Language Value - Using IETF instead of ISO 639 #8

Open
opened 2 months ago by krixano · 8 comments
krixano commented 2 months ago

You mention that one of the reasons for the language is for correct pronunciations by screen readers. If this is the case, then the dialect of the language should also be included. For example, instead of "en", it should be one of the following "en_US", "en_GB", "en_AU", etc. This is the "IETF BCP 47 language tag" spec (https://en.wikipedia.org/wiki/IETF_language_tag)

Edit: The IETF language tag should use hyphens, not underscores. Everything else still applies.

You mention that one of the reasons for the language is for correct pronunciations by screen readers. If this is the case, then the *dialect* of the language should also be included. For example, instead of "en", it should be one of the following "en_US", "en_GB", "en_AU", etc. This is the "IETF BCP 47 language tag" spec (https://en.wikipedia.org/wiki/IETF_language_tag) Edit: The IETF language tag should use hyphens, not underscores. Everything else still applies.
Collaborator

The text/gemini spec currently requires conformance with RFC4646:

Valid values for the "lang" parameter are comma-separated lists of one or
more language tags as defined in RFC4646.

However, https://tools.ietf.org/html/rfc5646 replaces [RFC4646]. [RFC5646], in combination with [RFC4647], comprises BCP 47.

So we probably need to see if we can get the text/gemini spec updated to refer to BCP 47, and then modify the gempub spec to refer to BCP 47 as well.

There is an currently an open issue for that: https://gitlab.com/gemini-specification/gemini-text/-/issues/1

The text/gemini spec currently requires conformance with RFC4646: > Valid values for the "lang" parameter are comma-separated lists of one or more language tags as defined in RFC4646. However, https://tools.ietf.org/html/rfc5646 replaces [RFC4646]. [RFC5646], in combination with [RFC4647], comprises BCP 47. So we probably need to see if we can get the text/gemini spec updated to refer to BCP 47, and then modify the gempub spec to refer to BCP 47 as well. There is an currently an open issue for that: https://gitlab.com/gemini-specification/gemini-text/-/issues/1

https://tools.ietf.org/html/bcp47 seems to suggest that the Gemini spec is correct in saying that language tags should have hyphen-minuses in them, not underscores ("en-US", "sr-Latn-RS").

https://tools.ietf.org/html/bcp47 seems to suggest that the Gemini spec is correct in saying that language tags should have hyphen-minuses in them, not underscores ("en-US", "sr-Latn-RS").
Poster

Right, but my original issue is about gempub not even using ietf's language tags at all, where, afaik (could be wrong, idk), it has no ability to add regions. Whether it's underscores or hyphen's is irrelevant and mostly just a small mistake from me thinking it was underscores.

Right, but my original issue is about gempub not even using ietf's language tags at all, where, afaik (could be wrong, idk), it has no ability to add regions. Whether it's underscores or hyphen's is irrelevant and mostly just a small mistake from me thinking it was underscores.
Collaborator

Yes, the hyphen is mandated by BCP 47:

A language tag is composed from a sequence of one or more "subtags",
each of which refines or narrows the range of language identified by
the overall tag. Subtags, in turn, are a sequence of alphanumeric
characters (letters and digits), distinguished and separated from
other subtags in a tag by a hyphen ("-", [Unicode] U+002D).

Yes, the hyphen is mandated by [BCP 47](https://tools.ietf.org/html/bcp47#section-2.1): > A language tag is composed from a sequence of one or more "subtags", each of which refines or narrows the range of language identified by the overall tag. Subtags, in turn, are a sequence of alphanumeric characters (letters and digits), distinguished and separated from other subtags in a tag by a hyphen ("-", [Unicode] U+002D).
krixano changed title from Language value to Language Value - Using IETF instead of ISO 639 2 months ago
Collaborator

Right, but my original issue is about gempub not even using ietf's language tags at all, where, afaik (could be wrong, idk), it has no ability to add regions. Whether it's underscores or hyphen's is irrelevant and mostly just a small mistake from me thinking it was underscores.

@krixano, absolutely. I think you are correct here in pointing out that the gempub language metadata field should use BCP 47 language tags, and not ISO 639-1 or ISO 639-2 country codes. We're just trying to get pointers to the right specs.

I think I have the summary of this issue correct, which is that the current gempub spec provides for the use of an ISO 639-1 or ISO 639-2 tag in the language field of the metadata.txt file. This only allows a two- or three- character country code.

The spec ought to provide for the use of complete language tags that are compliant with BCP 47, rather than only for two- or three- character country codes.

In other words, "en" or "en-GB" or "sr-Latn-RS" should all work, as they are all proper BCP 47 language tags that can be passed to screen readers for (among other things) the purpose of supporting correct pronunciation.

> Right, but my original issue is about gempub not even using ietf's language tags at all, where, afaik (could be wrong, idk), it has no ability to add regions. Whether it's underscores or hyphen's is irrelevant and mostly just a small mistake from me thinking it was underscores. @krixano, absolutely. I think you are correct here in pointing out that the gempub `language` metadata field should use BCP 47 language tags, and not ISO 639-1 or ISO 639-2 country codes. We're just trying to get pointers to the right specs. I think I have the summary of this issue correct, which is that the current gempub spec provides for the use of an ISO 639-1 or ISO 639-2 tag in the `language` field of the `metadata.txt` file. This only allows a two- or three- character country code. The spec ought to provide for the use of complete language tags that are compliant with BCP 47, rather than only for two- or three- character country codes. In other words, "en" or "en-GB" or "sr-Latn-RS" should all work, as they are all proper BCP 47 language tags that can be passed to screen readers for (among other things) the purpose of supporting correct pronunciation.
Owner

Everyone happy with BCP 47 then? I'll update it tomorrow - fwiw the spec did say that field needed some thinking about, it was just thrown in.

The use-case in mind was say a Spanish user reading an English book, and how would the screen reader handle that - I need to check what APIs are available for Android (just because that's my primary platform) and make sure it's a realistic requirement - at the moment I have no idea.

Everyone happy with BCP 47 then? I'll update it tomorrow - fwiw the spec did say that field needed some thinking about, it was just thrown in. The use-case in mind was say a Spanish user reading an English book, and how would the screen reader handle that - I need to check what APIs are available for Android (just because that's my primary platform) and make sure it's a realistic requirement - at the moment I have no idea.
Poster

If we're talking about a spanish user reading an English book, you would want the screen reader to read in English? If that's the case, then this should be fine.
In the case of translation, the software should already have a setting of it's own for what the user's native language is. You wouldn't change anything in the gpub, that wouldn't make sense. The gpub only needs to have what is actually inside the book. gpub's can't magically know that a person of a different native language is reading the book. That's the job of the reader software.

There is one thing to consider though - multi-language books.

If we're talking about a spanish user reading an English book, you would want the screen reader to read in English? If that's the case, then this should be fine. In the case of translation, the software should already have a setting of it's own for what the user's native language is. You wouldn't change anything in the gpub, that wouldn't make sense. The gpub only needs to have what is actually inside the book. gpub's can't magically know that a person of a different native language is reading the book. That's the job of the reader software. There is one thing to consider though - multi-language books.
Owner

Multi-language books... let's leave that as a future problem for now.

And yes - re screen reader. On Android (for my sins, my primary platform) the screen reader TalkBack would default to the users Locale (Spanish in our example), so if we have the language from metadata.txt we can wrap String in a LocaleSpan which would give Talkback what it needs to correctly pronounce the English words on a Spanish Locale device.

Multi-language books... let's leave that as a future problem for now. And yes - re screen reader. On Android (for my sins, my primary platform) the screen reader TalkBack would default to the users Locale (Spanish in our example), so if we have the language from metadata.txt we can wrap String in a [LocaleSpan](https://developer.android.com/reference/android/text/style/LocaleSpan.html) which would give Talkback what it needs to correctly pronounce the English words on a Spanish Locale device.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.