RIO normalizeLanguage shouldn't lowercase the tag

Description

RIO converts the languageTag toLowerCase: see
https://bitbucket.org/openrdf/sesame/src/7796198d63fb3b43fb21c6c79eba06701e9914a2/core/rio/languages/src/main/java/org/openrdf/rio/languages/RFC3066LanguageHandler.java?at=master
org.openrdf.rio.languages.RFC3066LanguageHandler::normalizeLanguage

While RFC 5646 = BCP 47 (which obsoletes RFC 3066) demands that tags should be treated case-insensitively, it also recommends: "consistent formatting and presentation of language tags will aid users. The format of subtags in the registry is RECOMMENDED as the form to use in language tags. This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived." (and the format is NOT all lowercase).
The RFC describes a case normalization algorithm: "An implementation can reproduce this format without accessing the registry as follows...".

I posted a similar request for RDF::Trine::Node::Literal, and implemented such normalization algorithm there (in Perl):
https://rt.cpan.org/Public/Bug/Display.html?id=88964
Gregg Williams (the author of RDF::Trine) agreed this normalization is better than lowercasing.

(Not sure which is the correct component: this is in the commno part of RIO)

Environment

None

Activity

Show:
Peter Ansell
January 19, 2014, 11:08 PM

There are two slightly different issues here.

Firstly, both the RDF-1.0 and RDF-1.1 specifications explictly state that the value space of language tags (RFC3066 for RDF-1.0 and BCP47 for RDF-1.1) is lowercased strings. As RDF-1.1 only has a MAY requirement to be converted to lower case, meaning you won't be able to consistently roundtrip between tools and keep casing anyway:

"Lexical representations of language tags MAY be converted to lower case. The value space of language tags is always in lower case." http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

"a language tag as defined by [RFC-3066], normalized to lowercase." http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-Graph-Literal

Secondly, the RFC3066LanguageHandler.normalizeLanguage is not normally called. We don't normalise unless you explicitly set BasicParserSettings.NORMALIZE_LANGUAGE_TAGS to true in your ParserConfig. The actual conversion right now is done by the ValueFactory that you are using. That is possibly in LiteralImpl if you are using a ValueFactory that creates those objects.

Would you mind if I close this as a duplicate of to keep discussion in one place? Although there are two slightly different issues, any fix for both of them will need to be in the same place.

Peter Ansell
January 21, 2014, 1:44 AM

Closing as duplicate of to keep discussion in one place.

It is not exactly the same described issue, but allowing for all or any of lowercase, uppercase, RFC3066-case and/or BCP47-case will need to be handled in a uniform manner so useful to aggregate the issues together.

Assignee

Jeen Broekstra

Reporter

Vladimir Alexiev

Labels

None

Components

Affects versions

Priority

Major
Configure