RIO normalizeLanguage shouldn't lowercase the tag
Description
Environment
duplicates
Activity
Closing as duplicate of https://openrdf.atlassian.net/browse/SES-1659#icft=SES-1659 to keep discussion in one place.
It is not exactly the same described issue, but allowing for all or any of lowercase, uppercase, RFC3066-case and/or BCP47-case will need to be handled in a uniform manner so useful to aggregate the issues together.
There are two slightly different issues here.
Firstly, both the RDF-1.0 and RDF-1.1 specifications explictly state that the value space of language tags (RFC3066 for RDF-1.0 and BCP47 for RDF-1.1) is lowercased strings. As RDF-1.1 only has a MAY requirement to be converted to lower case, meaning you won't be able to consistently roundtrip between tools and keep casing anyway:
"Lexical representations of language tags MAY be converted to lower case. The value space of language tags is always in lower case." http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
"a language tag as defined by [RFC-3066], normalized to lowercase." http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-Graph-Literal
Secondly, the RFC3066LanguageHandler.normalizeLanguage is not normally called. We don't normalise unless you explicitly set BasicParserSettings.NORMALIZE_LANGUAGE_TAGS to true in your ParserConfig. The actual conversion right now is done by the ValueFactory that you are using. That is possibly in LiteralImpl if you are using a ValueFactory that creates those objects.
Would you mind if I close this as a duplicate of https://openrdf.atlassian.net/browse/SES-1659#icft=SES-1659 to keep discussion in one place? Although there are two slightly different issues, any fix for both of them will need to be in the same place.
RIO converts the languageTag toLowerCase: see
https://bitbucket.org/openrdf/sesame/src/7796198d63fb3b43fb21c6c79eba06701e9914a2/core/rio/languages/src/main/java/org/openrdf/rio/languages/RFC3066LanguageHandler.java?at=master
org.openrdf.rio.languages.RFC3066LanguageHandler::normalizeLanguage
While RFC 5646 = BCP 47 (which obsoletes RFC 3066) demands that tags should be treated case-insensitively, it also recommends: "consistent formatting and presentation of language tags will aid users. The format of subtags in the registry is RECOMMENDED as the form to use in language tags. This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived." (and the format is NOT all lowercase).
The RFC describes a case normalization algorithm: "An implementation can reproduce this format without accessing the registry as follows...".
I posted a similar request for RDF::Trine::Node::Literal, and implemented such normalization algorithm there (in Perl):
https://rt.cpan.org/Public/Bug/Display.html?id=88964
Gregg Williams (the author of RDF::Trine) agreed this normalization is better than lowercasing.
(Not sure which is the correct component: this is in the commno part of RIO)