Allow storage of Language country code in upper case

Description

The LiteralImpl class in Sesame's model currently normalizes all language tags to lower case before passing them on to the rest of Sesame. This is strictly speaking according to the spec, as the RDF Concepts & Abstract syntax recommendation states that language tags should be normalized to lower case.

However, the issue reporter has stated a need for being allowed to store language tags 'as-is', that is, without Sesame normalizing them.

Environment

None

Activity

Show:
Vladimir Alexiev
January 20, 2014, 3:59 PM

5b. I posted to public-rdf-comments: http://lists.w3.org/Archives/Public/public-rdf-comments/2014Jan/0011.html :

http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
"Lexical representations of language tags may be converted to lower case. The value space of language tags is always in lower case."

CHANGE TO:

"Lexical representations of language tags MAY be normalized, according to BCP47 section 2.1.1. "Formatting of Language Tags" (country codes in upper case, script codes capitalized, the rest in lower case).
Language tags MAY also be normalized by converting all to lower case, but BCP47 normalization is preferred.
No matter which method is chosen, the semantics of language tags MUST NOT depend on case.
In particular, implementations MUST NOT store as separate statements, two statements that differ only by the case of language tags."

Peter Ansell
January 21, 2014, 1:56 AM

I noticed your comment on public-rdf-comments. May be best to wait until the RDF-WG makes a formal response before continuuing. Given the timelines, particularly the recent January 9 updated, it is possible that the comment will be postponed to a further release.

However, I am definitely interested in making the different behaviours easily configurable.

One issue is that the ValueFactory.createLiteral(String,String) method doesn't currently have any context in which to put configuration parameters. This means that, to be internally consistent with RDF-1.1 Abstract Concepts w.r.t language tag/triple comparisons, in a performant manner, a single normalisation algorithm must occur in or below that method. If it were possible to configure ValueFactory, in a similar way to RDFParser, then the user could specify their desired algorithm at that point.

Another alternative, is to make it a configuration setting for RDFWriter, and produce the desired upper/lower/BCP47/etc., case by modifying the language tags on the fly inside of RDFWriter.handleStatement. It could use an existing setting, (ie, RDFWriter.getWriterConfig().set(BasicParserSettings.LANGUAGE_HANDLERS, ...)) so there wouldn't be a need for any new configuration settings. If the RDF-WG do respond and keep the "value space is lowercase" condition, then, to be interoperable, all Sesame Sail/Repository implementations could still be emitting Statement objects with lowercase language tags, which would then be modified by the RDFWriter to the users specifications.

Vladimir Alexiev
January 21, 2014, 9:40 AM

Agree, we must wait for RDF-WG response.

upper/lower/BCP47/etc case

In this case variety is a bad thing, so let's not add needlessly. There are "only" 2 alternatives lower-case and BCP47-case

Peter Ansell
June 11, 2015, 6:10 AM

Based on the RDF-1.1 direction in the abstract syntax specifying that the value space for language tags is lower cased strings, I am working on the first patch for version 4 that will allow comparisons using that value space, even if the parser or repository uses a different case.

Jeen Broekstra
June 18, 2015, 11:37 PM

Assigning this ticket to you peter, since you are active on this issue.

Assignee

Peter Ansell

Reporter

Les Kneebone

Labels

None

Components

Affects versions

Priority

Major
Configure