The Internationalized Resource Identifier ( IRI ) was defined by the Internet Engineering Task Force (IETF) in 2005 as a new standard in the Uniform Resource Identifier (URI) scheme. [1] The new standard was published in RFC 3987 .

While URIs are limited to a subset of the ASCII character set, IRIs may contain characters from the Universal Character Set (Unicode / ISO 10646 ), including Chinese or Japanese Kanji , Korean , Cyrillic characters, and so forth.

Syntax

IRI extend upon URIs by using the Universal Character Set when URIs were limited to ASCII with far fewer characters. IRIs may be represented by a sequence of bytes but by definition as a sequence of characters because IRIs can be written or written by hand. [2]

Compatibility

IRIs are mapped to URIs to retain backwards-compatibility with systems that do not support the new format. [2]

For applications and protocols that do not allow direct consumption of IRIs, the IRI should first be converted to Unicode using canonical composition normalization (NFC) , if not already in Unicode format.

All non-ASCII code points in the IRI should be encoded as UTF-8 , and the resulting bytes percent-encoded , to produce a valid URI.

Example: The IRI https://en.wiktionary.org/wiki/Ῥόδος becomes the URI https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF % CF 82%

ASCII code points that are invalid URI characters may be encoded the same way, depending on implementation. [2]

This conversion is easily reversible; by definition, converting to IRI to an URI and back again to IRI yield that is semantically equivalent to the original IRI, even though it may differ in exact representation. [3]

Some protocols may impose further transformations; eg Punycode for DNS labels.

Advantages

There are reasons to see URIs displayed in different languages; Mostly, it makes it easier for users who are unfamiliar with the Latin (AZ) alphabet. Assuming that it is not too difficult for anyone to replicate arbitrary Unicode on their keyboards, this can make the URI system more accessible. [4]

Disadvantages

Mixing IRIs and ASCII URIs can make it much easier to do phishing attacks that trick somebody into believing they’re on a site they really are not on. For example, one can replace the “a” in www.ebay.comgold www.paypal.comwith an internationalized look-alike “a” character Such As < α >, and developed That IRI to a malicious site. This is an IDN homograph attack .

While a URI does not provide information with its own alphabets, it is not possible to generate the required internationalized characters. This does not mean that IRIs are actually handled in a way that does not require the use of non-keyboard input when dealing with texts in various languages.

See also

  • IDN (Internationalized Domain Name)
  • Semantic Web
  • Punycode
  • XRI (Extensible Resource Identifier)

References

  1. Jump up^ Gangemi, Aldo; Presutti, Valentina (2006). “The bourne identity of a web resource” (PDF) . Proceedings of Identity Reference and the Web Workshop (IRW) . Laboratory for Applied Ontology. Roma, Italy: National Research Council (ISTC-CNR): 3. Notice that IRIs (Internationalized Resource Identifiers) [11] are supposed to replace URIs in next future.
  2. ^ Jump up to:c Duerst, M. (2005). “RFC 3987” . Network Working Group . Internet Engineering Task Force. Standards Track . Retrieved 12 October 2014 .
  3. Jump up^ Hendler, Hrsg. Dieter Fensel; Hrsg. John Domingue; Hrsg. James A. (2010). Handbook of Semantic Web Technologies (1. Ed.). Berlin: Springer-Verlag GmbH. ISBN  978-3-540-92912-3 . Retrieved 12 October 2014 .
  4. Jump up^ Clark, Kendall (2003-05-07). “Internationalizing the URI” . O’Reilly Media, Inc . Retrieved 12 October 2014 .