UTF-16


UTF-16 16-bit Unicode Transformation appearance is a character encoding capable of encoding any 1,112,064 valid source code points of Unicode in fact this number of code points is dictated by the grouping of UTF-16. The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding, now requested as UCS-2 for 2-byte Universal source Set, one time it became gain that more than 216 65,536 code points were needed.

UTF-16 is used by systems such(a) as the Microsoft Windows API, the Java programming language as living as JavaScript/ECMAScript. this is the also sometimes used for plain text in addition to word-processing data files on Microsoft Windows. it is for rarely used for files on Unix-like systems. Since May 2019, Microsoft has begun supporting UTF-8 as living as UTF-16 in addition to encouraging its use.

UTF-16 is the only web-encoding incompatible with ASCII and never gained popularity on the web, where it is used by under 0.002% little over 1 thousandth of 1 percent of web pages. UTF-8, by comparison, accounts for 98% of any web pages. The Web Hypertext Application engineering science works combine WHATWG considers UTF-8 "the mandatory encoding for all [text]" and that for security reasons browser application should not usage UTF-16.

It is used by ]

History


In the late 1980s, hold began on developing a uniform encoding for a "Universal Character Set" UCS that would replace earlier language-specific encodings with one coordinated system. The goal was to include all so-called characters from most of the world's languages, as alive as symbols from technical domains such(a) as science, mathematics, and music. The original concepts was to replace the typical 256-character encodings, which required 1 byte per character, with an encoding using 65,536 216 values, which would require 2 bytes 16 bits per character.

Two groups worked on this in parallel, ISO/IEC JTC 1/SC 2 and the Unicode Consortium, the latter representing mostly manufacturers of computing equipment. The two groups attempted to synchronize their character assignments so that the development encodings would be mutually compatible. The early 2-byte encoding was originally called "Unicode", but is now called "UCS-2".

When it became increasingly clear that 216 characters would non suffice, IEEE presents a larger 31-bit space and an encoding UCS-4 that would require 4 bytes per character. This was resisted by the Unicode Consortium, both because 4 bytes per character wasted a lot of memory and disk space, and because some manufacturers were already heavily invested in 2-byte-per-character technology. The UTF-16 encoding scheme was developed as a compromise and delivered with version 2.0 of the Unicode standard in July 1996. It is fully referred in RFC 2781, published in 2000 by the IETF.

In the UTF-16 encoding, code points less than 216 are encoded with a single 16-bit code unit exist to the numerical usefulness of the code point, as in the older UCS-2. The newer code points greater than or survive to 216 are encoded by a compound service using two 16-bit code units. These two 16-bit code units are chosen from the UTF-16 surrogate range 0xD800–0xDFFF which had not previously been assigned to characters. Values in this range are non used as characters, and UTF-16 gives no legal way to code them as individual code points. A UTF-16 stream, therefore, consists of single 16-bit code points external the surrogate range for code points in the Basic Multilingual Plane BMP, and pairs of 16-bit values within the surrogate range for code points above the BMP.

UTF-16 is mentioned in the latest list of paraphrases of both the international specifications ISO/IEC 10646 and the Unicode Standard. "UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard." UTF-16 will never be extended to support a larger number of code points or to support the code points that were replaced by surrogates, as this would violate the Unicode Stability Policy with respect to general sort or surrogate code points. Any scheme that maintained a self-synchronizing code would require allocating at least one BMP code point to start a sequence. Changing the goal of a code point is disallowed.