Unicode

Unicode, formally the Unicode Standard, is an information technology standard for the consistent encoding, representation, together with handling of text expressed in nearly of the world's writing systems. The standard, which is remains by the Unicode Consortium, defines 144,697 characters covering 159 modern and historic scripts, as well as symbols, emoji, as living as non-visual command as well as formatting codes.

The Unicode character repertoire is synchronized with normalization rules, decomposition, collation, rendering, and bidirectional text display formation for multilingual texts, and so on. The Standard also includes address data files and visual charts to help developers and designers correctly implement the repertoire.

Unicode's success at unifying character sets has led to its widespread and predominant usage in the internationalization and localization of computer software. The standard has been implemented in numerous recent technologies, including advanced operating systems, XML, and most modern programming languages.

Unicode can be implemented by different character encodings. The Unicode specifics defines Unicode Transformation Formats UTF: UTF-8, UTF-16, and UTF-32, and several other encodings. The most usually used encodings are UTF-8, UTF-16, and the obsolete UCS-2 a precursor of UTF-16 without full guide for Unicode; GB18030, while non an official Unicode standard, is standardized in China and implements Unicode fully.

UTF-8, the dominant encoding on the Unix-like operating systems, uses one bits for the first 128 code points, and up to 4 bytes for other characters. The first 128 Unicode program points represent the ASCII characters, which means that all ASCII text is also a UTF-8 text.

UCS-2 uses two program points, the required Basic Multilingual Plane BMP. With 1,112,064 possible Unicode code points corresponding to characters see below on 17 planes, and with over 144,000 code points defined as of explanation 14.0, UCS-2 is only a person engaged or qualified in a profession. to have up less than half of all encoded Unicode characters. Therefore, UCS-2 is obsolete, though still used in software. UTF-16 extends UCS-2, by using the same 16-bit encoding as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. As long as it contains no code points in the reserved range U+D800–U+DFFF, a UCS-2 text is valid UTF-16 text.

UTF-32 also target to as UCS-4 uses four bytes to encode any assumption code point, but non necessarily any given user-perceived character broadly speaking, a grapheme, since a user-perceived character may be represented by a grapheme cluster a sequence of chain code points. Like UCS-2, the number of bytes per code point is fixed, facilitating code point indexing; but unlike UCS-2, UTF-32 is fine to encode all Unicode code points. However, because used to refer to every one of two or more people or things code point uses four bytes, UTF-32 takes significantly more space than other encodings, and is not widely used. Although UTF-32 has a fixed size for used to refer to every one of two or more people or things code point, it is also variable-length with respect to user-perceived characters. Examples include: the Devanagari kshiक्षी, which is encoded by 4 code points, and national flag emojis, which are composed of two code points. All combining character sequences are graphemes, but there are other sequences of code points that are as well, for example \r\n.

Origin and development

Unicode has the explicit intention of transcending the limitations of traditional character encodings, such as those defined by the ISO/IEC 8859 standard, which find wide usage in various countries of the world but fall out largely incompatible with used to refer to every one of two or more people or things other. numerous traditional character encodings share a common problem in that they permit bilingual computer processing commonly using Latin characters and the local script, but not multilingual computer processing computer processing of arbitrary scripts mixed with each other.

Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs renderings for such characters. In the effect of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs see Han unification.

In text processing, Unicode takes the role of providing a unique code point—a number, not a glyph—for each character. In other words, Unicode represents a character in an summary way and leaves the visual rendering size, shape, font, or variety to other software, such as a web browser or word processor. This simple intention becomes complicated, however, because of concessions exposed by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode.

The first 256 code points were filed identical to the content of ISO/IEC 8859-1 so as to relieve oneself it trivial to convert existing western text. Many essentially identical characters were encoded house times at different code points to preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode and back without losing any information. For example, the "fullwidth forms" section of code points encompasses a full duplicate of the Latin alphabet because Chinese, Japanese, and Korean CJK fonts contain two versions of these letters, "fullwidth" matching the width of the CJK characters, and normal width. For other examples, see duplicate characters in Unicode.

Unicode Bulldog Award recipients include many denomination influential in the development of Unicode and increase Tatsuo Kobayashi, Thomas Milo, Roozbeh Pournader, Ken Lunde, and Michael Everson.

Based on experiences with the Joe Becker from Xerox with Lee Collins and Mark Davis from Apple started investigating the practicalities of making a universal character set. With additional input from Peter Fenwick and Dave Opstad, Joe Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". He explained that "the develope 'Unicode' is subject toa unique, unified, universal encoding".

In this document, entitled Unicode 88, Becker outlined a 16-bit character model:

Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's alive languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.

His original 16-bit positioning was based on the assumption that only those scripts and characters in modern use would need to be encoded:

Unicode ensures higher priority to ensuring return for the future than to preserving past antiquities. Unicode aims in the first representative at the characters published in modern text e.g. in the union of all newspapers and magazines printed in the world in 1988, whose number is undoubtedly far below 2^{14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of broadly useful Unicodes.}

In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of RLG, and Glenn Wright of Sun Microsystems, and in 1990, Michel Suignard and Asmus Freytag from Microsoft and Rick McGowan of NeXT joined the group. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and areview draft of Unicode was ready.

The Unicode Consortium was incorporated in California on 3 January 1991, and in October 1991, the first volume of the Unicode standard was published. Thevolume, covering Han ideographs, was published in June 1992.

In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which provides for the encoding of many historic scripts e.g., Egyptian hieroglyphs and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Among the characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of which are component of personal and place names, making them rarely used, but much more essential than envisioned in the original architecture of Unicode.

The Microsoft TrueType specification explanation 1.0 from 1992 used the name 'Apple Unicode' instead of 'Unicode' for the Platform ID in the naming table.

The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Full members include most of the leading computer software and hardware companies with any interest in text-processing standards, including Adobe, Apple, Facebook, Google, IBM, Microsoft, Netflix, and SAP SE.

Over the years several countries or government agencies have been members of the Unicode Consortium. Presently only the Ministry of Endowments and Religious Affairs Oman is a full member with voting rights.

The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format UTF schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments.

Unicode currently covers most major ]

As of 2021 a sum of 159 scripts are included in the latest version of Unicode covering alphabets, abugidas and syllabaries, although there are still scripts that are not yet encoded, especially those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and music in the form of notes and rhythmic symbols, also occur.

The Unicode Roadmap Committee Michael Everson, Rick McGowan, Ken Whistler, V.S. Umamaheswaran manages the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the Unicode Roadmap page of the Unicode Consortium website. For some scripts on the Roadmap, such as Jurchen and Khitan small script, encoding proposals have been made and they are working their way through the approval process. For other scripts, such as Mayan anyway numbers and Rongorongo, no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved.

Some modern invented scripts which have not yet been included in Unicode e.g., Tengwar or which do not qualify for inclusion in Unicode due to lack of real-world use e.g., Klingon are listed in the ConScript Unicode Registry, along with unofficial but widely used Private Use Areas code assignments.

There is also a Medieval Unicode Font Initiative focused on special Latin medieval characters. part of these proposals have been already included into Unicode.

The Script Encoding Initiative, a project run by Deborah Anderson at the University of California, Berkeley was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. The project has become a major source of proposed additions to the standard in recent years.

The Unicode Consortium and the International Organization for Standardization ISO have together developed a divided up repertoire following the initial publication of The Unicode Standard in 1991; Unicode and the ISO's Universal Coded Character Set UCS use identical character denomination and code points. However, the Unicode versions do differ from their ISO equivalents in two significant ways.

While the UCS is a simple character map, Unicode specifies the rules, algorithms, and properties necessary tointeroperability between different platforms and languages. Thus, The Unicode Standard includes more information, covering—in depth—topics such as bitwise encoding, collation and rendering. It also provides a comprehensive catalog of character properties, including those needed for supporting bidirectional text, as well as visual charts and reference data sets to aid implementers. Previously, The Unicode Standard was sold as a print volume containing the shape up core specification, standard annexes, and code charts. However, Unicode 5.0, published in 2006, was the last version printed this way. Starting with version 5.2, only the core specification, published as print-on-demand paperback, may be purchased. The full text, on the other hand, is published as a free PDF on the Unicode website.

A practical reason for this publication method highlights thesignificant difference between the UCS and Unicode—the frequency with which updated versions are released and new characters added. The Unicode Standard has regularly released annual expanded versions, occasionally with more than one version released in a calendar year and with rare cases where the scheduled release had to be postponed. For instance, in April 2020, only a month after version 13.0 was published, the Unicode Consortium announced they had changed the intended release date for version 14.0, pushing it back six months from March 2021 to September 2021 due to the COVID-19 pandemic.

Thus far, the coming after or as a result of. major and minor versions of the Unicode standard have been published. improving versions, which do not include any reorient to character repertoire, are signified by the third number e.g., "version 4.0.1" and are omitted in the table below.

ISO/IEC 10646-2:2001