7.9. unicodedata — Unicode 数据库

This module provides access to the Unicode Character Database which defines character properties for all Unicode characters. The data in this database is based on the UnicodeData.txt file version 5.2.0 which is publicly available from ftp://ftp.unicode.org/.

The module uses the same names and symbols as defined by the UnicodeData File Format 5.2.0 (see https://www.unicode.org/reports/tr44/). It defines the following functions:

unicodedata.lookup(name)

Look up character by name. If a character with the given name is found, return the corresponding Unicode character. If not found, KeyError is raised.

unicodedata.name(unichr[, default])

Returns the name assigned to the Unicode character unichr as a string. If no name is defined, default is returned, or, if not given, ValueError is raised.

unicodedata.decimal(unichr[, default])

Returns the decimal value assigned to the Unicode character unichr as integer. If no such value is defined, default is returned, or, if not given, ValueError is raised.

unicodedata.digit(unichr[, default])

Returns the digit value assigned to the Unicode character unichr as integer. If no such value is defined, default is returned, or, if not given, ValueError is raised.

unicodedata.numeric(unichr[, default])

Returns the numeric value assigned to the Unicode character unichr as float. If no such value is defined, default is returned, or, if not given, ValueError is raised.

unicodedata.category(unichr)

Returns the general category assigned to the Unicode character unichr as string.

unicodedata.bidirectional(unichr)

Returns the bidirectional class assigned to the Unicode character unichr as string. If no such value is defined, an empty string is returned.

unicodedata.combining(unichr)

Returns the canonical combining class assigned to the Unicode character unichr as integer. Returns 0 if no combining class is defined.

unicodedata.east_asian_width(unichr)

Returns the east asian width assigned to the Unicode character unichr as string.

2.4 新版功能.

unicodedata.mirrored(unichr)

Returns the mirrored property assigned to the Unicode character unichr as integer. Returns 1 if the character has been identified as a “mirrored” character in bidirectional text, 0 otherwise.

unicodedata.decomposition(unichr)

Returns the character decomposition mapping assigned to the Unicode character unichr as string. An empty string is returned in case no such mapping is defined.

unicodedata.normalize(form, unistr)

返回 Unicode 字符串 unistr 的正常形式 formform 的有效值为 ‘NFC’ 、 ‘NFKC’ 、 ‘NFD’ 和 ‘NFKD’ 。

Unicode 标准基于规范等价和兼容性等效的定义定义了 Unicode 字符串的各种规范化形式。在 Unicode 中,可以以各种方式表示多个字符。 例如,字符 U+00C7 (带有 CEDILLA 的 LATIN CAPITAL LETTER C )也可以表示为序列 U+0043( LATIN CAPITAL LETTER C )U+0327( COMBINING CEDILLA )。

对于每个字符,有两种正规形式:正规形式 C 和正规形式 D 。正规形式D(NFD)也称为规范分解,并将每个字符转换为其分解形式。 正规形式C(NFC)首先应用规范分解,然后再次组合预组合字符。

除了这两种形式之外,还有两种基于兼容性等效的其他常规形式。 在 Unicode 中,支持某些字符,这些字符通常与其他字符统一。 例如, U+2160(ROMAN NUMERAL ONE)与 U+0049(LATIN CAPITAL LETTER I)完全相同。 但是, Unicode 支持它与现有字符集(例如 gb2312 )的兼容性。

正规形式KD(NFKD)将应用兼容性分解,即用其等价项替换所有兼容性字符。 正规形式KC(NFKC)首先应用兼容性分解,然后是规范组合。

即使两个 unicode 字符串被规范化并且人类读者看起来相同,如果一个具有组合字符而另一个没有,则它们可能无法相等。

2.3 新版功能.

此外,该模块暴露了以下常量:

unicodedata.unidata_version

此模块中使用的 Unicode 数据库的版本。

2.3 新版功能.

unicodedata.ucd_3_2_0

这是一个与整个模块具有相同方法的对象,但对于需要此特定版本的 Unicode 数据库(如 IDNA )的应用程序,则使用 Unicode 数据库版本 3.2 。

2.5 新版功能.

示例:

>>> import unicodedata
>>> unicodedata.lookup('LEFT CURLY BRACKET')
u'{'
>>> unicodedata.name(u'/')
'SOLIDUS'
>>> unicodedata.decimal(u'9')
9
>>> unicodedata.decimal(u'a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: not a decimal
>>> unicodedata.category(u'A')  # 'L'etter, 'u'ppercase
'Lu'
>>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber
'AN'