32.7. tokenize – 对 Python 代码使用的标记解析器

源码: Lib/tokenize.py


The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing 「pretty-printers,」 including colorizers for on-screen displays.

为了简化标记流的处理,所有的 运算符定界符 以及 Ellipsis 返回时都会打上通用的 OP 标记。 可以通过 tokenize.tokenize() 返回的 named tuple 对象的 exact_type 属性来获得确切的标记类型。

32.7.1. 对输入进行解析标记

主要的入口是一个 generator:

tokenize.tokenize(readline)

生成器 tokenize() 需要一个 readline 参数,这个参数必须是一个可调用对象,且能提供与文件对象的 io.IOBase.readline() 方法相同的接口。每次调用这个函数都要 返回字节类型输入的一行数据。

The generator produces 5-tuples with these members: the token type; the token string; a 2-tuple (srow, scol) of ints specifying the row and column where the token begins in the source; a 2-tuple (erow, ecol) of ints specifying the row and column where the token ends in the source; and the line on which the token was found. The line passed (the last tuple item) is the logical line; continuation lines are included. The 5 tuple is returned as a named tuple with the field names: type string start end line.

The returned named tuple has an additional property named exact_type that contains the exact operator type for token.OP tokens. For all other token types exact_type equals the named tuple type field.

3.1 版更變: 增加了对 named tuple 的支持。

3.3 版更變: 添加了对 exact_type 的支持。

根据:pep:263tokenize() 通过寻找 UTF-8 BOM 或编码 cookie 来确定文件的源编码。

All constants from the token module are also exported from tokenize, as are three additional token type values:

tokenize.COMMENT

Token value used to indicate a comment.

tokenize.NL

Token value used to indicate a non-terminating newline. The NEWLINE token indicates the end of a logical line of Python code; NL tokens are generated when a logical line of code is continued over multiple physical lines.

tokenize.ENCODING

Token value that indicates the encoding used to decode the source bytes into text. The first token returned by tokenize() will always be an ENCODING token.

提供了另一个函数来逆转标记化过程。这对于创建对脚本进行标记、修改标记流并写回修改后脚本的工具很有用。

tokenize.untokenize(iterable)

将令牌转换为 Python 源代码。 iterable 必须返回至少有两个元素的序列,即令牌类型和令牌字符串。任何额外的序列元素都会被忽略。

重构的脚本以单个字符串的形式返回。 结果被保证为标记回与输入相匹配,因此转换是无损的,并保证来回操作。 该保证只适用于标记类型和标记字符串,因为标记之间的间距(列位置)可能会改变。

It returns bytes, encoded using the ENCODING token, which is the first token sequence output by tokenize().

tokenize() 需要检测它所标记源文件的编码。它用来做这件事的函数是可用的:

tokenize.detect_encoding(readline)

detect_encoding() 函数用于检测解码 Python 源文件时应使用的编码。它需要一个参数, readline ,与 tokenize() 生成器的使用方式相同。

它最多调用 readline 两次,并返回所使用的编码(作为一个字符串)和它所读入的任何行(不是从字节解码的)的 list 。

It detects the encoding from the presence of a UTF-8 BOM or an encoding cookie as specified in PEP 263. If both a BOM and a cookie are present, but disagree, a SyntaxError will be raised. Note that if the BOM is found, 'utf-8-sig' will be returned as an encoding.

如果没有指定编码,那么将返回默认的 'utf-8' 编码.

使用 open() 来打开 Python 源文件:它使用 detect_encoding() 来检测文件编码。

tokenize.open(filename)

使用由 detect_encoding() 检测到的编码,以只读模式打开一个文件。

3.2 版新加入.

exception tokenize.TokenError

当文件中任何地方没有完成 docstring 或可能被分割成几行的表达式时触发,例如:

"""Beginning of
docstring

或是:

[1,
 2,
 3

Note that unclosed single-quoted strings do not cause an error to be raised. They are tokenized as ERRORTOKEN, followed by the tokenization of their contents.

32.7.2. 命令行用法

3.3 版新加入.

tokenize 模块可以作为一个脚本从命令行执行。这很简单。

python -m tokenize [-e] [filename.py]

可以接受以下选项:

-h, --help

显示此帮助信息并退出

-e, --exact

使用确切的类型显示令牌名称

如果 filename.py 被指定,其内容会被标记到 stdout 。否则,标记化将在 stdin 上执行。

32.7.3. 例子

脚本改写器的例子,它将 float 文本转换为 Decimal 对象:。

from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
from io import BytesIO

def decistmt(s):
    """Substitute Decimals for floats in a string of statements.

    >>> from decimal import Decimal
    >>> s = 'print(+21.3e-5*-.1234/81.7)'
    >>> decistmt(s)
    "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"

    The format of the exponent is inherited from the platform C library.
    Known cases are "e-007" (Windows) and "e-07" (not Windows).  Since
    we're only showing 12 digits, and the 13th isn't close to 5, the
    rest of the output should be platform-independent.

    >>> exec(s)  #doctest: +ELLIPSIS
    -3.21716034272e-0...7

    Output from calculations with Decimal should be identical across all
    platforms.

    >>> exec(decistmt(s))
    -3.217160342717258261933904529E-7
    """
    result = []
    g = tokenize(BytesIO(s.encode('utf-8')).readline)  # tokenize the string
    for toknum, tokval, _, _, _ in g:
        if toknum == NUMBER and '.' in tokval:  # replace NUMBER tokens
            result.extend([
                (NAME, 'Decimal'),
                (OP, '('),
                (STRING, repr(tokval)),
                (OP, ')')
            ])
        else:
            result.append((toknum, tokval))
    return untokenize(result).decode('utf-8')

从命令行进行标记化的例子。 脚本:

def say_hello():
    print("Hello, World!")

say_hello()

将被标记为以下输出,其中第一列是发现标记的行 / 列坐标范围,第二列是标记的名称,最后一列是标记的值(如果有)。

$ python -m tokenize hello.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,13:           NAME           'say_hello'
1,13-1,14:          OP             '('
1,14-1,15:          OP             ')'
1,15-1,16:          OP             ':'
1,16-1,17:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,9:            NAME           'print'
2,9-2,10:           OP             '('
2,10-2,25:          STRING         '"Hello, World!"'
2,25-2,26:          OP             ')'
2,26-2,27:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,0:            DEDENT         ''
4,0-4,9:            NAME           'say_hello'
4,9-4,10:           OP             '('
4,10-4,11:          OP             ')'
4,11-4,12:          NEWLINE        '\n'
5,0-5,0:            ENDMARKER      ''

The exact token type names can be displayed using the -e option:

$ python -m tokenize -e hello.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,13:           NAME           'say_hello'
1,13-1,14:          LPAR           '('
1,14-1,15:          RPAR           ')'
1,15-1,16:          COLON          ':'
1,16-1,17:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,9:            NAME           'print'
2,9-2,10:           LPAR           '('
2,10-2,25:          STRING         '"Hello, World!"'
2,25-2,26:          RPAR           ')'
2,26-2,27:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,0:            DEDENT         ''
4,0-4,9:            NAME           'say_hello'
4,9-4,10:           LPAR           '('
4,10-4,11:          RPAR           ')'
4,11-4,12:          NEWLINE        '\n'
5,0-5,0:            ENDMARKER      ''