32.7. tokenize
– 对 Python 代码使用的标记解析器¶
源码: Lib/tokenize.py
The tokenize
module provides a lexical scanner for Python source code,
implemented in Python. The scanner in this module returns comments as tokens
as well, making it useful for implementing 「pretty-printers,」 including
colorizers for on-screen displays.
为了简化标记流的处理,所有的 运算符 和 定界符 以及 Ellipsis
返回时都会打上通用的 OP
标记。 可以通过 tokenize.tokenize()
返回的 named tuple 对象的 exact_type
属性来获得确切的标记类型。
32.7.1. 对输入进行解析标记¶
主要的入口是一个 generator:
-
tokenize.
tokenize
(readline)¶ 生成器
tokenize()
需要一个 readline 参数,这个参数必须是一个可调用对象,且能提供与文件对象的io.IOBase.readline()
方法相同的接口。每次调用这个函数都要 返回字节类型输入的一行数据。The generator produces 5-tuples with these members: the token type; the token string; a 2-tuple
(srow, scol)
of ints specifying the row and column where the token begins in the source; a 2-tuple(erow, ecol)
of ints specifying the row and column where the token ends in the source; and the line on which the token was found. The line passed (the last tuple item) is the logical line; continuation lines are included. The 5 tuple is returned as a named tuple with the field names:type string start end line
.The returned named tuple has an additional property named
exact_type
that contains the exact operator type fortoken.OP
tokens. For all other token typesexact_type
equals the named tupletype
field.3.1 版更變: 增加了对 named tuple 的支持。
3.3 版更變: 添加了对
exact_type
的支持。根据:pep:263 ,
tokenize()
通过寻找 UTF-8 BOM 或编码 cookie 来确定文件的源编码。
All constants from the token
module are also exported from
tokenize
, as are three additional token type values:
-
tokenize.
COMMENT
¶ Token value used to indicate a comment.
-
tokenize.
NL
¶ Token value used to indicate a non-terminating newline. The NEWLINE token indicates the end of a logical line of Python code; NL tokens are generated when a logical line of code is continued over multiple physical lines.
-
tokenize.
ENCODING
¶ Token value that indicates the encoding used to decode the source bytes into text. The first token returned by
tokenize()
will always be an ENCODING token.
提供了另一个函数来逆转标记化过程。这对于创建对脚本进行标记、修改标记流并写回修改后脚本的工具很有用。
-
tokenize.
untokenize
(iterable)¶ 将令牌转换为 Python 源代码。 iterable 必须返回至少有两个元素的序列,即令牌类型和令牌字符串。任何额外的序列元素都会被忽略。
重构的脚本以单个字符串的形式返回。 结果被保证为标记回与输入相匹配,因此转换是无损的,并保证来回操作。 该保证只适用于标记类型和标记字符串,因为标记之间的间距(列位置)可能会改变。
It returns bytes, encoded using the ENCODING token, which is the first token sequence output by
tokenize()
.
tokenize()
需要检测它所标记源文件的编码。它用来做这件事的函数是可用的:
-
tokenize.
detect_encoding
(readline)¶ detect_encoding()
函数用于检测解码 Python 源文件时应使用的编码。它需要一个参数, readline ,与tokenize()
生成器的使用方式相同。它最多调用 readline 两次,并返回所使用的编码(作为一个字符串)和它所读入的任何行(不是从字节解码的)的 list 。
It detects the encoding from the presence of a UTF-8 BOM or an encoding cookie as specified in PEP 263. If both a BOM and a cookie are present, but disagree, a SyntaxError will be raised. Note that if the BOM is found,
'utf-8-sig'
will be returned as an encoding.如果没有指定编码,那么将返回默认的
'utf-8'
编码.使用
open()
来打开 Python 源文件:它使用detect_encoding()
来检测文件编码。
-
tokenize.
open
(filename)¶ 使用由
detect_encoding()
检测到的编码,以只读模式打开一个文件。3.2 版新加入.
-
exception
tokenize.
TokenError
¶ 当文件中任何地方没有完成 docstring 或可能被分割成几行的表达式时触发,例如:
"""Beginning of docstring
或是:
[1, 2, 3
Note that unclosed single-quoted strings do not cause an error to be
raised. They are tokenized as ERRORTOKEN
, followed by the tokenization of
their contents.
32.7.2. 命令行用法¶
3.3 版新加入.
tokenize
模块可以作为一个脚本从命令行执行。这很简单。
python -m tokenize [-e] [filename.py]
可以接受以下选项:
-
-h
,
--help
¶
显示此帮助信息并退出
-
-e
,
--exact
¶
使用确切的类型显示令牌名称
如果 filename.py
被指定,其内容会被标记到 stdout 。否则,标记化将在 stdin 上执行。
32.7.3. 例子¶
脚本改写器的例子,它将 float 文本转换为 Decimal 对象:。
from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
from io import BytesIO
def decistmt(s):
"""Substitute Decimals for floats in a string of statements.
>>> from decimal import Decimal
>>> s = 'print(+21.3e-5*-.1234/81.7)'
>>> decistmt(s)
"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
The format of the exponent is inherited from the platform C library.
Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
we're only showing 12 digits, and the 13th isn't close to 5, the
rest of the output should be platform-independent.
>>> exec(s) #doctest: +ELLIPSIS
-3.21716034272e-0...7
Output from calculations with Decimal should be identical across all
platforms.
>>> exec(decistmt(s))
-3.217160342717258261933904529E-7
"""
result = []
g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
for toknum, tokval, _, _, _ in g:
if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
result.extend([
(NAME, 'Decimal'),
(OP, '('),
(STRING, repr(tokval)),
(OP, ')')
])
else:
result.append((toknum, tokval))
return untokenize(result).decode('utf-8')
从命令行进行标记化的例子。 脚本:
def say_hello():
print("Hello, World!")
say_hello()
将被标记为以下输出,其中第一列是发现标记的行 / 列坐标范围,第二列是标记的名称,最后一列是标记的值(如果有)。
$ python -m tokenize hello.py
0,0-0,0: ENCODING 'utf-8'
1,0-1,3: NAME 'def'
1,4-1,13: NAME 'say_hello'
1,13-1,14: OP '('
1,14-1,15: OP ')'
1,15-1,16: OP ':'
1,16-1,17: NEWLINE '\n'
2,0-2,4: INDENT ' '
2,4-2,9: NAME 'print'
2,9-2,10: OP '('
2,10-2,25: STRING '"Hello, World!"'
2,25-2,26: OP ')'
2,26-2,27: NEWLINE '\n'
3,0-3,1: NL '\n'
4,0-4,0: DEDENT ''
4,0-4,9: NAME 'say_hello'
4,9-4,10: OP '('
4,10-4,11: OP ')'
4,11-4,12: NEWLINE '\n'
5,0-5,0: ENDMARKER ''
The exact token type names can be displayed using the -e
option:
$ python -m tokenize -e hello.py
0,0-0,0: ENCODING 'utf-8'
1,0-1,3: NAME 'def'
1,4-1,13: NAME 'say_hello'
1,13-1,14: LPAR '('
1,14-1,15: RPAR ')'
1,15-1,16: COLON ':'
1,16-1,17: NEWLINE '\n'
2,0-2,4: INDENT ' '
2,4-2,9: NAME 'print'
2,9-2,10: LPAR '('
2,10-2,25: STRING '"Hello, World!"'
2,25-2,26: RPAR ')'
2,26-2,27: NEWLINE '\n'
3,0-3,1: NL '\n'
4,0-4,0: DEDENT ''
4,0-4,9: NAME 'say_hello'
4,9-4,10: LPAR '('
4,10-4,11: RPAR ')'
4,11-4,12: NEWLINE '\n'
5,0-5,0: ENDMARKER ''