`tokenize` --- Python ソース用のトークナイザー¶

ソースコード: Lib/tokenize.py

The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing "pretty-printers", including colorizers for on-screen displays.

トークン・ストリームの扱いをシンプルにするために、全ての operator と delimiter トークンおよび Ellipsis はジェネリックな OP トークンタイプとして返されます。正確な型は tokenize.tokenize() が返す named tuple の exact_type プロパティをチェックすれば解ります。

警告

Note that the functions in this module are only designed to parse syntactically valid Python code (code that does not raise when parsed using ast.parse()). The behavior of the functions in this module is undefined when providing invalid Python code and it can change at any point.

Tokenizing Input¶

第一のエントリポイントはジェネレータです:

tokenize.tokenize(readline)¶

tokenize() ジェネレータは一つの引数 readline を必要とします。この引数は呼び出し可能オブジェクトで、ファイルオブジェクトの io.IOBase.readline() メソッドと同じインタフェースを提供している必要があります。この関数は呼び出しのたびに入力の一行を bytes で返さなければなりません。

このジェネレータは次の5要素のタプルを返します; トークンタイプ; トークン文字列; ソース内でそのトークンがどの行、列で開始するかを示す int の (srow, scol) タプル; どの行、列で終了するかを示す int の (erow, ecol) タプル; トークンが見つかった行。 (タプルの最後の要素にある) 行は物理行です。この5要素のタプルは named tuple として返され、フィールド名は type string start end line になります。

返される named tuple は追加のプロパティ exact_type を持ちます。このプロパティは OP トークンに対して正確な演算子のタイプを持ちます。それ以外のトークンタイプについては、 exact_type は type フィールドと同じ値を持ちます。

バージョン 3.1 で変更: named tuple のサポートを追加。

バージョン 3.3 で変更: exact_type のサポートを追加。

tokenize() は PEP 263 にしたがって、ソースのエンコーディングを UTF-8 BOM か encoding cookie を見つけて決定します。

tokenize.generate_tokens(readline)¶

Tokenize a source reading unicode strings instead of bytes.

Like tokenize(), the readline argument is a callable returning a single line of input. However, generate_tokens() expects readline to return a str object rather than bytes.

The result is an iterator yielding named tuples, exactly like tokenize(). It does not yield an ENCODING token.

All constants from the token module are also exported from tokenize.

もう一つの関数がトークン化プロセスを逆転するために提供されています。これは、スクリプトを字句解析し、トークンのストリームに変更を加え、変更されたスクリプトを書き戻すようなツールを作成する際に便利です。

tokenize.untokenize(iterable)¶

トークンの列を Python ソースコードに変換します。 iterable は少なくとも二つの要素、トークンタイプおよびトークン文字列、からなるシーケンスを返さなければいけません。その他のシーケンスの要素は無視されます。

The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured. The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change.

It returns bytes, encoded using the ENCODING token, which is the first token sequence output by tokenize(). If there is no encoding token in the input, it returns a str instead.

tokenize() はトークナイズしようとしているソースファイルのエンコーディングを検出する必要があります。これを行うために使っている関数が公開されています:

tokenize.detect_encoding(readline)¶

detect_encoding() 関数は Python のソースファイルをデコードするのに使うエンコーディングを検出するために使われます。 tokenize() ジェネレータと同じ readline を引数として取ります。

readline を最大2回呼び出し、利用するエンコーディング (文字列として) と、読み込んだ行を (bytes からデコードされないままの状態で) 返します。

UTF-8 BOM か PEP 263 で定義されている encoding cookie からエンコーディングを検出します。BOMと cookie の両方が存在し、一致していない場合、SyntaxError が送出されます。 BOM が見つかった場合はエンコーディングとして 'utf-8-sig' が返されます。

エンコーディングが指定されていない場合、デフォルトの 'utf-8' が返されます。

Python のソースファイルを開くには open() を使ってください。これは detect_encoding() を利用してファイルエンコーディングを検出します。

tokenize.open(filename)¶: detect_encoding() を使って検出したエンコーディングを利用して、ファイルを読み出し専用モードで開きます。

Added in version 3.2.

exception tokenize.TokenError¶

docstring や複数行にわたることが許される式がファイル内のどこかで終わっていない場合に送出されます。例えば:

"""Beginning of
docstring

もしくは:

[1,
 2,
 3

Command-Line Usage¶

Added in version 3.3.

The tokenize module can be executed as a script from the command line. It is as simple as:

python -m tokenize [-e] [filename.py]

以下のオプションが使用できます:

-h, --help¶: このヘルプメッセージを出力して終了します

-e, --exact¶: exact type を使ってトークン名を表示します

filename.py が指定された場合、その内容がトークナイズされ stdout に出力されます。指定されなかった場合は stdin からトークナイズします。

使用例¶

スクリプト書き換えの例で、浮動小数点数リテラルを Decimal オブジェクトに変換します:

from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
from io import BytesIO

def decistmt(s):
    """Substitute Decimals for floats in a string of statements.

    >>> from decimal import Decimal
    >>> s = 'print(+21.3e-5*-.1234/81.7)'
    >>> decistmt(s)
    "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"

    The format of the exponent is inherited from the platform C library.
    Known cases are "e-007" (Windows) and "e-07" (not Windows).  Since
    we're only showing 12 digits, and the 13th isn't close to 5, the
    rest of the output should be platform-independent.

    >>> exec(s)  #doctest: +ELLIPSIS
    -3.21716034272e-0...7

    Output from calculations with Decimal should be identical across all
    platforms.

    >>> exec(decistmt(s))
    -3.217160342717258261933904529E-7
    """
    result = []
    g = tokenize(BytesIO(s.encode('utf-8')).readline)  # tokenize the string
    for toknum, tokval, _, _, _ in g:
        if toknum == NUMBER and '.' in tokval:  # replace NUMBER tokens
            result.extend([
                (NAME, 'Decimal'),
                (OP, '('),
                (STRING, repr(tokval)),
                (OP, ')')
            ])
        else:
            result.append((toknum, tokval))
    return untokenize(result).decode('utf-8')

コマンドラインからトークナイズする例。次のスクリプトが:

def say_hello():
    print("Hello, World!")

say_hello()

will be tokenized to the following output where the first column is the range of the line/column coordinates where the token is found, the second column is the name of the token, and the final column is the value of the token (if any)

$ python -m tokenize hello.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,13:           NAME           'say_hello'
1,13-1,14:          OP             '('
1,14-1,15:          OP             ')'
1,15-1,16:          OP             ':'
1,16-1,17:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,9:            NAME           'print'
2,9-2,10:           OP             '('
2,10-2,25:          STRING         '"Hello, World!"'
2,25-2,26:          OP             ')'
2,26-2,27:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,0:            DEDENT         ''
4,0-4,9:            NAME           'say_hello'
4,9-4,10:           OP             '('
4,10-4,11:          OP             ')'
4,11-4,12:          NEWLINE        '\n'
5,0-5,0:            ENDMARKER      ''

トークンの exact_type 名は -e オプションで表示できます:

$ python -m tokenize -e hello.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,13:           NAME           'say_hello'
1,13-1,14:          LPAR           '('
1,14-1,15:          RPAR           ')'
1,15-1,16:          COLON          ':'
1,16-1,17:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,9:            NAME           'print'
2,9-2,10:           LPAR           '('
2,10-2,25:          STRING         '"Hello, World!"'
2,25-2,26:          RPAR           ')'
2,26-2,27:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,0:            DEDENT         ''
4,0-4,9:            NAME           'say_hello'
4,9-4,10:           LPAR           '('
4,10-4,11:          RPAR           ')'
4,11-4,12:          NEWLINE        '\n'
5,0-5,0:            ENDMARKER      ''

Example of tokenizing a file programmatically, reading unicode strings instead of bytes with generate_tokens():

import tokenize

with tokenize.open('hello.py') as f:
    tokens = tokenize.generate_tokens(f.readline)
    for token in tokens:
        print(token)

Or reading bytes directly with tokenize():

import tokenize

with open('hello.py', 'rb') as f:
    tokens = tokenize.tokenize(f.readline)
    for token in tokens:
        print(token)

`tokenize` --- Python ソース用のトークナイザー¶

Tokenizing Input¶

Command-Line Usage¶

使用例¶

目次

前のトピックへ

次のトピックへ

This page

tokenize --- Python ソース用のトークナイザー¶

Tokenizing Input¶

Command-Line Usage¶

使用例¶

`tokenize` --- Python ソース用のトークナイザー¶