2. 词法分析

Python 程序由一个 解析器 读取。输入到解析器的是一个由 词法分析器 所生成的 形符 流,本章将描述词法分析器是如何将一个文件拆分为一个个形符的。

Python uses the 7-bit ASCII character set for program text.

2.3 新版功能: An encoding declaration can be used to indicate that string literals and comments use an encoding different from ASCII.

For compatibility with older versions, Python only warns if it finds 8-bit characters; those warnings should be corrected by either declaring an explicit encoding, or using escape sequences if those bytes are binary data, instead of characters.

The run-time character set depends on the I/O devices connected to the program but is generally a superset of ASCII.

Future compatibility note: It may be tempting to assume that the character set for 8-bit characters is ISO Latin-1 (an ASCII superset that covers most western languages that use the Latin alphabet), but it is possible that in the future Unicode text editors will become common. These generally use the UTF-8 encoding, which is also an ASCII superset, but with very different use for the characters with ordinals 128-255. While there is no consensus on this subject yet, it is unwise to assume either Latin-1 or UTF-8, even though the current implementation appears to favor Latin-1. This applies both to the source character set and the run-time character set.

2.1. 行结构

一个 Python 程序可分为许多 逻辑行

2.1.1. 逻辑行

逻辑行的结束是以 NEWLINE 形符表示的。语句不能跨越逻辑行的边界,除非其语法允许包含 NEWLINE (例如复合语句可由多行子语句组成)。一个逻辑行可由一个或多个 物理行 按照明确或隐含的 行拼接 规则构成。

2.1.2. 物理行

物理行是以一个行终止序列结束的字符序列。在源文件和字符串中,可以使用任何标准平台上的行终止序列 - Unix 所用的 ASCII 字符 LF (换行), Windows 所用的 ASCII 字符序列 CR LF (回车加换行), 或者旧 Macintosh 所用的 ASCII 字符 CR (回车)。所有这些形式均可使用,无论具体平台。输入的结束也会被作为最后一个物理行的隐含终止标志。

当嵌入 Python 时,源码字符串传入 Python API 应使用标准 C 的传统换行符 (即 \n,表示 ASCII 字符 LF 作为行终止标志)。

2.1.3. 注释

A comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules are invoked. Comments are ignored by the syntax; they are not tokens.

2.1.4. 编码声明

如果一条注释位于 Python 脚本的第一或第二行,并且匹配正则表达式 coding[=:]\s*([-\w.]+),这条注释会被作为编码声明来处理;上述表达式的第一组指定了源码文件的编码。编码声明必须独占一行。如果它是在第二行,则第一行也必须是注释。推荐的编码声明形式如下

# -*- coding: <encoding-name> -*-

这也是 GNU Emacs 认可的形式,以及

# vim:fileencoding=<encoding-name>

which is recognized by Bram Moolenaar’s VIM. In addition, if the first bytes of the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared file encoding is UTF-8 (this is supported, among others, by Microsoft’s notepad).

If an encoding is declared, the encoding name must be recognized by Python. The encoding is used for all lexical analysis, in particular to find the end of a string, and to interpret the contents of Unicode literals. String literals are converted to Unicode for syntactical analysis, then converted back to their original encoding before interpretation starts.

2.1.5. 显式的行拼接

两个或更多个物理行可使用反斜杠字符 (\) 拼接为一个逻辑行,规则如下: 当一个物理行以一个不在字符串或注释内的反斜杠结尾时,它将与下一行拼接构成一个单独的逻辑行,反斜杠及其后的换行符会被删除。例如:

if 1900 < year < 2100 and 1 <= month <= 12 \
   and 1 <= day <= 31 and 0 <= hour < 24 \
   and 0 <= minute < 60 and 0 <= second < 60:   # Looks like a valid date
        return 1

以反斜杠结束的行不能带有注释。反斜杠不能用来拼接注释。反斜杠不能用来拼接形符,字符串除外 (即原文字符串以外的形符不能用反斜杠分隔到两个物理行)。不允许有原文字符串以外的反斜杠存在于物理行的其他位置。

2.1.6. 隐式的行拼接

圆括号、方括号或花括号以内的表达式允许分成多个物理行,无需使用反斜杠。例如:

month_names = ['Januari', 'Februari', 'Maart',      # These are the
               'April',   'Mei',      'Juni',       # Dutch names
               'Juli',    'Augustus', 'September',  # for the months
               'Oktober', 'November', 'December']   # of the year

隐式的行拼接可以带有注释。后续行的缩进不影响程序结构。后续行也允许为空白行。隐式拼接的行之间不会有 NEWLINE 形符。隐式拼接的行也可以出现于三引号字符串中 (见下);此情况下这些行不允许带有注释。

2.1.7. 空白行

A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated). During interactive input of statements, handling of a blank line may differ depending on the implementation of the read-eval-print loop. In the standard implementation, an entirely blank logical line (i.e. one containing not even whitespace or a comment) terminates a multi-line statement.

2.1.8. 缩进

一个逻辑行开头处的空白 (空格符和制表符) 被用来计算该行的缩进等级,以决定语句段落的组织结构。

First, tabs are replaced (from left to right) by one to eight spaces such that the total number of characters up to and including the replacement is a multiple of eight (this is intended to be the same rule as used by Unix). The total number of spaces preceding the first non-blank character then determines the line’s indentation. Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation.

跨平台兼容性注释: 由于非 UNIX 平台上文本编辑器本身的特性,在一个源文件中混合使用制表符和空格符是不明智的。另外也要注意不同平台还可能会显式地限制最大缩进层级。

行首有时可能会有一个进纸符;它在上述缩进层级计算中会被忽略。处于行首空格内其他位置的进纸符的效果未定义 (例如它可能导致空格计数重置为零)。

多个连续行各自的缩进层级将会被放入一个堆栈用来生成 INDENT 和 DEDENT 形符,具体说明如下。

在读取文件的第一行之前,先向堆栈推入一个零值;它将不再被弹出。被推入栈的层级数值从底至顶持续增加。每个逻辑行开头的行缩进层级将与栈顶行比较。如果相同,则不做处理。如果新行层级较高,则会被推入栈顶,并生成一个 INDENT 形符。如果新行层级较低,则 应当 是栈中的层级数值之一;栈中高于该层级的所有数值都将被弹出,每弹出一级数值生成一个 DEDENT 形符。在文件末尾,栈中剩余的每个大于零的数值生成一个 DEDENT 形符。

这是一个正确 (但令人迷惑) 的Python 代码缩进示例:

def perm(l):
        # Compute the list of all permutations of l
    if len(l) <= 1:
                  return [l]
    r = []
    for i in range(len(l)):
             s = l[:i] + l[i+1:]
             p = perm(s)
             for x in p:
              r.append(l[i:i+1] + x)
    return r

以下示例显示了各种缩进错误:

 def perm(l):                       # error: first line indented
for i in range(len(l)):             # error: not indented
    s = l[:i] + l[i+1:]
        p = perm(l[:i] + l[i+1:])   # error: unexpected indent
        for x in p:
                r.append(l[i:i+1] + x)
            return r                # error: inconsistent dedent

(实际上,前三个错误会被解析器发现;只有最后一个错误是由词法分析器发现的 — return r 的缩进无法匹配弹出栈的缩进层级。)

2.1.9. 形符之间的空白

除非是在逻辑行的开头或字符串内,空格符、制表符和进纸符等空白符都同样可以用来分隔形符。如果两个形符彼此相连会被解析为一个不同的形符,则需要使用空白来分隔 (例如 ab 是一个形符,而 a b 是两个形符)。

2.2. 其他形符

除了 NEWLINE, INDENT 和 DEDENT,还存在以下类别的形符: 标识符, 关键字, 字面值, 运算符 以及 分隔符。 空白字符 (之前讨论过的行终止符除外) 不属于形符,而是用来分隔形符。如果存在二义性,将从左至右读取尽可能长的合法字符串组成一个形符。

2.3. 标识符和关键字

Identifiers (also referred to as names) are described by the following lexical definitions:

identifier ::=  (letter|"_") (letter | digit | "_")*
letter     ::=  lowercase | uppercase
lowercase  ::=  "a"..."z"
uppercase  ::=  "A"..."Z"
digit      ::=  "0"..."9"

标识符的长度没有限制。对大小写敏感。

2.3.1. 关键字

以下标识符被作为语言的保留字或称 关键字,不可被用作普通标识符。关键字的拼写必须与这里列出的完全一致。

and       del       from      not       while
as        elif      global    or        with
assert    else      if        pass      yield
break     except    import    print
class     exec      in        raise
continue  finally   is        return
def       for       lambda    try

在 2.4 版更改: None became a constant and is now recognized by the compiler as a name for the built-in object None. Although it is not a keyword, you cannot assign a different object to it.

在 2.5 版更改: Using as and with as identifiers triggers a warning. To use them as keywords, enable the with_statement future feature .

在 2.6 版更改: as and with are full keywords.

2.3.2. 保留的标识符类

某些标识符类 (除了关键字) 具有特殊的含义。这些标识符类的命名模式是以下划线字符打头和结尾:

_*

Not imported by from module import *. The special identifier _ is used in the interactive interpreter to store the result of the last evaluation; it is stored in the __builtin__ module. When not in interactive mode, _ has no special meaning and is not defined. See section The import statement.

注解

_ 作为名称常用于连接国际化文本;请参看 gettext 模块文档了解有关此约定的详情。

__*__

System-defined names. These names are defined by the interpreter and its implementation (including the standard library). Current system names are discussed in the 特殊方法名称 section and elsewhere. More will likely be defined in future versions of Python. Any use of __*__ names, in any context, that does not follow explicitly documented use, is subject to breakage without warning.

__*

类的私有名称。这种名称在类定义中使用时,会以一种混合形式重写以避免在基类及派生类的 “私有” 属性之间出现名称冲突。参见 标识符(名称)

2.4. 字面值

字面值用于表示一些内置类型的常量。

2.4.1. String literals

字符串字面值由以下词法定义进行描述:

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "ur" | "R" | "U" | "UR" | "Ur" | "uR"
                     | "b" | "B" | "br" | "Br" | "bR" | "BR"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''"
                     | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | escapeseq
longstringitem  ::=  longstringchar | escapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
escapeseq       ::=  "\" <any ASCII character>

One syntactic restriction not indicated by these productions is that whitespace is not allowed between the stringprefix and the rest of the string literal. The source character set is defined by the encoding declaration; it is ASCII if no encoding declaration is given in the source file; see section 编码声明.

In plain English: String literals can be enclosed in matching single quotes (') or double quotes ("). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings). The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences. A prefix of 'u' or 'U' makes the string a Unicode string. Unicode strings use the Unicode character set as defined by the Unicode Consortium and ISO 10646. Some additional escape sequences, described below, are available in Unicode strings. A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

In triple-quoted strings, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the string. (A “quote” is the character used to open the string, i.e. either ' or ".)

Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:

转义序列

含义

注释

\newline

Ignored

\\

反斜杠 (\)

\'

单引号 (')

\"

双引号 (")

\a

ASCII 响铃 (BEL)

\b

ASCII 退格 (BS)

\f

ASCII 进纸 (FF)

\n

ASCII 换行 (LF)

\N{name}

Character named name in the Unicode database (Unicode only)

\r

ASCII 回车 (CR)

\t

ASCII 水平制表 (TAB)

\uxxxx

Character with 16-bit hex value xxxx (Unicode only)

(1)

\Uxxxxxxxx

Character with 32-bit hex value xxxxxxxx (Unicode only)

(2)

\v

ASCII 垂直制表 (VT)

\ooo

八进制数 ooo 码位的字符

(3,5)

\xhh

十六进制数 hh 码位的字符

(4,5)

注释:

  1. Individual code units which form parts of a surrogate pair can be encoded using this escape sequence.

  2. Any Unicode character can be encoded this way, but characters outside the Basic Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is compiled to use 16-bit code units (the default).

  3. 与标准 C 一致,接受最多三个八进制数码。

  4. 与标准 C 不同,要求必须为两个十六进制数码。

  5. In a string literal, hexadecimal and octal escapes denote the byte with the given value; it is not necessary that the byte encodes a character in the source character set. In a Unicode literal, these escapes denote a Unicode character with the given value.

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.) It is also important to note that the escape sequences marked as “(Unicode only)” in the table above fall into the category of unrecognized escapes for non-Unicode string literals.

When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase 'n'. String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.

When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U' prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed while all other backslashes are left in the string. For example, the string literal ur"\u0062\n" consists of three Unicode characters: ‘LATIN SMALL LETTER B’, ‘REVERSE SOLIDUS’, and ‘LATIN SMALL LETTER N’. Backslashes can be escaped with a preceding backslash; however, both remain in the string. As a result, \uXXXX escape sequences are only recognized when there are an odd number of backslashes.

2.4.2. 字符串字面值拼接

Multiple adjacent string literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings, for example:

re.compile("[A-Za-z_]"       # letter or underscore
           "[A-Za-z0-9_]*"   # letter, digit or underscore
          )

Note that this feature is defined at the syntactical level, but implemented at compile time. The ‘+’ operator must be used to concatenate string expressions at run time. Also note that literal concatenation can use different quoting styles for each component (even mixing raw strings and triple quoted strings).

2.4.3. 数字字面值

There are four types of numeric literals: plain integers, long integers, floating point numbers, and imaginary numbers. There are no complex literals (complex numbers can be formed by adding a real number and an imaginary number).

注意数字字面值并不包含正负号;-1 这样的负数实际上是由单目运算符 ‘-‘ 和字面值 1 合成的。

2.4.4. Integer and long integer literals

Integer and long integer literals are described by the following lexical definitions:

longinteger    ::=  integer ("l" | "L")
integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"
octinteger     ::=  "0" ("o" | "O") octdigit+ | "0" octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
nonzerodigit   ::=  "1"..."9"
octdigit       ::=  "0"..."7"
bindigit       ::=  "0" | "1"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"

Although both lower case 'l' and upper case 'L' are allowed as suffix for long integers, it is strongly recommended to always use 'L', since the letter 'l' looks too much like the digit '1'.

Plain integer literals that are above the largest representable plain integer (e.g., 2147483647 when using 32-bit arithmetic) are accepted as if they were long integers instead. 1 There is no limit for long integer literals apart from what can be stored in available memory.

Some examples of plain integer literals (first row) and long integer literals (second and third rows):

7     2147483647                        0177
3L    79228162514264337593543950336L    0377L   0x100000000L
      79228162514264337593543950336             0xdeadbeef

2.4.5. 浮点数字面值

浮点数字面值由以下词法定义进行描述:

floatnumber   ::=  pointfloat | exponentfloat
pointfloat    ::=  [intpart] fraction | intpart "."
exponentfloat ::=  (intpart | pointfloat) exponent
intpart       ::=  digit+
fraction      ::=  "." digit+
exponent      ::=  ("e" | "E") ["+" | "-"] digit+

Note that the integer and exponent parts of floating point numbers can look like octal integers, but are interpreted using radix 10. For example, 077e010 is legal, and denotes the same number as 77e10. The allowed range of floating point literals is implementation-dependent. Some examples of floating point literals:

3.14    10.    .001    1e100    3.14e-10    0e0

Note that numeric literals do not include a sign; a phrase like -1 is actually an expression composed of the unary operator - and the literal 1.

2.4.6. 虚数字面值

虚数字面值由以下词法定义进行描述:

imagnumber ::=  (floatnumber | intpart) ("j" | "J")

一个虚数字面值将生成一个实部为 0.0 的复数。复数是以一对浮点数来表示的,它们的取值范围相同。要创建一个实部不为零的复数,就加上一个浮点数,例如 (3+4j)。一些虚数字面值的示例如下:

3.14j   10.j    10j     .001j   1e100j  3.14e-10j

2.5. 运算符

以下形符属于运算符:

+       -       *       **      /       //      %
<<      >>      &       |       ^       ~
<       >       <=      >=      ==      !=      <>

The comparison operators <> and != are alternate spellings of the same operator. != is the preferred spelling; <> is obsolescent.

2.6. 分隔符

以下形符在语法中归类为分隔符:

(       )       [       ]       {       }      @
,       :       .       `       =       ;
+=      -=      *=      /=      //=     %=
&=      |=      ^=      >>=     <<=     **=

The period can also occur in floating-point and imaginary literals. A sequence of three periods has a special meaning as an ellipsis in slices. The second half of the list, the augmented assignment operators, serve lexically as delimiters, but also perform an operation.

以下可打印 ASCII 字符作为其他形符的组成部分时具有特殊含义,或是对词法分析器有重要意义:

'       "       #       \

以下可打印 ASCII 字符不在 Python 词法中使用。如果出现于字符串字面值和注释之外将无条件地引发错误:

$       ?

备注

1

In versions of Python prior to 2.4, octal and hexadecimal literals in the range just above the largest representable plain integer but below the largest unsigned 32-bit number (on a machine using 32-bit arithmetic), 4294967296, were taken as the negative plain integer obtained by subtracting 4294967296 from their unsigned value.