7.2. re — 正则表达式操作

This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings.

正则表达式使用反斜杠('\')来表示特殊形式,或者把特殊字符转义成普通字符。 而反斜杠在普通的 Python 字符串里也有相同的作用,所以就产生了冲突。比如说,要匹配一个字面上的反斜杠,正则表达式模式不得不写成 '\\\\',因为正则表达式里匹配一个反斜杠必须是 \\ ,而每个反斜杠在普通的 Python 字符串里都要写成 \\

解决办法是对于正则表达式样式使用 Python 的原始字符串表示法;在带有 'r' 前缀的字符串字面值中,反斜杠不必做任何特殊处理。 因此 r"\n" 表示包含 '\''n' 两个字符的字符串,而 "\n" 则表示只包含一个换行符的字符串。 样式在 Python 代码中通常都会使用这种原始字符串表示法来表示。

It is important to note that most regular expression operations are available as module-level functions and RegexObject methods. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters.

参见

第三方模块 regex , 提供了与标准库 re 模块兼容的API接口, 同时还提供了额外的功能和更全面的Unicode支持。

7.2.1. 正则表达式语法

一个正则表达式(或RE)指定了一集与之匹配的字符串;模块内的函数可以让你检查某个字符串是否跟给定的正则表达式匹配(或者一个正则表达式是否匹配到一个字符串,这两种说法含义相同)。

Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression. In general, if a string p matches A and another string q matches B, the string pq will match AB. This holds unless A or B contain low precedence operations; boundary conditions between A and B; or have numbered group references. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here. For details of the theory and implementation of regular expressions, consult the Friedl book referenced above, or almost any textbook about compiler construction.

以下是正则表达式格式的简要说明。更详细的信息和演示,参考 正则表达式HOWTO

正则表达式可以包含普通或者特殊字符。绝大部分普通字符,比如 'A', 'a', 或者 '0',都是最简单的正则表达式。它们就匹配自身。你可以拼接普通字符,所以 last 匹配字符串 'last'. (在这一节的其他部分,我们将用 this special style 这种方式表示正则表达式,通常不带引号,要匹配的字符串用 'in single quotes' ,单引号形式。)

Some characters, like '|' or '(', are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. Regular expression pattern strings may not contain null bytes, but can specify the null byte using the \number notation, e.g., '\x00'.

重复修饰符 (*, +, ?, {m,n}, 等) 不能直接嵌套。这样避免了非贪婪后缀 ? 修饰符,和其他实现中的修饰符产生的多义性。要应用一个内层重复嵌套,可以使用括号。 比如,表达式 (?:a{6})* 匹配6个 'a' 字符重复任意次数。

特殊字符是:

'.'

(点) 在默认模式,匹配除了换行的任意字符。如果指定了标签 DOTALL ,它将匹配包括换行符的任意字符。

'^'

(插入符号) 匹配字符串的开头, 并且在 MULTILINE 模式也匹配换行后的首个符号。

'$'

匹配字符串尾或者换行符的前一个字符,在 MULTILINE 模式匹配换行符的前一个字符。 foo 匹配 'foo''foobar' , 但正则 foo$ 只匹配 'foo'。更有趣的是, 在 'foo1\nfoo2\n' 搜索 foo.$ ,通常匹配 'foo2' ,但在 MULTILINE 模式 ,可以匹配到 'foo1' ;在 'foo\n' 搜索 $ 会找到两个空串:一个在换行前,一个在字符串最后。

'*'

对它前面的正则式匹配0到任意次重复, 尽量多的匹配字符串。 ab* 会匹配 'a''ab', 或者 'a'``后面跟随任意个 ``'b'

'+'

对它前面的正则式匹配1到任意次重复。 ab+ 会匹配 'a' 后面跟随1个以上到任意个 'b',它不会匹配 'a'

'?'

对它前面的正则式匹配0到1次重复。 ab? 会匹配 'a' 或者 'ab'

*?, +?, ??

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>.

“{m}”

对其之前的正则式指定匹配 m 个重复;少于 m 的话就会导致匹配失败。比如, a{6} 将匹配6个 'a' , 但是不能是5个。

“{m, n}”

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand 'a' characters followed by a b, but not aaab. The comma may not be omitted or the modifier would be confused with the previously described form.

{m,n}?

前一个修饰符的非贪婪模式,只匹配尽量少的字符次数。比如,对于 'aaaaaa'a{3,5} 匹配 5个 'a' ,而 a{3,5}? 只匹配3个 'a'

'\'

转义特殊字符(允许你匹配 '*', '?', 或者此类其他),或者表示一个特殊序列;特殊序列之后进行讨论。

如果你没有使用原始字符串( r'raw' )来表达样式,要牢记Python也使用反斜杠作为转义序列;如果转义序列不被Python的分析器识别,反斜杠和字符才能出现在字符串中。如果Python可以识别这个序列,那么反斜杠就应该重复两次。这将导致理解障碍,所以高度推荐,就算是最简单的表达式,也要使用原始字符串。

[]

用于表示一个字符集合。在一个集合中:

  • 字符可以单独列出,比如 [amk] 匹配 'a''m', 或者 'k'

  • Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character (e.g. [a-]), it will match a literal '-'.

  • 特殊字符在集合中,失去它的特殊含义。比如 [(+*)] 只会匹配这几个文法字符 '(', '+', '*', or ')'

  • Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether LOCALE or UNICODE mode is in force.

  • 不在集合范围内的字符可以通过 取反 来进行匹配。如果集合首字符是 '^' ,所有 在集合内的字符将会被匹配,比如 [^5] 将匹配所有字符,除了 '5'[^^] 将匹配所有字符,除了 '^'. ^ 如果不在集合首位,就没有特殊含义。

  • 在集合内要匹配一个字符 ']',有两种方法,要么就在它之前加上反斜杠,要么就把它放到集合首位。比如, [()[\]{}][]()[{}] 都可以匹配括号。

'|'

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

(?…)

这是个扩展标记法 (一个 '?' 跟随 '(' 并无含义)。 '?' 后面的第一个字符决定了这个构建采用什么样的语法。这种扩展通常并不创建新的组合; (?P<name>...) 是唯一的例外。 以下是目前支持的扩展。

(?iLmsux)

(One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression. (The flags are described in 模块内容.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.

Note that the (?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.

(?:…)

正则括号的非捕获版本。 匹配在括号内的任何正则表达式,但该分组所匹配的子字符串 不能 在执行匹配后被获取或是之后在模式中被引用。

(?P<name>…)

(命名组合)类似正则组合,但是匹配到的子串组在外部是通过定义的 name 来获取的。组合名必须是有效的Python标识符,并且每个组合名只能用一个正则表达式定义,只能定义一次。一个符号组合同样是一个数字组合,就像这个组合没有被命名一样。

命名组合可以在三种上下文中引用。如果样式是 (?P<quote>['"]).*?(?P=quote) (也就是说,匹配单引号或者双引号括起来的字符串):

引用组合 “quote” 的上下文

引用方法

在正则式自身内

  • (?P=quote) (如示)

  • \1

when processing match object m

  • m.group('quote')

  • m.end('quote') (等)

in a string passed to the repl argument of re.sub()

  • \g<quote>

  • \g<1>

  • \1

(?P=name)

反向引用一个命名组合;它匹配前面那个叫 name 的命名组中匹配到的串同样的字串。

(?#…)

注释;里面的内容会被忽略。

(?=…)

Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

(?!…)

Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

(?<=…)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef, since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Group references are not supported even if they match strings of some fixed length. Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:

>>> import re
>>> m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
'def'

这个例子搜索一个跟随在连字符后的单词:

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'
(?<!…)

Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

(?(id/name)yes-pattern|no-pattern)

Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. no-pattern is optional and can be omitted. For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>) is a poor email matching pattern, which will match with '<user@host.com>' as well as 'user@host.com', but not with '<user@host.com'.

2.4 新版功能.

The special sequences consist of '\' and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, \$ matches the character '$'.

\number

匹配数字代表的组合。每个括号是一个组合,组合从1开始编号。比如 (.+) \1 匹配 'the the' 或者 '55 55', 但不会匹配 'thethe' (注意组合后面的空格)。这个特殊序列只能用于匹配前面99个组合。如果 number 的第一个数位是0, 或者 number 是三个八进制数,它将不会被看作是一个组合,而是八进制的数字值。在 '['']' 字符集合内,任何数字转义都被看作是字符。

\A

只匹配字符串开始。

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

\B

Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b, so is also subject to the settings of LOCALE and UNICODE.

\d

When the UNICODE flag is not specified, matches any decimal digit; this is equivalent to the set [0-9]. With UNICODE, it will match whatever is classified as a decimal digit in the Unicode character properties database.

\D

When the UNICODE flag is not specified, matches any non-digit character; this is equivalent to the set [^0-9]. With UNICODE, it will match anything other than character marked as digits in the Unicode character properties database.

\s

When the UNICODE flag is not specified, it matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]. The LOCALE flag has no extra effect on matching of the space. If UNICODE is set, this will match the characters [ \t\n\r\f\v] plus whatever is classified as space in the Unicode character properties database.

\S

When the UNICODE flag is not specified, matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v] The LOCALE flag has no extra effect on non-whitespace match. If UNICODE is set, then any character not marked as space in the Unicode character properties database is matched.

\w

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

\W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]. With LOCALE, it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

\Z

只匹配字符串尾。

If both LOCALE and UNICODE flags are included for a particular sequence, then LOCALE flag takes effect first followed by the UNICODE.

绝大部分Python的标准转义字符也被正则表达式分析器支持。:

\a      \b      \f      \n
\r      \t      \v      \x
\\

(注意 \b 被用于表示词语的边界,它只在字符集合内表示退格,比如 [\b] 。)

Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.

参见

Mastering Regular Expressions

Book on regular expressions by Jeffrey Friedl, published by O’Reilly. The second edition of the book no longer covers Python at all, but the first edition covered writing good regular expression patterns in great detail.

7.2.2. 模块内容

模块定义了几个函数,常量,和一个例外。有些函数是编译后的正则表达式方法的简化版本(少了一些特性)。绝大部分重要的应用,总是会先将正则表达式编译,之后在进行操作。

re.compile(pattern, flags=0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.

这个表达式的行为可以通过指定 标记 的值来改变。值可以是以下任意变量,可以通过位的OR操作来结合( | 操作符)。

序列

prog = re.compile(pattern)
result = prog.match(string)

等价于

result = re.match(pattern, string)

如果需要多次使用这个正则表达式的话,使用 re.compile() 和保存这个正则对象以便复用,可以让程序更加高效。

注解

The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

re.DEBUG

Display debug information about compiled expression.

re.I
re.IGNORECASE

Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale. To get this effect on non-ASCII Unicode characters such as ü and Ü, add the UNICODE flag.

re.L
re.LOCALE

Make \w, \W, \b, \B, \s and \S dependent on the current locale.

re.M
re.MULTILINE

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

re.S
re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

re.U
re.UNICODE

Make the \w, \W, \b, \B, \d, \D, \s and \S sequences dependent on the Unicode character properties database. Also enables non-ASCII matching for IGNORECASE.

2.0 新版功能.

re.X
re.VERBOSE

这个标记允许你编写更具可读性更友好的正则表达式。通过分段和添加注释。空白符号会被忽略,除非在一个字符集合当中或者由反斜杠转义,或者在 *?, (?: or (?P<…> 分组之内。当一个行内有 # 不在字符集和转义序列,那么它之后的所有字符都是注释。

意思就是下面两个正则表达式等价地匹配一个十进制数字:

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
re.search(pattern, string, flags=0)

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

注意即便是 MULTILINE 多行模式, re.match() 也只匹配字符串的开始位置,而不匹配每行开始。

如果你想定位 string 的任何位置,使用 search() 来替代(也可参考 search() vs. match()

re.split(pattern, string, maxsplit=0, flags=0)

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

>>> re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

That way, separator components are always found at the same relative indices within the result list (e.g., if there’s one capturing group in the separator, the 0th, the 2nd and so forth).

Note that split will never split a string on an empty pattern match. For example:

>>> re.split('x*', 'foo')
['foo']
>>> re.split("(?m)^$", "foo\n\nbar\n")
['foo\n\nbar\n']

在 2.7 版更改: 增加了可选标记参数。

re.findall(pattern, string, flags=0)

string 返回一个不重复的 pattern 的匹配列表, string 从左到右进行扫描,匹配按找到的顺序返回。如果样式里存在一到多个组,就返回一个组合列表;就是一个元组的列表(如果样式里有超过一个组合的话)。空匹配也会包含在结果里。

注解

Due to the limitation of the current implementation the character following an empty match is not included in a next match, so findall(r'^|\w+', 'two words') returns ['', 'wo', 'words'] (note missed “t”). This is changed in Python 3.7.

1.5.2 新版功能.

在 2.4 版更改: 增加了可选标记参数。

re.finditer(pattern, string, flags=0)

Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result. See also the note about findall().

2.2 新版功能.

在 2.4 版更改: 增加了可选标记参数。

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. For example:

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
...        r'static PyObject*\npy_\1(void)\n{',
...        'def myfunc():')
'static PyObject*\npy_myfunc(void)\n{'

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:

>>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return ' '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'

The pattern may be a string or an RE object.

The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.

在字符串类型的 repl 参数里,如上所述的转义和向后引用中,\g<name> 会使用命名组合 name,(在 (?P<name>…) 语法中定义) \g<number> 会使用数字组;\g<2> 就是 \2,但它避免了二义性,如 \g<2>0\20 就会被解释为组20,而不是组2后面跟随一个字符 '0'。向后引用 \g<0>pattern 作为一整个组进行引用。

在 2.7 版更改: 增加了可选标记参数。

re.subn(pattern, repl, string, count=0, flags=0)

行为与 sub() 相同,但是返回一个元组 (字符串, 替换次数).

在 2.7 版更改: 增加了可选标记参数。

re.escape(pattern)

Escape all the characters in pattern except ASCII letters and numbers. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it. For example:

>>> print re.escape('python.exe')
python\.exe

>>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
>>> print '[%s]+' % re.escape(legal_chars)
[abcdefghijklmnopqrstuvwxyz0123456789\!\#\$\%\&\'\*\+\-\.\^\_\`\|\~\:]+

>>> operators = ['+', '-', '*', '/', '**']
>>> print '|'.join(map(re.escape, sorted(operators, reverse=True)))
\/|\-|\+|\*\*|\*
re.purge()

清除正则表达式缓存。

exception re.error

Exception raised when a string passed to one of the functions here is not a valid regular expression (for example, it might contain unmatched parentheses) or when some other error occurs during compilation or matching. It is never an error if a string contains no match for a pattern.

7.2.3. 正则表达式对象 (正则对象)

class re.RegexObject

The RegexObject class supports the following methods and attributes:

search(string[, pos[, endpos]])

Scan through string looking for a location where this regular expression produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

可选的第二个参数 pos 给出了字符串中开始搜索的位置索引;默认为 0,它不完全等价于字符串切片; '^' 样式字符匹配字符串真正的开头,和换行符后面的第一个字符,但不会匹配索引规定开始的位置。

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found, otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

>>> pattern = re.compile("d")
>>> pattern.search("dog")     # Match at index 0
<_sre.SRE_Match object at ...>
>>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
match(string[, pos[, endpos]])

If zero or more characters at the beginning of string match this regular expression, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

The optional pos and endpos parameters have the same meaning as for the search() method.

>>> pattern = re.compile("o")
>>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
>>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
<_sre.SRE_Match object at ...>

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

split(string, maxsplit=0)

等价于 split() 函数,使用了编译后的样式。

findall(string[, pos[, endpos]])

Similar to the findall() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match().

finditer(string[, pos[, endpos]])

Similar to the finditer() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match().

sub(repl, string, count=0)

等价于 sub() 函数,使用了编译后的样式。

subn(repl, string, count=0)

等价于 subn() 函数,使用了编译后的样式。

flags

The regex matching flags. This is a combination of the flags given to compile() and any (?...) inline flags in the pattern.

groups

捕获组合的数量。

groupindex

映射由 (?P<id>) 定义的命名符号组合和数字组合的字典。如果没有符号组,那字典就是空的。

pattern

The pattern string from which the RE object was compiled.

7.2.4. 匹配对象

class re.MatchObject

Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:

match = re.search(pattern, string)
if match:
    process(match)

匹配对象支持以下方法和属性:

expand(template)

Return the string obtained by doing backslash substitution on the template string template, as done by the sub() method. Escapes such as \n are converted to the appropriate characters, and numeric backreferences (\1, \2) and named backreferences (\g<1>, \g<name>) are replaced by the contents of the corresponding group.

group([group1, ...])

Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned.

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

如果正则表达式使用了 (?P<name>…) 语法, groupN 参数就也可能是命名组合的名字。如果一个字符串参数在样式中未定义为组合名,一个 IndexErrorraise

A moderately complicated example:

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
'Reynolds'

Named groups can also be referred to by their index:

>>> m.group(1)
'Malcolm'
>>> m.group(2)
'Reynolds'

If a group matches multiple times, only the last match is accessible:

>>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
>>> m.group(1)                        # Returns only the last match.
'c3'
groups([default])

Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None. (Incompatibility note: in the original Python 1.5 release, if the tuple was one element long, a string would be returned instead. In later versions (from 1.5.1 on), a singleton tuple is returned in such cases.)

For example:

>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
>>> m.groups()
('24', '1632')

If we make the decimal place and everything after it optional, not all groups might participate in the match. These groups will default to None unless the default argument is given:

>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
>>> m.groups()      # Second group defaults to None.
('24', None)
>>> m.groups('0')   # Now, the second group defaults to '0'.
('24', '0')
groupdict([default])

Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name. The default argument is used for groups that did not participate in the match; it defaults to None. For example:

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
start([group])
end([group])

返回 group 匹配到的字串的开始和结束标号。group 默认为0(意思是整个匹配的子串)。如果 group 存在,但未产生匹配,就返回 -1 。对于一个匹配对象 m, 和一个未参与匹配的组 g ,组 g (等价于 m.group(g))产生的匹配是

m.string[m.start(g):m.end(g)]

注意 m.start(group) 将会等于 m.end(group) ,如果 group 匹配一个空字符串的话。比如,在 m = re.search('b(c?)', 'cba') 之后,m.start(0) 为 1, m.end(0) 为 2, m.start(1)m.end(1) 都是 2, m.start(2) raise 一个 IndexError 例外。

An example that will remove remove_this from email addresses:

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'tony@tiger.net'
span([group])

For MatchObject m, return the 2-tuple (m.start(group), m.end(group)). Note that if group did not contribute to the match, this is (-1, -1). group defaults to zero, the entire match.

pos

The value of pos which was passed to the search() or match() method of the RegexObject. This is the index into the string at which the RE engine started looking for a match.

endpos

The value of endpos which was passed to the search() or match() method of the RegexObject. This is the index into the string beyond which the RE engine will not go.

lastindex

捕获组的最后一个匹配的整数索引值,或者 None 如果没有匹配产生的话。比如,对于字符串 'ab',表达式 (a)b, ((a)(b)), 和 ((ab)) 将得到 lastindex == 1 , 而 (a)(b) 会得到 lastindex == 2

lastgroup

最后一个匹配的命名组名字,或者 None 如果没有产生匹配的话。

re

The regular expression object whose match() or search() method produced this MatchObject instance.

string

The string passed to match() or search().

7.2.5. Examples

7.2.5.1. Checking For a Pair

在这个例子里,我们使用以下辅助函数来更好的显示匹配对象:

def displaymatch(match):
    if match is None:
        return None
    return '<Match: %r, groups=%r>' % (match.group(), match.groups())

假设你在写一个扑克程序,一个玩家的一手牌为五个字符的串,每个字符表示一张牌,”a” 就是 A, “k” K, “q” Q, “j” J, “t” 为 10, “2” 到 “9” 表示2 到 9。

To see if a given string is a valid hand, one could do the following:

>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
>>> displaymatch(valid.match("akt5q"))  # Valid.
"<Match: 'akt5q', groups=()>"
>>> displaymatch(valid.match("akt5e"))  # Invalid.
>>> displaymatch(valid.match("akt"))    # Invalid.
>>> displaymatch(valid.match("727ak"))  # Valid.
"<Match: '727ak', groups=()>"

That last hand, "727ak", contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences as such:

>>> pair = re.compile(r".*(.).*\1")
>>> displaymatch(pair.match("717ak"))     # Pair of 7s.
"<Match: '717', groups=('7',)>"
>>> displaymatch(pair.match("718ak"))     # No pairs.
>>> displaymatch(pair.match("354aa"))     # Pair of aces.
"<Match: '354aa', groups=('a',)>"

To find out what card the pair consists of, one could use the group() method of MatchObject in the following manner:

>>> pair.match("717ak").group(1)
'7'

# Error because re.match() returns None, which doesn't have a group() method:
>>> pair.match("718ak").group(1)
Traceback (most recent call last):
  File "<pyshell#23>", line 1, in <module>
    re.match(r".*(.).*\1", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'

>>> pair.match("354aa").group(1)
'a'

7.2.5.2. 模拟 scanf()

Python 目前没有一个类似c函数 scanf() 的替代品。正则表达式通常比 scanf() 格式字符串要更强大一些,但也带来更多复杂性。下面的表格提供了 scanf() 格式符和正则表达式大致相同的映射。

scanf() 格式符

正则表达式

%c

.

%5c

.{5}

%d

[-+]?\d+

%e, %E, %f, %g

[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?

%i

[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)

%o

[-+]?[0-7]+

%s

\S+

%u

\d+

%x, %X

[-+]?(0[xX])?[\dA-Fa-f]+

从文件名和数字提取字符串

/usr/sbin/sendmail - 0 errors, 4 warnings

你可以使用 scanf() 格式化

%s - %d errors, %d warnings

等价的正则表达式是:

(\S+) - (\d+) errors, (\d+) warnings

7.2.5.3. search() vs. match()

Python 提供了两种不同的操作:基于 re.match() 检查字符串开头,或者 re.search() 检查字符串的任意位置(默认Perl中的行为)。

例如

>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef")   # Match
<_sre.SRE_Match object at ...>

search() 中,可以用 '^' 作为开始来限制匹配到字符串的首位

>>> re.match("c", "abcdef")    # No match
>>> re.search("^c", "abcdef")  # No match
>>> re.search("^a", "abcdef")  # Match
<_sre.SRE_Match object at ...>

Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.

>>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
<_sre.SRE_Match object at ...>

7.2.5.4. 建立一个电话本

split() 将字符串用参数传递的样式分隔开。这个方法对于转换文本数据到易读而且容易修改的数据结构,是很有用的,如下面的例子证明。

First, here is the input. Normally it may come from a file, here we are using triple-quoted string syntax:

>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
...
... Ronald Heathmore: 892.345.3428 436 Finley Avenue
... Frank Burger: 925.541.7625 662 South Dogwood Way
...
...
... Heather Albrecht: 548.326.4584 919 Park Place"""

条目用一个或者多个换行符分开。现在我们将字符串转换为一个列表,每个非空行都有一个条目:

>>> entries = re.split("\n+", text)
>>> entries
['Ross McFluff: 834.345.1254 155 Elm Street',
'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
'Frank Burger: 925.541.7625 662 South Dogwood Way',
'Heather Albrecht: 548.326.4584 919 Park Place']

最终,将每个条目分割为一个由名字、姓氏、电话号码和地址组成的列表。我们为 split() 使用了 maxsplit 形参,因为地址中包含有被我们作为分割模式的空格符:

>>> [re.split(":? ", entry, 3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

:? 样式匹配姓后面的冒号,因此它不出现在结果列表中。如果 maxsplit 设置为 4 ,我们还可以从地址中获取到房间号:

>>> [re.split(":? ", entry, 4) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

7.2.5.5. 文字整理

sub() 替换字符串中出现的样式的每一个实例。这个例子证明了使用 sub() 来整理文字,或者随机化每个字符的位置,除了首位和末尾字符

>>> def repl(m):
...     inner_word = list(m.group(2))
...     random.shuffle(inner_word)
...     return m.group(1) + "".join(inner_word) + m.group(3)
>>> text = "Professor Abdolmalek, please report your absences promptly."
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

7.2.5.6. 找到所有副词

findall() matches all occurrences of a pattern, not just the first one as search() does. For example, if a writer wanted to find all of the adverbs in some text, they might use findall() in the following manner:

>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']

7.2.5.7. 找到所有副词和位置

If one wants more information about all matches of a pattern than the matched text, finditer() is useful as it provides instances of MatchObject instead of strings. Continuing with the previous example, if a writer wanted to find all of the adverbs and their positions in some text, they would use finditer() in the following manner:

>>> text = "He was carefully disguised but captured quickly by police."
>>> for m in re.finditer(r"\w+ly", text):
...     print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
07-16: carefully
40-47: quickly

7.2.5.8. 原始字符记法

Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:

>>> re.match(r"\W(.)\1\W", " ff ")
<_sre.SRE_Match object at ...>
>>> re.match("\\W(.)\\1\\W", " ff ")
<_sre.SRE_Match object at ...>

When one wants to match a literal backslash, it must be escaped in the regular expression. With raw string notation, this means r"\\". Without raw string notation, one must use "\\\\", making the following lines of code functionally identical:

>>> re.match(r"\\", r"\\")
<_sre.SRE_Match object at ...>
>>> re.match("\\\\", r"\\")
<_sre.SRE_Match object at ...>