11. 标准库简介 —— 第二部分¶
第二部分涵盖了专业编程所需要的更高级的模块。这些模块很少用在小脚本中。
11.1. 格式化输出¶
reprlib
模块提供了一个定制化版本的 repr()
函数,用于缩略显示大型或深层嵌套的容器对象:
>>> import reprlib
>>> reprlib.repr(set('supercalifragilisticexpialidocious'))
"{'a', 'c', 'd', 'e', 'f', 'g', ...}"
pprint
模块提供了更加复杂的打印控制,其输出的内置对象和用户自定义对象能够被解释器直接读取。当输出结果过长而需要折行时,“美化输出机制”会添加换行符和缩进,以更清楚地展示数据结构:
>>> import pprint
>>> t = [[[['black', 'cyan'], 'white', ['green', 'red']], [['magenta',
... 'yellow'], 'blue']]]
...
>>> pprint.pprint(t, width=30)
[[[['black', 'cyan'],
'white',
['green', 'red']],
[['magenta', 'yellow'],
'blue']]]
textwrap
模块能够格式化文本段落,以适应给定的屏幕宽度:
>>> import textwrap
>>> doc = """The wrap() method is just like fill() except that it returns
... a list of strings instead of one big string with newlines to separate
... the wrapped lines."""
...
>>> print(textwrap.fill(doc, width=40))
The wrap() method is just like fill()
except that it returns a list of strings
instead of one big string with newlines
to separate the wrapped lines.
locale
模块处理与特定地域文化相关的数据格式。locale 模块的 format 函数包含一个 grouping 属性,可直接将数字格式化为带有组分隔符的样式:
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'English_United States.1252')
'English_United States.1252'
>>> conv = locale.localeconv() # get a mapping of conventions
>>> x = 1234567.8
>>> locale.format("%d", x, grouping=True)
'1,234,567'
>>> locale.format_string("%s%.*f", (conv['currency_symbol'],
... conv['frac_digits'], x), grouping=True)
'$1,234,567.80'
11.2. 模板¶
string
模块包含一个通用的 Template
类,具有适用于最终用户的简化语法。它允许用户在不更改应用逻辑的情况下定制自己的应用。
上述格式化操作是通过占位符实现的,占位符由 $
加上合法的 Python 标识符(只能包含字母、数字和下划线)构成。一旦使用花括号将占位符括起来,就可以在后面直接跟上更多的字母和数字而无需空格分割。$$
将被转义成单个字符 $
:
>>> from string import Template
>>> t = Template('${village}folk send $$10 to $cause.')
>>> t.substitute(village='Nottingham', cause='the ditch fund')
'Nottinghamfolk send $10 to the ditch fund.'
如果在字典或关键字参数中未提供某个占位符的值,那么 substitute()
方法将抛出 KeyError
。对于邮件合并类型的应用,用户提供的数据有可能是不完整的,此时使用 safe_substitute()
方法更加合适 —— 如果数据缺失,它会直接将占位符原样保留。
>>> t = Template('Return the $item to $owner.')
>>> d = dict(item='unladen swallow')
>>> t.substitute(d)
Traceback (most recent call last):
...
KeyError: 'owner'
>>> t.safe_substitute(d)
'Return the unladen swallow to $owner.'
Template 的子类可以自定义定界符。例如,以下是某个照片浏览器的批量重命名功能,采用了百分号作为日期、照片序号和照片格式的占位符:
>>> import time, os.path
>>> photofiles = ['img_1074.jpg', 'img_1076.jpg', 'img_1077.jpg']
>>> class BatchRename(Template):
... delimiter = '%'
>>> fmt = input('Enter rename style (%d-date %n-seqnum %f-format): ')
Enter rename style (%d-date %n-seqnum %f-format): Ashley_%n%f
>>> t = BatchRename(fmt)
>>> date = time.strftime('%d%b%y')
>>> for i, filename in enumerate(photofiles):
... base, ext = os.path.splitext(filename)
... newname = t.substitute(d=date, n=i, f=ext)
... print('{0} --> {1}'.format(filename, newname))
img_1074.jpg --> Ashley_0.jpg
img_1076.jpg --> Ashley_1.jpg
img_1077.jpg --> Ashley_2.jpg
模板的另一个应用是将程序逻辑与多样的格式化输出细节分离开来。这使得对 XML 文件、纯文本报表和 HTML 网络报表使用自定义模板成为可能。
11.3. 使用二进制数据记录格式¶
struct
模块提供了 pack()
和 unpack()
函数,用于处理不定长度的二进制记录格式。下面的例子展示了在不使用 zipfile
模块的情况下,如何循环遍历一个 ZIP 文件的所有头信息。Pack 代码 "H"
和 "I"
分别代表两字节和四字节无符号整数。"<"
代表它们是标准尺寸的小尾型字节序:
import struct
with open('myfile.zip', 'rb') as f:
data = f.read()
start = 0
for i in range(3): # show the first 3 file headers
start += 14
fields = struct.unpack('<IIIHH', data[start:start+16])
crc32, comp_size, uncomp_size, filenamesize, extra_size = fields
start += 16
filename = data[start:start+filenamesize]
start += filenamesize
extra = data[start:start+extra_size]
print(filename, hex(crc32), comp_size, uncomp_size)
start += extra_size + comp_size # skip to the next header
11.4. 多线程¶
线程是一种对于非顺序依赖的多个任务进行解耦的技术。多线程可以提高应用的响应效率,当接收用户输入的同时,保持其他任务在后台运行。一个有关的应用场景是,将 I/O 和计算运行在两个并行的线程中。
以下代码展示了高阶的 threading
模块如何在后台运行任务,且不影响主程序的继续运行:
import threading, zipfile
class AsyncZip(threading.Thread):
def __init__(self, infile, outfile):
threading.Thread.__init__(self)
self.infile = infile
self.outfile = outfile
def run(self):
f = zipfile.ZipFile(self.outfile, 'w', zipfile.ZIP_DEFLATED)
f.write(self.infile)
f.close()
print('Finished background zip of:', self.infile)
background = AsyncZip('mydata.txt', 'myarchive.zip')
background.start()
print('The main program continues to run in foreground.')
background.join() # Wait for the background task to finish
print('Main program waited until background was done.')
多线程应用面临的主要挑战是,相互协调的多个线程之间需要共享数据或其他资源。为此,threading 模块提供了多个同步操作原语,包括线程锁、事件、条件变量和信号量。
尽管这些工具非常强大,但微小的设计错误却可以导致一些难以复现的问题。因此,实现多任务协作的首选方法是将对资源的所有请求集中到一个线程中,然后使用 queue
模块向该线程供应来自其他线程的请求。应用程序使用 Queue
对象进行线程间通信和协调,更易于设计,更易读,更可靠。
11.5. 日志¶
logging
模块提供功能齐全且灵活的日志记录系统。在最简单的情况下,日志消息被发送到文件或 sys.stderr
import logging
logging.debug('Debugging information')
logging.info('Informational message')
logging.warning('Warning:config file %s not found', 'server.conf')
logging.error('Error occurred')
logging.critical('Critical error -- shutting down')
这会产生以下输出:
WARNING:root:Warning:config file server.conf not found
ERROR:root:Error occurred
CRITICAL:root:Critical error -- shutting down
默认情况下,informational 和 debugging 消息被压制,输出会发送到标准错误流。其他输出选项包括将消息转发到电子邮件,数据报,套接字或 HTTP 服务器。新的过滤器可以根据消息优先级选择不同的路由方式:DEBUG
,INFO
,WARNING
,ERROR
,和 CRITICAL
。
日志系统可以直接从 Python 配置,也可以从用户配置文件加载,以便自定义日志记录而无需更改应用程序。
11.6. 弱引用¶
Python does automatic memory management (reference counting for most objects and garbage collection to eliminate cycles). The memory is freed shortly after the last reference to it has been eliminated.
This approach works fine for most applications but occasionally there is a need
to track objects only as long as they are being used by something else.
Unfortunately, just tracking them creates a reference that makes them permanent.
The weakref
module provides tools for tracking objects without creating a
reference. When the object is no longer needed, it is automatically removed
from a weakref table and a callback is triggered for weakref objects. Typical
applications include caching objects that are expensive to create:
>>> import weakref, gc
>>> class A:
... def __init__(self, value):
... self.value = value
... def __repr__(self):
... return str(self.value)
...
>>> a = A(10) # create a reference
>>> d = weakref.WeakValueDictionary()
>>> d['primary'] = a # does not create a reference
>>> d['primary'] # fetch the object if it is still alive
10
>>> del a # remove the one reference
>>> gc.collect() # run garbage collection right away
0
>>> d['primary'] # entry was automatically removed
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
d['primary'] # entry was automatically removed
File "C:/python36/lib/weakref.py", line 46, in __getitem__
o = self.data[key]()
KeyError: 'primary'
11.7. 用于操作列表的工具¶
Many data structure needs can be met with the built-in list type. However, sometimes there is a need for alternative implementations with different performance trade-offs.
The array
module provides an array()
object that is like
a list that stores only homogeneous data and stores it more compactly. The
following example shows an array of numbers stored as two byte unsigned binary
numbers (typecode "H"
) rather than the usual 16 bytes per entry for regular
lists of Python int objects:
>>> from array import array
>>> a = array('H', [4000, 10, 700, 22222])
>>> sum(a)
26932
>>> a[1:3]
array('H', [10, 700])
The collections
module provides a deque()
object
that is like a list with faster appends and pops from the left side but slower
lookups in the middle. These objects are well suited for implementing queues
and breadth first tree searches:
>>> from collections import deque
>>> d = deque(["task1", "task2", "task3"])
>>> d.append("task4")
>>> print("Handling", d.popleft())
Handling task1
unsearched = deque([starting_node])
def breadth_first_search(unsearched):
node = unsearched.popleft()
for m in gen_moves(node):
if is_goal(m):
return m
unsearched.append(m)
In addition to alternative list implementations, the library also offers other
tools such as the bisect
module with functions for manipulating sorted
lists:
>>> import bisect
>>> scores = [(100, 'perl'), (200, 'tcl'), (400, 'lua'), (500, 'python')]
>>> bisect.insort(scores, (300, 'ruby'))
>>> scores
[(100, 'perl'), (200, 'tcl'), (300, 'ruby'), (400, 'lua'), (500, 'python')]
The heapq
module provides functions for implementing heaps based on
regular lists. The lowest valued entry is always kept at position zero. This
is useful for applications which repeatedly access the smallest element but do
not want to run a full list sort:
>>> from heapq import heapify, heappop, heappush
>>> data = [1, 3, 5, 7, 9, 2, 4, 6, 8, 0]
>>> heapify(data) # rearrange the list into heap order
>>> heappush(data, -5) # add a new entry
>>> [heappop(data) for i in range(3)] # fetch the three smallest entries
[-5, 0, 1]
11.8. 十进制浮点运算¶
The decimal
module offers a Decimal
datatype for
decimal floating point arithmetic. Compared to the built-in float
implementation of binary floating point, the class is especially helpful for
- 财务应用和其他需要精确十进制表示的用途,
- 控制精度,
- 控制四舍五入以满足法律或监管要求,
- 跟踪有效小数位,或
- 用户期望结果与手工完成的计算相匹配的应用程序。
例如,使用十进制浮点和二进制浮点数计算70美分手机和5%税的总费用,会产生的不同结果。如果结果四舍五入到最接近的分数差异会更大:
>>> from decimal import *
>>> round(Decimal('0.70') * Decimal('1.05'), 2)
Decimal('0.74')
>>> round(.70 * 1.05, 2)
0.73
The Decimal
result keeps a trailing zero, automatically
inferring four place significance from multiplicands with two place
significance. Decimal reproduces mathematics as done by hand and avoids
issues that can arise when binary floating point cannot exactly represent
decimal quantities.
Exact representation enables the Decimal
class to perform
modulo calculations and equality tests that are unsuitable for binary floating
point:
>>> Decimal('1.00') % Decimal('.10')
Decimal('0.00')
>>> 1.00 % 0.10
0.09999999999999995
>>> sum([Decimal('0.1')]*10) == Decimal('1.0')
True
>>> sum([0.1]*10) == 1.0
False
decimal
模块的算法提供了尽可能的精度:
>>> getcontext().prec = 36
>>> Decimal(1) / Decimal(7)
Decimal('0.142857142857142857142857142857142857')