代码库和插件 FAQ

目录

通用的代码库问题

如何找到可以用来做 XXX 的模块或应用?

代码库参考 中查找是否有适合的标准库模块。(如果你已经了解标准库的内容,可以跳过这一步)

对于第三方软件包,请搜索 Python Package Index 或是 Google 等其他搜索引擎。用“Python”加上一两个你需要的关键字通常会找到有用的东西。

math.py(socket.py,regex.py 等)的源文件在哪?

如果找不到模块的源文件,可能它是一个内建的模块,或是使用 C,C++ 或其他编译型语言实现的动态加载模块。这种情况下可能是没有源码文件的,类似 mathmodule.c 这样的文件会存放在 C 代码目录中(但不在 Python 目录中)。

Python 中(至少)有三类模块:

  1. 使用 Python 编写的模块(.py);

  2. 使用 C 编写的动态加载模块(.dll,.pyd,.so,.sl 等);

  3. 使用 C 编写并链接到解释器的模块,要获取此列表,输入:

    import sys
    print sys.builtin_module_names
    

在 Unix 中怎样让 Python 脚本可执行?

你需要做两件事:文件必须是可执行的,并且第一行需要以 #! 开头,后面跟上 Python 解释器的路径。

第一点可以用执行 chmod +x scriptfile 或是 chmod 755 scriptfile 做到。

第二点有很多种做法,最直接的方式是:

#!/usr/local/bin/python

在文件第一行,使用你所在平台上的 Python 解释器的路径。

如果你希望脚本不依赖 Python 解释器的具体路径,你也可以使用 env 程序。假设你的 Python 解释器所在目录已经添加到了 PATH 环境变量中,几乎所有的类 Unix 系统都支持下面的写法:

#!/usr/bin/env python

不要 在 CGI 脚本中这样做。CGI 脚本的 PATH 环境变量通常会非常精简,所以你必须使用解释器的完整绝对路径。

Occasionally, a user’s environment is so full that the /usr/bin/env program fails; or there’s no env program at all. In that case, you can try the following hack (due to Alex Rezinsky):

#! /bin/sh
""":"
exec python $0 ${1+"$@"}
"""

这样做有一个小小的缺点,它会定义脚本的 __doc__ 字符串。不过可以这样修复:

__doc__ = """...Whatever..."""

Python 中有 curses/termcap 包吗?

For Unix variants the standard Python source distribution comes with a curses module in the Modules subdirectory, though it’s not compiled by default. (Note that this is not available in the Windows distribution – there is no curses module for Windows.)

curses 模块支持基本的 curses 特性,同时也支持 ncurses 和 SYSV curses 中的很多额外功能,比如颜色、不同的字符集支持、填充和鼠标支持。这意味着这个模块不兼容只有 BSD curses 模块的操作系统,但是目前仍在维护的系统应该都不会存在这种情况。

对于 Windows 平台:使用 consolelib 模块.

Python 中存在类似 C 的 onexit() 函数的东西吗?

atexit 模块提供了一个与 C 的 onexit() 函数类似的注册函数。

为什么我的信号处理函数不能工作?

最常见的问题是信号处理函数没有正确定义参数列表。它会被这样调用:

handler(signum, frame)

因此函数应该定义两个参数:

def handler(signum, frame):
    ...

通用任务

怎样测试 Python 程序或组件?

Python 带有两个测试框架。doctest 模块从模块的 docstring 中寻找示例并执行,对比输出是否与 docstring 中给出的是否一致。

unittest 模块是一个模仿 Java 和 Smalltalk 测试框架的更棒的测试框架。

为了使测试更容易,你应该在程序中使用良好的模块化设计。程序中的绝大多数功能都应该用函数或类方法封装 —— 有时这样做会有额外惊喜,程序会运行得更快(因为局部变量比全局变量访问要快)。除此之外,程序应该避免依赖可变的局部变量,这会使得测试困难许多。

程序的“全局主逻辑”应该尽量简单:

if __name__ == "__main__":
    main_logic()

并放置在程序主模块的最后面。

一旦你的程序已经用函数和类完善地组织起来,你就应该编写测试函数来测试其行为。可以使用自动执行一系列测试函数的测试集与每个模块进行关联。听起来似乎需要大量的工作,但是因为 Python 非常简洁和灵活,所以实际上会相当简单。在编写“生产代码”的同时别忘了也要编写测试函数,你会发现编程会变得更愉快、更有趣,因为这样会使得发现 bug 和设计缺陷更加容易。

程序主模块之外的其他“辅助模块”中可以增加自测试的入口。

if __name__ == "__main__":
    self_test()

通过使用 Python 实现的“假”接口,即使是需要与复杂的外部接口交互的程序也可以在外部接口不可用时进行测试。

怎样用 docstring 创建文档?

pydoc 模块可以用 Python 源码中的 docstring 创建 HTML 文件。也可以使用 epydoc 来只通过 docstring 创建 API 文档。Sphinx 也可以引入 docstring 的内容。

怎样一次只获取一个按键?

For Unix variants there are several solutions. It’s straightforward to do this using curses, but curses is a fairly large module to learn. Here’s a solution without curses:

import termios, fcntl, sys, os
fd = sys.stdin.fileno()

oldterm = termios.tcgetattr(fd)
newattr = termios.tcgetattr(fd)
newattr[3] = newattr[3] & ~termios.ICANON & ~termios.ECHO
termios.tcsetattr(fd, termios.TCSANOW, newattr)

oldflags = fcntl.fcntl(fd, fcntl.F_GETFL)
fcntl.fcntl(fd, fcntl.F_SETFL, oldflags | os.O_NONBLOCK)

try:
    while 1:
        try:
            c = sys.stdin.read(1)
            print "Got character", repr(c)
        except IOError: pass
finally:
    termios.tcsetattr(fd, termios.TCSAFLUSH, oldterm)
    fcntl.fcntl(fd, fcntl.F_SETFL, oldflags)

You need the termios and the fcntl module for any of this to work, and I’ve only tried it on Linux, though it should work elsewhere. In this code, characters are read and printed one at a time.

termios.tcsetattr() turns off stdin’s echoing and disables canonical mode. fcntl.fnctl() is used to obtain stdin’s file descriptor flags and modify them for non-blocking mode. Since reading stdin when it is empty results in an IOError, this error is caught and ignored.

线程相关

程序中怎样使用线程?

Be sure to use the threading module and not the thread module. The threading module builds convenient abstractions on top of the low-level primitives provided by the thread module.

Aahz 的非常实用的 threading 教程中有一些幻灯片;可以参阅 http://www.pythoncraft.com/OSCON2001/

我的线程都没有运行,为什么?

一旦主线程退出,所有的子线程都会被杀掉。你的主线程运行得太快了,子线程还没来得及工作。

简单的解决方法是在程序中加一个时间足够长的 sleep,让子线程能够完成运行。

import threading, time

def thread_task(name, n):
    for i in range(n): print name, i

for i in range(10):
    T = threading.Thread(target=thread_task, args=(str(i), i))
    T.start()

time.sleep(10) # <----------------------------!

但目前(在许多平台上)线程不是并行运行的,而是按顺序依次执行!原因是系统线程调度器在前一个线程阻塞之前不会启动新线程。

简单的解决方法是在运行函数的开始处加一个时间很短的 sleep。

def thread_task(name, n):
    time.sleep(0.001) # <---------------------!
    for i in range(n): print name, i

for i in range(10):
    T = threading.Thread(target=thread_task, args=(str(i), i))
    T.start()

time.sleep(10)

Instead of trying to guess a good delay value for time.sleep(), it’s better to use some kind of semaphore mechanism. One idea is to use the Queue module to create a queue object, let each thread append a token to the queue when it finishes, and let the main thread read as many tokens from the queue as there are threads.

如何将任务分配给多个工作线程?

Use the Queue module to create a queue containing a list of jobs. The Queue class maintains a list of objects and has a .put(obj) method that adds items to the queue and a .get() method to return them. The class will take care of the locking necessary to ensure that each job is handed out exactly once.

这是一个简单的例子:

import threading, Queue, time

# The worker thread gets jobs off the queue.  When the queue is empty, it
# assumes there will be no more work and exits.
# (Realistically workers will run until terminated.)
def worker():
    print 'Running worker'
    time.sleep(0.1)
    while True:
        try:
            arg = q.get(block=False)
        except Queue.Empty:
            print 'Worker', threading.currentThread(),
            print 'queue empty'
            break
        else:
            print 'Worker', threading.currentThread(),
            print 'running with argument', arg
            time.sleep(0.5)

# Create queue
q = Queue.Queue()

# Start a pool of 5 workers
for i in range(5):
    t = threading.Thread(target=worker, name='worker %i' % (i+1))
    t.start()

# Begin adding work to the queue
for i in range(50):
    q.put(i)

# Give threads time to run
print 'Main thread sleeping'
time.sleep(5)

运行时会产生如下输出:

Running worker
Running worker
Running worker
Running worker
Running worker
Main thread sleeping
Worker <Thread(worker 1, started)> running with argument 0
Worker <Thread(worker 2, started)> running with argument 1
Worker <Thread(worker 3, started)> running with argument 2
Worker <Thread(worker 4, started)> running with argument 3
Worker <Thread(worker 5, started)> running with argument 4
Worker <Thread(worker 1, started)> running with argument 5
...

Consult the module’s documentation for more details; the Queue class provides a featureful interface.

怎样修改全局变量是线程安全的?

A global interpreter lock (GIL) is used internally to ensure that only one thread runs in the Python VM at a time. In general, Python offers to switch among threads only between bytecode instructions; how frequently it switches can be set via sys.setcheckinterval(). Each bytecode instruction and therefore all the C implementation code reached from each instruction is therefore atomic from the point of view of a Python program.

理论上说,具体的结果要看具体的 PVM 字节码实现对指令的解释。而实际上,对内建类型(int,list,dict 等)的共享变量的“类原子”操作都是原子的。

举例来说,下面的操作是原子的(L、L1、L2 是列表,D、D1、D2 是字典,x、y 是对象,i,j 是 int 变量):

L.append(x)
L1.extend(L2)
x = L[i]
x = L.pop()
L1[i:j] = L2
L.sort()
x = y
x.field = y
D[x] = y
D1.update(D2)
D.keys()

这些不是原子的:

i = i+1
L.append(L[-1])
L[i] = L[j]
D[x] = D[x] + 1

覆盖其他对象的操作会在其他对象的引用计数变成 0 时触发其 __del__() 方法,这可能会产生一些影响。对字典和列表进行大量操作时尤其如此。如果有疑问的话,使用互斥锁!

不能删除全局解释器锁吗?

global interpreter lock (GIL)通常被视为 Python 在高端多核服务器上开发时的阻力,因为(几乎)所有 Python 代码只有在获取到 GIL 时才能运行,所以多线程的 Python 程序只能有效地使用一个 CPU。

Back in the days of Python 1.5, Greg Stein actually implemented a comprehensive patch set (the “free threading” patches) that removed the GIL and replaced it with fine-grained locking. Unfortunately, even on Windows (where locks are very efficient) this ran ordinary Python code about twice as slow as the interpreter using the GIL. On Linux the performance loss was even worse because pthread locks aren’t as efficient.

Since then, the idea of getting rid of the GIL has occasionally come up but nobody has found a way to deal with the expected slowdown, and users who don’t use threads would not be happy if their code ran at half the speed. Greg’s free threading patch set has not been kept up-to-date for later Python versions.

This doesn’t mean that you can’t make good use of Python on multi-CPU machines! You just have to be creative with dividing the work up between multiple processes rather than multiple threads. Judicious use of C extensions will also help; if you use a C extension to perform a time-consuming task, the extension can release the GIL while the thread of execution is in the C code and allow other threads to get some work done.

也有建议说 GIL 应该是解释器状态锁,而不是完全的全局锁;解释器不应该共享对象。不幸的是,这也不可能发生。由于目前许多对象的实现都有全局的状态,因此这是一个艰巨的工作。举例来说,小整型数和短字符串会缓存起来,这些缓存将不得不移动到解释器状态中。其他对象类型都有自己的自由变量列表,这些自由变量列表也必须移动到解释器状态中。等等。

我甚至怀疑这些工作是否可能在优先的时间内完成,因为同样的问题在第三方拓展中也会存在。第三方拓展编写的速度可比你将它们转换为把全局状态存入解释器状态中的速度快得多。

最后,假设多个解释器不共享任何状态,那么这样做比每个进程一个解释器好在哪里呢?

输入输出

怎样删除文件?(以及其他文件相关的问题……)

Use os.remove(filename) or os.unlink(filename); for documentation, see the os module. The two functions are identical; unlink() is simply the name of the Unix system call for this function.

如果要删除目录,应该使用 os.rmdir();使用 os.mkdir() 创建目录。os.makedirs(path) 会创建 path 中任何不存在的目录。os.removedirs(path) 则会删除其中的目录,只要它们都是空的;如果你想删除整个目录以及其中的内容,可以使用 shutil.rmtree()

重命名文件可以使用 os.rename(old_path, new_path)

To truncate a file, open it using f = open(filename, "r+"), and use f.truncate(offset); offset defaults to the current seek position. There’s also os.ftruncate(fd, offset) for files opened with os.open(), where fd is the file descriptor (a small integer).

shutil 模块也包含了一些处理文件的函数,包括 copyfile()copytree()rmtree()

怎样复制文件?

shutil 模块有一个 copyfile() 函数。注意在 MacOS 9 中不会复制 resource fork 和 Finder info。

怎样读取(或写入)二进制数据?

要读写复杂的二进制数据格式,最好使用 struct 模块。该模块可以读取包含二进制数据(通常是数字)的字符串并转换为 Python 对象,反之亦然。

举例来说,下面的代码会从文件中以大端序格式读取一个 2 字节的整型和一个 4 字节的整型:

import struct

f = open(filename, "rb")  # Open in binary mode for portability
s = f.read(8)
x, y, z = struct.unpack(">hhl", s)

格式字符串中的 ‘>’ 强制以大端序读取数据;字母 ‘h’ 从字符串中读取一个“短整型”(2 字节),字母 ‘l’ 读取一个“长整型”(4 字节)。

对于更常规的数据(例如整型或浮点类型的列表),你也可以使用 array 模块。

似乎 os.popen() 创建的管道不能使用 os.read(),这是为什么?

os.read() 是一个底层函数,它接收的是文件描述符 —— 用小整型数表示的打开的文件。os.popen() 创建的是一个高级文件对象,和内建的 open() 方法返回的类型一样。因此,如果要从 os.popen() 创建的管道 p 中读取 n 个字节的话,你应该使用 p.read(n)

How do I run a subprocess with pipes connected to both input and output?

Use the popen2 module. For example:

import popen2
fromchild, tochild = popen2.popen2("command")
tochild.write("input\n")
tochild.flush()
output = fromchild.readline()

Warning: in general it is unwise to do this because you can easily cause a deadlock where your process is blocked waiting for output from the child while the child is blocked waiting for input from you. This can be caused by the parent expecting the child to output more text than it does or by data being stuck in stdio buffers due to lack of flushing. The Python parent can of course explicitly flush the data it sends to the child before it reads any output, but if the child is a naive C program it may have been written to never explicitly flush its output, even if it is interactive, since flushing is normally automatic.

Note that a deadlock is also possible if you use popen3() to read stdout and stderr. If one of the two is too large for the internal buffer (increasing the buffer size does not help) and you read() the other one first, there is a deadlock, too.

Note on a bug in popen2: unless your program calls wait() or waitpid(), finished child processes are never removed, and eventually calls to popen2 will fail because of a limit on the number of child processes. Calling os.waitpid() with the os.WNOHANG option can prevent this; a good place to insert such a call would be before calling popen2 again.

In many cases, all you really need is to run some data through a command and get the result back. Unless the amount of data is very large, the easiest way to do this is to write it to a temporary file and run the command with that temporary file as input. The standard module tempfile exports a mktemp() function to generate unique temporary file names.

import tempfile
import os

class Popen3:
    """
    This is a deadlock-safe version of popen that returns
    an object with errorlevel, out (a string) and err (a string).
    (capturestderr may not work under windows.)
    Example: print Popen3('grep spam','\n\nhere spam\n\n').out
    """
    def __init__(self,command,input=None,capturestderr=None):
        outfile=tempfile.mktemp()
        command="( %s ) > %s" % (command,outfile)
        if input:
            infile=tempfile.mktemp()
            open(infile,"w").write(input)
            command=command+" <"+infile
        if capturestderr:
            errfile=tempfile.mktemp()
            command=command+" 2>"+errfile
        self.errorlevel=os.system(command) >> 8
        self.out=open(outfile,"r").read()
        os.remove(outfile)
        if input:
            os.remove(infile)
        if capturestderr:
            self.err=open(errfile,"r").read()
            os.remove(errfile)

Note that many interactive programs (e.g. vi) don’t work well with pipes substituted for standard input and output. You will have to use pseudo ttys (“ptys”) instead of pipes. Or you can use a Python interface to Don Libes’ “expect” library. A Python extension that interfaces to expect is called “expy” and available from http://expectpy.sourceforge.net. A pure Python solution that works like expect is pexpect.

怎样访问(RS232)串口?

对于 Win32,POSIX(Linux,BSD 等),Jython:

对于 Unix,查看 Mitch Chapman 发布的帖子:

为什么关闭 sys.stdout(stdin,stderr)并不会真正关掉它?

Python file objects are a high-level layer of abstraction on top of C streams, which in turn are a medium-level layer of abstraction on top of (among other things) low-level C file descriptors.

For most file objects you create in Python via the built-in file constructor, f.close() marks the Python file object as being closed from Python’s point of view, and also arranges to close the underlying C stream. This also happens automatically in f’s destructor, when f becomes garbage.

But stdin, stdout and stderr are treated specially by Python, because of the special status also given to them by C. Running sys.stdout.close() marks the Python-level file object as being closed, but does not close the associated C stream.

To close the underlying C stream for one of these three, you should first be sure that’s what you really want to do (e.g., you may confuse extension modules trying to do I/O). If it is, use os.close:

os.close(0)   # close C's stdin stream
os.close(1)   # close C's stdout stream
os.close(2)   # close C's stderr stream

网络 / Internet 编程

Python 中的 WWW 工具是什么?

参阅代码库参考手册中 互联网协议和支持互联网数据处理 这两章的内容。Python 有大量模块来帮助你构建服务端和客户端 web 系统。

Paul Boddie 维护了一份可用框架的概览,见 https://wiki.python.org/moin/WebProgramming

Cameron Laird 维护了一份关于 Python web 技术的实用网页的集合,见 http://phaseit.net/claird/comp.lang.python/web_python

怎样模拟发送 CGI 表单(METHOD=POST)?

我需要通过 POST 表单获取网页,有什么代码能简单做到吗?

Yes. Here’s a simple example that uses httplib:

#!/usr/local/bin/python

import httplib, sys, time

# build the query string
qs = "First=Josephine&MI=Q&Last=Public"

# connect and send the server a path
httpobj = httplib.HTTP('www.some-server.out-there', 80)
httpobj.putrequest('POST', '/cgi-bin/some-cgi-script')
# now generate the rest of the HTTP headers...
httpobj.putheader('Accept', '*/*')
httpobj.putheader('Connection', 'Keep-Alive')
httpobj.putheader('Content-type', 'application/x-www-form-urlencoded')
httpobj.putheader('Content-length', '%d' % len(qs))
httpobj.endheaders()
httpobj.send(qs)
# find out what the server said in response...
reply, msg, hdrs = httpobj.getreply()
if reply != 200:
    sys.stdout.write(httpobj.getfile().read())

Note that in general for percent-encoded POST operations, query strings must be quoted using urllib.urlencode(). For example, to send name=Guy Steele, Jr.:

>>> import urllib
>>> urllib.urlencode({'name': 'Guy Steele, Jr.'})
'name=Guy+Steele%2C+Jr.'

生成 HTML 需要使用什么模块?

你可以在 Web 编程 wiki 页面 找到许多有用的链接。

怎样使用 Python 脚本发送邮件?

使用 smtplib 标准库模块。

下面是一个很简单的交互式发送邮件的代码。这个方法适用于任何支持 SMTP 协议的主机。

import sys, smtplib

fromaddr = raw_input("From: ")
toaddrs  = raw_input("To: ").split(',')
print "Enter message, end with ^D:"
msg = ''
while True:
    line = sys.stdin.readline()
    if not line:
        break
    msg += line

# The actual mail send
server = smtplib.SMTP('localhost')
server.sendmail(fromaddr, toaddrs, msg)
server.quit()

在 Unix 系统中还可以使用 sendmail。sendmail 程序的位置在不同系统中不一样,有时是在 /usr/lib/sendmail,有时是在 /usr/sbin/sendmail。sendmail 手册页面会对你有所帮助。以下是示例代码:

import os

SENDMAIL = "/usr/sbin/sendmail"  # sendmail location
p = os.popen("%s -t -i" % SENDMAIL, "w")
p.write("To: receiver@example.com\n")
p.write("Subject: test\n")
p.write("\n") # blank line separating headers from body
p.write("Some text\n")
p.write("some more text\n")
sts = p.close()
if sts != 0:
    print "Sendmail exit status", sts

socket 的 connect() 方法怎样避免阻塞?

The select module is commonly used to help with asynchronous I/O on sockets.

要避免 TCP 连接阻塞,你可以设置将 socket 设置为非阻塞模式。此时当调用 connect() 时,要么连接会立刻建立好(几乎不可能),要么会收到一个包含了错误码 .error 的异常。errno.EINPROGRESS 表示连接正在进行,但还没有完成。不同的系统会返回不同的值,所以你需要确认你使用的系统会返回什么。

You can use the connect_ex() method to avoid creating an exception. It will just return the errno value. To poll, you can call connect_ex() again later – 0 or errno.EISCONN indicate that you’re connected – or you can pass this socket to select to check if it’s writable.

数据库

Python 中有数据库包的接口吗?

有的。

Python 2.3 includes the bsddb package which provides an interface to the BerkeleyDB library. Interfaces to disk-based hashes such as DBM and GDBM are also included with standard Python.

大多数关系型数据库都已经支持。查看 数据库编程 wiki 页面 获取更多信息。

在 Python 中如何实现持久化对象?

The pickle library module solves this in a very general way (though you still can’t store things like open files, sockets or windows), and the shelve library module uses pickle and (g)dbm to create persistent mappings containing arbitrary Python objects. For better performance, you can use the cPickle module.

A more awkward way of doing things is to use pickle’s little sister, marshal. The marshal module provides very fast ways to store noncircular basic Python types to files and strings, and back again. Although marshal does not do fancy things like store instances or handle shared references properly, it does run extremely fast. For example, loading a half megabyte of data may take less than a third of a second. This often beats doing something more complex and general such as using gdbm with pickle/shelve.

Why is cPickle so slow?

By default pickle uses a relatively old and slow format for backward compatibility. You can however specify other protocol versions that are faster:

largeString = 'z' * (100 * 1024)
myPickle = cPickle.dumps(largeString, protocol=1)

If my program crashes with a bsddb (or anydbm) database open, it gets corrupted. How come?

Databases opened for write access with the bsddb module (and often by the anydbm module, since it will preferentially use bsddb) must explicitly be closed using the .close() method of the database. The underlying library caches database contents which need to be converted to on-disk form and written.

If you have initialized a new bsddb database but not written anything to it before the program crashes, you will often wind up with a zero-length file and encounter an exception the next time the file is opened.

I tried to open Berkeley DB file, but bsddb produces bsddb.error: (22, ‘Invalid argument’). Help! How can I restore my data?

Don’t panic! Your data is probably intact. The most frequent cause for the error is that you tried to open an earlier Berkeley DB file with a later version of the Berkeley DB library.

Many Linux systems now have all three versions of Berkeley DB available. If you are migrating from version 1 to a newer version use db_dump185 to dump a plain text version of the database. If you are migrating from version 2 to version 3 use db2_dump to create a plain text version of the database. In either case, use db_load to create a new native database for the latest version installed on your computer. If you have version 3 of Berkeley DB installed, you should be able to use db2_load to create a native version 2 database.

You should move away from Berkeley DB version 1 files because the hash file code contains known bugs that can corrupt your data.

数学和数字

Python 中怎样生成随机数?

random 标准库模块实现了随机数生成器,使用起来非常简单:

import random
random.random()

这个函数会返回 [0, 1) 之间的随机浮点数。

该模块中还有许多其他的专门的生成器,例如:

  • randrange(a, b) 返回 [a, b) 区间内的一个整型数。

  • uniform(a, b) 返回 [a, b) 区间之间的浮点数。

  • normalvariate(mean, sdev) 使用正态(高斯)分布采样。

还有一些高级函数直接对序列进行操作,例如:

  • choice(S) 从给定的序列中随机选择一个元素。

  • shuffle(L) 对列表进行原地重排,也就是说随机打乱。

还有 Random 类,你可以将其实例化,用来创建多个独立的随机数生成器。