Python support for the "perf map" compatible profilers
******************************************************

作者:
   Pablo Galindo

The Linux perf profiler and samply are powerful tools that allow you
to profile and obtain information about the performance of your
application. Both tools have vibrant ecosystems that aid with the
analysis of the data they produce.

The main problem with using these profilers with Python applications
is that they only get information about native symbols, that is, the
names of functions and procedures written in C. This means that the
names and file names of Python functions in your code will not appear
in the profiler output.

Since Python 3.12, the interpreter can run in a special mode that
allows Python functions to appear in the output of compatible
profilers. When this mode is enabled, the interpreter will interpose a
small piece of code compiled on the fly before the execution of every
Python function and it will teach the profiler the relationship
between this piece of code and the associated Python function using
perf map files.

備註:

  Support for profiling is available on Linux and macOS on select
  architectures. Perf is available on Linux, while samply can be used
  on both Linux and macOS. samply support on macOS is available
  starting from Python 3.15. Check the output of the "configure" build
  step or check the output of "python -m sysconfig | grep
  HAVE_PERF_TRAMPOLINE" to see if your system is supported.

例如，參考以下腳本：

   def foo(n):
       result = 0
       for _ in range(n):
           result += 1
       return result

   def bar(n):
       foo(n)

   def baz(n):
       bar(n)

   if __name__ == "__main__":
       baz(1000000)

我們可以執行 "perf" 以 9999 赫茲取樣 CPU 堆疊追蹤 (stack trace)：

   $ perf record -F 9999 -g -o perf.data python my_script.py

然後我們可以使用 "perf report" 來分析資料：

   $ perf report --stdio -n -g

   # Children      Self       Samples  Command     Shared Object       Symbol
   # ........  ........  ............  ..........  ..................  ..........................................
   #
       91.08%     0.00%             0  python.exe  python.exe          [.] _start
               |
               ---_start
               |
                   --90.71%--__libc_start_main
                           Py_BytesMain
                           |
                           |--56.88%--pymain_run_python.constprop.0
                           |          |
                           |          |--56.13%--_PyRun_AnyFileObject
                           |          |          _PyRun_SimpleFileObject
                           |          |          |
                           |          |          |--55.02%--run_mod
                           |          |          |          |
                           |          |          |           --54.65%--PyEval_EvalCode
                           |          |          |                     _PyEval_EvalFrameDefault
                           |          |          |                     PyObject_Vectorcall
                           |          |          |                     _PyEval_Vector
                           |          |          |                     _PyEval_EvalFrameDefault
                           |          |          |                     PyObject_Vectorcall
                           |          |          |                     _PyEval_Vector
                           |          |          |                     _PyEval_EvalFrameDefault
                           |          |          |                     PyObject_Vectorcall
                           |          |          |                     _PyEval_Vector
                           |          |          |                     |
                           |          |          |                     |--51.67%--_PyEval_EvalFrameDefault
                           |          |          |                     |          |
                           |          |          |                     |          |--11.52%--_PyCompactLong_Add
                           |          |          |                     |          |          |
                           |          |          |                     |          |          |--2.97%--_PyObject_Malloc
   ...

如你所見，Python 函式未顯示在輸出中，僅顯示
"_Py_Eval_EvalFrameDefault" （為 Python 位元組碼 (bytecode) 求值的函式
）。不幸的是，這不是很有用，因為所有 Python 函式都使用相同的 C 函式來
替位元組碼求值，因此我們無法知道哪個 Python 函式是對應於哪個位元組碼計
算函式。

作為替代，如果我們在啟用 "perf" 支援的情況下執行相同的實驗，我們會得到
：

   $ perf report --stdio -n -g

   # Children      Self       Samples  Command     Shared Object       Symbol
   # ........  ........  ............  ..........  ..................  .....................................................................
   #
       90.58%     0.36%             1  python.exe  python.exe          [.] _start
               |
               ---_start
               |
                   --89.86%--__libc_start_main
                           Py_BytesMain
                           |
                           |--55.43%--pymain_run_python.constprop.0
                           |          |
                           |          |--54.71%--_PyRun_AnyFileObject
                           |          |          _PyRun_SimpleFileObject
                           |          |          |
                           |          |          |--53.62%--run_mod
                           |          |          |          |
                           |          |          |           --53.26%--PyEval_EvalCode
                           |          |          |                     py::<module>:/src/script.py
                           |          |          |                     _PyEval_EvalFrameDefault
                           |          |          |                     PyObject_Vectorcall
                           |          |          |                     _PyEval_Vector
                           |          |          |                     py::baz:/src/script.py
                           |          |          |                     _PyEval_EvalFrameDefault
                           |          |          |                     PyObject_Vectorcall
                           |          |          |                     _PyEval_Vector
                           |          |          |                     py::bar:/src/script.py
                           |          |          |                     _PyEval_EvalFrameDefault
                           |          |          |                     PyObject_Vectorcall
                           |          |          |                     _PyEval_Vector
                           |          |          |                     py::foo:/src/script.py
                           |          |          |                     |
                           |          |          |                     |--51.81%--_PyEval_EvalFrameDefault
                           |          |          |                     |          |
                           |          |          |                     |          |--13.77%--_PyCompactLong_Add
                           |          |          |                     |          |          |
                           |          |          |                     |          |          |--3.26%--_PyObject_Malloc


Using the samply profiler
=========================

samply is a modern profiler that can be used as an alternative to
perf. It uses the same perf map files that Python generates, making it
compatible with Python's profiling support. samply is particularly
useful on macOS where perf is not available.

To use samply with Python, first install it following the instructions
at https://github.com/mstange/samply, then run:

   $ samply record PYTHONPERFSUPPORT=1 python my_script.py

This will open a web interface where you can analyze the profiling
data interactively. The advantage of samply is that it provides a
modern web-based interface for analyzing profiling data and works on
both Linux and macOS.

On macOS, samply support requires Python 3.15 or later. Also on macOS,
samply can't profile signed Python executables due to restrictions by
macOS. You can profile with Python binaries that you've compiled
yourself, or which are unsigned or locally-signed (such as anything
installed by Homebrew). In order to attach to running processes on
macOS, run "samply setup" once (and every time samply is updated) to
self-sign the samply binary.


如何啟用 "perf" 分析支援
========================

要啟用 "perf" 分析支援，可以在一開始就使用環境變數 "PYTHONPERFSUPPORT"
或使用 "-X perf" 選項，也可以使用 "sys.activate_stack_trampoline()" 和
"sys.deactivate_stack_trampoline()" 來動態啟用。

"sys" 函式優先於 "-X" 選項、"-X" 選項優先於環境變數。

例如，使用環境變數：

   $ PYTHONPERFSUPPORT=1 perf record -F 9999 -g -o perf.data python my_script.py
   $ perf report -g -i perf.data

例如，使用 "-X" 選項：

   $ perf record -F 9999 -g -o perf.data python -X perf my_script.py
   $ perf report -g -i perf.data

例如，在 "example.py" 檔案中使用 "sys" API：

   import sys

   sys.activate_stack_trampoline("perf")
   do_profiled_stuff()
   sys.deactivate_stack_trampoline()

   non_profiled_stuff()

...然後：

   $ perf record -F 9999 -g -o perf.data python ./example.py
   $ perf report -g -i perf.data


如何獲得最佳結果
================

For best results, keep frame pointers enabled. On supported GCC-
compatible toolchains, CPython builds itself with "-fno-omit-frame-
pointer" and similar flags (see "--without-frame-pointers" for
details). These flags allow profilers to unwind using only the frame
pointer and not on DWARF debug information. This is because as the
code that is interposed to allow "perf" support is dynamically
generated it doesn't have any DWARF debugging information available.

你可以透過執行以下指令來檢查你的系統是否已使用此旗標進行編譯：

   $ python -m sysconfig | grep 'no-omit-frame-pointer'

如果你沒有看到任何輸出，則表示你的直譯器尚未使用 frame 指標進行編譯，
因此它可能無法在 "perf" 的輸出中顯示 Python 函式。


How to work without frame pointers
==================================

If you are working with a Python interpreter that has been compiled
without frame pointers, you can still use the "perf" profiler, but the
overhead will be a bit higher because Python needs to generate
unwinding information for every Python function call on the fly.
Additionally, "perf" will take more time to process the data because
it will need to use the DWARF debugging information to unwind the
stack and this is a slow process.

To enable this mode, you can use the environment variable
"PYTHON_PERF_JIT_SUPPORT" or the "-X perf_jit" option, which will
enable the JIT mode for the "perf" profiler.

備註:

  Due to a bug in the "perf" tool, only "perf" versions higher than
  v6.8 will work with the JIT mode.  The fix was also backported to
  the v6.7.2 version of the tool.Note that when checking the version
  of the "perf" tool (which can be done by running "perf version") you
  must take into account that some distros add some custom version
  numbers including a "-" character.  This means that "perf 6.7-3" is
  not necessarily "perf 6.7.3".

When using the perf JIT mode, you need an extra step before you can
run "perf report". You need to call the "perf inject" command to
inject the JIT information into the "perf.data" file.:

   $ perf record -F 9999 -g -k 1 --call-graph dwarf -o perf.data python -Xperf_jit my_script.py
   $ perf inject -i perf.data --jit --output perf.jit.data
   $ perf report -g -i perf.jit.data

或使用環境變數：

   $ PYTHON_PERF_JIT_SUPPORT=1 perf record -F 9999 -g --call-graph dwarf -o perf.data python my_script.py
   $ perf inject -i perf.data --jit --output perf.jit.data
   $ perf report -g -i perf.jit.data

"perf inject --jit" command will read "perf.data", automatically pick
up the perf dump file that Python creates (in "/tmp/perf-$PID.dump"),
and then create "perf.jit.data" which merges all the JIT information
together. It should also create a lot of "jitted-XXXX-N.so" files in
the current directory which are ELF images for all the JIT trampolines
that were created by Python.

警告:

  When using "--call-graph dwarf", the "perf" tool will take snapshots
  of the stack of the process being profiled and save the information
  in the "perf.data" file. By default, the size of the stack dump is
  8192 bytes, but you can change the size by passing it after a comma
  like "--call-graph dwarf,16384".The size of the stack dump is
  important because if the size is too small "perf" will not be able
  to unwind the stack and the output will be incomplete. On the other
  hand, if the size is too big, then "perf" won't be able to sample
  the process as frequently as it would like as the overhead will be
  higher.The stack size is particularly important when profiling
  Python code compiled with low optimization levels (like "-O0"), as
  these builds tend to have larger stack frames. If you are compiling
  Python with "-O0" and not seeing Python functions in your profiling
  output, try increasing the stack dump size to 65528 bytes (the
  maximum):

     $ perf record -F 9999 -g -k 1 --call-graph dwarf,65528 -o perf.data python -Xperf_jit my_script.py

  Different compilation flags can significantly impact stack sizes:

  * Builds with "-O0" typically have much larger stack frames than
    those with "-O1" or higher

  * Adding optimizations ("-O1", "-O2", etc.) typically reduces stack
    size

  * Frame pointers ("-fno-omit-frame-pointer") generally provide more
    reliable stack unwinding
