perf 两个子系统组成:
- 内核 SYSCALL 提供对性能数据的访问,包括操作系统软件事件和硬件性能事件,以及
- 一组用于收集、显示和分析性能数据的用户空间工具。
PERF 是访问硬件性能计数器的标准方法。它同时支持计数模式(counting mode)和采样模式(sampling mode)。
perf 是内核的一部分,因此安装 linux-tools 就包含了 perf
sudo apt-get install linux-tools-$(uname -r)
# perf 工具选项
perf --help
# perf 特定命令的选项, 如 report
perf report --help
perf 可以测量什么 #
PERF 支持大量预定义事件:
perf list
perf 常用硬件事件符号名称:
cpu-cycles OR cycles
instructions
cache-references
cache-misses
branch-instructions OR branches
branch-misses
bus-cycles
stalled-cycles-frontend OR idle-cycles-frontend
stalled-cycles-backend OR idle-cycles-backend
ref-cycles
perf 一些预定义软件事件:
cpu-clock
task-clock
page-faults OR faults
context-switches OR cs
cpu-migrations OR migrations
minor-faults
major-faults
alignment-faults
emulation-faults
该 cpu-clock
事件测量时间的流逝,使用 Linux CPU Clock 作为计时源。cpu-clock
事件时间是性能分析的良好起点,该事件易于使用和理解。
cpu-clock 和 faults 测量 #
分析的第一步是建立基线性能。基线为我们提供了一种方法来评估我们对源代码或构建过程所做的任何更改的有效性。请不要忘记编译器会执行许多由命令行选项启用的优化。有时提高性能的最佳方法是通过标志挖掘,即找到产生最佳性能的编译器选项组合。
➜ TERRIER_BENCHMARK_THREADS="4" perf record -e cpu-clock,faults --call-graph dwarf -- ./slot_iterator_benchmark
2022-05-11 15:42:00
Running ./slot_iterator_benchmark
Run on (16 X 2304.01 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 1280K (x8)
L3 Unified 24576K (x1)
-------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------
SlotIteratorBenchmark/ConcurrentSlotIterators/manual_time 6250 ms 0 ms 1 6.10352M items/s
[ perf record: Woken up 4000 times to write data ]
[ perf record: Captured and wrote 1000.992 MB perf.data (124083 samples) ]
8991.71 msec cpu-clock
以毫秒为单位显示 cpu-clock 事件数和经过的执行时间,该时间是未来比较的基准(在4个线程下)。
我们应该问什么问题? #
性能分析中有三个首要问题:
- 程序的哪些地方占用的执行时间最多?
- 软件或硬件事件的数量是否表明程序的实际性能,需要修复吗?
- 如何解决性能问题?
执行时间最长的代码区域称为热点。热点是调整和优化的最佳场所,因为在加快热点上付出一点努力可以获得巨大的回报。在只执行一次或不需要太多时间的代码上花费工程精力是不值得的。由于这种分析涉及执行时间,因此您可以很容易地猜测cpu-clock
事件是分析方法的一部分。
第二个问题询问事件的数量是否page-faults
表明要修复的实际性能问题。在上面的第二个例子中,885 页面错误是不是太多了?要回答这个问题,我们需要对程序中的算法和数据结构有一些了解或直觉。我们至少需要知道程序是受 CPU 限制(进行大量计算)还是受内存限制(需要在内存中读取/写入大量数据)。我们还需要知道事件在程序中发生的位置。如果事件与已知热点相关,那么我们可能发现了真正的性能问题并需要修复它。
最后一个问题是最难回答的。假设我们发现了一个具有大量页面错误的热点。我们需要研究程序的算法和数据结构,并确定程序是否具有不必要的大工作集大小并在短时间内触及许多页面。我们可能需要减少或重新组织数据,以便数据驻留在更少的页面上。一旦代码更改完成,我们必须测量修改后程序的性能,并将其执行时间与基线进行比较,看看是否有任何加速。
性能调优是一项艰苦的实验性工作,如果修改后的程序运行速度低于基线,请不要感到惊讶。如果一个假设失败,那么尝试另一个假设。继续尝试!
如何找到热点? #
Step1. Collect profile data #
在收集 profile data 之前,最好使用 gdb 的 -g
选项编译,perf 可以根据 debug symbol 注释源代码,这样 perf 可以提供更多信息,否则 perf 只能生成带有注释的反汇编。
TERRIER_BENCHMARK_THREADS="4" perf record -e cpu-clock,faults --call-graph dwarf -- ./slot_iterator_benchmark
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 1.835 MB perf.data (40023 samples) ]
perf-record 支持很多选项
--call-graph
记录caller/callee 信息,用于 stack chain/backtrace- 在多核机器上工作时,可以在所有 CPU 或指定的 CPU core 上收集
- 可以指定输出文件的名称、更改采样周期、抑制状态消息…
Step2. Display profiles (TUI) #
perf report
使用在 prof.data
(默认的profile 文件名称)中显示 profile data, 支持额外的输出选项
--tui
--stdio
--gtk
perf report
输出
# Total Lost Samples: 0
#
# Samples: 122K of event 'cpu-clock'
# Event count (approx.): 30544750000
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... ....................... .................................................................................................................................................................................................................
#
27.31% 27.31% slot_iterator_b slot_iterator_benchmark [.] __TBB_machine_try_lock_byte
19.99% 19.99% slot_iterator_b slot_iterator_benchmark [.] __TBB_machine_pause
17.65% 0.00% slot_iterator_b [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
|
---entry_SYSCALL_64_after_hwframe
|
--17.56%--do_syscall_64
|
|--13.64%--__x64_sys_sched_yield
| |
| |--10.60%--schedule
| | |
| | --10.31%--__sched_text_start
| |
| --2.89%--do_sched_yield
|
--3.64%--syscall_enter_from_user_mode
17.56% 0.13% slot_iterator_b [kernel.kallsyms] [k] do_syscall_64
|
--17.43%--do_syscall_64
|
|--13.64%--__x64_sys_sched_yield
| |
| |--10.60%--schedule
| | |
| | --10.31%--__sched_text_start
| |
| --2.89%--do_sched_yield
|
--3.64%--syscall_enter_from_user_mode
13.64% 0.16% slot_iterator_b [kernel.kallsyms] [k] __x64_sys_sched_yield
|
--13.49%--__x64_sys_sched_yield
|
|--10.60%--schedule
| |
| --10.31%--__sched_text_start
我们可以屏蔽 callstack
perf report --no-children -g graph
输出
# Total Lost Samples: 0
#
# Samples: 122K of event 'cpu-clock'
# Event count (approx.): 30544750000
#
# Overhead Command Shared Object Symbol
# ........ ............... ....................... .................................................................................................................................................................................................................
#
27.31% slot_iterator_b slot_iterator_benchmark [.] __TBB_machine_try_lock_byte
19.99% slot_iterator_b slot_iterator_benchmark [.] __TBB_machine_pause
10.31% slot_iterator_b [kernel.kallsyms] [k] __sched_text_start
|
---__sched_text_start
schedule
__x64_sys_sched_yield
do_syscall_64
entry_SYSCALL_64_after_hwframe
3.64% slot_iterator_b [kernel.kallsyms] [k] syscall_enter_from_user_mode
|
---syscall_enter_from_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
2.92% slot_iterator_b libc-2.31.so [.] __sched_yield
2.89% slot_iterator_b [kernel.kallsyms] [k] do_sched_yield
|
---do_sched_yield
__x64_sys_sched_yield
do_syscall_64
entry_SYSCALL_64_after_hwframe
1.35% slot_iterator_b slot_iterator_benchmark [.] terrier::common::RawConcurrentBitmap::Flip
1.33% slot_iterator_b slot_iterator_benchmark [.] tbb::internal::atomic_backoff::pause
1.32% slot_iterator_b slot_iterator_benchmark [.] __TBB_LockByte
1.31% slot_iterator_b slot_iterator_benchmark [.] std::_List_const_iterator<terrier::storage::RawBlock*>::operator--
0.95% slot_iterator_b slot_iterator_benchmark [.] std::operator&
0.88% slot_iterator_b slot_iterator_benchmark [.] std::__cxx11::list<terrier::storage::RawBlock*, std::allocator<terrier::storage::RawBlock*> >::empty
0.74% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::TupleAccessStrategy::Block::Column
0.62% slot_iterator_b slot_iterator_benchmark [.] std::vector<unsigned short, std::allocator<unsigned short> >::size
0.62% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::BlockLayout::NumColumns
0.61% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::DataTable::end
0.59% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::TupleSlot::GetOffset
//.....
# Samples: 1K of event 'faults'
# Event count (approx.): 159709
#
# Overhead Command Shared Object Symbol
# ........ ............... ....................... ......................................................................................................................................................................
#
59.51% slot_iterator_b libc-2.31.so [.] _int_malloc
37.33% slot_iterator_b libc-2.31.so [.] __memset_avx2_erms
2.42% slot_iterator_b libc-2.31.so [.] sysmalloc
0.20% slot_iterator_b libc-2.31.so [.] __memmove_avx_unaligned_erms
0.18% slot_iterator_b libc-2.31.so [.] _int_memalign
0.17% slot_iterator_b ld-2.31.so [.] _dl_map_object_from_fd
0.10% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::UndoRecord::InitializeInsert
0.06% slot_iterator_b slot_iterator_benchmark [.] __gnu_cxx::new_allocator<terrier::storage::RecordBufferSegment*>::construct<terrier::storage::RecordBufferSegment*, terrier::storage::RecordBufferSegment* const&>
0.02% slot_iterator_b ld-2.31.so [.] strchr
0.01% slot_iterator_b ld-2.31.so [.] _dl_start
0.00% slot_iterator_b libc-2.31.so [.] __res_thread_freeres
0.00% slot_iterator_b libc-2.31.so [.] new_heap
0.00% slot_iterator_b [kernel.kallsyms] [k] __clear_user
0.00% slot_iterator_b ld-2.31.so [.] _start
0.00% slot_iterator_b ld-2.31.so [.] do_lookup_x
0.00% slot_iterator_b [kernel.kallsyms] [k] copy_user_generic_unrolled
0.00% slot_iterator_b [vdso] [.] __vdso_clock_gettime
0.00% slot_iterator_b libc-2.31.so [.] __sched_yield
0.00% slot_iterator_b libc-2.31.so [.] clock_gettime@GLIBC_2.2.5
#
结果显示
__TBB_machine_try_lock_byte
是 cpu-clock 热点-
libc-2.31.so 的
_int_malloc
是 page faults 热点
Step2. Display profiles (stdio) #
perf report --stdio
显示用户空间和内核中最热门函数的列表。
- 测量了
cpu-clock, faults
两个事件,会得到了两个列表:最热的函数cpu-clocks
和最热的函数page-faults
。 - 通常,对内核活动不感兴趣,因此从结果中过滤掉内核函数通常很有帮助。
--sort
选项聚合(“sums up”)并按 command(comm)、动态共享对象(dso)、进程标识符(pid)等对结果进行排序。
perf report --stdio --sort comm,dso --no-children -g graph
# Total Lost Samples: 0
#
# Samples: 122K of event 'cpu-clock'
# Event count (approx.): 30544750000
#
# Overhead Command Shared Object
# ........ ............... .......................
#
78.62% slot_iterator_b slot_iterator_benchmark
18.27% slot_iterator_b [kernel.kallsyms]
|
|--10.31%--__sched_text_start
| schedule
| __x64_sys_sched_yield
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
|
|--3.64%--syscall_enter_from_user_mode
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
|
--2.89%--do_sched_yield
__x64_sys_sched_yield
do_syscall_64
entry_SYSCALL_64_after_hwframe
3.11% slot_iterator_b libc-2.31.so
0.00% slot_iterator_b ld-2.31.so
0.00% slot_iterator_b libstdc++.so.6.0.28
# Samples: 1K of event 'faults'
# Event count (approx.): 159709
#
# Overhead Command Shared Object
# ........ ............... .......................
#
99.64% slot_iterator_b libc-2.31.so
0.20% slot_iterator_b ld-2.31.so
0.16% slot_iterator_b slot_iterator_benchmark
0.00% slot_iterator_b [kernel.kallsyms]
0.00% slot_iterator_b [vdso]
#
结果显示
- 78.62% cpu-clock 是
slot_iterator_benchmark
程序本身 - 99.64% page faults 是 lic-2.31.so 导致的
如何找到热点函数? #
perf report --stdio --dsos=slot_iterator_benchmark,libc-2.31.so
--dsos
选项将输出限制为指定的动态共享对象 (DSO),上面命令将输出限制为 slot_iterator_benchmark
和
libc-2.13.so runtime lib.
-
cpu-cloc 热点函数
# Total Lost Samples: 0 # # Samples: 122K of event 'cpu-clock' # Event count (approx.): 30544750000 # # Children Self Command Shared Object Symbol # ........ ........ ............... ....................... ................................................................................................................................................................................................................. # 27.31% 27.31% slot_iterator_b slot_iterator_benchmark [.] __TBB_machine_try_lock_byte 19.99% 19.99% slot_iterator_b slot_iterator_benchmark [.] __TBB_machine_pause 1.35% 1.35% slot_iterator_b slot_iterator_benchmark [.] terrier::common::RawConcurrentBitmap::Flip 1.33% 1.33% slot_iterator_b slot_iterator_benchmark [.] tbb::internal::atomic_backoff::pause 1.32% 1.32% slot_iterator_b slot_iterator_benchmark [.] __TBB_LockByte 1.31% 1.31% slot_iterator_b slot_iterator_benchmark [.] std::_List_const_iterator<terrier::storage::RawBlock*>::operator-- 0.95% 0.95% slot_iterator_b slot_iterator_benchmark [.] std::operator& 0.88% 0.88% slot_iterator_b slot_iterator_benchmark [.] std::__cxx11::list<terrier::storage::RawBlock*, std::allocator<terrier::storage::RawBlock*> >::empty 0.74% 0.74% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::TupleAccessStrategy::Block::Column 0.62% 0.62% slot_iterator_b slot_iterator_benchmark [.] std::vector<unsigned short, std::allocator<unsigned short> >::size 0.62% 0.62% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::BlockLayout::NumColumns 0.61% 0.61% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::DataTable::end 0.59% 0.59% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::TupleSlot::GetOffset 0.55% 0.55% slot_iterator_b slot_iterator_benchmark [.] tbb::spin_mutex::lock 0.53% 0.53% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::DataTable::SlotIterator::operator++ 0.51% 0.51% slot_iterator_b slot_iterator_benchmark [.] terrier::common::SpinLatch::Lock 0.49% 0.49% slot_iterator_b slot_iterator_benchmark [.] terrier::SlotIteratorBenchmark_ConcurrentSlotIterators_Benchmark::BenchmarkCase(benchmark::State&)::{lambda()#1}::operator() 0.48% 0.48% slot_iterator_b slot_iterator_benchmark [.] std::atomic<terrier::storage::UndoRecord*>::store 0.47% 0.47% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::DataTableCounter::IncrementNumInsert 0.46% 0.46% slot_iterator_b slot_iterator_benchmark [.] std::__atomic_base<terrier::storage::UndoRecord*>::operator= 0.46% 0.46% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::ArrowBlockMetadata::Size 0.45% 0.45% slot_iterator_b slot_iterator_benchmark [.] std::atomic<terrier::common::StrongTypeAlias<terrier::transaction::tags::timestamp_t_typedef_tag, unsigned long> >::store 0.43% 0.43% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::TupleAccessStrategy::ClearBlockBusyStatus 0.43% 0.43% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::DataTable::SlotIterator::SlotIterator 0.43% 0.43% slot_iterator_b slot_iterator_benchmark [.] terrier::common::SpinLatch::ScopedSpinLatch::ScopedSpinLatch 0.42% 0.42% slot_iterator_b slot_iterator_benchmark [.] terrier::common::RawConcurrentBitmap::Test 0.42% 0.42% slot_iterator_b slot_iterator_benchmark [.] std::_List_const_iterator<terrier::storage::RawBlock*>::operator* 0.42% 0.42% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::StorageUtil::PadUpToSize 0.40% 0.40% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::TupleAccessStrategy::Block::AttrOffsets 0.39% 0.39% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::StorageUtil::AlignedPtr 0.39% 0.39% slot_iterator_b slot_iterator_benchmark [.] terrier::common::SpinLatch::ScopedSpinLatch::~ScopedSpinLatch 0.37% 0.37% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::BlockLayout::AttrSize 0.37% 0.37% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::BlockLayout::NumSlots 0.35% 0.35% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::TupleSlot::TupleSlot 0.33% 0.33% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::TupleAccessStrategy::AccessForceNotNull
-
faults 热点函数
# Samples: 1K of event 'faults' # Event count (approx.): 159709 # # Children Self Command Shared Object Symbol # ........ ........ ............... ....................... ...................................................................................................................................................................... # 59.51% 59.51% slot_iterator_b libc-2.31.so [.] _int_malloc 37.33% 37.33% slot_iterator_b libc-2.31.so [.] __memset_avx2_erms 2.42% 2.42% slot_iterator_b libc-2.31.so [.] sysmalloc 0.20% 0.20% slot_iterator_b libc-2.31.so [.] __memmove_avx_unaligned_erms 0.18% 0.18% slot_iterator_b libc-2.31.so [.] _int_memalign 0.10% 0.10% slot_iterator_b slot_iterator_benchmark [.] terrier::storage::UndoRecord::InitializeInsert 0.06% 0.06% slot_iterator_b slot_iterator_benchmark [.] __gnu_cxx::new_allocator<terrier::storage::RecordBufferSegment*>::construct<terrier::storage::RecordBufferSegment*, terrier::storage::RecordBufferSegment* const&> 0.00% 0.00% slot_iterator_b libc-2.31.so [.] __res_thread_freeres 0.00% 0.00% slot_iterator_b libc-2.31.so [.] new_heap 0.00% 0.00% slot_iterator_b libc-2.31.so [.] __sched_yield 0.00% 0.00% slot_iterator_b libc-2.31.so [.] clock_gettime@GLIBC_2.2.5
查看热点函数内部的热点部分
perf annotate --stdio --dsos=slot_iterator_benchmark --symbol=__TBB_machine_try_lock_byte
Percent | Source code & Disassembly of slot_iterator_benchmark for cpu-clock (33370 samples, percent: local period)
-------------------------------------------------------------------------------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 00000000000fea49 <__TBB_machine_try_lock_byte(unsigned char&)>:
: _Z27__TBB_machine_try_lock_byteRh():
:
: static inline void __TBB_machine_and( volatile void *ptr, uintptr_t addend ) {
: __atomic_fetch_and(reinterpret_cast<volatile uintptr_t *>(ptr),addend,__ATOMIC_SEQ_CST);
: }
:
: inline bool __TBB_machine_try_lock_byte( __TBB_atomic_flag &flag ) {
0.52 : fea49: endbr64
0.00 : fea4d: push %rbp
0.03 : fea4e: mov %rsp,%rbp
0.29 : fea51: mov %rdi,-0x8(%rbp)
: return !__atomic_test_and_set(&flag,__ATOMIC_ACQUIRE);
0.01 : fea55: mov -0x8(%rbp),%rdx
1.08 : fea59: mov $0x1,%eax
0.00 : fea5e: xchg %al,(%rdx)
97.25 : fea60: xor $0x1,%eax
: }
0.02 : fea63: pop %rbp
0.81 : fea64: retq
注意:若是编译时没有开启调试选项,则不能查看符号表,只能显示注释的反汇编,需要使用这个命令
perf annotate --stdio --dsos=slot_iterator_benchmark --symbol=terrier::common::RawConcurrentBitmap::Flip --no-source
Percent | Source code & Disassembly of slot_iterator_benchmark for cpu-clock (33370 samples, percent: local period)
-------------------------------------------------------------------------------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 00000000000fea49 <__TBB_machine_try_lock_byte(unsigned char&)>:
: _Z27__TBB_machine_try_lock_byteRh():
0.52 : fea49: endbr64
0.00 : fea4d: push %rbp
0.03 : fea4e: mov %rsp,%rbp
0.29 : fea51: mov %rdi,-0x8(%rbp)
0.01 : fea55: mov -0x8(%rbp),%rdx
1.08 : fea59: mov $0x1,%eax
0.00 : fea5e: xchg %al,(%rdx)
97.25 : fea60: xor $0x1,%eax
0.02 : fea63: pop %rbp
0.81 : fea64: retq
这个汇编指令是热点,占用了 97.25 的 cpu-clock
修复 hotspot #
slot_iterator_benchmark
对数据库的 blocks 进行迭代,begin
和 end
调用需要加锁
修复后 #
# Total Lost Samples: 0
#
# Samples: 177K of event 'cpu-clock'
# Event count (approx.): 44291000000
#
# Overhead Command Shared Object
# ........ ............... .......................
#
75.59% slot_iterator_b libtbb.so.2
20.35% slot_iterator_b slot_iterator_benchmark
3.39% slot_iterator_b [kernel.kallsyms]
|
|--1.71%--__sched_text_start
| schedule
| __x64_sys_sched_yield
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
|
--0.58%--syscall_enter_from_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
0.67% slot_iterator_b libc-2.31.so
0.00% slot_iterator_b ld-2.31.so
0.00% slot_iterator_b libstdc++.so.6.0.28
0.00% slot_iterator_b libgcc_s.so.1
0.00% slot_iterator_b libtbbmalloc.so.2
# Samples: 1K of event 'faults'
# Event count (approx.): 159198
#
# Overhead Command Shared Object
# ........ ............... .......................
#
99.61% slot_iterator_b libc-2.31.so
0.27% slot_iterator_b ld-2.31.so
0.12% slot_iterator_b slot_iterator_benchmark
0.00% slot_iterator_b [kernel.kallsyms]
0.00% slot_iterator_b [vdso]
#
# (Tip: For a higher level overview, try: perf report --sort comm,dso)
#
- cpu-clock 热点降低了 50%, 对比之前,热点区域集中在 slot_iterator_benchmark, 因为i每次迭代都要获取 lock, 使用读写锁后,blocks 迭代只获取 read lock, 热点在 libtbb 内,这是合理的。
- faults 热点保持一致