使用 Perf 发现程序执行热点

Table of Contents

perf 两个子系统组成：

内核 SYSCALL 提供对性能数据的访问，包括操作系统软件事件和硬件性能事件，以及
一组用于收集、显示和分析性能数据的用户空间工具。

PERF 是访问硬件性能计数器的标准方法。它同时支持计数模式（counting mode）和采样模式（sampling mode）。

perf 是内核的一部分，因此安装 linux-tools 就包含了 perf

sudo apt-get install linux-tools-$(uname -r)

# perf 工具选项
perf --help
# perf 特定命令的选项, 如 report
perf report --help

perf 可以测量什么
#

PERF 支持大量预定义事件：

perf list

perf 常用硬件事件符号名称：

cpu-cycles OR cycles
instructions
cache-references
cache-misses
branch-instructions OR branches
branch-misses
bus-cycles
stalled-cycles-frontend OR idle-cycles-frontend
stalled-cycles-backend OR idle-cycles-backend
ref-cycles

perf 一些预定义软件事件：

cpu-clock
task-clock
page-faults OR faults
context-switches OR cs
cpu-migrations OR migrations
minor-faults
major-faults
alignment-faults
emulation-faults

该 cpu-clock 事件测量时间的流逝，使用 Linux CPU Clock 作为计时源。cpu-clock事件时间是性能分析的良好起点，该事件易于使用和理解。

cpu-clock 和 faults 测量
#

分析的第一步是建立基线性能。基线为我们提供了一种方法来评估我们对源代码或构建过程所做的任何更改的有效性。请不要忘记编译器会执行许多由命令行选项启用的优化。有时提高性能的最佳方法是通过标志挖掘，即找到产生最佳性能的编译器选项组合。

➜  TERRIER_BENCHMARK_THREADS="4" perf record -e cpu-clock,faults --call-graph dwarf -- ./slot_iterator_benchmark

2022-05-11 15:42:00
Running ./slot_iterator_benchmark
Run on (16 X 2304.01 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 1280K (x8)
  L3 Unified 24576K (x1)
-------------------------------------------------------------------------------------------------
Benchmark                                                          Time           CPU Iterations
-------------------------------------------------------------------------------------------------
SlotIteratorBenchmark/ConcurrentSlotIterators/manual_time       6250 ms          0 ms          1   6.10352M items/s
[ perf record: Woken up 4000 times to write data ]
[ perf record: Captured and wrote 1000.992 MB perf.data (124083 samples) ]

8991.71 msec cpu-clock 以毫秒为单位显示 cpu-clock 事件数和经过的执行时间，该时间是未来比较的基准（在4个线程下）。

我们应该问什么问题？
#

性能分析中有三个首要问题：

程序的哪些地方占用的执行时间最多？
软件或硬件事件的数量是否表明程序的实际性能，需要修复吗？
如何解决性能问题？

执行时间最长的代码区域称为热点。热点是调整和优化的最佳场所，因为在加快热点上付出一点努力可以获得巨大的回报。在只执行一次或不需要太多时间的代码上花费工程精力是不值得的。由于这种分析涉及执行时间，因此您可以很容易地猜测cpu-clock事件是分析方法的一部分。

第二个问题询问事件的数量是否page-faults表明要修复的实际性能问题。在上面的第二个例子中，885 页面错误是不是太多了？要回答这个问题，我们需要对程序中的算法和数据结构有一些了解或直觉。我们至少需要知道程序是受 CPU 限制（进行大量计算）还是受内存限制（需要在内存中读取/写入大量数据）。我们还需要知道事件在程序中发生的位置。如果事件与已知热点相关，那么我们可能发现了真正的性能问题并需要修复它。

最后一个问题是最难回答的。假设我们发现了一个具有大量页面错误的热点。我们需要研究程序的算法和数据结构，并确定程序是否具有不必要的大工作集大小并在短时间内触及许多页面。我们可能需要减少或重新组织数据，以便数据驻留在更少的页面上。一旦代码更改完成，我们必须测量修改后程序的性能，并将其执行时间与基线进行比较，看看是否有任何加速。

性能调优是一项艰苦的实验性工作，如果修改后的程序运行速度低于基线，请不要感到惊讶。如果一个假设失败，那么尝试另一个假设。继续尝试！

如何找到热点？
#

Step1. Collect profile data
#

在收集 profile data 之前，最好使用 gdb 的 -g选项编译，perf 可以根据 debug symbol 注释源代码，这样 perf 可以提供更多信息，否则 perf 只能生成带有注释的反汇编。

TERRIER_BENCHMARK_THREADS="4" perf record -e cpu-clock,faults --call-graph dwarf -- ./slot_iterator_benchmark

[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 1.835 MB perf.data (40023 samples) ]

perf-record 支持很多选项

--call-graph 记录caller/callee 信息，用于 stack chain/backtrace
在多核机器上工作时，可以在所有 CPU 或指定的 CPU core 上收集
可以指定输出文件的名称、更改采样周期、抑制状态消息…

Step2. Display profiles (TUI)
#

perf report 使用在 prof.data（默认的profile 文件名称）中显示 profile data，支持额外的输出选项

--tui
--stdio
--gtk

perf report

输出


# Total Lost Samples: 0
#
# Samples: 122K of event 'cpu-clock'
# Event count (approx.): 30544750000
#
# Children      Self  Command          Shared Object            Symbol

# ........  ........  ...............  .......................  .................................................................................................................................................................................................................
#
    27.31%    27.31%  slot_iterator_b  slot_iterator_benchmark  [.] __TBB_machine_try_lock_byte
    19.99%    19.99%  slot_iterator_b  slot_iterator_benchmark  [.] __TBB_machine_pause
    17.65%     0.00%  slot_iterator_b  [kernel.kallsyms]        [k] entry_SYSCALL_64_after_hwframe
            |
            ---entry_SYSCALL_64_after_hwframe
               |
                --17.56%--do_syscall_64
                          |
                          |--13.64%--__x64_sys_sched_yield
                          |          |
                          |          |--10.60%--schedule
                          |          |          |
                          |          |           --10.31%--__sched_text_start
                          |          |
                          |           --2.89%--do_sched_yield
                          |
                           --3.64%--syscall_enter_from_user_mode

    17.56%     0.13%  slot_iterator_b  [kernel.kallsyms]        [k] do_syscall_64
            |
             --17.43%--do_syscall_64
                       |
                       |--13.64%--__x64_sys_sched_yield
                       |          |
                       |          |--10.60%--schedule
                       |          |          |
                       |          |           --10.31%--__sched_text_start
                       |          |
                       |           --2.89%--do_sched_yield
                       |
                        --3.64%--syscall_enter_from_user_mode

    13.64%     0.16%  slot_iterator_b  [kernel.kallsyms]        [k] __x64_sys_sched_yield
            |
             --13.49%--__x64_sys_sched_yield
                       |
                       |--10.60%--schedule
                       |          |
                       |           --10.31%--__sched_text_start

我们可以屏蔽 callstack

perf report --no-children -g graph

输出

# Total Lost Samples: 0
#
# Samples: 122K of event 'cpu-clock'
# Event count (approx.): 30544750000
#
# Overhead  Command          Shared Object            Symbol

# ........  ...............  .......................  .................................................................................................................................................................................................................
#
    27.31%  slot_iterator_b  slot_iterator_benchmark  [.] __TBB_machine_try_lock_byte
    19.99%  slot_iterator_b  slot_iterator_benchmark  [.] __TBB_machine_pause
    10.31%  slot_iterator_b  [kernel.kallsyms]        [k] __sched_text_start
            |
            ---__sched_text_start
               schedule
               __x64_sys_sched_yield
               do_syscall_64
               entry_SYSCALL_64_after_hwframe

     3.64%  slot_iterator_b  [kernel.kallsyms]        [k] syscall_enter_from_user_mode
            |
            ---syscall_enter_from_user_mode
               do_syscall_64
               entry_SYSCALL_64_after_hwframe

     2.92%  slot_iterator_b  libc-2.31.so             [.] __sched_yield
     2.89%  slot_iterator_b  [kernel.kallsyms]        [k] do_sched_yield
            |
            ---do_sched_yield
               __x64_sys_sched_yield
               do_syscall_64
               entry_SYSCALL_64_after_hwframe

     1.35%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::common::RawConcurrentBitmap::Flip
     1.33%  slot_iterator_b  slot_iterator_benchmark  [.] tbb::internal::atomic_backoff::pause
     1.32%  slot_iterator_b  slot_iterator_benchmark  [.] __TBB_LockByte
     1.31%  slot_iterator_b  slot_iterator_benchmark  [.] std::_List_const_iterator<terrier::storage::RawBlock*>::operator--
     0.95%  slot_iterator_b  slot_iterator_benchmark  [.] std::operator&
     0.88%  slot_iterator_b  slot_iterator_benchmark  [.] std::__cxx11::list<terrier::storage::RawBlock*, std::allocator<terrier::storage::RawBlock*> >::empty
     0.74%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::TupleAccessStrategy::Block::Column
     0.62%  slot_iterator_b  slot_iterator_benchmark  [.] std::vector<unsigned short, std::allocator<unsigned short> >::size
     0.62%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::BlockLayout::NumColumns
     0.61%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::DataTable::end
     0.59%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::TupleSlot::GetOffset

//.....

# Samples: 1K of event 'faults'
# Event count (approx.): 159709
#
# Overhead  Command          Shared Object            Symbol

# ........  ...............  .......................  ......................................................................................................................................................................
#
    59.51%  slot_iterator_b  libc-2.31.so             [.] _int_malloc
    37.33%  slot_iterator_b  libc-2.31.so             [.] __memset_avx2_erms
     2.42%  slot_iterator_b  libc-2.31.so             [.] sysmalloc
     0.20%  slot_iterator_b  libc-2.31.so             [.] __memmove_avx_unaligned_erms
     0.18%  slot_iterator_b  libc-2.31.so             [.] _int_memalign
     0.17%  slot_iterator_b  ld-2.31.so               [.] _dl_map_object_from_fd
     0.10%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::UndoRecord::InitializeInsert
     0.06%  slot_iterator_b  slot_iterator_benchmark  [.] __gnu_cxx::new_allocator<terrier::storage::RecordBufferSegment*>::construct<terrier::storage::RecordBufferSegment*, terrier::storage::RecordBufferSegment* const&>
     0.02%  slot_iterator_b  ld-2.31.so               [.] strchr
     0.01%  slot_iterator_b  ld-2.31.so               [.] _dl_start
     0.00%  slot_iterator_b  libc-2.31.so             [.] __res_thread_freeres
     0.00%  slot_iterator_b  libc-2.31.so             [.] new_heap
     0.00%  slot_iterator_b  [kernel.kallsyms]        [k] __clear_user
     0.00%  slot_iterator_b  ld-2.31.so               [.] _start
     0.00%  slot_iterator_b  ld-2.31.so               [.] do_lookup_x
     0.00%  slot_iterator_b  [kernel.kallsyms]        [k] copy_user_generic_unrolled
     0.00%  slot_iterator_b  [vdso]                   [.] __vdso_clock_gettime
     0.00%  slot_iterator_b  libc-2.31.so             [.] __sched_yield
     0.00%  slot_iterator_b  libc-2.31.so             [.] clock_gettime@GLIBC_2.2.5

#

结果显示

__TBB_machine_try_lock_byte 是 cpu-clock 热点
libc-2.31.so 的 _int_malloc 是 page faults 热点

Step2. Display profiles (stdio)
#

perf report --stdio 显示用户空间和内核中最热门函数的列表。

测量了 cpu-clock, faults 两个事件，会得到了两个列表：最热的函数 cpu-clocks 和最热的函数 page-faults。
通常，对内核活动不感兴趣，因此从结果中过滤掉内核函数通常很有帮助。--sort选项聚合（“sums up”）并按 command（comm）、动态共享对象（dso）、进程标识符（pid）等对结果进行排序。

perf report --stdio --sort comm,dso --no-children -g graph

# Total Lost Samples: 0
#
# Samples: 122K of event 'cpu-clock'
# Event count (approx.): 30544750000
#
# Overhead  Command          Shared Object
# ........  ...............  .......................
#
    78.62%  slot_iterator_b  slot_iterator_benchmark
    18.27%  slot_iterator_b  [kernel.kallsyms]
            |
            |--10.31%--__sched_text_start
            |          schedule
            |          __x64_sys_sched_yield
            |          do_syscall_64
            |          entry_SYSCALL_64_after_hwframe
            |
            |--3.64%--syscall_enter_from_user_mode
            |          do_syscall_64
            |          entry_SYSCALL_64_after_hwframe
            |
             --2.89%--do_sched_yield
                       __x64_sys_sched_yield
                       do_syscall_64
                       entry_SYSCALL_64_after_hwframe

     3.11%  slot_iterator_b  libc-2.31.so
     0.00%  slot_iterator_b  ld-2.31.so
     0.00%  slot_iterator_b  libstdc++.so.6.0.28

# Samples: 1K of event 'faults'
# Event count (approx.): 159709
#
# Overhead  Command          Shared Object
# ........  ...............  .......................
#
    99.64%  slot_iterator_b  libc-2.31.so
     0.20%  slot_iterator_b  ld-2.31.so
     0.16%  slot_iterator_b  slot_iterator_benchmark
     0.00%  slot_iterator_b  [kernel.kallsyms]
     0.00%  slot_iterator_b  [vdso]

#

结果显示

78.62% cpu-clock 是 slot_iterator_benchmark 程序本身
99.64% page faults 是 lic-2.31.so 导致的

如何找到热点函数？
#

perf report --stdio --dsos=slot_iterator_benchmark,libc-2.31.so

--dsos 选项将输出限制为指定的动态共享对象 (DSO)，上面命令将输出限制为 slot_iterator_benchmark 和 libc-2.13.so runtime lib.

cpu-cloc 热点函数

# Total Lost Samples: 0
#
# Samples: 122K of event 'cpu-clock'
# Event count (approx.): 30544750000
#
# Children      Self  Command          Shared Object            Symbol

# ........  ........  ...............  .......................  .................................................................................................................................................................................................................
#
    27.31%    27.31%  slot_iterator_b  slot_iterator_benchmark  [.] __TBB_machine_try_lock_byte
    19.99%    19.99%  slot_iterator_b  slot_iterator_benchmark  [.] __TBB_machine_pause
     1.35%     1.35%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::common::RawConcurrentBitmap::Flip
     1.33%     1.33%  slot_iterator_b  slot_iterator_benchmark  [.] tbb::internal::atomic_backoff::pause
     1.32%     1.32%  slot_iterator_b  slot_iterator_benchmark  [.] __TBB_LockByte
     1.31%     1.31%  slot_iterator_b  slot_iterator_benchmark  [.] std::_List_const_iterator<terrier::storage::RawBlock*>::operator--
     0.95%     0.95%  slot_iterator_b  slot_iterator_benchmark  [.] std::operator&
     0.88%     0.88%  slot_iterator_b  slot_iterator_benchmark  [.] std::__cxx11::list<terrier::storage::RawBlock*, std::allocator<terrier::storage::RawBlock*> >::empty
     0.74%     0.74%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::TupleAccessStrategy::Block::Column
     0.62%     0.62%  slot_iterator_b  slot_iterator_benchmark  [.] std::vector<unsigned short, std::allocator<unsigned short> >::size
     0.62%     0.62%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::BlockLayout::NumColumns
     0.61%     0.61%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::DataTable::end
     0.59%     0.59%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::TupleSlot::GetOffset
     0.55%     0.55%  slot_iterator_b  slot_iterator_benchmark  [.] tbb::spin_mutex::lock
     0.53%     0.53%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::DataTable::SlotIterator::operator++
     0.51%     0.51%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::common::SpinLatch::Lock
     0.49%     0.49%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::SlotIteratorBenchmark_ConcurrentSlotIterators_Benchmark::BenchmarkCase(benchmark::State&)::{lambda()#1}::operator()
     0.48%     0.48%  slot_iterator_b  slot_iterator_benchmark  [.] std::atomic<terrier::storage::UndoRecord*>::store
     0.47%     0.47%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::DataTableCounter::IncrementNumInsert
     0.46%     0.46%  slot_iterator_b  slot_iterator_benchmark  [.] std::__atomic_base<terrier::storage::UndoRecord*>::operator=
     0.46%     0.46%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::ArrowBlockMetadata::Size
     0.45%     0.45%  slot_iterator_b  slot_iterator_benchmark  [.] std::atomic<terrier::common::StrongTypeAlias<terrier::transaction::tags::timestamp_t_typedef_tag, unsigned long> >::store
     0.43%     0.43%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::TupleAccessStrategy::ClearBlockBusyStatus
     0.43%     0.43%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::DataTable::SlotIterator::SlotIterator
     0.43%     0.43%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::common::SpinLatch::ScopedSpinLatch::ScopedSpinLatch
     0.42%     0.42%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::common::RawConcurrentBitmap::Test
     0.42%     0.42%  slot_iterator_b  slot_iterator_benchmark  [.] std::_List_const_iterator<terrier::storage::RawBlock*>::operator*
     0.42%     0.42%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::StorageUtil::PadUpToSize
     0.40%     0.40%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::TupleAccessStrategy::Block::AttrOffsets
     0.39%     0.39%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::StorageUtil::AlignedPtr
     0.39%     0.39%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::common::SpinLatch::ScopedSpinLatch::~ScopedSpinLatch
     0.37%     0.37%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::BlockLayout::AttrSize
     0.37%     0.37%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::BlockLayout::NumSlots
     0.35%     0.35%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::TupleSlot::TupleSlot
     0.33%     0.33%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::TupleAccessStrategy::AccessForceNotNull

faults 热点函数

# Samples: 1K of event 'faults'
# Event count (approx.): 159709
#
# Children      Self  Command          Shared Object            Symbol

# ........  ........  ...............  .......................  ......................................................................................................................................................................
#
    59.51%    59.51%  slot_iterator_b  libc-2.31.so             [.] _int_malloc
    37.33%    37.33%  slot_iterator_b  libc-2.31.so             [.] __memset_avx2_erms
     2.42%     2.42%  slot_iterator_b  libc-2.31.so             [.] sysmalloc
     0.20%     0.20%  slot_iterator_b  libc-2.31.so             [.] __memmove_avx_unaligned_erms
     0.18%     0.18%  slot_iterator_b  libc-2.31.so             [.] _int_memalign
     0.10%     0.10%  slot_iterator_b  slot_iterator_benchmark  [.] terrier::storage::UndoRecord::InitializeInsert
     0.06%     0.06%  slot_iterator_b  slot_iterator_benchmark  [.] __gnu_cxx::new_allocator<terrier::storage::RecordBufferSegment*>::construct<terrier::storage::RecordBufferSegment*, terrier::storage::RecordBufferSegment* const&>
     0.00%     0.00%  slot_iterator_b  libc-2.31.so             [.] __res_thread_freeres
     0.00%     0.00%  slot_iterator_b  libc-2.31.so             [.] new_heap
     0.00%     0.00%  slot_iterator_b  libc-2.31.so             [.] __sched_yield
     0.00%     0.00%  slot_iterator_b  libc-2.31.so             [.] clock_gettime@GLIBC_2.2.5

查看热点函数内部的热点部分

perf annotate --stdio --dsos=slot_iterator_benchmark --symbol=__TBB_machine_try_lock_byte

Percent |      Source code & Disassembly of slot_iterator_benchmark for cpu-clock (33370 samples, percent: local period)
-------------------------------------------------------------------------------------------------------------------------
         :
         :
         :
         :           Disassembly of section .text:
         :
         :           00000000000fea49 <__TBB_machine_try_lock_byte(unsigned char&)>:
         :           _Z27__TBB_machine_try_lock_byteRh():
         :
         :           static inline void __TBB_machine_and( volatile void *ptr, uintptr_t addend ) {
         :           __atomic_fetch_and(reinterpret_cast<volatile uintptr_t *>(ptr),addend,__ATOMIC_SEQ_CST);
         :           }
         :
         :           inline bool __TBB_machine_try_lock_byte( __TBB_atomic_flag &flag ) {
    0.52 :   fea49:  endbr64
    0.00 :   fea4d:  push   %rbp
    0.03 :   fea4e:  mov    %rsp,%rbp
    0.29 :   fea51:  mov    %rdi,-0x8(%rbp)
         :           return !__atomic_test_and_set(&flag,__ATOMIC_ACQUIRE);
    0.01 :   fea55:  mov    -0x8(%rbp),%rdx
    1.08 :   fea59:  mov    $0x1,%eax
    0.00 :   fea5e:  xchg   %al,(%rdx)
   97.25 :   fea60:  xor    $0x1,%eax
         :           }
    0.02 :   fea63:  pop    %rbp
    0.81 :   fea64:  retq

注意：若是编译时没有开启调试选项，则不能查看符号表，只能显示注释的反汇编，需要使用这个命令

perf annotate --stdio --dsos=slot_iterator_benchmark --symbol=terrier::common::RawConcurrentBitmap::Flip --no-source

Percent |      Source code & Disassembly of slot_iterator_benchmark for cpu-clock (33370 samples, percent: local period)
-------------------------------------------------------------------------------------------------------------------------
         :
         :
         :
         :           Disassembly of section .text:
         :
         :           00000000000fea49 <__TBB_machine_try_lock_byte(unsigned char&)>:
         :           _Z27__TBB_machine_try_lock_byteRh():
    0.52 :   fea49:  endbr64
    0.00 :   fea4d:  push   %rbp
    0.03 :   fea4e:  mov    %rsp,%rbp
    0.29 :   fea51:  mov    %rdi,-0x8(%rbp)
    0.01 :   fea55:  mov    -0x8(%rbp),%rdx
    1.08 :   fea59:  mov    $0x1,%eax
    0.00 :   fea5e:  xchg   %al,(%rdx)
   97.25 :   fea60:  xor    $0x1,%eax
    0.02 :   fea63:  pop    %rbp
    0.81 :   fea64:  retq

这个汇编指令是热点，占用了 97.25 的 cpu-clock

修复 hotspot
#

slot_iterator_benchmark 对数据库的 blocks 进行迭代，begin 和 end 调用需要加锁

修复后
#

# Total Lost Samples: 0
#
# Samples: 177K of event 'cpu-clock'
# Event count (approx.): 44291000000
#
# Overhead  Command          Shared Object
# ........  ...............  .......................
#
    75.59%  slot_iterator_b  libtbb.so.2
    20.35%  slot_iterator_b  slot_iterator_benchmark
     3.39%  slot_iterator_b  [kernel.kallsyms]
            |
            |--1.71%--__sched_text_start
            |          schedule
            |          __x64_sys_sched_yield
            |          do_syscall_64
            |          entry_SYSCALL_64_after_hwframe
            |
             --0.58%--syscall_enter_from_user_mode
                       do_syscall_64
                       entry_SYSCALL_64_after_hwframe

     0.67%  slot_iterator_b  libc-2.31.so
     0.00%  slot_iterator_b  ld-2.31.so
     0.00%  slot_iterator_b  libstdc++.so.6.0.28
     0.00%  slot_iterator_b  libgcc_s.so.1
     0.00%  slot_iterator_b  libtbbmalloc.so.2

# Samples: 1K of event 'faults'
# Event count (approx.): 159198
#
# Overhead  Command          Shared Object
# ........  ...............  .......................
#
    99.61%  slot_iterator_b  libc-2.31.so
     0.27%  slot_iterator_b  ld-2.31.so
     0.12%  slot_iterator_b  slot_iterator_benchmark
     0.00%  slot_iterator_b  [kernel.kallsyms]
     0.00%  slot_iterator_b  [vdso]

#
# (Tip: For a higher level overview, try: perf report --sort comm,dso)
#