Introduction

perf, aka perf_events, is the official Linux profiler and included in the Linux kernel source under tools/perf. It can instrument CPU performance counters, tracepoints, kprobes, and uprobes. It is capable of system profiling.

Performance Monitoring Counter(PMC)

Show system wide PMC statistics

The following example shows PMC statistics for the entire system, for 5 seconds.

$ perf stat -a sleep 5

 Performance counter stats for 'system wide':

        399,489.78 msec cpu-clock                 #   79.290 CPUs utilized
            21,795      context-switches          #    0.055 K/sec
               208      cpu-migrations            #    0.001 K/sec
            10,071      page-faults               #    0.025 K/sec
    15,739,163,983      cycles                    #    0.039 GHz
    11,493,495,432      instructions              #    0.73  insn per cycle
     2,091,049,131      branches                  #    5.234 M/sec
        23,987,055      branch-misses             #    1.15% of all branches

       5.038337951 seconds time elapsed

Show PMC statistics for the specified process

The following example shows PMC statistics for the fio process. The fio job file will be included later on.

$  perf stat fio fio_job_file.ini
Performance counter stats for 'fio fio_job_file.ini':

          9,500.26 msec task-clock                #    0.172 CPUs utilized
           133,844      context-switches          #    0.014 M/sec
            13,540      cpu-migrations            #    0.001 M/sec
           107,703      page-faults               #    0.011 M/sec
    21,039,694,713      cycles                    #    2.215 GHz
     8,896,839,896      instructions              #    0.42  insn per cycle
     1,755,629,641      branches                  #  184.798 M/sec
        34,380,966      branch-misses             #    1.96% of all branches

      55.143040711 seconds time elapsed

       4.520016000 seconds user
       5.921369000 seconds sys

perf events

perf events can be listed using perf list. I only list a few of them for example. It includes hardware event, hardware cache event, software event, PMU event, tracepoint event and SDT event.

$ perf list
List of pre-defined events (to be used in -e):

  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  context-switches OR cs                             [Software event]
  page-faults OR faults                              [Software event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event] 
  cache-misses OR cpu/cache-misses/                  [Kernel PMU event]
  cache-references OR cpu/cache-references/          [Kernel PMU event]
  cpu-cycles OR cpu/cpu-cycles/                      [Kernel PMU event]  
  block:block_plug                                   [Tracepoint event]
  block:block_rq_abort                               [Tracepoint event]
  block:block_rq_complete                            [Tracepoint event]
  block:block_rq_insert                              [Tracepoint event]
  block:block_rq_issue                               [Tracepoint event]
  irq:softirq_entry                                  [Tracepoint event]
  kmem:kfree                                         [Tracepoint event]
  kmem:kmalloc                                       [Tracepoint event]
  kmem:kmalloc_node                                  [Tracepoint event]
  kmem:kmem_cache_alloc                              [Tracepoint event]
  kmem:kmem_cache_alloc_node                         [Tracepoint event]
  kmem:kmem_cache_free                               [Tracepoint event]
  kmem:mm_page_alloc                                 [Tracepoint event]
  kmem:mm_page_alloc_extfrag                         [Tracepoint event]
  kmem:mm_page_alloc_zone_locked                     [Tracepoint event]
  kmem:mm_page_free                                  [Tracepoint event]
  kmem:mm_page_free_batched                          [Tracepoint event]
  kmem:mm_page_pcpu_drain                            [Tracepoint event]
  syscalls:sys_enter_write                           [Tracepoint event]
  syscalls:sys_enter_read                            [Tracepoint event]
  sdt_libc:memory_heap_free                          [SDT event]
  sdt_libc:memory_heap_less                          [SDT event]
  sdt_libc:memory_heap_more                          [SDT event]
  sdt_libc:memory_heap_new                           [SDT event]  
[...]  

PMC sample frequency

A default sample frequency is used when using perf record with PMCs. Thus, not all the event are recorded. This is desirable because some PMCs events can occur billions of time per second which causes significant overhead of recording.

The following two examples record the hardware cycle events and software page-faults events. The output shows the frequency sampling is enabled(freq=1) and the sample frequency is 4000.

$ perf record -vve cycles -a sleep 1
Using CPUID GenuineIntel-6-55-4
intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
------------------------------------------------------------
perf_event_attr:
  size                             112
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|CPU|PERIOD
  read_format                      ID
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  freq                             1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
[...]
[ perf record: Captured and wrote 2.057 MB perf.data (14549 samples) ]

$ perf record -vve page-faults -a sleep 1
Using CPUID GenuineIntel-6-55-4
intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x2
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|CPU|PERIOD
  read_format                      ID
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  freq                             1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
[...]
[ perf record: Captured and wrote 1.434 MB perf.data (615 samples) ]

The sample frequency can be modified using the -F option.

$ perf record -F 99 -vve cycles -a sleep 1
Using CPUID GenuineIntel-6-55-4
intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
------------------------------------------------------------
perf_event_attr:
  size                             112
  { sample_period, sample_freq }   99
  sample_type                      IP|TID|TIME|CPU|PERIOD
  read_format                      ID
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  freq                             1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------

In Linux, there is a limit for frequency rate and cpu utilization for perf. These limits can be changed with sysctl if needed.

$ sysctl -a | grep kernel.perf_event_max_sample_rate
kernel.perf_event_max_sample_rate = 16000

$ sysctl -a | grep kernel.perf_cpu_time_max_percent
kernel.perf_cpu_time_max_percent = 25

CPU profiling

perf is often used for CPU profiling.

In the following example, we have a fio workload running to create 10000 files and 100KB each. We can profile the workload while it’s running.

We increase the open files parameter value from the default 1024 to 10240 so that we can create 10000 files.

$ ulimit -a | grep "open files"
open files                      (-n) 1024

$ ulimit -n 10240

$ ulimit -a | grep "open files"
open files                      (-n) 10240

We use the following fio job file to run the workload.

$ cat fio_job_file.ini
[job1]
ioengine=libaio
iodepth=8
rw=write
direct=1
nrfiles=10000
filesize=102400
blocksize=4096
dedupe_percentage=30
buffer_compress_percentage=50
buffer_pattern="0123456789"
numjobs=1
directory=/testdir

$ fio fio_job_file.ini

We start the perf profiling in a different terminal while the above fio workload is running.

$ perf record -F 99 -p `pidof fio` -a -g -- sleep 5

Once the profiling is done, a perf.data result file is generated for later analysis. We use perf report –stdio command to summarize the perf.data and generate a text report.

From the text report, we can have a better idea on how the fio works by understanding the system call function graph. This is extremly useful when we need to identify the high latency of functions.

$ ls -la | grep perf.data
-rw-------.  1 root root   143460 Feb 25 02:25 perf.data

$ perf report --stdio > perf.report.stdio.out
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 335  of event 'cycles:ppp'
# Event count (approx.): 3215628508
#
# Children      Self  Command  Shared Object       Symbol
# ........  ........  .......  ..................  ............................................
#
    41.73%     0.00%  :198904  [kernel.kallsyms]   [k] system_call_fastpath
            |
            ---system_call_fastpath
               |
               |--30.99%--sys_io_submit
               |          |
               |          |--29.96%--do_io_submit
               |          |          |
               |          |          |--15.62%--vx_naio_write_v2
               |          |          |          vx_naio_write
               |          |          |          |
               |          |          |          |--9.73%--vx_naio_handoff
               |          |          |          |          |
               |          |          |          |           --8.57%--__wake_up
               |          |          |          |                     __wake_up_common_lock
               |          |          |          |                     __wake_up_common
               |          |          |          |                     vx_wq_wakeup_function
               |          |          |          |                     default_wake_function
               |          |          |          |                     try_to_wake_up
               |          |          |          |                     |
               |          |          |          |                      --0.87%--_raw_spin_lock_irqsave
               |          |          |          |
               |          |          |           --5.89%--vx_naio_checks.isra.5.constprop.6
               |          |          |                     |
               |          |          |                      --5.16%--vx_fel_io_allowed
               |          |          |                                |
               |          |          |                                |--2.86%--vx_irwunlock
               |          |          |                                |          vx_recsmp_rangeunlock
               |          |          |                                |          vx_rwsleep_rec_unlock
               |          |          |                                |
               |          |          |                                 --1.92%--vx_irwlock
               |          |          |                                           vx_irwlock2
               |          |          |                                           vx_recsmp_rangelock
               |          |          |                                           vx_rwsleep_rec_lock
               |          |          |                                           _raw_spin_lock_irqsave
               |          |          |
               |          |          |--7.94%--kmem_cache_alloc
               |          |          |          __slab_alloc
               |          |          |          ___slab_alloc
               |          |          |
               |          |          |--3.83%--rw_verify_area
               |          |          |          security_file_permission
               |          |          |          |
               |          |          |           --3.48%--selinux_file_permission
               |          |          |                     __inode_security_revalidate
               |          |          |
               |          |          |--1.04%--lookup_ioctx
               |          |          |
               |          |           --0.83%--fget
               |          |
               |           --0.86%--kmem_cache_alloc
               |
               |--5.52%--sys_io_getevents
               |          read_events
               |          |
               |          |--3.06%--schedule
               |          |          __schedule
               |          |          |
               |          |           --2.76%--deactivate_task
               |          |                     update_rq_clock.part.76
               |          |
               |          |--1.77%--aio_read_events
               |          |          |
               |          |           --0.58%--_cond_resched
               |          |
               |           --0.64%--prepare_to_wait
               |                     _raw_spin_lock_irqsave
               |
                --5.04%--sys_open
                          do_sys_open
                          |
                          |--4.12%--do_filp_open
                          |          path_openat
                          |          |
                          |          |--2.38%--link_path_walk
                          |          |          |
                          |          |           --0.66%--lookup_fast
                          |          |                     vx_drevalidate
                          |          |                     vx_nmspc_resolve
                          |          |                     _raw_spin_lock_irqsave
                          |          |
                          |          |--0.90%--get_empty_filp
                          |          |          |
                          |          |           --0.72%--kmem_cache_alloc
                          |          |
                          |           --0.81%--do_last
                          |                     |
                          |                      --0.56%--__audit_inode
                          |                                audit_copy_inode
                          |                                get_vfs_caps_from_disk
                          |                                vx_linux_getxattr
                          |                                vx_get_eatype
                          |
                           --0.92%--getname
                                     getname_flags
                                     kmem_cache_alloc
[...]

Tracepoint events

In the previous profile, we saw a function kmem_cache_alloc. It’s a pre-defined tracepoint. So, we can trace it directly to understand the call graph related.

$ perf list | grep kmem_cache_alloc
  kmem:kmem_cache_alloc                              [Tracepoint event]


$ perf record -F 99 -e kmem:kmem_cache_alloc -p `pidof fio` -a -g -- sleep 5 

We can check the target tracepoint call graph as below.

$ perf report --stdio
    14.61%    14.61%  (mmap_region+0x38c) call_site=ffffffff877fa39c ptr=0xffff9193252c5f38 bytes_req=216 bytes_alloc=216 gfp_flags=GFP_KERNEL|GFP_ZERO
            |
            ---__mmap64
               system_call_fastpath
               sys_mmap
               sys_mmap_pgoff
               vm_mmap_pgoff
               do_mmap
               mmap_region
               kmem_cache_alloc
[...]               

Another example is to check the which function issues the disk I/Os(e.g. synchronous read/write).

$ perf record -e block:block_rq_insert -a -g -- sleep 60

perf script can also be used to get a summary. It’s useful to spot patterns overtime which might be lost in a huge report summary.

$ perf script

fio 264249 [020] 3116264.020748: kmem:kmem_cache_alloc: (getname_flags+0x4f) call_site=ffffffff8785c74f ptr=0xffff9234810d5000 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP
_KERNEL
    ffffffff8782660d kmem_cache_alloc+0xfd ([kernel.kallsyms])
    ffffffff8785c74f getname_flags+0x4f ([kernel.kallsyms])
    ffffffff8785d8c5 user_path_at_empty+0x45 ([kernel.kallsyms])
    ffffffff8785d951 user_path_at+0x11 ([kernel.kallsyms])
    ffffffff87850633 vfs_fstatat+0x63 ([kernel.kallsyms])
    ffffffff87850a51 SYSC_newlstat+0x31 ([kernel.kallsyms])
    ffffffff87850ebe sys_newlstat+0xe ([kernel.kallsyms])
    ffffffff87d8fede system_call_fastpath+0x25 ([kernel.kallsyms])
        7fd8de3ea365 __lxstat64+0x15 (/usr/lib64/libc-2.17.so)
                8000 [unknown] ([unknown])
[...]

The following are the reports without using “-g” option during profiling. It summarizes the profile without the detail function graph.

$ perf script
    fio 276212 [047] 3116580.119986: kmem:kmem_cache_alloc: (get_empty_filp+0x5c) call_site=ffffffff8784d10c ptr=0xffff922b52f2ee00 bytes_req=256 bytes_alloc=256 g
    fio 276212 [047] 3116580.120138: kmem:kmem_cache_alloc: (sys_io_setup+0xaa) call_site=ffffffff878a182a ptr=0xffff918f24972f40 bytes_req=576 bytes_alloc=576 gfp
    fio 276212 [008] 3116580.120419: kmem:kmem_cache_alloc: (do_io_submit+0x194) call_site=ffffffff878a1da4 ptr=0xffff91224f606f00 bytes_req=192 bytes_alloc=192 gf
    fio 276212 [008] 3116580.120420: kmem:kmem_cache_alloc: (do_io_submit+0x194) call_site=ffffffff878a1da4 ptr=0xffff91224f606f00 bytes_req=192 bytes_alloc=192 gf
    fio 276212 [008] 3116580.120420: kmem:kmem_cache_alloc: (do_io_submit+0x194) call_site=ffffffff878a1da4 ptr=0xffff91224f606f00 bytes_req=192 bytes_alloc=192 gf
    fio 276212 [008] 3116580.120425: kmem:kmem_cache_alloc: (vx_alloc+0x152) call_site=ffffffffc13ec032 ptr=0xffff91b661b0bd10 bytes_req=88 bytes_alloc=88 gfp_flag
    fio 276212 [008] 3116580.120440: kmem:kmem_cache_alloc: (getname_flags+0x4f) call_site=ffffffff8785c74f ptr=0xffff919470bd5000 bytes_req=4096 bytes_alloc=4096


$ perf report --stdio
    12.93%  (vx_alloc+0x152) call_site=ffffffffc13ec032 ptr=0xffff90e61cd6c840 bytes_req=88 bytes_alloc=88 gfp_flags=GFP_NOFS|GFP_NOWARN|GFP_NORETRY
    12.50%  (do_io_submit+0x194) call_site=ffffffff878a1da4 ptr=0xffff91cbb8190300 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_ZERO
    12.42%  (do_io_submit+0x194) call_site=ffffffff878a1da4 ptr=0xffff91e20e737980 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_ZERO
    12.02%  (do_io_submit+0x194) call_site=ffffffff878a1da4 ptr=0xffff919379aea240 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_ZERO
    11.69%  (vx_alloc+0x152) call_site=ffffffffc13ec032 ptr=0xffff922ef33f91b8 bytes_req=88 bytes_alloc=88 gfp_flags=GFP_NOFS|GFP_NOWARN|GFP_NORETRY
    10.60%  (do_io_submit+0x194) call_site=ffffffff878a1da4 ptr=0xffff9153dad8be00 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_ZERO
     5.79%  (getname_flags+0x4f) call_site=ffffffff8785c74f ptr=0xffff90edba1c3000 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL
     5.74%  (vx_alloc+0x152) call_site=ffffffffc13ec032 ptr=0xffff922efbb0b580 bytes_req=88 bytes_alloc=88 gfp_flags=GFP_NOFS|GFP_NOWARN|GFP_NORETRY
     4.88%  (do_io_submit+0x194) call_site=ffffffff878a1da4 ptr=0xffff91224f606a80 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_ZERO

perf stat

perf stat subcommand can be used to count the target event during the specified seconds.

The following example shows the kmem:kmem_cache_alloc tracepoints fired 146,375 times during 5 seconds.

$ perf stat -e kmem:kmem_cache_alloc -p `pidof fio` -a -- sleep 5

 Performance counter stats for process id '9639':

           146,375      kmem:kmem_cache_alloc

       5.003282238 seconds time elapsed

Dynamic tracing with probe events

Except for tracing the predefined perf events which are present in perf list command, perf is capable of creating more probe events dynamically.

kprobes

kprobes(kernel probes) can trace kernel function or instruction.

Noticed that there are lots of function calls of native_apic_mem_write

$ perf record -F 99 -p `pidof fio` -a -- sleep 5
$ perf script
  fio 225677 3122981.775077:      42831 cycles:ppp:  ffffffff87663ec0 native_apic_mem_write+0x0 ([kernel.kallsyms])
  fio 225677 3122981.775309:      62731 cycles:ppp:  ffffffff87635b48 native_sched_clock+0x58 ([kernel.kallsyms])
[...]  

It’s not a pre-defined tracepoint.

$ perf list | grep native_write_msr_safe

It’s a kernel function.

$ cat /proc/kallsyms | grep native_write_msr_safe
ffffffff8766d5b0 t native_write_msr_safe

We add the probe event for the kernel function and it will be listed in perf list. Now it’s ready for tracing.

$ perf probe --add native_write_msr_safe
Added new event:
  probe:native_write_msr_safe (on native_write_msr_safe)

You can now use it in all perf tools, such as:

    perf record -e probe:native_write_msr_safe -aR sleep 1

$ perf list | grep native_write_msr_safe
  probe:native_write_msr_safe                        [Tracepoint event]

$ perf record -e probe:native_write_msr_safe -p `pidof fio` -a -g sleep 5
Warning:
PID/TID switch overriding SYSTEM
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.417 MB perf.data (1564 samples) ]

$ perf script
fio 241552 [020] 3123367.978632: probe:native_write_msr_safe: (ffffffff8766d5b0)
        ffffffff8766d5b1 native_write_msr_safe+0x1 ([kernel.kallsyms])
        ffffffff8770de01 clockevents_program_event+0x71 ([kernel.kallsyms])
        ffffffff8770fb03 tick_program_event+0x23 ([kernel.kallsyms])
        ffffffff876cad42 hrtimer_interrupt+0xf2 ([kernel.kallsyms])
        ffffffff8765c61b local_apic_timer_interrupt+0x3b ([kernel.kallsyms])
        ffffffff87d949d3 smp_apic_timer_interrupt+0x43 ([kernel.kallsyms])
        ffffffff87d90efa apic_timer_interrupt+0x16a ([kernel.kallsyms])
        ffffffffc13a484d vx_log+0x2bd ([kernel.kallsyms])
        ffffffffc14620cb vx_trancommit+0x70b ([kernel.kallsyms])
        ffffffffc1327fb3 vx_growfile+0x683 ([kernel.kallsyms])
        ffffffffc132ecbd vx_tran_extset+0x79d ([kernel.kallsyms])
        ffffffffc132d982 vx_extset+0x4e2 ([kernel.kallsyms])
        ffffffffc13ea139 vx_do_fallocate+0x479 ([kernel.kallsyms])
        ffffffffc13ea330 vx_fallocate+0x60 ([kernel.kallsyms])
        ffffffffc13ea3e6 vx_fallocate_v2+0x16 ([kernel.kallsyms])
        ffffffff87847d82 vfs_fallocate+0x142 ([kernel.kallsyms])
        ffffffff87848dab sys_fallocate+0x5b ([kernel.kallsyms])
        ffffffff87d8fede system_call_fastpath+0x25 ([kernel.kallsyms])
            7fcc1002a730 fallocate64+0x70 (/usr/lib64/libc-2.17.so)
[...]         

We can remove the kprobe if it’s no longer needed.

$ perf probe --del native_write_msr_safe
Removed event: probe:native_write_msr_safe

The function variables including arguments are available to perf if the kernel debuginfo is available.

$ perf probe --vars native_write_msr_safe

uprobes

uprobes can trace user-space functions in applications and libraries.

The following example adds user probe for fopen function on the libc library.

$ perf probe -x /usr/lib64/libc.so.6 --add fopen
Added new event:
  probe_libc:fopen     (on fopen in /usr/lib64/libc-2.17.so)

You can now use it in all perf tools, such as:

    perf record -e probe_libc:fopen -aR sleep 1

$ perf probe --del probe_libc:fopen
Removed event: probe_libc:fopen

The return of the function can be instrumented by adding %return after the function.

$ perf probe -x /usr/lib64/libc.so.6 --add fopen%return
Added new event:
  probe_libc:fopen__return (on fopen%return in /usr/lib64/libc-2.17.so)

You can now use it in all perf tools, such as:

    perf record -e probe_libc:fopen__return -aR sleep 1

$ perf list |grep fopen
  probe_libc:fopen__return                           [Tracepoint event]

$ perf probe --del probe_libc:fopen__return
Removed event: probe_libc:fopen__return

If the system has debuginfo for the target library, the variable information including arguments might be available.

$ perf probe -x /usr/lib64/libc.so.6 --vars fopen

References

fio is the acronym for Flexible IO Tester and is a tool for I/O performance measurement.

From time to time, I use fio to run performance benchmark test, either to investigate I/O performance for certain storage or reproduce performance issues.

In this article, we are not target to discuss how to use fio for general I/O performance test. Instead, we will explore how to use it to create millions of files in the file system and simulate the real-world data content.

fio provides two options, dedupe_percentage and buffer_compress_percentage, from which the percentage of identical buffers and percentage of compressible buffer content can be specified. To meet specific file content requirement, a buffer_pattern option can be used to reflect the real-world data content.

buffer_compress_percentage=int

If this is set, then fio will attempt to provide I/O buffer content (on WRITEs) that compresses to the specified level. Fio does this by providing a mix of random data followed by fixed pattern data. The fixed pattern is either zeros, or the pattern specified by buffer_pattern. If the buffer_pattern option is used, it might skew the compression ratio slightly. Setting buffer_compress_percentage to a value other than 100 will also enable refill_buffers in order to reduce the likelihood that adjacent blocks are so similar that they over compress when seen together. See buffer_compress_chunk for how to set a finer or coarser granularity for the random/fixed data region. Defaults to unset i.e., buffer data will not adhere to any compression level.

buffer_pattern=str

If set, fio will fill the I/O buffers with this pattern or with the contents of a file. If not set, the contents of I/O buffers are defined by the other options related to buffer contents. The setting can be any pattern of bytes, and can be prefixed with 0x for hex values. It may also be a string, where the string must then be wrapped with "". Or it may also be a filename, where the filename must be wrapped with '' in which case the file is opened and read. Note that not all the file contents will be read if that would cause the buffers to overflow. 

So, for example: buffer_pattern="0123456789"

dedupe_percentage=int

If set, fio will generate this percentage of identical buffers when writing. These buffers will be naturally dedupable. The contents of the buffers depend on what other buffer compression settings have been set. It’s possible to have the individual buffers either fully compressible, or not at all – this option only controls the distribution of unique buffers. Setting this option will also enable refill_buffers to prevent every buffer being identical.

The following example shows how we use fio to simulate real-world data.

$ fio -v
fio-3.7

$ cat fio_job_file.ini
[job1]
ioengine=libaio
iodepth=8
rw=write
direct=1
nrfiles=1000
filesize=102400
blocksize=4096
dedupe_percentage=30
buffer_compress_percentage=50
buffer_pattern="0123456789"
numjobs=8
directory=/testdir

The following example shows how the generated file content looks like. It contains integer numbers which is defined by buffer_pattern. With the real-world like data generated, we could simulate the workloads and deliver more realistic performance result.

$ ls -la /testdir/job1.0.0
-rw-r--r--. 1 root root 102400 Feb 20 19:10 /testdir/job1.0.0

$  strings /testdir/job1.0.0 | head
-(=,(
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345
2>}o
l&'TI
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345
    |6=
&<y|
FSlK
t+pf|
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345

When the application source code is not available, we can trace the function calls during its runtime in order to understand how the application code works.

Library calls take place in user space. System calls is the fundamental interface between application and the kernel. System calls are not invoked directly but rather via wrapper functions in glibc(or other library).

In this article, we will study how to use ltrace/strace/perf tracing tools to help understand the target application.

Both strace and ltrace are tracing tools which use ptrace(process trace) to inspect the internal of the target process. strace only works for tracing system calls while ltrace has no such restriction.

We use a Redhat6.2 Linux system for this study.

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.2 (Santiago)

$ uname -r
2.6.32-220.el6.x86_64

We have ltrace, strace and perf installed in the system.

$ rpm -qa ltrace
ltrace-0.5-16.45svn.1.el6.x86_64

$ rpm -qa strace
strace-4.5.19-1.10.el6.x86_64

$ rpm -qa perf
perf-2.6.32-220.el6.x86_64

perf profiler

perf is the official Linux profiler which is suited for CPU analysis. It can help analyze software functions as well.

$ pid=`pidof <process-name>`
$ perf record -p $pid -a -g -- sleep 10
$ perf report -n --stdio > perf.<process-name>.out

$ cat perf.<process-name>.out

# Events: 1K cycles
#
# Overhead  Samples    Command         Shared Object                       Symbol
# ........ ..........  .......         .................  ...........................
#

    10.17%   131       <process-name>  libc-2.12.so       [.] __memcpy

From the output, the memcpy function is called to the library libc-2.12.so.

Library function calls

ltrace is a library call tracer.

$ man ltrace

ltrace(1)

NAME
       ltrace - A library call tracer

SYNOPSIS
       ltrace  [-CdfhiLrStttV]  [-a  column] [-e expr] [-l filename] [-n nr] [-o filename] [-p pid] ... [-s strsize] [-u username] [-X extern] [-x extern]
       ... [--align=column] [--debug] [--demangle] [--help] [--indent=nr] [--library=filename] [--output=filename] [--version] [command [arg ...]]

DESCRIPTION
       ltrace is a program that simply runs the specified command until it exits.  It intercepts and records the dynamic library calls which are called by
       the executed process and the signals which are received by that process.  It can also intercept and print the system calls executed by the program.

       Its use is very similar to strace(1).


$ ltrace -h

Usage: ltrace [option ...] [command [arg ...]]
Trace library calls of a given program.

  -a, --align=COLUMN  align return values in a secific column.
  -c                  count time and calls, and report a summary on exit.
  -C, --demangle      decode low-level symbol names into user-level names.
  -d, --debug         print debugging info.
  -e expr             modify which events to trace.
  -f                  follow forks.
  -h, --help          display this help and exit.
  -i                  print instruction pointer at time of library call.
  -l, --library=FILE  print library calls from this library only.
  -L                  do NOT display library calls.
  -n, --indent=NR     indent output by NR spaces for each call level nesting.
  -o, --output=FILE   write the trace output to that file.
  -p PID              attach to the process with the process ID pid.
  -r                  print relative timestamps.
  -s STRLEN           specify the maximum string size to print.
  -S                  display system calls.
  -t, -tt, -ttt       print absolute timestamps.
  -T                  show the time spent inside each call.
  -u USERNAME         run command with the userid, groupid of username.
  -V, --version       output version information and exit.
  -x NAME             treat the global NAME like a library subroutine.

Report bugs to ltrace-devel@lists.alioth.debian.org

We use the following commands to trace the target process on the specified library calls. The “-c” option is used to get the summary report.

$ ltrace -p $pid --library=/lib/libc-2.12.so -c -o ltrace.<process-name>.c.out

$ cat ltrace.<process-name>.c.out

% time     seconds  usecs/call     calls      function
=========================================================
 63.69    8.028116        2386      3364 write
 18.63    2.348951         190     12323 memcpy
 16.37    2.063338         613      3365 time
  0.62    0.078129          34      2243 access
  0.36    0.044836          19      2242 strncpy
  0.33    0.041813          18      2243 __errno_location
=========================================================
100.00   12.605183                 25780 total

From above tracing output, the process spent 63.69% time in write function call and 18.63% time in memcpy function call.

To know more detail for the function calls, we remove the “-c” option and trace again.

$ ltrace -p $pid --library=/lib/libc-2.12.so -o ltrace.<process-name>.out

$ cat ltrace.<process-name>.out

25236 access("", 0)                                = -1
25236 time(0x7fff8b770d58)                         = 1613691764
25236 memcpy(0x7fc462d54000, "\323j\377\305\207\037\332\234\304CH\317\270\005\036\250\271\304w"`\262\214\322\213\357\\\356\213,\300\357"..., 80384) = 0x7fc462d54000
25236 memcpy(0x7fc462d67a00, "\204\257\265P\261\241\340\232\0278\2132\272\350)\327\025Iwc\342\014#\353{\265d2|\240V\266"..., 131072) = 0x7fc462d67a00
25236 memcpy(0x7fc462d87a00, "\204\257\265Q\261\241\340\233\0278\2133\272\350)\330\025Iwd\342\014#\354{\265d3|\240V\267"..., 131072) = 0x7fc462d87a00
25236 memcpy(0x7fc462da7a00, "\204\257\265R\261\241\340\234\0278\2134\272\350)\331\025Iwe\342\014#\355{\265d4|\240V\270"..., 131072) = 0x7fc462da7a00
25236 memcpy(0x7fc462dc7a00, "\204\257\265S\261\241\340\235\0278\2135\272\350)\332\025Iwf\342\014#\356{\265d5|\240V\271"..., 50688) = 0x7fc462dc7a00
25236 __errno_location()                           = 0x7fc4631416a8
25236 write(1, "\323j\377\305\207\037\332\234\304CH\317\270\005\036\250\271\304w"`\262\214\322\213\357\\\356\213,\300\357"..., 524288) = 524288

From above tracing output, there are five memcpy function calls to totally copy 524288 bytes in memory, and followed by a write function call to write the same amount of data to a file of file descriptor 1.

We can read the memcpy function call definition from the library manual page.

$ man 3 memcpy

MEMCPY(3)                  Linux Programmer’s Manual                 MEMCPY(3)

NAME
       memcpy - copy memory area

SYNOPSIS
       #include <string.h>

       void *memcpy(void *dest, const void *src, size_t n);

DESCRIPTION
       The  memcpy() function copies n bytes from memory area src to memory area dest.  The memory areas should not overlap.  Use memmove(3) if the memory
       areas do overlap.

RETURN VALUE
       The memcpy() function returns a pointer to dest.

System calls

ltrace provides a “-S” option to display the system calls.

$ ltrace -p $pid --library=/lib/libc-2.12.so -S -c -o ltrace.<process-name>.l.S.c.out

$ cat ltrace.<process-name>.l.S.c.out
% time     seconds  usecs/call     calls      function
=========================================================
 41.04   10.231204        3444      2970 write
 40.78   10.166975        3420      2972 SYS_write
  9.25    2.306825         212     10877 memcpy
  8.33    2.076351        1048      1980 access
  0.22    0.055967          18      2971 time
  0.16    0.039493          19      1980 strncpy
  0.15    0.037153          18      1980 __errno_location
  0.07    0.016229           8      1980 SYS_access
  0.00    0.000037          37         1 _IO_putc
=========================================================
100.00   24.930234                 27711 total

We can also use strace to do the similar trace to system calls.

$ strace -p $pid -o > strace.<process-name>.c.out

$ cat strace.<process-name>.c.out

% time     seconds  usecs/call     calls   errors syscall
=========================================================
 99.85    0.039798          11      3533           write
  0.15    0.000060           0      2356      2356 access
=========================================================
100.00    0.039858                  5889      2356 total

The function call details can be traced by removing the “-c” option.

$ strace -p $pid -o > strace.<process-name>.out

$ cat strace.<process-name>.out

write(1, "\4\363\263\370EJ\n\336W,\270a\275\221;\7`\261\364\24\354_t7\6#\354\212}\v\325\230"..., 524288) = 524288

Similar to the output from ltrace, it writes 524288 bytes data from the buffer to a file of file descriptor 1.

We can read the manual page to understand the parameters and return value for the write system call.

$ man 2 write

NAME
       write - write to a file descriptor

SYNOPSIS
       #include <unistd.h>

       ssize_t write(int fd, const void *buf, size_t count);

DESCRIPTION
       write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd.

RETURN VALUE
       On success, the number of bytes written is returned (zero indicates nothing was written).  On error, -1 is returned, and  errno  is  set  appropriately.

If we want to know which file is actually written, we can get the file descriptors from /proc/$pid/fd

$ ls -la /proc/$pid/fd > proc.fd.out
total 0
dr-x------ 2 root root  0 Feb 18 15:49 .
dr-xr-xr-x 7 root root  0 Feb 18 15:39 ..
lrwx------ 1 root root 64 Feb 18 15:49 0 -> socket:[39812145]
lrwx------ 1 root root 64 Feb 18 15:49 1 -> socket:[39812145]
lrwx------ 1 root root 64 Feb 18 15:49 2 -> socket:[39812186]

Now, we understand it actually writes to a socket but rather a real file. This makes sense because the application is a data generator which produce data in memory and send to remote server through network.

References

Karana, a girl who was living alone on the Island of the Dolphins, set out for many dangerous adventures and changes throughout the novel. She learnt how to survive by herself, along with a couple of animal friends for company.

The setting of the Island of the Blue dolphins was San Nicolas  Island. In the book, it was referred to as the Island of the Blue Dolphins. The island was two leagues long and one league wide, and looked like a dolphin lying on its side, with its tail pointing toward the sunrise. The island got its name from the blue dolphins that lived in the seas. Karana lived in a village called Ghalas-at. The village lay east of the hills on a small mesa, near Coral Cove. This story took place from 1835 to 1853, the time when Karana lived alone on the island. Since the book was mainly about how Karana survived on the island, understanding the setting of the island was important. It gave us an idea that she may be facing dangers from wild life, since the setting is on a deserted island.

The main plot of the story was Karana’s struggle to survive on the island. She lived in fear of the Aleuts, a group of people that slaughtered many of her friends and family members. Also, she had to face the wild dogs who killed her brother Ramu. She wanted to get revenge and made weapons to defend herself. She broke the law which stated that women could not forge weapons. The inciting incident was when the white men, who were sent by Kimki, the new chief of the tribe after the death of Chowig, came and brought the tribe to a new place. This was what started the story of how Karana was left alone after she made the choice to rescue his brother. The climax was when Karana decided to befriend Tutok, an Aleut girl. Karana changed a lot from the inciting incident to the climax. She decided to burn down her village because she knew she could not stand staying there on her own. She built weapons so she would be prepared when the Aleuts came back. But that all changed when she met Tutok. Tutok showed her that not all Aleuts are bad. The overarching conflict all started when the Aleuts came to the island to hunt sea otters. They lied about the deal they made with Chowig and killed many people in the tribe. Then, the white men came and sailed to a new place with most of the tribe. Ramu, who ran off in search of his hunting spear, was left behind because there was a storm and the people had to leave quickly. Karana swam back to the island for his brother. Ramu wanted to get a canoe so they could hunt food but was killed by the hunting dogs. To Karana, the loss of his brother was devastating and she vowed to take revenge. That was when she started her struggle for her own safety. The conflict was solved when the white men came back and brought Karana to the rest of her tribe.

The main character of this story was Karana. She was a young girl who was left alone on the island she called home after her tribe left without her. After her father died, she took the responsibility to take care of Ramu. But soon, Ramu was killed by the wild dogs. Now that she was isolated on the island, Karana was forced to learn how to survive in the wild. Some values Karana showed was caring for nature and friendship. After she saw how brutal the Aleuts were to the sea otters, Karana began to question her practice of hunting. She learnt that animals and humans were truly equal, and that she should never hunt unless her life depended on it. By caring for the wild life, Karana also created a friendship with the animals. She made two friends that she became close with: a dog that killed her brother, and Tutok that belonged to the group of people that killed her father. She learnt that holding grudges was no good and she should learn to forgive. One example of this was her friendship with Tutok. Since Tutok was an Aleut, Karana was suspicious and scared that she would reveal her hiding place. But she was wrong. Tutok befriended Karana and gave her a gift. Karana chose mercy over justice, like when she saved Rontu’s life, even though he was the leader of the wild dogs. At first, Karana thought of herself as only a girl, and she was only fit to maintain the house. But when she was alone, she realized that the division of gender was really unnecessary if she was capable of doing the tasks that men did. From the choices she made, I learnt that I could not always rely on someone for help, and that friendship was very important.

The major theme of the book was self-reliance. Karana’s village encouraged self-reliance. They have learnt to survive by using the resources on the island. When Karana was alone, she had to apply this value more than ever. She made her own weapons, shelter, clothing, and hunted her own food. She thought of ways to hunt animals like devilfish and sea elephants. She had no one to help her. She had to do what the men did, and survived using skills she learnt. She was resourceful and used her knowledge to survive without outside assistance.

My favorite part of this book was when Karana became friends with Tutok because she learnt to forgive and decided that all Aleuts were not bad. Only the ones that chose to lie were unworthy of being forgiven. My least favorite part was when Rontu died because Karana lost her only companion on the island. Yes, I would recommend this book to a friend because people should learn not to rely on others and that making friends was very important.

Introduction

Memory-mapped I/O lets us map a file on disk into a buffer in memory so that, when we fetch bytes from the buffer, the corresponding bytes of the file are read. Similarly, when we store data in the buffer, the corresponding bytes are automatically written to the file. This lets us perform I/O without using read or write.

#include <sys/mman.h>

void *mmap(void *addr, size_t len, int prot, int flag, int fd, off_t off );

Returns: starting address of mapped region if OK, MAP_FAILED on error

The addr argument lets us specify the address where we want the mapped region to start. We normally set this value to 0 to allow the system to choose the starting address. The return value of this function is the starting address of the mapped area.

The fd argument is the file descriptor specifying the file that is to be mapped. We have to open this file before we can map it into the address space. The len argument is the number of bytes to map, and off is the starting offset in the file of the bytes to map.

The prot argument specifies the protection of the mapped region.

prot         Description
PROT_READ    Region can be read.
PROT_WRITE   Region can be written.
PROT_EXEC    Region can be executed.
PROT_NONE    Region cannot be accessed.

Reference

  • Advanced Programming in the UNIX Environment

Memory fragmentation

Memory page availability can be checked from /proc/buddyinfo as below.

$  cat /proc/buddyinfo
Node 0, zone      DMA      1      0      1      0      1      1      1      0      1      1      3
Node 0, zone    DMA32   3342   2441   2138   5025   1871    236      3      0      0      0      0
Node 0, zone   Normal 143135   6057 150803   4005    330     62      1      0      0      0      0

If you see the system log message file(/var/log/message) is reporting “kernel: nf_conntrack: falling back to vmalloc”, it highly indicates the memory is being fragmented.

Following command can be used to compact fragmented memory pages.

echo 1 > /proc/sys/vm/compact_memory

extfrag_threshold

This parameter affects whether the kernel will compact memory or direct

reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in

debugfs shows what the fragmentation index for each order is in each zone in

the system. Values tending towards 0 imply allocations would fail due to lack

of memory, values towards 1000 imply failures are due to fragmentation and -1

implies that the allocation will succeed as long as watermarks are met.

The kernel will not compact memory in a zone if the fragmentation index is <= extfrag_threshold. The default value is 500.

Example output:

$  sysctl -a | grep extfrag_threshold
vm.extfrag_threshold = 500

$  cd /sys/kernel/debug/extfrag/
$  ls
extfrag_index  unusable_index

$  cat unusable_index
Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.001 0.009 0.018 0.035 0.035 0.035 0.173
Node 0, zone    DMA32 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.003 0.007 0.013 0.028
Node 0, zone   Normal 0.000 0.007 0.020 0.045 0.096 0.192 0.342 0.552 0.803 1.000 1.000
Node 1, zone   Normal 0.000 0.006 0.021 0.051 0.104 0.190 0.325 0.522 0.775 1.000 1.000

$  cat extfrag_index
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.955 0.978
Node 1, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.956 0.978

$  cat /proc/buddyinfo
Node 0, zone      DMA      1      0      1      0      2      1      1      0      0      1      3
Node 0, zone    DMA32      7     10     13     15     11      8      9     10      8      9    309
Node 0, zone   Normal 1134904 1124019 975094 1027980 973525 755862 528091 316182 123744      0      0
Node 1, zone   Normal 864976 1041246 994760 895916 726438 565284 415625 266573 118000      0      0

References

The Linux kernel supports the following overcommit handling modes

0 - Heuristic overcommit handling. Obvious overcommits of

address space are refused. Used for a typical system. It

ensures a seriously wild allocation fails while allowing

overcommit to reduce swap usage. root is allowed to

allocate slightly more memory in this mode. This is the

default.

1 - Always overcommit. Appropriate for some scientific

applications. Classic example is code using sparse arrays

and just relying on the virtual memory consisting almost

entirely of zero pages.

2 - Don’t overcommit. The total address space commit

for the system is not permitted to exceed swap + a

configurable amount (default is 50%) of physical RAM.

Depending on the amount you use, in most situations

this means a process will not be killed while accessing

pages but will receive errors on memory allocation as

appropriate.

    Useful for applications that want to guarantee their
    memory allocations will be available in the future
    without having to initialize every page.

The overcommit policy is set via the sysctl `vm.overcommit_memory’.

The overcommit amount can be set via vm.overcommit_ratio' (percentage) or vm.overcommit_kbytes’ (absolute value).

The current overcommit limit and amount committed are viewable in

/proc/meminfo as CommitLimit and Committed_AS respectively.

kmalloc

The function allocates contiguous region in physical memory. It’s fast and doesn’t clear the allocated memory content.

kmalloc function is defined in include/linux/slab.h

/**
 * kmalloc - allocate memory
 * @size: how many bytes of memory are required.
 * @flags: the type of memory to allocate.
 *
 * kmalloc is the normal method of allocating memory
 * for objects smaller than page size in the kernel.
 *
 * The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGN
 * bytes. For @size of power of two bytes, the alignment is also guaranteed
 * to be at least to the size.
 *
 * The @flags argument may be one of the GFP flags defined at
 * include/linux/gfp.h and described at
 * :ref:`Documentation/core-api/mm-api.rst <mm-api-gfp-flags>`
 *
 * The recommended usage of the @flags is described at
 * :ref:`Documentation/core-api/memory-allocation.rst <memory_allocation>`
 *
 * Below is a brief outline of the most useful GFP flags
 *
 * %GFP_KERNEL
 *	Allocate normal kernel ram. May sleep.
 *
 * %GFP_NOWAIT
 *	Allocation will not sleep.
 *
 * %GFP_ATOMIC
 *	Allocation will not sleep.  May use emergency pools.
 *
 * %GFP_HIGHUSER
 *	Allocate memory from high memory on behalf of user.
 *
 * Also it is possible to set different flags by OR'ing
 * in one or more of the following additional @flags:
 *
 * %__GFP_HIGH
 *	This allocation has high priority and may use emergency pools.
 *
 * %__GFP_NOFAIL
 *	Indicate that this allocation is in no way allowed to fail
 *	(think twice before using).
 *
 * %__GFP_NORETRY
 *	If memory is not immediately available,
 *	then give up at once.
 *
 * %__GFP_NOWARN
 *	If allocation fails, don't issue any warnings.
 *
 * %__GFP_RETRY_MAYFAIL
 *	Try really hard to succeed the allocation but fail
 *	eventually.
 */
static __always_inline void *kmalloc(size_t size, gfp_t flags){
      if (__builtin_constant_p(size)) {
#ifndef CONFIG_SLOB
        unsigned int index;
#endif
        if (size > KMALLOC_MAX_CACHE_SIZE)
            return kmalloc_large(size, flags);
#ifndef CONFIG_SLOB
        index = kmalloc_index(size);

        if (!index)
            return ZERO_SIZE_PTR;

        return kmem_cache_alloc_trace(
                kmalloc_caches[kmalloc_type(flags)][index],
                flags, size);
#endif
    }
    return __kmalloc(size, flags);
}

The first argument is the size of the blocks to be allocated. The second argument is the memory allocation type flags. It controls the behavior of kmalloc.

  • GFP_KERNEL is the most commonly used flag. It means the allocation is performed on behalf of a process running in kernel space. kmalloc can put the current process to sleep while waiting for the page allocation if the system is in memory starvation. The free memory can be optained either by flushing buffer to disk or swapping out user process memory.
  • GFP_ATOMIC is used if the kmalloc call is from outside process context. The kernel can use the reserved free pages to serve the allocation. The allocation can fail if there is no last free page.

vmalloc

vmalloc allocates a contiguous memory region in the virtual address space. The allocated pages in physical memory are not consecutive and each page is retrieved with separate call to alloc_page. Thus, the vmalloc function is less efficient than kmalloc.

vmalloc function is defined in mm/vmalloc.c

/**
 * vmalloc - allocate virtually contiguous memory
 * @size:    allocation size
 *
 * Allocate enough pages to cover @size from the page level
 * allocator and map them into contiguous kernel virtual space.
 *
 * For tight control over page level allocator and protection flags
 * use __vmalloc() instead.
 *
 * Return: pointer to the allocated memory or %NULL on error
 */
void *vmalloc(unsigned long size)
{
    return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE,
                __builtin_return_address(0));
}

References

It was my most regrettable decision ever. I was playing with Alex, who was my next door neighbor and my best friend in fourth grade. Suddenly, Alex said: “You know what, I think it’s time for us to compare our cooking skills now that I practiced my recipes for two years now.”  I did not know what to say. I knew Alex was teasing me. He had enrolled for a master cooking class and practiced for two years already. I was only a home cook, and I never thought about entering a cooking competition. I wanted to decline the offer but that would show that I was afraid. I did not like people calling me a loser so I foolishly accepted the challenge. We decided we were going to hold the cooking contest in my school’s cafeteria.

When I told my parents about the contest, they said that I should cook something really fancy so it will increase my chance of winning. I got really stressed then. I only knew one type of dish which was a burger. I knew how to make many kinds of burgers but nothing else. I decided to go online to search up some recipes for other foods. I bought the ingredients and started to try new recipes. But none of them tasted better than my masterpiece burger.

I wondered if I was doing something wrong. The recipes were more complicated so they should taste better. But they didn’t. I told my mom that I did not think the new dishes tasted any better than my burger. She said that I should be patient and practice more. I disagreed with her. I only had a few days to practice and mastering new recipes would take me weeks. I decided to continue with my burger. I was best at making that so it should have the highest chance of winning the contest. I created new kinds of burgers everyday from then on and tasted each of them to see if they were the best I could make.

Then, on the day before the contest, my mom asked me what dish I prepared. When I told her I was going to make a burger, she advised me that next time I should make a more complicated recipe. “A tasty recipe matters on how delicious it is, not how fancy or complicated it is.” I said. Even if many people think a burger would not stand a chance against a chocolate cake, I refused to change my topic.

Finally, the day of the competition arrived. I entered the stadium and put my ingredients on the right side of the table. I felt the tension in the kitchen. Alex was there and he was sitting on the left side of the table. When he saw that I had only a few ingredients needed to make my dish, he asked me what I was going to cook. A confident smile grew on his face when I told him I was going to make a smash burger. He told me I could forfeit if I was scared. I stared at him. I knew what he was going for. But it was not so easy to make me give up. Seeing that I would not back up, Alex had no choice but to start a cooking fight with me. Then the competition began. I first carefully spread some butter on the brioche buns. Then I toasted the buns. I took out the ground beef and rolled it into a patty. Meanwhile, Alex was making a chocolate cake. I wondered if the judge would like salty or sweet food better. While I was thinking, the judge said that we had thirty minutes left. I continued to make my Bacon King smash burger.

I could feel my heart beating fast. I shakily put the patty into the pan and fried them. I then put the juicy meat on the toasted bun. Alex was watching me the whole time. He was already done baking his cake. I felt a thousand pairs of eyes looking at me. There was no time to be nervous. I put the butter lettuce on the meat and assembled the burger.

Alex looked toward me. “It is no use trying to act tough. A loser will always remain a loser,” he said. But I ignored his words.

Then we presented our finished results to the judge. He first took a bite out of the chocolate cake. He said that the sweetness of the chocolate cake mixed with the sourness of the strawberry on top was perfect. Then he ate my burger. He did not say anything. Cold sweat fell on my face. The whole room was so quiet that you could hear a pin drop on the ground.

I gulped. Did the judge dislike my burger? That was when I saw him take another bite out of the burger. After that, he looked at Alex and me. “Well?” said Alex impatiently. “Who won?” The judge stood up. He said it was very hard to decide which recipe was tastier than the other. Though he equally liked the two foods, he said that my burger had the perfect attributes a burger should have. I grinned because I knew I won the contest.

If I had made a fancy unfamiliar recipe, I may have never won this contest. I learnt that we should always play to our strengths and do what we are good at.

0%