What is the magic SysRq key?

It is a ‘magical’ key combo you can hit which the kernel will respond to regardless of whatever else it is doing, unless it is completely locked up.

How do I enable the magic SysRq key?

You need to say “yes” to ‘Magic SysRq key (CONFIG_MAGIC_SYSRQ)’ when configuring the kernel. When running a kernel with SysRq compiled in, /proc/sys/kernel/sysrq controls the functions allowed to be invoked via the SysRq key. The default value in this file is set by the CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE config symbol, which itself defaults to 1. Here is the list of possible values in /proc/sys/kernel/sysrq:

  • 0 - disable sysrq completely

  • 1 - enable all functions of sysrq

  • $>$1 - bitmask of allowed sysrq functions (see below for detailed function description):

    2 =   0x2 - enable control of console logging level
    4 =   0x4 - enable control of keyboard (SAK, unraw)
    8 =   0x8 - enable debugging dumps of processes etc.
    

    16 = 0x10 - enable sync command
    32 = 0x20 - enable remount read-only
    64 = 0x40 - enable signalling of processes (term, kill, oom-kill)
    128 = 0x80 - allow reboot/poweroff
    256 = 0x100 - allow nicing of all RT tasks

You can set the value in the file by the following command:

$ echo "number" >/proc/sys/kernel/sysrq

The number may be written here either as decimal or as hexadecimal with the 0x prefix. CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE must always be written in hexadecimal.

Note that the value of /proc/sys/kernel/sysrq influences only the invocation via a keyboard. Invocation of any operation via /proc/sysrq-trigger is always allowed (by a user with admin privileges).

How do I use the magic SysRq key?

Write a character to /proc/sysrq-trigger. e.g.:

$ echo "command key" > /proc/sysrq-trigger

The “command key” is case sensitive.

What are the command keys?

Image

What can I use them for?

If we want to check the backtrace of all active CPUs, we can do following.

$ echo l > /proc/sysrq-trigger


$ tail -f /var/log/messages
May 21 17:49:41 host1 kernel: sysrq: Show backtrace of all active CPUs
May 21 17:49:41 host1 kernel: NMI backtrace for cpu 34
May 21 17:49:41 host1 kernel: CPU: 34 PID: 27101 Comm: bash Tainted: G           O      5.7.12-1.el7.elrepo.x86_64 #1
May 21 17:49:41 host1 kernel: Hardware name: Supermicro SYS-1029U-TN12RV/X11DPU-V, BIOS 3.4 11/03/2020
May 21 17:49:41 host1 kernel: Call Trace:
May 21 17:49:41 host1 kernel: dump_stack+0x6d/0x9a
May 21 17:49:41 host1 kernel: ? lapic_can_unplug_cpu.cold+0x40/0x40
May 21 17:49:41 host1 kernel: nmi_cpu_backtrace.cold+0x14/0x53
May 21 17:49:41 host1 kernel: nmi_trigger_cpumask_backtrace+0xd9/0xdb
May 21 17:49:41 host1 kernel: arch_trigger_cpumask_backtrace+0x19/0x20
May 21 17:49:41 host1 kernel: sysrq_handle_showallcpus+0x17/0x20
May 21 17:49:41 host1 kernel: __handle_sysrq.cold+0x48/0x111
May 21 17:49:41 host1 kernel: write_sysrq_trigger+0x28/0x37
May 21 17:49:41 host1 kernel: proc_reg_write+0x66/0x90
May 21 17:49:41 host1 kernel: __vfs_write+0x1b/0x40
May 21 17:49:41 host1 kernel: vfs_write+0xb9/0x1b0
May 21 17:49:41 host1 kernel: ksys_write+0x67/0xe0
May 21 17:49:41 host1 kernel: __x64_sys_write+0x1a/0x20
May 21 17:49:41 host1 kernel: do_syscall_64+0x60/0x1e0
May 21 17:49:41 host1 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
<...>

The same message can also be checked from the following command.

$ dmesg -T
$ journalctl --since "5 minutes ago"

Reference

journalctl may be used to query the contents of the systemd journal as written by systemd-journald.service.

If called without parameters, it will show the full contents of the journal, starting with the oldest entry collected.

The following are some of the common used options:

  • –no-full, –full, -l Ellipsize fields when they do not fit in available columns. The default is to show full fields, allowing them to wrap or be truncated by the pager, if one is used.
  • -f, –follow Show only the most recent journal entries, and continuously print new entries as they are appended to the journal.
  • -u, –unit=UNIT or PATTERN Show messages for the specified systemd unit UNIT (such as a service unit), or for any of the units matched by PATTERN. This parameter can be specified multiple times.
  • -S, –since=, -U, –until= Start showing entries on or newer than the specified date, or on or older than the specified date, respectively.

The following commands are to save the journal to a file or query it on the fly with the -f option.

$ systemctl -a | grep ntpd.service
  ntpd.service  loaded    active   running   Network Time Service
$ journalctl -u ntpd > ntpd.journal.out
$ tail ntpd.journal.out

$ journalctl -u ntpd* --since "10 minutes ago"

$ journalctl -lfu ntpd*

Flame graphs can be generated from the output of many different software profilers, including profiles for different resources and event types. The article written by Brendan Gregg describes how the flame graph works.

We can start to work with the flame graph by following commands.

$ yum install git
$ git clone --depth 1 https://github.com/brendangregg/FlameGraph
$ cp result-15082021-201302/perf.data FlameGraph/
$ cd FlameGraph/
 
$ perf record -p $pids -a -g -- sleep 30
$ perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf16.svg

Image

Reference

Regulus Aurelius woke up to the clanging of weapons and fierce battle cries of Roman elite soldiers during their practice session for the oncoming battles. Things weren’t going well for Rome when General Hannibal Barca of Carthage stormed into the Italian Peninsula with a raging army of 26000 troops, 6000 horses, and war elephants. Regulus, unlike the fearful Roman citizens, faced a continuous dilemma. Part of him was a loyal Roman general who wanted to defeat Carthage. The other part of him was loyal to Carthage, his homeland. Sometimes, a faint smile tugged at his mouth when frantic Roman scouts reported the losses they suffered.

Read more »

Introduction

In the book of System Design Interview by Lewis C. Lin, a six-step framework called PEDALS is introduced to help answer any system design question.

PEDALS stand for:

  • Process requirements
  • Estimate
  • Design the service
  • Articulate the data model
  • List the architectural components
  • Scale

The systematic method will power you through the whole system design interview.

Process requirements

It means to clarify the question before you answer it.

  • What is it? What does the system do? What are the goals or outcomes?
  • What features does it provide? What features should be left out?

Why is clarification important? If the requirements are not clarified,

  • the desired features would not be included or correctly designed
  • the interviewers’ expectation would not be met
  • the precious time would be wasted in unexpected area

By clarifying the question, you prove that you can take proper actions for any ambiguous question in real world.

Estimate

It is always a good idea to estimate the scale of the system as it helps reflect if the designed system could fulfill the functional requirements. The requirements might include:

  • Number of users
  • Number of active users(NAU)
  • Requrests per second(RPS)
  • Logins per seconds
  • Transactions per second(TPS) for E-commerce
  • Likes/dislikes per second, shares per second, comments per second for social media sites
  • Searches per second for sites with a search feature
  • Storage needed
  • Servers needed
  • Network bandwidth needed

Based on the functional requirements above, it’s possible to estimate the system requirements including servers, storage and netowrk bandwidth needed.

To estimate hardware resource needed, we need to understand that there are four major resources in a computer system.

  • CPU
  • Memory
  • Storage
  • Network

Estimate servers needed

The modern computer system is a multi-processor system. It varies from single CPU core to multiple CPU cores. The following is a 32 CPU threads system.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

In order to estimate how many servers are needed, we can approach in the following manner.

  1. How much work can single CPU do?
  2. How much work can single server do?
  3. How many servers are needed?

Let’s have an example to go through this approach.

Let’s say it takes 100ms for a sinlge-core CPU system to handle single client request. It means the system can handle 10 requests per second. So, we can extrapolate a 32-core system can handle 320 requests per second. Let’s say we have to handle 320,000 requests per second(RPS). It means 1000 servers are needed.

Notice that this is a rough calculation only considering CPU needs. In real case, there might be other performance bottleneck to handle 320 requests per second in a system. For example, the system might be already I/O bound before running out of CPU bandwidth. But this method still gives us an estimation when to consider the CPU computation resource only.

Estimate storage needed

To estimate storage needed, we can approach as below.

  1. Identify the different data types
  2. Estimate the space needed for each data type
  3. Get the total space needed

Let’s take YouTube as an example to understand this approach.

  • Data types: videos, thumbnail images and comments.
  • Let’s assume there are roughly 2B users and 5% users(100M users) upload videos consistently. On average, each user has a weekly upload(~50 videos per year). Roughly, 13M videos(100M*50/365) are uploaded daily. Let’s assume the video is 10 minutes long on average and it takes 50MB storage space after compression. Let’s say each video has a thumbnail image of 20KB. Each video has about 5 comments and the size of each comment is 200 bytes. In total, the space need for each video is 50MB + 20KB + 1KB, roughly 50MB. By multiplying 13M videos, it needs 619TB storage in a day.

Estimate network bandwidth needed

Determine the incoming and outgoing data for network bandwidth estimation

  • We already know there would be ~619TB data uploaded to YouTube in a day. Dividing this by the number of seconds in a day(619TB/86400 seconds), the incoming data to YouTube would be 7.3GB/s.
  • Let’s say 10% of YouTube users are daily active users. With approximately 200M daily users, let’s assume a user watches 10 videos a day. Then YouTube would have 2B views in a day. This would result in ~93PB outgoing data in a day. Dividing this by the number of seconds in a day(93PB/86400 seconds), the outgoing speed would be 1128GB/s.

Design the service

When we process the requirements, we should already collect the clear enough requiremnts before the system design. And now, we need to define what to build and figure out how to build the system service.

CRUD framework

  • Create
  • Read
  • Update
  • Delete

For example, to build a YouTube-like video service. We can use CRUD to brainstorm the possible the system actions.

  • Create

  • New users

  • New channels

  • Upload videos

  • Add comments

  • Read

  • View videos

  • Read comments

  • Search videos

  • Read video recommendations

  • Update

  • Edit video metadata

  • Edit comments

  • Delete

  • Delete users

  • Delete channels

  • Delete videos

  • Delete comments

Now we can further define services(API endpoints) as below.

  • /users
  • /channels
  • /videos
  • /comments
  • /recommendations

Be RESTFUL: API Best Practices

  • Use Nouns. REST is known for using the HTTP commands(GET/POST/DELETE/PUT) to read or write data. When the API is invoded via HTTP requests, the HTTP request comes with the corresponding verbs(GET/POST/DELETE/PUT).
  • Use nesting to show the hierarchy. For example, /users/1 returns a specific user with id=1.
  • Return json. It is the industry standard today.
  • Support filtering and pagination. It helps minimize the latency and waste.
  • Use plural

Design strategies

  • Information processing strategies

  • Batch strategy(Scheduled processing)

  • Chain-of-Command strategy

  • Checklist strategy

  • Rate limiting strategies

  • Fixed window strategy

  • Sliding window strategy

  • Token bucket strategy

  • Leaky bucket strategy

  • Limiting concurrent requests strategy

  • Critical requests strategy

  • Communication strategies

  • Middleman strategy

  • Town crier strategy

  • Asynchronous strategy

  • Latency strategies

  • Main-replica strategy

  • Push vs. Pull strategy

  • Precompute strategy

  • Lazy loading strategy

  • Peer-to-peer strategy

  • Efficiency strategies

  • Divide and conquer strategy

  • Listener strategy

  • Space reduction strategies

  • Mario and Luigi strategy

  • Synchronization strategies

  • Locks strategy

  • Error handling strategies

  • Code readability, maintainability, and elegance strategies

  • Security strategies

Articulate the data model

  • Schema(Tables, Fields)

  • SQL vs. NoSQL database

  • SQL database: MySQL, PostgreSQL, etc

  • NoSQL database: MongoDB, Redis, etc

  • NoSQL databases have a flexible or unstructured database schema.

  • Non-database storage

  • Distributed file systems(e.g. Store files in HDFS and file path in database)

  • Object storage(e.g. S3)

List the architectural components

  • Logical architecture, physical architecture, cloud architecture
  • Service-oriented architecture
  • Draw architecture diagrams

Scale

How to tackle common scale issues

What were the estimated scale requirements in step 2?

  • Number of total users
  • Number of active users
  • Requests per second(RPS)
  • Logins per second
  • Storage needed

Resolve the following system bottleneck:

  • CPU
  • Memory
  • Storage
  • Latency

We can use the following strtegies to scale a small functional system.

  • Load balancing
  • Read replica databases
  • Distributed file systems
  • Content delivery networks
  • Daemon process pre-computation

Problem: Handling more users and requests

  • Horizontal scaling and load balancer
  • Vertical scaling

Problem: Handling more reads

  • Read replicas

  • drawback: lack of data consistency

  • Sharding

  • Shard by customer or range

  • Shard by geography

  • Shard by hash function

  • Disadvantages: slower join performance, additional application code, recurring maintenance

Problem: Avoiding crashes

Chaos Engineering is a discipline where engineering teams purposefully create controlled failure within a large scale system. It can emulate:

  • Failure of a data center
  • Failure of a database shard
  • Randomly thrown exceptions
  • Skew in distributed system clocks
  • Failure of a key internal microservice
  • “Storm” tests that emulate power outages

Problem: Providing data consistency

No system can simultaneously provide two of the following guarantees(CAP):

  • Consistency - wirtes are immediately in effect
  • Availability - requests receive a valid response
  • Partition Tolerance - functionality despite network errors

When scaling a system, there is a direct relationship between consistency and availability. When horizontally scaling, you’ll see availability dramatically increase; however, consistency will suffer because the system becomes distributed.

To create stronger read consistency, you’ll have to decrease the effect of data replication your system creates, which will decrease availability.

Problem: Need to improve latency

As a system grows and creates an increasingly complex workflow, latency becomes a critical factor.

The bottleneck can be caused by system resource starvation, which could be alleviated through horizontal and vertical scaling. It also can be caused by application limitation. Another strategy to consider is caching which helps reduce unnecessary disk I/O operations.

Identify and alleviate scalability bottlenecks

The complete guide to System Design in 2022 covers:

  • What is System Design?

  • System Design fundamentals

  • Horizontal and vertical scaling

  • Microservices

  • Proxy servers

  • CAP theorem

  • Redundancy and replication

  • Storage

  • Block storage

  • File storage

  • Object storage

  • Redundant Disk Arrays (RAID)

  • Message queues

  • Kafka

  • File systems

  • Google File System (GFS)

  • Hadoop Distributed File System (HDFS)

  • System Design patterns

  • Bloom filters

  • Consistent hashing

  • Quorum

  • Checksum

  • Merkle trees

  • Leader election

  • Databases

  • Relational databases

  • MySQL

  • PostgreSQL

  • SQL joins

  • Non-relational databases

  • MongoDB

  • How to choose a database

  • Database schemas

  • Database queries

  • ACID properties

  • Database sharding and partitioning

  • Database indexing

  • What are distributed systems?

  • Distributed system failures

  • Distributed system fundamentals

  • MapReduce

  • Stateless and stateful systems

  • Raft

  • Distibuted system design patterns

  • Scalable web applications

  • DNS and load balancing

  • N-tier applications

  • HTTP and REST

  • Stream processing

  • Caching

  • Cache invalidation

  • Cache eviction

  • Machine learning and System Design

  • Containerization and System Design

  • The cloud and System Design

Reference

The disasters could occur unexpectedly and it could take down the whole data center in an application region. In the real world, a well designed disaster recovery solution would help restore the applications and related data at the time of disaster.

There are two critical metrics to measure how timely the applications and data can be recovered and how much data loss can be tolerated.

  • Recovery Time Objective(RTO)
  • Recovery Point Objective(RPO)

Image

Reference

Linux makes a clear distinction between the contents of a file and the information about a file. All information needed by the filesystem to handle a file is included in a data structure called an inode. Each file has its own inode, which the filesystem uses to identify the file.

File types

Linux files may be one of the following types.

  • Regular file
  • Directory
  • Symbolic link
  • Block device file
  • Character device file
  • Pipe(FIFO)
  • Socket

The first three file types are constituents of any Linux filesystem. Device files are related both to I/O devices, and to device drivers integrated into the kernel. When a program accesses a device file, it acts directly on the I/O device associated with that file. Pipes and sockets are special files used for interprocess communication.

File inodes

Linux system must provide the following inode attributes, which are specified in the POSIX standard:

  • File type
  • Link count: Number of hard links associated with the file
  • File size in bytes
  • Device ID(i.e., the identifier of the device containing the file)
  • Inode number that identifies the file in the filesystem
  • UID of the file owner
  • User group ID of the file
  • Timestamps which specify the inode change time, the last access time and the last modify time.
  • Access rights/permissions(read, write, and execute) and file mode

The following is inode struct definition in the Linux source code.

/*
 * Keep mostly read-only and often accessed (especially for
 * the RCU path lookup and 'stat' data) fields at the beginning
 * of the 'struct inode'
 */
struct inode {
    umode_t			i_mode;
    unsigned short		i_opflags;
    kuid_t			i_uid;
    kgid_t			i_gid;
    unsigned int		i_flags;

#ifdef CONFIG_FS_POSIX_ACL
    struct posix_acl	*i_acl;
    struct posix_acl	*i_default_acl;
#endif

    const struct inode_operations	*i_op;
    struct super_block	*i_sb;
    struct address_space	*i_mapping;

#ifdef CONFIG_SECURITY
    void			*i_security;
#endif

    /* Stat data, not accessed from path walking */
    unsigned long		i_ino;
    /*
     * Filesystems may only read i_nlink directly.  They shall use the
     * following functions for modification:
     *
     *    (set|clear|inc|drop)_nlink
     *    inode_(inc|dec)_link_count
     */
    union {
        const unsigned int i_nlink;
        unsigned int __i_nlink;
    };
    dev_t			i_rdev;
    loff_t			i_size;
    struct timespec64	i_atime;
    struct timespec64	i_mtime;
    struct timespec64	i_ctime;
    spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
    unsigned short          i_bytes;
    u8			i_blkbits;
    u8			i_write_hint;
    blkcnt_t		i_blocks;

#ifdef __NEED_I_SIZE_ORDERED
    seqcount_t		i_size_seqcount;
#endif

    /* Misc */
    unsigned long		i_state;
    struct rw_semaphore	i_rwsem;

    unsigned long		dirtied_when;	/* jiffies of first dirtying */
    unsigned long		dirtied_time_when;

    struct hlist_node	i_hash;
    struct list_head	i_io_list;	/* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
    struct bdi_writeback	*i_wb;		/* the associated cgroup wb */

    /* foreign inode detection, see wbc_detach_inode() */
    int			i_wb_frn_winner;
    u16			i_wb_frn_avg_time;
    u16			i_wb_frn_history;
#endif
    struct list_head	i_lru;		/* inode LRU list */
    struct list_head	i_sb_list;
    struct list_head	i_wb_list;	/* backing dev writeback list */
    union {
        struct hlist_head	i_dentry;
        struct rcu_head		i_rcu;
    };
    atomic64_t		i_version;
    atomic64_t		i_sequence; /* see futex */
    atomic_t		i_count;
    atomic_t		i_dio_count;
    atomic_t		i_writecount;
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
    atomic_t		i_readcount; /* struct files open RO */
#endif
    union {
        const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
        void (*free_inode)(struct inode *);
    };
    struct file_lock_context	*i_flctx;
    struct address_space	i_data;
    struct list_head	i_devices;
    union {
        struct pipe_inode_info	*i_pipe;
        struct cdev		*i_cdev;
        char			*i_link;
        unsigned		i_dir_seq;
    };

    __u32			i_generation;

#ifdef CONFIG_FSNOTIFY
    __u32			i_fsnotify_mask; /* all events this inode cares about */
    struct fsnotify_mark_connector __rcu	*i_fsnotify_marks;
#endif

#ifdef CONFIG_FS_ENCRYPTION
    struct fscrypt_info	*i_crypt_info;
#endif

#ifdef CONFIG_FS_VERITY
    struct fsverity_info	*i_verity_info;
#endif

    void			*i_private; /* fs or device private pointer */
} __randomize_layout;

Display inode information

The inode data can be displayed with the stat command.

$ echo "hello inode" > testfile

$ debugfs /dev/mapper/vgroot-lvroot
debugfs 1.42.9 (28-Dec-2013)
debugfs:  q
[root@init500-d1 ~]# stat testfile
  File: ‘testfile’
  Size: 12        	Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 27787306    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-04-30 20:40:38.496696650 +0000
Modify: 2022-04-30 20:40:38.496696650 +0000
Change: 2022-04-30 20:40:38.496696650 +0000
 Birth: -

$ stat --format=%i testfile
27787306

The inode number can be also displayed with ls -i command. With ls -l command, it displays file permissions for the owner, group and others. In the following example, the owner has read and write permission. The group and others only have read permission.

$ ls -il testfile
27787306 -rw-r--r-- 1 root root 12 Apr 30 20:40 testfile

The inode space information can be displayed with df -i command.

$ df -i
Filesystem                   Inodes  IUsed     IFree IUse% Mounted on
devtmpfs                  132058034    953 132057081    1% /dev
tmpfs                     132061520      4 132061516    1% /dev/shm
tmpfs                     132061520   1270 132060250    1% /run
tmpfs                     132061520     17 132061503    1% /sys/fs/cgroup
/dev/mapper/vgroot-lvroot  97320960 188567  97132393    1% /
/dev/nvme0n1p2               128016     31    127985    1% /boot
/dev/nvme0n1p1                    0      0         0     - /boot/efi
tmpfs                     132061520      1 132061519    1% /run/user/0
tmpfs                     132061520      1 132061519    1% /var/lib/osd/lttng

The content of filesystem superblock can be listed with tune2fs command. The inode related information can be grepped.

$ tune2fs -l /dev/mapper/vgroot-lvroot | grep -i inode
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Inode count:              97320960
Free inodes:              97167388
Inodes per group:         8192
Inode blocks per group:   512
First inode:              11
Inode size:	          256
Journal inode:            8
First orphan inode:       22157990
Journal backup:           inode blocks

The debugfs program is an interactive file system debugger. It can be used to examine and change the state of an ext2, ext3, or ext4 file system.

$ debugfs /dev/mapper/vgroot-lvroot
debugfs 1.42.9 (28-Dec-2013)

debugfs:  stat <27787306>
Inode: 27787306   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 269876152    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 12
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x626d9ec6:766bf528 -- Sat Apr 30 20:40:38 2022
 atime: 0x626d9ec6:766bf528 -- Sat Apr 30 20:40:38 2022
 mtime: 0x626d9ec6:766bf528 -- Sat Apr 30 20:40:38 2022
crtime: 0x626d9ec6:766bf528 -- Sat Apr 30 20:40:38 2022
Size of extra inode fields: 32
EXTENTS:
(0):111182849

Inode structure for the directory can be displayed as below.

$ stat /
  File: ‘/’
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: fd00h/64768d	Inode: 2           Links: 18
Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-04-14 23:50:13.812053511 +0000
Modify: 2022-04-30 20:42:56.468650708 +0000
Change: 2022-04-30 20:42:56.468650708 +0000
 Birth: -

Reference

What is MapReduce

Image

MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.

  • MapReduce consists of two distinct tasks — Map and Reduce.
  • As the name MapReduce suggests, reducer phase takes place after the mapper phase has been completed.
  • So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
  • The output of a Mapper or map job (key-value pairs) is input to the Reducer.
  • The reducer receives the key-value pair from multiple map jobs.
  • Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

A Word Count Example of MapReduce

Let us understand, how a MapReduce works by taking an example where I have a text file called example.txt whose contents are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will be finding unique words and the number of occurrences of those unique words.

Image

  • First, we divide the input into three splits as shown in the figure. This will distribute the work among all the map nodes.
  • Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word, in itself, will occur once.
  • Now, a list of key-value pair will be created where the key is nothing but the individual words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs — Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
  • After the mapper phase, a partition process takes place where sorting and shuffling happen so that all the tuples with the same key are sent to the corresponding reducer.
  • So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
  • Now, each Reducer counts the values which are present in that list of values. As shown in the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of ones in the very list and gives the final output as — Bear, 2.
  • Finally, all the output key/value pairs are then collected and written in the output file.

Reference

0%