Author: R. Koucha
Last update: 23-Nov-2020
Huge pages setup on the Raspberry Pi 4B
Introduction
1 Transparent Huge Pages
2 Explicit Huge Pages
    2.1 Kernel configuration
    2.2 Kernel source code
    2.3 Example
3 Early reservation
About the author
Introduction

Huge pages are a way to enhance the performances of the applications by reducing the number of TLB misses. The mechanism coalesces contiguous standard physical pages (typical size of 4 KB) into a big one (e.g. 2 MB). Linux implements this feature in two flavors: Transparent Huge pages and explicit huge pages.

1 Transparent Huge Pages

Transparent huge pages (THP) are managed transparently by the kernel. The user space applications have no control on them. The kernel makes its best to allocate huge pages whenever it is possible but it is not guaranteed. Moreover, THP may introduce overhead as an underlying "garbage collector" kernel daemon named khugepaged is in charge of the coalescing of the physical pages to make huge pages. This may consume CPU time with undesirable effects on the performances of the running applications. In systems with time critical applications, it is generally advised to deactivate THP.

THP can be disabled on the boot command line (cf. the end of this answer) or from the shell in sysfs:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

N.B.: Some interesting papers exist on the performance evaluation/issues of the THP:

2 Explicit Huge Pages
2.1 Kernel configuration

If the huge pages are required at application level (i.e. from user space). HUGETLBFS kernel configuration must be set to activate the hugetlbfs pseudo-filesystem (the menu in the kernel configurator is something like: "File systems" --> "Pseudo filesystems" --> "HugeTLB file system support").

confighp

For example, on an Ubuntu system, we can check:

$ cat /boot/config-5.4.0-53-generic | grep HUGETLBFS
CONFIG_HUGETLBFS=y

On Raspberry Pi running Raspberry Pi OS, it is possible to configure the apparition of /proc/config.gz: "General setup" --> "Kernel .config support" + "Enable access to .config through /proc/config.gz".

configgz

Then, it is possible to check the kernel configuration with zcat:

$ zcat /proc/config.gz | grep HUGETLBFS
CONFIG_HUGETLBFS=y

Once the previous configuration is done, it is possible to see the multiple huge page sizes supported by the architecture in the /sys/kernel/mm/hugepages directory:

$ ls -l /sys/kernel/mm/hugepages
total 0
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-1048576kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-2048kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-32768kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-64kB

The default huge page size is displayed with /proc/meminfo:

$ cat /proc/meminfo
[...]
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
2.2 Kernel source code

In the kernel source tree the HUGETLBFS parameter is in fs/Kconfig:

config HUGETLBFS
    bool "HugeTLB file system support"
    depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
           SYS_SUPPORTS_HUGETLBFS || BROKEN
    help
      hugetlbfs is a filesystem backing for HugeTLB pages, based on
      ramfs. For architectures that support it, say Y here and read
       for details.

      If unsure, say N.

When this parameter is set, hugetlbfs pseudo-filesystem is added into the kernel build (cf. fs/Makefile):

obj-$(CONFIG_HUGETLBFS)     += hugetlbfs/

The source code of hugetlbfs is located in fs/hugetlbfs/inode.c. At startup, the kernel will mount an internal hugetlbfs file systems to support the default huge page size for the architecture it is running on:

static int __init init_hugetlbfs_fs(void)
{
    struct vfsmount *mnt;
    struct hstate *h;
    int error;
    int i;

    if (!hugepages_supported()) {
        pr_info("disabling because there are no supported hugepage sizes\n");
        return -ENOTSUPP;
    }

    error = -ENOMEM;
    hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
                    sizeof(struct hugetlbfs_inode_info),
                    0, SLAB_ACCOUNT, init_once);
    if (hugetlbfs_inode_cachep == NULL)
        goto out;

    error = register_filesystem(&hugetlbfs_fs_type);
    if (error)
        goto out_free;

    /* default hstate mount is required */
    mnt = mount_one_hugetlbfs(&hstates[default_hstate_idx]);
    if (IS_ERR(mnt)) {
        error = PTR_ERR(mnt);
        goto out_unreg;
    }
    hugetlbfs_vfsmount[default_hstate_idx] = mnt;

    /* other hstates are optional */
    i = 0;
    for_each_hstate(h) {
        if (i == default_hstate_idx) {
            i++;
            continue;
        }

        mnt = mount_one_hugetlbfs(h);
        if (IS_ERR(mnt))
            hugetlbfs_vfsmount[i] = NULL;
        else
            hugetlbfs_vfsmount[i] = mnt;
        i++;
    }

    return 0;

 out_unreg:
    (void)unregister_filesystem(&hugetlbfs_fs_type);
 out_free:
    kmem_cache_destroy(hugetlbfs_inode_cachep);
 out:
    return error;
}

Several parameters related to huge pages can be setup on the boot command line. The parameters concerning hugetlbfs file systems are parsed in mm/hugetlb.c:

static int __init hugetlb_nrpages_setup(char *s)
{
	unsigned long *mhp;
	static unsigned long *last_mhp;

	if (!parsed_valid_hugepagesz) {
		pr_warn("hugepages = %s preceded by "
			"an unsupported hugepagesz, ignoring\n", s);
		parsed_valid_hugepagesz = true;
		return 1;
	}
	/*
	 * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter yet,
	 * so this hugepages= parameter goes to the "default hstate".
	 */
	else if (!hugetlb_max_hstate)
		mhp = &default_hstate_max_huge_pages;
	else
		mhp = &parsed_hstate->max_huge_pages;

	if (mhp == last_mhp) {
		pr_warn("hugepages= specified twice without interleaving hugepagesz=, ignoring\n");
		return 1;
	}

	if (sscanf(s, "%lu", mhp) <= 0)
		*mhp = 0;

	/*
	 * Global state is always initialized later in hugetlb_init.
	 * But we need to allocate >= MAX_ORDER hstates here early to still
	 * use the bootmem allocator.
	 */
	if (hugetlb_max_hstate && parsed_hstate->order >= MAX_ORDER)
		hugetlb_hstate_alloc_pages(parsed_hstate);

	last_mhp = mhp;

	return 1;
}
__setup("hugepages=", hugetlb_nrpages_setup);

static int __init hugetlb_default_setup(char *s)
{
	default_hstate_size = memparse(s, &s);
	return 1;
}
__setup("default_hugepagesz=", hugetlb_default_setup);

If not specified on the boot command line, the default huge page size is set with HPAGE_SIZE in the initialization function of the mm/hugetlb.c file:

static int __init hugetlb_init(void)
{
	int i;

	if (!hugepages_supported())
		return 0;

	if (!size_to_hstate(default_hstate_size)) {
		if (default_hstate_size != 0) {
			pr_err("HugeTLB: unsupported default_hugepagesz %lu. Reverting to %lu\n",
			       default_hstate_size, HPAGE_SIZE);
		}

		default_hstate_size = HPAGE_SIZE;
		if (!size_to_hstate(default_hstate_size))
			hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
	}
	default_hstate_idx = hstate_index(size_to_hstate(default_hstate_size));
	if (default_hstate_max_huge_pages) {
		if (!default_hstate.max_huge_pages)
			default_hstate.max_huge_pages = default_hstate_max_huge_pages;
	}

	hugetlb_init_hstates();
	gather_bootmem_prealloc();
	report_hugepages();

	hugetlb_sysfs_init();
	hugetlb_register_all_nodes();
	hugetlb_cgroup_file_init();

#ifdef CONFIG_SMP
	num_fault_mutexes = roundup_pow_of_two(8 * num_possible_cpus());
#else
	num_fault_mutexes = 1;
#endif
	hugetlb_fault_mutex_table =
		kmalloc_array(num_fault_mutexes, sizeof(struct mutex),
			      GFP_KERNEL);
	BUG_ON(!hugetlb_fault_mutex_table);

	for (i = 0; i < num_fault_mutexes; i++)
		mutex_init(&hugetlb_fault_mutex_table[i]);
	return 0;
}
subsys_initcall(hugetlb_init);

The value of the HPAGE_SIZE constant is architecture dependent. For the ARM platform, it is set in arch/arm/include/asm/pgtable-3level.h:

/*
 * PMD_SHIFT determines the size a middle-level page table entry can map.
 */
#define PMD_SHIFT		21
[...]
/*
 * Hugetlb definitions.
 */
#define HPAGE_SHIFT		PMD_SHIFT
#define HPAGE_SIZE		(_AC(1, UL) << HPAGE_SHIFT)

In other words, the default huge page size is equal to: (1 << 21) = 2^21 = 2097152 bytes = 2 MB on ARM architecture.

2.3 Example

A hugetlbfs file system is a sort of RAM file system into which the kernel creates files to back the memory regions mapped by the applications.

The amount of needed huge pages can be reserved by writing the number of needed huge pages into /sys/kernel/mm/hugepages/hugepages-[hugepagesize]/nr_hugepages.

Then, mmap() is able to map some part of the application address space onto huge pages. Here is an example showing how to do it. The program maps a huge page of 2MB and pauses until the operator type CTRL-C. Upon CTRL-C, the programs goes on to write 0xff in the memory segment and it ends as soon as the operator enter CTRL-C again:

#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <string.h>

#define HP_SIZE  (2 * 1024 * 1024) // <-- Adjust with size of the default HP size on your system

static void handler(int sig)
{
  printf("Signal %d\n", sig);
}

int main(void)
{
  char *addr, *addr1;

  signal(SIGINT, handler);

  // Map a Huge page
  addr = mmap(NULL, HP_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED| MAP_HUGETLB, -1, 0);
  if (addr == MAP_FAILED) {
    perror("mmap()");
    return 1;
  }

  printf("Mapping located at address: %p\n", addr);

  pause();

  printf("Writing into the memory area...\n");

  memset(addr, 0xff, HP_SIZE); // <--- Triggers effective allocation of the physical page

  pause();

  printf("Exiting...\n");

  return 0;
}

$ gcc alloc_hp.c -o alloc_hp
$ ./alloc_hp
mmap(): Cannot allocate memory
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
0
$ sudo sh -c "echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages"
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
1
$ ./alloc_hp 
Mapping located at address: 0x2000000000

In another terminal, the process map can be observed to verify the size of the memory page (it is blocked in pause() system call):

$ pidof alloc_hp
13009
$ cat /proc/13009/smaps
[...]
2000000000-2000200000 rw-s 00000000 00:0e 20412   /anon_hugepage (deleted)
Size:               2048 kB
KernelPageSize:     2048 kB   <----- The page size is 2MB
MMUPageSize:        2048 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            0
[...]

In the preceding map, the file name /anon_hugepage for the huge page region is made internally by the kernel. It is marked deleted because the kernel removes the associated memory file which will make the file disappear as soon as there are no longer references on it (e.g. when the calling process ends, the underlying file is closed upon exit(), the reference counter on the file drops to 0 and the remove operation finishes to make it disappear).

Back into the first terminal, type CTRL-C to make the program go on to write 0xff into the memory area:

$ ./alloc_hp 
Mapping located at address: 0x2000000000
^CSignal 2
Writing into the memory area...

In the other terminal, the display of the process memory map shows that Private_Hugetlb field is updated:

$ cat /proc/13009/smaps
[...]
2000000000-2000200000 rw-s 00000000 00:0e 20412   /anon_hugepage (deleted)
Size:               2048 kB
KernelPageSize:     2048 kB   <----- The page size is 2MB
MMUPageSize:        2048 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:    2048 kB    <----- Amount of memory backed by huge pages
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:            0
VmFlags: rd wr sh mr mw me ms de ht

According to the Linux documentation , Private_Hugetlb shows the amount of memory backed by huge pages which is not counted in "RSS" or "PSS" field for historical reasons.

3 Early reservation

The huge pages are made with consecutive physical memory pages. The reservation should be done early in the system startup (especially on heavy loaded systems) as the physical memory may be so fragmented that it is sometimes impossible to allocate huge pages afterward. To reserve as early as possible, this can be done on the kernel boot command line:

default_hugepagesz=
	[HW] The size of the default HugeTLB page. This is
	the size represented by the legacy /proc/ hugepages
	APIs.  In addition, this is the default hugetlb size
	used for shmget(), mmap() and mounting hugetlbfs
	filesystems.  If not specified, defaults to the
	architecture's default huge page size.  Huge page
	sizes are architecture dependent.  See also
	Documentation/admin-guide/mm/hugetlbpage.rst.
	Format: size[KMG]

hugepages=  
       [HW] Number of HugeTLB pages to allocate at boot.
       If this follows hugepagesz (below), it specifies
       the number of pages of hugepagesz to be allocated.
       If this is the first HugeTLB parameter on the command
       line, it specifies the number of pages to allocate for
       the default huge page size.  See also
       Documentation/admin-guide/mm/hugetlbpage.rst.
       Format: 

hugepagesz=
        [HW] The size of the HugeTLB pages.  This is used in
        conjunction with hugepages (above) to allocate huge
        pages of a specific size at boot.  The pair
        hugepagesz=X hugepages=Y can be specified once for
        each supported huge page size. Huge page sizes are
        architecture dependent.  See also
        Documentation/admin-guide/mm/hugetlbpage.rst.
        Format: size[KMG]

transparent_hugepage=
        [KNL]
        Format: [always|madvise|never]
        Can be used to control the default behavior of the system
        with respect to transparent hugepages.
        See Documentation/admin-guide/mm/transhuge.rst
        for more details.

On Raspberry Pi OS, the boot command line can typically be updated in /boot/cmdline.txt and the current boot command line used by the running kernel can be seen in /proc/cmdline.

About the author

The author is an engineer in computer sciences located in France. He can be contacted here.