Wednesday, December 29, 2010


Sparsemem is a framework to be used in certain architecture systems, in which the entire memory is spanned out in different memory banks across the address space of the processor. The problem in such cases is, if we dont deal with it differently, it means making page struct for all the pages across the entire address space, even if it contains huge no. of invalid pages which cannot be used. Thus wasting a lot of memory for bookkeeping invalid memory.

In this first article of the series, I will be talking about the logic and usability of Sparsemem. Second one will talk about code flow and implementation of the Framework. And the third one will focus on testing tools and techniques for validation a MM solution.

Sparsemem Framework in Linux, is a very small part of the hugely complex Memory Management module in the Linux kernel. Its genesis could be traced back to the Discontigmen Patch existing in the kernel from prehistoric days (before 2.5). Since Discontigmem implementation didnt foresee the changing pattern of hardware and the new requirements put on its framework like Hot plug and sparse intra node physical memory. Sparsemem filled that gap by using the existing framework of Discontigmem and added some new cocepts and unified lots code from different arches and provide support for new features like hotplug.

To understand what sparsemem is, we need to understand the memory layout of the kernel.
With large scale machines, memory may be arranged into blocks that incur a different cost to access, depending on their distance to processor.

Each such bank is called a node (pg_data_t) For UMA archs like desktop PC we have only one pg_data_t. Each node is divided into a number of blocks called zones, which represents different memory ranges. A zone could be of type ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. Each node contains one mem_map array to hold physical page frames present in that node called struct page.

Now, what sparsemem does is, it abstracts the use of these global mem_map's[]. This  kind of mem_map[] is used by discontiguous memor machines (like in the old CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.

Sparsemem divides the entire physical memory address space into sections. Developers need to set a SECTION_SIZE variable to to fine tune the outcome. Now some of these sections contain valid memory and some doesnt contain any memory at all.

                            Physical Mem.    Sections
                                       .                    .
                                       .                    .
                                       .                    .
                             |                |    |                |
0x30000000         --------------    --------------  0x30000000
                             |                |    |    offline   |
                             |                |    -------------  0x2C000000
                             |                |    |    offline   |
                             |                |    -------------  0x2A000000
                             |    none    |    |    offline   |
                             |                |    -------------  0x28000000
                             |                |    |    offline   |
                             |                |    -------------  0x26000000
0x25000000         --------------    |   *online   |
                             |                |    -------------  0x24000000
                             |                |    |    online   |
                             |  80 MiB   |    --------------  0x22000000
                             |                |    |    online   |
0x20000000         --------------    ---------------  0x20000000

In the above scenario memory bank present is of 80MB. and section size chosen is 32 MB which means that in total, we have only 3 valid sections and rest as invalid sections.

What happens is during boot time, we identify the memory banks present in the system and their  respective addresses and amount and according to that we mark sections in which memory banks are lying, as valid and rest as invalid.

And during the time when SLAB is building its inventory of useful pages, these sections are skipped, saving both time and extra page_struct space.

Sparsemem uses an array to provide different pfn_to_page() translations for each SECTION_SIZE area of physical memory which in turn allows mem_maps[] to break up.

In order to do quick pfn_to_page() operations, the section no. of page is encoded in page->flags. Part of the sparsemem infrastructure enables sharing of these bits more dynamically (at compile time) between the page_zone() and sparsemem operations.

The major issue with SPARSEMEM is the hit it takes for extra indirection during page access and in case of huge holes between comparatively small memory banks.
To resolve them some variations have been put in place, most noteworthy has been:

1) Virtual mem_map -- This implementation is specifically for 64-bit architectures. In this case, mem_map is mapped into a virtually contiguous area and only the active sections are physically mapped. This allows virt_to_page, page_addresses and others to become simple shift/add operations. No page-fileds, no table lookups, nothing related to memory is required. The two key operations pfn_to_page and page_to_pfn becomes:
#define __pfn_to_page(pfn)      (vmemmap + (pfn))
#define __page_to_pfn(page)     ((page) - vmemmap)

By having a virtual mapping for the memmap, we allow simple access without wasting physical memory.  As kernel memory is typically already mapped 1:1, this introduces no additional overhead.

The virtual mapping must be big enough to allow a struct page to be allocated and mapped for all valid physical pages.  This will make a virtual memmap difficult to use on 32 bit platforms that support 36 address bits.However, if there is enough virtual space available and the arch already maps its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.  FLATMEM needs to read the contents of the mem_map variable to get the start of the memmap and then add the offset to the required entry.  vmemmap is a constant to which we can simply add the offset.

2) SPARSEMEM_EXTREME - makes mem_section a one dimensional array of pointers to mem_sections.  This two level layout scheme is able to achieve smaller memory requirements for SPARSEMEM with the tradeoff of an additional shift and load when fetching the memory section.  The current SPARSEMEM -mm implementation is a one dimensional array of mem_sections which is the default SPARSEMEM configuration.  The patch attempts isolates the implementation details of the physical layout of the sparsemem section array.

ARCH_SPARSEMEM_EXTREME depends on 64BIT and is by default boolean false. The disadvantage of SPARSEMEM_EXTREME is that it costs you the extra level in the lookup

SPARSEMEM_EXTREME requires bootmem to be functioning at the time of memory_present() calls.  This is not always feasible, so architectures which do not need it may allocate everything statically by using

1) Kernel Documentation.txt
2) Various LKML discussions
3) Understanding linux virtual memory manager by Mel Gorman.