Wednesday, January 19, 2011

Cache Memory demystified.

What is Cache ?
The webster dictionary defines Cache as "a safe place for hiding and storing things."
For software developers like me, cache memory is a temporary storage place for information requested most frequently. It consists of relatively small area of fast local memory. As a result, the information stored may be delivered much quicker than it would be done from slower external devices, such as operating memory or disk subsystem.

Why is it Needed ?
In simple terms, its needed to speed up the operations.
e.g. Time to run code = clock cycles used for running the code + clock cycles spent waiting for memory fetch. For many years now, CPUs all over the world have sped up by more than 50% per year, over memory chip speed ups. Hence memory access is a bottleneck to fast computing.

Cache in short is that part of memory hierarchy which is used to improve average access time to slower memory. The hierarchy is defined as follows :

<----- FASTER -----

Registers -- L1 Cache -- L2 Cache -- RAM -- Disk/Tape

----- BIGGER ---->

Cache Basics

Before I go any further, let me put out a glossary of few Cache related terms, we will be using a lot:
I-Cache - Instruction Cache
D-Cache - Data Cache
U-Cache - Unified Cache
S-Cache - Secondary Cache
T-Cache - Tertiary Cache
B-Cache - Back up Cache
E-Box - Integer execution Unit / Arithmetic Logic Unit.
I-Box - Decoder and Scheduler Unit
F-Box - Floating Point Unit
A-Box - Load and Store Unit
C-Box - Cache and System Bus Controller Unit.

Now with that out of the way, let me bring out the big guns.

In any computer system, cache memory is separated logically and physically for so-called levels. The internal memory built-in to the processor core is called L1-Cache or Level 1 Cache and the external cache memory located either on processor module or the main board is called L2-Cache or Level 2 Cache.

Cache memory on the first level is further subdivided into two regions, of roughly same size, called I-Cache and D-Cache. Cache memory of the second level is almost always unified. The reason being the Cache memory of second level doesn't need to be as fast as of the first level. In real life I-Cache and D-Cache satisfy about 80%-90% of all requests, usually, so it makes room for a trade off.

A general Cache based system

Cache functioning
At run time, while executing some instruction, if there's no data available in the register, a request has to be generated and dispatched to the nearest cache level i.e. to D-Cache. If a miss returns, the request goes redirected towards the next cache level and so on and so forth. Then while retrieving the data, the system first writes the data from memory to cache and then cache to register.

If you wish to save/write some data, the resulting operation is in opposite direction as in, the data is first written to register, then to D-Cache, then to L2 Cache and if required, then to memory (Depending on the write policy being followed: Write-through or Write-back.)

To reduce the cost of memory transfer, more than one element is loaded into cache, for each request. The unit of transfer is called a cache block or cache line. The size of this line is architecture dependent and is usually expressed in bytes. A line is typically between 32 - 128 bytes long.

Every address field consists of two primary parts : a dynamic (tag) which contains higher address bits, and a static (index) which contains the lower address bit. The first one may be modified during run time but the second one is fixed. Imagine that there is some physical memory space organized in the form of M memory segments, sized equally. Suppose those memory segments and cache segments(lines) are of exactly same size. Therefore to obtain an address of a memory line, it needs to determine the number of its memory segment first and the number of memory line inside of that segment, second, and then to merge both numbers.


Cache mapping
There are three general methods for mapping main memory to cache.
1) Direct mapping : In direct mapping, a block from main memory can go in exactly one place in the cache. It works by doing the following simple calculation for each memory address.
Cache address = memory address MODULO cache size.
e.g. location 0 in the cache memory can be occupied by location 0,4,8... from main memory.


2) Set associative : N-way set associative cache memory means that information stored at some address in operating memory could be placed (cached) in N locations (lines) of this cache memory. So if the memory is logically segmented then each segment will have exactly one line capable of caching information located at some memory address. So when the data from a certain address is required, the appropriate lines in all the segments are queried and a hit or miss is delivered.


3) Fully associative : In this arrangement, a block from main memory can be placed in any location in the cache.


There are several algorithms to choose from for replacing the cache lines in associative memory cases.


  • LRU (Least Recently Used) - A line containing information with oldest last hit is replaced.
  • LRR (Least Recently Replaced) / FIFO (First In First Out) - The line containing information with oldest record is replaced.
  • LFU (Least Frequently Used) - A line containing information which is been requested least often is replaced.
  • Random - A random line is chosen for replacement.
Direct mapped cache is faster than fully associative but is least efficient because controller has no choice while writing because there is only one line available to choose from . So significant performance degradation occur if there are two or more memory addresses challenging each other for a single cache line. Fully associative cache is slowest yet most efficient for the same above stated reasons.

Cache Read - uses the principal of locality to bring pages into cache. There are two different types of localities :

  • Temporal Locality (Locality in Time) - states that if an item is referenced, it will tend to be referenced again soon. e.g. loops, reuse etc.
  • Spatial Locality (Locality in Space) - states that if an item is referenced, items whose addresses are close by tend to be referenced soon. e.g. array access
Both of these principles are satisfied by moving blocks/lines instead of words to the memory for each request. Optimal line size is based on properties of data bus and memory subsystem.

Cache Write - There are 2 main approaches for writing data to memory.

  • Write Through - states that the information is written to both the block in Cache and the block in lower level memory at the same time. This approach is simpler but costly.
  • Write Back - states that the information is written only to the block in the Cache. The modified block is written to the main memory only when it is replaced in Cache. This method is less costly as it reduces the memory writes but is more complicated as a lot of checks need to be in place to prevent data corruption.
That's all from me at the moment. There will be more to come. Please let me know of any comments.

References:
Sun cache document
Various documents on the Internet.
 

Friday, January 7, 2011

Linux Memory Management Notes

Here, I will talk about some random MM things that I learned while working on it a while back. It might be useful for others. I am defining things here for MIPS but it could be tweaked for other archs as well.

1) How to define the no. of memory nodes contained it a system ??
Soln:- Its  a two step process.

a) In the .config file select CONFIG_SYS_SUPPORTS_NUMA & then CONFIG_NODES_SHIFT=<no. of nodes^2)

b) The second step is :
    in arch/mips/include/asm/mmzone.h, define the following:
   pg_data_t discontig_node_data[]
   #define NODE_DATA(nid) (&discontig_node_data[nid])
   #define NODE_MEM_MAP(nid) (NODE_DATA(nid)->node_mem_map)

& define discontig_node_data  in arch/mips/mm/discontig.c
e.g.

pg_data_t discontig_node_data[MAX_NUMNODES] = {
{ .bdata = &bootmem_node_data[0]  },
.
.
.
};

and finally define MAX_NUMNODES in include/linux/numa.h

2) How many page descriptors can a page hold ??
Soln:
1 page typically is 4KB (4*1024 --> 4096)
1 page descriptor is tyically 32 bytes.

So the no. of page descriptors in a page are:- 4096/32 --> 128 page descriptors

3) What does PFN_UP / PFN_DOWN macros do ??
Soln:
PFN_UP macro gives a page frame number after rounding the given address to the next page frame boundary
PFN_DOWN macro gives a page frame number that contains the address passed to this macro.

(to be contd....)

Thursday, January 6, 2011

Sparsemem-III

This is my third article in the series for Sparsemem. First one dealt with the logic and usability of sparsemem. Second one discussed the code flow and implementation of the Sparsemem Framework.

In this article I will be talking about various testing tools to validate your implentation of sparsemem on your system. Most of these tools are for  generic memory testing and can be used for other MM areas as well.

The most affected area of sparsemem usage is the calls to pfn_to_page() and page_to_pfn(). This can be tested by the following methods:

 1. page fault
 2. memory reclaim
 3. slab(slub/slob) (virt_to_page() will be called.)
 4. high-order page allocator

But 1 and 2 is too heavy to measure the influence. 3 or 4 can be done and out of these 3 is better.

If you don't want to test-in-the-kernel, reading /proc/kpagecount may call tons of pfn_to_page() ....but copy information between userspace <-> kernel is very heavy in general. So, in-kernel test will yield much better results.

The following patch by christoph lameter is pretty useful for doing page faults on bootup and display results on the terminal.
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg190559.html

The other memory testing tools are as follows:
1) Memtester --  It is a GPL'd userspace utility for testing the memory subsystem for faults
It can be found at the following location:
http://pyropus.ca/software/memtester/
It has a output like the following:

<inux#> ./memtester 256 2
memtester version 4.2.0 (32-bit)
Copyright (C) 2010 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffff000
want 256MB (268435456 bytes)
got  256MB (268435456 bytes), trying mlock ...locked.
Loop 1/2:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok
Loop 2/2:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok
Done.

2) Stress -- stress is a deliberately simple workload generator for POSIX systems. It imposes a configurable amount of CPU, memory, I/O, and disk stress on the system.
It can be found here: http://weather.ou.edu/~apw/projects/stress/

Output:
<Linux#> ./stress --cpu 2 --io 1 --vm 1 --vm-bytes 128M --timeout 10s --verbose
stress: info: [419] dispatching hogs: 2 cpu, 1 io, 1 vm, 0 hdd
stress: dbug: [419] using backoff sleep of 12000us
stress: dbug: [419] setting timeout to 10s
stress: dbug: [419] --> hogcpu worker 2 [420] forked
stress: dbug: [419] --> hogio worker 1 [421] forked
stress: dbug: [419] --> hogvm worker 1 [422] forked
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [419] using backoff sleep of 3000us
stress: dbug: [419] setting timeout to 10s
stress: dbug: [419] --> hogcpu worker 1 [423] forked
stress: dbug: [422] freed 134217728 bytes
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [422] freed 134217728 bytes
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [422] freed 134217728 bytes
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [422] freed 134217728 bytes
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [419] <-- worker 421 signalled normally
stress: dbug: [419] <-- worker 420 signalled normally
stress: dbug: [419] <-- worker 422 signalled normally
stress: dbug: [419] <-- worker 423 signalled normally
stress: info: [419] successful run completed in 10s
Apart from these two you can use googles stress tester tools as well. They can be found here

Let me know of your feedback and comments.

 

Tuesday, January 4, 2011

Sparsemem-II

This is the second article of Sparsemem series. The first describes the logic and usability of Sparsemem. This one will focus on the code flow and implementation of the Framework. And the third one will focus on testing tools and techniques for validation a MM solution.

So First of all we statcially allocate mem_sections (2-d array) in include/linux/mmzone.h
Then the following action takes place during kernel initialization process.

--   In setup_arch(), prom_init() is called, which initializes different kernel variables.
--   prom_init() will call add_memory_region() with the complete RAM size.
--   In case the user has given the "mem=" optin with kernel cmdline then
    the previous added region is deleted and new region created using the
    memory given.
--   early_parse_mem() records the detected memory areas using add_memory_region().
--   add_memory_region() adds an entry in the boot_mem_map for each separate memory bank.
--   setup_arch() then calls arch_mem_init()
--   arch_mem_init() calls plat_mem_setup() and then calls bootmem_init() where all the available memory is bunched together and its min pfn and max pfn are identified.
--   Then bootmem_init() calls memory_present() for the detected memory.
--   memory_present() marks all the memory sections that lie within the detected memory banks as SECTION_MARKED_PRESENT.
--   The function which allocates mem_map(array of struct page) is sparse_init().
--   Then memory for bootmem map is allocated and the usable memory range is added to the early_node_map[]
--   After sparse_init(), mem_maps are allocated. (depending on the config.) But, here, mem_map is not initialized.This is because initialization logic of memmap doesn't depend on FLATMEM/DISCONTIGMEM/SPARSEMEM.
--   Initializing mem_map is done by free_area_init_node(). This function initializes memory range registered by add_active_range() (see mm/page_alloc.c) (*)There are architecutures which doesn't use add_active_range(), but this function is for generic use.
--   After free_area_init_node(), all mem_map are initialized as PG_reserved and NODE_DATA(nid)->start_pfn, etc..are available.
--   PG_reserved is cleared at free_all_bootmem(). If you want to keep pages as Reserved (because of holes), OR, don't register memory hole as bootmem then, pages will be kept as Reserved.
--   memmap_init_zone() doesn't care about valid/invalid sections. Regardless of hole, it initializes page descriptors(including struct page which are on a hole). But if page descriptors on holes are _Reserved_ then they don't go to the buddy allocator as free page. To confirm this, free_bootmem_node marks 0x0 on bitmap about only _valid_ pages by bank. Afterwards, free_all_bootmem_core() doesn't insert pages on hole into buddy by using bitmap and rejecting invalid pages.
--   To free memory, even memmap on hole would be freed on ARM by free_unused_memmap_node.

Clarification:
 memory_present() ......... prepare for section[] and mark up PRESENT.
 sparse_init() .................. allocates mem_map. but just allocates it.
 free_area_init_node() .... initizalize mem_map at el.
 memmap_init_zone() ..... mark all pages as reserved.
 free_all_bootmem() ....... make pages available and put into buddy allocator.
 pfn_valid() ..................... useful for checking there are mem_map.

Notes:If a section contains both of valid pages and holes, the section itself is marked as SECTION_MARKED_PRESENT.

 kernel allocates memmap for section which has mixed with valid and invalid(ex, hole) pages. For example, if a memory bank supports 64M but system has only 16M. Let's assume section size is 64M. In this case, section has a hole of 48M, but it will still be a valid section.

It's not encouraged to detect a section is valid or invalid but you can use pfn_valid() to check there are memmap or not. (*) pfn_valid(pfn) is not for detecting there is memory but for detecting
there is memmap.

How to make pages kept as Reserved , reserve bootmem or not register to bootmem.
All of the above may depend on various CONFIG options.

 An active memory region is simply a memory region that does not contain any holes. add_active_range() must be used to register a region in the global variable early_node_map.

Each region is described by the following data structure in mmzone.h
struct node_active_region {
unsigned long start_pfn;
unsigned long end_pfn;
int nid;
};
where start_pfn and end_pfn denote the first and last page frame in a continuous region, and nid is the NUMA ID of the node to which the memory belongs. UMA systems naturally set this to 0.

So that concludes this second article of the series. It was more about the code flow and some small tidbits I remember about this topic from discussions on linux-mm mailing list with Kamezawa Hiroyuki and Minchan Kim. Theres still some more to come which will be about testing sparsemem implementation or memory management in general.

Pls let me know about your views on this article and how can I improve it. Waiting for your comments.

Till Next time, Adios.

References:
1) Linux-MM mailing list.
2) Discussions with Kamezawa and Minchan.
3) Linux kernel books.