Wednesday, January 19, 2011

Cache Memory demystified.

What is Cache ?
The webster dictionary defines Cache as "a safe place for hiding and storing things."
For software developers like me, cache memory is a temporary storage place for information requested most frequently. It consists of relatively small area of fast local memory. As a result, the information stored may be delivered much quicker than it would be done from slower external devices, such as operating memory or disk subsystem.

Why is it Needed ?
In simple terms, its needed to speed up the operations.
e.g. Time to run code = clock cycles used for running the code + clock cycles spent waiting for memory fetch. For many years now, CPUs all over the world have sped up by more than 50% per year, over memory chip speed ups. Hence memory access is a bottleneck to fast computing.

Cache in short is that part of memory hierarchy which is used to improve average access time to slower memory. The hierarchy is defined as follows :

<----- FASTER -----

Registers -- L1 Cache -- L2 Cache -- RAM -- Disk/Tape

----- BIGGER ---->

Cache Basics

Before I go any further, let me put out a glossary of few Cache related terms, we will be using a lot:
I-Cache - Instruction Cache
D-Cache - Data Cache
U-Cache - Unified Cache
S-Cache - Secondary Cache
T-Cache - Tertiary Cache
B-Cache - Back up Cache
E-Box - Integer execution Unit / Arithmetic Logic Unit.
I-Box - Decoder and Scheduler Unit
F-Box - Floating Point Unit
A-Box - Load and Store Unit
C-Box - Cache and System Bus Controller Unit.

Now with that out of the way, let me bring out the big guns.

In any computer system, cache memory is separated logically and physically for so-called levels. The internal memory built-in to the processor core is called L1-Cache or Level 1 Cache and the external cache memory located either on processor module or the main board is called L2-Cache or Level 2 Cache.

Cache memory on the first level is further subdivided into two regions, of roughly same size, called I-Cache and D-Cache. Cache memory of the second level is almost always unified. The reason being the Cache memory of second level doesn't need to be as fast as of the first level. In real life I-Cache and D-Cache satisfy about 80%-90% of all requests, usually, so it makes room for a trade off.

A general Cache based system

Cache functioning
At run time, while executing some instruction, if there's no data available in the register, a request has to be generated and dispatched to the nearest cache level i.e. to D-Cache. If a miss returns, the request goes redirected towards the next cache level and so on and so forth. Then while retrieving the data, the system first writes the data from memory to cache and then cache to register.

If you wish to save/write some data, the resulting operation is in opposite direction as in, the data is first written to register, then to D-Cache, then to L2 Cache and if required, then to memory (Depending on the write policy being followed: Write-through or Write-back.)

To reduce the cost of memory transfer, more than one element is loaded into cache, for each request. The unit of transfer is called a cache block or cache line. The size of this line is architecture dependent and is usually expressed in bytes. A line is typically between 32 - 128 bytes long.

Every address field consists of two primary parts : a dynamic (tag) which contains higher address bits, and a static (index) which contains the lower address bit. The first one may be modified during run time but the second one is fixed. Imagine that there is some physical memory space organized in the form of M memory segments, sized equally. Suppose those memory segments and cache segments(lines) are of exactly same size. Therefore to obtain an address of a memory line, it needs to determine the number of its memory segment first and the number of memory line inside of that segment, second, and then to merge both numbers.


Cache mapping
There are three general methods for mapping main memory to cache.
1) Direct mapping : In direct mapping, a block from main memory can go in exactly one place in the cache. It works by doing the following simple calculation for each memory address.
Cache address = memory address MODULO cache size.
e.g. location 0 in the cache memory can be occupied by location 0,4,8... from main memory.


2) Set associative : N-way set associative cache memory means that information stored at some address in operating memory could be placed (cached) in N locations (lines) of this cache memory. So if the memory is logically segmented then each segment will have exactly one line capable of caching information located at some memory address. So when the data from a certain address is required, the appropriate lines in all the segments are queried and a hit or miss is delivered.


3) Fully associative : In this arrangement, a block from main memory can be placed in any location in the cache.


There are several algorithms to choose from for replacing the cache lines in associative memory cases.


  • LRU (Least Recently Used) - A line containing information with oldest last hit is replaced.
  • LRR (Least Recently Replaced) / FIFO (First In First Out) - The line containing information with oldest record is replaced.
  • LFU (Least Frequently Used) - A line containing information which is been requested least often is replaced.
  • Random - A random line is chosen for replacement.
Direct mapped cache is faster than fully associative but is least efficient because controller has no choice while writing because there is only one line available to choose from . So significant performance degradation occur if there are two or more memory addresses challenging each other for a single cache line. Fully associative cache is slowest yet most efficient for the same above stated reasons.

Cache Read - uses the principal of locality to bring pages into cache. There are two different types of localities :

  • Temporal Locality (Locality in Time) - states that if an item is referenced, it will tend to be referenced again soon. e.g. loops, reuse etc.
  • Spatial Locality (Locality in Space) - states that if an item is referenced, items whose addresses are close by tend to be referenced soon. e.g. array access
Both of these principles are satisfied by moving blocks/lines instead of words to the memory for each request. Optimal line size is based on properties of data bus and memory subsystem.

Cache Write - There are 2 main approaches for writing data to memory.

  • Write Through - states that the information is written to both the block in Cache and the block in lower level memory at the same time. This approach is simpler but costly.
  • Write Back - states that the information is written only to the block in the Cache. The modified block is written to the main memory only when it is replaced in Cache. This method is less costly as it reduces the memory writes but is more complicated as a lot of checks need to be in place to prevent data corruption.
That's all from me at the moment. There will be more to come. Please let me know of any comments.

References:
Sun cache document
Various documents on the Internet.
 

Friday, January 7, 2011

Linux Memory Management Notes

Here, I will talk about some random MM things that I learned while working on it a while back. It might be useful for others. I am defining things here for MIPS but it could be tweaked for other archs as well.

1) How to define the no. of memory nodes contained it a system ??
Soln:- Its  a two step process.

a) In the .config file select CONFIG_SYS_SUPPORTS_NUMA & then CONFIG_NODES_SHIFT=<no. of nodes^2)

b) The second step is :
    in arch/mips/include/asm/mmzone.h, define the following:
   pg_data_t discontig_node_data[]
   #define NODE_DATA(nid) (&discontig_node_data[nid])
   #define NODE_MEM_MAP(nid) (NODE_DATA(nid)->node_mem_map)

& define discontig_node_data  in arch/mips/mm/discontig.c
e.g.

pg_data_t discontig_node_data[MAX_NUMNODES] = {
{ .bdata = &bootmem_node_data[0]  },
.
.
.
};

and finally define MAX_NUMNODES in include/linux/numa.h

2) How many page descriptors can a page hold ??
Soln:
1 page typically is 4KB (4*1024 --> 4096)
1 page descriptor is tyically 32 bytes.

So the no. of page descriptors in a page are:- 4096/32 --> 128 page descriptors

3) What does PFN_UP / PFN_DOWN macros do ??
Soln:
PFN_UP macro gives a page frame number after rounding the given address to the next page frame boundary
PFN_DOWN macro gives a page frame number that contains the address passed to this macro.

(to be contd....)

Thursday, January 6, 2011

Sparsemem-III

This is my third article in the series for Sparsemem. First one dealt with the logic and usability of sparsemem. Second one discussed the code flow and implementation of the Sparsemem Framework.

In this article I will be talking about various testing tools to validate your implentation of sparsemem on your system. Most of these tools are for  generic memory testing and can be used for other MM areas as well.

The most affected area of sparsemem usage is the calls to pfn_to_page() and page_to_pfn(). This can be tested by the following methods:

 1. page fault
 2. memory reclaim
 3. slab(slub/slob) (virt_to_page() will be called.)
 4. high-order page allocator

But 1 and 2 is too heavy to measure the influence. 3 or 4 can be done and out of these 3 is better.

If you don't want to test-in-the-kernel, reading /proc/kpagecount may call tons of pfn_to_page() ....but copy information between userspace <-> kernel is very heavy in general. So, in-kernel test will yield much better results.

The following patch by christoph lameter is pretty useful for doing page faults on bootup and display results on the terminal.
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg190559.html

The other memory testing tools are as follows:
1) Memtester --  It is a GPL'd userspace utility for testing the memory subsystem for faults
It can be found at the following location:
http://pyropus.ca/software/memtester/
It has a output like the following:

<inux#> ./memtester 256 2
memtester version 4.2.0 (32-bit)
Copyright (C) 2010 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffff000
want 256MB (268435456 bytes)
got  256MB (268435456 bytes), trying mlock ...locked.
Loop 1/2:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok
Loop 2/2:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok
Done.

2) Stress -- stress is a deliberately simple workload generator for POSIX systems. It imposes a configurable amount of CPU, memory, I/O, and disk stress on the system.
It can be found here: http://weather.ou.edu/~apw/projects/stress/

Output:
<Linux#> ./stress --cpu 2 --io 1 --vm 1 --vm-bytes 128M --timeout 10s --verbose
stress: info: [419] dispatching hogs: 2 cpu, 1 io, 1 vm, 0 hdd
stress: dbug: [419] using backoff sleep of 12000us
stress: dbug: [419] setting timeout to 10s
stress: dbug: [419] --> hogcpu worker 2 [420] forked
stress: dbug: [419] --> hogio worker 1 [421] forked
stress: dbug: [419] --> hogvm worker 1 [422] forked
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [419] using backoff sleep of 3000us
stress: dbug: [419] setting timeout to 10s
stress: dbug: [419] --> hogcpu worker 1 [423] forked
stress: dbug: [422] freed 134217728 bytes
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [422] freed 134217728 bytes
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [422] freed 134217728 bytes
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [422] freed 134217728 bytes
stress: dbug: [422] allocating 134217728 bytes ...
stress: dbug: [422] touching bytes in strides of 4096 bytes ...
stress: dbug: [419] <-- worker 421 signalled normally
stress: dbug: [419] <-- worker 420 signalled normally
stress: dbug: [419] <-- worker 422 signalled normally
stress: dbug: [419] <-- worker 423 signalled normally
stress: info: [419] successful run completed in 10s
Apart from these two you can use googles stress tester tools as well. They can be found here

Let me know of your feedback and comments.