### Determining L1 and L2 Cache size of a machine

When the time comes to buy a Computer, one has to take into account so many factors; each of them accounting for a small percentage of the total performance. One such factor is the amount (or simply the presence or absence) of the L1 and L2 Cache size of a processor.

In the memory hierarchy of a processor seeking an instruction or data, L1 comes before L2 (and so on) until it hits the main memory (also called RAM) and finally the hard-disk. More explanation can be found at many sites including this and this.

The purpose of this post in not to delve into the details of what these various memory types are; rather to decide how you can actually calculate or verify them empirically (Yes, it is indeed possible!). Nowadays, we can rely on the hardware manufactures to correctly tell us the exact specifications of the processor of a machine, but back in those times it was not considered reliable to believe the vendor about the machine specifications and the norm was to run your own tests to confirm that. I shall be outlining a few steps based on matrix multiplications that can help shed some light on the processor memory in question.

All experiments were run on a Solaris cluster machine although the same principles can be easily applied to any other Operating System.

Step1: Getting the assumed values for the sake of comparison.

Solaris allows us to run fpversion to get information about the cache sizes as known to the installed version of the OS. Here's one such run, which we use as a sample machine for the explanation in the rest of the article.

---

A SPARC-based CPU is available.

Kernel says CPU's clock rate is 360.0 MHz.

Kernel says main memory's clock rate is 120.0 MHz.

Sun-4 floating-point controller version 0 found.

An UltraSPARC chip is available.

Use "-xtarget=ultra2i -xcache=16/32/1:2048/64/1" code-generation option.

Hostid = 0xXXXXXX

---

This tells me the following details:

L1 Cache is 16 KB

L1 Cache line size is 32 bytes

L1 Cache is 1-way set associative (direct-mapped)

L2 Cache is 2048 MB

L2 Cache line size is 64 bytes

L2 Cache is 1-way set associative (direct-mapped)

Step 2: Running suitable tests to figure out the L2 cache size.

Since the L2 cache is the larger of the two caches under consideration, it will (obviously) tend to hold the majority of the data values being manipulated. A good method to test this is to run a simple matrix-based multiplication algorithm (the dreaded triple nested loop). The only variable parameter to this function is the dimension of the matrix, let's say 'n'. The time taken to perform the multiplication is calculated and plotted. The point where the graph shows a drastic deterioration is considered to be the threshold when the cache size is over the limit and is causing access to the RAM.

In my program runs, n = 240 was found to be the case where the timings went haywire.

Calculations:

n=240

a double = 8 bytes on the Solaris machine

For 3 matrices, each of size n = 240, empirical cache size = 240*240*8*3, on the presumption that all 3 matrices are there in the cache at this time.

=> L2 size approximately = 1.4 MB

The machine supposedly had a cache value of 2 MB, and some difference can be attributed to the program instruction references and other programs multitasking at the same time. More careful runs will yield better results.

Step 3: Running suitable tests to figure out the L1 cache size.

Note that the historical triply nested matrix multiplication is not very efficient since it tries to read in a value from than once to perform multiplication.

---

Consider matrix multiplication C = A * B

where each element of C, c(i,j) = Summation of k from 1 to n of [ a (i,k)* b (k,j) ]

Note that this means that some other 'c' element would need some value of a(i,k) and b(k,j) for some value of i, k and j.

---

Ideally, you would not want to read a matrix value more than once -- once you read in a particular value, make sure you use it completely and then forget about it. This is called the blocking version of matrix multiplication. Without going into details, I can say that it works and provides a god test case for determining the L1 cache size since it brings in data only when needed and not otherwise.

[ Algorithm not provided, again left as a test for the reader ]

The program was run for matrix dimensions from the block sizes from 20 to around 55. The following observations were made: 1.) The peak megaflops for the various block sizes were noted. It was found that the peak performance was a relative maximum (a local maxima) for values 20, 25, 30, 35, 40, 45 and after that the values steadily decreased. 2.) The local maxima values at the above noted points also kept on increasing, the maximum value being observed at block size=45.

Explanation for the Observations: Firstly, we note that the L1 cache size is 16 KB, arranged as direct-mapped, 4-byte cache line size. We can consider this to be 2048 double-words, capable of holding 2048 values for matrix computations. Secondly, because of direct mapping, there is a chance that greater number of addresses get mapped to the same location. We get a maximum performance for n=45, which can be easily concluded as follows.

For n=45, the total sum of matrix elements is 45*45*8 bytes, = 16200 which is the largest array that can be stored in L1 cache (16384). The L2 cache is 2 MB in size, so once the elements has been bought to L2, transfer to L1 can take place very rapidly. For array sizes greater than 45, the performance would decrease, because a few elements of the array would have to be swapped out to accommodate the other elements of the array.

Calculations:

n=45

a double = 8 bytes on the Solaris machine

Hence, L1 cache size approx. = 45*45*4 = 16200 bytes.

The machine on which I ran my program 'supposedly' had a L1 cache of 16 KB, so my experimental value matches more or less with the expected value.

---

Rest assured now that you are getting exactly what you paid for.

In the memory hierarchy of a processor seeking an instruction or data, L1 comes before L2 (and so on) until it hits the main memory (also called RAM) and finally the hard-disk. More explanation can be found at many sites including this and this.

The purpose of this post in not to delve into the details of what these various memory types are; rather to decide how you can actually calculate or verify them empirically (Yes, it is indeed possible!). Nowadays, we can rely on the hardware manufactures to correctly tell us the exact specifications of the processor of a machine, but back in those times it was not considered reliable to believe the vendor about the machine specifications and the norm was to run your own tests to confirm that. I shall be outlining a few steps based on matrix multiplications that can help shed some light on the processor memory in question.

All experiments were run on a Solaris cluster machine although the same principles can be easily applied to any other Operating System.

Step1: Getting the assumed values for the sake of comparison.

Solaris allows us to run fpversion to get information about the cache sizes as known to the installed version of the OS. Here's one such run, which we use as a sample machine for the explanation in the rest of the article.

---

A SPARC-based CPU is available.

Kernel says CPU's clock rate is 360.0 MHz.

Kernel says main memory's clock rate is 120.0 MHz.

Sun-4 floating-point controller version 0 found.

An UltraSPARC chip is available.

Use "-xtarget=ultra2i -xcache=16/32/1:2048/64/1" code-generation option.

Hostid = 0xXXXXXX

---

This tells me the following details:

L1 Cache is 16 KB

L1 Cache line size is 32 bytes

L1 Cache is 1-way set associative (direct-mapped)

L2 Cache is 2048 MB

L2 Cache line size is 64 bytes

L2 Cache is 1-way set associative (direct-mapped)

Step 2: Running suitable tests to figure out the L2 cache size.

Since the L2 cache is the larger of the two caches under consideration, it will (obviously) tend to hold the majority of the data values being manipulated. A good method to test this is to run a simple matrix-based multiplication algorithm (the dreaded triple nested loop). The only variable parameter to this function is the dimension of the matrix, let's say 'n'. The time taken to perform the multiplication is calculated and plotted. The point where the graph shows a drastic deterioration is considered to be the threshold when the cache size is over the limit and is causing access to the RAM.

In my program runs, n = 240 was found to be the case where the timings went haywire.

Calculations:

n=240

a double = 8 bytes on the Solaris machine

For 3 matrices, each of size n = 240, empirical cache size = 240*240*8*3, on the presumption that all 3 matrices are there in the cache at this time.

=> L2 size approximately = 1.4 MB

The machine supposedly had a cache value of 2 MB, and some difference can be attributed to the program instruction references and other programs multitasking at the same time. More careful runs will yield better results.

Step 3: Running suitable tests to figure out the L1 cache size.

Note that the historical triply nested matrix multiplication is not very efficient since it tries to read in a value from than once to perform multiplication.

---

Consider matrix multiplication C = A * B

where each element of C, c(i,j) = Summation of k from 1 to n of [ a (i,k)* b (k,j) ]

Note that this means that some other 'c' element would need some value of a(i,k) and b(k,j) for some value of i, k and j.

---

Ideally, you would not want to read a matrix value more than once -- once you read in a particular value, make sure you use it completely and then forget about it. This is called the blocking version of matrix multiplication. Without going into details, I can say that it works and provides a god test case for determining the L1 cache size since it brings in data only when needed and not otherwise.

[ Algorithm not provided, again left as a test for the reader ]

The program was run for matrix dimensions from the block sizes from 20 to around 55. The following observations were made: 1.) The peak megaflops for the various block sizes were noted. It was found that the peak performance was a relative maximum (a local maxima) for values 20, 25, 30, 35, 40, 45 and after that the values steadily decreased. 2.) The local maxima values at the above noted points also kept on increasing, the maximum value being observed at block size=45.

Explanation for the Observations: Firstly, we note that the L1 cache size is 16 KB, arranged as direct-mapped, 4-byte cache line size. We can consider this to be 2048 double-words, capable of holding 2048 values for matrix computations. Secondly, because of direct mapping, there is a chance that greater number of addresses get mapped to the same location. We get a maximum performance for n=45, which can be easily concluded as follows.

For n=45, the total sum of matrix elements is 45*45*8 bytes, = 16200 which is the largest array that can be stored in L1 cache (16384). The L2 cache is 2 MB in size, so once the elements has been bought to L2, transfer to L1 can take place very rapidly. For array sizes greater than 45, the performance would decrease, because a few elements of the array would have to be swapped out to accommodate the other elements of the array.

Calculations:

n=45

a double = 8 bytes on the Solaris machine

Hence, L1 cache size approx. = 45*45*4 = 16200 bytes.

The machine on which I ran my program 'supposedly' had a L1 cache of 16 KB, so my experimental value matches more or less with the expected value.

---

Rest assured now that you are getting exactly what you paid for.

## 3 Comments:

I bow to thee O(Venky).

This is the most technicla rticle I have ever read on a blog.

Good stuff...

I would like to exchange links with your site www.blogger.com

Is this possible?

I would like to exchange links with your site www.blogger.com

Is this possible?

Post a Comment

<< Home