Bloomiboy Babble -- Technical,'s all here...

Here's where I do a brain dump of everything I feel like throwing out's part technical, part personal, part crap, and hopefully you get to learn something...sometime.

Wednesday, September 27, 2006

.NET and Strings and some interesting tidbits

Strings are one of the simplest yet most abused data structure in most programming languages, including .NET. With the support of String to be a native data type and hence the possibility of it's rampant usage, it makes it even more important for us to understand how they work.

Two important things to be considered as part of the System.String or simply string implementation in .NET are:
1.) Strings are reference types (although they are primitive types)
2.) Strings are immutable -- more about this follows.

Without going into details of another .NET programmatic paradigm called AppDomains, let me simply say that they are light-weight processes, just like most can imagine Threads to be, but definitely above that. AppDomains can be thought about the be independent light-weight processes that allow data to be shared across them without the overhead of a process. They have all the independence of a process without any issues like one crashing another and hence provides isolation.

The .NET framework defines and loads it's own standard AppDomains within the execution of the CLR and the other .NET programs. The following image has been reproduced from here.

The only thing of interest to us here in this article is the System Domain, and a specific part of that called the "Global String Literal Map". This is nothing but a mapping of all strings that get used within the execution of all .NET programs within the context of a CLR instance. Simply said, the strings across the global space of the CLR is shared and hence avoid their creation multiple times over. To achieve this, they have to be immutable.

The immutability of strings is again something that is new as compared to other programming languages. What this does *not* mean is that string cannot be modified. They can. What this means is that any string, if changed, would create another string whose address would then be different. The previous string will still be stored in the system (provided it is in scope) and not be modified. There references are maintained by the "Global String Literal Map" in the System domain, by the CLR.

So, anytime you do a strResult = strVal1 + strVal2, you are creating a new string as expected, but compared to strVal1 = strVal1 + strVal2, you end up creating a new strVal1 while the previous value of strVal1 also remains although not pointed to by strVal1.

Here's another example, which is somewhat analogous.

string str1 = "Some String";
string str2 = "Some String"; //note str1 and str2 are literally the same.

? Object.ReferenceEquals(str1, str2) will return true in the Command window. This means that they are pointing to the same global storage of string data.

This process in .NET-speak is called "Interning".

This specific behavior of string leads to bad programming when used without prior design. String manipulation in .NET is mush faster as opposed to it's prior versions such as Visual Basic, but is still substantial. .NET stores all string representation as Unicode value, which means at least 4 bytes per character to be represented. To avoid all this extra storage would be a good thing to do. This is a tip that you may find all over the 'net, but usage of the StringBuilder class (provided by .NET) for significant amount of string additions is the best thing you can do to get your program to run X times faster, where X depends on how bad a programmer you are. ;)

The System.Text.StringBuilder uses a combination of preallocation of storage and string represented in array format so that repeated addition of string data is stored in subscript values instead of new strings being created. Finally if you need to get the string value back, you can use the ToString() method. The usage of this class is highly recommended, when possible.

I did want to write about one interesting aspect of the optimizations performed by the .NET CLR. It highlights how one has to tailor's one's mind specific to the framework of the language under consideration. The usual thought process is that C# is so similar to C, and so any C programmer can easily adapt to C# and be able to write great programs; it does hold true under most cases, but there are always cases when common logic belies observations.

Let's us consider two simple loop, placed side by side for comparison:

Loop 1

int[] intArr = new int[10000];
// timer start here
for (int j = 0; j < intArr.Length; j++)
intArr[j] = j;
// timer end here

Loop 2

int[] intArr = new int[10000];
// timer start here
int length = intArr.Length;
for (int j = 0; j < length; j++)
intArr[j] = j;
// timer end here

Loop 1 requires the length of the array to be checked at every loop iteration to decide whether the go ahead or not, while the other pre-stores the length of the one-dimensional array and uses that to make the decision. Given a decent C background, one is bound to believe that avoiding extra comparisons will make it run faster, hence Loop 2.

Which one do YOU think will run faster in .NET?

The answer is below in white font color, highlight it with your mouse to read it...

In .NET, Loop 1 will run faster. The CLR sees that the comparison factor is the Length property of the array, which can never be an unacceptable value and consequently avoids all checks for the "ArrayIndexOutofBounds" condition. Hence it runs faster.


.NET: NGEN versus JITting

For those not very familiar with the details on how the Common Language Runtime (CLR) for .NET functions, this will come as a surprise.

The basic premise of the CLR was to delay the interpretation of the .NET assemblies until the they actually get used. Actually, the level of granularity is more at a function level than at a module level. This is to say that every function gets "Just-in-time" compiled (called JITting) when it is first called, not when a module is loaded in memory. This late execution model was a decision made based on application execution heuristics. We are assuming that this study has been made over a large variety of applications making sure that the working set encompassed a wide range of scenarios to match every day program executions.

Windows executables for .NET are called Portable Executables (PEs) for reasons that is self-explanatory. The format what is fed into the JIT compiler is called the Intermediate Language (IL or sometime s MSIL).
So basically,
NGEN (for "Native Generation"): pre-compiled code
JITting (For "Just-in-time" compilation): compile code only when needed .

Each has it's own good points and shortcomings.

Your argument will seemingly be this: Why would I prefer JITting over NGEN anytime since I would need to JIT my function once per execution of the program? It's not as straightforward as that. NGENing your program will generate the compiled version of the IL for that specific machine. It will then become machine specific (more or less) but depending on how much of it is x86 compatible (and we assume it is 100%), should still allow you to run it on various machines. With the kind of machine heterogeneity out there, you definitely don't want these kind of dependencies. With machine-specific code, you always face the "hit" of falling back to use unoptimized code in case the specific assembly instructions are not available on the machine (e.g. MMX).

Note that this was one of the basic premise because of which the Internet is extremely successful.
Java users will note the distinct similarity between IL and Java bytecode interpretation, the JITter and the JVM, etc. There are definitely parallels but the usage is dramatically different.

There are some things one should be aware of when deciding between NGEN and JITting:
  • The resulting native image is, as a rule, generally larger than the corresponding IL code. As is the case with many applications, many methods are not executed during the course of an instance of the application. Thus, the in memory image of the code would be larger if the entire assembly is stored as a native image.
  • The native image cannot reflect the way that the code is actually used. By dynamically compiling, the .NET infrastructure (CLR) can adjust the code image to reflect the current state of the application and machine. For example, some methods may end up occupying a different virtual memory page than the rest of the application, causing a fragmentation which has a negative impact on the working size of the application.
  • By producing a native image, the image is now strongly bound to specific versions of other assemblies. Essentially, a contract has been established where the native image is dependent on specific versions of other assemblies being available, down to a level of a module version identifier (MVI) being assigned to each module when it is compiled. The MVI is a unique identifier that is stored with the native code to allow it to check other compiled dependencies to verify that the correct version of the module is being loaded. If it is not, the CLR must fall back to the IL version of the code and use JIT.
The above are not specific instances, but more to be uses as a defining guide for your decision making process.

Best scenarios for NGEN would be to make sure that there is not too much of code revamp in your application, i.e. frequent changes. Good choices would be code heavily dependent on mathematical or graphical operations that can be optimized largely with CPU-based optimizations. Examples of code that does not change can also serve as good candidates, e.g. the .NET framework itself.

In conclusion, JITting (the default behavior) should work for most of your cases, and NGEN should be seen as a performance enhancement for very specific scenarios.

Sunday, September 17, 2006

got my MacBook!

I finally got my MacBook -- one word: ROCKS!

I went for the most basic macbook edition pictured here. (Apologies, if the link is dead after a few months or so...). It was the 1.83 Intel Core Duo with 512MB RAM and 60GB HDD.

Looks wise, this is the best laptop I have ever seen. Ergonomically, Apple has done an amazingly good job or making sure that the laptop features do last for a log time to come. Some examples:

1.) MagSafe power slot: Have you walked across a room in a hurry only to find that you tripped over the power cord of your friend's laptop which then flew across the room in that melee? No more issues. The MagSafe cord will ensue quick disconnection when you yank it beyond a threshold. The magnetic feature of the power slot ensures that.
2.) CD drive: They are the slip-in variety (like the ones found in cars), rather the usual tray-based ones.
3.) Inbuilt Camera (iSight): Just too good, the resolution and the pinhole sized camera just take your breath away, not to mention the microphone alongside.
Others: soft keys, large trackpad, inbuilt Bluetooth support and he list can go on and on.

From the software perspective, Mac OS X (Tiger) is too sleek. With it's inherent security support and detection capabilities, it just takes your breath away. If you have used Solaris GUI or Linux-based XWindows, then Mac OS X is eons better than those.

An amazing value for money package, is you ask me. In fact, I'm posting this blog from my Mac...;)

Wednesday, September 06, 2006

Movie Review: Vettaiyadu Vilayadu

Movie Review: Memento

Saturday, September 02, 2006

Determining L1 and L2 Cache size of a machine

When the time comes to buy a Computer, one has to take into account so many factors; each of them accounting for a small percentage of the total performance. One such factor is the amount (or simply the presence or absence) of the L1 and L2 Cache size of a processor.

In the memory hierarchy of a processor seeking an instruction or data, L1 comes before L2 (and so on) until it hits the main memory (also called RAM) and finally the hard-disk. More explanation can be found at many sites including
this and this.

The purpose of this post in not to delve into the details of what these various memory types are; rather to decide how you can actually calculate or verify them empirically (Yes, it is indeed possible!). Nowadays, we can rely on the hardware manufactures to correctly tell us the exact specifications of the processor of a machine, but back in those times it was not considered reliable to believe the vendor about the machine specifications and the norm was to run your own tests to confirm that. I shall be outlining a few steps based on matrix multiplications that can help shed some light on the processor memory in question.

All experiments were run on a Solaris cluster machine although the same principles can be easily applied to any other Operating System.

Step1: Getting the assumed values for the sake of comparison.

Solaris allows us to run fpversion to get information about the cache sizes as known to the installed version of the OS. Here's one such run, which we use as a sample machine for the explanation in the rest of the article.
A SPARC-based CPU is available.

Kernel says CPU's clock rate is 360.0 MHz.
Kernel says main memory's clock rate is 120.0 MHz.
Sun-4 floating-point controller version 0 found.

An UltraSPARC chip is available.

Use "-xtarget=ultra2i -xcache=16/32/1:2048/64/1" code-generation option.

Hostid = 0xXXXXXX


This tells me the following details:
L1 Cache is 16 KB
L1 Cache line size is 32 bytes
L1 Cache is 1-way set associative (direct-mapped)
L2 Cache is 2048 MB

L2 Cache line size is 64 bytes
L2 Cache is 1-way set associative (direct-mapped)

Step 2:
Running suitable tests to figure out the L2 cache size.

Since the L2 cache is the larger of the two caches under consideration, it will (obviously) tend to hold the majority of the data values being manipulated. A good method to test this is to run a simple matrix-based multiplication algorithm (the dreaded triple nested loop). The only variable parameter to this function is the dimension of the matrix, let's say 'n'. The time taken to perform the multiplication is calculated and plotted. The point where the graph shows a drastic deterioration is considered to be the threshold when the cache size is over the limit and is causing access to the RAM.

In my program runs, n = 240 was found to be the case where the timings went haywire.

a double = 8 bytes on the Solaris machine
For 3 matrices, each of size n = 240, empirical cache size = 240*240*8*3, on the presumption that all 3 matrices are there in the cache at this time.

=> L2 size approximately = 1.4 MB

The machine supposedly had a cache value of 2 MB, and some difference can be attributed to the program instruction references and other programs multitasking at the same time. More careful runs will yield better results.

Step 3: Running suitable tests to figure out the L1 cache size.

Note that the historical triply nested matrix multiplication is not very efficient since it tries to read in a value from than once to perform multiplication.

Consider matrix multiplication C = A * B
where each element of C, c(i,j) = Summation of k from 1 to n of [ a (i,k)* b (k,j) ]

Note that this means that some other 'c' element would need some value of a(i,k) and b(k,j) for some value of i, k and j.

Ideally, you would not want to read a matrix value more than once -- once you read in a particular value, make sure you use it completely and then forget about it. This is called the blocking version of matrix multiplication. Without going into details, I can say that it works and provides a god test case for determining the L1 cache size since it brings in data only when needed and not otherwise.

[ Algorithm not provided, again left as a test for the reader ]

The program was run for matrix dimensions from the block sizes from 20 to around 55. The following observations were made: 1.) The peak megaflops for the various block sizes were noted. It was found that the peak performance was a relative maximum (a local maxima) for values 20, 25, 30, 35, 40, 45 and after that the values steadily decreased. 2.) The local maxima values at the above noted points also kept on increasing, the maximum value being observed at block size=45.

Explanation for the Observations: Firstly, we note that the L1 cache size is 16 KB, arranged as direct-mapped, 4-byte cache line size. We can consider this to be 2048 double-words, capable of holding 2048 values for matrix computations. Secondly, because of direct mapping, there is a chance that greater number of addresses get mapped to the same location. We get a maximum performance for n=45, which can be easily concluded as follows.

For n=45, the total sum of matrix elements is 45*45*8 bytes, = 16200 which is the largest array that can be stored in L1 cache (16384). The L2 cache is 2 MB in size, so once the elements has been bought to L2, transfer to L1 can take place very rapidly. For array sizes greater than 45, the performance would decrease, because a few elements of the array would have to be swapped out to accommodate the other elements of the array.

a double = 8 bytes on the Solaris machine
Hence, L1 cache size approx. = 45*45*4 = 16200 bytes.

The machine on which I ran my program 'supposedly' had a L1 cache of 16 KB, so my experimental value matches more or less with the expected value.

Rest assured now that you are getting exactly what you paid for.

Euler or Euler-Mascheroni's Constant and its large precision calculation

This post has been inspired by one of the assignments I had for an Algorithms course while in graduate school. The task was to find 4 integer values a, b, c and d such that:

a/b < γ < c/d

There were 3 tests to be satisfied for the above condition:
1.) a/b < c/d
2.) a/b is the proper lower bound, i.e. a/b < γ
3.) c/d is the proper upper bound, i.e. c/d > γ

The ultimate test was to get the lowest possible difference (c/d - a/b)
Note that the same question can be applied to any mathematical constant such as e (Euler's number, natural logarithm base) and pi (approx. 22/7).

Some background on γ from my final project document:

The calculation of Euler's constant (γ) has not attracted the same public intrigue as pi or even e but it has inspired the dedication of a few. While we presently know millions of decimal places of it's other contemporaries, only several thousand places of pi are known till now. Initially started by D.E. Knuth, it was soon followed by Sweeney, Brent & McMillan, and some recent methods have taken the limit to a few hundred thousand digits of accuracy.

This project, though not directly involved in the calculation of the accuracy of γ to any limit, does involve the calculation to a certain limit governed only by time. It seeks to find out the bounds within which γ lies. The accuracy of these bounds, as suggested, will be limited by the amount of computer time that is allowed for the program.
Most of the details are in the project document (in PostScript format; you will need GhostScript/GhostView to view it) and I will mention just my relevant experiences in the current post. The document also provides a very clear background, derivations and explanations of all the formulae used and the method chosen. The final program was written in C on the Solaris platform using the GNU Multi Precision library (GMP). GMP defines its own float, double and integer variables such that the values and precision are limited only by the amount of hard-disk space available on your machine. So theoretically, you can represent a float/double value of infinite length using this library. Given a chance I encourage you to use this extremely lucid library for any other needs that require maintenance of large precision values. Note: Be careful not to blame this library for any bugs in your program.

The maximum precision of your calculations is limited by the minimum precision among all the parameters in your equation.

e.g., if you are calculating d = a * b * c, and a's precision = 10 digits, b's precision = 15 digits and c's precision = 7 digits, then the overall precision of d will be 7 digits. These kind of gotchas make the above problem very interesting.

One rule to make the program tougher would be to disallow the use of float/double GMP variables in your program. --- In the process of finding the solution, there was the need to find the natural logarithm of a number; and based on the above logic, we inadvertently also solved the problem of finding the logarithm value of a number to a large precision value.

The document makes really interesting reading with regards to why one method was chosen above the other. In the end, the C program (of course, the results are specific to the machine used then) could calculate a precision of 14,200 bits (or roughly 4275 digits of accuracy) and I'm happy to mention that this was the best precision calculation in a minute amongst all folks in the course. --

Lots of details in the linked document below, including the source code for the program (in PostScript format; you will need GhostScript/GhostView to view it). Most machine and course specific information has been removed from it.

Standard Disclaimer: If you plan to use the information in this document for your personal or professional work, I will not have any copyright issues but feel free to acknowledge this article. Also, I bear no responsibility to the accuracy or correctness of this document that may cause issues for your specific usage. Please be aware of plagiarism notices of research work.

PostScript document (631KB, 59 pages)

.NET and ActiveX, using Thread.Abort()

There may be situations where one is forced to use ActiveX components within a .NET application. It is quite possible that you may overcome the overhead of Interops and Cross-thread marshalling that needs to happen in such cases (Note that ActiveX threads are STA, forcibly).

One thing to beware of is the handling of an exception when Thread.Abort() is called on the code which is currently being executed in the COM context, or by the ActiveX component.

It is always recommended that we catch the ThreadAbortException and then decide whether to safely ignore it or to perform a custom action, like alerting the user. Let's say for the sake of the discussion that we would like to just continue processing, assuming that the user has knowingly stopped the further processing of the thread.

Bad. Threads in COM contexts cannot be safely aborted, at least not without causing exception that are not handled by the .NET application or the CLR.

Here's some sample hypothetical code snippet:

// do some ActiveX stuff here
// Thread.Abort() is called before the next statement is executed
catch(ThreadAbortException ex)
// ignore this
catch(Exception ex)
// the mother of all exceptions
// Surprisingly, this won't handle the exception being thrown.

Here's more information from various places:

The evils of Thread.Abort:


"If a thread makes an unmanaged call into the operating system that has blocked the thread in unmanaged code, the runtime will not take control of it for System.Threading.Thread.Interrupt or System.Threading.Thread.Abort. In the case of System.Threading.Thread.Abort, the runtime marks the thread for Abort and takes control of it when it re-enters managed code. It is preferable for you to use managed blocking rather than unmanaged blocking."

Sometimes, it is just not possible. What do we do in such cases?

We would need to wait until the control returns to the managed context and then not do any further actions by just returning control back. This would mean that we will not call Thread.Abort() anymore, instead set a boolean variable so make sure that we don't proceed ahead.

So, let's add some extra information to the code snippet above and assume that we used to call Thread.Abort() in the Window closing event of a form. Here's the change based on what we have just discussed. New code in blue.

void Window_closing(...)
// was previously Thread.Abort()
_isClosing = true;
me.hide = true;

// next, use the boolean variable to stop further processing.

// do some ActiveX stuff here
if (!_isClosing)
// Thread.Abort() is not called now
// we will just not do anything if the form is to be closed.
if (!_isClosing)
catch(ThreadAbortException ex)
// ignore this
catch(Exception ex)
// the mother of all exceptions
// There should be no need to handle any exceptions here.

One thing to note is that although we hide the window, the control will be chugging away in the background and only ends once it comes back into managed code and hits the boolean check for (!_isClosing). It's the only reliable way to make sure that the form is closed properly without throwing exception for the OS to handle.