The Complete Magazine on Open Source

A sneak peek into the world of garbage collection

2.48K 0

Java garbage collection

This is not about hygiene and sanitation. In Java, garbage refers to unreferenced objects. Java memory can be made more efficient by removing these objects. Garbage collection is the process of reclaiming the runtime unused memory, automatically.

There was a time when developers were under constant pressure to write efficient code that kept a check on the memory usage and cleaning up of unused or idle resources. This constraint not only added to the woes of the developer but also made the code lengthy and repetitive. Java, the object orient language, came up with a brilliant mechanism for garbage collection. It eased the load on coders by running a daemon thread that automatically cleaned the memory as and when required and made it available for the program.

Now, the question is: where does Java keep the object? Well, the answer is that since Java abides by the OOPs paradigm, it stores everything as an object in the Heap memory. Whenever objects are created, they occupy some space in the Heap and are further referenced as per usage. From time to time, JVM marks objects that are idle for a long time or are not referenced, and selects them for deletion. Garbage collection in Java is tracking down all the objects that are still used, and marking the rest as garbage.

Figure 1: The memory structure of Heap

Figure 2: Marking live objects in the Heap

Before getting into garbage collection, let’s understand the memory pool of Java, i.e., how the Heap is divided into different segments and where objects are present within the Heap at different times. To start with, the Heap is divided into the Young and Tenured generations, along with the PermGen space. The Young generation is further divided into Eden, Survivor1 and Survivor2. When created, objects reside in Eden; if garbage collection (GC) does not result in sufficient memory space inside Eden, then the object is allocated to the Old generation. When garbage is being collected from Eden, the GC runs from the root to all reachable objects and marks them as alive—this process is called marking. After marking, all the live objects are copied from Eden to Survivor spaces, after which the whole of Eden is empty and can be reused to allocate more objects. This entire process is named as Mark and Copy. Beside Eden lie the Survivor spaces called 1 and 2, one of which is always empty. It is this empty space that will start getting filled once the GC works on Eden.

The copying of live objects in Survivor space is repeated several times, until some objects are considered mature enough and are promoted to the Old generation, where they reside until they become unreachable. In contrast to the Young generation, the size of the Old generation is quite large, and the frequency of GC here is less than in the Young generation. When GC takes place in the Old generation, the following takes place: a) Reachable objects are marked by setting the marked bit next to all objects accessible through GC roots; b) All unreachable objects are deleted; and c) The content of old space is compacted by copying live objects contiguously to the beginning of the Old space. Next comes the Permanent Generation, a.k.a. PermGen, which existed prior to Java 8 and was used to store the metadata information of the objects. With Java 8, the class definitions are now loaded into metaspace, which is located in the native memory and does not interfere with the regular Heap objects. By default, the metaspace size is only limited by the amount of native memory available to the Java process.

Moving on, GC is mainly categorised into three types, namely, Minor GC, Major GC and Full GC. Minor GC works on the Young generation and is triggered when the JVM is unable to allocate the space for a new object. Besides, it does trigger the stop-the-world pause suspending all the live threads. Major GC runs on the Old Generation while the full GC works on cleaning the entire Heap.

Java implements different algorithms to achieve GC and, to understand them, let’s start with some basic terminology.

  • Marking reachable objects: GC defines some specific objects as GC roots. Examples are local variables and input parameters of currently executing methods, active threads, static field of the loaded classes, JNI references, etc. This phase calls the stop-the-world pause.
  • Removing unused objects: Removal of unused objects is somewhat different for different GC algorithms, but all such algorithms can be divided into three groups—sweeping, compacting and copying.

Sweeping or ‘mark and sweep’ means that after the marking phase is complete, all space occupied by unvisited objects is considered free and can thus be reused to allocate new objects.

Compact or mark-sweep-compact solves the shortcomings of ‘mark and sweep’ by moving all marked – and thus alive – objects to the beginning of the memory region.

Copy or ‘mark and copy’ algorithms are very similar to the ‘mark and compact’ as they too relocate all live objects. The important difference is that the target of relocation is a different memory region as a new home for survivors.

We are done with all the basics and the terminology. Let’s now explore the different algorithms of GC. Basically, these can be put in four categories—namely, Serial, Parallel, CMS and G1.

Figure 3: Memory status after sweep

Figure 4: Memory status after compacting

Serial GC: In this, the GC uses mark-copy for the Young Generation and mark-sweep-compact for the Old Generation. Both the collectors are single threaded and trigger stop-the-world pauses, stopping all application threads. This GC algorithm cannot thus take advantage of multiple CPU cores commonly found in modern hardware.

Parallel GC: This uses mark-copy in the Young Generation and mark-sweep-compact in the Old Generation. Both Young and Old collections trigger stop-the-world events, stopping all application threads from performing garbage collection. It uses multiple threads and, thus, is suitable on multi-core machines in cases where your primary goal is to increase throughput. High throughput is achieved due to effective usage of system resources in two ways.

During collection, all cores clean the garbage in parallel, resulting in shorter pauses; and between garbage collection cycles, neither of the collectors consumes any resources.

Concurrent mark and sweep CMS: This uses the parallel stop-the-world mark-copy algorithm in the Young Generation and, usually, the concurrent mark-sweep algorithm in the Old Generation. This collector avoids long pauses while collecting in the Old Generation, as it does not compact the Old Generation, but uses free lists to manage reclaimed space; and it does most of the job in the mark-and-sweep phases concurrently with the application. This algorithm is divided into seven phases. These are: i) Initial mark, ii) Concurrent mark, iii) Concurrent Pre-clean, iv) Concurrent abortable Pre-clean, v) Final Remark, vi) Concurrent Sweep, and vii) Concurrent Reset.

Figure 5: Memory status after copying

Figure 6: Different algorithms for garbage collection

Figure 7: Memory structure of the Heap in G

G1: This is also known as Garbage-First and, with it, you can set specific performance goals. You can request the stop-the-world pauses to be no longer than x milliseconds within any given y-millisecond long time-range, e.g., no more than 5 milliseconds in any given second. Garbage-First GC will do its best to meet this goal with high probability (but not with certainty, as that would be hard to do in real-time). G1 has totally redesigned the way the objects are stored in Heap. It does not have the concept of Young and Old generations; instead, the Heap is split into a number of smaller (typically about 2048) Heap regions that can house objects. Each region may be an Eden region, a Survivor region or an Old region, as shown in Figure 7.

With all this information, you are up to date with the different types of garbage collection mechanisms. You can play around with the different GC algorithms by applying them in different generations of the Heap.