Jun 17, 2019

JVM Memory Management and Garbage Collection

Memory basics

Memory model

In JVM memory can be divided into two categories: stack memory and heap memory.

Each thread has its own stack memory that can only be accessed by itself. Stack memory stores some local references, functional call stack etc.

Heap memory are managed by Garbage Collector (GC)¹ dynamically. Most objects created during run-time, meta information such as classes and methods definitions are stored in heap memory.

Heap memory model is much more complicated, and most tuning happens here.

Garbage Collection

Garbage collector is responsible for 3 things:

Allocate memory dynamically.
Ensure valid objects remain in memory (i.e. valid object reference does not fail).
Release unused memory when the memory area cannot be accessed anywhere in the program.

GC basics

A simple GC

Now consider how a simple GC is implemented. A very naive GC can be described below:

Looking through the heap memory and mark all objects as alive or dead. An object is considered dead if there is no pointer that points to it in the running program, aka unreferenced.
Delete all dead objects.

Problems

There are some problems with the simple solution given above, some obvious ones are:

Every time the GC is needed, the whole heap memory must be went through, which costs much time. In a stop-the-world (STW) fashion, this will stuck the whole application and brings very bad user experience (or timeout etc).
After some time, this methods leaves many "holes" in the memory which makes it harder and less efficient for future memory allocation.

Evaluation metrics

How do we evaluate if a GC is good? There are some performance metrics:

Throughput, the percentage of total time not spent in garbage collection, considered over long periods of time.
Garbage collection overhead, the inverse of throughput, that is, the percentage of total time spent in garbage collection.
Pause time, the length of time during which application execution is stopped while garbage collection is occurring.
Frequency of collection, how often collection occurs, relative to application execution.
Footprint, a measure of size, such as heap size.
Promptness, the time between when an object becomes garbage and when the memory got released.

These metrics do not work alone: they affect each other. For example, a bigger heap size potentially leads to lower frequency to perform GC, but it raises the footprint and makes each GC take more time.

Also, importance of each metrics is different for different kinds of application. For example, an interactive application might require low pause time to get a better user experience, whereas an embedded device requires smaller footprint.

So it is usually trade-offs among these factors while choosing or tuning GC.

Generational collection

In order to optimize overall GC performance, JVM uses generational collection to manage heap memory.

Prologue

The following picture shows a typical distribution of lifetimes of objects.

As shown in the figure above, a huge amount of objects are dead shortly after they are created. Even though not every application fits in this graph, but a large number process this general shape, i.e. live fast, die young.

Heap Memory model

When generational collection is used, memory is divided into several small parts called generations, which means memory areas that holds objects of different ages, as shown below.

Heap memory is organized into three generations: young generation, old generation and permanent generation. Most objects are initially allocated in young generation. When some of them survives after some number of GC, they are moved to old generation. Permanent generation is used to store meta data of classes, methods or other information used by JVM internally.

Fast memory allocation

A technique called bump-the-pointer is used to achieve fast memory allocation. It means the JVM keeps track of previously allocated object and allocates new object in the next continuous memory.

When multi-thread is considered, allocation must be thread safe. One way to guarantee that is to add a global lock every time allocation is performed. However locking and unlocking are very slow and are purely waste of time. Thus a technique named Thread-Local Allocation Buffers (TLABs) is used to avoid performance bottleneck introduced by global lock. Using TLABs each generation are divided into smaller pieces and assigned to each thread as its local buffer.

The combination of TLABs and linear allocations enables each allocation to be more efficient.

Legacy GC algorithms

Serial GC

With serial collector, both young and old collections are done serially, in a STW fashion.

Young GC

Using this GC, between S0 and S1, there is always one space that is empty at any time. Suppose the S0 space is empty before GC starts. The live objects in Eden and S1 are copied to S0 space (the empty survivor space) if there is enough space. Otherwise, the object directly goes to old generation despite how many GC it really survived.

After this move, Eden and S1 becomes empty. For the next young GC, S0 and S1 swap their role and S1 becomes the empty space to hold survived objects.

Old GC

The old and permanent generations are collected via mark-sweep-compact algorithm, which consists of three steps:

In mark step, the collector identifies which objects are still alive and mark them.
In sweep phase, the collector looks for garbage memory and identifies them.
In compaction phase, the collector moves all live objects to the start of old generation (or perm generation) so that a continuous memory chunk is generated that can be used for fast allocation.

Evaluation

Serial collector are suitable for single-process machine with small footprint that does not require very low pause time.

Parallel GC

Parallel collector is very similar to serial GC. It uses multiple threads to performs GC in parallel to speed it up and reduce pause time. It is suitable for multi-core machines.

Concurrent Mark-Sweep (CMS) GC

It is also called low-latency collector. Compared with GC introduced above, it provides fast response time.

Young GC

Very similar to the parallel GC.

<img src="/jvm-memory/cms-young-2.png">2

Old GC

Most GC of old generation is done concurrently with the execution of the application.

The process of CMS old GC can be described as below:

It starts with a short pause stage called initial mark. During this stage an initial set of live objects directly reachable from root region is created. The root region may be application thread stacks and registers, or young generation objects.
During concurrent marking phase, the collector marks all live objects that are transitively reachable from the initial set. All objects assigned during this phase are marked as alive. Because CMS runs concurrently with the application, the object references might be updated while it is performing GC. Thus the result is not accurate.
The last phase remark pauses the application again. During this period the collector revisits any objects that were modified during the second phase and update object marks. Multiple threads are used to speed it up. After this step, it is guaranteed that all objects are correctly marked. The garbage will be collected while application is running without STW.

The overall status before and after old GC is show in following figures.

Evaluation

CMS sacrificed a lot to achieve low latency:

CMS does not do memory compacting, so it cannot simply use a pointer when allocating memory. Instead, a linked list is used to store available memory pieces. Every time an object is moved to old generation, the list has to be traversed until a suitable memory hole is found, which is expensive. This also raises another problem that the footprint is much larger because of the usage of linked list.
CMS collector tries to start a collection as early as possible to keep memory available. If the GC has to be triggered when the old generation is full, an old STW full GC is performed.

Gargage First (G1) GC

G1 was newly added in Oracle JDK 7 update 4 and later version. It takes a very different way compared with those legacy GC, from memory model to collection algorithm.

Memory model

Unlike other garbage collectors, with G1 the heap memory is partitioned into a set of equal-sized heap regions, as shown in the figure below. Certain region sets are assigned the same roles (Eden, survivor, old) but their sizes are not fixed.

The heap is split into around 2000 regions, with minimum size of 1Mb and maximum size of 32 Mb. Sizes of these regions can be adjusted dynamically by G1 when in need.

G1 has larger footprint mainly because two information are recorded:

Remembered Sets or RSets track object references into a given region. There is one RSet per region in the heap. The overall footprint impact of RSets is less than 5%.
Collection Sets or CSets are sets of regions that will be collected in an GC. All live data in a CSet is evacuated during a GC. CSets have a less than 1% impact on the size of the footprint.

GC algorithm

Young GC

Live objects are evacuated to one or more survivor regions, or even old generation regions. This process is a STW pause. Eden size and survivor size is calculated for the next young GC.

Old GC

GC for old generation can be described below:

Initial mark (STW)

Similar to CMS algorithm, G1 starts with a concurrent global marking phase. This phase is done just after a young GC, marking all alive objects. In the log it is noted as GC pause (young)(inital-mark).

Concurrent marking

If empty regions are found, they are marked for deletion. Also some information that decides aliveness is calculated during this phase.

Remark (STW)

All empty regions are removed and aliveness of all regions is calculated.

Cleanup (STW)

G1 selects the regions with most garbage, i.e. regions that can be reclaimed fastest. These regions are collected at the same time of young GC. This is noted as GC pause (mixed) in the log.

G1 uses a prediction model to arrange its GC process. It collects information for every GC and estimate how many regions it can process within given pause time. However G1 is not a real-time collector. The pause time limit is not strictly satisfied.

If in some rare cases, full GC is triggered, G1 falls back to legacy STW mode. Full GC can be avoided by tuning.

Monitor GC

-XX:+PrintGCDetails Enable GC logging.
-XX:+PrintGCDateStamps Log timestamp.

Tuning

Choosing suitable GC

-XX:+UseSerialGC Use serial GC.
-XX:+UseParallelGC Use parallel GC.
-XX:+UseConcMarkSweepGC Use CMS GC.
-XX:+UseG1GC Use G1 GC.

Heap size settings

-Xms<n> Set the initial heap size when JVM starts.
-Xmx<n> Set the maximum heap size (might be related to OutOfMemery error).
-Xmn<n> Set the size of young generation, rest of the space goes for old generation.
-XX:PermGen[fn:2] Set the initial size of the Permanent Generation Memory.
-XX:MaxPermGen[fn:2] Set the maximum size of Perm Gen.
-XX:SurvivorRatio For providing ratio of Eden space, for example if young generation size is 10m and VM switch is –XX:SurvivorRatio=2 then 5m will be reserved for Eden space and 2.5m each for both the Survivor spaces. The default value is 8.
-XX:NewRatio For providing ratio of old/new generation sizes. The default value is 2.

Behavior-based collector tuning

-XX:MaxGCPauseMillis=n Sets a target for the maximum GC pause time. This is a soft goal, and the JVM will make its best effort to achieve it.
-XX:GCTimeRatio=n Set the value of GC time/application time as 1/(1 + n). The default value is 99, i.e. 1% time spent on GC.

G1 switches

-XX:MaxGCPauseMillis=n Sets a target for the maximum GC pause time. This is a soft goal, and the JVM will make its best effort to achieve it.
-XX:InitiatingHeapOccupancyPercent=n Percentage of the (entire) heap occupancy to start a concurrent GC cycle. It is used by GCs that trigger a concurrent GC cycle based on the occupancy of the entire heap, not just one of the generations (e.g., G1). A value of 0 denotes 'do constant GC cycles'. The default value is 45.
-XX:NewRatio=n Ratio of new/old generation sizes. The default value is 2.
-XX:SurvivorRatio=n Ratio of eden/survivor space size. The default value is 8.
-XX:MaxTenuringThreshold=n Maximum value for tenuring threshold. The default value is 15.
-XX:ParallelGCThreads=n Sets the number of threads used during parallel phases of the garbage collectors. The default value varies with the platform on which the JVM is running.
-XX:ConcGCThreads=n Number of threads concurrent garbage collectors will use. The default value varies with the platform on which the JVM is running.
-XX:G1ReservePercent=n Sets the amount of heap that is reserved as a false ceiling to reduce the possibility of promotion failure. The default value is 10.
-XX:G1HeapRegionSize=n With G1 the Java heap is subdivided into uniformly sized regions. This sets the size of the individual sub-divisions. The default value of this parameter is determined ergonomically based upon heap size. The minimum value is 1Mb and the maximum value is 32Mb.

There are some best practices here while using G1:

Do not set young generation size. Young generation size should be maintained by G1 dynamically. If you do so, G1 will no long respect pause time target, or be able to expand and contract the young generation space as needed.
Better not setting average response time. Instead, set -XX:MaxGCPauseMillis=<N> option and set value of N to meet 90% of the time or more. This means that 90% of clients making a request will not experience a response time higher than the goal.

Java 8 related points

Changes of memory model

Since Java 8, the old Permanent Generation is totally removed from memory model. The main reason is that it is difficult to manage and tune. Thus there will be no error message like OutOfMemoryError: PermGen space.

Java 8 added a new memory area called Metaspace to play a similar role as old PermGen. It stores meta information of classes, methods etc.

One important difference is that Metaspace does not use JVM memory anymore. Instead, it uses raw memory (from OS) and automatically increases its size (up to what the underlying OS provides), while PermGen always has a fixed maximum size.

Note that it is still possible to give a limit for Metaspace by passing corresponding start-up option.

Q&A

Is PermGen really part of heap memory?

According to this question, no. This picture is somehow incorrect.

How does CMS remark phase really work?

According to this blog, CMS records all object reference changes during concurrent mark phase. After entering remark phase, GC complement the reference information according to the changes it makes.

References

Term GC might refer to Garbage Collector or Garbage Collection, according to the context.