Aleksey Shipilëv on Compact Strings

By: NightHacking

69   0   6118

Uploaded on 02/08/2016

A presentation from Aleksey Shipilëv on his journey working through the compact string implementation in the JVM.

Full Abstract: java.lang.String is pervasively and perversely used in most Java applications. Not surprisingly, we are looking into optimizing it both on small and large scale. In this talk, we will take a deeper look into two interesting String-related features coming in JDK 9: a) Compact Strings, that saves memory for Strings representable with single-byte chars, with little or none performance regressions, and in many cases, significant performance improvements; b) Indify String Concat, that uses the magic of invokedynamic to concatenate Strings, to free runtime implementors for optimizing string concatenation without pushing users to recompile their programs; We will talk about the rationale, pitfalls and caveats of implementing the intrusive core library/runtime changes.

Comments (6):

By anonymous    2017-09-20

Compressed strings (Java 6) and compact strings (Java 9) both have the same motivation (strings are often effectively Latin-1, so half the space is wasted) and goal (make those strings small) but the implementations differ a lot.

Compressed Strings

In an interview Aleksey Shipilëv (who was in charge of implementing the Java 9 feature) had this to say about compressed strings:

UseCompressedStrings feature was rather conservative: while distinguishing between char[] and byte[] case, and trying to compress the char[] into byte[] on String construction, it done most String operations on char[], which required to unpack the String. Therefore, it benefited only a special type of workloads, where most strings are compressible (so compression does not go to waste), and only a limited amount of known String operations are performed on them (so no unpacking is needed). In great many workloads, enabling -XX:+UseCompressedStrings was a pessimization.

[...] UseCompressedStrings implementation was basically an optional feature that maintained a completely distinct String implementation in alt-rt.jar, which was loaded once the VM option is supplied. Optional features are harder to test, since they double the number of option combinations to try.

Compact Strings

In Java 9 on the other hand, compact strings are fully integrated into the JDK source. String is always backed by byte[], where characters use one byte if they are Latin-1 and otherwise two. Most operations do a check to see which is the case, e.g. charAt:

public char charAt(int index) {
    if (isLatin1()) {
        return StringLatin1.charAt(value, index);
    } else {
        return StringUTF16.charAt(value, index);
    }
}

Compact strings are enabled by default and can be partially disabled - "partially" because they are still backed by a byte[] and operations returning chars must still put them together from two separate bytes (due to intrinsics it is hard to say whether this has a performance impact).

More

If you're interested in more background on compact strings I recommend to read the interview I linked to above and/or watch this great talk by the same Aleksey Shipilëv (which also explains the new string concatenation).

Original Thread

By anonymous    2017-09-20

Java 9 is bringing in with String optimizations. Java 9 is coming with a feature JEP 254 (Compact Strings).

"Instead of having char[] array, String is now represented as byte[] array. Depending on which characters it contains, it will either use UTF-16 or Latin-1, that is – either one or two bytes per character. There is a new field inside the String class – coder, which indicates which variant is used. Unlike Compressed Strings, this feature is enabled by default. If necessary (in a case where there are mainly UTF-16 Strings used), it can still be disabled by -XX:-CompactStrings.

The change does not affect any public interfaces of String or any other related classes. Many of the classes were reworked to support the new String representation, such as StringBuffer or StringBuilder."

http://openjdk.java.net/jeps/254

Description

We propose to change the internal representation of the String class from a UTF-16 char array to a byte array plus an encoding-flag field. The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string. The encoding flag will indicate which encoding is used.

String-related classes such as AbstractStringBuilder, StringBuilder, and StringBuffer will be updated to use the same representation, as will the HotSpot VM's intrinsic string operations.

This is purely an implementation change, with no changes to existing public interfaces. There are no plans to add any new public APIs or other interfaces.

The prototyping work done to date confirms the expected reduction in memory footprint, substantial reductions of GC activity, and minor performance regressions in some corner cases.

For further detail, see

http://cr.openjdk.java.net/~shade/density/state-of-string-density-v1.txt

http://cr.openjdk.java.net/~huntch/string-density/reports/String-Density-SPARC-jbb2005-Report.pdf

https://www.youtube.com/watch?v=wIyeOaitmWM

https://www.infoq.com/news/2016/02/compact-strings-Java-JDK9

Original Thread

By anonymous    2017-09-20

In Java 8

In Java 8 String has a field char[] value - a char takes up two bytes because it represents a full UTF-16 code unit. Oracle's analysis of many, many heap dumps came to two conclusions:

  • these arrays occupy somewhere between 20 % and 30 % of an average application’s live data (including headers and pointers)
  • the overwhelming majority of strings only require ISO-8859-1 (also called Latin-1), which is a single byte.

Refactoring String to only use one byte for Latin-1 could hence safe about 10 % to 15 % memory and improve run time performance by reducing garbage allocation.

In Java 9

In Java 9 String is backed by a byte[] value, so that UTF-8 characters can use only a single byte. But what happens if a string uses both UTF-8 and UTF-16 (or even UTF-32) characters?

This may sound like a case for variable-sized records like UTF-8, where the distinction between one and two bytes is made per character. But then there would be no way to predict for a single character which array slots it will occupy, thus requiring random access (e.g. charAt(int)) to perform a linear scan. Degrading random access performance from constant to linear time was an unacceptable regression.

Instead, either each character can be encoded with a single byte, in which case this is the chosen representation, or if at least one of them requires two, two bytes will be used for all of them. A new field coder denotes how the bytes encode characters and many methods in String evaluate it to pick the correct code path.

Here’s how that looks in a simplified version of the String constructor:

// `char[] value` is the argument
if (COMPACT_STRINGS) {
    byte[] val = StringUTF16.compress(value);
    if (val != null) {
        this.value = val;
        this.coder = LATIN1;
        return;
    }
}
this.coder = UTF16;
this.value = StringUTF16.toBytes(value);

There are a couple of things to note here:

  • The boolean flag COMPACT_STRINGS, which is the implementation of the command line flag XX:-CompactStrings and with which the compression can be disabled.
  • The utility class StringUTF16 is first used to try and compress the value array to single bytes and, should that fail and return null, convert it to double bytes instead.
  • The coder field is assigned the respective constant that marks which case applies.

If you want to learn more about compact strings (and indifyied string concatenation), have a look at JEP 254 or Aleksey Shipilev’s talk (it also includes some benchmarks).

Original Thread

Popular Videos 754

Submit Your Video

If you have some great dev videos to share, please fill out this form.