Accelerating JPEG Coding With Multiple Threads

After the JPEG standard was released in 1992, JPEG images became synonymous with digital photography and are used in almost every application which works with photo quality images. The reason the adoption of the standard was fast and nearly universal is that it makes use of multiple techniques simultaneously to reduce the compressed file size. One of those is an understanding of the limits of the human visual system and which information is important to preserve versus which is less important and can be removed.

The JPEG coding algorithm

There are several steps needed to compress an image using the JPEG method (see image below). To begin, it is usually converted from RGB into the YCbCR colorspace. The reason this is necessary is to allow sub-sampling of the pixels. The human eye is more sensitive to changes in luminance than changes in chrominance. This allows the chroma channels to be downsampled while keeping the luma channel at full resolution. The image can lose 50% of its data with this step and the perceived degradation is quite minimal. The image is then divided into 8x8 MCUs or minimum coded units (the equivalent terminology for video codecs is Macro Blocks). These MCUs are square blocks of pixels which are compressed based on the similarity to each other. The pixels in each MCU are transformed from the spatial domain to the frequency domain with a Discrete Cosine Transform (DCT). The operation allows for the easy removal of high frequency information (fine detail) to further compress the image. The more high frequency coefficients are removed, the smaller the file and the blurrier the image becomes. This effectively controls the “Q Value” or compression amount used by encoders.

JPEG coding steps
JPEG coding steps

One of the many clever ideas incorporated into the standard is to use the similarity in DC value (essentially the brightness) between adjacent MCUs to further reduce the data size by compressing just the change from one to the next instead of encoding the entire value. This is depicted above in the “DPCM coding” block. This is a great idea, but it creates a small problem. By having each successive MCU’s DC value(s) depend on a delta from the previous, it means that if there is an error in the data, the MCUs from that point on down will all be wrong.

Making matters slightly worse is that the compressed symbols which encode the image data (shown above as “Huffman coding”) are Variable Length Codes (VLC). A single wrong bit can corrupt the data from that point forward. Back in the ‘old’ days when JPEG was invented, this was a very real concern because images were often transmitted over channels (e.g. acoustic modem) that may not have had robust error correction or stored on media (e.g. floppy disks) that were prone to errors. Knowing that the data may suffer from errors, a feature was added to the standard to mitigate the amount of the image that would be corrupted by bad data. The idea was to periodically reset the previous DC value to 0, forcing the next MCU to encode its entire value as a delta from 0. This means that any corrupted DC values will only affect pixels up to the next restart point.

This is implemented with Restart Markers. They are 2-byte markers placed in between the MCUs at regular intervals (e.g. every 100 MCUs). If a data corruption occurs, it’s easy to scan forward in the file to the next restart marker (JPEG markers are always on byte boundaries and preceded by 0xFF). Once the next restart marker is found, the image can be properly decoded from that point on since the number of MCUs between each restart marker is known.

The ever evolving use of digital images

Starting not long after the JPEG standard was released, computer networks and storage were getting more reliable and incorporated error detection and correction (e.g. TCP/IP). The solid state storage cards used in digital cameras were quite reliable and the restart markers were ‘forgotten’ for a while since they made the files slightly larger without giving much perceived benefit. During that period, computer software was mostly designed to run on a single processor using a single thread. The fact that a JPEG image needed to be decoded in a single pass due to its use of variable length codes and a string of successive delta values for the MCU didn’t cause any problems for software since it was designed to run as a single thread anyway.

In the last few years however, computers and our use of JPEG images have changed dramatically. Nearly every computing device has multiple CPUs and runs an operating system with multiple threads (even phones). The other change is that people are taking, editing and viewing billions of JPEG photos on their mobile phones. Each generation of phones produce larger and higher resolution images. For anyone working with tons of photos, the time to encode and decode them has become increasingly important because of the sheer volume and size of new images being generated.

The sum of transfer size kilobytes of all external images requested by the page.
The average transfer size of images requested by the web page. Data source: HTTP Archive

Repurposing a good idea into an even better idea

Computers have been getting more powerful on a mostly continuous trajectory since the 1970s. A general term used to describe the continuous improvements in silicon processing called Moore’s Law was coined many years ago to commemorate Gordon Moore’s prediction that computers will double the number of transistors every 18 months. This has held mostly true for the transistor count, but the maximum speed of computers has basically hit a dead end due to the physical limitations of silicon and power+heat issues. Since individual processor speed hasn’t advanced much in the last few years, the emphasis has shifted to employing many processors to complete tasks faster by working in parallel. In today’s computing environment, it’s advantageous to be able to divide a task into sections and assign them to multiple CPUs. Not all tasks are possible to divide because each successive part may depend on the results from the previous. JPEG encoding and decoding are normally difficult to divide into pieces and run in parallel because of the way each successive MCU depends on the previous one and the use of variable length codes.

However… A convenient benefit of restart markers is that the VLC data is reset to a byte boundary (after the marker) and the MCU DC delta value is also reset. This means that both JPEG encoding and decoding could be divided into multiple threads with the use of restart markers. For encoding, the image could be divided into symmetrical strips and each strip could be encoded by a different processor. When each processor completes that task, the output of each can then be ‘glued’ together using restart markers. For decoding, the task can be spread across as many processors as there are restart markers. The only extra effort needed is to scan forward in the compressed data looking for the restart markers first since the size of the compressed data between each varies and there is no ‘directory’ in JPEG files showing where the restart markers are located.

Real World Example

With multi-threaded applications, the performance rarely scales 1:1 with the number of CPUs utilized (e.g. splitting a task into 12 threads on 12 CPU cores doesn’t mean it will run 12x faster). There is extra overhead in managing the threads and the memory is usually a single entity shared among the processors. Let’s run a test on the following image:

JPEG timings example image
Image credit: Matthew Henry on Unsplash

Among other techniques, at Optidash we incorporate this idea of using restart markers to greatly accelerate both the decoding and encoding of images. The image above was run through one of our testing tools on a 2018 MacBook Pro 15” laptop with a 6-core Intel i7 processor. Here are the results:

Number of CPUsImage decode timeImage encode time

As you can see in the table above, there is a measurable advantage to splitting JPEG coding into multiple parts and assigning them to different CPUs. Depending on the task, the speed rarely scales linearly when using more CPUs. In this case, the intense use of large areas of memory limits how much benefit multiple CPUs can improve the overall speed.

Here’s some pseudocode which demonstrates how a multi-threaded JPEG encoder might look:

void ThreadedWriteJPEG(assorted JPEG parameters, int numThreads) {
    int slice;
    pthread_t tinfo;

    // set completion counter
    sliceRemaining = numThreads;

    // Start a thread for each slice
    for (slice = 0; slice < numThreads; slice++) {
        pthread_t tinfo;

        // Use a ‘slice’ structure to hold info about each thread’s work
        // This includes a pointer to the start of that strip of pixels
        // and the number of lines to compress

        <setup slice structure for each thread’s work>

        pthread_create(&tinfo, NULL, JPEGBuffer, &slices[slice]);
    } // for each slice

    // wait for all worker threads to finish

    // merge slices into a single file and write it
    WriteJPEGBuffer(slices, numThreads);

Ready to try Optidash?

Start optimizing your images smarter. Integrate in minutes.

Create Free Account Get In Touch

This website uses cookiesBy using Optidash, you agree to our Cookie Policy.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.