EXIF / XMP / RDF / JPEG and JFIF or how to efficiently extract Motion Video from Google Pixel Photos

With the exception of JPEG, these acronyms may be unfamiliar, but I had to investigate them to extract all the short video clips, which are around 1.6 seconds long, contained inside the Google Pixel photos when you enable Motion Picture in the Camera app.

First, let's to clarify what we are talking about. When your phone takes a picture, the sensor capture light through a Bayer Filter and send the pixel of each color to an encoder (most likely a hardware one) that encodes the image in the JPEG compressed format. However to decode and display the image, the decoder requires parameters such as image width and height. That's where the JPEG File Interchange Format aka. JFIF standard comes in. It defines the file format that we manipulate (i.e.. *.jpg) and this file contains different sections to decode the image and add specific application data. One usual way to store additional information is to use the JFIF APP1 section with EXIF metadata such as location, camera/lens used, date and time, etc.. However, this is still not enough, due to the inherent limitations of EXIF (mostly the size constrains). To address this issue, Google (and possibly other manufacturer ?) started to add additional information in another format called Extensible Metadata Platform aka. XMP, a standard created by Adobe for PDF files but can be used for other files format. This standard use a subset of the Resource Description Framework aka. RDF a W3C standard to exchange metadata.

All I wanted was to get the offset and the size of the sidecar MP4 video within the file. After some research, I discovered that most phone manufacturers simply concatenate the video to the end of the JFIF file and most extraction tools parse the file up to a delimiter that indicates the start of a MP4 file. This method doesn't seam to be a reliable way to do it, and parsing a multi megabytes is efficient. Initially, I attempted to locate the video offset within the easily accessible EXIF metadata. However, all I found was a flag to indicate that the image contain a Motion Photo, with no information on the video size or offset. As a result, I decided to look deeper into the file with xxd and discovered the one of the information I was looking for: the sidecar video file size!

xmp-rdf-jfif-0

At the time, I was unaware that the sidecar file was simply concatenated at the end of the JFIF. It seems like a odd choice to me. And indeed, this is may be the exact problem that prevents me to enable Motion Picture and Depth Map in the Camera app! Both features append the sidecar at the end without specifying the start offset. The only explanation I have, is that metadata are at the beginning of the file, thus modifying the it would change the offset of the start of the video, making the offset value pointless. Fortunately, there is a straightforward solution: simply specify the size of each sidecar file in a given order. Then, by knowing the total file size, we can sum up the sizes of all previous sidecar files to determine the offset of the sidecar we wish to extract. Erratum: Obviously, I should have noticed the Padding field that indicates how many bytes are appended after the sidecar file, making easy to calculate the right offset if there is multiple sidecar files.

By using the length information inside the XMP data that I saw with xxd I was able to extract the video with great success! I later found that I can extract the XMP information with exiftool using the following command.

exiftool -xmp -b file.jpg

The I created a script to convert the XMP XML to JSON using xq and then use the much more familiar jq command to extract the file sidecar video length. With this I was able to successfully extract the video files from my Motion Picture JPEG!

#!/usr/bin/env bash

get_offset () {
    local filename="$1"
    exiftool -xmp -b  "$filename" \
        | xq -j \
        | jq -r -f <(cat <<EOF
."x:xmpmeta"
."rdf:RDF"
."rdf:Description"
."Container:Directory"
."rdf:Seq"
."rdf:li"
| map(select(."Container:Item"."@Item:Mime"=="video/mp4"))
| .[0]
| ."Container:Item"
  ."@Item:Length"
EOF
)
}

process_file () {
    local filename="$1"
    local offset="$2"
    cat "$filename" | tail -c +$((`stat -c %s "$filename" ` - $offset + 1)) > "$filename.mp4"
}


for file in "$@"
do
    echo $file
    process_file "$file" `get_offset "$file"`
done

The jq line may be ugly, but this was just POC.

After that, I started reading specifications of XMP and RDF, because there isn't a lot of information available on Internet beside the specification documents which are more focused on specifying than explaining. After feeling lost, I discovered that the standard way to decode XMP data is by using the Adobe XMP Toolkit.

I wanted to make my program in Rust. Fortunately there is already a xmp_toolkit crate to manipulate XMP data and the jfifdump crate to access different sections of a JFIF file. Specifically I needed to access one of the APP1 JFIF section that contains the XMP data and that start with the null terminated string http://ns.adobe.com/xap/1.0/ (Don't ask me why it's xap instead of xmp…).

"And to add a final twist, Google decided not to directly include the size value of the video inside a struct value. Instead, they used an RDF struct inside an RDF array, which made it even harder to access the value if you're not familiar with the XMP Toolkit. It's even more difficult because I haven't been able to find the schema of the RDF namespace that Google uses (http://ns.google.com/photos/1.0/container/). "

And to add a final twist, Google decided to not directly put the size value of the video inside a struct value, but instead used an RDF struc inside a RDF array, making it even harder to access the value if you are not familiar with the XMP Toolkit. It is even harder because I have no access to the schema of the RDF namespace that Google uses (http://ns.google.com/photos/1.0/container/).

But the XMP path with the XMP Toolkit to access MIME and Length properties inside the Google namespace are:

// Where N is the array index
Container:Directory[N]/Container:Item/Item:Mime
Container:Directory[N]/Container:Item/Item:Length

I had to check the presence of the video size with a brute-force approach (for each array index) because xmp_toolkit doesn't support accessing struct properties inside arrays.

I was finally able to make everything work. Through this adventure I have improved my knowledge of the JPEG/JFIF image format, learned about RDF and XMP and gained more experience in Rust, which I am currently learning. With my Rust program, I'm now able to extract the video of each photo in a faster way than by parsing each (more than 2600) files that contain a sidecar video. However, I am disappointed that Google engineers decided to use XMP to store such basic information (probably because they already use XMP for storing HDR, or Photo Sphere information), instead of adding a simple EXIF field that would specify the start offset of the video and the length of the video, it would make it easier to access the information (i.e. not dealing with XMP/RDF). The major issue with EXIF is that it is unstructured, making it less flexible and not as extensible as XMP/RDF. Camera manufacturers already use different field names for the same information. Using standard namespaces would cleanup this mess.

Github Repo of the project