Video and audio data offset

The duration of a Dolby Digital Plus access unit is always equal to 1,536 audio samples, or 32 ms at 48 kHz. The duration of a video frame varies, depending on the video frame rate. For example, a video frame rate of 25 fps equals a video frame duration of 40 ms, and a video frame rate of 29.97 fps equals a video frame duration of 33.367 ms. Consequently, video and audio PES packet boundaries are rarely (if ever) time aligned within the MPEG-2 transport stream. Therefore, at the end of an MPEG-2 transport stream segment, there will be an offset between the end of the last video and last audio PES packet of the segment. This is illustrated in the A/V presentation time stamp offset at MPEG-2 transport segment boundaries figure.

To ensure that playback across a segment transition is seamless, and to maintain A/V synchronization, the segmenter must meet these requirements when constructing an HTTP Live Streaming compliant MPEG-2 transport stream:

Each segment contains only complete PES packets. Fragmentation of PES packets containing Dolby Digital Plus audio data across a segment boundary is not permitted.¹
To ensure that switching between audio and video streams encoded at different bit rates is seamless, all segments that correspond to the same presentation period of the multimedia presentation (segments containing alternative renditions of the same content, with each rendition encoded at a different bit rate) contain an identical number of video and audio access units.
The first PTS of each audio stream in the segment is equal to or greater than the first video PTS of the segment.
The time offset between the first video PTS and the first audio PTS of a segment is less than 2,880 PTS ticks.
The time offset between the Audio_In time and Video_In time of a segment (the A/V PTS offset) is identical to the time offset between the Audio_Out time and Video_Out time of the previous segment.
Segmenters may elect to reset the time base at the beginning of a segment. (The program clock reference [PCR] and PTS are not continuous across segment boundaries.) In this case, the discontinuity indicator in the first transport stream packet of a PID is designated as a PCR_PID, and the first packet of any audio elementary stream must be set to 1 (see ISO/IEC 13818-1).

The figure shows an example of two MPEG-2 transport stream segments containing 25 fps video and Dolby Digital Plus audio.

A/V presentation time stamp offset at MPEG-2 transport segment boundaries

For the purposes of illustration, the segments shown contain only a few frames of video and audio data. Segments used in a real-world HTTP Live Streaming application contain up to ten seconds of video and audio. This example assumes multiplexer and segmenter interaction to reset PTS values at segment boundaries.

In this example, the multiplexer places three Dolby Digital Plus access units (each with a duration equivalent to 2,880 PTS ticks) within a single PES packet, and places each video frame (with a duration equivalent to 3,600 PTS ticks) within its own PES packet. To ensure that the PTS offset between the end of the video stream and the end of audio stream is less than 2,880 ticks, the multiplexer places only a single Dolby Digital Plus access unit within a PES packet at the segment boundary, resulting in an A/V PTS offset of 720 PTS ticks.

At the start of the next segment, the PTS value of the first video PES packet is 0, and the PTS value of the first Dolby Digital Plus PES packet (which again contains three Dolby Digital Plus access units) is 720, ensuring that the offset at the end of the segment TS1 is maintained at the start of segment TS2.

Note: In real-world implementations, the multiplexer may choose not to reset the PTS to 0 at the start of each new segment, and instead use the Video_Out and Audio_Out PTS values from the previous segment as the PTS values of the first video and audio PES packets, respectively.

By maintaining the A/V PTS offset at each segment boundary, both the multiplexer and segmenter ensure that synchronization between the audio and video PTS is maintained, and that A/V synchronization during playback is also maintained. Additionally, by ensuring that only whole PES packets are present in the MPEG-2 transport stream, a player is able to switch between both audio and video streams at different data rates without any visible or audible interruption to playback.

¹ This creates some dependency, as PES packetization is usually performed by the multiplexer. For cases in which more than one audio access unit is present in a PES, packet remultiplexing may be required in the segmenter.