HOW TO remove time codes from a WebVTT file

MSDN Channel 9 has started providing captions for videos in WebVTT (Web Video Text Tracks) format. This means, you can use that file to read it as a transcript when you are on a low bandwidth Internet connection instead of watching the video.

You can grab the subtitles file by appending /captions?f=webvtt&l=en to the Channel 9 video (if it is available for that video). For example, http://channel9.msdn.com/Shows/Azure-Friday/Scott-Guthries-explains-SQL-Databases-in-Azure/captions?f=webvtt&l=en will get you the captions file for the Azure Friday discussion on SQL Databases.

A typical WebVTT caption file looks this -

When the time codes (representing the time within the video when the words are spoken) are present within the text file it is a little distracting to read the content. You can get rid of the timestamps using an editor that supports finding & replacing text using regular expressions.

Use the expression \d{2}:\d{2}:\d{2}\.\d{3}(\s)+-->(\s)+()\d{2}:\d{2}:\d{2}\.\d{3} to find a match for the time codes and then replace it with a blank string:

If there is any string following the timestamp, this regular expression can tackle that - \d{2}:\d{2}:\d{2}\.\d{3}(\s)+-->(\s)+()\d{2}:\d{2}:\d{2}\.\d{3}(\s).*

Sometimes the subtitles may be enclosed with HTML tags for emphasis. To remove those tags, you can use <[^>]+>

You can use an online regular expression tester like Regex101 if you don't have an handy editor that support regular expressions

[Update 4-Oct-2022] - Azure Friday videos come with WebVTT files in multiple languages. To get the US English version of the file, find the URL of the video through browser Developer Tools. It may be in this format -
https://learn.microsoft.com/video/media//bb9ceeb6-ed33-44ee-a8d9-b63c2e7a6ab3/azfrfrankel20190110_high.mp4

Replace the MP4 file part of the URL with caption-en-us.vtt to get the captions file -
https://learn.microsoft.com/video/media//bb9ceeb6-ed33-44ee-a8d9-b63c2e7a6ab3/caption-en-us.vtt

Comments