HOW TO extract subtitles from YouTube videos as plain text
Most questions in the Google Webmaster Central YouTube video channel are answered by Matt Cutts. The questions are very interesting but the answers are available only in video format. You have to watch the video which is of around 2 minutes to know the answer. While the background that Matt Cutts provides is educative, even 2 minutes is a lot of time if you need a quick answer and when you are on a low bandwidth internet connection. So I started compiling short answers to build summaries of Google Webmaster Central YouTube videos.
I noticed that these videos have captions and I realized that I don't even have to watch the video to know the question and answer. It is possible to extract subtitles from YouTube videos by specifying the language and VideoId in this generic URL - http://video.google.com/timedtext?lang={LANG}&v={VIDEOID}.
For instance, instead of watching a Google Webmaster Central YouTube video that has a URL like this -
http://www.youtube.com/watch?v=6dlr-1Qk8Uc
I can take the video ID - 6dlr-1Qk8Uc and use it in this URL -
http://video.google.com/timedtext?lang=en&v=6dlr-1Qk8Uc
http://www.youtube.com/api/timedtext?lang=en&v=6dlr-1Qk8Uc
...to get a .xml file containing the English subtitles for that video.
Update: The original URL pattern (struck through) appears to have changed. The subtitles can be fetched only if captions are manually transcribed i.e. not automatically generated (as shown in image below)
Reading text in a XML file is for machines. To view just the text in a XML (or HTML file), I paste the XML content in EditPlus (my favorite text editor) and use the Ctrl + Shift + P keyboard shortcut to convert. With other editors (I tried this in Notepad++ & Visual Studio) that support regular expressions, you can paste the XML content into it and use the expression <(.|\n)*?> with the Find and Replace option to get just plain text.
You can also use Excel to convert HTML content to text.
I noticed that these videos have captions and I realized that I don't even have to watch the video to know the question and answer. It is possible to extract subtitles from YouTube videos by specifying the language and VideoId in this generic URL - http://video.google.com/timedtext?lang={LANG}&v={VIDEOID}.
For instance, instead of watching a Google Webmaster Central YouTube video that has a URL like this -
http://www.youtube.com/watch?v=6dlr-1Qk8Uc
I can take the video ID - 6dlr-1Qk8Uc and use it in this URL -
http://www.youtube.com/api/timedtext?lang=en&v=6dlr-1Qk8Uc
...to get a .xml file containing the English subtitles for that video.
Update: The original URL pattern (struck through) appears to have changed. The subtitles can be fetched only if captions are manually transcribed i.e. not automatically generated (as shown in image below)
Reading text in a XML file is for machines. To view just the text in a XML (or HTML file), I paste the XML content in EditPlus (my favorite text editor) and use the Ctrl + Shift + P keyboard shortcut to convert. With other editors (I tried this in Notepad++ & Visual Studio) that support regular expressions, you can paste the XML content into it and use the expression <(.|\n)*?> with the Find and Replace option to get just plain text.
You can also use Excel to convert HTML content to text.
Comments
Post a Comment