Formatting dumped subtitles into a vocabulary list

Published: May 28, 2020, last updated: Jul 5, 2020

Reading time: 1 min

Tags: Formats, Languages, Linux, Media, Snippets, Software

As per my previous post, you should now have a single srt subtitle file, to convert this into a single word list that you can begin translating away at, you can run the below verbose script.

tr ' ' '\n' < subs.srt | \
	sed -e 's/<[^>]*>//g' | \
	tr '[:upper:]' '[:lower:]' | \
	tr -d '\>\/!-.:?,.\",[:digit:]' | \
	tr -d '…' | \
	sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' -re 's/\{...\}//' | \
	sort -u > subs-sort.srt

In short, this will break all spaces into new lines, remove HTML tags, make everything lowercase, remove some strange characters and empty lines then finally sort the list while removing duplicates.

One issue I’ve noticed is some special characters won’t be converted to lowercase Å to å for example. I don’t have an automated workaround for you aside from specifying the letters individually for example using:

tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'

Edit 2020-09-23: Added elipses removal, fixed pipes
Edit 2020-07-05: Added {\an} tag removal