Formatting dumped subtitles into a vocabulary list

Published: May 28, 2020, last updated: Jul 5, 2020
Reading time: 1 min
Tags: Formats Languages Linux Media Snippets Software 

As per my previous post, you should now have a single srt subtitle file, to convert this into a single word list that you can begin translating away at, you can run the below verbose script.

tr ' ' '\n' < | \
	sed -e 's/<[^>]*>//g' | \
	tr '[:upper:]' '[:lower:]' | \
	tr -d '\>\/!-.:?,.\",[:digit:]' | \
	tr -d '…' | \
	sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' -re 's/\{...\}//' | \
	sort -u >

In short, this will break all spaces into new lines, remove HTML tags, make everything lowercase, remove some strange characters and empty lines then finally sort the list while removing duplicates.

One issue I’ve noticed is some special characters won’t be converted to lowercase Å to å for example. I don’t have an automated workaround for you aside from specifying the letters individually for example using:

tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'