Formatting dumped subtitles into a vocabulary list
Published: May 28, 2020, last updated: Jul 5, 2020
Reading time: 1 min
As per my previous post, you should now have a single srt
subtitle file, to convert this into a single word list that you can begin translating away at, you can run the below verbose script.
tr ' ' '\n' < subs.srt | \
sed -e 's/<[^>]*>//g' | \
tr '[:upper:]' '[:lower:]' | \
tr -d '\>\/!-.:?,.\",[:digit:]' | \
tr -d '…' | \
sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' -re 's/\{...\}//' | \
sort -u > subs-sort.srt
In short, this will break all spaces into new lines, remove HTML tags, make everything lowercase, remove some strange characters and empty lines then finally sort the list while removing duplicates.
One issue I’ve noticed is some special characters won’t be converted to lowercase Å to å for example. I don’t have an automated workaround for you aside from specifying the letters individually for example using:
tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'
- Edit 2020-09-23: Added elipses removal, fixed pipes
- Edit 2020-07-05: Added {\an} tag removal