files
I'm working on a sizeable (around 1 mil words) collection of texts in a number of github repos. Dealing with them manually would be a chore so I wrote a couple automation scripts.
Remove BOM
MemoQ has a habit of exporting utf-8 plaintext/markdown with BOM, and this is not caught by our course building script and messes up the final html.
replace local links with absolute urls in markdown
This is useful when pasting a bunch of markdown files to some shared space like google drive (provided a hosted version already exists somewhere). Needs a csv file with matching file names and page names. Handles anchors (or anchor only links) based on pre-defined logic (recently adapted for Confluence).
update yml (but like it's text)
Why update .yml files but read them as plaintext? Turns out that in my case they don't follow consistent rules when it comes to quoting. So instead of trying to figure out the reasons and/or try to enforce one style I decided to treat them like text files. This also works well because the changes I'm automating in these yamls are minimal: one line (or two) in the file contains some PL text. I need to add an extra line with the EN key and value based on the exiting PL before I can start editing the file manually or deserialize it.
Of course to make it work I need a glossary. I collect the candidate terms with another script.
The glossary used here is a .csv usually compiled from two files (source and target of translation).
Last updated