command line to convert all .docx to a directory (and subdirectories) to a text file and write new files

advertisements

I would like to convert all .docx files in a directory (and subdirectories) to text files from the command line (so I can use grep after on these files). I found this

unzip -p tutu.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'

here which works well but it sends the file in the terminal. I would like to write the new text file (.txt for instance) in the same directory as the .docx file. And I would like a script to do this recursively.

I have this, using antiword, that do what I want for .doc files but it doesn't work for .docx files.

find . -name '*.doc' | while read i; do antiword -i 1 "${i}" >"${i/doc/txt}"; done

I tried to mix both but without success... A command line that would do both at the same time would be appreciated!

Thank you


The following script..

  • converts all docx files in the directory where you run it, recursively (adapt . in find . to your wished starting point)
  • writes the txt files to where it found the docx file

Bash script:

find . -name "*.docx" | while read file; do
    unzip -p $file word/document.xml |
        sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' > "${file/docx/txt}"
done

Afterwards you can run the grep like this:

grep -r "some text" --include "*.txt" .