How to count the number of words in bold and italic words in a markdown syntax file

advertisements

I've read that bold and italic words can be represented in markdown language by ** bold_text ** and * italic_text *, respectively. To have a both bold and italic text at once, you can wrap the text with 4 asterisks for bold and 2 underscores for italic (or vice versa).

I would like to write a bash script which determines the number of bold words and italic words. I guess that this comes down to counting the number of double asterisks , single asterisks, double underscores and single underscores. My question is how to count the number of specific strings like "**" or "__" from a file, so I can know how many bold and italic words there are.

#!/bin/bash

if [ -z "$1" ]; then
    echo "No input file specified."
else
    ls $1 > /dev/null 2> /dev/null &&
    echo $(cat $1 | grep -o '\<**>\' | wc -c) || echo "File $1 does not exist."
fi

Example input file:

**This is bold and _italic_** text.

Expected output:

Bold words: 5
Italic words: 1
Bold and italic words: 1

Simple approach

A few assumptions:

  • Bold uses __, italic uses * (even though it might also be ** and _)
  • No "funny stuff" like (inline) code with these characters, or escaped _ or *, or lists with leading * that throw our count off

Now, to count bold words, we can use

grep -Po '__.*?__' infile.md | grep -o '[^[:space:]]\+' | wc -l

This looks for anything between two pairs of __. I used the Perl regex engine (-P) to enable non-greedy matching (.*?); otherwise, something like __bold__ not bold __bold__ would be just one match. -o returns just the matches.

The second grep matches the words: any sequence of one or more non-space characters; and wc -l counts the lines of output.

The same works for italics:

grep -Po '\*.*?\*' infile.md | grep -o '[^[:space:]]\+' | wc -l

To combine these (for bold and italic), the command lists have to be combined. For italic inside bold:

grep -Po '__.*?__' infile.md | grep -Po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l

and bold inside italic:

grep -Po '\*.*?\*' infile.md | grep -Po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l

Cleaning up a more realistic file

Now, a real markdown file might have a few extra surprises (see "Assumptions"):

* List item with **bold word**

Line with **bold words and \* an escaped asterisk**

Here is an *italicized* word

And *italics with a **bold** word inside*

And **bold words with *italics* inside**

    Code can have tons of *, ** and _ and we want to ignore them all

Also `inline code can have * and ** and _ to be ignored`, right?

which would render as

  • List item with bold word

Line with bold words and * an escaped asterisk

Here is an italicized word

And italics with a bold word inside

And bold words with italics inside

Code can have tons of *, ** and _ and we want to ignore them all

Also inline code can have * and ** and _ to be ignored, right?

One approach to clean up something like this up would be a sed script:

/^$/d                           # Delete empty lines
/^    /d                        # Delete code lines (start with four spaces)
s/`[^`]*`//g                    # Remove inline code
/^\* /s/^\* (.*)/\1/            # Remove asterisk from list items
s/\\\*//g                       # Remove escaped asterisks
s/\\_//g                        # Remove escaped underscores
s/`[^`]*`//g                    # Remove inline code
s/\*\*/__/g                     # Make sure bold uses underscores
s/(^|[^_])_([^_]|$)/\1\*\2/g    # Make sure italics use asterisks

with the following result:

$ sed -rf md.sed infile.md
List item with __bold word__
Line with __bold words and  an escaped asterisk__
Here is an *italicized* word
And *italics with a __bold__ word inside*
And __bold words with *italics* inside__
Also , right?

Ready for consumption by the commands from the first section.

Putting it all together

Everything together in a script that takes the markdown file name as an argument:

#!/bin/bash

fname="$1"
tempfile="$(mktemp)"

sed -r '
    /^$/d
    /^    /d
    s/`[^`]*`//g
    /^\* /s/^\* (.*)/\1/
    s/\\\*//g
    s/\\_//g
    s/`[^`]*`//g
    s/\*\*/__/g
    s/(^|[^_])_([^_]|$)/\1\*\2/g
' "$fname" > "$tempfile"

bold=$(grep -Po '__.*?__' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l)
italic=$(grep -Po '\*.*?\*' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l)
both=$((
    $(grep -Po '__.*?__' "$tempfile" |
        grep -Po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l)
    +
    $(grep -Po '\*.*?\*' "$tempfile" |
        grep -Po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l)
))

rm -f "$tempfile"

echo "Bold words: $bold"
echo "Italic words: $italic"
echo "Bold and italic words: $both"

Which can be used like this:

$ ./wordcount infile.md
Bold words: 14
Italic words: 8
Bold and italic words: 2

Shortcomings

  • This can be tripped up by words containing underscores. Some markdown flavours ignore these and assume they're part of the word.
  • I'm sure I missed a few edge cases in the cleanup