Extract a string between the first and second commas in a specific row in all files in a directory

advertisements

I am trying to extract a string that is located between the first and second comma in a specific line in a series of text files (subtitle files). The text files are formatted this way:

Subtitles01.txt

[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour
Style: Default, Estrangelo Edessa, 57, &H00FFFFFF
Style: Title1, Arno Pro, 65, &H00606066

Subtitles02.txt

[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour
Style: OP Eng, Arno Pro, 45, &H00100F11
Style: ED Romaji, Nueva Std Cond, 46, &H00FFFFFF

Subtitles03.txt

[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour
Style: OP Eng, Estrangelo Edessa, 45, &H00100F11
Style: Default, Arno Pro, 45, &H00100F11
Style: ED Romaji, Nueva Std Cond, 46, &H00FFFFFF

What I want to achieve here is extract the Fontname for each line that start with "Style: " and then determine which subtitles contain the fonts I want in a non-repeat manner. So essentially the end result would be output to a textfile like the following;

Subtitles01.txt: Estrangelo Edessa
Subtitles01.txt: Arno Pro
Subtitles02.txt: Arno Pro
Subtitles02.txt: Nueva Std Cond
Subtitles03.txt: Estrangelo Edessa
Subtitles03.txt: Arno Pro
Subtitles03.txt: Nueva Std Cond

Only Subtitles03.txt is needed.

Since Subtitles03.txt contains all the fonts in Subtitles01.txt and Subtitles02.txt, only Subtitles03.txt is needed. The goal is to use the least amount of files to find the unique fonts in all the files. I have came up with the following batch script using findstr to extract the lines starting with "Style: " but I am stuck beyond that.

@echo off
findstr /B /C:"Style:" *.txt > results.txt
if %errorlevel%==0 (
    echo Found! logged files into results.txt
) else (
    echo No matches found
)

Any help would be appreciated. Thank you guys!


I imagine it would be much easier to use some other language besides batch, or at least use non-native utilities. But here is a pure native batch solution.

I don't see how FINDSTR regex will help with this problem. It cannot extract a portion of the matching line like many other non-native batch regex utilities.

You can use FOR /F to extract the fonts from each file:

for /f "tokens=2 delims=," %%A in ('findstr /lb "Style:" file.txt') do echo font=%%A

You can use environment variables to come up with a list of unique fonts. Define variables with the font name in the variable name, all prefixed with font_. Only one variable can be defined for a given name. The assigned value does not matter. You can then use set font_ to list all the unique font names. The number of unique names can be counted, or the actual font name can be parsed out (remove the font_ prefix).

The tricky part is establishing the minimum set of files required to cover the complete set of unique font names. I imagine someone could come up with an efficient solution. I've just employed a brute force recursive permutation method: I count the number of unique fonts found in each permutation and compare the number to the total number of unique fonts. I have added a few shortcuts to not proceed down a particular permutation path if I've already found a smaller complete set than the current set.

The code could be simpler if I used SETLOCAL in my recursion, but batch is limited to only 32 levels of SETLOCAL. I wanted a solution that could support more than 32 files, although I'm a bit worried about performance with that many files.

Edit -I fixed a bug in my :permuteFiles routine that surfaced once I had more than 3 files

@echo off
setlocal enableDelayedExpansion

::Make sure there are no font_ variables defined
for /f "delims==" %%A in ('set font_ 2^>nul') do set "%%A="

::Read all the Subtitle files and
:: - create an "array" of file names
:: - create a file of font names for each input file
:: - create an "associative array" of unique font names
:: - List the available file/font pairs in the final results
:: - List the unique fonts in the final results
set fileCount=0
>results.txt (
  echo Available fonts
  echo ----------------------------
  for %%F in (subtitles*.txt) do (
    set /a totalFiles+=1
    set "file_!totalFiles!=%%F"
    3>"%%F.fonts" (
      for /f "tokens=2 delims=," %%A in ('findstr /lb "Style:" "%%F"') do (
        set "font_%%A=1"
        >&3 echo %%A
        echo %%F:%%A
      )
    )
  )
  echo(
  echo Unique fonts
  echo ----------------------------
  for /f "delims==" %%A in ('set font_') do (
    set "font=%%A"
    echo !font:~5!
  )
)

::Count the number of unique fonts
for /f %%N in ('set font_ ^| find /c /v ""') do set uniqueFonts=%%N

::Test all the permutations
set /a minFileCount=%totalFiles%+1
for /l %%N in (1 1 %totalFiles%) do (
  call :permuteFiles %%N 0 ""
)

::List the required files in the final results
>>results.txt (
  echo(
  echo The following files contain the complete set of unique fonts:
  echo -------------------------------------------------------------
  for %%N in (%minFileList:~1,-1%) do echo !file_%%N!
)
type results.txt

::Cleanup
del subtitles*.txt.fonts
exit /b

:permuteFiles  fileNumber  fileCount  fileList
if %1==%totalFiles% (
  if %2 gtr 0 call :testPermutation %2 %3
  set /a fileCount=%2+1
  if !fileCount! lss !minFileCount! call :testPermutation !fileCount! "%~3,%1"
) else (
  set /a nextFile=%1+1
  if %2 gtr 0 call :permuteFiles !nextFile! %2 %3
  set /a "nextFile=%1+1, fileCount=%2+1"
  if !fileCount! lss !minFileCount! call :permuteFiles !nextFile! !fileCount! "%~3,%1"
)
exit /b

:testPermutation  fileCount  fileList
for /f "delims==" %%A in ('set font_ 2^>nul') do set "%%A="
for %%N in (%~2) do (
  for /f "usebackq delims=" %%A in ("!file_%%N!.fonts") do set "font_%%A=1"
)
for /f %%N in ('set font_ ^| find /c /v ""') do if %%N==%uniqueFonts% (
  set minFileList=%2
  set minFileCount=%1
)
exit /b

Here are the results using your example input:

Available fonts
----------------------------
Subtitles01.txt: Estrangelo Edessa
Subtitles01.txt: Arno Pro
Subtitles02.txt: Arno Pro
Subtitles02.txt: Nueva Std Cond
subtitles03.txt: Estrangelo Edessa
subtitles03.txt: Arno Pro
subtitles03.txt: Nueva Std Cond

Unique fonts
----------------------------
 Arno Pro
 Estrangelo Edessa
 Nueva Std Cond

The following files contain the complete set of unique fonts:
-------------------------------------------------------------
subtitles03.txt