How do I parse a string containing wildcards and character classes in Ruby?


I would like to write a script that takes one argument that might look like this:


a, b, c are literal characters.

(ag) means "an 'a' or a 'g'".

* means any one letter or number.

I want the script to create an Array of all the possible strings the input could represent. (The purpose is to check if they're available domain names.)

The input could also be something like abc(ag)de(mnlop) where there are more than on character class.

Seems like the first task is to split it into an Array or Arrays, so the first example would be...

  ['a', 'g'],
    'a', 'b', 'c', 'd', 'e', 'f', 'g',
    # etc...

This is where I get stuck. I don't know how to split it up into pieces like that.

Any suggestions on how to approach it?

Here's a pretty compact solution. It's in no way optimized for performance which puts some constraints on the patterns you supply e.g. too many wildcards is probably not the best idea.

Here's the code

input1 = "abc(ag)de*"
input2 = "abc(ag)de(mnlop)"

class Array
  def append_suffixes!(suffixes)
    self.replace { |a| { |p| p + a }}.flatten

def generate_combinations(pattern)
  combinations = [""]
  pattern.scan(/\(([^)]+)\)|(\*)|(\w+)/) do |group,wildcard,other|
    new_suffixes = case
      when group    : group.split('')
      when wildcard : [*'a'..'z']
      when other    : other
      else raise "Unknown match!"
    combinations.append_suffixes! new_suffixes

p generate_combinations(input1)
p generate_combinations(input2)
p generate_combinations("**").size

The output from running the code above is (slightly edited):

["abcadea", "abcgdea", "abcadeb", "abcgdeb", "abcadec",
 "abcgdec", "abcaded", "abcgded", "abcadee", "abcgdee",
 "abcadef", "abcgdef", "abcadeg", "abcgdeg", "abcadeh",
 "abcgdeh", "abcadei", "abcgdei", "abcadej", "abcgdej",
 "abcadek", "abcgdek", "abcadel", "abcgdel", "abcadem",
 "abcgdem", "abcaden", "abcgden", "abcadeo", "abcgdeo",
 "abcadep", "abcgdep", "abcadeq", "abcgdeq", "abcader",
 "abcgder", "abcades", "abcgdes", "abcadet", "abcgdet",
 "abcadeu", "abcgdeu", "abcadev", "abcgdev", "abcadew",
 "abcgdew", "abcadex", "abcgdex", "abcadey", "abcgdey",
 "abcadez", "abcgdez"]

["abcadem", "abcgdem", "abcaden", "abcgden", "abcadel",
 "abcgdel", "abcadeo", "abcgdeo", "abcadep", "abcgdep"]

676 # The number of two letter words i.e. 26*26

Please feel free to ask if you have any questions about the code above.