Analysis in Python: What is the most effective way to remove / normalize strings?


I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.

Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":

print("h e l l o "+
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")

Should be transformed to the following result:


Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.

Few remarks:

  • I would like the "start-of-string" character to be a variable, e.g. ' vs ".
  • I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
  • The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
  • I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)

Thanks, Rax

Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.

Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.

Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):

import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
    lambda match: or '"string"', source_code)

The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).

When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).