Programming Simple Regular Expression Matching in Python

Regular expressions are a very useful concept in programming. We can use them to parse strings based on known formats, typically to determine if a string fits a pattern or not. Often, regular expressions tend to be a less understood concept, due to the complexity in how they can be designed.

One way to learn how regular expressions work is to learn how to program simple regular expression matching. In this article, we will learn how to match regular expressions, with patterns based on the following criteria

1. A pattern can have any alphabetic character, in lowercase.

2. If a letter is followed by a *, it means that there can be 0 or more of the character.

3. If the character is a . it means there can be one of any character

Let’s look at a few examples to get a better idea.

Regular Expression: “ab”

For this pattern, we look for a string that has exactly the letters shown, “ab”. This is the only string that will match the pattern.

Pattern: “a*b*”

For this pattern, we look for a string that has the letter a 0 or more times, and the letter b 0 or more times. This means that “ab” is acceptable, as is “b” and “aaabbbb”.

Pattern: “.a”

For this pattern, we look for a string with any letter before a second letter a. This means that “aa” is accepted, and “ba” is accepted, and any similar string.

Programming this logic is relatively complex, due to how the * character acts. Since it looks at the previous character, and allows it 0 or more times, we need to be careful with how we implement our solution to make it as efficient as possible. I am going to present two solutions to this pattern matching problem. The first is a recursive solution, and the second uses dynamic programming.


The Recursive Approach

The recursive solution to this problem is pretty straightforward. Our strategy is to keep checking the first letter to check for a match until we run out of characters in the pattern and text. The full code looks like this:

Image for post
The recursive solution

First, let’s discuss the recursive calls that exist in the program. We need to make a recursive call if we have a * as the first character with at least two characters left in the pattern. The second recursive call happens as long as the pattern still has characters left.

If we see a *, and there are two characters left in the pattern, we recursively call the function with the two characters. We then or the result of this with the value of the patterns matching, as well as the text from the first character onwards matching the pattern. This will make sure that we match the * character in a way that allows for zero or more of the character to exist in the text.

If we don’t see the * character, we simply shorten the text and pattern by a character, and recursively call the function again. You can see that most of the complexity lies in the case where we have a * character, since this is the case where the most possibilities for a match exist. When the function is recursively called, all we check is that either the text or pattern are empty, or if the pattern still has characters, we check if the pattern and text still match.

This approach is good because it is quick to program, however the efficiency isn’t the best possible. In general, plain recursion will often run slower than optimized versions that take advantage of dynamic programming or similar approaches. This is made worse by the fact that our algorithm requires two recursive calls when we have a * character, one with pattern[2:] and one with text[1:]. Due to how this behaves with the function as a whole, we end up with a time complexity of:

Image for post

This is due to the fact that the pattern recursion executes P/2 times and the text recursion executes T times. On top of this, we have a third recursive call that uses text[1:] and pattern[1:] which gives us the (T+P) term. Put together, we end up with a rather slow algorithm.


The Dynamic Programming Approach

When we encounter problems with recursive solutions, we should always consider optimization techniques like dynamic programming. Dynamic programming will allow us to remember previous checks we have completed, in order to reduce steps, and therefore, time complexity of our first algorithm.

The idea is that we start with a similar logic as our first algorithm. We are still going to check character by character progressing when we still have a value matching our pattern. The difference is that in this algorithm, each call is going to save the current results, to save us from needing to look back at the string and build a substring when we see the * character. The top-down approach looks like this:

Image for post

You will notice that a lot of this looks somewhat familiar compared to our recursive solution. We still have recursive calls when we either have * or more text available to parse. The difference is that rather than doing pattern[2:] and text[1:], we do pattern[j+2] and text[i+1], keeping track of what we have already checked. Doing this allows us to skip unneeded checks that our recursive solution was forced to do, due to the fact that it didn’t keep any memory of the previous checks.

Doing this small change helps our time complexity immensely. Before, our calls required more work on each call, meaning that the amount of work build up until we reached an exponential-time algorithm. In this case, every call to the dpTopDown function can be completed in constant time, since all we do is check the pattern against the text, moving forward on each call. Due to this, we need to call P times for the pattern, and T times for the text, where P and T are the lengths of the pattern and text respectively. This gives us an overall time efficiency of O(TP), which is now linear.

Leave a Reply

Your email address will not be published. Required fields are marked *