Making an OSS Contribution, Part 3: Into The Weeds with RegEx

Jekyll Bug Fix with Matt Series Index

Part 3 Table of Contents

How to read this particular post #

Make sure to read part 0, and watch part 1 and 2. It won’t make much sense otherwise.

We’ve started with a failing test. I encourage you to follow the setup instructions from Part 2.

Make sure to add the same test we did, which is visible here

Now, lets make the test pass! Feel free to take a stab at it yourself, or watch along with us in the video.

Video Walkthrough #

Video Timestamps #

0:15 Conventions for writing Context blocks, generating useful error messages
1:00 our test failure matches the bug report. 💪
1:30 running a single test from the test file. Trying line number first
1:45 How to pass the name of the test through, in a regular expression: -n=/tamil/
3:00 Applying the fix “as suggested”, hoping it’ll be easy. (sad trombone)
4:15 Making the first change, running test, hoping for different output
4:40 Figuring out if what we’re thinking is getting called IS getting called
5:10 Oops, wrong mode. Updating the Regular Expression. Easy fix.
5:48 Now applying the given Regex to SLUGIFY_DEFAULT_REGEX
6:24 We made a change and got a change! Huzzah!
6:50 Oh boy. It didn’t work. Lets debug. Lots of things to check. JUMP TO 33:45 TO SEE ANSWER IF YOU DON’T WANT TO WATCH US DEBUG THIS
8:28 “Slimming down” the magnitude of our change, seeing if that helps.
8:55 Adding non-special-characters to see if they show up in the test output. they don’t. :(
9:53 Adding a breakpoint so we can shorten the feedback loop. Pry for the win!
10:16 Oops. Pry doesn’t work.
10:35 Installing pry globally with -g flag. Still doesn’t work.
11:15 Pry still not working.
11:57 Incomprehensible error messages. Story of my life.
12:20 Looking up a Pry alternative, like IRB
12:35 using ‘debugger’ statement
13:10 Using require ‘irb’; binding.irb. it works!🎉
14:00 Exploring “state” of what’s available in this #replace_character_sequence_with_hyphen
15:00 Playing with return values of string.gsub(replaceable_char, "-")
15:20 Introducing Rubular, a great tool for working with Regular Expressions
16:05 Building up the regular expression and test string in Rubular
18:20 Trying the suggested Regular Expression in Rubular to see if it works
18:44 it works!
19:21 Trying the working Regular Expression in IRB
20:35 Still stumped. Even more stumped than before. This is not what we expected.
22:45 Still stumped. We’re wondering what we’re missing. Cannot find it.
24:32 Comparing strings. 💡
24:40 Pasting the regular expression straight into IRB
25:30 Looking at the construction of the regular expression
26:33 Trying to look in the docs for a hint of why Regexp.new(regex) might behave differently from /regex/
27:13 Regular Expression Character Properties. Check the notes for more details.
28:44 barking up the wrong tree.
29:00 Finding better words to google thanks to a JavaScript question
29:30 advanced GoogleFu: how to exclude certain answers
30:20 Noticing the escape characters. Getting warmer…\
32:20 “OR-ed together.” What strange words.
33:50 Matt sees the problem! We have to double-escape certain characters. 🤦🏻‍♂️ So simple.
34:40 Seeing the correct regular expression, with \p{} visible
35:00 The test passes. 🙃. What a 🐇 hole!
35:45 Meta-principle: If evaluated Regular Expression isn’t doing what you expect, make sure the regex being run is the same as what you think you’re giving it.
36:38 Running all the tests. Still pass. 🏁 🎉🎊🎉🎊🎉🎊🎉🎊

Expanding on what came up in the walk-through #

Test Naming Conventions #

The context/should blocks in this project give nice error output, like:

The Utils.slugify method should break right now for this issue.

How nice!

Running a single test from a file #

If you’ve got 30 tests in a file, and many of them test the same method, you’d want to run just a single test in the file for two reasons:

This will save you time (no running unnecessary tests)
If you use pry or a breakpoint down the road, you can hit the breakpoint with the context of just the single test you’re working for.

In this case, we’d want to hit the breakpoint with மல்லிப்பூ வகைகள as our input, not any of the other values in test/test_utils.rb

How to run a specific test and exclude all others?

In Rspec, you can call rspec path/to/test/file:line_number, with line_number representing the line of code your test is on.

We don’t have that option here, but we do have the --name flag, where we can specify the name of the test we want to run.

If our test name as the word tamil in it, we can say test/test_utils.rb -n /tamil/, and any test with the matching word in it gets run.

according to the docs you can pass both a regular expression or a string:

% ruby -Ilib:test test/minitest/test_minitest_test.rb --help
minitest options:
    -h, --help                       Display this help.
    -s, --seed SEED                  Sets random seed. Also via env. Eg: SEED=n rake
    -v, --verbose                    Verbose. Show progress processing files.
    -n, --name PATTERN               Filter run on /regexp/ or string.
    -e, --exclude PATTERN            Exclude /regexp/ or string from run.

Now you know!

Adding a Breakpoint (Pry) #

We had success (and the answer!) from StackOVerflow:

How do I drop to the IRB prompt from a running script? (StackOverflow)

Ended up using:

require 'irb'
binding.irb

Now we can inspect the state of the program a bit:

> script/test test/test_utils.rb -n=/tamil/
+ ruby -S bundle exec ruby -I test test/test_utils.rb -n=/tamil/
# -------------------------------------------------------------
# SPECS AND TESTS ARE RUNNING WITH WARNINGS OFF.
# SEE: https://github.com/Shopify/liquid/issues/730
# SEE: https://github.com/jekyll/jekyll/issues/4719
# -------------------------------------------------------------

# Running tests with run options -n=/tamil/ --seed 59662:

cannot load such file -- awesome_print

From: /Users/joshthompson/crap/jekyll/lib/jekyll/utils.rb @ line 364 :

    359:         else
    360:           SLUGIFY_DEFAULT_REGEXP
    361:         end
    362:
    363:       # Strip according to the mode
 => 364:       require 'irb'; binding.irb
    365:       string.gsub(replaceable_char, "-")
    366:     end
    367:   end
    368: end

>> replaceable_char
=> /[^\p{M}\p{L}\p{Nd}]+/
>> string
=> "மல்லிப்பூ வகைகள்"
>> string.gsub(replaceable_char, '-')
=> "மல்லிப்பூ-வகைகள்"
>> mode
=> "default"

Breaking the Regex into smaller pieces #

These are sorta crazy regular expressions to try to read: +There’s a lot to say about Regular Expressions, but in this case, we don’t have to worry too much about the exact details of the Regex we were working with.

/[^\p{M}\p{L}\p{Nd}._~!$&'()+,;=@]+/
/[^[:alnum:]._~!$&'()+,;=@]+/
/[^\\p{M}\\p{L}\\p{Nd}._~!$&'()+,;=@]+/

The characters in common are:

._~!$&'()+,;=@

If you put [._~!$&'()+,;=@] into Rubular, and give it the following test strings, you’ll see what it does:

this-matches_any;chars
_one!might~not%want.
=@in'a$url
_~!$&'()+,=@+

The + means one or more of, so these big regular expressions are trying to a variety of character properties, with the potential matches enclosed inside the outter brackets.

Here’s how I “read” the regex:

[anything_in_here_including._~!$&'()+,;=@]+

Regular Expressions (character properties) #

There’s a lot to say about Regular Expressions, but in this case, we don’t have to worry too much about the exact details of the Regex we were working with.

We got wrapped up in Regex Character Properties. Take a look at Regex Character Properties(ruby-doc.org)

All the examples that they give look like the regular expression we got from @deepestblue’s Github issue: /[^\p{M}\p{L}\p{Nd}._~!$&'()+,;=@]+/

We can “parse” thise Regex a bit. I’d never seen the \p{} thing before. From the docs:

The \p{} construct matches characters with the named property, much like POSIX bracket classes.

In the above Regex, we’ve got:

\p{M}
\p{L}
\p{Nd}

Which translates to:

\p{M}  => 'Mark'
\p{L}  => 'Letter'
\p{Nd} => 'Number: Decimal Digit'

These were the changes that DeepestBlue suggested. He suggested we change it from:

[^[:alnum:]._~!$&'()+,;=@]

According to the docs, [:alnum] is a POSIX bracket expression, which is similar to character classes

If you’re like me, reading the above definition, you thought:

What does POSIX mean?

Wikipedia says

The Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems.

None of this was what caused Matt and I to spend so much time working on this fix - we had a problem with character escaping.

Compare:

Regex.new(/[^\p{M}\p{L}\p{Nd}._~!$&'()+,;=@]+/)
Regex.new(/[^\\p{M}\\p{L}\\p{Nd}._~!$&'()+,;=@]+/)

As I was writing these notes, long after the video recording, I thought to myself:

Do the docs even mention the need for escape characters? We didn’t see any in use in the docs.

The answer is yes, the docs mention this:

Metacharacters and Escapes

The following are metacharacters (, ), [, ], {, }, ., ?, +, *. They have a specific meaning when appearing in a pattern. To match them literally they must be backslash-escaped. To match a backslash literally, backslash-escape it: \\.

The included example doesn’t really seem that helpful, either:

/1 \+ 2 = 3\?/.match('Does 1 + 2 = 3?') #=> #<MatchData "1 + 2 = 3?">
/a\\\\b/.match('a\\\\b')                #=> #<MatchData "a\\b">

Anytime in the future that I am dealing with Regular Expressions, I’ll think about escape characters.

Hopefully now you will too!

Checks for Understanding #

What are two reasons you might want to run just a specific test from a file?
Googling: How do you run a query and exclude results containing a certain word? (for example, googling on Regular Expressions, and you don’t want any answers that reference javascript)
If you see \p{something} in a regular expression, what does it mean? (Check the docs, no need to know this off the top of your head)
If I want my regular expression to be /\p{L}/, what should it really be?
What’s an alnum?
How many characters in a string will this regular expression capture? /[abc]/
How many characters in a string will this regular expression capture? /[abc]+/
Will /[^abc]+/ match a?
What online tool is helpful for using Regular Expressions with Ruby?
How do you run a specific test using the --name flag?
If pry doesn’t work in a Ruby project, what is a good alternative to try?

Next, jump over to part 4:

But before you go, why not subscribe to get updates when more guides in this series are done, as well as when future guides go up?