Regex cheat sheet for python
I am writing a program that will find all the sex and sires in a raw data file and get rid of characters we don’t need anymore.
Character Classes
Character classes are a small group of characters that you need to use with an escape \
. They are used as a way of catching all the different types of a certain character.
For instance, instead of going [A-Za-z]
you can just do \w
and that will catch all the different word characters. There are a handful of them:
Escape character | Use | Example |
---|---|---|
\w |
Word Characters | a-z, A-Z, 0-9, _ |
\W |
Non-Word Characters | !@#$%^^&*() |
\d |
Digits | 0-9 |
\D |
Non Digits | Not 0-9 |
\s |
White Spaces | \t, \n, \r |
\S |
Non white spaces | Opposite of above |
\b |
Boundary of a word | Basically the first or last letter of a word |
\b |
Opposite | Leaves off the first or last letter |
The lowercase and upper case are the exact opposite. \d
finds all digits (0-9) and \D
finds the exact opposite.
Special characters
Special characters are when things get helpful. I don’t want to type out \w\w\w\w
to find every 4 letter word. Seems silly.
Special Character | Use | Example |
---|---|---|
^ |
Beginning of a string | ^\d |
$ |
Matches everything to it’s left at the end of a string | \w$ finds when I don’t end a line with punctuation |
. |
Anything besides \n |
|
\ |
Escape character | |
| |
OR | A|B A OR B |
{N} |
Number of times the thing to it’s left needs to be found | \w{4} Finds a word character 4 times in a row |
{N,} |
Same as above, but N or more times | \w{4,} Finds a word character 4 or more long |
{N,X} |
Finds whatever code N through X times | \w{4,6} Finds word characters that are 4 to 6 long |
* |
Finds 0 or more times | \w* Finds a word character 0 or more times |
+ |
Finds something 1 or more times. | \d+ Finds a digit 1 or more times |
? |
Sort of optional. Like the () around the area code. Sometimes they’re there other times they aren’t | \(?\d{3}\)? Finds the area code with optional () |
Sets
Sets use brackets and are used for groups of characters. An example would be [A-Z]
. That would be every uppercase letter A-Z.
You also don’t need to repeat letters so a letter can appear 1 or more times. For instance if I wanted to find “James Carney” it would look like this:
[jamescrny\s]
Carney looks a little weird because it has some letters that appear in James so they don’t need to be repeated. I also added \s
because there is a space between James and Carney.
Set layout | Purpose |
---|---|
[ ] | Regular Set. Finds things between the brackets |
[milk] | Finds either of those letters, but not all of them in one string |
[a-z] | Finds lowercase letters a-z |
[-a] | Finds - and ‘a’ since the - is at the beginning of the set |
[^anything] | the ^ excludes anything in the bracket |
[]?*+ | Anything in the bracket, followed by the special character determines how many characters you’re looking for |