Regular Expressions : RegEx

Regular expressions is a concept used to search for patterns in string text.

https://www.activestate.com/wp-content/uploads/2020/03/Python-RegEx-Cheatsheet.pdf

This is a universal concept for any programming language or text editing program.

The goal of regular expressions is to be able to search for a specific type of text inside of a string. If we have a form on our webpage where we ask for email addresses, can we check whether the inputted string actually follows the form of an email? some letters or numbers or special characters, then an @ sign then some more letters numbers or special characters then a . then a few more letters.

RegEx Keys

.       - Any Character Except New Line
\d - Digit (0-9)
\D - Not a Digit (0-9)
\w - Word Character (a-z, A-Z, 0-9, _)
\W - Not a Word Character
\s - Whitespace (space, tab, newline)
\S - Not Whitespace (space, tab, newline)
\b - Word Boundary
\B - Not a Word Boundary
^ - Beginning of a String
$ - End of a String
[] - Matches Characters in brackets
[^ ] - Matches Characters NOT in brackets
| - Either Or
( ) - Group
Quantifiers:
* - 0 or More
+ - 1 or More
? - 0 or One
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)

Importing re libraries:

import re

text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
123abc

Hello HelloHello

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

gmail.com

321-555-4321
123.555.1234

abhi-arya@gmail.com

Mr. Johnson
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

Searching Literals

pattern = re.compile(r'cba')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
Output: Will search for the literals 'cba' in the text.

Searching special characters

Square brackets []
Square brackets are used to specify a character set — at least one of which must be a match, but no more than one unless otherwise specified.

Example: Malwareb[yi]es will be a match for Malwarebytes and Malwarebites, not for Malwarebyites.

The minus sign –
The minus sign or hyphen is used to specify a range of characters.
Example: [0–9] will be a match for any single digit between 0 and 9.

Curly brackets {}
Curly brackets are used to quantify the number of characters.
Example: [0–9]{3} matches for any number sequence between 000 and 999

Parentheses ()
Parentheses are used to group characters. Matches contain the characters in their exact order.

Example: (are) gives a match for malware, but not for aerial because the following order of the characters is different from the specification.

Slash |
The slash, as in many languages, stands for the logical “or” operator.
Example: Most|more will be a match for both of the specified words.

Period .
The dot or period acts as a wildcard. It matches any single character, except line break characters.
Example: Malwareb.tes will be a match for Malwarebytes, Malwarebites, Malwarebotes, and many others, but still not for Malwarebyites.

Backslash \

The backslash is used to escape special characters and to give special meaning to some characters that follow it.
Examples: \d matches for one whole number (0–9).
\w matches for one alphanumeric character.

Asterisk *
The asterisk is a repeater. It matches when the character preceding it matches 0 or more times.
Example: cho*se will match for chose and choose, but also for chse (zero match).

Asterisk and period .*
The asterisk is used in combination with the period to match for any character 0 or more times.
Example: Malware.* will match for Malware, Malwarebytes, and any misspelled version that starts with Malware.

Plus sign +
The plus sign matches when the character preceding + matches 1 or more times.
Example: cho+se will match for chose and choose, but not for chse.

pattern = re.compile(r'.')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
Output: Any Character Except New Line
pattern = re.compile(r'\.')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Output: All ‘.’ character

<re.Match object; span=(129, 130), match=’.’>
<re.Match object; span=(163, 164), match=’.’>
<re.Match object; span=(185, 186), match=’.’>
<re.Match object; span=(189, 190), match=’.’>
<re.Match object; span=(211, 212), match=’.’>
<re.Match object; span=(219, 220), match=’.’>
<re.Match object; span=(250, 251), match=’.’>
<re.Match object; span=(263, 264), match=’.’>

pattern = re.compile(r'\D')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'\d\w')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'\d\s')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Word boundary

# Hello HelloHello
pattern = re.compile(r'Hello') #searching for 'Hello'
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Output:

<re.Match object; span=(74, 79), match='Hello'>
<re.Match object; span=(80, 85), match='Hello'>
<re.Match object; span=(85, 90), match='Hello'>
pattern = re.compile(r'Hello\b') #Word Boundary
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Output:

<re.Match object; span=(74, 79), match=’Hello’> 
<re.Match object; span=(85, 90), match=’Hello’>
Note: Here we searched for the pattern were 'Hello' was followed by word boundary(space/new line etc.)
pattern = re.compile(r'\bHello\b')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Output:

<re.Match object; span=(74, 79), match='Hello'>
Other Worth trying examples:pattern = re.compile(r'\BHello\b') # \B - Not word boundary
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

pattern = re.compile(r'\b\d')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'^\s') #Whitespace (space, tab, newline) at beginning of string
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Character sets

pattern = re.compile(r'[123]\w')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Output :

<re.Match object; span=(55, 57), match='12'>
<re.Match object; span=(57, 59), match='34'>
<re.Match object; span=(66, 68), match='12'>
<re.Match object; span=(68, 70), match='3a'>
<re.Match object; span=(169, 171), match='32'>
<re.Match object; span=(178, 180), match='32'>
<re.Match object; span=(182, 184), match='12'>
<re.Match object; span=(190, 192), match='12'>
<re.Match object; span=(192, 194), match='34'>
pattern = re.compile(r'[a-z][a-z]')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'[a-zA-Z0-9][a-zA-z-]')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'[a-zA-Z][^a-zA-z]')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Character groups

pattern = re.compile(r'(abc|com|texas)\b')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'([A-Z]|llo)[a-zA-z]')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Quantifiers

pattern = re.compile(r'Mr\.?\s[A-Z]')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'Mr\.?\s[A-Z][a-z]*')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'M(s|rs)\.?\s[A-Z][a-z]*')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{4}')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'[a-zA-Z0-9_]+\.[a-z]{3}')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat)

Accessing information in the Match object

pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,4}')
matches = pattern.finditer(text_to_search)
for mat in matches:
print(mat.span(0))
print(mat.group(0))
print(text_to_search[mat.span(0)[0]:mat.span(0)[1]])
urls = r'''
https://www.google.com
http://yahoo.com
https://www.whitehouse.gov
https://craigslist.org
'''
pattern = re.compile(r'https?://(www\.)?\w+\.\w+')
matches = pattern.finditer(urls)
for mat in matches:
print(mat)
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for mat in matches:
print(mat.group(2)+mat.group(3))
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for mat in matches:
print(mat.group(0))
print(urls[mat.span(2)[0]:mat.span(2)[1]]+urls[mat.span(3)[0]:mat.span(3)[1]])

Senior Applied Scientist | Satellite, Space and Earth Observation (EO) Data