Regular Expressions

(Usage hints for this presentation)

Winter Term 2023/2024
Dr. Jens Lechtenbörger (License Information)

Chair of Data Science: Machine Learning and Data Engineering (Prof. Gieseke)
Dept. of Information Systems
University of Münster, Germany

Agenda

1. Introduction
2. Background
3. Regular Expressions
4. The Real World
5. Learning Objectives

1. Introduction

1.1. Regular Expressions

Regular expressions (regexps) are formalism to define languages (permissible strings/words)

Sample regexps use cases
- Validate whether string conforms to format/pattern
  - Special case: Define tokens in programming languages
- Find strings of specific pattern in text
- Replace strings of specific pattern in text
- Split/elementize strings

1.2. Regular Expression as Tool

Versatile tool
- In most programming languages
- In reasonable text editors
- In UNIX command line tools, e.g., grep, sed
- In data profiling tools (patterns for phone numbers, e-mail addresses, URIs, …)
- In data integration tools (matching of various formats to standardized representation)

1.3. Regexp Facts

Regexp matching can be implemented via finite-state automata
- Some implementations such as Rust’s regex guarantee linear-time behavior
- Others may come with backtracking and exponential run-time, breaking the Internet
Regular expressions and (deterministic or non-deterministic) finite state automata specify precisely the same languages, namely the regular languages
- There is no regexp to match correctly nested (arbitrarily deeply) parenthesized expressions
  - (E.g., “(1 + (2 * ((3 * …)…)…)…)”)
  - Reason: finite memory!
  - Regular languages vs context-free languages, Chomsky hierarchy

This slides mentions two facts about regular expressions, for which actually formal proofs exist.

First, every regular expression can be implemented via a finite-state automaton, which is a particular machine model for computation where only a finite amount of memory is available.

Second, the expressive power of regular expressions is limited as far as formal languages are concerned.

A famous example of a language that cannot be defined via a regular expression is that of properly nested parentheses.

Different classes of languages can be arranged in the so-called Chomsky hierarchy, about which you can learn more at Wikipedia or in textbooks on theoretical computer science or on compiler construction.

Those classes of languages include regular languages, defined by regular expressions, context-free languages, which include parenthesized expressions and are fundamental in the definition of programming languages, and more.

2. Background

2.1. Alphabets, Words, and Languages

Alphabet = finite set of symbols, e.g.:
- Latin characters
- Digits
- ASCII, UTF-8
- Keyboard alphabet

Word (over alphabet A) = String (over alphabet A) = finite sequence of symbols (over alphabet A)
- ε denotes the empty word (no symbols)

Language (over alphabet A) = set of words (over alphabet A)

2.2. Operations on Words

If v and w are words, then vw is a word, the concatenation of v and w
- E.g., v = data, w = integration. Then vw = dataintegration
- ε is the neutral element w.r.t. concatenation, i.e., εw = wε = w for all words w

If w is a word and k a non-negative integer then the power w^k is defined as follows:
- w⁰ = ε
- w^k = w^k-1w for k > 0

E.g.;
- data¹ = data⁰data = εdata = data
- data² = data¹data = datadata

2.2.1. Quiz on Words

2.3. Operations on Languages

2.3.1. Let L and M be languages

L ∪ M is usual set union

LM = { vw | v ∈ L and w ∈ M } (concatenation)

L^* = L⁰ ∪ L¹ ∪ L² ∪ L³ ∪ … (Kleene star/closure)
- Where L⁰ = { ε } and L^k+1 = L^kL (k ≥ 0)
- Zero or more concatenations of L with itself

L⁺ = L¹ ∪ L² ∪ L³ ∪ … (positive closure)
- One or more concatenations of L with itself

2.3.2. Language Operation Examples

L₁ = { ε, ab, abb }, L₂ = { b, ba }
L₁ L₂ = ?
L₂ L₁ = ?
L₁³ = ?
L₂³ = ?

2.3.3. Quiz on Language Operations

3. Regular Expressions

3.1. Regexp Definition

Regexps and their languages (over alphabet A) defined recursively as follows
1. ε is a regexp with L(ε) = {ε}
2. a is a regexp with L(a) = {a} for every symbol a ∈ A
3. Let r and s be regexps with languages L(r) and L(s).
  - (r)|(s) is a regexp with L((r)|(s)) = L(r) ∪ L(s)
    - Alternative
  - (r)(s) is a regexp with L((r)(s)) = L(r) L(s)
    - Concatenation
  - (r)^* is a regexp with L((r)^*) = (L(r))^*
    - Kleene star
  - (r) is a regexp with L((r)) = L(r)

We are now ready to define regular expressions, which in turn define languages.

The empty word is a regular expression, whose language contains nothing but the empty word.

Also, each individual symbol is a regular expression, whose language contains nothing but the symbol as a word.

Recursively, four rules allow to construct more complex regular expressions from given regular expressions.

First, the vertical bar denotes an alternative, so the resulting language is the union of the original languages.

Second, the concatenation of regular expressions is again a regular expression, whose language is the concatenation of the original languages.

Third, the Kleene star can be applied to a regular expression, which results in the language obtained by applying the Kleene star to the original language.

Fourth, parentheses can be applied without changing the defined language.

3.2. Parentheses

By definition, regexps contain many parentheses
Reduction of amount of parentheses via precedence rules for operators
- Kleene star > Concatenation > Alternative
- E.g.: ((a)(a))|(((a)(b))^*(c)) = aa|(ab)^*c

3.3. Regexp Matching

Regexp E matches word w if w ∈ L(E)
E₁ = data
- L(E₁) = { data }
- E₁ matches data (but not integration)
E₂ = (data)|(integration) = data|integration
- L(E₂) = { data, integration }
- E₂ matches data and integration (but not dataintegration)
E₃ = (E₂)*
- L(E₃) = ?

3.3.1. Quiz on RegExp Language

3.4. Major Regexp Shorthands

E regular expression, n < m integers
- E+ = EE* (at least one E)
- E? = E|ε (E is optional)
- E{n,m} = Eⁿ|Eⁿ⁺¹|…|E^m (n to m repetitions of E)
Alphabet A = { a1, a2, …, an }
- . = (a1|a2|…|an) (dots represents any symbol of A)
- [a_i1a_i2…a_ik] = (a_i1|a_i2|…|a_ik) (set of symbols)
- [^a_i1a_i2…a_ik] (complemented set; all symbols except { a_i1, a_i2, …, a_ik })
In practice, symbols of alphabets are ordered (e.g., a < c < z; 1 < 9)
- a’, a’’ ∈ A, a’ < a’’. Then [a’-a’’] = a’|…|a’’ (range of symbols)
  - [0-9] = 0|1|2|3|4|5|6|7|8|9

4. The Real World

4.1. Regexps in Practice

Various Implementations
- Wikipedia
- In particular, PCRE (Perl Compatible Regular Expressions)
  - Used in many tools and programming languages
  - All of the above: Concatenation, alternative (|), closure (*), positive closure (+), optional (?), sets ([…]), repetition ({n,m})
  - And more: Matching at beginning/end of word/line (^ and $), character classes (predefined expressions for words, numbers, whitespace, …), look-ahead, look-behind, grouping and backreference (parentheses and numbers) …
- E.g.:
  - data matches data integration and MIS and data warehousing
  - ^data matches data integration but not MIS and data warehousing; data$ matches neither
  - \d matches single digit, which is [0-9], which is (0|1|…|9)
  - [1-9]\d* matches positive integers without leading zeros
  - [hc]?at matches hat, cat, and at.

4.2. Learning Regexps

Lots of free software tools to build regexps and verify their effects
Some examples
- https://regexr.com/
- M-x regexp-builder in GNU Emacs
- Next slide in Python
  - Live code execution, editable
    - Based on in-browser Python implementation (skulpt), not complete
- Try out Jupyter notebooks (or Emacs IPython Notebook)
  - Web application for documents with live code, visualizations, documentation
    - Support for lots of languages
    - Conversations with and about data

4.2.1. Regexps in Python


import re # See https://docs.python.org/3/library/re.html
m1 = re.search("^Data", "Data Integration")
print("Match: '{}'".format(m1.group(0))) # Group 0 is entire match
m2 = re.search("^Data", "MIS and Data Warehousing")
if m2 is None:
        print("No match.")

4.3. Regexp Example: E-Mail

See https://www.regular-expressions.info/email.html
^[A-Z0-9.\_%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
- To be used with case-insensitive matching
Design decisions
- Match most real addresses but not all
  - Only ASCII allowed here
- Also allow some invalid strings
  - Multiple subsequent dots are allowed
- Don’t care about domain names
  - Could explicitly specify alternatives for top-level domains (TLDs), e.g., .museum missing but .asfg allowed

4.4. Patterns with Regexps

Regexp for telephone numbers?
- One expression matching both examples:
  - +49-251-83-38150
  - +49 251 83 38151
Different regexp for alternative format?
- (0251) (83) 38158

5. Learning Objectives

Explain what regexps are and where they are useful
Give examples for what regexps can and cannot do
- In general
- Matching accuracy
Read and write regexps for sample use cases
- Enumerate their languages

License Information

Source files are available on GitLab (check out embedded submodules) under free licenses. Icons of custom controls are by @fontawesome, released under CC BY 4.0.

No warranties are given. The license may not give you all of the permissions necessary for your intended use.

In particular, trademark rights are not licensed under this license. Thus, rights concerning third party logos (e.g., on the title slide) and other (trade-) marks (e.g., “Creative Commons” itself) remain with their respective holders.