Imprint | Privacy Policy

Regular Expressions

(Usage hints for this presentation)

Data Integration 2019/20
Dr. Jens Lechtenbörger (License Information)

DBIS Group
Prof. Dr. Gottfried Vossen
Chair for Computer Science
Dept. of Information Systems
WWU Münster, Germany

1 Usage hints

2 Introduction

2.1 Regular Expressions

  • Regular expressions (regexps) are formalism to define languages (permissible strings/words)
  • Sample regexps use cases
    • Validate whether string conforms to format/pattern
      • Special case: Define tokens in programming languages
    • Find strings of specific pattern in text
    • Replace strings of specific pattern in text
    • Split/elementize strings

2.2 Regular Expression as Tool

  • Versatile tool
    • In most programming languages
    • In reasonable text editors
    • In UNIX command line tools, e.g., grep, sed
    • In data profiling tools (patterns for phone numbers, e-mail addresses, URIs, …)
    • In data integration tools (matching of various formats to standardized representation)

2.3 Regexp Facts

  • Regexp matching can be implemented via finite-state automata
  • Regular expressions and (deterministic or non-deterministic) finite state automata specify precisely the same languages, namely the regular languages
    • There is no regexp to match correctly nested (arbitrarily deeply) parenthesized expressions
      • (E.g., “(1 + (2 * ((3 * …)…)…)…)”)
      • Reason: finite memory!
      • Regular languages vs context-free languages, Chomsky hierarchy

3 Background

3.1 Alphabets, Words, and Languages

  • Alphabet = finite set of symbols, e.g.:
    • Latin characters
    • Digits
    • ASCII, UTF-8
    • Keyboard alphabet
  • Word (over alphabet A) = String (over alphabet A) = finite sequence of symbols (over alphabet A)
    • ε denotes the empty word (no symbols)
  • Language (over alphabet A) = set of words (over alphabet A)

3.2 Operations on Words

  • If v and w are words, then vw is a word, the concatenation of v and w

    • E.g., v = data, w = integration. Then vw = dataintegration
    • ε is the neutral element w.r.t. concatenation, i.e., εw = = w for all words w
  • If w is a word and k a non-negative integer then the power wk is defined as follows:
    • w0 = ε
    • wk = wk-1w for k > 0
  • E.g.;
    • data1 = data0data = εdata = data
    • data2 = data1data = datadata

3.2.1 Quiz on Words

3.3 Operations on Languages

3.3.1 Let L and M be languages

  • L ∪ M is usual set union
  • LM = { vw | v ∈ L and w ∈ M } (concatenation)
  • L* = L0 ∪ L1 ∪ L2 ∪ L3 ∪ … (Kleene star/closure)
    • Where L0 = { ε } and Lk+1 = LkL (k ≥ 0)
    • Zero or more concatenations of L with itself
  • L+ = L1 ∪ L2 ∪ L3 ∪ … (positive closure)
    • One or more concatenations of L with itself

3.3.2 Language Operation Examples

  • L1 = { ε, ab, abb }, L2 = { b, ba }
  • L1 L2 = ?
  • L2 L1 = ?
  • L13 = ?
  • L23 = ?

3.3.3 Quiz on Language Operations

4 Regular Expressions

4.1 Regexp Definition

  • Regexps and their languages (over alphabet A) defined recursively as follows
    1. ε is a regexp with L(ε) = {ε}
    2. a is a regexp with L(a) = {a} for every symbol a ∈ A
    3. Let r and s be regexps with languages L(r) and L(s).
      • (r)|(s) is a regexp with L((r)|(s)) = L(r) ∪ L(s)
        • Alternative
      • (r)(s) is a regexp with L((r)(s)) = L(r) L(s)
        • Concatenation
      • (r)* is a regexp with L((r)*) = (L(r))*
        • Kleene star
      • (r) is a regexp with L((r)) = L(r)

4.2 Parentheses

  • By definition, regexps contain many parentheses
  • Reduction of amount of parentheses via precedence rules for operators
    • Kleene star > Concatenation > Alternative
    • E.g.: ((a)(a))|(((a)(b))*(c)) = aa|(ab)*c

4.3 Regexp Matching

  • Regexp E matches word w if w ∈ L(E)
  • E1 = data
    • L(E1) = { data }
    • E1 matches data (but not integration)
  • E2 = (data)|(integration) = data|integration
    • L(E2) = { data, integration }
    • E2 matches data and integration (but not dataintegration)
  • E3 = (E2)*
    • L(E3) = ?

4.3.1 Quiz on RegExp Language

4.4 Major Regexp Shorthands

  • E regular expression, n < m integers
    • E+ = EE* (at least one E)
    • E? = E|ε (E is optional)
    • E{n,m} = En|En+1|…|Em (n to m repetitions of E)
  • Alphabet A = { a1, a2, …, an }
    • . = (a1|a2|…|an) (dots represents any symbol of A)
    • [ai1ai2…aik] = (ai1|ai2|…|aik) (set of symbols)
    • [^ai1ai2…aik] (complemented set; all symbols except { ai1, ai2, …, aik })
  • In practice, symbols of alphabets are ordered (e.g., a < c < z; 1 < 9)
    • a’, a’’ ∈ A, a’ < a’’. Then [a’-a’’] = a’|…|a’’ (range of symbols)
      • [0-9] = 0|1|2|3|4|5|6|7|8|9

5 The Real World

5.1 Regexps in Practice

  • Various standards
    • See https://en.wikipedia.org/wiki/Regular_expression#Standards
    • In particular, PCRE (Perl Compatible Regular Expressions)
      • Used in many tools and programming languages
      • All of the above: Concatenation, alternative (|), closure (*), positive closure (+), optional (?), sets ([…]), repetition ({n,m})
      • And more: Matching at beginning/end of word/line (^ and $), character classes (predefined expressions for words, numbers, whitespace, …), look-ahead, look-behind, grouping and backreference (parentheses and numbers) …
    • E.g.:
      • data matches data integration and MIS and data warehousing
      • ^data matches data integration but not MIS and data warehousing
      • data$ matches neither
      • \d matches single digit, which is [0-9], which is (0|1|…|9)
      • [1-9]\d* matches positive integers without leading zeros
      • [hc]?at matches hat, cat, and at.

5.2 Learning Regexps

  • Lots of free software tools to build regexps and verify their effects
  • Some examples
    • https://regexr.com/
    • M-x regexp-builder in GNU Emacs
    • Next slide in Python
      • Live code execution, editable
        • Based on in-browser Python implementation (skulpt), not complete
    • Try out Jupyter notebooks (or Emacs IPython Notebook)
      • Web application for documents with live code, visualizations, documentation
        • Support for lots of languages
        • Conversations with and about data

5.2.1 Regexps in Python


import re # See https://docs.python.org/3/library/re.html
m1 = re.search("^Data", "Data Integration")
print("Match: '{}'".format(m1.group(0))) # Group 0 is entire match
m2 = re.search("^Data", "MIS and Data Warehousing")
if m2 is None:
    print("No match.")

5.3 Regexp Example: E-Mail

  • See https://www.regular-expressions.info/email.html
  • ^[A-Z0-9.\_%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
    • To be used with case-insensitive matching
  • Design decisions
    • Match most real addresses but not all
      • Only ASCII allowed here
    • Also allow some invalid strings
      • Multiple subsequent dots are allowed
    • Don’t care about domain names
      • Could explicitly specify alternatives for top-level domains (TLDs), e.g., .museum missing but .asfg allowed

5.4 Patterns with Regexps

  • Regexp for telephone numbers?
    • One expression matching both examples:
      • +49-251-83-38150
      • +49 251 83 38151
  • Different regexp for alternative format?
    • (0251) (83) 38158

6 Learning Objectives

  • Explain what regexps are and where they are useful
  • Give examples for what regexps can and cannot do
    • In general
    • Matching accuracy
  • Read and write regexps for sample use cases
    • Enumerate their languages

License Information

Except where otherwise noted, the work “Regular Expressions”, © 2019-2020 Jens Lechtenbörger, is published under the Creative Commons license CC BY-SA 4.0.

No warranties are given. The license may not give you all of the permissions necessary for your intended use.

In particular, trademark rights are not licensed under this license. Thus, rights concerning third party logos (e.g., on the title slide) and other (trade-) marks (e.g., “Creative Commons” itself) remain with their respective holders.