(Usage hints for this presentation)

Data Integration 2019/20

Dr. Jens Lechtenbörger (License Information)

- Regular expressions (regexps) are formalism to define languages (permissible strings/words)

- Sample regexps use cases
**Validate**whether string conforms to format/pattern- Special case:
**Define**tokens in programming languages

- Special case:
**Find**strings of specific**pattern**in text**Replace**strings of specific**pattern**in text**Split/elementize**strings

- Versatile tool
- In most programming languages
- In reasonable text editors
- In UNIX command line tools, e.g.,
`grep`

,`sed`

- In data profiling tools (patterns for phone numbers, e-mail addresses, URIs, …)
- In data integration tools (matching of various formats to standardized representation)

- Regexp matching can be implemented via
finite-state automata
- Some implementations such as Rust’s regex guarantee linear-time behavior
- Others may come with backtracking and
**exponential**run-time, breaking the Internet

- Regular expressions and (deterministic or non-deterministic)
finite state automata specify precisely the same languages, namely
the
**regular languages**- There is no regexp to match correctly nested
(arbitrarily deeply) parenthesized expressions
- (E.g., “(1 + (2 * ((3 * …)…)…)…)”)
- Reason: finite memory!
- Regular languages vs context-free languages, Chomsky hierarchy

- There is no regexp to match correctly nested
(arbitrarily deeply) parenthesized expressions

**Alphabet**= finite set of symbols, e.g.:- Latin characters
- Digits
- ASCII, UTF-8
- Keyboard alphabet

**Word**(over alphabet A) = String (over alphabet A) = finite sequence of symbols (over alphabet A)- ε denotes the
**empty**word (no symbols)

- ε denotes the

**Language**(over alphabet A) = set of words (over alphabet A)

If v and w are words, then

**vw**is a word, the**concatenation**of v and w- E.g., v = data, w = integration. Then vw = dataintegration

- ε is the neutral element w.r.t. concatenation, i.e., εw = wε = w for all words w

- If w is a word and k a non-negative integer then the
**power w**is defined as follows:^{k}- w
^{0}= ε - w
^{k}= w^{k-1}w for k > 0

- w

- E.g.;
- data
^{1}= data^{0}data = εdata = data - data
^{2}= data^{1}data = datadata

- data

- L ∪ M is usual
**set union**

- LM = { vw | v ∈ L and w ∈ M } (
**concatenation**)

- L
^{*}= L^{0}∪ L^{1}∪ L^{2}∪ L^{3}∪ … (**Kleene star/closure**)- Where L
^{0}= { ε } and L^{k+1}= L^{k}L (k ≥ 0) - Zero or more concatenations of L with itself

- Where L

- L
^{+}= L^{1}∪ L^{2}∪ L^{3}∪ … (**positive closure**)- One or more concatenations of L with itself

- L
_{1}= { ε, ab, abb }, L_{2}= { b, ba } - L
_{1}L_{2}= ? - L
_{2}L_{1}= ? - L
_{1}^{3}= ? - L
_{2}^{3}= ?

- Regexps and their languages (over alphabet A)
**defined recursively**as follows- ε is a regexp with L(ε) = {ε}
- a is a regexp with L(a) = {a} for every symbol a ∈ A
- Let r and s be regexps with languages L(r) and L(s).
- (r)|(s) is a regexp with
L((r)|(s)) = L(r)
∪ L(s)
- Alternative

- (r)(s) is a regexp with
L((r)(s))
= L(r) L(s)
- Concatenation

- (r)
^{*}is a regexp with L((r)^{*}) = (L(r))^{*}- Kleene star

- (r) is a regexp with L((r)) = L(r)

- (r)|(s) is a regexp with
L((r)|(s)) = L(r)
∪ L(s)

- By definition, regexps contain many parentheses
- Reduction of amount of parentheses via precedence rules for operators
- Kleene star > Concatenation > Alternative
- E.g.: ((a)(a))|(((a)(b))
^{*}(c)) = aa|(ab)^{*}c

- Regexp E
**matches**word w if w ∈ L(E) - E
_{1}= data- L(E
_{1}) = { data } - E
_{1}matches data (but not integration)

- L(E
- E
_{2}= (data)|(integration) = data|integration- L(E
_{2}) = { data, integration } - E
_{2}matches data and integration (but not dataintegration)

- L(E
- E
_{3}= (E_{2})*- L(E
_{3}) = ?

- L(E

- E regular expression, n < m integers
- E+ = EE* (at least one E)
- E? = E|ε (E is optional)
- E{n,m} = E
^{n}|E^{n+1}|…|E^{m}(n to m repetitions of E)

- Alphabet A = { a1, a2, …, an }
- . = (a1|a2|…|an) (dots represents any symbol of A)
- [a
_{i1}a_{i2}…a_{ik}] = (a_{i1}|a_{i2}|…|a_{ik}) (set of symbols) - [^a
_{i1}a_{i2}…a_{ik}] (complemented set; all symbols except { a_{i1}, a_{i2}, …, a_{ik}})

- In practice, symbols of alphabets are ordered (e.g., a < c < z;
1 < 9)
- a’, a’’ ∈ A, a’ < a’’.
Then [a’-a’’] = a’|…|a’’ (range of
symbols)
- [0-9] = 0|1|2|3|4|5|6|7|8|9

- a’, a’’ ∈ A, a’ < a’’.
Then [a’-a’’] = a’|…|a’’ (range of
symbols)

- Various standards
- See https://en.wikipedia.org/wiki/Regular_expression#Standards
- In particular, PCRE (Perl Compatible Regular Expressions)
- Used in many tools and programming languages
- All of the above: Concatenation, alternative (|), closure (*), positive closure (+), optional (?), sets ([…]), repetition ({n,m})
- And more: Matching at beginning/end of word/line (^ and $), character classes (predefined expressions for words, numbers, whitespace, …), look-ahead, look-behind, grouping and backreference (parentheses and numbers) …

- E.g.:
- data matches data integration and MIS and data warehousing
- ^data matches data integration but not MIS and data warehousing
- data$ matches neither
- \d matches single digit, which is [0-9], which is (0|1|…|9)
- [1-9]\d* matches positive integers without leading zeros
- [hc]?at matches hat, cat, and at.

- Lots of free software tools to build regexps and verify their effects
- Some examples
- https://regexr.com/
`M-x regexp-builder`

in GNU Emacs- Next slide in Python
**Live**code execution,**editable**- Based on in-browser Python implementation (skulpt), not complete

- Try out Jupyter notebooks (or Emacs IPython Notebook)
- Web application for documents with live code, visualizations,
documentation
- Support for lots of languages
- Conversations
**with**and**about**data

- Web application for documents with live code, visualizations,
documentation

```
import re # See https://docs.python.org/3/library/re.html
m1 = re.search("^Data", "Data Integration")
print("Match: '{}'".format(m1.group(0))) # Group 0 is entire match
m2 = re.search("^Data", "MIS and Data Warehousing")
if m2 is None:
print("No match.")
```

- See https://www.regular-expressions.info/email.html
`^[A-Z0-9.\_%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$`

- To be used with case-insensitive matching

- Design decisions
- Match most real addresses but not all
- Only ASCII allowed here

- Also allow some invalid strings
- Multiple subsequent dots are allowed

- Don’t care about domain names
- Could explicitly specify alternatives for top-level domains
(TLDs), e.g.,
`.museum`

missing but`.asfg`

allowed

- Could explicitly specify alternatives for top-level domains
(TLDs), e.g.,

- Match most real addresses but not all

- Regexp for telephone numbers?
- One expression matching both examples:
**+49-251-83-38150****+49 251 83 38151**

- One expression matching both examples:
- Different regexp for alternative format?
**(0251) (83) 38158**

- Explain what regexps are and where they are useful
- Give examples for what regexps can and cannot do
- In general
- Matching accuracy

- Read and write regexps for sample use cases
- Enumerate their languages

Except where otherwise noted, the work “Regular Expressions”, © 2019-2020 Jens Lechtenbörger, is published under the Creative Commons license CC BY-SA 4.0.

In particular, trademark rights are *not* licensed under this license.
Thus, rights concerning third party logos (e.g., on the title slide)
and other (trade-) marks (e.g., “Creative Commons” itself) remain with
their respective holders.