Regex: Teste e Curso rápido

Publicado a 19/01/2019, 07:53 por Luis Pitta -org-   [ atualizado a 19/01/2019, 08:21 ]

Teste de Regex

É importante testarmos a nossa expressão regular antes de a colocarmos em fucionamento.

Para isso podemos utilizar um dos seguintes sites:

Regex Accelerated Course and Cheat Sheet


CharacterLegendExampleSample Match
\dMost engines: one digit
from 0 to 9
\d.NET, Python 3: one Unicode digit in any scriptfile_\d\dfile_9੩
\wMost engines: "word character": ASCII letter, digit or underscore\w-\w\w\wA-b_1
\w.Python 3: "word character": Unicode letter, ideogram, digit, or underscore\w-\w\w\w字-ま_۳
\w.NET: "word character": Unicode letter, ideogram, digit, or connector\w-\w\w\w字-ま‿۳
\sMost engines: "whitespace character": space, tab, newline, carriage return, vertical taba\sb\sca b
\s.NET, Python 3, JavaScript: "whitespace character": any Unicode separatora\sb\sca b
\DOne character that is not a digit as defined by your engine's \d\D\D\DABC
\WOne character that is not a word character as defined by your engine's \w\W\W\W\W\W*-+=)
\SOne character that is not a whitespace character as defined by your engine's \s\S\S\S\SYoyo


QuantifierLegendExampleSample Match
+One or moreVersion \w-\w+Version A-b1_1
{3}Exactly three times\D{3}ABC
{2,4}Two to four times\d{2,4}156
{3,}Three or more times\w{3,}regex_tutorial
*Zero or more timesA*B*C*AAACC
?Once or noneplurals?plural

More Characters

CharacterLegendExampleSample Match
.Any character except line breaka.cabc
.Any character except line break.*whatever, man.
\.A period (special character: needs to be escaped by a \)a\.ca.c
\Escapes a special character\.\*\+\?    \$\^\/\\.*+?    $^/\
\Escapes a special character\[\{\(\)\}\][{()}]

(direct link)


LogicLegendExampleSample Match
|Alternation / OR operand22|3333
( … )Capturing groupA(nt|pple)Apple (captures "pple")
\1Contents of Group 1r(\w)g\1xregex
\2Contents of Group 2(\d\d)\+(\d\d)=\2\+\112+65=65+12
(?: … )Non-capturing groupA(?:nt|pple)Apple

More White-Space

CharacterLegendExampleSample Match
\tTabT\t\w{2}T     ab
\rCarriage return charactersee below
\nLine feed charactersee below
\r\nLine separator on WindowsAB\r\nCDAB
\NPerl, PCRE (C, PHP, R…): one character that is not a line break\N+ABC
\hPerl, PCRE (C, PHP, R…), Java: one horizontal whitespace character: tab or Unicode space separator
\HOne character that is not a horizontal whitespace
\v.NET, JavaScript, Python, Ruby: vertical tab
\vPerl, PCRE (C, PHP, R…), Java: one vertical whitespace character: line feed, carriage return, vertical tab, form feed, paragraph or line separator
\VPerl, PCRE (C, PHP, R…), Java: any character that is not a vertical whitespace
\RPerl, PCRE (C, PHP, R…), Java: one line break (carriage return + line feed pair, and all the characters matched by \v)

More Quantifiers

QuantifierLegendExampleSample Match
+The + (one or more) is "greedy"\d+12345
?Makes quantifiers "lazy"\d+?1 in 12345
*The * (zero or more) is "greedy"A*AAA
?Makes quantifiers "lazy"A*?empty in AAA
{2,4}Two to four times, "greedy"\w{2,4}abcd
?Makes quantifiers "lazy"\w{2,4}?ab in abcd

Character Classes

CharacterLegendExampleSample Match
[ … ]One of the characters in the brackets[AEIOU]One uppercase vowel
[ … ]One of the characters in the bracketsT[ao]pTap or Top
-Range indicator[a-z]One lowercase letter
[x-y]One of the characters in the range from x to y[A-Z]+GREAT
[ … ]One of the characters in the brackets[AB1-5w-z]One of either: A,B,1,2,3,4,5,w,x,y,z
[x-y]One of the characters in the range from x to y[ -~]+Characters in the printable section of the ASCII table.
[^x]One character that is not x[^a-z]{3}A1!
[^x-y]One of the characters not in the range from x to y[^ -~]+Characters that are not in the printable section of the ASCII table.
[\d\D]One character that is a digit or a non-digit[\d\D]+Any characters, inc-
luding new lines, which the regular dot doesn't match
[\x41]Matches the character at hexadecimal position 41 in the ASCII table, i.e. A[\x41-\x45]{3}ABE

Anchors and Boundaries

AnchorLegendExampleSample Match
^Start of string or start of linedepending on multiline mode. (But when [^inside brackets], it means "not")^abc .*abc (line start)
$End of string or end of linedepending on multiline mode. Many engine-dependent subtleties..*? the end$this is the end
\ABeginning of string
(all major engines except JS)
\Aabc[\d\D]*abc (string...
\zVery end of the string
Not available in Python and JS
the end\zthis is...\n...the end
\ZEnd of string or (except Python) before final line break
Not available in JS
the end\Zthis is...\n...the end\n
\GBeginning of String or End of Previous Match
.NET, Java, PCRE (C, PHP, R…), Perl, Ruby
\bWord boundary
Most engines: position where one side only is an ASCII letter, digit or underscore
Bob.*\bcat\bBob ate the cat
\bWord boundary
.NET, Java, Python 3, Ruby: position where one side only is a Unicode letter, digit or underscore
Bob.*\b\кошка\bBob ate the кошка
\BNot a word boundaryc.*\Bcat\B.*copycats

POSIX Classes

CharacterLegendExampleSample Match
[:alpha:]PCRE (C, PHP, R…): ASCII letters A-Z and a-z[8[:alpha:]]+WellDone88
[:alpha:]Ruby 2: Unicode letter or ideogram[[:alpha:]\d]+кошка99
[:alnum:]PCRE (C, PHP, R…): ASCII digits and letters A-Z and a-z[[:alnum:]]{10}ABCDE12345
[:alnum:]Ruby 2: Unicode digit, letter or ideogram[[:alnum:]]{10}кошка90210
[:punct:]PCRE (C, PHP, R…): ASCII punctuation mark[[:punct:]]+?!.,:;
[:punct:]Ruby: Unicode punctuation mark[[:punct:]]+‽,:〽⁆

Inline Modifiers

None of these are supported in JavaScript. In Ruby, beware of (?s) and (?m)
ModifierLegendExampleSample Match
(?i)Case-insensitive mode
(except JavaScript)
(?s)DOTALL mode (except JS and Ruby). The dot (.) matches new line characters (\r\n). Also known as "single-line mode" because the dot treats the entire input as a single line(?s)From A.*to ZFrom A
to Z
(?m)Multiline mode
(except Ruby and JS) ^ and $ match at the beginning and end of every line
(?m)In Ruby: the same as (?s) in other engines, i.e. DOTALL mode, i.e. dot matches line breaks(?m)From A.*to ZFrom A
to Z
(?x)Free-Spacing Mode mode
(except JavaScript). Also known as comment mode or whitespace mode
(?x) # this is a
# comment
abc # write on multiple
# lines
[ ]d # spaces must be
# in brackets
abc d
(?n).NET, PCRE 10.30+: named capture onlyTurns all (parentheses) into non-capture groups. To capture, use named groups.
(?d)Java: Unix linebreaks onlyThe dot and the ^ and $ anchors are only affected by \n
(?^)PCRE 10.32+: unset modifiersUnsets ismnxmodifiers


LookaroundLegendExampleSample Match
(?=…)Positive lookahead(?=\d{10})\d{5}01234 in 0123456789
(?<=…)Positive lookbehind(?<=\d)catcat in 1cat
(?!…)Negative lookahead(?!theatre)the\w+theme
(?<!…)Negative lookbehind\w{3}(?<!mon)sterMunster

Character Class Operations

Class OperationLegendExampleSample Match
[…-[…]].NET: character class subtraction. One character that is in those on the left, but not in the subtracted class.[a-z-[aeiou]]Any lowercase consonant
[…-[…]].NET: character class subtraction.[\p{IsArabic}-[\D]]An Arabic character that is not a non-digit, i.e., an Arabic digit
[…&&[…]]Java, Ruby 2+: character class intersection. One character that is both in those on the left and in the && class.[\S&&[\D]]An non-whitespace character that is a non-digit.
[…&&[…]]Java, Ruby 2+: character class intersection.[\S&&[\D]&&[^a-zA-Z]]An non-whitespace character that a non-digit and not a letter.
[…&&[^…]]Java, Ruby 2+: character class subtraction is obtained by intersecting a class with a negated class[a-z&&[^aeiou]]An English lowercase letter that is not a vowel.
[…&&[^…]]Java, Ruby 2+: character class subtraction[\p{InArabic}&&[^\p{L}\p{N}]]An Arabic character that is not a letter or a number

Other Syntax

SyntaxLegendExampleSample Match
\KKeep Out
Perl, PCRE (C, PHP, R…), Python's alternate regexengine, Ruby 2+: drop everything that was matched so far from the overall match to be returned
\Q…\EPerl, PCRE (C, PHP, R…), Java: treat anything between the delimiters as a literal string. Useful to escape metacharacters.\Q(C++ ?)\E(C++ ?)

Retirado de 
https://www.rexegg.com em jan 2019

Regex na programação

Publicado a 19/01/2019, 07:13 por Luis Pitta -org-   [ atualizado a 19/01/2019, 07:45 ]

Regex: Frases modelo para pesquisa de texto

Publicado a 19/01/2019, 06:58 por Luis Pitta -org-   [ atualizado a 19/01/2019, 07:01 ]

Regex Examples for Text File Search

What good are text editors if you can't perform complex searches? 
I checked these sample expressions in EditPad Pro, but they would probably work in Notepad++ or a regex-friendly IDE. 

Seven-Letter Word Containing "hay"

Search pattern: (?=\b\w{7}\b)\w*?hay\w*
Translation: Look right ahead for a seven-letter word (the \b boundaries are important). Lazily eat up any word characters followed by "hay", then eat up any word characters. We know that the greedy match has to stop because the word is seven characters long. 

Here, in our word, we allow any characters that regex calls "word characters", which, besides letters, also include digits and underscores. If we want a more conservative pattern, we just need to change the lookup:

Traditional word (only letters): (?i)(?=\b[A-Z]{7}\b)\w*?hay\w*

In this pattern, in the lookup, you can see that I replaced \w{7} with [A-Z]{7}, which matches seven capital letters. To include lowercase letters, we could have used [A-Za-z]{7}. Instead, we used the case insensitive modifier (?i). Thanks to this modifier, the pattern can match "HAY" or "hAy" just as easily as "hay". It all depends on what you want: regex puts the power is in your hands. 

Line Contains both "bubble" and "gum"

Search pattern: ^(?=.*?\bbubble\b).*?\bgum\b.*
Translation: While anchored a the beginning of the line, look ahead for any characters followed by the word bubble. We could use a second lookahead to look for gum, but it is faster to just match the whole line, taking care to match gum on the way. 

Line Contains "boy" or "buy"

Search pattern: \bb[ou]y\b
Translation: Inside a word (inside two \b boundaries), match the character b, then one character that is either o or u, then y

Find Repeated Words, such as "the the"

This is a popular example in the regex literature. I don't know about you you, but it doesn't happen all that often often that mistakenly repeated words find their way way into my text. If this example is so popular, it's probably because it's a short pattern that does a great job of showcasing the power of regex. 

You can find a million ways to write your repeated word pattern. In this one, I used POSIX classes (available in Perl and PHP), allowing us to throw in optional punctuation between the words, in addition to optional space. 

Search pattern: \b([[:alpha:]]+)[ [:punct:]]+\1
Translation: After a word delimiter, in group one, capture a positive number of letters, then eat up space characters or punctuation marks, then match the same word we captured earlier in group one. 

If you don't want the punctuation, just use an \s+ in place of [ [:punct:]]+

Remember that \s eats up any white-space characters, including newlines, tabs and vertical tabs, so if this is not what you want use [ ]+ to specify space characters. The brackets are optional, but they make the space character easier to spot, especially in a variable-width font. 

Line does Not Contain "boy"

Search pattern: ^(?!.*boy).*
Translation: At the beginning of the line, if the negative lookahead can assert that what follows is not "any characters then boy", match anything on the line. 

Line Contains "bubble" but Neither "gum" Nor "bath"

Search pattern: ^(?!.*gum)(?!.*bath).*?bubble.*
Translation: At the beginning of the line, assert that what follows is not "any characters then gum", assert that what follows is not "any characters then bath", then match the whole string, making sure to pick up bubble on the way.

Email Address

If I ever have to look for an email address in my text editor, frankly, I just search for @. That shows me both well-formed addresses, as well as addresses whose authors let their creativity run loose, for instance by typing DOT in place of the period. 

When it comes to validating user input, you want an expression that checks for well-formed addresses. There are thousands of email address regexes out there. In the end, none can really tell you whether an address is valid until you send a message and the recipient replies. 

The regex below is borrowed from chapter 4 of Jan Goyvaert's excellent book, Regular Expressions Cookbook. I'm in tune with Jan's reasoning that what you really want is an expression that works with 999 addresses out of a thousand, an expression that doesn't require a lot of maintenance, for instance by forcing you to add new top-level domains ("dot something") every time the powers in charge of those things decide it's time to launch names ending in something like dot-phone or dot-dog. 

Search pattern: (?i)\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b

Let's unroll this one:

(?i)               # Turn on case-insensitive mode

\b                 # Position engine at a word boundary

[A-Z0-9._%+-]+     # Match one or more of the characters between brackets: letters, numbers, dot, underscore, percent, plus, minus. Yes, some of these are rare in an email address.

@                  # Match @

(?:[A-Z0-9-]+\.)+  # Match one or more strings followed by a dot, such strings being made of letters, numbers and hyphens. These are the domains and sub-domains, such as post. and microsoft. in post.microsoft.com

[A-Z]{2,6}         # Match two to six letters, for instance US, COM, INFO. This is meant to be the top-level domain. Yes, this also matches DOG. You have to decide if you want achieve razor precision, at the cost of needing to maintain your regex when new TLDs are introduced.

\b                 # Match a word boundary

Regex: A sintaxe das Regular Expression

Publicado a 19/01/2019, 06:31 por Luis Pitta -org-   [ atualizado a 19/01/2019, 07:56 ]

O que é o Regex?

O termo Regex é uma abreviatura da expressão "Regular Expression"

Nas ciências da computação, uma expressão regular (ou regex ou regexp, abreviação do inglês de regular expression) permite, de uma forma concisa e flexível, identificar cadeias de caracteres de interesse, como caracteres particulares, palavras ou padrões de caracteres. 

As expressões regulares são escritas numa linguagem formal que pode ser interpretada por um processador de expressões regulares.

O termo deriva do trabalho do matemático norte-americano Stephen Cole Kleene, que desenvolveu as expressões regulares como uma notação ao que ele chamava de álgebra de conjuntos regulares. O seu trabalho serviu de base para os primeiros algoritmos computacionais de busca, e depois para algumas das mais antigas ferramentas de tratamento de texto da plataforma Unix.

O uso atual de expressões regulares inclui procura e substituição de texto em editores de texto e linguagens de programação, validação de formatos de texto (validação de protocolos ou formatos digitais), realce de sintaxe e filtragem de informação. Aplicações online com o Google Formulários aceitam expressões regulares para filtrarmos informação. 

O Gmail não aceita regex diretamente mas com ajuda de um GAS conseguimos fazer pesquisas avançadas com regex. 

Ver sites.aebenfica.org/apontamentos-tic/email/umgaspararegexnogmail

(Adaptado da Wikipedia em maio 2018)

Regular Expression (Regex) Syntax

A Regular Expression (or Regex) is a pattern (or filter) that describes a set of strings that matches the pattern. In other words, a regex accepts a certain set of strings and rejects the rest.

A regex consists of a sequence of characters, metacharacters (such as .\d\D\s, \S\w\W) and operators (such as +*?|^). They are constructed by combining many smaller sub-expressions.

1  Matching a Single Character

The fundamental building blocks of a regex are patterns that match a single character. Most characters, including all letters (a-z and A-Z) and digits (0-9), match itself. For example, the regex x matches substring "x"z matches "z"; and 9 matches "9".

Non-alphanumeric characters without special meaning in regex also matches itself. For example, = matches "="@ matches "@".

2  Regex Special Characters and Escape Sequences

Regex's Special Characters

These characters have special meaning in regex (I will discuss in detail in the later sections):

  • metacharacter: dot (.)
  • bracket list: [ ]
  • position anchors: ^$
  • occurrence indicators: +*?{ }
  • parentheses: ( )
  • or: |
  • escape and metacharacter: backslash (\)
Escape Sequences

The characters listed above have special meanings in regex. To match these characters, we need to prepend it with a backslash (\), known as escape sequence.  For examples, \+ matches "+"\[ matches "["; and \. matches ".".

Regex also recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.

3  Matching a Sequence of Characters (String or Text)


A regex is constructed by combining many smaller sub-expressions or atoms. For example, the regex Friday matches the string "Friday". The matching, by default, is case-sensitive, but can be set to case-insensitive via modifier.

4  OR (|) Operator

You can provide alternatives using the "OR" operator, denoted by a vertical bar '|'. For example, the regex four|for|floor|4 accepts strings "four", "for", "floor" or "4".

5  Bracket List (Character Class) [...][^...][.-.]

bracket expression is a list of characters enclosed by [ ], also called character class. It matches ANY ONE character in the list. However, if the first character of the list is the caret (^), then it matches ANY ONE character NOT in the list. For example, the regex [02468] matches a single digit 0246, or 8; the regex [^02468] matches any single character other than 0246, or 8.

Instead of listing all characters, you could use a range expression inside the bracket. A range expression consists of two characters separated by a hyphen (-). It matches any single character that sorts between the two characters, inclusive. For example, [a-d] is the same as [abcd]. You could include a caret (^) in front of the range to invert the matching. For example, [^a-d] is equivalent to [^abcd].

Most of the special regex characters lose their meaning inside bracket list, and can be used as they are; except ^-] or \.

  • To include a ], place it first in the list, or use escape \].
  • To include a ^, place it anywhere but first, or use escape \^.
  • To include a - place it last, or use escape \-.
  • To include a \, use escape \\.
  • No escape needed for the other characters such as .+*?(){}, and etc, inside the bracket list
  • You can also include metacharacters (to be explained in the next section), such as \w\W\d\D\s\S inside the bracket list.
Name Character Classes in Bracket List (For Perl Only?)

Named (POSIX) classes of characters are pre-defined within bracket expressions. They are:

  • [:alnum:][:alpha:][:digit:]: letters+digits, letters, digits.
  • [:xdigit:]: hexadecimal digits.
  • [:lower:][:upper:]: lowercase/uppercase letters.
  • [:cntrl:]: Control characters
  • [:graph:]: printable characters, except space.
  • [:print:]: printable characters, include space.
  • [:punct:]: printable characters, excluding letters and digits.
  • [:space:]: whitespace

For example, [[:alnum:]] means [0-9A-Za-z]. (Note that the square brackets in these class names are part of the symbolic names, and must be included in addition to the square brackets delimiting the bracket list.)

6  Metacharacters .\w\W\d\D\s\S

metacharacter is a symbol with a special meaning inside a regex.

  • The metacharacter dot (.) matches any single character except newline \n (same as [^\n]). For example, ... matches any 3 characters (including alphabets, numbers, whitespaces, but except newline); the.. matches "there", "these", "the  ", and so on.
  • \w (word character) matches any single letter, number or underscore (same as [a-zA-Z0-9_]). The uppercase counterpart \W (non-word-character) matches any single character that doesn't match by \w (same as [^a-zA-Z0-9_]).
  • In regex, the uppercase metacharacter is always the inverse of the lowercase counterpart.
  • \d (digit) matches any single digit (same as [0-9]). The uppercase counterpart \D (non-digit) matches any single character that is not a digit (same as [^0-9]).
  • \s (space) matches any single whitespace (same as [ \t\n\r\f], blank, tab, newline, carriage-return and form-feed). The uppercase counterpart \S (non-space) matches any single character that doesn't match by \s (same as [^ \t\n\r\f]).


\s\s      # Matches two spaces
\S\S\s    # Two non-spaces followed by a space
\s+       # One or more spaces
\S+\s\S+  # Two words (non-spaces) separated by a space

7  Backslash (\) and Regex Escape Sequences

Regex uses backslash (\) for two purposes:

  1. for metacharacters such as \d (digit), \D (non-digit), \s (space), \S (non-space), \w (word), \W (non-word).
  2. to escape special regex characters, e.g., \. for .\+ for +\* for *\? for ?. You also need to write \\ for \ in regex to avoid ambiguity.
  3. Regex also recognizes \n for newline, \t for tab, etc.

Take note that in many programming languages (C, Java, Python), backslash (\) is also used for escape sequences in string, e.g., "\n" for newline, "\t" for tab, and you also need to write "\\" for \. Consequently, to write regex pattern \\ (which matches one \) in these languages, you need to write "\\\\" (two levels of escape!!!). Similarly, you need to write "\\d" for regex metacharacter \d. This is cumbersome and error-prone!!!

8  Occurrence Indicators (Repetition Operators): +*?{m}{m,n}{m,}

A regex sub-expression may be followed by an occurrence indicator (aka repetition operator):

  • ?: The preceding item is optional and matched at most once (i.e., occurs 0 or 1 times or optional).
  • *: The preceding item will be matched zero or more times, i.e., 0+
  • +: The preceding item will be matched one or more times, i.e., 1+
  • {m}: The preceding item is matched exactly m times.
  • {m,}: The preceding item is matched m or more times, i.e., m+
  • {m,n}: The preceding item is matched at least m times, but not more than n times.

For example: The regex xy{2,4} accepts "xyy", "xyyy" and "xyyyy".

9  Modifiers

You can apply modifiers to a regex to tailor its behavior, such as global, case-insensitive, multiline, etc. The ways to apply modifiers differ among languages.

In Perl, you can attach modifiers after a regex, in the form of /.../modifiers. For examples:

m/abc/i     # case-insensitive matching
m/abc/g     # global (Match ALL instead of match first)

In Java, you apply modifiers when compiling the regex Pattern. For example,

Pattern p1 = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);  // for case-insensitive matching
Pattern p2 = Pattern.compile(regex, Pattern.MULTILINE);         // for multiline input string
Pattern p3 = Pattern.compile(regex, Pattern.DOTALL);            // Dot (.) matches all characters including newline

The commonly-used modifer modes are:

  • Case-Insensitive mode (or i): case-insensitive matching for letters.
  • Global (or g): match All instead of first match.
  • Multiline mode (or m): affect ^$\A and \Z. In multiline mode, ^ matches start-of-line or start-of-input; $ matches end-of-line or end-of-input, \A matches start-of-input; \Z matches end-of-input.
  • Single-line mode (or s): Dot (.) will match all characters, including newline.
  • Comment mode (or x): allow and ignore embedded comment starting with # till end-of-line (EOL).
  • more...

10  Greediness, Laziness and Backtracking for Repetition Operators

Greediness of Repetition Operators *+?{m,n}: The repetition operators are greedy operators, and by default grasp as many characters as possible for a match. For example, the regex xy{2,4} try to match for "xyyyy", then "xyyy", and then "xyy".

Lazy Quantifiers *?+???{m,n}?{m,}?, : You can put an extra ? after the repetition operators to curb its greediness (i.e., stop at the shortest match). For example,

input = "The <code>first</code> and <code>second</code> instances"
regex = <code>.*</code> matches "<code>first</code> and <code>second</code>"
regex = <code>.*?</code> produces two matches: "<code>first</code>" and "<code>second</code>"

Backtracking: If a regex reaches a state where a match cannot be completed, it backtracks by unwinding one character from the greedy match. For example, if the regex z*zzz is matched against the string "zzzz", the z* first matches "zzzz"; unwinds to match "zzz"; unwinds to match "zz"; and finally unwinds to match "z", such that the rest of the patterns can find a match.

Possessive Quantifiers *+++?+{m,n}+{m,}+: You can put an extra + to the repetition operators to disable backtracking, even it may result in match failure. e.g, z++z will not match "zzzz". This feature might not be supported in some languages.

11  Position Anchors ^$\b\B\<\>\A\Z

Positional anchors DO NOT match actual character, but matches position in a string, such as start-of-line, end-of-line, start-of-word, and end-of-word.

  • ^ and $: The ^ matches the start-of-line. The $ matches the end-of-line excluding newline, or end-of-input (for input not ending with newline). These are the most commonly-used position anchors. For examples,
    ing$           # ending with 'ing'
    ^testing 123$  # Matches only one pattern. Should use equality comparison instead.
    ^[0-9]+$       # Numeric string
  • \b and \B: The \b matches the boundary of a word (i.e., start-of-word or end-of-word); and \B matches inverse of \b, or non-word-boundary. For examples,
    \bcat\b        # matches the word "cat" in input string "This is a cat."
                   # but does not match input "This is a catalog."
  • \< and \>: The \< and \> match the start-of-word and end-of-word, respectively (compared with \b, which can match both the start and end of a word).
  • \A and \Z: The \A matches the start of the input. The \Z matches the end of the input. 
    They are different from ^ and $ when it comes to matching input with multiple lines. ^ matches at the start of the string and after each line break, while \A only matches at the start of the string. $ matches at the end of the string and before each line break, while \Z only matches at the end of the string. For examples,
    $ python3
    # Using ^ and $ in multiline mode
    >>> p1 = re.compile(r'^.+$', re.MULTILINE)  # . for any character except newline
    >>> p1.findall('testing\ntesting')
    ['testing', 'testing']
    >>> p1.findall('testing\ntesting\n')
    ['testing', 'testing']
       # ^ matches start-of-input or after each line break at start-of-line
       # $ matches end-of-input or before line break at end-of-line
       # newlines are NOT included in the matches
    # Using \A and \Z in multiline mode
    >>> p2 = re.compile(r'\A.+\Z', re.MULTILINE)
    >>> p2.findall('testing\ntesting')
    []    # This pattern does not match the internal \n
    >>> p3 = re.compile(r'\A.+\n.+\Z', re.MULTILINE)  # to match the internal \n
    >>> p3.findall('testing\ntesting')
    >>> p3.findall('testing\ntesting\n')
    []    # This pattern does not match the trailing \n
       # \A matches start-of-input and \Z matches end-of-input

12  Capturing Matches via Parenthesized Back-References & Matched Variables $1$2, ...

Parentheses ( ) serve two purposes in regex:

  1. Firstly, parentheses ( ) can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example, (abc)+ (accepts abcabcabcabcabcabc, ...) is different from abc+ (accepts abcabccabccc, ...).
  2. Secondly, parentheses are used to provide the so called back-references. A back-reference contains the matched substring. For examples, the regex (\S+) creates one back-reference (\S+), which contains the first word (consecutive non-spaces) of the input string; the regex (\S+)\s+(\S+)creates two back-references: (\S+) and another (\S+), containing the first two words, separated by one or more spaces \s+.

The back-references are stored in special variables $1$2, … (or \1\2, ... in Python), where $1 contains the substring matched the first pair of parentheses, and so on. For example, (\S+)\s+(\S+) creates two back-references which matched with the first two words. The matched words are stored in $1 and $2 (or \1 and \2), respectively.

Back-references are important to manipulate the string. For example, the following Perl expression swap the first and second words separate by a space:

s/(\S+) (\S+)/$2 $1/;   # Swap the first and second words separated by a single space

13  (Advanced) Lookahead/Lookbehind, Groupings and Conditional

These feature might not be supported in some languages.

Positive Lookahead (?=pattern)

The (?=pattern) is known as positive lookahead. It performs the match, but does not capture the match, returning only the result: match or no match. It is also called assertion as it does not consume any characters in matching. For example, the following complex regex is used to match email addresses by AngularJS:


The first positive lookahead patterns ^(?=.{1,254}$) sets the maximum length to 254 characters. The second positive lookahead ^(?=.{1,64}@) sets maximum of 64 characters before the '@' sign for the username.

Negative Lookahead (?!pattern)

Inverse of (?=pattern). Match if pattern is missing. For example, a(?=b) matches 'a' in 'abc' (not consuming 'b'); but not 'acc'. Whereas a(?!b) matches 'a' in 'acc', but not abc.

Positive Lookbehind (?<=pattern)


Negative Lookbehind (?<!pattern)


Non-Capturing Group (?:pattern)

Recall that you can use Parenthesized Back-References to capture the matches. To disable capturing, use ?: inside the parentheses in the form of (?:pattern). In other words, ?: disables the creation of a capturing group, so as not to create an unnecessary capturing group.

Example: [TODO]

Named Capturing Group (?<name>pattern)

The capture group can be referenced later by name.

Atomic Grouping (>pattern)

Disable backtracking, even if this may lead to match failure.

Conditional (?(Cond)then|else)


14  Unicode

The metacharacters \w\W, (word and non-word character), \b\B (word and non-word boundary) recongize Unicode characters.

Retirado de www.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html

Regex: Um GAS para pesquisa com RegEx no Gmail

Publicado a 19/01/2019, 06:29 por Luis Pitta -org-   [ atualizado a 19/01/2019, 07:06 ]

The script scans the mailbox, compares the message body with the search pattern and prints any matching messages. Google Apps Script using standard JavaScript functions to perform regex search.

function Search() {

var sheet   = SpreadsheetApp.getActiveSheet();
var row     = 2;
// Clear existing search results
sheet.getRange(2, 1, sheet.getMaxRows() - 1, 4).clearContent();

// Which Gmail Label should be searched?
var label   = sheet.getRange("F3").getValue();

// Get the Regular Expression Search Pattern
var pattern = sheet.getRange("F4").getValue();
// Retrieve all threads of the specified label
var threads = GmailApp.search("in:" + label);
for (var i = 0; i < threads.length; i++) {
  var messages = threads[i].getMessages();
  for (var m = 0; m < messages.length; m++) {
    var msg = messages[m].getBody();

    // Does the message content match the search pattern?
    if (msg.search(pattern) !== -1) {

     // Format and print the date of the matching message

     // Print the sender's name and email address

     // Print the message subject

     // Print the unique URL of the Gmail message
     var id = "https://mail.google.com/mail/u/0/#all/"
       + messages[m].getId();
       '=hyperlink("' + id + '", "View")'); 

     // Move to the next row
Descarregar a folha de cálculo em labnol.org/?p=21623

Retirado de https://ctrlq.org/ em jan2018

Gmail: O poder dos filtros de procura no Gmail

Publicado a 19/01/2019, 06:27 por Luis Pitta -org-   [ atualizado a 19/01/2019, 08:07 ]

Let's explore the most useful Gmail search filters and their usage:

Q: I have hundreds of unread emails in my Gmail Inbox but not all of them show up on the main page. Can I bring up view in GMail containing only unread messages.
A: label:inbox is:unread

Q: My boss sent me a PDF document last month that I can no longer locate in the Inbox. Can you help me find it.
A: from:Name_Of_Boss filename:pdf after:2007/07/01 [yyyy/mm/dd]

Q: I received an email from Paypal support last week. I am not sure if I deleted the message, archived it or marked it as Spam. 
A: from:Paypal in:anywhere

Q: I have dozens of unread email messages in the Inbox but I am in a hurry to check them all. Show me just the messages that are for me.
A: is:unread after:2007/09/03 to:your.email@address.com

Q: Ryan is a good friend who send me pretty interesting PowerPoint / Word files that often have inspirational quotes, beautiful natural landscapes and funny slide shows of Cats. I love it all but they take too much space.
A: from:Ryan has:attachment [Select all and then delete]

Q. I accidentally deleted an important email message from a colleague. My GMail trash is already overflowing. How do I retrieve that particular message.
A: label:trash Name_Of_Your_Colleague

Q: I use GMail to automatically backup my WordPress blog. The database backup are stored via email attachments as Zips. Since I am running short on GMail storage, I would like to delete all backups that are older than two weeks.
A: filename:.zip before:2007/08/15 wordpress

Q: While chatting over Google Talk, Veronica sent me a link to her Flickr pictures. 
A: in:chat from:veronica flickr.com

Q: Show me all emails from my Boss that he marked Urgent or Important in the subject.
A: from:Boss_Name subject:(Urgent OR Important)

Q. I have two contacts in GMail with similar names - Peter King and Peter King Junior. Can I see emails the received from the former contact.
A: from:Peter King -Junior

Search Shortcuts: Instead of typing label:unread, you can simply type l:^u

Important: Unlike Google web search, GMail won't suggest search results that contain plurals or misspellings of your search query. For instance, a search for "computer" will only show email with that exact word - you won't see messages containing the word "computers".

Usar uma conta Principal

Publicado a 09/05/2018, 04:26 por Luis Pitta -org-   [ atualizado a 10/05/2018, 14:36 ]

Como utilizar somente uma conta principal de email?

Depois de efetuar o seguinte procedimento a minha conta principal irá funcionar do seguinte modo:

A conta principal vai receber TODOS os emails das diversas contas secundárias.
Na recepção vai logo "arrumá-las" nos marcadores preparados para o efeito.
Sempre que iniciar a escrita de um novo email poderei decidir quem sou eu, alterando o meu endereço de remente.


1. Escolher uma das suas contas de email para ser a conta principal.

A conta eleita para conta principal deve preferêncialmente ser ILIMITADA como acontece com todas as contas G Suite para Educação:
Para ajudar a escolher a sua conta principal tenha em atenção o seguinte:
    • As contas de Gmail pessoal têm um limite máximo gratuito de 15 GB. (Google)
    • As contas G Suite para Educação (Gmail escolar) têm um limite máximo ILIMITADO. (Google)
    • As contas Exchange escolar (Outlook) têm um limite máximo de 50 GB. (Microsoft)

2. Aceder ás outras contas secundárias (uma a uma) e criar um encaminhamento para a conta principal.

Como criar um encaminhamento: 

3. Criar filtros automáticos na conta principal para colocar as mensagens que vão chegando das contas secundárias, em marcadores especificos criados para o efeito. Por exemplo:

Marcador "Gmail pessoal": Vai receber todas as mensagens recebidas da minha conta secundária do Gmail 
Marcador " Email de AEBEnfica.pt":  Vai receber todas as mensagens recebidas da minha conta secundária do Gmail 

4. Criar novos remetentes, na conta principal:

Atenção ao passo 7 - aqui vamos ter de colocar os seguintes dados técnicos:

Se a conta secundária for do Gmail pessoal:

Servidor SMTP: smtp.gmail.com Porta: 587
Nome do utilizador: nome_do_utilizador (não precisa colocar @gmail.com)
Palavra-passe: <a sua palavra passe relativa a esta caixa de correio>

Escolher "Ligação protegida através de TLS (recomendado)".

Se a conta secundária for do Gmail escolar (G Suite):

Servidor SMTP: smtp.gmail.com Porta: 587
Nome do utilizador: nome_do_utilizador@aebenfica.org (ou outro dominio escolar qualquer)
Palavra-passe: <a sua palavra passe relativa a esta caixa de correio>

Escolher "Ligação protegida através de TLS (recomendado)".

Se a conta secundária for do Office 365 escolar

Servidor SMTP: smtp.office365.com Porta: 587
Nome do utilizador: nome.apelido@aebenfica.pt (ou outro dominio escolar qualquer)
Palavra-passe: <a sua palavra passe relativa a esta caixa de correio>

Escolher "Ligação protegida através de TLS (recomendado)".

5. E já está!

A partir de agora sempre que começar a compor uma nova mensagem de email vai ser possivel alterar o campo De: ou seja,
é como se estivesse a enviar uma mensagem através de uma das outras contas secundárias.

Mais ajuda em:

AEBenfica.pt: Encaminhar o email escolar

Marcadores, Filtros e Encaminhamentos

Mail de turma: Listas de distribuição (ou Grupos)

Publicado a 13/09/2016, 09:06 por Luis Pitta -org-   [ atualizado a 30/01/2018, 12:43 ]

Todos os anos é a mesma coisa:

Os Professores precisam de ter um e-mail de turma para enviar materiais para os seus alunos.

E o que é que fazem? 

Criam (ou pedem ao delegado para criar) um email "normal" no gmail ou no hotmail (ou noutro serviço de email qualquer) e depois partilham a password com toda a turma!

Errado. Errado. Errado. Nada mais errado!!!

Porque razão é assim tão errado?

Por todas as razões e mais algumas incluindo as seguintes:

Porque está-se a utilizar um email pessoal como sendo uma "lista de distribuição", trazendo de imediato diversos problemas.

1. As passwords, por definição, são individuais e de carácter privado.NÃO se partilham PASSWORDS! Nunca!

2. A gestão individual de email é impossivel, pois se o professor ou os alunos eliminarem uma mensagem, ninguém sabe qual foi nem quem foi nem onde foi!

3. Se um aluno resolver enviar disparates atrávés dessa conta de email NUNCA ninguém saberá quem foi.

4. Se algum dos alunos alterar a password do email, ninguém saberá quem foi, mais grave ainda, ninguém conseguirá VOLTAR a entrar nessa caixa do correio! Acabou para sempre!

5. Ao se eliminarem mensagens e email não existe nenhum outro local onde possa haver uma copia de reserva!

Qual será, então, a solução "correta"?

Pois a resposta está logo no titulo da publicação - devemos criar uma "Lista de distribuição" ou "Grupo". 

Características chave dos Grupos:
  • Nos Grupos (ou listas de distribuição) temos um email distinto e a lista de membros (os nossos alunos). 
  • Há sempre uma copia das mensagens enviadas, armazenadas no serviço que pode ser consultado em qualquer momento (não sendo possível elimina-lo).
  • Cada aluno podem interagir (via mensagem) com a restante turma (é opcional e configurável) criando o conceito de forum.
  • A gestão de anexos é automática, permitindo enviar anexos bem maiores do que aqueles que conseguimos enviar através de nossa caixa de correio convencional.,
  • Mesmo que o aluno apague uma mensagem da sua caixa de correio, haverá SEMPRE uma copia no Grupo.

Os principais serviços para criar e gerir Grupos são os seguintes:

Google: www.GoogleGroups.com (utilize o seu login do Gmail pessoal gratuito)

Yahoo: groups.yahoo.com (tem de criar um Yahoo ID, gratuito)

Microsoft: Tem um serviço mais recente que ainda não testei.É gratuito.

Em alternativa podemos, também, criar Grupos (ou listas de distribuição) a partir do domínio escolar:
    • Google em AEBenfica.pt: www.GoogleGroups.com (utilize o seu login escolar nome.apelido@aebenfica.pt)
    • Microsoft em AEBenfica.pt: Dentro do interface web do email escolar (utilize o seu login escolar nome.apelido@aebenfica.pt)

Vamos ver o procedimento para criar um Grupo no Google Groups (Google):

Gmail: Procurar mensagens com anexos grandes

Publicado a 27/12/2013, 06:50 por Luis Pitta

Gostava de listar (e eliminar) as mensagens com anexos maiores do que 5MB. 
Proceda do seguinte modo:

Proceda do seguinte modo:

1. Abra o Gmail
2. Na linha de pesquisa escreva: size:5m ou larger:5m
3. Clique na lupa

Já está!

Nota: Se juntar o texto older_than:1y ent\ao a pesquisa devolve todos os ficheiros maiores do que 5MB e com mais de  1 ano.

Mais informação no blog oficial do Gmail:

Marcadores, Filtros e Encaminhamentos

Publicado a 20/11/2013, 05:34 por Luis Pitta   [ atualizado a 10/01/2019, 08:43 por Luis Pitta -org- ]

Para o Gmail funcionar em pleno é de fundamental importancia a utilização de Marcadores e Filtros

Vamos ver o que são e como se criam?

1. Marcadores

O marcador é uma espécie de pasta que permite arrumar as mensagens de email por temas e assuntos.

Procedimento para criar um marcador:

1. Acede à área de Marcadores (Labels) da tua caixa de correio:
2. Clica em "Criar marcador"
3. Escolhe um nome ao teu gosto.
4. Clica em Guardar.

2. Filtros

São regras automáticas que arrumam, as mensagens que vão chegando, nos marcadores definidos na regra.
(Se nunca utilizaste filtros nem fazes ideia do que está a perder!)

Procedimento para criar um filtro:

1. Ir para a área de Filtros da tua caixa de correio: mail.google.com/mail/u/0/#settings/filters
2. Em baixo clica em "Criar um novo filtro"
3. Preenche os campos de pesquisa que entender e clicar em "Criar filtro com esta pesquisa"
4. Agora decide a acção a tomar com o resultado da pesquisa:

É habitual colocar três vistos:
  • Na caixa "Ignorar Caixa de Entrada" 
  • Em "Colocar o marcador" (selecione um marcador já existente ou crie um)
  • E em "Também aplicar o filtro a conversas correspondentes."

5. Termina clicando em "Criar Filtro"

3. Encaminhamentos

Servem para redireccionar as mensagens de uma caixa de correio para outra.
Ambas as contas têm de pertencer ao utilizador.


Pretendo encaminhar as mensagens de email recebidas no email fornecido pela escola (em @aebenfica.org) para a minha caixa de correio pessoal do Hotmail (por exemplo).

Ir para a área de Encaminhamentos da tua caixa de correio:

Ajuda Google: support.google.com/mail/answer/10957?hl=pt-PT&ctx=mail

1-10 of 11