E-mail‎ > ‎

Regex: Frases modelo para pesquisa de texto

Publicado a 19/01/2019, 06:58 por Luis Pitta -org-   [ atualizado a 19/01/2019, 07:01 ]

Regex Examples for Text File Search



What good are text editors if you can't perform complex searches? 
I checked these sample expressions in EditPad Pro, but they would probably work in Notepad++ or a regex-friendly IDE. 

Seven-Letter Word Containing "hay"

Search pattern: (?=\b\w{7}\b)\w*?hay\w*
Translation: Look right ahead for a seven-letter word (the \b boundaries are important). Lazily eat up any word characters followed by "hay", then eat up any word characters. We know that the greedy match has to stop because the word is seven characters long. 

Here, in our word, we allow any characters that regex calls "word characters", which, besides letters, also include digits and underscores. If we want a more conservative pattern, we just need to change the lookup:

Traditional word (only letters): (?i)(?=\b[A-Z]{7}\b)\w*?hay\w*

In this pattern, in the lookup, you can see that I replaced \w{7} with [A-Z]{7}, which matches seven capital letters. To include lowercase letters, we could have used [A-Za-z]{7}. Instead, we used the case insensitive modifier (?i). Thanks to this modifier, the pattern can match "HAY" or "hAy" just as easily as "hay". It all depends on what you want: regex puts the power is in your hands. 

Line Contains both "bubble" and "gum"

Search pattern: ^(?=.*?\bbubble\b).*?\bgum\b.*
Translation: While anchored a the beginning of the line, look ahead for any characters followed by the word bubble. We could use a second lookahead to look for gum, but it is faster to just match the whole line, taking care to match gum on the way. 

Line Contains "boy" or "buy"

Search pattern: \bb[ou]y\b
Translation: Inside a word (inside two \b boundaries), match the character b, then one character that is either o or u, then y

Find Repeated Words, such as "the the"

This is a popular example in the regex literature. I don't know about you you, but it doesn't happen all that often often that mistakenly repeated words find their way way into my text. If this example is so popular, it's probably because it's a short pattern that does a great job of showcasing the power of regex. 

You can find a million ways to write your repeated word pattern. In this one, I used POSIX classes (available in Perl and PHP), allowing us to throw in optional punctuation between the words, in addition to optional space. 

Search pattern: \b([[:alpha:]]+)[ [:punct:]]+\1
Translation: After a word delimiter, in group one, capture a positive number of letters, then eat up space characters or punctuation marks, then match the same word we captured earlier in group one. 

If you don't want the punctuation, just use an \s+ in place of [ [:punct:]]+

Remember that \s eats up any white-space characters, including newlines, tabs and vertical tabs, so if this is not what you want use [ ]+ to specify space characters. The brackets are optional, but they make the space character easier to spot, especially in a variable-width font. 

Line does Not Contain "boy"

Search pattern: ^(?!.*boy).*
Translation: At the beginning of the line, if the negative lookahead can assert that what follows is not "any characters then boy", match anything on the line. 

Line Contains "bubble" but Neither "gum" Nor "bath"

Search pattern: ^(?!.*gum)(?!.*bath).*?bubble.*
Translation: At the beginning of the line, assert that what follows is not "any characters then gum", assert that what follows is not "any characters then bath", then match the whole string, making sure to pick up bubble on the way.

Email Address

If I ever have to look for an email address in my text editor, frankly, I just search for @. That shows me both well-formed addresses, as well as addresses whose authors let their creativity run loose, for instance by typing DOT in place of the period. 

When it comes to validating user input, you want an expression that checks for well-formed addresses. There are thousands of email address regexes out there. In the end, none can really tell you whether an address is valid until you send a message and the recipient replies. 

The regex below is borrowed from chapter 4 of Jan Goyvaert's excellent book, Regular Expressions Cookbook. I'm in tune with Jan's reasoning that what you really want is an expression that works with 999 addresses out of a thousand, an expression that doesn't require a lot of maintenance, for instance by forcing you to add new top-level domains ("dot something") every time the powers in charge of those things decide it's time to launch names ending in something like dot-phone or dot-dog. 

Search pattern: (?i)\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b

Let's unroll this one:

(?i)               # Turn on case-insensitive mode

\b                 # Position engine at a word boundary

[A-Z0-9._%+-]+     # Match one or more of the characters between brackets: letters, numbers, dot, underscore, percent, plus, minus. Yes, some of these are rare in an email address.

@                  # Match @

(?:[A-Z0-9-]+\.)+  # Match one or more strings followed by a dot, such strings being made of letters, numbers and hyphens. These are the domains and sub-domains, such as post. and microsoft. in post.microsoft.com

[A-Z]{2,6}         # Match two to six letters, for instance US, COM, INFO. This is meant to be the top-level domain. Yes, this also matches DOG. You have to decide if you want achieve razor precision, at the cost of needing to maintain your regex when new TLDs are introduced.

\b                 # Match a word boundary