Cara menggunakan python regex translate

Diving headlong into data sets is a part of the lesson for anyone working in data science. Often, this means number-crunching, but what do we do when our data set is primarily text-based? We can use regular expressions. In this tutorial, we’re going to take a closer look at how to use regular expressions (regex) in Python.

Regular expressions (regex) are essentially text patterns that you can use to automate searching through and replacing elements within strings of text. This can make cleaning and working with text-based data sets much easier, saving you the trouble of having to search through mountains of text by hand.

Regular expressions can be used across a variety of programming languages, and they’ve been around for a very long time!

In this tutorial, though, we’ll learning about regular expressions in Python, so basic familiarity with key Python concepts like if-else statements, while and for loops, etc., is required. (If you need a refresher on any of this stuff, our introductory Python courses cover all of the relevant topics interactively, right in your browser!)

By the end of the tutorial, you’ll be familiar with how Python regex works, and be able to use the basic patterns and functions in Python’s regex module,

From:
From:
4, for to analyze text strings. You’ll also get an introduction to how regex can be used in concert with pandas to work with large text corpuses (corpus means a data set of text).

(To work through the pandas section of this tutorial, you will need to have the pandas library installed. The easiest way to do this is to download Anaconda and work through this tutorial in a Jupyter notebook. For other options, check out the pandas installation guide.)

Cara menggunakan python regex translate

Our Task: Analyze Spam Emails

In this tutorial, we’ll use the Fraudulent Email Corpus from Kaggle. It contains thousands of phishing emails sent between 1998 and 2007. They’re pretty entertaining to read.

You can find the full corpus here. But we’ll start by learning basic regex commands using a few emails. If you’d like, you can use our test file as well, or you can try this with the full corpus.

Introducing Python’s Regex Module

First, we’ll prepare the data set by opening the test file, setting it to read-only, and reading it. We’ll also assign it to a variable,

From:
From:
5 (for “file handle”).

fh = open(r"test_emails.txt", "r").read()

Notice that we precede the directory path with an

From:
From:
6. This technique converts a string into a raw string, which helps to avoid conflicts caused by how some machines read characters, such as backslashes in directory paths on Windows.

Now, suppose we want to find out who the emails are from. We could try raw Python on its own:


for line in fh.split("n"):
    if "From:" in line:
        print(line)
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo

But that’s not giving us exactly what we want. If you take a look at our test file, we could figure out why and fix it, but instead, let’s use Python’s

From:
From:
4 module and do it with regular expressions!

We’ll start by importing Python’s

From:
From:
4 module. Then, we’ll use a function called
From:
From:
9 that returns a list of all instances of a pattern we define in the string we’re looking at.

Here’s how it looks:


import re

for line in re.findall("From:.*", fh):
    print(line)
From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>

This is essentially the same length as our raw Python, but that’s because it’s a very simple example. The more you’re trying to do, the more effort Python regex is likely to save you.

Before we move on, let’s take a closer look at

From:
From:
9. This function takes two arguments in the form of

for line in re.findall("From:...........", fh):
    print(line)
1. Here,

for line in re.findall("From:...........", fh):
    print(line)
2 represents the substring we want to find, and

for line in re.findall("From:...........", fh):
    print(line)
3 represents the main string we want to find it in. The main string can consist of multiple lines. In this case, we’re having it search through all of
From:
From:
5, the file with our selected emails.

The


for line in re.findall("From:...........", fh):
    print(line)
5 is a shorthand for a string pattern. Regular expressions work by using these shorthand patterns to find specific patterns in text, so let’s take a look at some other common examples:

Common Python Regex Patterns

The pattern we used with

From:
From:
9 above contains a fully spelled-out out string,

for line in re.findall("From:...........", fh):
    print(line)
7. This is useful when we know precisely what we’re looking for, right down to the actual letters and whether or not they’re upper or lower case. If we don’t know the exact format of the strings we want, we’d be lost. Fortunately, regex has basic patterns that account for this scenario. Let’s look at the ones we use in this tutorial:

  • 
    for line in re.findall("From:...........", fh):
        print(line)
    
    8 matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _, and the dash, -.
  • 
    for line in re.findall("From:...........", fh):
        print(line)
    
    9 matches digits, which means 0-9.
  • From: "Mr. Ben S
    From: "PRINCE OB
    
    0 matches whitespace characters, which include the tab, new line, carriage return, and space characters.
  • From: "Mr. Ben S
    From: "PRINCE OB
    
    1 matches non-whitespace characters.
  • From: "Mr. Ben S
    From: "PRINCE OB
    
    2 matches any character except the new line character
    From: "Mr. Ben S
    From: "PRINCE OB
    
    3.

With these regex patterns in hand, you’ll quickly understand our code above as we go on to explain it.

Working with Regex Patterns

We can now explain the use of


for line in re.findall("From:...........", fh):
    print(line)
5 in the line
From: "Mr. Ben S
From: "PRINCE OB
5 above. Let’s look at
From: "Mr. Ben S
From: "PRINCE OB
2 first:


for line in re.findall("From:.", fh):
    print(line)
From:
From:

By adding a

From: "Mr. Ben S
From: "PRINCE OB
2 next to
From: "Mr. Ben S
From: "PRINCE OB
8, we look for one additional character next to it. Because
From: "Mr. Ben S
From: "PRINCE OB
2 looks for any character except
From: "Mr. Ben S
From: "PRINCE OB
3, it captures the space character, which we cannot see. We can try more dots to verify this.


for line in re.findall("From:...........", fh):
    print(line)
From: "Mr. Ben S
From: "PRINCE OB

It looks like adding dots does acquire the rest of the line for us. But, it’s tedious and we don’t know how many dots to add. This is where the asterisk symbol,


for line in re.findall("From:.*", fh):
    print(line)
1, comes in.


for line in re.findall("From:.*", fh):
    print(line)
1 matches zero or more instances of a pattern on its left. This means it looks for repeating patterns. When we look for repeating patterns, we say that our search is “greedy.” If we don’t look for repeating patterns, we can call our search “non-greedy” or “lazy.”

Let’s construct a greedy search for

From: "Mr. Ben S
From: "PRINCE OB
2 with

for line in re.findall("From:.*", fh):
    print(line)
1.


for line in re.findall("From:.*", fh):
    print(line)
From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>

Because


for line in re.findall("From:.*", fh):
    print(line)
1 matches zero or more instances of the pattern indicated on its left, and
From: "Mr. Ben S
From: "PRINCE OB
2 is on its left here, we are able to acquire all the characters in the
From: "Mr. Ben S
From: "PRINCE OB
8 field until the end of the line. This prints out the full line with beautifully succinct code.

We might even go further and isolate only the name. Let’s use

From:
From:
9 to return a list of lines containing the pattern

for line in re.findall("From:.*", fh):
    print(line)
9 as we’ve done before. We’ll assign it to the variable

for line in fh.split("n"):
    if "From:" in line:
        print(line)
00 for neatness. Next, we’ll iterate through the list. In each cycle, we’ll execute

for line in fh.split("n"):
    if "From:" in line:
        print(line)
01 again, matching the first quotation mark to pick out just the name:


for line in fh.split("n"):
    if "From:" in line:
        print(line)
1

for line in fh.split("n"):
    if "From:" in line:
        print(line)
2

Notice that we use a backslash next to the first quotation mark. The backslash is a special character used for escaping other special characters. For instance, when we want to use a quotation mark as a string literal instead of a special character, we escape it with a backslash like this:


for line in fh.split("n"):
    if "From:" in line:
        print(line)
02. If we do not escape the pattern above with backslashes, it would become

for line in fh.split("n"):
    if "From:" in line:
        print(line)
03, which the Python interpreter would read as a period and an asterisk between two empty strings. It would produce an error and break the script. Hence, it’s crucial that we escape the quotation marks here with backslashes.

After the first quotation mark is matched,


for line in re.findall("From:...........", fh):
    print(line)
5 acquires all the characters in the line until the next quotation mark, also escaped in the pattern. This gets us just the name, within quotation marks. The name is also printed within square brackets because

for line in fh.split("n"):
    if "From:" in line:
        print(line)
01 returns matches in a list.

What if we want the email address instead?


for line in fh.split("n"):
    if "From:" in line:
        print(line)
3

for line in fh.split("n"):
    if "From:" in line:
        print(line)
4

Looks simple enough, doesn’t it? Only the pattern is different. Let’s walk through it.

Here’s how we match just the front part of the email address:


for line in fh.split("n"):
    if "From:" in line:
        print(line)
5

for line in fh.split("n"):
    if "From:" in line:
        print(line)
6

Emails always contain an @ symbol, so we start with it. The part of the email before the @ symbol might contain alphanumeric characters, which means


for line in re.findall("From:...........", fh):
    print(line)
8 is required. However, because some emails contain a period or a dash, that’s not enough. We add
From: "Mr. Ben S
From: "PRINCE OB
1 to look for non-whitespace characters. But,

for line in fh.split("n"):
    if "From:" in line:
        print(line)
08 will get only two characters. Add

for line in re.findall("From:.*", fh):
    print(line)
1 to look for repetitions. The front part of the pattern thus looks like this:

for line in fh.split("n"):
    if "From:" in line:
        print(line)
10.

Now for the pattern behind the @ symbol:


for line in fh.split("n"):
    if "From:" in line:
        print(line)
7

for line in fh.split("n"):
    if "From:" in line:
        print(line)
8

The domain name usually contains alphanumeric characters, periods, and a dash sometimes, so a

From: "Mr. Ben S
From: "PRINCE OB
2 will do. To make it greedy, we extend the search with a

for line in re.findall("From:.*", fh):
    print(line)
1. This allows us to match any character till the end of the line.

If we look at the line closely, we see that each email is encapsulated within angle brackets, < and >. Our pattern,


for line in re.findall("From:...........", fh):
    print(line)
5, includes the closing bracket, >. Let’s remedy it:


for line in fh.split("n"):
    if "From:" in line:
        print(line)
9
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
0

Email addresses end with an alphanumeric character, so we cap the pattern with


for line in re.findall("From:...........", fh):
    print(line)
8. So, after the @ symbol we have

for line in fh.split("n"):
    if "From:" in line:
        print(line)
15, which means that the pattern we want is a group of any type of characters ending with an alphanumeric character. This excludes >.

Our full email address pattern thus looks like this:


for line in fh.split("n"):
    if "From:" in line:
        print(line)
16.

Phew! That was quite a bit to work through. Next, we’ll run through some common

From:
From:
4 functions that will be useful when we start reorganizing our corpus.

Common Python Regex Functions

From:
From:
9 is undeniably useful, but it’s not the only built-in function that’s available to us in
From:
From:
4:

  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    20
  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    21
  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    22

Let’s look at these one by one before using them to bring some order to our data set.

While

From:
From:
9 matches all instances of a pattern in a string and returns them in a list,

for line in fh.split("n"):
    if "From:" in line:
        print(line)
20 matches the first instance of a pattern in a string, and returns it as a
From:
From:
4 match object.

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
1
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
2

Like

From:
From:
9,

for line in fh.split("n"):
    if "From:" in line:
        print(line)
20 also takes two arguments. The first is the pattern to match, and the second is the string to find it in. Here, we’ve assigned the results to the

for line in fh.split("n"):
    if "From:" in line:
        print(line)
00 variable for neatness.

Because


for line in fh.split("n"):
    if "From:" in line:
        print(line)
20 returns a
From:
From:
4 match object, we can’t display the name and email address by printing it directly. Instead, we have to apply the

for line in fh.split("n"):
    if "From:" in line:
        print(line)
31 function to it first. We’ve printed both their types out in the code above. As we can see,

for line in fh.split("n"):
    if "From:" in line:
        print(line)
31 converts the match object into a string.

We can also see that printing


for line in fh.split("n"):
    if "From:" in line:
        print(line)
00 displays properties beyond the string itself, whereas printing

for line in fh.split("n"):
    if "From:" in line:
        print(line)
34 displays only the string.

re.split()

Suppose we need a quick way to get the domain name of the email addresses. We could do it with three regex operations, like so:

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
3
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
4

The first line is familiar. We return a list of strings, each containing the contents of the

From: "Mr. Ben S
From: "PRINCE OB
8 field, and assign it to a variable. Next, we iterate through the list to find the email addresses. At the same time, we iterate through the email addresses and use the
From:
From:
4 module’s

for line in fh.split("n"):
    if "From:" in line:
        print(line)
37 function to snip each address in half, with the @ symbol as the delimiter. Finally, we print it.

re.sub()

Another handy

From:
From:
4 function is

for line in fh.split("n"):
    if "From:" in line:
        print(line)
22. As the function name suggests, it substitutes parts of a string. An example:

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
5
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
6

We’ve already seen the tasks on the first and second lines before. On the third line, we apply


for line in fh.split("n"):
    if "From:" in line:
        print(line)
22 on

for line in fh.split("n"):
    if "From:" in line:
        print(line)
41, which is the full
From: "Mr. Ben S
From: "PRINCE OB
8 field in the email header.


for line in fh.split("n"):
    if "From:" in line:
        print(line)
22 takes three arguments. The first is the substring to substitute, the second is a string we want in its place, and the third is the main string itself.

Regex with Pandas

Now we have the basics of Python regex in hand. But often for data tasks, we’re not actually using raw Python, we’re using the pandas library. Now let’s take our regex skills to the next level by bringing them into a pandas workflow.

Don’t worry if you’ve never used pandas before. We’ll walk through the code every step of the way so you never feel lost. But if you’d like to learn about pandas in more detail, check out our pandas tutorial or the fully interactive course we offer on numpy and pandas.

Sorting Emails with Python Regex and Pandas

Our corpus is a single text file containing thousands of emails (though again, for this tutorial we’re using a much smaller file with just two emails, since printing the results of our regex work on the full corpus would make this post far too long).

We’ll use regex and pandas to sort the parts of each email into appropriate categories so that the Corpus can be more easily read or analysed.

We’ll sort each email into the following categories:

  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    44
  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    45
  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    46
  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    47
  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    48
  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    49
  • 
    for line in fh.split("n"):
        if "From:" in line:
            print(line)
    50

Each of these categories will become a column in our pandas dataframe (i.e., our table). This will make it easier for us work on and analyze each column individually.

We’ll keep working with our small sample, but it’s worth reiterating that regular expressions allow us to write more concise code. Concise code reduces the number of operations our machines have to do, which speeds up our analytical process. Working with our small file of two emails, there’s not much difference, but if you try processing the entire corpus with and without regex, you’ll start to see the advantages!

Preparing the Script

To start, let’s import the libraries we’ll need and get our file opened again.

In addition to

From:
From:
4 and

for line in fh.split("n"):
    if "From:" in line:
        print(line)
52, we’ll import Python’s

for line in fh.split("n"):
    if "From:" in line:
        print(line)
53 package as well, which will help with the body of the email. The body of the email is rather complicated to work with using regex alone. It might even require enough cleaning up to warrant its own tutorial. So, we’ll use the well-developed

for line in fh.split("n"):
    if "From:" in line:
        print(line)
53 package to save some time and let us focus on learning regex.

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
7

We’ve also created an empty list,


for line in fh.split("n"):
    if "From:" in line:
        print(line)
55, which will store dictionaries. Each dictionary will contain the details of each email.

Now, let’s begin applying regex!

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
8
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
9

Note: we cut off the printout above for the sake of brevity. If you print this on your own machine, it will display everything that’s contained in


for line in fh.split("n"):
    if "From:" in line:
        print(line)
56 rather than ending with

for line in fh.split("n"):
    if "From:" in line:
        print(line)
57 like it does above.

We use the

From:
From:
4 module’s split function to split the entire chunk of text in
From:
From:
5 into a list of separate emails, which we assign to the variable

for line in fh.split("n"):
    if "From:" in line:
        print(line)
56. This is important because we want to work on the emails one by one, by iterating through the list with a for loop. But, how do we know to split by the string

for line in fh.split("n"):
    if "From:" in line:
        print(line)
61?

We know this because we looked into the file before we wrote the script. We didn’t have to peruse the thousands of emails in there. Just the first few, to see what the structure of the data looks like. Whenever possible, it’s good to get your eyes on the actual data before you start working with code, as you’ll often discover useful features like this.

We’ve taken a screenshot of what the original text file looks like:

Cara menggunakan python regex translate

Emails start with “From r”

The green block is the first email. The blue block is the second email. As we can see, both emails start with


for line in fh.split("n"):
    if "From:" in line:
        print(line)
61, highlighted with red boxes.

One reason we use the Fraudulent Email Corpus in this tutorial is to show that when data is disorganized, unfamiliar, and comes without documentation, we can’t rely solely on code to sort it out. It would require a pair of human eyes. As we’ve just shown, we had to look into the corpus itself to study its structure.

Disorganized data like this may require a lot of cleaning up. For instance, even though we count 3,977 emails in this set using the full script we’re about to construct for this tutorial, there are actually more. Some emails actually are not preceded by


for line in fh.split("n"):
    if "From:" in line:
        print(line)
61, and so are not counted separately. (However, for the purposes of brevity, we’ll proceed as if that issue has already been fixed and all emails are separated by

for line in fh.split("n"):
    if "From:" in line:
        print(line)
61.)

Notice also that we use


for line in fh.split("n"):
    if "From:" in line:
        print(line)
65 to get rid of the first element in the list. That’s because a

for line in fh.split("n"):
    if "From:" in line:
        print(line)
61 string precedes the first email. When that string is split, it produces an empty string at index 0. The script we’re about to write is designed for emails. If we try to use it on an empty string, it might throw errors. Getting rid of the empty string lets us keep these errors from breaking our script.

Getting Every Name and Address With a For Loop

Next, we’ll work with the emails in the


for line in fh.split("n"):
    if "From:" in line:
        print(line)
56 list.


import re

for line in re.findall("From:.*", fh):
    print(line)
0

In the code above, we use a


for line in fh.split("n"):
    if "From:" in line:
        print(line)
68 loop to iterate through

for line in fh.split("n"):
    if "From:" in line:
        print(line)
56 so we can work with each email in turn. We create a dictionary,

for line in fh.split("n"):
    if "From:" in line:
        print(line)
70, that will hold all the details of each email, such as the sender’s address and name. In fact, these are the first items we find.

This is a three-step process. It begins by finding the

From: "Mr. Ben S
From: "PRINCE OB
8 field.


import re

for line in re.findall("From:.*", fh):
    print(line)
1

With Step 1, we find the entire

From: "Mr. Ben S
From: "PRINCE OB
8 field using the

for line in fh.split("n"):
    if "From:" in line:
        print(line)
20 function. The
From: "Mr. Ben S
From: "PRINCE OB
2 means any character except
From: "Mr. Ben S
From: "PRINCE OB
3, and

for line in re.findall("From:.*", fh):
    print(line)
1 extends it to the end of the line. We then assign this to the variable

for line in fh.split("n"):
    if "From:" in line:
        print(line)
77.

But, data isn’t always straightforward. It can contain surprises. For instance, what if there’s no

From: "Mr. Ben S
From: "PRINCE OB
8 field? The script would throw an error and break. We pre-empt errors from this scenario in Step 2.


import re

for line in re.findall("From:.*", fh):
    print(line)
2

To avoid errors resulting from missing

From: "Mr. Ben S
From: "PRINCE OB
8 fields, we use an

for line in fh.split("n"):
    if "From:" in line:
        print(line)
80 statement to check that

for line in fh.split("n"):
    if "From:" in line:
        print(line)
77 isn’t

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82. If it is, we assign

for line in fh.split("n"):
    if "From:" in line:
        print(line)
83 and

for line in fh.split("n"):
    if "From:" in line:
        print(line)
84 the value of

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82 so that the script can move on instead of breaking unexpectedly.

If you’re working along with this tutorial in your own file, you’ve probably already realized that working with regular expressions gets messy. For instance, these if-else statements are the result of using trial and error on the corpus while writing it. Writing code is an iterative process. It’s worth noting that even if this tutorial is making it seem straightforward, actual practice involves a lot more experimentation.

In Step 2, we use a familiar regex pattern from before,


for line in fh.split("n"):
    if "From:" in line:
        print(line)
16, which matches the email address.

We’ll use a different tactic for the name. Each name is bounded by the colon,


for line in fh.split("n"):
    if "From:" in line:
        print(line)
87, of the substring

for line in re.findall("From:...........", fh):
    print(line)
7 on the left, and by the opening angle bracket,

for line in fh.split("n"):
    if "From:" in line:
        print(line)
89, of the email address on the right. Hence, we use

for line in fh.split("n"):
    if "From:" in line:
        print(line)
90 to find the name. We get rid of

for line in fh.split("n"):
    if "From:" in line:
        print(line)
87 and

for line in fh.split("n"):
    if "From:" in line:
        print(line)
89 from each result in a moment.

Now, let’s print out the results of our code to see how they look.


import re

for line in re.findall("From:.*", fh):
    print(line)
3

import re

for line in re.findall("From:.*", fh):
    print(line)
4

import re

for line in re.findall("From:.*", fh):
    print(line)
5

Again, we have match objects. Every time we apply


for line in fh.split("n"):
    if "From:" in line:
        print(line)
20 to strings, it produces match objects. We have to turn them into string objects.

Before we do this, recall that if there is no

From: "Mr. Ben S
From: "PRINCE OB
8 field,

for line in fh.split("n"):
    if "From:" in line:
        print(line)
77 would have the value of

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82, and so too would

for line in fh.split("n"):
    if "From:" in line:
        print(line)
83 and

for line in fh.split("n"):
    if "From:" in line:
        print(line)
84. Hence, we have to check for this scenario again so that the script doesn’t break unexpectedly. Let’s see how to construct the code with

for line in fh.split("n"):
    if "From:" in line:
        print(line)
83 first.


import re

for line in re.findall("From:.*", fh):
    print(line)
6

In Step 3A, we use an


for line in fh.split("n"):
    if "From:" in line:
        print(line)
80 statement to check that

for line in fh.split("n"):
    if "From:" in line:
        print(line)
83 is not

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82, otherwise it would throw an error and break the script.

Then, we simply convert the


for line in fh.split("n"):
    if "From:" in line:
        print(line)
83 match object into a string and assign it to the
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
04 variable. We add this to the

for line in fh.split("n"):
    if "From:" in line:
        print(line)
70 dictionary, which will make it incredibly easy for us to turn the details into a pandas dataframe later on.

We do almost exactly the same for


for line in fh.split("n"):
    if "From:" in line:
        print(line)
84 in Step 3B.


import re

for line in re.findall("From:.*", fh):
    print(line)
7

Just as we did before, we first check that


for line in fh.split("n"):
    if "From:" in line:
        print(line)
84 isn’t

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82 in Step 3B.

Then, we use the

From:
From:
4 module’s

for line in fh.split("n"):
    if "From:" in line:
        print(line)
22 function twice before assigning the string to a variable. First, we remove the colon and any whitespace characters between it and the name. We do this by substituting
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
11 with an empty string
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
12. Then, we remove whitespace characters and the angle bracket on the other side of the name, again substituting it with an empty string. Finally, after assigning the string to

for line in fh.split("n"):
    if "From:" in line:
        print(line)
44, we add it to the dictionary.

Let’s check out our results.


import re

for line in re.findall("From:.*", fh):
    print(line)
8

import re

for line in re.findall("From:.*", fh):
    print(line)
9

Perfect. We’ve isolated the email address and the sender’s name. We’ve also added them to the dictionary, which will come into play soon.

Now that we’ve found the sender’s email address and name, we do exactly the same set of steps to acquire the recipient’s email address and name for the dictionary.

First, we find the the

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
14 field.

From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
0

Next, we pre-empt the scenario where

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
15 is

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82.

From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
1

If

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
15 isn’t

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82, we use

for line in fh.split("n"):
    if "From:" in line:
        print(line)
20 to find the match object containing the email address and the recipient’s name. Otherwise, we pass
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
20 and
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
21 the value of

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82.

Then, we turn the match objects into strings and add them to the dictionary.

From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
2

Because the structure of the

From: "Mr. Ben S
From: "PRINCE OB
8 and
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
14 fields are the same, we can use the same code for both. We need to tailor slightly different code for the other fields.

Getting the Date of the Email

Now for the date the email was sent.

From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
3

We acquire the

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
25 field with the same code for the
From: "Mr. Ben S
From: "PRINCE OB
8 and
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
14 fields.

And, just as we do for those two fields, we check that the

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
25 field, assigned to the
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
29 variable, is not

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82.

From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
4
From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
5

We’ve printed out

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
31 so that we can see the structure of the string more clearly. It includes the day, the date in DD MMM YYYY format, and the time. We want just the date. The code for the date is largely the same as for names and email addresses but simpler. Perhaps the only puzzler here is the regex pattern,
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
32.

The date starts with a number. Hence, we use


for line in re.findall("From:...........", fh):
    print(line)
9 to account for it. However, as the DD part of the date, it could be either one or two digits. Here is where
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
34 becomes important. In Python regex,
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
34 matches 1 or more instances of a pattern on its left.
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
36 would thus match the DD part of the date no matter if it is one or two digits.

After that, there’s a space. This is accounted for by

From: "Mr. Ben S
From: "PRINCE OB
0, which looks for whitespace characters. The month is made up of three alphabetical letters, hence
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
38. Then it hits another space,
From: "Mr. Ben S
From: "PRINCE OB
0. The year is made up of numbers, so we use
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
36 once more.

The full pattern,

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
32, works because it is a precise pattern bounded on both sides by whitespace characters.

Next, we do the same check for a value of


for line in fh.split("n"):
    if "From:" in line:
        print(line)
82 as before.

From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
6

If

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
43 is not

for line in fh.split("n"):
    if "From:" in line:
        print(line)
82, we turn it from a match object into a string and assign it to the variable

for line in fh.split("n"):
    if "From:" in line:
        print(line)
48. We then insert it into the dictionary.

Before we go on, we should note a crucial point.

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
34 and

for line in re.findall("From:.*", fh):
    print(line)
1 seem similar but they can produce very different results. Let’s use the date string here as an example.

From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
7
From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
8

If we use


for line in re.findall("From:.*", fh):
    print(line)
1, we’d be matching zero or more occurrences.
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
34 matches one or more occurrences. We’ve printed the results for both scenarios. It’s a big difference. As you can see,
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
34 acquires the full date whereas

for line in re.findall("From:.*", fh):
    print(line)
1 gets a space and the digits
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
52.

Next up, the subject line of the email.

Getting the Email Subject

As before, we use the same code and code structure to acquire the information we need.

From: "Mr. Ben Suleman" <[email protected]>
From: "PRINCE OBONG ELEME" <[email protected]>
9

We’re becoming more familiar with the use of Python regex now, aren’t we? It’s largely the same code as before, except that we substitute

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
53 with an empty string to get only the subject itself.

Getting the Body of the Email

The last item to insert into our dictionary is the body of the email.


for line in re.findall("From:.", fh):
    print(line)
0

Separating the header from the body of an email is an awfully complicated task, especially when many of the headers are different in one way or another. Consistency is seldom found in raw unorganised data. Luckily for us, the work’s already been done. Python’s


for line in fh.split("n"):
    if "From:" in line:
        print(line)
53 package is highly adept at this task.

Remember that we’ve already imported the package earlier. Now, we apply its

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
55 function to
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
56, to turn the full email into an

for line in fh.split("n"):
    if "From:" in line:
        print(line)
53 Message object. A Message object consists of a header and a payload, which correspond to the header and body of an email.

Next, we apply its

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
58 function on the Message object. This function isolates the body of the email. We assign it to the variable
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
59, which we then insert into our

for line in fh.split("n"):
    if "From:" in line:
        print(line)
70 dictionary under the key
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
61.

Why the Email Package and Not Regex for the Body

You may ask, why use the


for line in fh.split("n"):
    if "From:" in line:
        print(line)
53 Python package rather than regex? This is because there’s no good way to do it with Python regex at the moment that doesn’t require significant amounts of cleaning up. It would mean another sheet of code that probably deserves its own tutorial.

It’s worth checking out how we arrive at decisions like this one. However, we need to understand what square brackets,

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
63, mean in regex before we can do that.

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
63 match any character placed inside them. For instance, if we want to find
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
65,
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
66, or
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
67 in a string, we can use
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
68 as the pattern. The patterns we discussed above apply as well.
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
69 would find either alphanumeric or whitespace characters. The exception is
From: "Mr. Ben S
From: "PRINCE OB
2, which becomes a literal period within square brackets.

Now, we can better understand how we made the decision to use the email package instead.

A peek at the data set reveals that email headers stop at the strings

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
71 or
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
72, and end before the string

for line in fh.split("n"):
    if "From:" in line:
        print(line)
61 of the next email. We could thus use
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
74 to acquire only the email body.
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
75 works for large chunks of text, numbers, and punctuation because it searches for either whitespace or non-whitespace characters.

Unfortunately, some emails have more than one

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
76 string and others don’t contain

for line in fh.split("n"):
    if "From:" in line:
        print(line)
61, which means that we would split the emails into more or less than the number of dictionaries in the emails list. They would not match with the other categories we already have. This will create problems when working with pandas. Hence, we decided to leverage the

for line in fh.split("n"):
    if "From:" in line:
        print(line)
53 package.

Create the List of Dictionaries

Finally, append the dictionary,


for line in fh.split("n"):
    if "From:" in line:
        print(line)
70, to the

for line in fh.split("n"):
    if "From:" in line:
        print(line)
55 list:


for line in re.findall("From:.", fh):
    print(line)
1

We might want to print the


for line in fh.split("n"):
    if "From:" in line:
        print(line)
55 list at this point to see how it looks. This will be pretty anti-climactic if you’ve just been using our little sample file, but with the entire corpus you’ll see the power of regular expressions!

We could also run

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
82 to see how many dictionaries, and therefore emails, are in the list. As we mentioned before, the full corpus contains 3,977.

Here’s the code in full:


for line in re.findall("From:.", fh):
    print(line)
2

And here’s what you’ll get if you run that using our sample text file:


for line in re.findall("From:.", fh):
    print(line)
3

We’ve printed out the first item in the


for line in fh.split("n"):
    if "From:" in line:
        print(line)
55 list, and it’s clearly a dictionary with key and value pairs. Because we used a

for line in fh.split("n"):
    if "From:" in line:
        print(line)
68 loop, every dictionary has the same keys but different values.

We’ve substituted

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
56 with
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
86 so that we don’t print out the entire mass of the email and clog up our screens. If you’re printing this at home using the actual data set, you’ll see the entire email.

Manipulating Data with Pandas

With dictionaries in a list, we’ve made it infinitely easy for the pandas library to do its job. Each key will become a column title, and each value becomes a row in that column.

All we have to do is apply the following code:


for line in re.findall("From:.", fh):
    print(line)
4

With this single line, we turn the


for line in fh.split("n"):
    if "From:" in line:
        print(line)
55 list of dictionaries into a dataframe using the pandas
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
88 function. We assign it to a variable too.

That’s it. We now have a sophisticated pandas dataframe. This is essentially a neat and clean table containing all the information we’ve extracted from the emails.

Let’s look at the first few rows.


for line in re.findall("From:.", fh):
    print(line)
5

The

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
89 function displays just the first few rows rather than the entire data set. It takes one argument. An optional argument allows us to specify how many rows we want displayed. Here,
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
90 lets us view three rows.

We can also find precisely what we want. For instance, we can find all the emails sent from a particular domain name. However, let’s learn a new regex pattern to improve our precision in finding the items we want.

The pipe symbol,

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
91, looks for characters on either side of itself. For instance,
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
92 looks for either
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
93 or
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
94.

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
91 might seem to do the same as
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
63, but they really are different. Suppose we want to match either
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
97,
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
98, or
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
99. Using

import re

for line in re.findall("From:.*", fh):
    print(line)
00 would make more sense than

import re

for line in re.findall("From:.*", fh):
    print(line)
01, wouldn’t it? The former would look for each whole word, whereas the latter would look for every single letter.

Now, let’s use

der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
91 to find all the emails sent from one or another domain name.


for line in re.findall("From:.", fh):
    print(line)
6

We’ve used a rather lengthy line of code here. Let’s start from the inside out.


import re

for line in re.findall("From:.*", fh):
    print(line)
03 selects the column labelled
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
04. Next,

import re

for line in re.findall("From:.*", fh):
    print(line)
05 returns

import re

for line in re.findall("From:.*", fh):
    print(line)
06 if the substring

import re

for line in re.findall("From:.*", fh):
    print(line)
07 or

import re

for line in re.findall("From:.*", fh):
    print(line)
08 is found in that column. Finally, the outer

import re

for line in re.findall("From:.*", fh):
    print(line)
09 returns a view of the rows where the
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
04 column contains the target substrings. Nifty!

We can view emails from individual cells too. To do this, we go through four steps. In Step 1, we find the index of the row where the


import re

for line in re.findall("From:.*", fh):
    print(line)
11 column contains the string

import re

for line in re.findall("From:.*", fh):
    print(line)
12. Notice how we use regex to do this.


for line in re.findall("From:.", fh):
    print(line)
7

In Step 2, we use the index to find the email address, which the


import re

for line in re.findall("From:.*", fh):
    print(line)
13 method returns as a Series object with several different properties. We print it out below to see what it looks like.


for line in re.findall("From:.", fh):
    print(line)
8

for line in re.findall("From:.", fh):
    print(line)
9

In Step 3, we extract the email address from the Series object as we would items from a list. You can see that its type is now class.

From:
From:
0
From:
From:
1

Step 4 is where we extract the email body.

From:
From:
2
From:
From:
3

In Step 4,


import re

for line in re.findall("From:.*", fh):
    print(line)
14 finds the row where the
der.com>
Message-Id: <[email protected]>
From: "Mr. Be
[email protected]>
Message-Id: <[email protected]>
From: "PRINCE OBONG ELEME" <obo
04 column contains the value

import re

for line in re.findall("From:.*", fh):
    print(line)
16. Next,

import re

for line in re.findall("From:.*", fh):
    print(line)
17 finds the value of the

for line in fh.split("n"):
    if "From:" in line:
        print(line)
50 column in that same row. Finally, we print out the value.

As you can see, we can work with regex in many ways, and it plays well with pandas, too! Don’t be discouraged if your regex work includes a lot of trial and error, especially when you’re just getting started!

Other Resources

Regex has grown tremendously since it leapt from biology to engineering all those years ago. Today, regex is used across different programming languages, where there are some variations beyond its basic patterns. We’ve learned a lot of Python regex, and if you’d like to take this to the next level, our Python Data Cleaning Advanced course might be a good fit.

You may also find some help in official references, like Python’s documentation for its

From:
From:
4 module. Google has a quicker reference.

If you’re so inclined, you can also start exploring the differences between Python regex and other forms of regex Stack Overflow post. Wikipedia has a comparing the different regex engines.

If you require data sets to experiment with, Kaggle and are useful.

Finally, here’s a Regex cheatsheet we made that is also quite useful.

Did this tutorial help?

Choose your path to keep learning valuable data skills.

Cara menggunakan python regex translate

Cara menggunakan python regex translate

Python Tutorials

Practice your Python programming skills as you work through our free tutorials.

Data science courses

Commit to your study with our interactive, in-your-browser data science courses in Python, R, SQL, and more.

Data ScienceintermediateLearn PythonPandaspythonregexregular expressionsTutorials

About the author

Alex Yang

Alex is a writer fascinated by the things code can do. He also enjoys citizen science and new media art.

Modul apa dalam python untuk menjalankan RegEx?

Modul RegEx Python memiliki paket bawaan yang disebut re , yang dapat digunakan untuk bekerja dengan Ekspresi Reguler.

Apa itu python RegEx?

Regex merupakan singkatan dari Regular Expression yang merupakan serangkaian karakter yang mendefinisikan sebuah pola pencarian. Beberapa bidang yang menggunakan metode ini adalah seperti Natural Language Processing (NLP), Text Mining, Data Validation, Finding Pattern, Anomaly Detection dan lainnya.

Apa itu string di Python?

Mengenal Apa itu String String dalam bahasa pemrograman Python disebut sebagai kumpulan karakter yang dikelilingi oleh tanda kutip tunggal, tanda kutip ganda bahkan tanda kutip tiga. Komputer tidak memahami karakter. Secara internal, tipe string ini menyimpan karakter yang dimanipulasi sebagai kombinasi dari 0 dan 1.