How to remove stop words from DataFrame in Python

Accepted answer

Assuming your stopwords is a list and df['tokens'] is a list of words or tokens each.
Simple Method:

clear_tokens = []
for i in df.index:
   clear_tokens.append([item for item in df.tokens[i] if item not in stopwords])

df['tokens'] = clear_tokens

if your df.tokens is a sentence in each row, then :

clear_tokens = []
for i in df.index:
   tokenlist = df.tokens[i].split()
   clear_tokens.append(' '.join([item for item in tokenlist if item not in stopwords]))

df['tokens'] = clear_tokens


after applying a function to a column you need to assign the result back to the column, it's not an in-place operation.

  • after tokenization ukdata['text'] holds a list of words, so you can use a in the apply to remove the stop words.

    >>> df = spark.createDataFrame([(["a", "b", "c"],)], ["text"]) >>> remover = StopWordsRemover(stopWords=["b"]) >>> remover.setInputCol("text") StopWordsRemover... >>> remover.setOutputCol("words") StopWordsRemover... >>> remover.transform(df).head().words == ['a', 'c'] True >>> stopWordsRemoverPath = temp_path + "/stopwords-remover" >>> remover.save(stopWordsRemoverPath) >>> loadedRemover = StopWordsRemover.load(stopWordsRemoverPath) >>> loadedRemover.getStopWords() == remover.getStopWords() True >>> loadedRemover.getCaseSensitive() == remover.getCaseSensitive() True >>> loadedRemover.transform(df).take(1) == remover.transform(df).take(1) True >>> df2 = spark.createDataFrame([(["a", "b", "c"], ["a", "b"])], ["text1", "text2"]) >>> remover2 = StopWordsRemover(stopWords=["b"]) >>> remover2.setInputCols(["text1", "text2"]).setOutputCols(["words1", "words2"]) StopWordsRemover... >>> remover2.transform(df2).show() +---------+------+------+------+ | text1| text2|words1|words2| +---------+------+------+------+ |[a, b, c]|[a, b]|[a, c]| [a]| +---------+------+------+------+ ...

    In my last publication, I started the post series on the topic of text pre-processing. In it, I first covered all the possible applications of .

    Now I will continue with the topics Tokenization and Stop Words.

    For this publication the processed dataset Amazon Unlocked Mobile from the statistic platform “Kaggle” was used as well as the created Example String. You can download both files from my “GitHub Repository”.

    2 Import the Libraries and the Data

    import pandas as pd
    import numpy as np
    
    import pickle as pk
    
    import warnings
    warnings.filterwarnings("ignore")
    
    
    from bs4 import BeautifulSoup
    import unicodedata
    import re
    
    from nltk.tokenize import word_tokenize
    from nltk.tokenize import sent_tokenize
    
    from nltk.corpus import stopwords
    
    
    from nltk.corpus import wordnet
    from nltk import pos_tag
    from nltk import ne_chunk
    
    from nltk.stem.porter import PorterStemmer
    from nltk.stem.wordnet import WordNetLemmatizer
    
    from nltk.probability import FreqDist
    import matplotlib.pyplot as plt
    from wordcloud import WordCloud
    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()

    How to remove stop words from DataFrame in Python

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)
    clean_text = pk.load(open("clean_text.pkl",'rb'))
    clean_text

    How to remove stop words from DataFrame in Python

    3 Definition of required Functions

    All functions are summarized here. I will show them again where they are used during this post if they are new and have not been explained yet.

    def word_count_func(text):
        '''
        Counts words within a string
        
        Args:
            text (str): String to which the function is to be applied, string
        
        Returns:
            Number of words within a string, integer
        ''' 
        return len(text.split())
    def remove_english_stopwords_func(text):
        '''
        Removes Stop Words (also capitalized) from a string, if present
        
        Args:
            text (str): String to which the function is to be applied, string
        
        Returns:
            Clean string without Stop Words
        ''' 
        # check in lowercase 
        t = [token for token in text if token.lower() not in stopwords.words("english")]
        text = ' '.join(t)    
        return text

    4 Text Pre-Processing

    4.1 (Text Cleaning)

    I have already described this part in the previous post. See here:

    4.2 Tokenization

    Tokenisation is a technique for breaking down a piece of text into small units, called tokens. A token may be a word, part of a word or just characters like punctuation.

    Tokenisation can therefore be roughly divided into three groups:

    • Word Tokenization
    • Character Tokenization and
    • Partial Word Tokenization (n-gram characters)

    In the following I will present two tokenizers:

    • Word Tokenizer
    • Sentence Tokenizer

    Of course there are some more. Find the one on the which fits best to your data or to your problem solution.

    text_for_tokenization = \
    "Hi my name is Michael. \
    I am an enthusiastic Data Scientist. \
    Currently I am working on a post about NLP, more specifically about the Pre-Processing Steps."
    
    text_for_tokenization

    How to remove stop words from DataFrame in Python

    4.2.1 Word Tokenizer

    To break a sentence into words, the word_tokenize() function can be used. Based on this, further text cleaning steps can be taken such as removing stop words or normalising text blocks. In addition, machine learning models need numerical data to be trained and make predictions. Again, tokenisation of words is a crucial part of converting text into numerical data.

    words = word_tokenize(text_for_tokenization)
    print(words)

    How to remove stop words from DataFrame in Python

    print('Number of tokens found: ' + str(len(words)))

    How to remove stop words from DataFrame in Python

    4.2.2 Sentence Tokenizer

    Now the question arises, why do I actually need to tokenise sentences when I can tokenise individual words?

    An example of use would be if you want to count the average number of words per sentence. How can I do that with the Word Tokenizer alone? I can’t, I need both the sent_tokenize() function and the word_tokenize() function to calculate the ratio.

    sentences = sent_tokenize(text_for_tokenization)
    print(sentences)

    How to remove stop words from DataFrame in Python

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    0

    How to remove stop words from DataFrame in Python

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    1

    How to remove stop words from DataFrame in Python

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    2

    How to remove stop words from DataFrame in Python

    4.2.3 Application to the Example String

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    3

    How to remove stop words from DataFrame in Python

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    4

    How to remove stop words from DataFrame in Python

    4.2.4 Application to the DataFrame

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    5

    How to remove stop words from DataFrame in Python

    Here I set a limit for the column width so that it remains clear. This setting should be reset at the end, otherwise it will remain.

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    6
    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    7

    How to remove stop words from DataFrame in Python

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    8

    How to remove stop words from DataFrame in Python

    It is always worthwhile (I have made a habit of doing this) to have the number of remaining words or tokens displayed and also to store them in the data record. The advantage of this is that (especially in later process steps) it is very quick and easy to see what influence the operation has had on the quality of my information. Of course, this can only be done on a random basis, but it is easy to see whether the function applied had negative effects that were not intended. Or you look at a case difference if you don’t know which type of algorithm (for example, in normalisation) fits my data better.

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    9

    How to remove stop words from DataFrame in Python

    Ok interesting, the average number of words has increased slightly. Let’s take a look at what caused that:

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)
    0

    How to remove stop words from DataFrame in Python

    Note: In the following I do not take the first row from the sorted dataset, but from the created dataset df_subset.

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)
    1

    How to remove stop words from DataFrame in Python

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)
    2

    How to remove stop words from DataFrame in Python

    Here we see the reason: The tokenizer has turned ‘cannot’ into ‘can not’.

    4.3 Stop Words

    Stop words are frequently used words such as I, a, an, in etc. They do not contribute significantly to the information content of a sentence, so it is advisable to remove them by storing a list of words that we consider stop words. The library nltk has such lists for 16 different languages that we can refer to.

    Here are the defined stop words for the English language:

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)
    3

    How to remove stop words from DataFrame in Python

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)
    4

    How to remove stop words from DataFrame in Python

    Stop Words can be removed well with the following function. However, the sentences must be converted into word tokens for this. I have explained in detail how to do this in the previous chapter.

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)
    5

    How to remove stop words from DataFrame in Python

    def remove_english_stopwords_func(text):
        '''
        Removes Stop Words (also capitalized) from a string, if present
        
        Args:
            text (str): String to which the function is to be applied, string
        
        Returns:
            Clean string without Stop Words
        ''' 
        # check in lowercase 
        t = [token for token in text if token.lower() not in stopwords.words("english")]
        text = ' '.join(t)    
        return text
    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)
    7

    How to remove stop words from DataFrame in Python

    4.3.1 Application to the Example String

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)
    8

    How to remove stop words from DataFrame in Python

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    4

    How to remove stop words from DataFrame in Python

    clean_text = pk.load(open("clean_text.pkl",'rb'))
    clean_text
    0

    How to remove stop words from DataFrame in Python

    clean_text = pk.load(open("clean_text.pkl",'rb'))
    clean_text
    1

    How to remove stop words from DataFrame in Python

    clean_text = pk.load(open("clean_text.pkl",'rb'))
    clean_text
    2

    How to remove stop words from DataFrame in Python

    Note: After removing the stop words we need the word_count function again for counting, because they are no tokens anymore.

    clean_text = pk.load(open("clean_text.pkl",'rb'))
    clean_text
    3

    How to remove stop words from DataFrame in Python

    4.3.2 Application to the DataFrame

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv')
    df.head()
    5

    How to remove stop words from DataFrame in Python

    clean_text = pk.load(open("clean_text.pkl",'rb'))
    clean_text
    5

    How to remove stop words from DataFrame in Python

    clean_text = pk.load(open("clean_text.pkl",'rb'))
    clean_text
    6

    How to remove stop words from DataFrame in Python

    5 Conclusion

    In this part of the Text Pre-Processing series, I explained how tokenization works, how to use it, and showed how to remove Stop Words.

    How do you remove stop words from a DataFrame?

    Here we have a dataframe with a column named "tweet" that contains tweet text data. We use the Pandas apply with the lambda function along with list comprehension to remove stop words as declared in the NLTK library.

    How do you remove stop words from a dataset in Python?

    Using Python's Gensim Library All you have to do is to import the remove_stopwords() method from the gensim. parsing. preprocessing module. Next, you need to pass your sentence from which you want to remove stop words, to the remove_stopwords() method which returns text string without the stop words.

    How do I remove a specific word from a DataFrame in Python?

    To remove characters from columns in Pandas DataFrame, use the replace(~) method.

    How are stop words removed?

    The removal of stop words is highly dependent on the task we are performing and the goal we want to achieve. For example, if we are training a model that can perform the sentiment analysis task, we might not remove the stop words. Movie review: “The movie was not good at all.”