How to remove stop words from DataFrame in Python

Accepted answer

Assuming your stopwords is a list and df['tokens'] is a list of words or tokens each.
Simple Method:

clear_tokens = [] for i in df.index: clear_tokens.append([item for item in df.tokens[i] if item not in stopwords]) df['tokens'] = clear_tokens

if your df.tokens is a sentence in each row, then :

clear_tokens = [] for i in df.index: tokenlist = df.tokens[i].split() clear_tokens.append(' '.join([item for item in tokenlist if item not in stopwords])) df['tokens'] = clear_tokens

after applying a function to a column you need to assign the result back to the column, it's not an in-place operation.

  • after tokenization ukdata['text'] holds a list of words, so you can use a in the apply to remove the stop words.

    >>> df = spark.createDataFrame([(["a", "b", "c"],)], ["text"]) >>> remover = StopWordsRemover(stopWords=["b"]) >>> remover.setInputCol("text") StopWordsRemover... >>> remover.setOutputCol("words") StopWordsRemover... >>> remover.transform(df).head().words == ['a', 'c'] True >>> stopWordsRemoverPath = temp_path + "/stopwords-remover" >>> remover.save(stopWordsRemoverPath) >>> loadedRemover = StopWordsRemover.load(stopWordsRemoverPath) >>> loadedRemover.getStopWords() == remover.getStopWords() True >>> loadedRemover.getCaseSensitive() == remover.getCaseSensitive() True >>> loadedRemover.transform(df).take(1) == remover.transform(df).take(1) True >>> df2 = spark.createDataFrame([(["a", "b", "c"], ["a", "b"])], ["text1", "text2"]) >>> remover2 = StopWordsRemover(stopWords=["b"]) >>> remover2.setInputCols(["text1", "text2"]).setOutputCols(["words1", "words2"]) StopWordsRemover... >>> remover2.transform(df2).show() +---------+------+------+------+ | text1| text2|words1|words2| +---------+------+------+------+ |[a, b, c]|[a, b]|[a, c]| [a]| +---------+------+------+------+ ...

    In my last publication, I started the post series on the topic of text pre-processing. In it, I first covered all the possible applications of .

    Now I will continue with the topics Tokenization and Stop Words.

    For this publication the processed dataset Amazon Unlocked Mobile from the statistic platform “Kaggle” was used as well as the created Example String. You can download both files from my “GitHub Repository”.

    2 Import the Libraries and the Dataimport pandas as pd import numpy as np import pickle as pk import warnings warnings.filterwarnings("ignore") from bs4 import BeautifulSoup import unicodedata import re from nltk.tokenize import word_tokenize from nltk.tokenize import sent_tokenize from nltk.corpus import stopwords from nltk.corpus import wordnet from nltk import pos_tag from nltk import ne_chunk from nltk.stem.porter import PorterStemmer from nltk.stem.wordnet import WordNetLemmatizer from nltk.probability import FreqDist import matplotlib.pyplot as plt from wordcloud import WordClouddf = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)clean_text = pk.load(open("clean_text.pkl",'rb')) clean_text

    3 Definition of required Functions

    All functions are summarized here. I will show them again where they are used during this post if they are new and have not been explained yet.

    def word_count_func(text): ''' Counts words within a string Args: text (str): String to which the function is to be applied, string Returns: Number of words within a string, integer ''' return len(text.split())def remove_english_stopwords_func(text): ''' Removes Stop Words (also capitalized) from a string, if present Args: text (str): String to which the function is to be applied, string Returns: Clean string without Stop Words ''' # check in lowercase t = [token for token in text if token.lower() not in stopwords.words("english")] text = ' '.join(t) return text

    4 Text Pre-Processing

    4.1 (Text Cleaning)

    I have already described this part in the previous post. See here:

    4.2 Tokenization

    Tokenisation is a technique for breaking down a piece of text into small units, called tokens. A token may be a word, part of a word or just characters like punctuation.

    Tokenisation can therefore be roughly divided into three groups:

    • Word Tokenization
    • Character Tokenization and
    • Partial Word Tokenization (n-gram characters)

    In the following I will present two tokenizers:

    • Word Tokenizer
    • Sentence Tokenizer

    Of course there are some more. Find the one on the which fits best to your data or to your problem solution.

    text_for_tokenization = \ "Hi my name is Michael. \ I am an enthusiastic Data Scientist. \ Currently I am working on a post about NLP, more specifically about the Pre-Processing Steps." text_for_tokenization

    4.2.1 Word Tokenizer

    To break a sentence into words, the word_tokenize() function can be used. Based on this, further text cleaning steps can be taken such as removing stop words or normalising text blocks. In addition, machine learning models need numerical data to be trained and make predictions. Again, tokenisation of words is a crucial part of converting text into numerical data.

    words = word_tokenize(text_for_tokenization) print(words)

    print('Number of tokens found: ' + str(len(words)))

    4.2.2 Sentence Tokenizer

    Now the question arises, why do I actually need to tokenise sentences when I can tokenise individual words?

    An example of use would be if you want to count the average number of words per sentence. How can I do that with the Word Tokenizer alone? I can’t, I need both the sent_tokenize() function and the word_tokenize() function to calculate the ratio.

    sentences = sent_tokenize(text_for_tokenization) print(sentences)

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()0

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()1

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()2

    4.2.3 Application to the Example String

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()3

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()4

    4.2.4 Application to the DataFrame

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()5

    Here I set a limit for the column width so that it remains clear. This setting should be reset at the end, otherwise it will remain.

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()6df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()7

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()8

    It is always worthwhile (I have made a habit of doing this) to have the number of remaining words or tokens displayed and also to store them in the data record. The advantage of this is that (especially in later process steps) it is very quick and easy to see what influence the operation has had on the quality of my information. Of course, this can only be done on a random basis, but it is easy to see whether the function applied had negative effects that were not intended. Or you look at a case difference if you don’t know which type of algorithm (for example, in normalisation) fits my data better.

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()9

    Ok interesting, the average number of words has increased slightly. Let’s take a look at what caused that:

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)0

    Note: In the following I do not take the first row from the sorted dataset, but from the created dataset df_subset.

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)1

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)2

    Here we see the reason: The tokenizer has turned ‘cannot’ into ‘can not’.

    4.3 Stop Words

    Stop words are frequently used words such as I, a, an, in etc. They do not contribute significantly to the information content of a sentence, so it is advisable to remove them by storing a list of words that we consider stop words. The library nltk has such lists for 16 different languages that we can refer to.

    Here are the defined stop words for the English language:

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)3

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)4

    Stop Words can be removed well with the following function. However, the sentences must be converted into word tokens for this. I have explained in detail how to do this in the previous chapter.

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)5

    def remove_english_stopwords_func(text): ''' Removes Stop Words (also capitalized) from a string, if present Args: text (str): String to which the function is to be applied, string Returns: Clean string without Stop Words ''' # check in lowercase t = [token for token in text if token.lower() not in stopwords.words("english")] text = ' '.join(t) return textdf['Clean_Reviews'] = df['Clean_Reviews'].astype(str)7

    4.3.1 Application to the Example String

    df['Clean_Reviews'] = df['Clean_Reviews'].astype(str)8

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()4

    clean_text = pk.load(open("clean_text.pkl",'rb')) clean_text0

    clean_text = pk.load(open("clean_text.pkl",'rb')) clean_text1

    clean_text = pk.load(open("clean_text.pkl",'rb')) clean_text2

    Note: After removing the stop words we need the word_count function again for counting, because they are no tokens anymore.

    clean_text = pk.load(open("clean_text.pkl",'rb')) clean_text3

    4.3.2 Application to the DataFrame

    df = pd.read_csv('Amazon_Unlocked_Mobile_small_Part_I.csv') df.head()5

    clean_text = pk.load(open("clean_text.pkl",'rb')) clean_text5

    clean_text = pk.load(open("clean_text.pkl",'rb')) clean_text6

    5 Conclusion

    In this part of the Text Pre-Processing series, I explained how tokenization works, how to use it, and showed how to remove Stop Words.

    How do you remove stop words from a DataFrame?

    Here we have a dataframe with a column named "tweet" that contains tweet text data. We use the Pandas apply with the lambda function along with list comprehension to remove stop words as declared in the NLTK library.

    How do you remove stop words from a dataset in Python?

    Using Python's Gensim Library All you have to do is to import the remove_stopwords() method from the gensim. parsing. preprocessing module. Next, you need to pass your sentence from which you want to remove stop words, to the remove_stopwords() method which returns text string without the stop words.

    How do I remove a specific word from a DataFrame in Python?

    To remove characters from columns in Pandas DataFrame, use the replace(~) method.

    How are stop words removed?

    The removal of stop words is highly dependent on the task we are performing and the goal we want to achieve. For example, if we are training a model that can perform the sentiment analysis task, we might not remove the stop words. Movie review: “The movie was not good at all.”

  • Postingan terbaru

    LIHAT SEMUA