Skip to content

Preprocessing and Feature Engineering

Preprocessing and feature engineering are essential steps in preparing data for machine learning tasks. These techniques help improve the quality of data, extract meaningful features, and enhance the performance of predictive models. In this section, we will explore several common preprocessing and feature engineering techniques with corresponding Python code examples.

Removing Stopwords

Stopwords are commonly used words that do not carry significant meaning in a given language. They often include articles, prepositions, and conjunctions. Removing stopwords can help reduce noise and focus on the most important words in text data. Here's an example of removing stopwords using the Natural Language Toolkit (NLTK) library:

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

text = "This is an example sentence with some stopwords."

filtered_text = [word for word in text.split() if word.lower() not in stop_words]

print(filtered_text) # ['example', 'sentence', 'stopwords.']

Lemmatization

Lemmatization is the process of reducing words to their base or root form, usually with the aim of normalization. This helps in reducing the dimensionality of text data and capturing the core meaning of words. The following example demonstrates lemmatization using the NLTK library:

from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

text = "Running, ran, and runs are all forms of the word run."

lemmatized_text = [lemmatizer.lemmatize(word) for word in text.split()]

print(lemmatized_text) # ['Running,', 'ran,', 'and', 'run', 'all', 'form', 'of', 'the', 'word', 'run.']

Data Cleaning

Data cleaning is a crucial step in preparing data for analysis. It involves handling missing values, removing duplicates, and correcting inconsistencies. Here's an example of data cleaning using the pandas library:

import pandas as pd

data = {'Name': ['John', 'Jane', 'John', 'Adam', 'Jane'],
        'Age': [25, 30, None, 35, 28],
        'City': ['New York', 'London', 'New York', 'Paris', None]}

df = pd.DataFrame(data)

# Handling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Removing duplicates
df.drop_duplicates(inplace=True)

# Handling inconsistent values
df['City'].replace({'London': 'UK'}, inplace=True)

print(df) # Output:

   Name   Age      City
0  John  25.0  New York
1  Jane  30.0        UK
3  Adam  35.0     Paris
4  Jane  28.0      None

Data cleaning also involves removing unwanted characters, converting text to lowercase, and handling missing values. Here's an example of data cleaning operations:

df['Text'] = df['Text'].str.replace('[^\w\s]', '')  # Remove punctuation
df['Text'] = df['Text'].str.lower()  # Convert text to lowercase
df['Text'].fillna('', inplace=True)  # Handle missing values

Input DataFrame:

Text
Hello! How are you?
This is a test.
Good morning!

Output DataFrame:

Text
hello how are you
this is a test
good morning

Filtering, Groupby, Apply, Joining Columns, Mapping

Pandas provides powerful operations for data manipulation and feature engineering. Here are some examples of common operations:

Filtering

Filtering allows us to select specific rows from a DataFrame based on certain conditions. Here's an example:

import pandas as pd

data = {'Name': ['John', 'Jane', 'Adam', 'Emily', 'Michael'],
        'Age': [25, 30, 35, 28, 32],
        'City': ['New York', 'London', 'Paris', 'London', 'Paris']}

df = pd.DataFrame(data)

filtered_df = df[df['Age'] > 30]

Input DataFrame:

Name Age City
John 25 New York
Jane 30 London
Adam 35 Paris
Emily 28 London
Michael 32 Paris

Output DataFrame:

Name Age City
Adam 35 Paris
Michael 32 Paris

Groupby

Groupby operation allows us to group rows based on a specific column and perform aggregations on other columns. Here's an example:

grouped_df = df.groupby('City')['Age'].mean()

print(grouped_df)
Input DataFrame:

Name Age City
John 25 New York
Jane 30 London
Adam 35 Paris
Emily 28 London
Michael 32 Paris

Output DataFrame:

City Mean Age
New York 25
London 29
Paris 33.5

Apply

The apply function allows us to apply a custom function to each element or row/column of a DataFrame. Here's an example:

df['Name Length'] = df['Name'].apply(lambda x: len(x))

Input DataFrame:

Name Age City
John 25 New York
Jane 30 London
Adam 35 Paris
Emily 28 London
Michael 32 Paris

Output DataFrame:

Name Age City Name Length
John 25 New York 4
Jane 30 London 4
Adam 35 Paris 4
Emily 28 London 5
Michael 32 Paris 7

Joining Columns

Joining columns allows us to concatenate or combine multiple columns into a single column. Here's an example:

df['Full Name'] = df['First Name'] + ' ' + df['Last Name']

Input DataFrame:

First Name Last Name Age
John Doe 25
Jane Johnson 30
Adam Smith 35
Emily Brown 28
Michael Davis 32

Output DataFrame:

First Name Last Name Age Full Name
John Doe 25 John Doe
Jane Johnson 30 Jane Johnson
Adam Smith 35 Adam Smith
Emily Brown 28 Emily Brown
Michael Davis 32 Michael Davis

Mapping

Mapping allows us to create a new column by mapping values from an existing column using a predefined mapping dictionary. Here's an example:

category_mapping = {'red': 1, 'blue': 2, 'green': 3}
df['Category'] = df['Color'].map(category_mapping)
Input DataFrame:

Color Quantity
red 5
green 3
blue 2

Output DataFrame:

Color Quantity Category
red 5 1
green 3 3
blue 2 2

These examples demonstrate the use of various pandas operations for filtering, grouping, applying functions, joining columns, and mapping values. These techniques enable effective preprocessing and feature engineering, which are crucial for enhancing the quality and usability of data in machine learning tasks.