Preprocessing and Feature Engineering

Preprocessing and feature engineering are essential steps in preparing data for machine learning tasks. These techniques help improve the quality of data, extract meaningful features, and enhance the performance of predictive models. In this section, we will explore several common preprocessing and feature engineering techniques with corresponding Python code examples.

Removing Stopwords

Stopwords are commonly used words that do not carry significant meaning in a given language. They often include articles, prepositions, and conjunctions. Removing stopwords can help reduce noise and focus on the most important words in text data. Here's an example of removing stopwords using the Natural Language Toolkit (NLTK) library:

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

text = "This is an example sentence with some stopwords."

filtered_text = [word for word in text.split() if word.lower() not in stop_words]

print(filtered_text) # ['example', 'sentence', 'stopwords.']

Lemmatization

Lemmatization is the process of reducing words to their base or root form, usually with the aim of normalization. This helps in reducing the dimensionality of text data and capturing the core meaning of words. The following example demonstrates lemmatization using the NLTK library:

from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

text = "Running, ran, and runs are all forms of the word run."

lemmatized_text = [lemmatizer.lemmatize(word) for word in text.split()]

print(lemmatized_text) # ['Running,', 'ran,', 'and', 'run', 'all', 'form', 'of', 'the', 'word', 'run.']

Data Cleaning

Data cleaning is a crucial step in preparing data for analysis. It involves handling missing values, removing duplicates, and correcting inconsistencies. Here's an example of data cleaning using the pandas library:

import pandas as pd

data = {'Name': ['John', 'Jane', 'John', 'Adam', 'Jane'],
        'Age': [25, 30, None, 35, 28],
        'City': ['New York', 'London', 'New York', 'Paris', None]}

df = pd.DataFrame(data)

# Handling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Removing duplicates
df.drop_duplicates(inplace=True)

# Handling inconsistent values
df['City'].replace({'London': 'UK'}, inplace=True)

print(df) # Output:

   Name   Age      City
0  John  25.0  New York
1  Jane  30.0        UK
3  Adam  35.0     Paris
4  Jane  28.0      None

Data cleaning also involves removing unwanted characters, converting text to lowercase, and handling missing values. Here's an example of data cleaning operations:

df['Text'] = df['Text'].str.replace('[^\w\s]', '')  # Remove punctuation
df['Text'] = df['Text'].str.lower()  # Convert text to lowercase
df['Text'].fillna('', inplace=True)  # Handle missing values

Input DataFrame:

Text
Hello! How are you?
This is a test.
Good morning!

Output DataFrame:

Text
hello how are you
this is a test
good morning

Filtering, Groupby, Apply, Joining Columns, Mapping

Pandas provides powerful operations for data manipulation and feature engineering. Here are some examples of common operations:

Filtering

Filtering allows us to select specific rows from a DataFrame based on certain conditions. Here's an example:

import pandas as pd

data = {'Name': ['John', 'Jane', 'Adam', 'Emily', 'Michael'],
        'Age': [25, 30, 35, 28, 32],
        'City': ['New York', 'London', 'Paris', 'London', 'Paris']}

df = pd.DataFrame(data)

filtered_df = df[df['Age'] > 30]

Input DataFrame:

Name	Age	City
John	25	New York
Jane	30	London
Adam	35	Paris
Emily	28	London
Michael	32	Paris

Output DataFrame:

Name	Age	City
Adam	35	Paris
Michael	32	Paris

Groupby

Groupby operation allows us to group rows based on a specific column and perform aggregations on other columns. Here's an example:

grouped_df = df.groupby('City')['Age'].mean()

print(grouped_df)

Input DataFrame:

Name	Age	City
John	25	New York
Jane	30	London
Adam	35	Paris
Emily	28	London
Michael	32	Paris

Output DataFrame:

City	Mean Age
New York	25
London	29
Paris	33.5

Apply

The apply function allows us to apply a custom function to each element or row/column of a DataFrame. Here's an example:

df['Name Length'] = df['Name'].apply(lambda x: len(x))

Input DataFrame:

Name	Age	City
John	25	New York
Jane	30	London
Adam	35	Paris
Emily	28	London
Michael	32	Paris

Output DataFrame:

Name	Age	City	Name Length
John	25	New York	4
Jane	30	London	4
Adam	35	Paris	4
Emily	28	London	5
Michael	32	Paris	7

Joining Columns

Joining columns allows us to concatenate or combine multiple columns into a single column. Here's an example:

df['Full Name'] = df['First Name'] + ' ' + df['Last Name']

Input DataFrame:

First Name	Last Name	Age
John	Doe	25
Jane	Johnson	30
Adam	Smith	35
Emily	Brown	28
Michael	Davis	32

Output DataFrame:

First Name	Last Name	Age	Full Name
John	Doe	25	John Doe
Jane	Johnson	30	Jane Johnson
Adam	Smith	35	Adam Smith
Emily	Brown	28	Emily Brown
Michael	Davis	32	Michael Davis

Mapping

Mapping allows us to create a new column by mapping values from an existing column using a predefined mapping dictionary. Here's an example:

category_mapping = {'red': 1, 'blue': 2, 'green': 3}
df['Category'] = df['Color'].map(category_mapping)

Input DataFrame:

Color	Quantity
red	5
green	3
blue	2

Output DataFrame:

Color	Quantity	Category
red	5	1
green	3	3
blue	2	2

These examples demonstrate the use of various pandas operations for filtering, grouping, applying functions, joining columns, and mapping values. These techniques enable effective preprocessing and feature engineering, which are crucial for enhancing the quality and usability of data in machine learning tasks.