Preprocessing and Feature Engineering
Preprocessing and feature engineering are essential steps in preparing data for machine learning tasks. These techniques help improve the quality of data, extract meaningful features, and enhance the performance of predictive models. In this section, we will explore several common preprocessing and feature engineering techniques with corresponding Python code examples.
Removing Stopwords
Stopwords are commonly used words that do not carry significant meaning in a given language. They often include articles, prepositions, and conjunctions. Removing stopwords can help reduce noise and focus on the most important words in text data. Here's an example of removing stopwords using the Natural Language Toolkit (NLTK) library:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
text = "This is an example sentence with some stopwords."
filtered_text = [word for word in text.split() if word.lower() not in stop_words]
print(filtered_text) # ['example', 'sentence', 'stopwords.']
Lemmatization
Lemmatization is the process of reducing words to their base or root form, usually with the aim of normalization. This helps in reducing the dimensionality of text data and capturing the core meaning of words. The following example demonstrates lemmatization using the NLTK library:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
text = "Running, ran, and runs are all forms of the word run."
lemmatized_text = [lemmatizer.lemmatize(word) for word in text.split()]
print(lemmatized_text) # ['Running,', 'ran,', 'and', 'run', 'all', 'form', 'of', 'the', 'word', 'run.']
Data Cleaning
Data cleaning is a crucial step in preparing data for analysis. It involves handling missing values, removing duplicates, and correcting inconsistencies. Here's an example of data cleaning using the pandas library:
import pandas as pd
data = {'Name': ['John', 'Jane', 'John', 'Adam', 'Jane'],
'Age': [25, 30, None, 35, 28],
'City': ['New York', 'London', 'New York', 'Paris', None]}
df = pd.DataFrame(data)
# Handling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Removing duplicates
df.drop_duplicates(inplace=True)
# Handling inconsistent values
df['City'].replace({'London': 'UK'}, inplace=True)
print(df) # Output:
Name Age City
0 John 25.0 New York
1 Jane 30.0 UK
3 Adam 35.0 Paris
4 Jane 28.0 None
Data cleaning also involves removing unwanted characters, converting text to lowercase, and handling missing values. Here's an example of data cleaning operations:
df['Text'] = df['Text'].str.replace('[^\w\s]', '') # Remove punctuation
df['Text'] = df['Text'].str.lower() # Convert text to lowercase
df['Text'].fillna('', inplace=True) # Handle missing values
Input DataFrame:
Text |
---|
Hello! How are you? |
This is a test. |
Good morning! |
Output DataFrame:
Text |
---|
hello how are you |
this is a test |
good morning |
Filtering, Groupby, Apply, Joining Columns, Mapping
Pandas provides powerful operations for data manipulation and feature engineering. Here are some examples of common operations:
Filtering
Filtering allows us to select specific rows from a DataFrame based on certain conditions. Here's an example:
import pandas as pd
data = {'Name': ['John', 'Jane', 'Adam', 'Emily', 'Michael'],
'Age': [25, 30, 35, 28, 32],
'City': ['New York', 'London', 'Paris', 'London', 'Paris']}
df = pd.DataFrame(data)
filtered_df = df[df['Age'] > 30]
Input DataFrame:
Name | Age | City |
---|---|---|
John | 25 | New York |
Jane | 30 | London |
Adam | 35 | Paris |
Emily | 28 | London |
Michael | 32 | Paris |
Output DataFrame:
Name | Age | City |
---|---|---|
Adam | 35 | Paris |
Michael | 32 | Paris |
Groupby
Groupby operation allows us to group rows based on a specific column and perform aggregations on other columns. Here's an example:
Input DataFrame:Name | Age | City |
---|---|---|
John | 25 | New York |
Jane | 30 | London |
Adam | 35 | Paris |
Emily | 28 | London |
Michael | 32 | Paris |
Output DataFrame:
City | Mean Age |
---|---|
New York | 25 |
London | 29 |
Paris | 33.5 |
Apply
The apply function allows us to apply a custom function to each element or row/column of a DataFrame. Here's an example:
Input DataFrame:
Name | Age | City |
---|---|---|
John | 25 | New York |
Jane | 30 | London |
Adam | 35 | Paris |
Emily | 28 | London |
Michael | 32 | Paris |
Output DataFrame:
Name | Age | City | Name Length |
---|---|---|---|
John | 25 | New York | 4 |
Jane | 30 | London | 4 |
Adam | 35 | Paris | 4 |
Emily | 28 | London | 5 |
Michael | 32 | Paris | 7 |
Joining Columns
Joining columns allows us to concatenate or combine multiple columns into a single column. Here's an example:
Input DataFrame:
First Name | Last Name | Age |
---|---|---|
John | Doe | 25 |
Jane | Johnson | 30 |
Adam | Smith | 35 |
Emily | Brown | 28 |
Michael | Davis | 32 |
Output DataFrame:
First Name | Last Name | Age | Full Name |
---|---|---|---|
John | Doe | 25 | John Doe |
Jane | Johnson | 30 | Jane Johnson |
Adam | Smith | 35 | Adam Smith |
Emily | Brown | 28 | Emily Brown |
Michael | Davis | 32 | Michael Davis |
Mapping
Mapping allows us to create a new column by mapping values from an existing column using a predefined mapping dictionary. Here's an example:
category_mapping = {'red': 1, 'blue': 2, 'green': 3}
df['Category'] = df['Color'].map(category_mapping)
Color | Quantity |
---|---|
red | 5 |
green | 3 |
blue | 2 |
Output DataFrame:
Color | Quantity | Category |
---|---|---|
red | 5 | 1 |
green | 3 | 3 |
blue | 2 | 2 |
These examples demonstrate the use of various pandas operations for filtering, grouping, applying functions, joining columns, and mapping values. These techniques enable effective preprocessing and feature engineering, which are crucial for enhancing the quality and usability of data in machine learning tasks.