Data Cleaning using Python | Part-4

This post follows up on Data Cleaning using Python | Part-3.

Checking duplicates in a data frame

Checking for duplicates in a data frame before creating a machine learning model, such as regression, is crucial because duplicates can bias the model by over-representing certain data points, leading to skewed predictions. They can also create misleading performance metrics if the same data appears in both training and testing sets, falsely inflating accuracy.

Additionally, duplicates increase computational cost without adding value and violate assumptions like independence of observations, which regression models often rely on. Removing duplicates ensures a clean, representative dataset, helping the model generalize better and produce more reliable results.

And this is how you can check whether your data frame includes duplicates:

#Checking "housing" data frame on 'Id' column whether the data frame has duplicates:
housing[housing.duplicated(['Id'])]

The above code has returned an empty data frame which means there are no duplicated rows (based on the Id column):

Another check you can do is looking at the index of the data frame:

#Looking at the index of "housing" data frame if it is unique:
housing.index.is_unique

The index of housing data frame is unique:

Checking for missing values

Missing values are important to address because they can negatively impact the performance and reliability of a machine learning model. If left untreated, missing values can distort statistical calculations, bias the model, and lead to incorrect predictions. For instance, regression models may fail if key variables contain missing data, or they might underestimate relationships if rows with missing values are automatically excluded. Moreover, missing values can indicate meaningful patterns in the data (e.g., missing entries might reveal underlying processes), which should be carefully analyzed. Proper handling of missing values ensures a complete, accurate dataset, leading to better model performance and insights.

To check missing values in housing data frame you can apply the below code:

# Count the number of missing values (NaN) in each column of the 'housing' DataFrame
total = housing.isnull().sum().sort_values(ascending=False)

# Select the top 20 columns with the highest number of missing values
total_select = total.head(20)

# Plot a bar chart of the missing values for the selected columns
total_select.plot(kind="bar", figsize=(8,6), fontsize=10)

# Label the x-axis as "Columns" with font size 20
plt.xlabel("Columns", fontsize=20)

# Label the y-axis as "Count" with font size 20
plt.ylabel("Count", fontsize=20)

# Set the title of the plot as "Total Missing Values" with font size 20
plt.title("Total Missing Values", fontsize=20)

Here the how you can check which rows for LotFrontage variable have missing variables:

#Checking the missing variables for "LotFrontage" variable:
missing_rows = housing[housing["LotFrontage"].isnull()].index
print(missing_rows)

There are several ways to handle missing values. Let’s examine the LotFrontage variable to explore these options:

Remove missing values using the dropna() method. This will eliminate all rows where the LotFrontage feature contains null values:

#Remove rows from the DataFrame housing where the column "LotFrontage" has missing (NaN) values:
housing.dropna(subset=["LotFrontage"])

2. We can remove an entire column with missing values using the drop() method. This will completely eliminate the column that contains null values.

# Drop the "Lot Frontage" column from the DataFrame
housing.drop("Lot Frontage", axis=1)

3.We can replace missing values with a specific value (e.g., zero, the mean, or the median) using the fillna() method.

median = housing["LotFrontage"].median()
median

housing["LotFrontage"].fillna(median, inplace = True)

Now check again if LotFrontage has missing values:

#Checking the missing variables for "LotFrontage" variable:
missing_rows = housing[housing["LotFrontage"].isnull()].index
print(missing_rows)

I’ll keep this post brief and to the point. In the next post, I’ll explore feature scaling.

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

£5.00

£15.00

£100.00

£5.00

£15.00

£100.00

£5.00

£15.00

£100.00

Or enter a custom amount

Your contribution is appreciated.

Donate Donate monthly Donate yearly

2 responses to “Data Cleaning using Python | Part-4”

Data Cleaning using Python| Part-5 – cenk-yildiran

February 4, 2025 at 7:32 am

[…] This post follows up on Data Cleaning using Python | Part-4. […]

Loading…

Data Cleaning using Python | Part-6 – cenk-yildiran

February 5, 2025 at 7:19 am

[…] This post follows up on Data Cleaning using Python | Part-5. […]

Loading…

Cenk Yildiran