Data Cleaning using Python | Part-1

In the real world, unlike in tutorials, raw data often contains duplicates, missing values, and irrelevant information. To prepare this data for use in a machine learning project, it’s essential to clean and preprocess it. In this post, I’ll guide you through handling duplicates, addressing missing values, and identifying outliers. Additionally, I’ll demonstrate how to transform data using logarithmic functions and explain techniques for standardizing and normalizing your dataset to improve its quality and usability. You can download the data from this address: https://www.kaggle.com/datasets/nickptaylor/iowa-house-prices

Importing the data

Below you can find all the required packages are used for data cleaning taks applied for this post and for the coming posts:

#start with importing the required packages:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import norm
from scipy import stats

Importing the data into your Jupyter Notebook:

#read the data:
housing = pd.read_csv("train.csv")

Getting to know the variables/attributes:

Looking at the variables/attributes:

#check the column names:
housing.columns

Checking the dataframe in the memory:

#check the first five rows:
housing.head(5)

Inspecting the data types of the variables:

#Check the datatype of the columns:
housing.info()

…..

Using the info() function, you can quickly get an overview of your dataset. It reveals that you have 1,459 rows (keeping in mind that Python indexing starts at 0) and 80 features or variables. The target (or response) variable in this dataset is SalePrice, while the remaining features serve as predictor variables. Regarding data types, the features include 3 float64, 35 int64, and 43 object (categorical) columns, providing a mix of numerical and categorical data to work with.

Inspecting numeric variables using `.describe()` function

#looking at basic statistics for 'SalePrice' variable/attribute:
housing["SalePrice"].describe()

With 1,460 observations, the data shows an average sale price of approximately 180,921, but the median is lower at 163,000, indicating a right-skewed distribution influenced by a few high-priced properties. The prices range from 34,900 to 755,000, with a standard deviation of about 79,442, highlighting significant variability. The quartiles reveal that 25 percent of houses sold for 129,975 or less, while 75 percent sold for 214,000 or less, leaving the top 25 percent above this value. The skewness and large range suggest potential outliers, which may require further analysis or data transformation, such as log scaling, to improve modeling performance.

Inspecting categorical variables using `.value_counts()`

The describe() function provides statistical insights into numeric attributes. To explore categorical (object) attributes, the value_counts() function is particularly useful. In this exercise, analyze all categories within the ‘SaleCondition’ variable/attribute using the value_counts() function.

housing["SaleCondition"].value_counts()

The SaleCondition attribute describes the circumstances of property sales. Most sales are “Normal” (1,198), followed by “Partial” (125) and “Abnorml” (101), indicating standard, partial, or abnormal conditions. Less common categories include “Family” (20), “Alloca” (12), and “AdjLand” (4), representing family-related, allocation, or land-adjusted sales.

I’ll keep this post concise and focused. In the next post, I’ll delve into how to examine correlations between the target variable and predictor variables.

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

£5.00

£15.00

£100.00

£5.00

£15.00

£100.00

£5.00

£15.00

£100.00

Or enter a custom amount

Your contribution is appreciated.

Donate Donate monthly Donate yearly

Cenk Yildiran