Data Cleaning using Python | Part-2

This post follows up on the previous one.

Revealing the higher correlated predictor variables with the response variable

Before beginning the data cleaning process, examining the correlation between the response variable (in this case, SalePrice) and the predictor variables is helpful. This step can help identify features with little to no impact on house price determination, allowing us to exclude them from the analysis.

There are several methods to assess the correlation between the target variable and other features, including pair plots, scatter plots, heat maps, and correlation matrices. In this analysis, we will use the .corr() function to identify the top features based on the Pearson correlation coefficient, which quantifies the strength of the linear relationship between two numerical variables.

Advertisements

Since the correlation coefficient applies only to numerical attributes (floats and integers), we will restrict our analysis to numeric features.

#select only the numerical (float, int) variables and assign them to hous_num dataframe:
hous_num = housing.select_dtypes(include = ['float64', 'int64'])
#compute the correlation of all numerical variables with SalePrice, excluding itself, and assign into hous_num_corr dataframe:
hous_num_corr = hous_num.corr()['SalePrice'][:-1] 
#filter only strong correlations (corr > 0.5):
top_features = hous_num_corr[abs(hous_num_corr) > 0.5].sort_values(ascending=False) 
#print the results:
print("There is {} strongly correlated values with SalePrice:\n{}".format(len(top_features), top_features))

Among the attributes, ten have a correlation value greater than 0.5, indicating a strong positive correlation with the SalePrice variable.

Advertisements

Next, we will create pair plots to visually examine the correlation between some of these features and the target variable. For this analysis, we will use Seaborn’s sns.pairplot() function. Pair plots are not only useful for visualizing relationships but also for identifying potential outliers in the data.

# Loop through the columns of the DataFrame 'hous_num' in steps of 5
for i in range(0, len(hous_num.columns), 5):
    
    # Generate a pairplot for each subset of 5 columns from 'hous_num'
    # 'x_vars' specifies the columns to plot on the x-axis, taking the current subset of 5 columns
    # 'y_vars' specifies the column 'SalePrice' to plot on the y-axis
    sns.pairplot(data=hous_num,
                 x_vars=hous_num.columns[i:i+5],  # Take a subset of 5 columns for the x-axis
                 y_vars=['SalePrice'])  # Always plot 'SalePrice' on the y-axis

The pair plots above align well with the Pearson correlation scores. From the graphs, it is evident that attributes such as OverallQual, GrLivArea, GarageArea, and seven others exhibit the strongest correlations with SalePrice.

I’ll keep this post brief and to the point. In the next post, I’ll explore how to analyze skewness and kurtosis.

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

£5.00

£15.00

£100.00

£5.00

£15.00

£100.00

£5.00

£15.00

£100.00

Or enter a custom amount

Your contribution is appreciated.

Donate Donate monthly Donate yearly

Cenk Yildiran