Cenk Yildiran

Writing about lots of unrelated topics…


Data Cleaning using Python | Part-2

This post follows up on the previous one.

Revealing the higher correlated predictor variables with the response variable

Before beginning the data cleaning process, examining the correlation between the response variable (in this case, SalePrice) and the predictor variables is helpful. This step can help identify features with little to no impact on house price determination, allowing us to exclude them from the analysis.

There are several methods to assess the correlation between the target variable and other features, including pair plots, scatter plots, heat maps, and correlation matrices. In this analysis, we will use the .corr() function to identify the top features based on the Pearson correlation coefficient, which quantifies the strength of the linear relationship between two numerical variables.

Advertisements

Since the correlation coefficient applies only to numerical attributes (floats and integers), we will restrict our analysis to numeric features.

#select only the numerical (float, int) variables and assign them to hous_num dataframe:
hous_num = housing.select_dtypes(include = ['float64', 'int64'])
#compute the correlation of all numerical variables with SalePrice, excluding itself, and assign into hous_num_corr dataframe:
hous_num_corr = hous_num.corr()['SalePrice'][:-1] 
#filter only strong correlations (corr > 0.5):
top_features = hous_num_corr[abs(hous_num_corr) > 0.5].sort_values(ascending=False) 
#print the results:
print("There is {} strongly correlated values with SalePrice:\n{}".format(len(top_features), top_features))

Among the attributes, ten have a correlation value greater than 0.5, indicating a strong positive correlation with the SalePrice variable.

Advertisements

Next, we will create pair plots to visually examine the correlation between some of these features and the target variable. For this analysis, we will use Seaborn’s sns.pairplot() function. Pair plots are not only useful for visualizing relationships but also for identifying potential outliers in the data.

# Loop through the columns of the DataFrame 'hous_num' in steps of 5
for i in range(0, len(hous_num.columns), 5):
    
    # Generate a pairplot for each subset of 5 columns from 'hous_num'
    # 'x_vars' specifies the columns to plot on the x-axis, taking the current subset of 5 columns
    # 'y_vars' specifies the column 'SalePrice' to plot on the y-axis
    sns.pairplot(data=hous_num,
                 x_vars=hous_num.columns[i:i+5],  # Take a subset of 5 columns for the x-axis
                 y_vars=['SalePrice'])  # Always plot 'SalePrice' on the y-axis

The pair plots above align well with the Pearson correlation scores. From the graphs, it is evident that attributes such as OverallQual, GrLivArea, GarageArea, and seven others exhibit the strongest correlations with SalePrice.

I’ll keep this post brief and to the point. In the next post, I’ll explore how to analyze skewness and kurtosis.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

£5.00
£15.00
£100.00
£5.00
£15.00
£100.00
£5.00
£15.00
£100.00

Or enter a custom amount

£

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly


One response to “Data Cleaning using Python | Part-2”

  1. […] This post follows up on Data Cleaning using Python | Part-2 […]

Leave a Reply

About Me

My name is Cenk, and I am an economist. I write on this internet site on economics, econometrics, finance, value-investing, programming, calculus, basketball, history, foods, books, self-improvement, well-being and productivity. This internet site is a personal blog, and the posts reflect my personal views and do not represent where I have been working.
For my academic works, please visit this site: https://cenkufukyildiran.academia.edu/
Posts related to financial markets, trading, investing and similar posts are not for financial advice purposes.

Newsletter

Discover more from Cenk Yildiran

Subscribe now to keep reading and get access to the full archive.

Continue reading