Exploratory Data Analysis (EDA)

A guide for beginners

Hitesh Mandav
7 min readJul 7, 2021

Exploratory analysis is performed to extract insights and comprehend the dataset.

This is the first step we need to complete as soon as we get the dataset before we get lost in the weeds.

Why do we do Exploratory Data Analysis (EDA)?

The goal is to learn more about the data set, which will help us make informed decisions throughout the project during data cleansing, function engineering, etc.

EDA will also help us to understand the relationships between features, data quality, and data visualization.

EDA is a very important first step in creating machine learning models, this step should not be ignored nor extended for a very long time. It should be quick and decisive.

So how do we get to know if we have done enough analysis?

Here, we will discuss certain questions and conclude our EDA when we have the answers to these questions.

I have taken Boston Housing Dataset from Kaggle as an example.

import pandas as pdhousing = pd.read_csv(“HousingData.csv”)

Now the very first question we should ask is how many data points do we have, what are the features available, what is my target variable

housing.info()

Here we see that we have a total of 506 data points and a total of 14 columns. Also, we see the data type of each column along with how many non-null values we have.

Here the column names are very ambiguous, we cannot understand anything from these names. So in this case you need to reach out to your stakeholder or data collector to get insights into what do these acronyms mean.

In this case, since we have taken data from Kaggle we also have a description for each acronym given there, sometimes while downloading the data you will get a file that contains these names and descriptions.

# CRIM — per capita crime rate by town
# ZN — proportion of residential land zoned for lots over 25,000 sq.ft.
# INDUS — proportion of non-retail business acres per town.
# CHAS — Charles River dummy variable (1 if tract bounds river; 0 otherwise)
# NOX — nitric oxides concentration (parts per 10 million)
# RM — average number of rooms per dwelling
# AGE — proportion of owner-occupied units built prior to 1940
# DIS — weighted distances to five Boston employment centres
# RAD — index of accessibility to radial highways
# TAX — full-value property-tax rate per $10,000
# PTRATIO — pupil-teacher ratio by town
# B — 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
# LSTAT — % lower status of the population
# MEDV — Median value of owner-occupied homes in $1000's

From this, we get some more insights into our data like MEDV is our target variable and is scaled down by 1000. CHAS can only have 1 or 0 as values and what they signify.

Now we can run describe function on our data frame and it will give us some useful information like maximum, minimum, mean, standard deviation, etc which can help us better understand the numeric values in our data and how are they distributed over a column.

housing.describe()

Now, let's print some of our data and have a look at it. This should help us answer if the columns make sense if column makes, do the values make sense, Are values on the right scale, and will the missing data cause a big problem.

This is just to get a feel for the data and not do any rigorous analysis on the data.

housing.head(7)

Data visualization is always better at understanding trends in the data rather than looking at the raw counts. We should therefore plot graphs to obtain more information.

But there are thousands of plots that you can map out with different combinations of features and targets. We should avoid doing this and plot graphs with purpose. The graph should tell us something useful about the data, such as what is the distribution of values, potential errors/outliers, the correlation between data, boundaries/range of data.

Let's see some of the plots that will help us with this.

Numerical Distribution

A very common plot is histograms for the columns to find a range and distribution of the numerical values. for example, here we plot a histogram of our target values.

import matplotlib.pyplot as pltplt.hist([housing.MEDV], bins=10)
plt.xlabel('values')
plt.ylabel('count')
plt.show()

Here we have to look out for

  • Unexpected distributions.
  • Boundaries that don’t make sense like our target cannot be negative and it is good that we don't see data points with negative values.
  • Potential measurement errors if any.

Categorical Distribution

Now for categorical features, we can use a bar plot to understand their distribution. In this data set, we don't have a categorical feature to best showcase this but let's create a feature called flooring which will tell us the material of the flooring

import randomflooring_types = [‘Marbel Tile’, ‘Solid Hardwood’, ‘Concreate’, ‘Ceramic Tile’, ‘Porcelain’, ‘Vinyl Tile’, ‘Engineered Hardwood’]
flooring_features_values = []
for x in range(0, housing.MEDV.count()):
if(x % 100 == 0):
flooring_features_values.append(‘Plank Flooring’)
else:
flooring_features_values.append(random.choice(flooring_types))
housing['flooring Type'] = flooring_features_values
housing['flooring Type'].value_counts()

Here I have fabricated a dummy feature with data points that we can use.

This dummy feature “Flooring Type” tells us about the type of flooring for the house. and we can see that there are 8 different types of categories here so to visualize this we can create a bar chart.

plt.barh(housing[‘flooring Type’].value_counts().keys(), width=housing[‘flooring Type’].value_counts())

Here we can observe the distribution for the categorical data and look for any sparse categories. Sparse categories are those which have fewer data points as compared to others. We should note these categories for our feature engineering step.

Segmentations: Relation between categorical and numerical data

One very important way to find ta relationship between numerical and categorical data is to do segmentation of data and then plot graphs.

So let's take an example and find out a relation between flooring type and the price of the house.

First, we need to divide our data set based on categories as shown below.

# segmenting the data
vinyl_tile
= housing[housing[‘flooring Type’]==’Vinyl Tile’]
marbel_tile = housing[housing[‘flooring Type’]==’Marbel Tile’]
solid_hardwood = housing[housing[‘flooring Type’]==’Solid Hardwood’]
concreate = housing[housing[‘flooring Type’]==’Concreate’]
porcelain = housing[housing[‘flooring Type’]==’Porcelain’]
engineered_hardwood = housing[housing[‘flooring Type’]==’Engineered Hardwood’]
ceramic_tile = housing[housing[‘flooring Type’]==’Ceramic Tile’]
plank_flooring = housing[housing[‘flooring Type’]==’Plank Flooring’]
#Getting house price array for plotting based on segmentation
plot_values = [vinyl_tile.MEDV, marbel_tile.MEDV, solid_hardwood.MEDV, concreate.MEDV, porcelain.MEDV, engineered_hardwood.MEDV, ceramic_tile.MEDV, plank_flooring.MEDV]
label_values = [‘vinyl_tile’, ‘marbel_tile’, ‘solid_hardwood’, ‘concreate’, ‘porcelain’, ‘engineered_hardwood’, ‘ceramic_tile’, ‘plank_flooring’]
#plotting box plot
plt.boxplot(plot_values, patch_artist=True, labels=label_values, vert=False)

Here we can get insights like min, max, median, and any outliers for the data for each category.

Note: Here Flooring Types is dummy data that we have created for understanding purposes.

Correlations

Correlation helps us to understand the relationship between numerical features of the data.

And using a correlation heatmap we can visualize this information.

import seaborn as snssns.set(rc = {‘figure.figsize’:(15,8)})
sns.heatmap(housing.corr(), vmin = -1, vmax = +1, annot = True, cmap = ‘coolwarm’)

Here the correlation values are between -1 and 1 that represents how 2 features are related. For example, we see a value of 0.7 for MEDV and RM which totally makes sense as a higher number of rooms will lead to higher house prices. This means they have a positive correlation.

Also, we can see LSTAT to MEDV has a value of -0.74 which means that a % lower status of the population value will lead to higher house prices. This means they have a negative correlation.

Here we should concentrate on strongly correlated features, be it positive or negative with the target value. Also, any unexpected strong correlations between features should be noted and analyzed further.

We can also take help from pandas scatter matrix plot to look for trends in features that have a strong correlation with target values.

So now we have seen RM, LSTAT have a strong correlation with MEDV so let's plot a scatter matrix and see the results.

pd.plotting.scatter_matrix(housing[[‘MEDV’ , ‘RM’, ‘LSTAT’]])

Here in the scatter plot we can clearly see that lower the RM value, lower the MEDV value also we can see lower the LSTAT value, higher the MEDV value.

Also, we can see some outlier values that don't follow the trend, and to understand more why do we have these outliers we should reach out to stakeholders and get more insights into this data. Like why a house with two rooms has higher price than a room with 8 rooms. Based on the answers we get we can decide if we want to remove or cap this data. Here we can see that the price is capped at 50 for multiple values as this is pre-cleaned data from Kaggle.

By the end of EDA we should have a pretty good understanding of our dataset and also we will have notes for data cleaning and ideas for feature Engineering.

Thanks for the read do leave comments if you have any inputs.

ENJOY YOUR CODING!

--

--