Exploratory Data Analysis with Python (Part 1)

Muskaan Pirani
The Startup
Published in
5 min readNov 27, 2020

--

How many times have you got baffled with the beginner part in Data Analysis? I am sure many times as I faced the same. So in this post, I will guide you as how to do Exploratory Data Analysis (EDA) with Python.

Photo by Franki Chamaki on Unsplash

EDA is one of the basic steps involved in Data Analysis. Before you begin with the analysis part, understanding data is very important. You should be familier with certain terminologies of the type of data on which you are working. For instance, you are working with any financial data let’s say stocks and thus, you should be familier with terminologies like open, close, high, low, volume and so on. Alright, we now know understanding data is important so let’s move ahead with the analysis part!

Why Data Analysis?

To get inferences from the data and to know future possible outcomes based on historical values of any data. Continuing the above example, with analysis of certain stock, let’s say I analyzed performance of a particular stock from year 2000 to 2020. I will find certain trends in the graph of it’s everyday value. If it shows an upward trend (for major time) then I will like to invest in it, otherwise I will analyze other stocks.

An important thing to remember here is:

Analysis is never done on basis of one or two parameters. There are lots of other parameters, might be 30, 40, 50 or any number.

To begin with analysis part, EDA is first step. EDA helps to:

  1. Get inferences from the data.
  2. To understand how parameters of data are inter-co-related.
  3. To extract important parameters and neglecting ones which are not.
  4. Find relationships between those parameters.
  5. Try to bring a conclusion

Let’s get our hands dirty with data

I will be using data provided by World Bank Data (Science and Technology Indicators). This dataset deals with every country in the world. So, I will be narrowing the data to India so as to get easy insights. You can download it from here: Dataset(India)

1. Importing packages

  • We will work with pandas, numpy, seaborn and matplotlib here for data analysis and visualizations.
  • We wrote %matplotlib inline to plot our graphs within the notebook.
#import packages
import pandas as pd
import numpy as np
import seaborn as sns
#To plot within notebook
import matplotlib.pylot as plt
%matplotlib inline

2. Loading the dataset and have overview of the data

  • Include file path in read_csv and df.head() will print first 5 rows of dataset.
  • I am using Google Collaboratory, hence we will write ‘/content/India.csv’ as file path.
df = pd.read_csv('/content/India.csv')
df.head()
Shows first five columns of datset
  • To get shape of the date i.e. number of rows and columns we will use df.shape
df.shape
41 rows with 9 columns
  • To obtain the names of the columns we will write df.columns.values
df.columns.values
Gives values of columns
  • To find what all columns it contains, of what types and if they contain any value in it or not, with the help of info() function.
df.info()

3. Statistical values

  • describe() method will give us statistical values of all the columns individually.
  • We can get to know mean, standard deviation, quartiles, min, max, etc.
df.describe()
  • From above data, we can conclude that the mean value of first five columns is greater or equal to the median value (50%).
  • There is a relatively big difference between the 75% and max values of the columns.
  • Above two observations, gives an indication that there are deviations in our data set.

4. Pair Plot graph

  • sns.pairplot is used to view all the graphs with all the columns. So that we can get rough idea about how one column is related with the other.

EDA in one line of code: sns.pairplot(df)

g = sns.pairplot(df)
g.fig.set_size_inches(20,20)
Pair plot

5. Co-relation Matrix

  • Co-relation matrix is generally used for Linear Regression because it gives symmetric matrix.
  • We have used annot=True, for getting the values.
  • Values near to 1 shows positive co-relation, values close to -1 shows negative co-relation and 0 shows neutral relation.
  • Absolute 1 indicates fully positive relation while -1 shows no relationship exist.
  • Drawback of Co-relation matrix: It will not show us any columns with non-linear relationship.
dataset = df.corr()
dataset = pd.DataFrame(dataset)
dataset
plt.figure(figsize=(10, 5))
sns.set_style('whitegrid')
sns.heatmap(dataset, robust=True, annot=True)
Heatmap
  • From above we can see, there is a strong positive correlation of Inflation, average consumer prices and Inflation, end of period consumer prices.
  • However, a strong negative correlation of General Government net lending/borrowing and Year.

6. PPS Matrix

  • PPS stands for Predictive Power Score.
  • To overcome the drawback of co-relation matrix, we will use PPS matrix.
  • This matrix will also let us know about the non-linear relationships among variables. Hence, it will give us asymmetric matrix.
  • The values we get will be interpreted same as Co-relation Matrix.
pip install ppscoreimport ppscore as pps
sns.set_style('whitegrid')
plt.figure(figsize=(10, 5))
matrix_df = pps.matrix(df).pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, annot=True, cmap='Blues'
PPS Matrix
  • For getting PPS value for a particular feature , we will write below line of code:
pps.predictors(df, "Inflation, average consumer prices")

The analyzing part gets completed here. In Exploratory Data Analysis with Python (Part 2), we will see how to visualize the data.

I would like to end the post with a quote:

Data is what you need to do analytics, information is what you need to do business.

— John Owen

--

--