Exploratory Data Analysis with Python
Objectives
After completing this lab you will be able to:
- Explore features or charecteristics to predict price of car
Table of Contents
1. Import Data from Module
2. Analyzing Individual Feature Patterns using Visualization
3. Descriptive Statistical Analysis
4. Basics of Grouping
5. Correlation and Causation
6. ANOVA
What are the main characteristics that have the most impact
on the car price?
1. Import Data from Module 2
Setup
You are running the lab in your browser, so we will install the
libraries using piplite
#you are running the lab in your browser, so we will install the libraries using ``piplite``
#import piplite
#await piplite.install(['pandas'])
#await piplite.install(['matplotlib'])
#await piplite.install(['scipy'])
#await piplite.install(['seaborn'])
# Se comentan las instalaciones pues se correrá localmente el código
Import libraries:
This function will download the dataset into your browser
#This function will download the dataset into your browser
#from pyodide.http import pyfetch
#async def download(url, filename):
# response = await pyfetch(url)
# if response.status == 200:
# with open(filename, "wb") as f:
# f.write(await response.bytes())
Load the data and store it in dataframe df:
This dataset was hosted on IBM Cloud object. Click HERE for free
storage.
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv'
2. Analyzing Individual Feature Patterns Using
Visualization
To install Seaborn we use pip, the
Python package manager.
Import visualization packages “Matplotlib” and
“Seaborn”. Don’t forget about “%matplotlib
inline” to plot in a Jupyter notebook.
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
# Se comenta la última línea del código pues sirve sólo para plotear en Jupiter Notebook
How to choose the right visualization method?
When visualizing individual variables, it is important to
first understand what type of variable you are dealing
with. This will help us find the right visualization
method for that variable.
# list the data types for each column
print(df.dtypes)
## symboling int64
## normalized-losses int64
## make object
## aspiration object
## num-of-doors object
## body-style object
## drive-wheels object
## engine-location object
## wheel-base float64
## length float64
## width float64
## height float64
## curb-weight int64
## engine-type object
## num-of-cylinders object
## engine-size int64
## fuel-system object
## bore float64
## stroke float64
## compression-ratio float64
## horsepower float64
## peak-rpm float64
## city-mpg int64
## highway-mpg int64
## price float64
## city-L/100km float64
## horsepower-binned object
## diesel int64
## gas int64
## dtype: object
Question #1:
What is the data type of the column
“peak-rpm”?
# Write your code below and press Shift+Enter to execute
type_peak_rpm = df.dtypes['peak-rpm']
type_peak_rpm
# O también
## dtype('float64')
df['peak-rpm'].dtypes
## dtype('float64')
For example, we can calculate the correlation between variables of
type “int64” or “float64” using the
method “corr”:
df.corr()
## symboling normalized-losses ... diesel gas
## symboling 1.000000 0.466264 ... -0.196735 0.196735
## normalized-losses 0.466264 1.000000 ... -0.101546 0.101546
## wheel-base -0.535987 -0.056661 ... 0.307237 -0.307237
## length -0.365404 0.019424 ... 0.211187 -0.211187
## width -0.242423 0.086802 ... 0.244356 -0.244356
## height -0.550160 -0.373737 ... 0.281578 -0.281578
## curb-weight -0.233118 0.099404 ... 0.221046 -0.221046
## engine-size -0.110581 0.112360 ... 0.070779 -0.070779
## bore -0.140019 -0.029862 ... 0.054458 -0.054458
## stroke -0.008245 0.055563 ... 0.241303 -0.241303
## compression-ratio -0.182196 -0.114713 ... 0.985231 -0.985231
## horsepower 0.075819 0.217299 ... -0.169053 0.169053
## peak-rpm 0.279740 0.239543 ... -0.475812 0.475812
## city-mpg -0.035527 -0.225016 ... 0.265676 -0.265676
## highway-mpg 0.036233 -0.181877 ... 0.198690 -0.198690
## price -0.082391 0.133999 ... 0.110326 -0.110326
## city-L/100km 0.066171 0.238567 ... -0.241282 0.241282
## diesel -0.196735 -0.101546 ... 1.000000 -1.000000
## gas 0.196735 0.101546 ... -1.000000 1.000000
##
## [19 rows x 19 columns]
The diagonal elements are always
one; we will study correlation more precisely Pearson
correlation in-depth at the end of the notebook.
Question #2:
Find the correlation between the following columns:
bore, stroke, compression-ratio, and horsepower.
Hint: if you would like to select those
columns, use the following syntax:
df[[‘bore’,‘stroke’,‘compression-ratio’,‘horsepower’]]
# Write your code below and press Shift+Enter to execute
df[['bore','stroke','compression-ratio','horsepower']].corr()
## bore stroke compression-ratio horsepower
## bore 1.000000 -0.055390 0.001263 0.566936
## stroke -0.055390 1.000000 0.187923 0.098462
## compression-ratio 0.001263 0.187923 1.000000 -0.214514
## horsepower 0.566936 0.098462 -0.214514 1.000000
Continuous Numerical Variables:
Continuous numerical variables are variables that may
contain any value within some range. They can be of type
“int64” or “float64”. A great way to
visualize these variables is by using
scatterplots with fitted lines.
In order to start understanding the (linear)
relationship between an individual variable and the
price, we can use “regplot” which plots the
scatterplot plus the fitted regression line for the
data.
Let’s see several examples of different linear
relationships:
Positive Linear Relationship
Let’s find the scatterplot of “engine-size” and
“price”.
# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
## (0.0, 53380.78213556759)
plt.show()

As the engine-size goes up, the price goes
up: this indicates a positive direct
correlation between these two variables. Engine
size seems like a pretty good predictor of
price since the regression line is almost a perfect
diagonal line.
We can examine the correlation between
‘engine-size’ and ‘price’ and see that it’s
approximately 0.87.
df[["engine-size", "price"]].corr()
## engine-size price
## engine-size 1.000000 0.872335
## price 0.872335 1.000000
Highway mpg is a potential predictor
variable of price. Let’s find the scatterplot of
“highway-mpg” and “price”.
plt.clf()
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
## (0.0, 48182.90564471966)
plt.show()

As highway-mpg goes up, the price goes
down: this indicates an inverse/negative
relationship between these two variables. Highway mpg
could potentially be a predictor of price.
We can examine the correlation between
‘highway-mpg’ and ‘price’ and see it’s
approximately -0.704.
df[['highway-mpg', 'price']].corr()
## highway-mpg price
## highway-mpg 1.000000 -0.704692
## price -0.704692 1.000000
Weak Linear Relationship
Let’s see if “peak-rpm” is a predictor variable of
“price”.
plt.clf()
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)
## (0.0, 47414.1)
plt.show()

Peak rpm does not seem like a good
predictor of the price at all since the regression line
is close to horizontal. Also, the data points are very
scattered and far from the fitted line, showing lots of
variability. Therefore, it’s not a reliable variable.
We can examine the correlation between
‘peak-rpm’ and ‘price’ and see it’s
approximately -0.101616.
df[['peak-rpm','price']].corr()
## peak-rpm price
## peak-rpm 1.000000 -0.101616
## price -0.101616 1.000000
Question 3 a):
Find the correlation between x=“stroke” and
y=“price”.
Hint: if you would like to select those columns,
use the following syntax: df[[“stroke”,“price”]].
# Write your code below and press Shift+Enter to execute
df[['stroke','price']].corr()
## stroke price
## stroke 1.00000 0.08231
## price 0.08231 1.00000
Question 3 b):
Given the correlation results between “price” and “stroke”, do
you expect a linear relationship?
Verify your results using the
function “regplot()”.
plt.clf()
sns.regplot(x="stroke", y="price", data=df)
plt.ylim(0,)
## (0.0, 47414.1)
plt.show()
# "stroke" no es un buen predictor de la variable "precio" ya que la línea de regresión está cerca a la horizontal y sus valores están lejanos a ésta. Gran variabilidad

Categorical Variables
These are variables that describe a ‘characteristic’ of a
data unit, and are selected from a small group of categories.
The categorical variables can have the type
“object” or “int64”. A good way to visualize
categorical variables is by using
boxplots.
Let’s look at the relationship between
“body-style” and “price”.
plt.clf()
sns.boxplot(x="body-style", y="price", data=df)
plt.show()

We see that the distributions of price between the
different body-style categories have a
significant overlap, so body-style would not be
a good predictor of price. Let’s examine
engine “engine-location” and “price”:
plt.clf()
sns.boxplot(x="engine-location", y="price", data=df)
plt.show()

Here we see that the distribution of price between
these two engine-location categories, front and
rear, are distinct enough to take engine-location as a
potential good predictor of price.
Let’s examine “drive-wheels” and “price”.
# drive-wheels
plt.clf()
sns.boxplot(x="drive-wheels", y="price", data=df)
plt.show()

Here we see that the distribution of price between
the different drive-wheels categories differs. As such,
drive-wheels could potentially be a predictor of
price.
3. Descriptive Statistical Analysis
Let’s first take a look at the variables by
utilizing a description method.
The describe function automatically
computes basic statistics for all continuous variables.
Any NaN values are automatically
skipped in these statistics.
This will show:
- the count of that variable
- the mean
- the standard deviation (std)
- the minimum value
- the IQR (Interquartile Range: 25%, 50% and 75%)
- the maximum value
We can apply the method “describe” as follows:
df.describe()
## symboling normalized-losses ... diesel gas
## count 201.000000 201.00000 ... 201.000000 201.000000
## mean 0.840796 122.00000 ... 0.099502 0.900498
## std 1.254802 31.99625 ... 0.300083 0.300083
## min -2.000000 65.00000 ... 0.000000 0.000000
## 25% 0.000000 101.00000 ... 0.000000 1.000000
## 50% 1.000000 122.00000 ... 0.000000 1.000000
## 75% 2.000000 137.00000 ... 0.000000 1.000000
## max 3.000000 256.00000 ... 1.000000 1.000000
##
## [8 rows x 19 columns]
The default setting of “describe” skips
variables of type object. We can apply the method
“describe” on the variables of type ‘object’ as
follows:
df.describe(include=['object'])
## make aspiration ... fuel-system horsepower-binned
## count 201 201 ... 201 200
## unique 22 2 ... 8 3
## top toyota std ... mpfi Low
## freq 32 165 ... 92 115
##
## [4 rows x 10 columns]
Value Counts
Value counts is a good way of understanding how many units
of each characteristic/variable we have. We can apply the
“value_counts” method on the column
“drive-wheels”. Don’t forget the method
“value_counts” only works on pandas series, not
pandas dataframes. As a result, we only include one
bracket df[‘drive-wheels’], not two brackets
df[[‘drive-wheels’]].
df['drive-wheels'].value_counts()
## fwd 118
## rwd 75
## 4wd 8
## Name: drive-wheels, dtype: int64
We can convert the series to a dataframe as
follows:
df['drive-wheels'].value_counts().to_frame()
## drive-wheels
## fwd 118
## rwd 75
## 4wd 8
Let’s repeat the above steps but save the results
to the dataframe “drive_wheels_counts” and
rename the column ‘drive-wheels’ to
‘value_counts’.
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts
## value_counts
## fwd 118
## rwd 75
## 4wd 8
Now let’s rename the index to ‘drive-wheels’:
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts
## value_counts
## drive-wheels
## fwd 118
## rwd 75
## 4wd 8
We can repeat the above process for the variable
‘engine-location’.
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)
## value_counts
## engine-location
## front 198
## rear 3
After examining the value counts of the engine location, we see that
engine location would not be a good predictor variable for the price.
This is because we only have three cars with a rear engine and 198 with
an engine in the front, so this result is skewed
(sesgado). Thus, we are not able to draw any conclusions about
the engine location.
4. Basics of Grouping
The “groupby” method groups data by
different categories. The data is grouped based on one or
several variables, and analysis is performed on the individual
groups.
For example, let’s group by the variable
“drive-wheels”. We see that there are 3
different categories of drive wheels.
df['drive-wheels'].unique()
## array(['rwd', 'fwd', '4wd'], dtype=object)
If we want to know, on average, which type
of drive wheel is most valuable, we can group
“drive-wheels” and then average them.
We can select the columns ‘drive-wheels’, ‘body-style’ and
‘price’, then assign it to the variable
“df_group_one”.
df_group_one = df[['drive-wheels','body-style','price']]
We can then calculate the average price for each of
the different categories of data.
# grouping results
df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()
df_group_one
## drive-wheels price
## 0 4wd 10241.000000
## 1 fwd 9244.779661
## 2 rwd 19757.613333
From our data, it seems rear-wheel drive vehicles are, on
average, the most expensive, while 4-wheel and
front-wheel are approximately the same in
price.
You can also group by multiple variables. For
example, let’s group by both ‘drive-wheels’ and
‘body-style’. This groups the dataframe by the unique
combination of ‘drive-wheels’ and ‘body-style’. We can
store the results in the variable
‘grouped_test1’.
# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1
## drive-wheels body-style price
## 0 4wd hatchback 7603.000000
## 1 4wd sedan 12647.333333
## 2 4wd wagon 9095.750000
## 3 fwd convertible 11595.000000
## 4 fwd hardtop 8249.000000
## 5 fwd hatchback 8396.387755
## 6 fwd sedan 9811.800000
## 7 fwd wagon 9997.333333
## 8 rwd convertible 23949.600000
## 9 rwd hardtop 24202.714286
## 10 rwd hatchback 14337.777778
## 11 rwd sedan 21711.833333
## 12 rwd wagon 16994.222222
This grouped data is much easier
to visualize when it is made into a pivot table. A
pivot table is like an Excel
spreadsheet, with one variable along the column and another
along the row. We can convert the dataframe to a pivot
table using the method “pivot” to create a
pivot table from the groups.
In this case, we will leave the drive-wheels variable as the
rows of the table, and pivot body-style to become the
columns of the table:
grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
grouped_pivot
## price ...
## body-style convertible hardtop ... sedan wagon
## drive-wheels ...
## 4wd NaN NaN ... 12647.333333 9095.750000
## fwd 11595.0 8249.000000 ... 9811.800000 9997.333333
## rwd 23949.6 24202.714286 ... 21711.833333 16994.222222
##
## [3 rows x 5 columns]
Often, we won’t have data for some of the pivot cells. We
can fill these missing cells with the value 0, but any
other value could potentially be used as well. It
should be mentioned that missing data is quite a complex
subject and is an entire course on its own.
grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot
## price ...
## body-style convertible hardtop ... sedan wagon
## drive-wheels ...
## 4wd 0.0 0.000000 ... 12647.333333 9095.750000
## fwd 11595.0 8249.000000 ... 9811.800000 9997.333333
## rwd 23949.6 24202.714286 ... 21711.833333 16994.222222
##
## [3 rows x 5 columns]
Question 4:
Use the “groupby” function to find the
average “price” of each car based on
“body-style”.
# Write your code below and press Shift+Enter to execute
# grouping results
df_gpbstyle = df[['body-style','price']]
grouped_bstyle = df_gpbstyle.groupby(['body-style'],as_index=False).mean()
grouped_bstyle
## body-style price
## 0 convertible 21890.500000
## 1 hardtop 22208.500000
## 2 hatchback 9957.441176
## 3 sedan 14459.755319
## 4 wagon 12371.960000
If you did not import “pyplot”, let’s do it again.
Variables: Drive Wheels and Body Style
vs. Price
Let’s use a heat map to visualize the
relationship between Body Style vs Price.
#use the grouped results
plt.clf()
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x0000015E1CE5B9D0>
plt.show()
## Traceback (most recent call last):
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\backends\backend_qt.py", line 455, in _draw_idle
## self.draw()
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\backends\backend_agg.py", line 436, in draw
## self.figure.draw(self.renderer)
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\artist.py", line 73, in draw_wrapper
## result = draw(artist, renderer, *args, **kwargs)
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\artist.py", line 50, in draw_wrapper
## return draw(artist, renderer)
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\figure.py", line 2810, in draw
## mimage._draw_list_compositing_images(
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\image.py", line 132, in _draw_list_compositing_images
## a.draw(renderer)
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\artist.py", line 50, in draw_wrapper
## return draw(artist, renderer)
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\axes\_base.py", line 3046, in draw
## self._update_title_position(renderer)
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\axes\_base.py", line 2984, in _update_title_position
## if (ax.xaxis.get_ticks_position() in ['top', 'unknown']
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\axis.py", line 2232, in get_ticks_position
## self._get_ticks_position()]
## File "C:\Users\USER\ANACON~1\lib\site-packages\matplotlib\axis.py", line 1936, in _get_ticks_position
## major = self.majorTicks[0]
## IndexError: list index out of range

The heatmap plots the target variable
(price) proportional to colour with respect to the variables
‘drive-wheel’ and ‘body-style’ on the vertical and horizontal axis,
respectively. This allows us to visualize how the price
is related to ‘drive-wheel’ and ‘body-style’.
Visualization is very important in data science, and Python
visualization packages provide great freedom. We will go more in-depth
in a separate Python visualizations course.
The main question we want to answer in this module is, “What
are the main characteristics which have the most impact on the car
price?”.
To get a better measure of the important
characteristics, we look at the correlation of these variables
with the car price. In other words: how is the car
price dependent on this variable?
5. Correlation and Causation
Correlation: a measure of the extent(grado)
of interdependence between variables.
Causation: the relationship between cause
and effect between two variables.
It is important to know the difference between
these two. Correlation does not imply causation. Determining correlation
is much simpler the determining causation as causation may require
independent experimentation.
Pearson Correlation
The Pearson Correlation measures the linear
dependence between two variables X and Y.
The resulting coefficient is a value
between -1 and 1 inclusive, where:
1: Perfect positive linear
correlation.
0: No linear correlation, the two variables
most likely do not affect each other.
-1: Perfect negative linear
correlation.
Pearson Correlation is the default method
of the function “corr”. Like before, we can calculate the
Pearson Correlation of the ‘int64’ or ‘float64’
variables.
df.corr()
## symboling normalized-losses ... diesel gas
## symboling 1.000000 0.466264 ... -0.196735 0.196735
## normalized-losses 0.466264 1.000000 ... -0.101546 0.101546
## wheel-base -0.535987 -0.056661 ... 0.307237 -0.307237
## length -0.365404 0.019424 ... 0.211187 -0.211187
## width -0.242423 0.086802 ... 0.244356 -0.244356
## height -0.550160 -0.373737 ... 0.281578 -0.281578
## curb-weight -0.233118 0.099404 ... 0.221046 -0.221046
## engine-size -0.110581 0.112360 ... 0.070779 -0.070779
## bore -0.140019 -0.029862 ... 0.054458 -0.054458
## stroke -0.008245 0.055563 ... 0.241303 -0.241303
## compression-ratio -0.182196 -0.114713 ... 0.985231 -0.985231
## horsepower 0.075819 0.217299 ... -0.169053 0.169053
## peak-rpm 0.279740 0.239543 ... -0.475812 0.475812
## city-mpg -0.035527 -0.225016 ... 0.265676 -0.265676
## highway-mpg 0.036233 -0.181877 ... 0.198690 -0.198690
## price -0.082391 0.133999 ... 0.110326 -0.110326
## city-L/100km 0.066171 0.238567 ... -0.241282 0.241282
## diesel -0.196735 -0.101546 ... 1.000000 -1.000000
## gas 0.196735 0.101546 ... -1.000000 1.000000
##
## [19 rows x 19 columns]
Sometimes we would like to know the significant of the correlation
estimate.
P-value
What is this P-value? The P-value is the
probability value that the correlation between these two
variables is statistically significant.
Normally, we choose a significance level of
0.05, which means that we are 95%
confident that the correlation between the variables is
significant.
By convention, when the
- p-value is < 0.001: we say there is
strong evidence that the correlation is
significant.
- p-value is < 0.05: there is moderate
evidence that the correlation is significant.
- p-value is < 0.1: there is weak
evidence that the correlation is significant.
- p-value is > 0.1: there is no evidence
that the correlation is significant.
Wheel-Base vs. Price
Let’s calculate the Pearson Correlation Coefficient and
P-value of ‘wheel-base’ and ‘price’.
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
## The Pearson Correlation Coefficient is 0.5846418222655081 with a P-value of P = 8.076488270732989e-20
Conclusion:
Since the p-value is < 0.001, the
correlation between wheel-base and price is statistically
significant, although the linear relationship isn’t
extremely strong (~0.585).
Horsepower vs. Price
Let’s calculate the Pearson Correlation Coefficient and
P-value of ‘horsepower’ and ‘price’.
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
## The Pearson Correlation Coefficient is 0.809574567003656 with a P-value of P = 6.369057428259557e-48
Conclusion:
Since the p-value is < 0.001, the
correlation between horsepower and price is statistically
significant, and the linear relationship is quite
strong (~0.809, close to 1).
Length vs. Price
Let’s calculate the Pearson Correlation Coefficient and
P-value of ‘length’ and ‘price’.
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
## The Pearson Correlation Coefficient is 0.690628380448364 with a P-value of P = 8.016477466158986e-30
Conclusion:
Since the p-value is < 0.001, the
correlation between length and price is statistically
significant, and the linear relationship is moderately
strong (~0.691).
Width vs. Price
Let’s calculate the Pearson Correlation Coefficient and
P-value of ‘width’ and ‘price’:
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value )
## The Pearson Correlation Coefficient is 0.7512653440522674 with a P-value of P = 9.200335510481516e-38
Conclusion:
Since the p-value is < 0.001, the
correlation between width and price is statistically
significant, and the linear relationship is quite
strong (~0.751).
Curb-Weight vs. Price
Let’s calculate the Pearson Correlation Coefficient and
P-value of ‘curb-weight’ and ‘price’:
pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
## The Pearson Correlation Coefficient is 0.8344145257702846 with a P-value of P = 2.1895772388936914e-53
Conclusion:
Since the p-value is < 0.001, the
correlation between curb-weight and price is statistically
significant, and the linear relationship is quite
strong (~0.834).
Engine-Size vs. Price
Let’s calculate the Pearson Correlation Coefficient and
P-value of ‘engine-size’ and ‘price’:
pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
## The Pearson Correlation Coefficient is 0.8723351674455185 with a P-value of P = 9.265491622198389e-64
Conclusion:
Since the p-value is < 0.001, the
correlation between engine-size and price is statistically
significant, and the linear relationship is very
strong (~0.872).
Bore vs. Price
Let’s calculate the Pearson Correlation Coefficient and
P-value of ‘bore’ and ‘price’:
pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value )
## The Pearson Correlation Coefficient is 0.5431553832626602 with a P-value of P = 8.049189483935489e-17
Conclusion:
Since the p-value is < 0.001, the
correlation between bore and price is statistically
significant, but the linear relationship is only
moderate (~0.521).
We can relate the process for each ‘city-mpg’ and
‘highway-mpg’:
City-mpg vs. Price
pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
## The Pearson Correlation Coefficient is -0.6865710067844677 with a P-value of P = 2.321132065567674e-29
Conclusion:
Since the p-value is < 0.001, the
correlation between city-mpg and price is statistically
significant, and the coefficient of about
-0.687 shows that the relationship is negative and
moderately strong.
Highway-mpg vs. Price
pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value )
## The Pearson Correlation Coefficient is -0.7046922650589529 with a P-value of P = 1.7495471144477352e-31
Conclusion:
Since the p-value is < 0.001, the
correlation between highway-mpg and price is statistically
significant, and the coefficient of about
-0.705 shows that the relationship is negative and
moderately strong.
6. ANOVA
ANOVA: Analysis of Variance
The Analysis of Variance (ANOVA) is a statistical
method used to test whether there are significant differences
between the means of two or more groups. ANOVA returns
two parameters:
- F-test score: ANOVA assumes the means of all
groups are the same, calculates how much the actual means
deviate from the assumption, and reports it as the F-test
score. A larger score means there is a larger
difference between the means.
- P-value: P-value tells how statistically
significant our calculated score value is.
If our price variable is strongly correlated with
the variable we are analyzing, we expect ANOVA to return a
sizeable F-test score and a small p-value.
Drive Wheels
Since ANOVA analyzes the difference between
different groups of the same variable, the groupby
function will come in handy. Because the ANOVA algorithm
averages the data automatically, we do not need to take the average
before hand.
To see if different types of ‘drive-wheels’ impact
‘price’, we group the data.
grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.head(2)
## drive-wheels price
## 0 rwd 13495.0
## 1 rwd 16500.0
## 3 fwd 13950.0
## 4 4wd 17450.0
## 5 fwd 15250.0
## 136 4wd 7603.0
df_gptest
## drive-wheels body-style price
## 0 rwd convertible 13495.0
## 1 rwd convertible 16500.0
## 2 rwd hatchback 16500.0
## 3 fwd sedan 13950.0
## 4 4wd sedan 17450.0
## .. ... ... ...
## 196 rwd sedan 16845.0
## 197 rwd sedan 19045.0
## 198 rwd sedan 21485.0
## 199 rwd sedan 22470.0
## 200 rwd sedan 22625.0
##
## [201 rows x 3 columns]
We can obtain the values of the method group using
the method “get_group”.
grouped_test2.get_group('4wd')['price']
## 4 17450.0
## 136 7603.0
## 140 9233.0
## 141 11259.0
## 144 8013.0
## 145 11694.0
## 150 7898.0
## 151 8778.0
## Name: price, dtype: float64
We can use the function ‘f_oneway’ in the module
‘stats’ to obtain the F-test score and
P-value.
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])
print( "ANOVA results: F=", f_val, ", P =", p_val)
## ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23
This is a great result with a large F-test score
showing a strong correlation and a P-value of almost 0
implying almost certain statistical significance. But
does this mean all three tested groups are all this highly
correlated?
Let’s examine them separately.
fwd and rwd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])
print( "ANOVA results: F=", f_val, ", P =", p_val )
## ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23
Let’s examine the other groups.
4wd and rwd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])
print( "ANOVA results: F=", f_val, ", P =", p_val)
## ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333
4wd and fwd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])
print("ANOVA results: F=", f_val, ", P =", p_val)
## ANOVA results: F= 0.665465750252303 , P = 0.41620116697845666
Conclusion: Important Variables
We now have a better idea of what our data looks like and
which variables are important to take into account when
predicting the car price. We have narrowed it down to
the following variables:
Continuous numerical variables:
- Length
- Width
- Curb-weight
- Engine-size
- Horsepower
- City-mpg
- Highway-mpg
- Wheel-base
- Bore
Categorical variables:
- Drive-wheels
As we now move into building machine learning models to automate our
analysis, feeding the model with variables that meaningfully
affect our target variable will improve our model’s
prediction performance.