Dedicated to those dipping their toes to R Data Analysis.

EDA is the first step in Data Analysis
In EDA a researcher is doing the following:
- Looks at the distribution of each model variable
- Looks for missing values and data anomalies (e.g., outliers)
- Investigates 1-way effects: how each model variable affects the Dependent Variable (aka Y, Target Variable)
- Looks at some prominent 2-way (or even 3-way) effects
At the end of EDA one has a better idea about the value of the data and about what is a valid model to build

Before EDA
1. Understand what is the purpose of your analysis or model
2. Research what affects your dependent variable and how:
 2.a Using desk research (read papers, books, posts, etc.)
 2.b Using field research (talk to experts, practitioners, etc.)
3. Get your data (need to know data meaning and provenance)

#

In this example:

Purpose: Find which factors determine fuel efficiency in cars sold during 1974
Data: Leverage the preloaded dataset - this is a clean dataset.

We assume one uses the popular RStudio IDE.

###

Model Dataset

# Learn more about the 1974 Motor Trend US Magazine Data using the help command
help(mtcars) # Opens the help tab (bottom-right)

#Data Format:
#A data frame with 32 observations on 11 (numeric) variables.
#
#[, 1]  mpg Miles/(US) gallon
#[, 2]  cyl Number of cylinders
#[, 3]  disp    Displacement (cu.in.)
#[, 4]  hp  Gross horsepower
#[, 5]  drat    Rear axle ratio
#[, 6]  wt  Weight (1000 lbs)
#[, 7]  qsec    1/4 mile time
#[, 8]  vs  Engine (0 = V-shaped, 1 = straight)
#[, 9]  am  Transmission (0 = automatic, 1 = manual)
#[,10]  gear    Number of forward gears
#[,11]  carb    Number of carburetors

### Quick Data Overview

# This is a small dataset, so just run
mtcars #print(mtcars) does the same
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
# how big is the dataset?
dim(mtcars) # Show Number of Rows and Number of Columns
## [1] 32 11
# Quick Overview Method 1: Use str command
str(mtcars) # Now I know all variables are numeric (or coded as numeric)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Quick Overview Method 2: Use RStudio View command
#RUN: View(mtcars) # Now a new tab opens (top-left) with a spreadsheet like view (click mpg arrow to sort)
# This way you can see big datasets when you need to.

Construct my Own Data Frame Using mtcars

df = mtcars # copy original dataset

Exploratory Data Analysis (EDA)

Dependent Variable: mpg Miles/(US) gallon

# Distribution
summary(df$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90
hist(df$mpg) # Histogram

Explanatory Variables (aka Independent Variables, Predictors)

Show Distribution and 1-way effect ### Predictor: cyl Number of cylinders

summary(df$cyl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   4.000   6.000   6.188   8.000   8.000
hist(df$cyl)

table(df$cyl) # Table of counts: cyl takes values 4,6,or 8
## 
##  4  6  8 
## 11  7 14
boxplot(mpg~cyl, data=df)
# Avg MPG by Cyl
mpg_by_cyl = tapply(df$mpg,df$cyl,mean)
mpg_by_cyl
##        4        6        8 
## 26.66364 19.74286 15.10000
lines(mpg_by_cyl, col="red")

# Effect of cyl is not really linear: will treat it as factor (aka categorical variable)
df$cyl = factor(df$cyl)

Predictor: disp Displacement (cu.in.)

summary(df$disp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.1   120.8   196.3   230.7   326.0   472.0
hist(df$disp)

plot(mpg~disp, data=df) # Effect appears to be nonlinear
lines(lowess(df$disp, df$mpg), col = "red") # Show trend line

# Alternative: Log-Transform disp
plot(mpg~log(disp), data=df) # Closer to linear trend now
lines(lowess(log(df$disp), df$mpg), col = "red")

Predictor: hp Gross horsepower

summary(df$hp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    52.0    96.5   123.0   146.7   180.0   335.0
hist(df$hp)

plot(mpg~hp, data=df) # Effect appears to be nonlinear
lines(lowess(df$hp, df$mpg), col = "red") # Show trend line

# After Log-Transform
plot(mpg~log(hp), data=df) # Closer to linear trend now
lines(lowess(log(df$hp), df$mpg), col = "red") # Show trend line

Predictor: drat Rear axle ratio

summary(df$drat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.760   3.080   3.695   3.597   3.920   4.930
hist(df$drat)

plot(mpg~drat, data=df) 
lines(lowess(df$drat, df$mpg), col = "red") # Show trend line

Predictor: wt Weight (1000 lbs)

summary(df$wt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.513   2.581   3.325   3.217   3.610   5.424
hist(df$wt)

plot(mpg~wt, data=df)
lines(lowess(df$wt, df$mpg), col = "red") # Show trend line

Predictor: qsec 1/4 mile time

summary(df$qsec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.50   16.89   17.71   17.85   18.90   22.90
hist(df$qsec)

plot(mpg~qsec, data=df)
lines(lowess(df$qsec, df$mpg), col = "red") # Show trend line default
lines(lowess(df$qsec, df$mpg, f=1.5), col = "green") # Show trend line smoothed

Predictor: vs Engine (0 = V-shaped, 1 = straight)

summary(df$vs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4375  1.0000  1.0000
hist(df$vs)

# Treat as factor
df$vs = factor(df$vs, levels = c(0,1), labels = c("V-shaped","straight"))
summary(df$vs)
## V-shaped straight 
##       18       14
boxplot(mpg~vs,df)

Predictor: am Transmission (0 = automatic, 1 = manual)

summary(df$am)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4062  1.0000  1.0000
hist(df$am)

# Treat as factor
df$am = factor(df$am, levels = c(0,1), labels = c("automatic","manual"))
summary(df$am)
## automatic    manual 
##        19        13
boxplot(mpg~am,df)

Predictor: gear Number of forward gears

summary(df$gear)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.000   4.000   3.688   4.000   5.000
hist(df$gear,n=10)

# Treat as factor
df$gear = factor(df$gear)
summary(df$gear)
##  3  4  5 
## 15 12  5
boxplot(mpg~gear,df,varwidth=TRUE)

# Note relation between gear box and number of gear
table(df$am, df$gear)
##            
##              3  4  5
##   automatic 15  4  0
##   manual     0  8  5
# Most Automatic Transmissions had 3 gears in 1974
# Mercedes was ahead of the curve
subset(df,(am=="automatic")&(gear==4))
##            mpg cyl  disp  hp drat   wt qsec       vs        am gear carb
## Merc 240D 24.4   4 146.7  62 3.69 3.19 20.0 straight automatic    4    2
## Merc 230  22.8   4 140.8  95 3.92 3.15 22.9 straight automatic    4    2
## Merc 280  19.2   6 167.6 123 3.92 3.44 18.3 straight automatic    4    4
## Merc 280C 17.8   6 167.6 123 3.92 3.44 18.9 straight automatic    4    4

Predictor: carb Number of carburetors

summary(df$carb)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.812   4.000   8.000
hist(df$carb, n=10)

# Treat as factor and merge levels
df$carb = factor(df$carb, levels=c(1,2,3,4,6,8), labels=c("1-2","1-2","3-4","3-4","6-8","6-8"))
summary(df$carb)
## 1-2 3-4 6-8 
##  17  13   2
boxplot(mpg~carb,df,varwidth=TRUE)

2-Way Interactions - Optional

# Graph of pairwise plots (first 6 columns)
pairs(df[,1:7], main = "Pairwise Relations", gap=0.2) # Pairwise Graphs of ALL Vars is way too busy

# Observe here the relationship between disp and hp

# Selected - Key 2-Way Interactions (focus on variables to include in model)

coplot(mpg~disp| am,data=df,panel=panel.smooth) # NOTE: MPG v disp is different depending on am

coplot(mpg~disp| cyl,data=df,panel=panel.smooth,rows=1) #NOTE: MPG goes down with disp - But not for cyl==6!

# Key  3-Way Interaction
coplot(mpg~disp| cyl+am,data=df,panel=panel.smooth) # NOTE: Automatic with 4 cylinders mpg goes up with disp!!!

subset(df,(cyl==4)&(am=="automatic")) # Key is Horsepower Not Displacement
##                mpg cyl  disp hp drat    wt  qsec       vs        am gear carb
## Merc 240D     24.4   4 146.7 62 3.69 3.190 20.00 straight automatic    4  1-2
## Merc 230      22.8   4 140.8 95 3.92 3.150 22.90 straight automatic    4  1-2
## Toyota Corona 21.5   4 120.1 97 3.70 2.465 20.01 straight automatic    3  1-2
# Will use Horsepower in Model
coplot(mpg~hp| cyl+am,data=df,panel=panel.smooth) 

Model Development

Use Background Knowledge and EDA
MPG is affected by a) Horsepower, b) whether type is luxury car or sports car, c) whether car buyer is sensitive to cost or ownership.

Choice A: Use qsec to capture performance
Choice B: Use wt to differentiate between gas guzzlers: luxury v sports
Choice C: Apply Log Transformation to wt (given EDA)
Missing Info: Owner sensitivity to cost of operation.

Build Model

mod = lm(mpg~qsec+log(wt),data=df)

# Results

summary(mod)
## 
## Call:
## lm(formula = mpg ~ qsec + log(wt), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0729 -1.3876 -0.4368  0.7493  5.4694 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.2967     4.4603   4.999 2.54e-05 ***
## qsec          0.8932     0.2224   4.016 0.000384 ***
## log(wt)     -16.1783     1.2519 -12.923 1.47e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.177 on 29 degrees of freedom
## Multiple R-squared:  0.878,  Adjusted R-squared:  0.8696 
## F-statistic: 104.3 on 2 and 29 DF,  p-value: 5.661e-14
# Discussion:
# qsec: longer 1/4 mile time better MPG (P-value < 0.001)
# log(wt): lower weight better MPG (P-value < 0.001)
#
# Very Strong Fit: Adj R-squared:  0.8696

Specification Tests

plot(df$mpg,predict(mod)) # Predicted v Actual: Reasonable Model close to diagonal
abline(0,1,col="blue")

plot(mod, which=1) # Residuals v Fitted: No Clear Deviation from Linearity

plot(mod, which=2, id.n=5) #Q-Q Plot: No Clear Deviation from Normality

  # But: Fiat 128, Toyota Corolla, Chrysler Imperial show above normal dist residuals
plot(mod, which=5) #Residuals v Leverage: No Influential Observations - Cooks-D < 0.5

  # Fiat 128, Toyota Corolla, Chrysler Imperial stand Out but are not influential

stand_out = c(17,18,20)
df[stand_out,] # Show stand out values
##                    mpg cyl  disp  hp drat    wt  qsec       vs        am gear
## Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42 V-shaped automatic    3
## Fiat 128          32.4   4  78.7  66 4.08 2.200 19.47 straight    manual    4
## Toyota Corolla    33.9   4  71.1  65 4.22 1.835 19.90 straight    manual    4
##                   carb
## Chrysler Imperial  3-4
## Fiat 128           1-2
## Toyota Corolla     1-2
predict(mod)[stand_out] # Expected MPG
## Chrysler Imperial          Fiat 128    Toyota Corolla 
##          10.73805          26.93060          30.24962
apply(df,2,rank)[stand_out,]  # Show stand out ranks
##                   mpg  cyl disp   hp drat wt qsec   vs am gear carb
## Chrysler Imperial   5 25.5   30 28.0 13.0 31   15 23.5 10  8.0   24
## Fiat 128           31  6.0    3  4.5 26.5  6   27  7.5 26 21.5    9
## Toyota Corolla     32  6.0    1  3.0 29.5  3   28  7.5 26 21.5    9
  # NOTE: No obvious reason why stand-out cases are special.

Future Research

Why are stand out cases not great fits to our model?