2-Way
Interactions -
OptionalDedicated to those dipping their toes to R Data
Analysis.
EDA is the first step in Data Analysis
In EDA a researcher is doing the following:
- Looks at the distribution of each model variable
- Looks for missing values and data anomalies (e.g., outliers)
- Investigates 1-way effects: how each model variable affects the
Dependent Variable (aka Y, Target Variable)
- Looks at some prominent 2-way (or even 3-way) effects
At the end of EDA one has a better idea about the value of the data and
about what is a valid model to build
Before EDA
1. Understand what is the purpose of your analysis or model
2. Research what affects your dependent variable and how:
 2.a Using desk research (read papers, books, posts, etc.)
 2.b Using field research (talk to experts, practitioners, etc.)
3. Get your data (need to know data meaning and provenance)
In this example:
Purpose: Find which factors determine fuel
efficiency in cars sold during 1974
Data: Leverage the preloaded dataset
We assume one uses the popular RStudio IDE.
# Learn more about the 1974 Motor Trend US Magazine Data using the help command
help(mtcars) # Opens the help tab (bottom-right)
#Data Format:
#A data frame with 32 observations on 11 (numeric) variables.
#
#[, 1] mpg Miles/(US) gallon
#[, 2] cyl Number of cylinders
#[, 3] disp Displacement (cu.in.)
#[, 4] hp Gross horsepower
#[, 5] drat Rear axle ratio
#[, 6] wt Weight (1000 lbs)
#[, 7] qsec 1/4 mile time
#[, 8] vs Engine (0 = V-shaped, 1 = straight)
#[, 9] am Transmission (0 = automatic, 1 = manual)
#[,10] gear Number of forward gears
#[,11] carb Number of carburetors
### Quick Data Overview
# This is a small dataset, so just run
mtcars #print(mtcars) does the same
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## [1] 32 11
# Quick Overview Method 1: Use str command
str(mtcars) # Now I know all variables are numeric (or coded as numeric)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Quick Overview Method 2: Use RStudio View command
#RUN: View(mtcars) # Now a new tab opens (top-left) with a spreadsheet like view (click mpg arrow to sort)
# This way you can see big datasets when you need to.
Construct my Own Data Frame Using mtcars
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
Show Distribution and 1-way effect ### Predictor: cyl Number of cylinders
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 4.000 6.000 6.188 8.000 8.000
##
## 4 6 8
## 11 7 14
## 4 6 8
## 26.66364 19.74286 15.10000
# Effect of cyl is not really linear: will treat it as factor (aka categorical variable)
df$cyl = factor(df$cyl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.1 120.8 196.3 230.7 326.0 472.0
plot(mpg~disp, data=df) # Effect appears to be nonlinear
lines(lowess(df$disp, df$mpg), col = "red") # Show trend line
# Alternative: Log-Transform disp
plot(mpg~log(disp), data=df) # Closer to linear trend now
lines(lowess(log(df$disp), df$mpg), col = "red")
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.0 96.5 123.0 146.7 180.0 335.0
plot(mpg~hp, data=df) # Effect appears to be nonlinear
lines(lowess(df$hp, df$mpg), col = "red") # Show trend line
# After Log-Transform
plot(mpg~log(hp), data=df) # Closer to linear trend now
lines(lowess(log(df$hp), df$mpg), col = "red") # Show trend line
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.760 3.080 3.695 3.597 3.920 4.930
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.513 2.581 3.325 3.217 3.610 5.424
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.50 16.89 17.71 17.85 18.90 22.90
plot(mpg~qsec, data=df)
lines(lowess(df$qsec, df$mpg), col = "red") # Show trend line default
lines(lowess(df$qsec, df$mpg, f=1.5), col = "green") # Show trend line smoothed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4375 1.0000 1.0000
# Treat as factor
df$vs = factor(df$vs, levels = c(0,1), labels = c("V-shaped","straight"))
summary(df$vs)
## V-shaped straight
## 18 14
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4062 1.0000 1.0000
# Treat as factor
df$am = factor(df$am, levels = c(0,1), labels = c("automatic","manual"))
summary(df$am)
## automatic manual
## 19 13
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 4.000 3.688 4.000 5.000
## 3 4 5
## 15 12 5
##
## 3 4 5
## automatic 15 4 0
## manual 0 8 5
# Most Automatic Transmissions had 3 gears in 1974
# Mercedes was ahead of the curve
subset(df,(am=="automatic")&(gear==4))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 straight automatic 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 straight automatic 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 straight automatic 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 straight automatic 4 4
2-Way
Interactions - Optional# Graph of pairwise plots (first 6 columns)
pairs(df[,1:7], main = "Pairwise Relations", gap=0.2) # Pairwise Graphs of ALL Vars is way too busy
# Observe here the relationship between disp and hp
# Selected - Key 2-Way Interactions (focus on variables to include in model)
coplot(mpg~disp| am,data=df,panel=panel.smooth) # NOTE: MPG v disp is different depending on am
coplot(mpg~disp| cyl,data=df,panel=panel.smooth,rows=1) #NOTE: MPG goes down with disp - But not for cyl==6!
# Key 3-Way Interaction
coplot(mpg~disp| cyl+am,data=df,panel=panel.smooth) # NOTE: Automatic with 4 cylinders mpg goes up with disp!!!
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 straight automatic 4 1-2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 straight automatic 4 1-2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 straight automatic 3 1-2
Use Background Knowledge and EDA
MPG is affected by a) Horsepower, b) whether type is luxury car or
sports car, c) whether car buyer is sensitive to cost or
ownership.
Choice A: Use qsec to capture performance
Choice B: Use wt to differentiate between gas guzzlers: luxury v
sports
Choice C: Apply Log Transformation to wt (given EDA)
Missing Info: Owner sensitivity to cost of operation.
##
## Call:
## lm(formula = mpg ~ qsec + log(wt), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0729 -1.3876 -0.4368 0.7493 5.4694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.2967 4.4603 4.999 2.54e-05 ***
## qsec 0.8932 0.2224 4.016 0.000384 ***
## log(wt) -16.1783 1.2519 -12.923 1.47e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.177 on 29 degrees of freedom
## Multiple R-squared: 0.878, Adjusted R-squared: 0.8696
## F-statistic: 104.3 on 2 and 29 DF, p-value: 5.661e-14
plot(df$mpg,predict(mod)) # Predicted v Actual: Reasonable Model close to diagonal
abline(0,1,col="blue")
# But: Fiat 128, Toyota Corolla, Chrysler Imperial show above normal dist residuals
plot(mod, which=5) #Residuals v Leverage: No Influential Observations - Cooks-D < 0.5
# Fiat 128, Toyota Corolla, Chrysler Imperial stand Out but are not influential
stand_out = c(17,18,20)
df[stand_out,] # Show stand out values
## mpg cyl disp hp drat wt qsec vs am gear
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 V-shaped automatic 3
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 straight manual 4
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 straight manual 4
## carb
## Chrysler Imperial 3-4
## Fiat 128 1-2
## Toyota Corolla 1-2
## Chrysler Imperial Fiat 128 Toyota Corolla
## 10.73805 26.93060 30.24962
## mpg cyl disp hp drat wt qsec vs am gear carb
## Chrysler Imperial 5 25.5 30 28.0 13.0 31 15 23.5 10 8.0 24
## Fiat 128 31 6.0 3 4.5 26.5 6 27 7.5 26 21.5 9
## Toyota Corolla 32 6.0 1 3.0 29.5 3 28 7.5 26 21.5 9
Why are stand out cases not great fits to our model?