Objective


I came upon a helpful 3D regression plot in An Introduction to Statistical Learning, but unfortunately the text did not provide us with a way to replicate it. In this report I will attempt to match the chart using R. I will also explore other datasets.

Advertising Dataset


For the first plot I chose to work with the advertising dataset from the ISLR website.

ISLR Website

Advertising Data

library(knitr)
library(kableExtra)
kable(Advertising[1:6,2:5])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")
TV radio newspaper sales
230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9
8.7 48.9 75.0 7.2

The data includes sales for a particular product (in thousands of units) and TV, radio, and newspaper advertising budgets (in thousands of dollars)

library("plot3D")

# set the x, y, and z variables
x <- Advertising$radio
y <- Advertising$TV
z <- Advertising$sales

# Compute the linear regression 
fit <- lm(z ~ x + y)

# create a grid from the x and y values (min to max) and predict values for every point
# this will become the regression plane
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy), 
                 nrow = grid.lines, ncol = grid.lines)

# create the fitted points for droplines to the surface
fitpoints <- predict(fit)

# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 1,colvar = NULL, col="red", 
          theta = 20, phi = 10, bty="b",
          xlab = "Radio", ylab = "TV", zlab = "Sales",  
          surf = list(x = x.pred, y = y.pred, z = z.pred,  
                      facets = TRUE, fit = fitpoints, col=ramp.col (col = c("dodgerblue3","seagreen2"), n = 300, alpha=0.9), border="black"), main = "Advertising")

# change the plot orientation to better view the bottom
theta = 0, phi = -60

Plot from An Introduction to Statistical Learning


The textbook describes a non-linear relationship (positive residuals lie along the 45-degree line and negative residuals lie along the edges) but it is difficult to visualize with the provided chart.

By recreating it I now had the flexibility to change the plot orientation. We can now clearly see the different patterns.

Income2 Dataset


For the second plot I chose to use the Income2 dataset from the ISLR website:

Income2 Data

kable(Income2[1:6,2:4])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")
Education Seniority Income
21.58621 113.1034 99.91717
18.27586 119.3103 92.57913
12.06897 100.6897 34.67873
17.03448 187.5862 78.70281
19.93103 20.0000 68.00992
18.27586 26.2069 71.50449
# set the x, y, and z variables
x <- Income2$Education
y <- Income2$Seniority
z <- Income2$Income

# Compute the linear regression 
fit <- lm(z ~ x + y)

# create a grid from the x and y values (min to max) and predict values for every point
# this will become the regression plane
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy), 
                 nrow = grid.lines, ncol = grid.lines)

# create the fitted points for droplines to the surface
fitpoints <- predict(fit)

# scatter plot with regression plane
scatter3D(x, y, z, pch = 18, cex = 1.5, colvar=FALSE, col="red", theta = 20, phi = 30, bty="u",
col.panel ="azure2", expand =0.5, col.grid = "snow",
xlab = "Years of Education", ylab = "Seniority", zlab = "Income",  
surf = list(x = x.pred, y = y.pred, z = z.pred,  
facets = TRUE, col=ramp.col (col = c("yellow","darkorange"), n = 100, alpha=0.7), fit = fitpoints, border="black"),main = "Income")

# change the plot orientation to better view the bottom
theta = 0, phi = -40

Plot from An Introduction to Statistical Learning


In this case there is no clear pattern in the residuals

NYC SAT Scores for 2012


In a previous report I analyzed NYC’s SAT scores for 2012, which can be seen here:

SAT Regression

For the 3D plot I chose to compare SAT math scores, the percentage of Asian students, and SAT reading scores.

kable(SAT2[1:6,c(6,8,5)])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")
MS AP RS
404 14.0 355
423 29.2 383
402 9.7 377
401 2.2 414
433 9.3 390
557 84.7 332
# x, y, z variables
x <- SAT2$AP # Asian %
y <- SAT2$RS # Reading Score
z <- SAT2$MS # math Score

# Compute the linear regression 
fit <- lm(z ~ x + y)

# predict values on regular xy grid
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy), 
                 nrow = grid.lines, ncol = grid.lines)

# fitted points for droplines to surface
fitpoints <- predict(fit)

# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 0.5, colvar=FALSE, ticktype="detailed", 
          col="black",theta = 20, phi = 30, bty="g",
          expand =0.7, 
          xlab = "Asian Pct", ylab = "Reading Score", zlab = "Math Score",  
          surf = list(x = x.pred, y = y.pred, z = z.pred,  
                      facets = TRUE, col=ramp.col (col = c("yellow","green","orchid1","cyan"), n = 100, alpha=1), fit = fitpoints, border="black"),main = "SAT")

#change the plot orientation
theta = 0, phi = -60


Unfortunately, since most NYC high schools have a low percentage of Asian students, the data is clustered at the left.

Community Health Data


In a previous report I analyzed community health data. The report is available here:

Community Health Data

For the 3D plot I chose to compare years of potential life lost, the percentage of children in poverty, and the percentage of adults in fair or poor health.

kable(CHD2[1:6,])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")
YPLL PFP CIP
8824 18 19
7225 18 15
9586 26 50
11784 20 27
10908 21 19
12067 29 48
# x, y, z variables
x <- CHD2$CIP # Children in poverty
y <- CHD2$PFP # % of adults in fair or poor health
z <- CHD2$YPLL # years of potential lost life
# Compute the linear regression 
fit <- lm(z ~ x + y)
# predict values on regular xy grid
grid.lines = 50
x.pred <- seq(floor(min(x)), ceiling(max(x)), length.out = grid.lines)
y.pred <- seq(floor(min(y)), ceiling(max(y)), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy), 
                 nrow = grid.lines, ncol = grid.lines)
# fitted points for droplines to surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 0.5, colvar=FALSE, 
          col="black",theta = 20, phi = 30, bty="b2",
          expand =0.7, 
          xlab = "Children in Poverty", ylab = "Pct Fair or Poor Health", zlab = "Years of Potential Life Lost",  
          surf = list(x = x.pred, y = y.pred, z = z.pred,  
                      facets = NA,col ="red",fit = fitpoints ),main = "Community Health Data")

#Change the plot orientation
theta = 0, phi = -40


We can see a fairly linear relationship and possibly some heteroscedasticity. The residuals seem to have greater variance as the independant variables increase in value.