I came upon a helpful 3D regression plot in An Introduction to Statistical Learning, but unfortunately the text did not provide us with a way to replicate it. In this report I will attempt to match the chart using R. I will also explore other datasets.
For the first plot I chose to work with the advertising dataset from the ISLR website.
library(knitr)
library(kableExtra)
kable(Advertising[1:6,2:5])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")
| TV | radio | newspaper | sales |
|---|---|---|---|
| 230.1 | 37.8 | 69.2 | 22.1 |
| 44.5 | 39.3 | 45.1 | 10.4 |
| 17.2 | 45.9 | 69.3 | 9.3 |
| 151.5 | 41.3 | 58.5 | 18.5 |
| 180.8 | 10.8 | 58.4 | 12.9 |
| 8.7 | 48.9 | 75.0 | 7.2 |
The data includes sales for a particular product (in thousands of units) and TV, radio, and newspaper advertising budgets (in thousands of dollars)
library("plot3D")
# set the x, y, and z variables
x <- Advertising$radio
y <- Advertising$TV
z <- Advertising$sales
# Compute the linear regression
fit <- lm(z ~ x + y)
# create a grid from the x and y values (min to max) and predict values for every point
# this will become the regression plane
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy),
nrow = grid.lines, ncol = grid.lines)
# create the fitted points for droplines to the surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 1,colvar = NULL, col="red",
theta = 20, phi = 10, bty="b",
xlab = "Radio", ylab = "TV", zlab = "Sales",
surf = list(x = x.pred, y = y.pred, z = z.pred,
facets = TRUE, fit = fitpoints, col=ramp.col (col = c("dodgerblue3","seagreen2"), n = 300, alpha=0.9), border="black"), main = "Advertising")
# change the plot orientation to better view the bottom
theta = 0, phi = -60
Plot from An Introduction to Statistical Learning
The textbook describes a non-linear relationship (positive residuals lie along the 45-degree line and negative residuals lie along the edges) but it is difficult to visualize with the provided chart.
By recreating it I now had the flexibility to change the plot orientation. We can now clearly see the different patterns.
For the second plot I chose to use the Income2 dataset from the ISLR website:
kable(Income2[1:6,2:4])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")
| Education | Seniority | Income |
|---|---|---|
| 21.58621 | 113.1034 | 99.91717 |
| 18.27586 | 119.3103 | 92.57913 |
| 12.06897 | 100.6897 | 34.67873 |
| 17.03448 | 187.5862 | 78.70281 |
| 19.93103 | 20.0000 | 68.00992 |
| 18.27586 | 26.2069 | 71.50449 |
# set the x, y, and z variables
x <- Income2$Education
y <- Income2$Seniority
z <- Income2$Income
# Compute the linear regression
fit <- lm(z ~ x + y)
# create a grid from the x and y values (min to max) and predict values for every point
# this will become the regression plane
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy),
nrow = grid.lines, ncol = grid.lines)
# create the fitted points for droplines to the surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 18, cex = 1.5, colvar=FALSE, col="red", theta = 20, phi = 30, bty="u",
col.panel ="azure2", expand =0.5, col.grid = "snow",
xlab = "Years of Education", ylab = "Seniority", zlab = "Income",
surf = list(x = x.pred, y = y.pred, z = z.pred,
facets = TRUE, col=ramp.col (col = c("yellow","darkorange"), n = 100, alpha=0.7), fit = fitpoints, border="black"),main = "Income")
# change the plot orientation to better view the bottom
theta = 0, phi = -40
Plot from An Introduction to Statistical Learning
In this case there is no clear pattern in the residuals
In a previous report I analyzed NYC’s SAT scores for 2012, which can be seen here:
For the 3D plot I chose to compare SAT math scores, the percentage of Asian students, and SAT reading scores.
kable(SAT2[1:6,c(6,8,5)])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")
| MS | AP | RS |
|---|---|---|
| 404 | 14.0 | 355 |
| 423 | 29.2 | 383 |
| 402 | 9.7 | 377 |
| 401 | 2.2 | 414 |
| 433 | 9.3 | 390 |
| 557 | 84.7 | 332 |
# x, y, z variables
x <- SAT2$AP # Asian %
y <- SAT2$RS # Reading Score
z <- SAT2$MS # math Score
# Compute the linear regression
fit <- lm(z ~ x + y)
# predict values on regular xy grid
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy),
nrow = grid.lines, ncol = grid.lines)
# fitted points for droplines to surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 0.5, colvar=FALSE, ticktype="detailed",
col="black",theta = 20, phi = 30, bty="g",
expand =0.7,
xlab = "Asian Pct", ylab = "Reading Score", zlab = "Math Score",
surf = list(x = x.pred, y = y.pred, z = z.pred,
facets = TRUE, col=ramp.col (col = c("yellow","green","orchid1","cyan"), n = 100, alpha=1), fit = fitpoints, border="black"),main = "SAT")
#change the plot orientation
theta = 0, phi = -60
Unfortunately, since most NYC high schools have a low percentage of Asian students, the data is clustered at the left.
In a previous report I analyzed community health data. The report is available here:
For the 3D plot I chose to compare years of potential life lost, the percentage of children in poverty, and the percentage of adults in fair or poor health.
kable(CHD2[1:6,])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")
| YPLL | PFP | CIP |
|---|---|---|
| 8824 | 18 | 19 |
| 7225 | 18 | 15 |
| 9586 | 26 | 50 |
| 11784 | 20 | 27 |
| 10908 | 21 | 19 |
| 12067 | 29 | 48 |
# x, y, z variables
x <- CHD2$CIP # Children in poverty
y <- CHD2$PFP # % of adults in fair or poor health
z <- CHD2$YPLL # years of potential lost life
# Compute the linear regression
fit <- lm(z ~ x + y)
# predict values on regular xy grid
grid.lines = 50
x.pred <- seq(floor(min(x)), ceiling(max(x)), length.out = grid.lines)
y.pred <- seq(floor(min(y)), ceiling(max(y)), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy),
nrow = grid.lines, ncol = grid.lines)
# fitted points for droplines to surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 0.5, colvar=FALSE,
col="black",theta = 20, phi = 30, bty="b2",
expand =0.7,
xlab = "Children in Poverty", ylab = "Pct Fair or Poor Health", zlab = "Years of Potential Life Lost",
surf = list(x = x.pred, y = y.pred, z = z.pred,
facets = NA,col ="red",fit = fitpoints ),main = "Community Health Data")
#Change the plot orientation
theta = 0, phi = -40
We can see a fairly linear relationship and possibly some heteroscedasticity. The residuals seem to have greater variance as the independant variables increase in value.