Objective

I came upon a helpful 3D regression plot in An Introduction to Statistical Learning, but unfortunately the text did not provide us with a way to replicate it. In this report I will attempt to match the chart using R. I will also explore other datasets.

Advertising Dataset

For the first plot I chose to work with the advertising dataset from the ISLR website.

ISLR Website

Advertising Data

library(knitr)
library(kableExtra)
kable(Advertising[1:6,2:5])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")

TV	radio	newspaper	sales
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5
180.8	10.8	58.4	12.9
8.7	48.9	75.0	7.2

The data includes sales for a particular product (in thousands of units) and TV, radio, and newspaper advertising budgets (in thousands of dollars)

library("plot3D")

# set the x, y, and z variables
x <- Advertising$radio
y <- Advertising$TV
z <- Advertising$sales

# Compute the linear regression 
fit <- lm(z ~ x + y)

# create a grid from the x and y values (min to max) and predict values for every point
# this will become the regression plane
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy), 
                 nrow = grid.lines, ncol = grid.lines)

# create the fitted points for droplines to the surface
fitpoints <- predict(fit)

# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 1,colvar = NULL, col="red", 
          theta = 20, phi = 10, bty="b",
          xlab = "Radio", ylab = "TV", zlab = "Sales",  
          surf = list(x = x.pred, y = y.pred, z = z.pred,  
                      facets = TRUE, fit = fitpoints, col=ramp.col (col = c("dodgerblue3","seagreen2"), n = 300, alpha=0.9), border="black"), main = "Advertising")

# change the plot orientation to better view the bottom
theta = 0, phi = -60

Plot from An Introduction to Statistical Learning

The textbook describes a non-linear relationship (positive residuals lie along the 45-degree line and negative residuals lie along the edges) but it is difficult to visualize with the provided chart.

By recreating it I now had the flexibility to change the plot orientation. We can now clearly see the different patterns.

Income2 Dataset

For the second plot I chose to use the Income2 dataset from the ISLR website:

Income2 Data

kable(Income2[1:6,2:4])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")

Education	Seniority	Income
21.58621	113.1034	99.91717
18.27586	119.3103	92.57913
12.06897	100.6897	34.67873
17.03448	187.5862	78.70281
19.93103	20.0000	68.00992
18.27586	26.2069	71.50449

# set the x, y, and z variables
x <- Income2$Education
y <- Income2$Seniority
z <- Income2$Income

# Compute the linear regression 
fit <- lm(z ~ x + y)

# create a grid from the x and y values (min to max) and predict values for every point
# this will become the regression plane
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy), 
                 nrow = grid.lines, ncol = grid.lines)

# create the fitted points for droplines to the surface
fitpoints <- predict(fit)

# scatter plot with regression plane
scatter3D(x, y, z, pch = 18, cex = 1.5, colvar=FALSE, col="red", theta = 20, phi = 30, bty="u",
col.panel ="azure2", expand =0.5, col.grid = "snow",
xlab = "Years of Education", ylab = "Seniority", zlab = "Income",  
surf = list(x = x.pred, y = y.pred, z = z.pred,  
facets = TRUE, col=ramp.col (col = c("yellow","darkorange"), n = 100, alpha=0.7), fit = fitpoints, border="black"),main = "Income")

# change the plot orientation to better view the bottom
theta = 0, phi = -40

Plot from An Introduction to Statistical Learning

In this case there is no clear pattern in the residuals

NYC SAT Scores for 2012

In a previous report I analyzed NYC’s SAT scores for 2012, which can be seen here:

SAT Regression

For the 3D plot I chose to compare SAT math scores, the percentage of Asian students, and SAT reading scores.

kable(SAT2[1:6,c(6,8,5)])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")

MS	AP	RS
404	14.0	355
423	29.2	383
402	9.7	377
401	2.2	414
433	9.3	390
557	84.7	332

# x, y, z variables
x <- SAT2$AP # Asian %
y <- SAT2$RS # Reading Score
z <- SAT2$MS # math Score

# Compute the linear regression 
fit <- lm(z ~ x + y)

# predict values on regular xy grid
grid.lines = 40
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy), 
                 nrow = grid.lines, ncol = grid.lines)

# fitted points for droplines to surface
fitpoints <- predict(fit)

# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 0.5, colvar=FALSE, ticktype="detailed", 
          col="black",theta = 20, phi = 30, bty="g",
          expand =0.7, 
          xlab = "Asian Pct", ylab = "Reading Score", zlab = "Math Score",  
          surf = list(x = x.pred, y = y.pred, z = z.pred,  
                      facets = TRUE, col=ramp.col (col = c("yellow","green","orchid1","cyan"), n = 100, alpha=1), fit = fitpoints, border="black"),main = "SAT")

#change the plot orientation
theta = 0, phi = -60

Unfortunately, since most NYC high schools have a low percentage of Asian students, the data is clustered at the left.

Community Health Data

In a previous report I analyzed community health data. The report is available here:

Community Health Data

For the 3D plot I chose to compare years of potential life lost, the percentage of children in poverty, and the percentage of adults in fair or poor health.

kable(CHD2[1:6,])%>%kable_styling(full_width = FALSE, position = "left", bootstrap_options = "striped")

YPLL	PFP	CIP
8824	18	19
7225	18	15
9586	26	50
11784	20	27
10908	21	19
12067	29	48

# x, y, z variables
x <- CHD2$CIP # Children in poverty
y <- CHD2$PFP # % of adults in fair or poor health
z <- CHD2$YPLL # years of potential lost life
# Compute the linear regression 
fit <- lm(z ~ x + y)
# predict values on regular xy grid
grid.lines = 50
x.pred <- seq(floor(min(x)), ceiling(max(x)), length.out = grid.lines)
y.pred <- seq(floor(min(y)), ceiling(max(y)), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy), 
                 nrow = grid.lines, ncol = grid.lines)
# fitted points for droplines to surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 19, cex = 0.5, colvar=FALSE, 
          col="black",theta = 20, phi = 30, bty="b2",
          expand =0.7, 
          xlab = "Children in Poverty", ylab = "Pct Fair or Poor Health", zlab = "Years of Potential Life Lost",  
          surf = list(x = x.pred, y = y.pred, z = z.pred,  
                      facets = NA,col ="red",fit = fitpoints ),main = "Community Health Data")

#Change the plot orientation
theta = 0, phi = -40

We can see a fairly linear relationship and possibly some heteroscedasticity. The residuals seem to have greater variance as the independant variables increase in value.

3D Regression Plotting

Paul Jozefek

2/18/2020

Objective

Advertising Dataset

Income2 Dataset

NYC SAT Scores for 2012

Community Health Data