An initial example the authors use to help us get an idea for an application of statisticsal learning with a visualization, in this practice situation where we work for a consulting firm and are in charge of analyzing sales as a function of TV, radio, or newspaper. Please view the code that generated the plots if interested. “An introduction to Statistical Learning: With Applications in R”
#load libraries
library(tidyverse)
library(gridExtra)
library(plot3D)
#read in data
ad_sales <- read_csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv")
#creating plots
p1 <- ggplot(ad_sales) + geom_point(aes(x = TV, y = sales)) + geom_smooth(aes(x = TV, y = sales),method='lm', formula= y~x)
p2 <- ggplot(ad_sales) + geom_point(aes(x = newspaper, y = sales)) + geom_smooth(aes(x = newspaper, y = sales),method='lm', formula= y~x)
p3 <- ggplot(ad_sales) + geom_point(aes(x = radio, y = sales)) + geom_smooth(aes(x = radio, y = sales),method='lm', formula= y~x)
grid.arrange(p1, p2, p3, ncol=3)
\(\approx\)
Why Estimate F?
There are two reasons why we would want to estimate F - prediction - example: we want to predict if a patient is alergic to a certain medication. - inference - Understanding the relationship between X and Y, what factors contribute to an increase or decrease in Y? - what form of advertising best contributes to sales?
How to estimate F?
# x, y, z variables
x <- mtcars$wt
y <- mtcars$disp
z <- mtcars$mpg
# Compute the linear regression (z = ax + by + d)
fit <- lm(z ~ x + y)
# predict values on regular xy grid
grid.lines = 26
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy),
nrow = grid.lines, ncol = grid.lines)
# fitted points for droplines to surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 18, cex = 2,
theta = 20, phi = 20, ticktype = "detailed",
xlab = "wt", ylab = "disp", zlab = "mpg",
surf = list(x = x.pred, y = y.pred, z = z.pred,
facets = NA, fit = fitpoints), main = "mtcars")
try a simple method first
Parametric methods - involve a two-step model - make an assumption about the model - then use a trianing dataset
Parametric methods - non-linear model. -
Measuring the quality of fit mean squared error - used for linear regression models degrees of freedom
Bayes Classifier k-nearest neighbor
# 3 linear regression
Questions about a dataset?
Is there a relationship between advertising budget and sales? How Stong is ther relationship betwwen advertising budget and sales Which media contribute to sales? How accurtately can we estimate the effect of each medium on sales? how accurely can we predit future sales? is the relationship linear?
estimating coeffetients
y = mx + b
standard error
SD^2 is the risidual standard error.
standard errors are used for confidence intervals
standard errors can be also be used to perform hypothesis tests on the coefficents
t-test
measures the number of SD is away from 0
P-values
Assuming the accuracy of the model
to quality of a linear regression for is typicallu assessed with Residual standard error and R^2