Introduction

  • predicting continuous or quantitative outputs is a regression problem
  • predicting whether the stock market will go up up or down on any given day is a classification problem
  • predicting which potential group a customer might belong to given demographic data is a clustering Problem

Chapter 2

Example 2.1

An initial example the authors use to help us get an idea for an application of statisticsal learning with a visualization, in this practice situation where we work for a consulting firm and are in charge of analyzing sales as a function of TV, radio, or newspaper. Please view the code that generated the plots if interested. “An introduction to Statistical Learning: With Applications in R”

#load libraries
library(tidyverse)
library(gridExtra)
library(plot3D)

#read in data
ad_sales <- read_csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv")

#creating plots
p1 <- ggplot(ad_sales) + geom_point(aes(x = TV, y = sales)) + geom_smooth(aes(x = TV, y = sales),method='lm', formula= y~x)
p2 <- ggplot(ad_sales) + geom_point(aes(x = newspaper, y = sales)) + geom_smooth(aes(x = newspaper, y = sales),method='lm', formula= y~x)
p3 <- ggplot(ad_sales) + geom_point(aes(x = radio, y = sales)) + geom_smooth(aes(x = radio, y = sales),method='lm', formula= y~x)

grid.arrange(p1, p2, p3, ncol=3)

\(\approx\)

Why Estimate F?

There are two reasons why we would want to estimate F - prediction - example: we want to predict if a patient is alergic to a certain medication. - inference - Understanding the relationship between X and Y, what factors contribute to an increase or decrease in Y? - what form of advertising best contributes to sales?

How to estimate F?

# x, y, z variables
x <- mtcars$wt
y <- mtcars$disp
z <- mtcars$mpg
# Compute the linear regression (z = ax + by + d)
fit <- lm(z ~ x + y)
# predict values on regular xy grid
grid.lines = 26
x.pred <- seq(min(x), max(x), length.out = grid.lines)
y.pred <- seq(min(y), max(y), length.out = grid.lines)
xy <- expand.grid( x = x.pred, y = y.pred)
z.pred <- matrix(predict(fit, newdata = xy), 
                 nrow = grid.lines, ncol = grid.lines)
# fitted points for droplines to surface
fitpoints <- predict(fit)
# scatter plot with regression plane
scatter3D(x, y, z, pch = 18, cex = 2, 
          theta = 20, phi = 20, ticktype = "detailed",
          xlab = "wt", ylab = "disp", zlab = "mpg",  
          surf = list(x = x.pred, y = y.pred, z = z.pred,  
                      facets = NA, fit = fitpoints), main = "mtcars")

How do we estimate F?

try a simple method first

Parametric methods - involve a two-step model - make an assumption about the model - then use a trianing dataset

Parametric methods - non-linear model. -

Measuring the quality of fit mean squared error - used for linear regression models degrees of freedom

Bayes Classifier k-nearest neighbor

# 3 linear regression

Questions about a dataset?

Is there a relationship between advertising budget and sales? How Stong is ther relationship betwwen advertising budget and sales Which media contribute to sales? How accurtately can we estimate the effect of each medium on sales? how accurely can we predit future sales? is the relationship linear?

estimating coeffetients

y = mx + b

standard error

SD^2 is the risidual standard error.

standard errors are used for confidence intervals

standard errors can be also be used to perform hypothesis tests on the coefficents

t-test

measures the number of SD is away from 0

P-values

Assuming the accuracy of the model

to quality of a linear regression for is typicallu assessed with Residual standard error and R^2