Import the data from the textbook website

ads <- read.csv("Advertising.csv", header = TRUE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages --------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts ------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
attach(ads)

Learn about the dataset, what kind of variables are there?

names(ads)
## [1] "X"         "TV"        "radio"     "newspaper" "sales"
head(ads)
##   X    TV radio newspaper sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2

Describe the relationship between sales and the TV

cor(sales, TV) 
## [1] 0.7822244
mod <- lm(sales ~ TV)

ggplot(ads, aes(x = TV, y = sales))+
  geom_point()+
  geom_abline(slope=mod$coefficients[2], intercept=mod$coefficients[1],
                          color="blue", lty=2, lwd=1)+
  theme_bw()

TV and sales are positively correlated.

Create a linear model for sales as a function of TV

mod <- lm(sales ~ TV)
mod
## 
## Call:
## lm(formula = sales ~ TV)
## 
## Coefficients:
## (Intercept)           TV  
##     7.03259      0.04754

Interpret the slope in the context of the problem.

For every dollar spent on TV ads, there would be a .047 increase in sales

Create a confidence interval for the slope coefficient

confint(mod)
##                  2.5 %     97.5 %
## (Intercept) 6.12971927 7.93546783
## TV          0.04223072 0.05284256

Perform a hypothesis test for the slope coefficient.

summary(mod)
## 
## Call:
## lm(formula = sales ~ TV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
anova(mod)
## Analysis of Variance Table
## 
## Response: sales
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## TV          1 3314.6  3314.6  312.14 < 2.2e-16 ***
## Residuals 198 2102.5    10.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both methods give you the same result and indicate that you would reject the null hypothesis (slope = 0). “Summary” performs a t-test, and “anova” provides an anova.

Predict the response for a new observation of TV at 100. Include the prediction and confidence intervals. Why are these intervals different?

newdata<-data.frame(TV=c(100))
predict(mod, newdata, interval="confidence")
##        fit      lwr     upr
## 1 11.78626 11.26782 12.3047
predict(mod, newdata, interval="predict")
##        fit      lwr      upr
## 1 11.78626 5.339251 18.23326

The prediction interval is much wider because it is predicting for a single value, whereas the confidence interval is predicting for a mean level.