For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants.
This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year.
This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.
This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## -- Attaching packages ----------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts -------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() rename() head()
mydata = read.csv(file="data/Advertising.csv")
mydata <- rename(mydata, "case_number" = "X")
head(mydata)
## case_number TV radio newspaper sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
#corr = cor( MYDATA )
#corr
corr = cor(mydata[ -c(1) ])
corr
## TV radio newspaper sales
## TV 1.00000000 0.05480866 0.05664787 0.7822244
## radio 0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales 0.78222442 0.57622257 0.22829903 1.0000000
This just means that each variable is perfectly correlated with itself, which makes sense. Aside from all the 1.0, the strongest correlation is between TV and sales.
qplot( x = radio, y = sales, data = mydata)
As radios increase, the sales seems to increase as well.
#Simple Linear Regression Model
#reg <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE )
reg <- lm( sales ~ radio, data = mydata )
reg
##
## Call:
## lm(formula = sales ~ radio, data = mydata)
##
## Coefficients:
## (Intercept) radio
## 9.3116 0.2025
#Summary of Simple Linear Regression Model
#summary(MODEL)
summary(reg)
##
## Call:
## lm(formula = sales ~ radio, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7305 -2.1324 0.7707 2.7775 8.1810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.31164 0.56290 16.542 <2e-16 ***
## radio 0.20250 0.02041 9.921 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
## F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
The multiple R-squared and adjusted R-squared are both around .33 which means that most of the data is not on the regression line.
#p <- qplot( x = INDEPENDENT_VARIABLE, y = DEPENDENT_VARIABLE, data = mydata) + geom_point()
p <- qplot( x = radio, y = sales, data = mydata) + geom_point()
#Add a trend line plot using the a linear model
#p + geom_smooth(method = "lm", formula = y ~ x)
p + geom_smooth(method = "lm", formula = y ~ x)
The more and more radios there are the further the points are from the line.
mlr2 <- lm( sales ~ radio + TV + newspaper, data = mydata)
mlr2
##
## Call:
## lm(formula = sales ~ radio + TV + newspaper, data = mydata)
##
## Coefficients:
## (Intercept) radio TV newspaper
## 2.938889 0.188530 0.045765 -0.001037
summary(mlr2)
##
## Call:
## lm(formula = sales ~ radio + TV + newspaper, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
This model is actually a tad weaker than the previous model bbecause adjusted R-squared actually decreases by .0006. This means including newspapers is not as good of a fit for the data.
The best at predicting the dependent variable is the radio and TV model. It has the higher overall R-squared calculations.
MODEL 1
Radio = 69
Model1 = 9.31164 + 0.20250 * (Radio)
Model1
## [1] 23.28414
MODEL 2
Radio = 69
TV = 255
Model2 = 2.92110 + 0.18799 * (Radio) + 0.04575 * (TV)
Model2
## [1] 27.55866
MODEL 3
Radio = 69
TV = 255
newspaper = 75
Model3 = 2.938889 + 0.188530 * (Radio) + 0.045765 *(TV) + -0.001037 * (newspaper)
Model3
## [1] 27.53976
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
knitr::include_graphics("image2.png")
knitr::include_graphics("image3.png")
knitr::include_graphics("image1.png")
Radio and TV are the best predictors of sales which confirms the data from earlier.
As I just mentioned, Watson found that using radio and TV is the best way to predict sales. Newspapers are not included in this calculation nor does newspaper alone have any noteworthy predictive strength in sales.