For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants.
This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year.
This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.
This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() rename() head()
mydata = read.csv(file="data/Advertising.csv")
head(mydata)
## X TV radio newspaper sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
corr = cor(mydata[ -c(1) ])
corr
## TV radio newspaper sales
## TV 1.00000000 0.05480866 0.05664787 0.7822244
## radio 0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales 0.78222442 0.57622257 0.22829903 1.0000000
because a variable will always be perfectly correlated with itself. TV and sales has the strongest correlation.
qplot( x =radio, y = sales, data = mydata)
Its a positive correlateion, meaning the points drift upwards as they shift right.
#Simple Linear Regression Model
model <- lm( mydata$sales ~ mydata$TV )
#Summary of Simple Linear Regression Model
summary(model)
##
## Call:
## lm(formula = mydata$sales ~ mydata$TV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## mydata$TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
sales = 7.0325 - (1.9545) * (tv) With the R-squared of 0.6119 being more than .5 it indicates that the fit is good.
#p <- qplot( x = INDEPENDENT_VARIABLE, y = DEPENDENT_VARIABLE, data = mydata) + geom_point()
p <- qplot( x = TV, y = sales, data = mydata) + geom_point()
#Add a trend line plot using the a linear model
#p + geom_smooth(method = "lm", formula = y ~ x)
p + geom_smooth(method = "lm", formula = y ~ x)
TV and sales are positively correlated, thus the trend line has a positive slope.
model3 <- lm( sales ~ radio + TV + newspaper, data=mydata)
summary(model3)
##
## Call:
## lm(formula = sales ~ radio + TV + newspaper, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
TV = 255
salesmodel = 7.0325 - (1.9545) * (TV)
salesmodel
## [1] -491.365
MODEL 2
radio = 69
TV = 255
salesmodel2 = 2.92110 + 0.18799 * (radio) + 0.04575 * (TV)
salesmodel2
## [1] 27.55866
MODEL 3
radio = 69
TV = 255
newspaper = 75
salesmodel3 = 2.938889 + 0.188530 * (radio) + 0.045765 * (TV) - 0.001037 * (newspaper)
salesmodel3
## [1] 27.53976
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
knitr::include_graphics('img1.png')
knitr::include_graphics('img2.png')
TV and radio are most influential as far as sales go. ### 3B) Note the predictive power strength of reported variables. Consider the one field predictive model only, describe your findings and add and screenshot
knitr::include_graphics('img3.png')
TV and radio are main drivers of sales. ### 3C) How do Watson results reconcile with your findings based on the R regression analysis in task 2? Explain how. They only serve to support everything found in task 2 as it shows that tv and radio are the main drivers of sales.