For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants.
This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year.
This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.
This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() rename() head()
mydata = read.csv(file="data/Advertising.csv")
head(mydata)
## X TV radio newspaper sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
mydata <- rename(mydata, "case_number" = "X")
head(mydata)
## case_number TV radio newspaper sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
sales = mydata$sales
TV = mydata$TV
radio = mydata$radio
news = mydata$newspaper
#corr = cor( MYDATA )
#corr
corr = cor(mydata[ -c(1)])
corr
## TV radio newspaper sales
## TV 1.00000000 0.05480866 0.05664787 0.7822244
## radio 0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales 0.78222442 0.57622257 0.22829903 1.0000000
1.0 goes down the diagonal because it checks the correlation between a variable and itself, they will always be correlated to themselves, hence the 1.There is the strongest correlation between TV and sales, the second strongest is between sales and radio.
#Use radio because TV is above 0.60. Radio is 0.576.
qplot( x = radio, y = sales, data = mydata)
The correlation has a stronger line at the top of the curve, where sales are the highest for each expenditure of radio spending. It is a posive correlation although it is a bit scattered overall.
#Simple Linear Regression Model
#reg <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE )
reg <- lm(sales ~ radio)
#Summary of Simple Linear Regression Model
#summary(MODEL)
summary(reg)
##
## Call:
## lm(formula = sales ~ radio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7305 -2.1324 0.7707 2.7775 8.1810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.31164 0.56290 16.542 <2e-16 ***
## radio 0.20250 0.02041 9.921 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
## F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
The R squared is 0.332 and the Adjusted R-squared is 0.3287. These are not very promising values, meaning that the low R-Squared indicates a poor fit to the data.
#p <- qplot( x = INDEPENDENT_VARIABLE, y = DEPENDENT_VARIABLE, data = mydata) + geom_point()
p <- qplot( x = radio, y = sales, data = mydata) + geom_point()
#Add a trend line plot using the a linear model
#p + geom_smooth(method = "lm", formula = y ~ x)
p + geom_smooth(method ="lm" , formula = y ~ x)
This does not look like a reliable formula as most of the points are not along the line. There seems to be a positive relationship even though the data does not fit the line well.
mlrl2 <- lm(sales ~ radio + TV + news)
summary(mlrl2)
##
## Call:
## lm(formula = sales ~ radio + TV + news)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## news -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
The accuracy on this model goes down from the previous one, but is still better than the first. While the R-Squared stays the same, the adjusted decreases slightly. This overall is a good fit for the data, just not the best fit.
The second model that considers radio and tv is the best model for predicting the dependent variable. It has the higherst values for R-Squared and Adjusted R Squared and uses the strongest indicators of sales.
MODEL 1
9.31164 + 0.20250*69
## [1] 23.28414
MODEL 2
2.9211 + 0.18799*69 + 0.04575*255
## [1] 27.55866
MODEL 3
0.188530*69 + 0.045765*255 - 0.001037*75 + 2.938889
## [1] 27.53976
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
The sales are best predicted using radio and tv values in both models. This is interesting because newspaper was not even considered to be a driver. This could be due to the 78% data accuracy for that variable. I also find it interesting that it is 72% accurate. It is nice to visually see how much closer the 94% of radio and TV are to sales than the 1 driver indicators.
knitr::include_graphics("img/PredictiveModel1.PNG")
knitr::include_graphics("img/PredictiveModel2.PNG")
The power strength of TV is 59% and Radio is 32%. This is not surprising as the model ran in R also indicated that TV was a stronger indicator of sales than radio.
knitr::include_graphics("img/DriverStrength.PNG")
The Watson results validate the findings in the R regression analysis. They both show that using tv and radio as the predictor helps yield the best indicator of sales. I find it interesting that the newspaper is not used at all in Watson, but both watson and r found that it is not as much of reliable indicator due to the outliers.