For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
In a given year, if it rains more, we may see that there might be an increase in crop production. This is because more water may lead to more plants.
This is a direct relationship; the number of fruits may be able to be predicted by amount of waterfall in a certain year.
This example represents simple linear regression, which is an extremely useful concept that allows us to predict values of a certain variable based off another variable.
This lab will explore the concepts of simple linear regression, multiple linear regression, and watson analytics.
We are going to use tidyverse a collection of R packages designed for data science.
Loading required package: tidyverse
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --[39m
[30m[32mv[30m [34mggplot2[30m 2.2.1 [32mv[30m [34mpurrr [30m 0.2.4
[32mv[30m [34mtibble [30m 1.4.2 [32mv[30m [34mdplyr [30m 0.7.4
[32mv[30m [34mtidyr [30m 0.7.2 [32mv[30m [34mstringr[30m 1.2.0
[32mv[30m [34mreadr [30m 1.1.1 [32mv[30m [34mforcats[30m 0.2.0[39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
Loading required package: plotly
Attaching package: <U+393C><U+3E31>plotly<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:ggplot2<U+393C><U+3E32>:
last_plot
The following object is masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:
filter
The following object is masked from <U+393C><U+3E31>package:graphics<U+393C><U+3E32>:
layout
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() rename() head()
mydata = read.csv(file="data/Advertising.csv")
head(mydata)
#corr = cor( MYDATA )
#corr
corr = cor(mydata)
corr
X TV radio newspaper sales
X 1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
TV 0.01771469 1.00000000 0.05480866 0.05664787 0.78222442
radio -0.11068044 0.05480866 1.00000000 0.35410375 0.57622257
newspaper -0.15494414 0.05664787 0.35410375 1.00000000 0.22829903
sales -0.05161625 0.78222442 0.57622257 0.22829903 1.00000000
These variables are correlated with themselves. Pairs with the strongest correlations: Radio and TV Newspaper and TV Sales and TV
qplot( x = mydata$radio, y = mydata$sales, data = mydata)
qplot
function (x, y = NULL, ..., data, facets = NULL, margins = FALSE,
geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
main = NULL, xlab = deparse(substitute(x)), ylab = deparse(substitute(y)),
asp = NA, stat = NULL, position = NULL)
{
if (!missing(stat))
warning("`stat` is deprecated", call. = FALSE)
if (!missing(position))
warning("`position` is deprecated", call. = FALSE)
if (!is.character(geom))
stop("`geom` must be a character vector", call. = FALSE)
argnames <- names(as.list(match.call(expand.dots = FALSE)[-1]))
arguments <- as.list(match.call()[-1])
env <- parent.frame()
aesthetics <- compact(arguments[.all_aesthetics])
aesthetics <- aesthetics[!is.constant(aesthetics)]
aes_names <- names(aesthetics)
aesthetics <- rename_aes(aesthetics)
class(aesthetics) <- "uneval"
if (missing(data)) {
data <- data.frame()
facetvars <- all.vars(facets)
facetvars <- facetvars[facetvars != "."]
names(facetvars) <- facetvars
facetsdf <- as.data.frame(mget(facetvars, envir = env))
if (nrow(facetsdf))
data <- facetsdf
}
if ("auto" %in% geom) {
if ("sample" %in% aes_names) {
geom[geom == "auto"] <- "qq"
}
else if (missing(y)) {
x <- eval(aesthetics$x, data, env)
if (is.discrete(x)) {
geom[geom == "auto"] <- "bar"
}
else {
geom[geom == "auto"] <- "histogram"
}
if (missing(ylab))
ylab <- "count"
}
else {
if (missing(x)) {
aesthetics$x <- bquote(seq_along(.(y)), aesthetics)
}
geom[geom == "auto"] <- "point"
}
}
p <- ggplot(data, aesthetics, environment = env)
if (is.null(facets)) {
p <- p + facet_null()
}
else if (is.formula(facets) && length(facets) == 2) {
p <- p + facet_wrap(facets)
}
else {
p <- p + facet_grid(facets = deparse(facets), margins = margins)
}
if (!is.null(main))
p <- p + ggtitle(main)
for (g in geom) {
params <- arguments[setdiff(names(arguments), c(aes_names,
argnames))]
params <- lapply(params, eval, parent.frame())
p <- p + do.call(paste0("geom_", g), params)
}
logv <- function(var) var %in% strsplit(log, "")[[1]]
if (logv("x"))
p <- p + scale_x_log10()
if (logv("y"))
p <- p + scale_y_log10()
if (!is.na(asp))
p <- p + theme(aspect.ratio = asp)
if (!missing(xlab))
p <- p + xlab(xlab)
if (!missing(ylab))
p <- p + ylab(ylab)
if (!missing(xlim))
p <- p + xlim(xlim)
if (!missing(ylim))
p <- p + ylim(ylim)
p
}
<environment: namespace:ggplot2>
According to this graph, there appears to be a general positive trend between radio and sales. This implies that increase in radio advertisements generates more sales.
#Simple Linear Regression Model
#reg <- lm( DEPENDENT_VARIABLE ~ INDEPENDENT_VARIABLE )
reg <- lm(mydata$sales ~ mydata$radio)
reg
Call:
lm(formula = mydata$sales ~ mydata$radio)
Coefficients:
(Intercept) mydata$radio
9.3116 0.2025
#Summary of Simple Linear Regression Model
#summary(MODEL)
summary(reg)
Call:
lm(formula = mydata$sales ~ mydata$radio)
Residuals:
Min 1Q Median 3Q Max
-15.7305 -2.1324 0.7707 2.7775 8.1810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.31164 0.56290 16.542 <2e-16 ***
mydata$radio 0.20250 0.02041 9.921 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
From this summary, we can identify that the R-squared value is 0.332 and the adjusted R-squared value is 0.3287, which means these variables aren’t a good fit for the dataset. As we know from statistics, correlation coefficients less than 0.5 represent a weak relationship between variables.
#p <- qplot( x = INDEPENDENT_VARIABLE, y = DEPENDENT_VARIABLE, data = mydata) + geom_point()
p <- qplot( x = mydata$radio, y = mydata$sales, data = mydata) + geom_point()
#Add a trend line plot using the a linear model
#p + geom_smooth(method = "lm", formula = y ~ x)
p + geom_smooth(method="lm")
This trendline demonstrates a positive linear relationship between radio and sales.
mlr2 <-lm(mydata$sales ~ mydata$radio + mydata$TV)
summary(mlr2)
Call:
lm(formula = mydata$sales ~ mydata$radio + mydata$TV)
Residuals:
Min 1Q Median 3Q Max
-8.7977 -0.8752 0.2422 1.1708 2.8328
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.92110 0.29449 9.919 <2e-16 ***
mydata$radio 0.18799 0.00804 23.382 <2e-16 ***
mydata$TV 0.04575 0.00139 32.909 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.681 on 197 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
Y_sales_predicted = 2.92110 + 0.18799(radio) + 0.04575(TV)
mlr3 <-lm(mydata$sales ~ mydata$radio + mydata$TV + mydata$newspaper)
summary(mlr3)
Call:
lm(formula = mydata$sales ~ mydata$radio + mydata$TV + mydata$newspaper)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
mydata$radio 0.188530 0.008611 21.893 <2e-16 ***
mydata$TV 0.045765 0.001395 32.809 <2e-16 ***
mydata$newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Y_sales_predicted = 2.938889 + 0.188530(radio) + 0.045765(TV) -0.001037*(newspaper)
Between these models, it appears mlr is the better regression line, as the adjusted R-squared value is lower.
MODEL 1
radio=69
TV=255
newspaper=75
Smlr1= 9.31164 + 0.20250*(radio)
Smlr1
[1] 23.28414
MODEL 2
Smlr2 = 2.92110 + 0.18799*(radio) + 0.04575*(TV)
Smlr2
[1] 27.55866
MODEL 3
Smlr3 = 2.938889 + 0.188530*(radio) + 0.045765*(TV) -0.001037*(newspaper)
Smlr3
[1] 27.53976
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 5\\data\\img1.png')
Predictive strength of TV =59% Predictive strength of Radio = 32% Predictive strength of radio and TV = 94%
To test Watson’s results, let’s analyze each independent variable against sales. As we’ve previously seen, mlrl (model 1) illustrates the relationship between sales and radio. Let’s now analyze sales and TV:
summary(mlrl4)
Call:
lm(formula = mydata$sales ~ mydata$TV)
Residuals:
Min 1Q Median 3Q Max
-8.3860 -1.9545 -0.1913 2.0671 7.2124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.032594 0.457843 15.36 <2e-16 ***
mydata$TV 0.047537 0.002691 17.67 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
Here, we notice the relationship between these variables. From these findings, we can deduce whether or not they adhere to Watson’s results.
According to our models, we will look at the R squared value, as this describes how well the dataset fits our regression model. For sales vs. TV, r-squared = 0.6119 ~ 61.19% (Watson = 59%) For sales vs. radio, r-squared = 0.332 ~ 33.2% (Waton = 32%) For sales vs. TV vs. radio, r-squared = 0.8972 ~ 89.72% (Watson = 94%)
while there may be some variance due to different methods of calculating predictive strength, results from the regression analysis and Watson results match well enough to be considered as a valid relationship between these variables.