This project aims to utilize various types of regression to forecast future values in a time series analysis. The data and the general workflow of the code was provided by the Eastern University course “Data Science for Business”.
The first step involved loading the necessary libraries for this analysis.
#load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(TTR)
Next the working directory was set in order to retrieve the data files.
#set working directory
setwd("/Users/nickwinters/desktop/MSDS/DS for Buisness")
Since regression was used for this analysis, the metrics used to evaluate each regression model was created via functions. These metrics relay how accurate each model is at predicting the outcome variable by indicating the error between the predicted and actual measures. Each metric relays this information in a different manner.
absolute difference
in observed vs predictedsquared difference in
observed vs predictedsquareroot of msepercent
difference in observed vs predictedThese metrics are relative values so they are beneficial at comparing between two different models.
# mean absolute error
mae<-function(actual,pred){
mae <- mean(abs(actual-pred), na.rm=TRUE)
return (mae)
}
# mean squared error
mse<-function(actual,pred){
mse <- mean((actual-pred)^2, na.rm=TRUE)
return (mse)
}
# root mean squared error
rmse<-function(actual,pred){
rmse <- sqrt(mean((actual-pred)^2, na.rm=TRUE))
return (rmse)
}
# mean absolute percentage error
mape<-function(actual,pred){
mape <- mean(abs((actual - pred)/actual), na.rm=TRUE)*100
return (mape)
}
Regression is a statistical technique that allows for the prediction of an outcome variable based on an input variable/s. This is achieved through the creation of a line of best fit. This line’s formula (y = mx + b) can be used to calculate or forecast the outcome variable.
Below are four business scenarios that utilize regression to forecast desired outcome variables.
The general structure of conducting these analysis include the following:
#read dataset into R
starbucksdf <- read.csv("starbucks_revenue.csv")
starbucksdf
## Year NetRevenue
## 1 2003 4.10
## 2 2004 5.30
## 3 2005 6.40
## 4 2006 7.80
## 5 2007 9.40
## 6 2008 10.40
## 7 2009 9.80
## 8 2010 10.70
## 9 2011 11.70
## 10 2012 13.30
## 11 2013 14.90
## 12 2014 16.45
## 13 2015 19.16
## 14 2016 21.32
## 15 2017 22.39
## 16 2018 24.72
## 17 2019 26.51
## 18 2020 19.16
## 19 2021 24.61
Observation: This data frame holds 19 years of yearly revenue (in billions) for Starbucks.
#create a time series plot showing yearly net revenue in billions
ggplot(data = starbucksdf, mapping = aes(x = Year, y = NetRevenue)) +
geom_line () +
geom_point() +
labs(title = "Starbucks Yearly Net Revenue in Billions of Dollars,
2003 to 2021", x = "Year", y = "Net Revenue")
Observation: This time series plot shows an overall positive trend in yearly revenue. There is a notable dip in the 2020 revenue. This is likely due to the COVID-19 epidemic due worldwide lockdowns. This dip does show some recovery in the following 2021 year.
To help simplify the model, instead of using the year as the predictor variable (due to its datatype being an integer) a new column was created that expresses each year as representative number.
# Add a column of consecutive numbers
starbucksdf$Time <- 1:nrow(starbucksdf)
head(starbucksdf)
## Year NetRevenue Time
## 1 2003 4.1 1
## 2 2004 5.3 2
## 3 2005 6.4 3
## 4 2006 7.8 4
## 5 2007 9.4 5
## 6 2008 10.4 6
Since there is only one outcome variable (NetRevenue),
linear regression is the model that was used for this scenario.
# simple linear regression analysis
sb_reg<-lm(NetRevenue ~ Time, data = starbucksdf)
summary(sb_reg)
##
## Call:
## lm(formula = NetRevenue ~ Time, data = starbucksdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1505 -1.0788 0.3347 0.8512 3.4086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.54719 0.93650 2.72 0.0146 *
## Time 1.20907 0.08214 14.72 4.17e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.961 on 17 degrees of freedom
## Multiple R-squared: 0.9273, Adjusted R-squared: 0.923
## F-statistic: 216.7 on 1 and 17 DF, p-value: 4.172e-11
Observation: The line of best fit is equal to \(y = 1.209x + 2.547\). This indicates that the year over year growth in revenue is 1.2 billion dollars.
# Predicted values were generated for evaluation
sb_pred = predict(sb_reg)
# Run previously created metric functions
mae (starbucksdf$NetRevenue, sb_pred)
## [1] 1.411594
mse (starbucksdf$NetRevenue, sb_pred)
## [1] 3.440664
rmse (starbucksdf$NetRevenue, sb_pred)
## [1] 1.854903
mape (starbucksdf$NetRevenue, sb_pred)
## [1] 9.219178
Note: These values generated above depict the models accuracy; however because they are relative values, they are really only beneficial at comparing between diffent models.
For this specific scenario, the net revenue for the years 2022, 2023, and 2024 were predicted
# Create a data frame with the time periods to use for the prediction
sb_new <- data.frame(Time = c(20, 21, 22))
predict(sb_reg, newdata = sb_new)
## 1 2 3
## 26.72860 27.93767 29.14674
#Create a vector of residuals generated from the regression above
sb_res = resid(sb_reg)
#Create a data frame of the predicted values and the residuals
pred_res_df <- data.frame(sb_pred, sb_res)
#create a scatterplot of the residuals versus the predicted values
ggplot(data = pred_res_df, mapping = aes(x = sb_pred, y = sb_res)) +
geom_point() +
labs(title = "Plot of residuals vs. predicted values", x = "Predicted values",
y = "Residuals")
Observation: In an ideal scenario, the points on this plot should be randomly scattered around the zero line; however here we see a case of autocorrelation. In this case this means that data from the current year may be correlated with data from the previous year. This depicted by the apparent upward trend that points are following. If future analysis is conducted on this data it may be beneficial to create an autoregression model to account for this observation.
#read dataset into R
nytdf <- read.csv("NYT_revenue.csv")
nytdf
## Quarter Revenue
## 1 2013Q1 380.68
## 2 2013Q2 390.96
## 3 2013Q3 361.74
## 4 2013Q4 443.86
## 5 2014Q1 390.41
## 6 2014Q2 388.72
## 7 2014Q3 364.72
## 8 2014Q4 444.68
## 9 2015Q1 384.24
## 10 2015Q2 382.89
## 11 2015Q3 367.40
## 12 2015Q4 444.69
## 13 2016Q1 379.52
## 14 2016Q2 372.63
## 15 2016Q3 363.55
## 16 2016Q4 439.65
Observation: This data frame holds 3 years of revenue divided up by quarter for New York Times Magazine.
#create a time series plot showing NYT quarterly revenue
ggplot(data = nytdf, mapping = aes(x = Quarter, y = Revenue)) +
geom_line (group=1) +
geom_point() +
labs(title = "New York Times Quarterly Revenue 2013 to 2016",
x = "Quarter", y = "Revenue")
Observation: This time series plot shows an a seasonal horizontal trend, with spikes in revenue every 4th quarter.
To implement regression in this scenario, first binary variables were created to represent each quarter.
# dummy variables corresponding to each quarter were created
nytdf$Q1 <- ifelse(grepl("Q1",nytdf$Quarter), 1, 0)
nytdf$Q2 <- ifelse(grepl("Q2",nytdf$Quarter), 1, 0)
nytdf$Q3 <- ifelse(grepl("Q3",nytdf$Quarter), 1, 0)
nytdf$Q4 <- ifelse(grepl("Q4",nytdf$Quarter), 1, 0)
# check newly manipulated variable
nytdf
## Quarter Revenue Q1 Q2 Q3 Q4
## 1 2013Q1 380.68 1 0 0 0
## 2 2013Q2 390.96 0 1 0 0
## 3 2013Q3 361.74 0 0 1 0
## 4 2013Q4 443.86 0 0 0 1
## 5 2014Q1 390.41 1 0 0 0
## 6 2014Q2 388.72 0 1 0 0
## 7 2014Q3 364.72 0 0 1 0
## 8 2014Q4 444.68 0 0 0 1
## 9 2015Q1 384.24 1 0 0 0
## 10 2015Q2 382.89 0 1 0 0
## 11 2015Q3 367.40 0 0 1 0
## 12 2015Q4 444.69 0 0 0 1
## 13 2016Q1 379.52 1 0 0 0
## 14 2016Q2 372.63 0 1 0 0
## 15 2016Q3 363.55 0 0 1 0
## 16 2016Q4 439.65 0 0 0 1
Multiple regression was used since there are multiple outcome
variables. For this regression Q4 was intnetionaly left out
to be the reference variable.
#Use multiple regression with quarter variables to generate a regression
#equation for forecasting
nyt_reg<-lm(Revenue ~ Q1 + Q2 + Q3, data = nytdf)
summary(nyt_reg)
##
## Call:
## lm(formula = Revenue ~ Q1 + Q2 + Q3, data = nytdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1700 -2.7175 0.4475 1.8644 7.1600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 443.220 2.530 175.16 < 2e-16 ***
## Q1 -59.508 3.578 -16.63 1.19e-09 ***
## Q2 -59.420 3.578 -16.61 1.21e-09 ***
## Q3 -78.868 3.578 -22.04 4.48e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.061 on 12 degrees of freedom
## Multiple R-squared: 0.9786, Adjusted R-squared: 0.9732
## F-statistic: 182.8 on 3 and 12 DF, p-value: 2.8e-10
Observation: The line of best fit is equal to \(y = -59.5x -59.4x -78.9x + 443.2\). This indicates that the year over year growth in revenue is 1.2 billion dollars. These coefficients indicate how much less each quarters revenue is in relation to Q4.
#Create a vector of predicted values
nyt_pred = predict(nyt_reg)
# calculate accuracy measures
mae(nytdf$Revenue, nyt_pred)
## [1] 3.28625
mse(nytdf$Revenue, nyt_pred)
## [1] 19.20721
rmse(nytdf$Revenue, nyt_pred)
## [1] 4.382603
mape(nytdf$Revenue, nyt_pred)
## [1] 0.8484193
Note: These values generated above depict the models accuracy; however because they are relative values, they are really only beneficial at comparing between diffent models.
For this specific scenario, the revenue for Q1, Q2, Q3, Q4 of 2017 were predicted.
# Create an object with the time periods to use for the prediction
nyt_new <- data.frame(Q1 = c(1,0,0,0), Q2 = c(0,1,0,0), Q3 = c(0,0,1,0))
predict(nyt_reg, newdata = nyt_new)
## 1 2 3 4
## 383.7125 383.8000 364.3525 443.2200
#read dataset into R
wfdf <- read.csv("whole_foods.csv")
wfdf
## Quarter Net.Sales
## 1 2005Q1 1.37
## 2 2005Q2 1.09
## 3 2005Q3 1.13
## 4 2005Q4 1.12
## 5 2006Q1 1.67
## 6 2006Q2 1.31
## 7 2006Q3 1.34
## 8 2006Q4 1.29
## 9 2007Q1 1.87
## 10 2007Q2 1.46
## 11 2007Q3 1.51
## 12 2007Q4 1.74
## 13 2008Q1 2.46
## 14 2008Q2 1.87
## 15 2008Q3 1.84
## 16 2008Q4 1.79
## 17 2009Q1 2.47
## 18 2009Q2 1.86
## 19 2009Q3 1.88
## 20 2009Q4 1.83
## 21 2010Q1 2.64
## 22 2010Q2 2.11
## 23 2010Q3 2.16
## 24 2010Q4 2.10
## 25 2011Q1 3.00
## 26 2011Q2 2.35
## 27 2011Q3 2.40
## 28 2011Q4 2.35
## 29 2012Q1 3.39
## 30 2012Q2 2.67
## 31 2012Q3 2.73
## 32 2012Q4 2.91
## 33 2013Q1 3.86
## 34 2013Q2 3.03
## 35 2013Q3 3.06
## 36 2013Q4 2.98
## 37 2014Q1 4.24
## 38 2014Q2 3.32
## 39 2014Q3 3.38
## 40 2014Q4 3.26
## 41 2015Q1 4.67
## 42 2015Q2 3.65
## 43 2015Q3 3.63
## 44 2015Q4 3.44
## 45 2016Q1 4.83
## 46 2016Q2 3.70
## 47 2016Q3 3.70
## 48 2016Q4 3.50
Observation: This data frame holds 11 years of revenue divided up by quarter for Whole Foods.
#create a time series plot showing quarterly net sales
ggplot(data = wfdf, mapping = aes(x = Quarter, y = Net.Sales)) +
geom_line (group=1) +
geom_point() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Whole Foods Quarterly Net Sales 2005 to 2016 in $ millions",
x = "Quarter", y = "Net Sales")
Observation: This time series plot shows an a seasonal upward trend, with spikes in revenue every 1st quarter.
To implement regression in this scenario, first a new column was added to represent each year and the binary variables were created to represent each quarter.
# add column of consecutive numbers for each year
wfdf$Time <- 1:nrow(wfdf)
# create dummy variables corresponding to each quarter
wfdf$Q1 <- ifelse(grepl("Q1",wfdf$Quarter), 1, 0)
wfdf$Q2 <- ifelse(grepl("Q2",wfdf$Quarter), 1, 0)
wfdf$Q3 <- ifelse(grepl("Q3",wfdf$Quarter), 1, 0)
wfdf$Q4 <- ifelse(grepl("Q4",wfdf$Quarter), 1, 0)
# veiw manipulated df
wfdf
## Quarter Net.Sales Time Q1 Q2 Q3 Q4
## 1 2005Q1 1.37 1 1 0 0 0
## 2 2005Q2 1.09 2 0 1 0 0
## 3 2005Q3 1.13 3 0 0 1 0
## 4 2005Q4 1.12 4 0 0 0 1
## 5 2006Q1 1.67 5 1 0 0 0
## 6 2006Q2 1.31 6 0 1 0 0
## 7 2006Q3 1.34 7 0 0 1 0
## 8 2006Q4 1.29 8 0 0 0 1
## 9 2007Q1 1.87 9 1 0 0 0
## 10 2007Q2 1.46 10 0 1 0 0
## 11 2007Q3 1.51 11 0 0 1 0
## 12 2007Q4 1.74 12 0 0 0 1
## 13 2008Q1 2.46 13 1 0 0 0
## 14 2008Q2 1.87 14 0 1 0 0
## 15 2008Q3 1.84 15 0 0 1 0
## 16 2008Q4 1.79 16 0 0 0 1
## 17 2009Q1 2.47 17 1 0 0 0
## 18 2009Q2 1.86 18 0 1 0 0
## 19 2009Q3 1.88 19 0 0 1 0
## 20 2009Q4 1.83 20 0 0 0 1
## 21 2010Q1 2.64 21 1 0 0 0
## 22 2010Q2 2.11 22 0 1 0 0
## 23 2010Q3 2.16 23 0 0 1 0
## 24 2010Q4 2.10 24 0 0 0 1
## 25 2011Q1 3.00 25 1 0 0 0
## 26 2011Q2 2.35 26 0 1 0 0
## 27 2011Q3 2.40 27 0 0 1 0
## 28 2011Q4 2.35 28 0 0 0 1
## 29 2012Q1 3.39 29 1 0 0 0
## 30 2012Q2 2.67 30 0 1 0 0
## 31 2012Q3 2.73 31 0 0 1 0
## 32 2012Q4 2.91 32 0 0 0 1
## 33 2013Q1 3.86 33 1 0 0 0
## 34 2013Q2 3.03 34 0 1 0 0
## 35 2013Q3 3.06 35 0 0 1 0
## 36 2013Q4 2.98 36 0 0 0 1
## 37 2014Q1 4.24 37 1 0 0 0
## 38 2014Q2 3.32 38 0 1 0 0
## 39 2014Q3 3.38 39 0 0 1 0
## 40 2014Q4 3.26 40 0 0 0 1
## 41 2015Q1 4.67 41 1 0 0 0
## 42 2015Q2 3.65 42 0 1 0 0
## 43 2015Q3 3.63 43 0 0 1 0
## 44 2015Q4 3.44 44 0 0 0 1
## 45 2016Q1 4.83 45 1 0 0 0
## 46 2016Q2 3.70 46 0 1 0 0
## 47 2016Q3 3.70 47 0 0 1 0
## 48 2016Q4 3.50 48 0 0 0 1
A linear regression model was created to predict
NetSales from Time.
# linear regression model
wf_reg<-lm(Net.Sales ~ Time, data = wfdf)
summary(wf_reg)
##
## Call:
## lm(formula = Net.Sales ~ Time, data = wfdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5381 -0.2692 -0.1117 0.0778 1.0779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.979849 0.115413 8.49 5.67e-11 ***
## Time 0.063714 0.004101 15.54 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3936 on 46 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.8365
## F-statistic: 241.4 on 1 and 46 DF, p-value: < 2.2e-16
Observation: The line of best fit is equal to \(y = 0.06x + 0.980\). This indicates that the year over year growth in revenue is 0.06 billion dollars.
# create a vector of predicted values
wf_pred = predict(wf_reg)
# calculate accuracy measures
mae(wfdf$Net.Sales, wf_pred)
## [1] 0.2969519
mse(wfdf$Net.Sales, wf_pred)
## [1] 0.1484442
rmse(wfdf$Net.Sales, wf_pred)
## [1] 0.3852846
mape(wfdf$Net.Sales, wf_pred)
## [1] 11.4477
To also account for the seasonality trend depicted in the time series
plot as well as the upward trend seen year to year, a multiple
regression model was created where Net.Sales is the outcome
variable, and Time, Q2, Q3, and
Q4 were the predictor variables. Q1 was chosen
as the reference varible.
# multiple regression with the time and quarters
wf_mreg<-lm(Net.Sales ~ Time + Q2 + Q3 + Q4, data = wfdf)
summary(wf_mreg)
##
## Call:
## lm(formula = Net.Sales ~ Time + Q2 + Q3 + Q4, data = wfdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29907 -0.12774 -0.00415 0.11164 0.45273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.533813 0.066159 23.18 < 2e-16 ***
## Time 0.065450 0.001841 35.55 < 2e-16 ***
## Q2 -0.736284 0.071918 -10.24 4.21e-13 ***
## Q3 -0.773400 0.071989 -10.74 9.37e-14 ***
## Q4 -0.876351 0.072106 -12.15 1.69e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1761 on 43 degrees of freedom
## Multiple R-squared: 0.97, Adjusted R-squared: 0.9673
## F-statistic: 348.1 on 4 and 43 DF, p-value: < 2.2e-16
Observation: The line of best fit is equal to \(y = 0.06x - 0.74x - 0.77x - 0.88x + 1.53\). This indicates that the year over year growth in revenue is 0.065 billion dollars, and the Q2-Q4 coefficients represent the change in sales relative to Q1.
# create a vector of predicted values
wf_pred2 = predict(wf_mreg)
# calculate accuracy measures
mae(wfdf$Net.Sales, wf_pred2)
## [1] 0.1385392
mse(wfdf$Net.Sales, wf_pred2)
## [1] 0.02778254
rmse(wfdf$Net.Sales, wf_pred2)
## [1] 0.166681
mape(wfdf$Net.Sales, wf_pred2)
## [1] 6.316223
Observation: Comparing metrics between both the
wf_reg and wf_mreg model indicates that the
latter is the better fit due to its error being comparatively lower than
when just linear regression was employed.
For this specific scenario, the Whole Foods net sales for 2017 Q1, Q2, Q3, and Q4 were predicted using the multiple regression model.
# create an object with the time periods to use for the prediction
new <- data.frame(Time = c(49, 50, 51, 52), Q2 = c(0,1,0,0), Q3 = c(0,0,1,0),
Q4 = c(0,0,0,1))
predict(wf_mreg, newdata = new)
## 1 2 3 4
## 4.740871 4.070038 4.098371 4.060871
#read dataset into R
lidf <- read.csv("linked_in.csv")
lidf
## Quarter Members
## 1 2009Q1 37.3
## 2 2009Q2 42.0
## 3 2009Q3 48.0
## 4 2009Q4 55.1
## 5 2010Q1 64.2
## 6 2010Q2 71.8
## 7 2010Q3 80.6
## 8 2010Q4 90.4
## 9 2011Q1 101.5
## 10 2011Q2 115.8
## 11 2011Q3 131.2
## 12 2011Q4 145.0
## 13 2012Q1 160.6
## 14 2012Q2 173.9
## 15 2012Q3 187.4
## 16 2012Q4 201.9
## 17 2013Q1 218.3
## 18 2013Q2 238.1
## 19 2013Q3 259.2
## 20 2013Q4 276.8
## 21 2014Q1 296.5
## 22 2014Q2 313.4
Observation: This data frame holds 5 years of memberships divided up by quarter for LinkedIn.
#create a time series plot showing number of LinkedIn members by quarter,
#in millions
ggplot(data = lidf, mapping = aes(x = Quarter, y = Members)) +
geom_line (group=1) +
geom_point() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "LinkedIn members by Quarter (millions), 2009 to 2014",
x = "Quarter", y = "Members")
Observation: This time series plot shows an upward sloping trend in memberships.
To implement regression in this scenario, first a new column
(Time) of consecutive numbers was added to represent each
quarter and a new variable was created by squaring the Time
variable.
# add column of consecutive numbers corresponding with each quarter
lidf$Time <- 1:nrow(lidf)
# create a new variable that squares the Time variable
lidf$Time2 <- lidf$Time^2
# check data
lidf
## Quarter Members Time Time2
## 1 2009Q1 37.3 1 1
## 2 2009Q2 42.0 2 4
## 3 2009Q3 48.0 3 9
## 4 2009Q4 55.1 4 16
## 5 2010Q1 64.2 5 25
## 6 2010Q2 71.8 6 36
## 7 2010Q3 80.6 7 49
## 8 2010Q4 90.4 8 64
## 9 2011Q1 101.5 9 81
## 10 2011Q2 115.8 10 100
## 11 2011Q3 131.2 11 121
## 12 2011Q4 145.0 12 144
## 13 2012Q1 160.6 13 169
## 14 2012Q2 173.9 14 196
## 15 2012Q3 187.4 15 225
## 16 2012Q4 201.9 16 256
## 17 2013Q1 218.3 17 289
## 18 2013Q2 238.1 18 324
## 19 2013Q3 259.2 19 361
## 20 2013Q4 276.8 20 400
## 21 2014Q1 296.5 21 441
## 22 2014Q2 313.4 22 484
The first model created was a simple linear regression model, where
Members was the outcome and Time was used as
the predictor.
# simple linear regression
li_reg<-lm(Members ~ Time, data = lidf)
summary(li_reg)
##
## Call:
## lm(formula = Members ~ Time, data = lidf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.400 -9.994 -5.359 10.707 27.629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.7325 6.0825 -0.614 0.546
## Time 13.4036 0.4631 28.942 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.78 on 20 degrees of freedom
## Multiple R-squared: 0.9767, Adjusted R-squared: 0.9755
## F-statistic: 837.7 on 1 and 20 DF, p-value: < 2.2e-16
Observation: The line of best fit is equal to \(y = 13.4x - 3.7\). This indicates that the quarter to quarter growth in membership is a gain of 13 million members.
# create a vector of predicted values
li_pred = predict(li_reg)
# calculate accuracy measures
mae (lidf$Members, li_pred)
## [1] 11.50083
mse (lidf$Members, li_pred)
## [1] 172.6512
rmse (lidf$Members, li_pred)
## [1] 13.13968
mape (lidf$Members, li_pred)
## [1] 12.65052
Next, because there was indication of an exponential trend in the
time series plot, a quadratic or polynomial model was created by adding
Time2 as a predictor.
# quadratic regression model
li_qreg<-lm(Members ~ Time + Time2, data = lidf)
summary(li_qreg)
##
## Call:
## lm(formula = Members ~ Time + Time2, data = lidf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0931 -1.5068 0.0645 1.6979 3.8404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.56883 1.49168 19.82 3.74e-14 ***
## Time 5.07829 0.29877 17.00 5.99e-13 ***
## Time2 0.36197 0.01261 28.70 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.123 on 19 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.805e+04 on 2 and 19 DF, p-value: < 2.2e-16
Observation: The line of best fit is equal to \(y = 0.36x^2 + 5.08x + 29.57\).
# create a vector of predicted values
li_pred2 = predict(li_qreg)
# calculate accuracy measures
mae (lidf$Members, li_pred2)
## [1] 1.636153
mse (lidf$Members, li_pred2)
## [1] 3.893917
rmse (lidf$Members, li_pred2)
## [1] 1.973301
mape (lidf$Members, li_pred2)
## [1] 1.473033
Observation: Comparing metrics between both the
li_reg and li_qreg model indicates that the
latter is the better fit due to its error being significantly lower than
when just linear regression was employed.
For this specific scenario, LinkedIn memberships for Q3 and Q4 of 2014 were predicted using the quadratic regression model.
# Create an object with the time periods to use for the prediction
li_new <- data.frame(Time = c(23, 24), Time2 = c(529, 576))
predict(li_qreg, newdata = li_new)
## 1 2
## 337.8519 359.9429
This time series analysis successfully forecasted various outcome variables by utilizing various methods of regression. The techniques used in this project could just as easily be adapted to other business scenarios depending on the trends depicted in a time series plot.