I. Overview

Specific Task

This project aims to utilize various types of regression to forecast future values in a time series analysis. The data and the general workflow of the code was provided by the Eastern University course “Data Science for Business”.

II. Setup

Load Packages

The first step involved loading the necessary libraries for this analysis.

#load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(TTR)

Set working directory

Next the working directory was set in order to retrieve the data files.

#set working directory
setwd("/Users/nickwinters/desktop/MSDS/DS for Buisness")

Metrics for Evaluations

Since regression was used for this analysis, the metrics used to evaluate each regression model was created via functions. These metrics relay how accurate each model is at predicting the outcome variable by indicating the error between the predicted and actual measures. Each metric relays this information in a different manner.

  • mae: the average absolute difference in observed vs predicted
  • mse: the average squared difference in observed vs predicted
  • rmse: the squareroot of mse
  • mape: the average absolute percent difference in observed vs predicted

These metrics are relative values so they are beneficial at comparing between two different models.

# mean absolute error
mae<-function(actual,pred){
  mae <- mean(abs(actual-pred), na.rm=TRUE)
  return (mae)
}

# mean squared error
mse<-function(actual,pred){
  mse <- mean((actual-pred)^2, na.rm=TRUE)
  return (mse)
}

# root mean squared error
rmse<-function(actual,pred){
  rmse <- sqrt(mean((actual-pred)^2, na.rm=TRUE))
  return (rmse)
}  

# mean absolute percentage error
mape<-function(actual,pred){
  mape <- mean(abs((actual - pred)/actual), na.rm=TRUE)*100
  return (mape)
}

III. Forecasting with Regression

Regression is a statistical technique that allows for the prediction of an outcome variable based on an input variable/s. This is achieved through the creation of a line of best fit. This line’s formula (y = mx + b) can be used to calculate or forecast the outcome variable.

Below are four business scenarios that utilize regression to forecast desired outcome variables.

The general structure of conducting these analysis include the following:

  • Reading in the data from a csv file
  • Plotting a time series plot
  • Manipulating data where needed
  • Building the regression model
  • Producing accuracy metrics
  • generating predictions based on the model

Strabucks Yearly Revenue

Load & Peak the Data

#read dataset into R
starbucksdf <- read.csv("starbucks_revenue.csv")
starbucksdf
##    Year NetRevenue
## 1  2003       4.10
## 2  2004       5.30
## 3  2005       6.40
## 4  2006       7.80
## 5  2007       9.40
## 6  2008      10.40
## 7  2009       9.80
## 8  2010      10.70
## 9  2011      11.70
## 10 2012      13.30
## 11 2013      14.90
## 12 2014      16.45
## 13 2015      19.16
## 14 2016      21.32
## 15 2017      22.39
## 16 2018      24.72
## 17 2019      26.51
## 18 2020      19.16
## 19 2021      24.61

Observation: This data frame holds 19 years of yearly revenue (in billions) for Starbucks.

Time Series Plot

#create a time series plot showing yearly net revenue in billions
ggplot(data = starbucksdf, mapping = aes(x = Year, y = NetRevenue)) +
  geom_line () +
  geom_point() +
  labs(title = "Starbucks Yearly Net Revenue in Billions of Dollars, 
       2003 to 2021", x = "Year", y = "Net Revenue")

Observation: This time series plot shows an overall positive trend in yearly revenue. There is a notable dip in the 2020 revenue. This is likely due to the COVID-19 epidemic due worldwide lockdowns. This dip does show some recovery in the following 2021 year.

Data Manipulation

To help simplify the model, instead of using the year as the predictor variable (due to its datatype being an integer) a new column was created that expresses each year as representative number.

# Add a column of consecutive numbers
starbucksdf$Time <- 1:nrow(starbucksdf) 
head(starbucksdf)
##   Year NetRevenue Time
## 1 2003        4.1    1
## 2 2004        5.3    2
## 3 2005        6.4    3
## 4 2006        7.8    4
## 5 2007        9.4    5
## 6 2008       10.4    6

Create a Linear Regression Model

Since there is only one outcome variable (NetRevenue), linear regression is the model that was used for this scenario.

# simple linear regression analysis
sb_reg<-lm(NetRevenue ~ Time, data = starbucksdf)
summary(sb_reg)
## 
## Call:
## lm(formula = NetRevenue ~ Time, data = starbucksdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1505 -1.0788  0.3347  0.8512  3.4086 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.54719    0.93650    2.72   0.0146 *  
## Time         1.20907    0.08214   14.72 4.17e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.961 on 17 degrees of freedom
## Multiple R-squared:  0.9273, Adjusted R-squared:  0.923 
## F-statistic: 216.7 on 1 and 17 DF,  p-value: 4.172e-11

Observation: The line of best fit is equal to \(y = 1.209x + 2.547\). This indicates that the year over year growth in revenue is 1.2 billion dollars.

Accuracy Metrics

# Predicted values were generated for evaluation
sb_pred = predict(sb_reg)

# Run previously created metric functions
mae (starbucksdf$NetRevenue, sb_pred)
## [1] 1.411594
mse (starbucksdf$NetRevenue, sb_pred)
## [1] 3.440664
rmse (starbucksdf$NetRevenue, sb_pred)
## [1] 1.854903
mape (starbucksdf$NetRevenue, sb_pred)
## [1] 9.219178

Note: These values generated above depict the models accuracy; however because they are relative values, they are really only beneficial at comparing between diffent models.

Generate Predictions

For this specific scenario, the net revenue for the years 2022, 2023, and 2024 were predicted

# Create a data frame with the time periods to use for the prediction
sb_new <- data.frame(Time = c(20, 21, 22))
predict(sb_reg, newdata = sb_new)
##        1        2        3 
## 26.72860 27.93767 29.14674

Residual vs Predicted Plot

#Create a vector of residuals generated from the regression above
sb_res = resid(sb_reg)

#Create a data frame of the predicted values and the residuals
pred_res_df <- data.frame(sb_pred, sb_res)

#create a scatterplot of the residuals versus the predicted values
ggplot(data = pred_res_df, mapping = aes(x = sb_pred, y = sb_res)) +
  geom_point() +
  labs(title = "Plot of residuals vs. predicted values", x = "Predicted values",
       y = "Residuals")

Observation: In an ideal scenario, the points on this plot should be randomly scattered around the zero line; however here we see a case of autocorrelation. In this case this means that data from the current year may be correlated with data from the previous year. This depicted by the apparent upward trend that points are following. If future analysis is conducted on this data it may be beneficial to create an autoregression model to account for this observation.

New York Times Quarterly Revenue

Load & Peak the Data

#read dataset into R
nytdf <- read.csv("NYT_revenue.csv")
nytdf
##    Quarter Revenue
## 1   2013Q1  380.68
## 2   2013Q2  390.96
## 3   2013Q3  361.74
## 4   2013Q4  443.86
## 5   2014Q1  390.41
## 6   2014Q2  388.72
## 7   2014Q3  364.72
## 8   2014Q4  444.68
## 9   2015Q1  384.24
## 10  2015Q2  382.89
## 11  2015Q3  367.40
## 12  2015Q4  444.69
## 13  2016Q1  379.52
## 14  2016Q2  372.63
## 15  2016Q3  363.55
## 16  2016Q4  439.65

Observation: This data frame holds 3 years of revenue divided up by quarter for New York Times Magazine.

Time Series Plot

#create a time series plot showing NYT quarterly revenue
ggplot(data = nytdf, mapping = aes(x = Quarter, y = Revenue)) +
  geom_line (group=1) +
  geom_point() +
  labs(title = "New York Times Quarterly Revenue 2013 to 2016", 
       x = "Quarter", y = "Revenue")

Observation: This time series plot shows an a seasonal horizontal trend, with spikes in revenue every 4th quarter.

Data Manipulation

To implement regression in this scenario, first binary variables were created to represent each quarter.

# dummy variables corresponding to each quarter were created
nytdf$Q1 <- ifelse(grepl("Q1",nytdf$Quarter), 1, 0)
nytdf$Q2 <- ifelse(grepl("Q2",nytdf$Quarter), 1, 0)
nytdf$Q3 <- ifelse(grepl("Q3",nytdf$Quarter), 1, 0)
nytdf$Q4 <- ifelse(grepl("Q4",nytdf$Quarter), 1, 0)
# check newly manipulated variable
nytdf
##    Quarter Revenue Q1 Q2 Q3 Q4
## 1   2013Q1  380.68  1  0  0  0
## 2   2013Q2  390.96  0  1  0  0
## 3   2013Q3  361.74  0  0  1  0
## 4   2013Q4  443.86  0  0  0  1
## 5   2014Q1  390.41  1  0  0  0
## 6   2014Q2  388.72  0  1  0  0
## 7   2014Q3  364.72  0  0  1  0
## 8   2014Q4  444.68  0  0  0  1
## 9   2015Q1  384.24  1  0  0  0
## 10  2015Q2  382.89  0  1  0  0
## 11  2015Q3  367.40  0  0  1  0
## 12  2015Q4  444.69  0  0  0  1
## 13  2016Q1  379.52  1  0  0  0
## 14  2016Q2  372.63  0  1  0  0
## 15  2016Q3  363.55  0  0  1  0
## 16  2016Q4  439.65  0  0  0  1

Create a Multiple Regression Model

Multiple regression was used since there are multiple outcome variables. For this regression Q4 was intnetionaly left out to be the reference variable.

#Use multiple regression with quarter variables to generate a regression 
#equation for forecasting
nyt_reg<-lm(Revenue ~ Q1 + Q2 + Q3, data = nytdf)
summary(nyt_reg)
## 
## Call:
## lm(formula = Revenue ~ Q1 + Q2 + Q3, data = nytdf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1700  -2.7175   0.4475   1.8644   7.1600 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  443.220      2.530  175.16  < 2e-16 ***
## Q1           -59.508      3.578  -16.63 1.19e-09 ***
## Q2           -59.420      3.578  -16.61 1.21e-09 ***
## Q3           -78.868      3.578  -22.04 4.48e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.061 on 12 degrees of freedom
## Multiple R-squared:  0.9786, Adjusted R-squared:  0.9732 
## F-statistic: 182.8 on 3 and 12 DF,  p-value: 2.8e-10

Observation: The line of best fit is equal to \(y = -59.5x -59.4x -78.9x + 443.2\). This indicates that the year over year growth in revenue is 1.2 billion dollars. These coefficients indicate how much less each quarters revenue is in relation to Q4.

Accuracy Metrics

#Create a vector of predicted values
nyt_pred = predict(nyt_reg)

# calculate accuracy measures
mae(nytdf$Revenue, nyt_pred)
## [1] 3.28625
mse(nytdf$Revenue, nyt_pred)
## [1] 19.20721
rmse(nytdf$Revenue, nyt_pred)
## [1] 4.382603
mape(nytdf$Revenue, nyt_pred)
## [1] 0.8484193

Note: These values generated above depict the models accuracy; however because they are relative values, they are really only beneficial at comparing between diffent models.

Generate Predictions

For this specific scenario, the revenue for Q1, Q2, Q3, Q4 of 2017 were predicted.

# Create an object with the time periods to use for the prediction
nyt_new <- data.frame(Q1 = c(1,0,0,0), Q2 = c(0,1,0,0), Q3 = c(0,0,1,0)) 
predict(nyt_reg, newdata = nyt_new)
##        1        2        3        4 
## 383.7125 383.8000 364.3525 443.2200

Whole Foods Quarterly Revenue

Load & Peak the Data

#read dataset into R
wfdf <- read.csv("whole_foods.csv")
wfdf
##    Quarter Net.Sales
## 1   2005Q1      1.37
## 2   2005Q2      1.09
## 3   2005Q3      1.13
## 4   2005Q4      1.12
## 5   2006Q1      1.67
## 6   2006Q2      1.31
## 7   2006Q3      1.34
## 8   2006Q4      1.29
## 9   2007Q1      1.87
## 10  2007Q2      1.46
## 11  2007Q3      1.51
## 12  2007Q4      1.74
## 13  2008Q1      2.46
## 14  2008Q2      1.87
## 15  2008Q3      1.84
## 16  2008Q4      1.79
## 17  2009Q1      2.47
## 18  2009Q2      1.86
## 19  2009Q3      1.88
## 20  2009Q4      1.83
## 21  2010Q1      2.64
## 22  2010Q2      2.11
## 23  2010Q3      2.16
## 24  2010Q4      2.10
## 25  2011Q1      3.00
## 26  2011Q2      2.35
## 27  2011Q3      2.40
## 28  2011Q4      2.35
## 29  2012Q1      3.39
## 30  2012Q2      2.67
## 31  2012Q3      2.73
## 32  2012Q4      2.91
## 33  2013Q1      3.86
## 34  2013Q2      3.03
## 35  2013Q3      3.06
## 36  2013Q4      2.98
## 37  2014Q1      4.24
## 38  2014Q2      3.32
## 39  2014Q3      3.38
## 40  2014Q4      3.26
## 41  2015Q1      4.67
## 42  2015Q2      3.65
## 43  2015Q3      3.63
## 44  2015Q4      3.44
## 45  2016Q1      4.83
## 46  2016Q2      3.70
## 47  2016Q3      3.70
## 48  2016Q4      3.50

Observation: This data frame holds 11 years of revenue divided up by quarter for Whole Foods.

Time Series Plot

#create a time series plot showing quarterly net sales
ggplot(data = wfdf, mapping = aes(x = Quarter, y = Net.Sales)) +
  geom_line (group=1) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title = "Whole Foods Quarterly Net Sales 2005 to 2016 in $ millions", 
       x = "Quarter", y = "Net Sales")

Observation: This time series plot shows an a seasonal upward trend, with spikes in revenue every 1st quarter.

Data Manipulation

To implement regression in this scenario, first a new column was added to represent each year and the binary variables were created to represent each quarter.

# add column of consecutive numbers for each year
wfdf$Time <- 1:nrow(wfdf) 

# create dummy variables corresponding to each quarter 
wfdf$Q1 <- ifelse(grepl("Q1",wfdf$Quarter), 1, 0)
wfdf$Q2 <- ifelse(grepl("Q2",wfdf$Quarter), 1, 0)
wfdf$Q3 <- ifelse(grepl("Q3",wfdf$Quarter), 1, 0)
wfdf$Q4 <- ifelse(grepl("Q4",wfdf$Quarter), 1, 0)

# veiw manipulated df
wfdf
##    Quarter Net.Sales Time Q1 Q2 Q3 Q4
## 1   2005Q1      1.37    1  1  0  0  0
## 2   2005Q2      1.09    2  0  1  0  0
## 3   2005Q3      1.13    3  0  0  1  0
## 4   2005Q4      1.12    4  0  0  0  1
## 5   2006Q1      1.67    5  1  0  0  0
## 6   2006Q2      1.31    6  0  1  0  0
## 7   2006Q3      1.34    7  0  0  1  0
## 8   2006Q4      1.29    8  0  0  0  1
## 9   2007Q1      1.87    9  1  0  0  0
## 10  2007Q2      1.46   10  0  1  0  0
## 11  2007Q3      1.51   11  0  0  1  0
## 12  2007Q4      1.74   12  0  0  0  1
## 13  2008Q1      2.46   13  1  0  0  0
## 14  2008Q2      1.87   14  0  1  0  0
## 15  2008Q3      1.84   15  0  0  1  0
## 16  2008Q4      1.79   16  0  0  0  1
## 17  2009Q1      2.47   17  1  0  0  0
## 18  2009Q2      1.86   18  0  1  0  0
## 19  2009Q3      1.88   19  0  0  1  0
## 20  2009Q4      1.83   20  0  0  0  1
## 21  2010Q1      2.64   21  1  0  0  0
## 22  2010Q2      2.11   22  0  1  0  0
## 23  2010Q3      2.16   23  0  0  1  0
## 24  2010Q4      2.10   24  0  0  0  1
## 25  2011Q1      3.00   25  1  0  0  0
## 26  2011Q2      2.35   26  0  1  0  0
## 27  2011Q3      2.40   27  0  0  1  0
## 28  2011Q4      2.35   28  0  0  0  1
## 29  2012Q1      3.39   29  1  0  0  0
## 30  2012Q2      2.67   30  0  1  0  0
## 31  2012Q3      2.73   31  0  0  1  0
## 32  2012Q4      2.91   32  0  0  0  1
## 33  2013Q1      3.86   33  1  0  0  0
## 34  2013Q2      3.03   34  0  1  0  0
## 35  2013Q3      3.06   35  0  0  1  0
## 36  2013Q4      2.98   36  0  0  0  1
## 37  2014Q1      4.24   37  1  0  0  0
## 38  2014Q2      3.32   38  0  1  0  0
## 39  2014Q3      3.38   39  0  0  1  0
## 40  2014Q4      3.26   40  0  0  0  1
## 41  2015Q1      4.67   41  1  0  0  0
## 42  2015Q2      3.65   42  0  1  0  0
## 43  2015Q3      3.63   43  0  0  1  0
## 44  2015Q4      3.44   44  0  0  0  1
## 45  2016Q1      4.83   45  1  0  0  0
## 46  2016Q2      3.70   46  0  1  0  0
## 47  2016Q3      3.70   47  0  0  1  0
## 48  2016Q4      3.50   48  0  0  0  1

Linear Regression Model

A linear regression model was created to predict NetSales from Time.

# linear regression model
wf_reg<-lm(Net.Sales ~ Time, data = wfdf)
summary(wf_reg)
## 
## Call:
## lm(formula = Net.Sales ~ Time, data = wfdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5381 -0.2692 -0.1117  0.0778  1.0779 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.979849   0.115413    8.49 5.67e-11 ***
## Time        0.063714   0.004101   15.54  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3936 on 46 degrees of freedom
## Multiple R-squared:   0.84,  Adjusted R-squared:  0.8365 
## F-statistic: 241.4 on 1 and 46 DF,  p-value: < 2.2e-16

Observation: The line of best fit is equal to \(y = 0.06x + 0.980\). This indicates that the year over year growth in revenue is 0.06 billion dollars.

Accuracy Metrics

# create a vector of predicted values
wf_pred = predict(wf_reg)

# calculate accuracy measures
mae(wfdf$Net.Sales, wf_pred)
## [1] 0.2969519
mse(wfdf$Net.Sales, wf_pred)
## [1] 0.1484442
rmse(wfdf$Net.Sales, wf_pred)
## [1] 0.3852846
mape(wfdf$Net.Sales, wf_pred)
## [1] 11.4477

Multiple Regression Model

To also account for the seasonality trend depicted in the time series plot as well as the upward trend seen year to year, a multiple regression model was created where Net.Sales is the outcome variable, and Time, Q2, Q3, and Q4 were the predictor variables. Q1 was chosen as the reference varible.

# multiple regression with the time and quarters
wf_mreg<-lm(Net.Sales ~ Time + Q2 + Q3 + Q4, data = wfdf)
summary(wf_mreg)
## 
## Call:
## lm(formula = Net.Sales ~ Time + Q2 + Q3 + Q4, data = wfdf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29907 -0.12774 -0.00415  0.11164  0.45273 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.533813   0.066159   23.18  < 2e-16 ***
## Time         0.065450   0.001841   35.55  < 2e-16 ***
## Q2          -0.736284   0.071918  -10.24 4.21e-13 ***
## Q3          -0.773400   0.071989  -10.74 9.37e-14 ***
## Q4          -0.876351   0.072106  -12.15 1.69e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1761 on 43 degrees of freedom
## Multiple R-squared:   0.97,  Adjusted R-squared:  0.9673 
## F-statistic: 348.1 on 4 and 43 DF,  p-value: < 2.2e-16

Observation: The line of best fit is equal to \(y = 0.06x - 0.74x - 0.77x - 0.88x + 1.53\). This indicates that the year over year growth in revenue is 0.065 billion dollars, and the Q2-Q4 coefficients represent the change in sales relative to Q1.

Accuracy Metrics

# create a vector of predicted values
wf_pred2 = predict(wf_mreg)

# calculate accuracy measures
mae(wfdf$Net.Sales, wf_pred2)
## [1] 0.1385392
mse(wfdf$Net.Sales, wf_pred2)
## [1] 0.02778254
rmse(wfdf$Net.Sales, wf_pred2)
## [1] 0.166681
mape(wfdf$Net.Sales, wf_pred2)
## [1] 6.316223

Observation: Comparing metrics between both the wf_reg and wf_mreg model indicates that the latter is the better fit due to its error being comparatively lower than when just linear regression was employed.

Generate Predictions

For this specific scenario, the Whole Foods net sales for 2017 Q1, Q2, Q3, and Q4 were predicted using the multiple regression model.

# create an object with the time periods to use for the prediction
new <- data.frame(Time = c(49, 50, 51, 52), Q2 = c(0,1,0,0), Q3 = c(0,0,1,0), 
                  Q4 = c(0,0,0,1)) 
predict(wf_mreg, newdata = new)
##        1        2        3        4 
## 4.740871 4.070038 4.098371 4.060871

LinkedIn Quarterly Members

Load & Peak the Data

#read dataset into R
lidf <- read.csv("linked_in.csv")
lidf
##    Quarter Members
## 1   2009Q1    37.3
## 2   2009Q2    42.0
## 3   2009Q3    48.0
## 4   2009Q4    55.1
## 5   2010Q1    64.2
## 6   2010Q2    71.8
## 7   2010Q3    80.6
## 8   2010Q4    90.4
## 9   2011Q1   101.5
## 10  2011Q2   115.8
## 11  2011Q3   131.2
## 12  2011Q4   145.0
## 13  2012Q1   160.6
## 14  2012Q2   173.9
## 15  2012Q3   187.4
## 16  2012Q4   201.9
## 17  2013Q1   218.3
## 18  2013Q2   238.1
## 19  2013Q3   259.2
## 20  2013Q4   276.8
## 21  2014Q1   296.5
## 22  2014Q2   313.4

Observation: This data frame holds 5 years of memberships divided up by quarter for LinkedIn.

Time Series Plot

#create a time series plot showing number of LinkedIn members by quarter, 
#in millions
ggplot(data = lidf, mapping = aes(x = Quarter, y = Members)) +
  geom_line (group=1) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "LinkedIn members by Quarter (millions), 2009 to 2014", 
       x = "Quarter", y = "Members")

Observation: This time series plot shows an upward sloping trend in memberships.

Data Manipulation

To implement regression in this scenario, first a new column (Time) of consecutive numbers was added to represent each quarter and a new variable was created by squaring the Time variable.

# add column of consecutive numbers corresponding with each quarter
lidf$Time <- 1:nrow(lidf) 

# create a new variable that squares the Time variable
lidf$Time2 <- lidf$Time^2

# check data
lidf
##    Quarter Members Time Time2
## 1   2009Q1    37.3    1     1
## 2   2009Q2    42.0    2     4
## 3   2009Q3    48.0    3     9
## 4   2009Q4    55.1    4    16
## 5   2010Q1    64.2    5    25
## 6   2010Q2    71.8    6    36
## 7   2010Q3    80.6    7    49
## 8   2010Q4    90.4    8    64
## 9   2011Q1   101.5    9    81
## 10  2011Q2   115.8   10   100
## 11  2011Q3   131.2   11   121
## 12  2011Q4   145.0   12   144
## 13  2012Q1   160.6   13   169
## 14  2012Q2   173.9   14   196
## 15  2012Q3   187.4   15   225
## 16  2012Q4   201.9   16   256
## 17  2013Q1   218.3   17   289
## 18  2013Q2   238.1   18   324
## 19  2013Q3   259.2   19   361
## 20  2013Q4   276.8   20   400
## 21  2014Q1   296.5   21   441
## 22  2014Q2   313.4   22   484

Linear Regression Model

The first model created was a simple linear regression model, where Members was the outcome and Time was used as the predictor.

# simple linear regression
li_reg<-lm(Members ~ Time, data = lidf)
summary(li_reg)
## 
## Call:
## lm(formula = Members ~ Time, data = lidf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.400  -9.994  -5.359  10.707  27.629 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.7325     6.0825  -0.614    0.546    
## Time         13.4036     0.4631  28.942   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.78 on 20 degrees of freedom
## Multiple R-squared:  0.9767, Adjusted R-squared:  0.9755 
## F-statistic: 837.7 on 1 and 20 DF,  p-value: < 2.2e-16

Observation: The line of best fit is equal to \(y = 13.4x - 3.7\). This indicates that the quarter to quarter growth in membership is a gain of 13 million members.

Accuracy Metrics

# create a vector of predicted values
li_pred = predict(li_reg)

# calculate accuracy measures 
mae (lidf$Members, li_pred)
## [1] 11.50083
mse (lidf$Members, li_pred)
## [1] 172.6512
rmse (lidf$Members, li_pred)
## [1] 13.13968
mape (lidf$Members, li_pred)
## [1] 12.65052

Quadratic Regression

Next, because there was indication of an exponential trend in the time series plot, a quadratic or polynomial model was created by adding Time2 as a predictor.

# quadratic regression model
li_qreg<-lm(Members ~ Time + Time2, data = lidf)
summary(li_qreg)
## 
## Call:
## lm(formula = Members ~ Time + Time2, data = lidf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0931 -1.5068  0.0645  1.6979  3.8404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.56883    1.49168   19.82 3.74e-14 ***
## Time         5.07829    0.29877   17.00 5.99e-13 ***
## Time2        0.36197    0.01261   28.70  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.123 on 19 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.805e+04 on 2 and 19 DF,  p-value: < 2.2e-16

Observation: The line of best fit is equal to \(y = 0.36x^2 + 5.08x + 29.57\).

Accuracy Metrics

# create a vector of predicted values
li_pred2 = predict(li_qreg)

# calculate accuracy measures
mae (lidf$Members, li_pred2)
## [1] 1.636153
mse (lidf$Members, li_pred2)
## [1] 3.893917
rmse (lidf$Members, li_pred2)
## [1] 1.973301
mape (lidf$Members, li_pred2)
## [1] 1.473033

Observation: Comparing metrics between both the li_reg and li_qreg model indicates that the latter is the better fit due to its error being significantly lower than when just linear regression was employed.

Generate Predictions

For this specific scenario, LinkedIn memberships for Q3 and Q4 of 2014 were predicted using the quadratic regression model.

# Create an object with the time periods to use for the prediction
li_new <- data.frame(Time = c(23, 24), Time2 = c(529, 576))
predict(li_qreg, newdata = li_new)
##        1        2 
## 337.8519 359.9429

IV. Conclusion

This time series analysis successfully forecasted various outcome variables by utilizing various methods of regression. The techniques used in this project could just as easily be adapted to other business scenarios depending on the trends depicted in a time series plot.