knitr::opts_chunk$set(echo = TRUE)
setwd("C:/Users/danie/OneDrive - University of New Haven")
options(digits = 3, scipen = 999)
remove(list = ls())
graphics.off()
library(readxl)
library(tidyverse)
library(dplyr)
library(car)
library(forecast)
My senior project consisted of comparing McDonald’s BigMac Index and the change in used car prices to determine which was a better indicator of inflation. Data-sets were collected and uploaded from the FRED. Supervised and Unsupervised Machine Learning methods were tested in RStudio.
Data collected for this project was uploaded from the Federal Reserve of Economic Data (FRED). I compiled the data into a singular Excel sheet for convenience.
bigmac <- read_excel("big-mac-source-data-v2 (1).xlsx",
sheet = "big-mac-source-data-v2")
cars <- read_excel("big-mac-source-data-v2 (1).xlsx",
sheet = "honda")
cpi <- read_excel("big-mac-source-data-v2 (1).xlsx",
sheet = "cpi")
apparel <- read_excel("big-mac-source-data-v2 (1).xlsx",
sheet = "apparel")
recreation <- read_excel("big-mac-source-data-v2 (1).xlsx",
sheet = "recreation")
energy <- read_excel("big-mac-source-data-v2 (1).xlsx",
sheet = "energy")
Each sheet of our data was then filtered for only the specific values columns and combined into a single data set to start building models.
bigmac <- filter(bigmac, iso_a3 == "USA")
bigmac <- bigmac[-c(8,10,14,15,17,19,21,23,25,27,29,31,33,35,37,39,41),]
cpi <- cpi[-1,]
bm_cols <- bigmac[,4]
cars_cols <- cars %>% dplyr::select(2,5)
cpi_cols <- cpi %>% dplyr::select("CPI")
app_cols <- apparel %>% dplyr::select("130.4")
engy_cols <- energy %>% dplyr::select("120.5")
rec_cols <- recreation %>% dplyr::select(2)
data <- bind_cols(bm_cols, cars_cols, cpi_cols, app_cols, engy_cols, rec_cols)
data <- data[, c("Year", "local_price", "Value", "130.4", "120.5", "102.9", "CPI")]
names(data) <- c("Year", "BigMac_Price", "Percent_Change_UsedCar_Prices", "Apparel", "Energy", "Recreation", "CPI")
data
## # A tibble: 24 × 7
## Year BigMac_Price Percent_Change_UsedCar_P…¹ Apparel Energy Recreation CPI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000 2.24 2.4 129 135. 105. 84.4
## 2 2001 2.24 1.9 125. 118. 106. 86.3
## 3 2002 2.35 -4.1 121. 136. 107. 87.6
## 4 2003 2.46 -5.8 121 147. 109. 89.1
## 5 2004 2.47 -6.5 120. 163. 109. 91.0
## 6 2005 2.58 4.5 120. 198. 111. 93.3
## 7 2006 2.67 0.5 120. 202. 111. 95.5
## 8 2007 2.89 -2.9 119. 240. 113. 97.7
## 9 2008 3.21 -1.2 120. 184. 114. 99.4
## 10 2009 3.43 -5 120. 210. 114. 100.
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹Percent_Change_UsedCar_Prices
With our single data set, I standardized the data. This set all values on a common scale, which would in turn improve the accuracy of our models.
data_scaled <- data
data_scaled[,2:7] <- scale(data_scaled[,2:7])
data_scaled <- as.data.frame(data_scaled)
cor_matrix <- cor(data_scaled)
cor_matrix
## Year BigMac_Price Percent_Change_UsedCar_Prices
## Year 1.000 0.991 0.322
## BigMac_Price 0.991 1.000 0.286
## Percent_Change_UsedCar_Prices 0.322 0.286 1.000
## Apparel 0.437 0.521 0.202
## Energy 0.760 0.747 0.492
## Recreation 0.926 0.917 0.356
## CPI 0.976 0.968 0.360
## Apparel Energy Recreation CPI
## Year 0.437 0.760 0.926 0.976
## BigMac_Price 0.521 0.747 0.917 0.968
## Percent_Change_UsedCar_Prices 0.202 0.492 0.356 0.360
## Apparel 1.000 0.363 0.507 0.508
## Energy 0.363 1.000 0.782 0.772
## Recreation 0.507 0.782 1.000 0.984
## CPI 0.508 0.772 0.984 1.000
The first Supervised Machine Learning Method I used was a Linear Regression Model. This is due to the ease of using the model. An important critique is that there is a small number of rows in my data set, so the accuracy will be very high of the model due to lower sample size.
model_scaled <- lm(CPI ~ BigMac_Price +
Percent_Change_UsedCar_Prices +
Apparel +
Energy +
Recreation,
data = data_scaled)
summary(model_scaled)
##
## Call:
## lm(formula = CPI ~ BigMac_Price + Percent_Change_UsedCar_Prices +
## Apparel + Energy + Recreation, data = data_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14092 -0.04708 -0.00177 0.04972 0.08415
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.000000000000000262 0.013300578613132222
## BigMac_Price 0.443417109519961217 0.035316178933994656
## Percent_Change_UsedCar_Prices 0.045337597128695784 0.015896468577579832
## Apparel -0.024058808248740533 0.016085796385526641
## Energy -0.053708911803309896 0.023943848842952600
## Recreation 0.615155259692325540 0.036904426670225289
## t value Pr(>|t|)
## (Intercept) 0.00 1.000
## BigMac_Price 12.56 0.0000000002427 ***
## Percent_Change_UsedCar_Prices 2.85 0.011 *
## Apparel -1.50 0.152
## Energy -2.24 0.038 *
## Recreation 16.67 0.0000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0652 on 18 degrees of freedom
## Multiple R-squared: 0.997, Adjusted R-squared: 0.996
## F-statistic: 1.08e+03 on 5 and 18 DF, p-value: <0.0000000000000002
lm_model_predictions <- predict(model_scaled, newdata = data_scaled)
lm_model_actuals <- data_scaled$CPI
plot(data_scaled$CPI, lm_model_predictions) +
abline(a = 0, b = 1, col = "red")
## integer(0)
lm_mae <- mean(abs(lm_model_actuals - lm_model_predictions))
lm_mae
## [1] 0.0468
lm_rmse <- sqrt(mean((lm_model_actuals - lm_model_predictions)^2))
lm_rmse
## [1] 0.0564
A Mean Absolute Error of 0.0468 suggests that the predictions will deviate from the total range by 4.68%. A summary of the model is displayed below:
##
## Call:
## lm(formula = CPI ~ BigMac_Price + Percent_Change_UsedCar_Prices +
## Apparel + Energy + Recreation, data = data_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14092 -0.04708 -0.00177 0.04972 0.08415
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.000000000000000262 0.013300578613132222
## BigMac_Price 0.443417109519961217 0.035316178933994656
## Percent_Change_UsedCar_Prices 0.045337597128695784 0.015896468577579832
## Apparel -0.024058808248740533 0.016085796385526641
## Energy -0.053708911803309896 0.023943848842952600
## Recreation 0.615155259692325540 0.036904426670225289
## t value Pr(>|t|)
## (Intercept) 0.00 1.000
## BigMac_Price 12.56 0.0000000002427 ***
## Percent_Change_UsedCar_Prices 2.85 0.011 *
## Apparel -1.50 0.152
## Energy -2.24 0.038 *
## Recreation 16.67 0.0000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0652 on 18 degrees of freedom
## Multiple R-squared: 0.997, Adjusted R-squared: 0.996
## F-statistic: 1.08e+03 on 5 and 18 DF, p-value: <0.0000000000000002
The triple asterisks next to the P value of BigMac_Price and Recreation show the heavy importance these variables have on the Linear Regression Model. Our Used Car Prices has some importance, but it is not as statistically significant to the others.
The second Supervised Machine Learning method I used was Time-Series Regression. This was just a better way for me to display the predictions for inflation over the next years. My regression was done using an Auto Regressive Integrated Moving Average (ARIMA) model. The function auto.arima() tests different combinations of lag terms, differences, and error terms.
ts_target <- ts(data_scaled$CPI, start = c(2001,1), end = c(2024, 1))
ts_predictors <- data_scaled[, -which(names(data_scaled) == "CPI")]
ts_predictors$time <- time(ts_target)
ts_model <- lm(ts_target ~ BigMac_Price +
Percent_Change_UsedCar_Prices +
Apparel +
Energy +
Recreation,
data = ts_predictors)
summary(ts_model)
##
## Call:
## lm(formula = ts_target ~ BigMac_Price + Percent_Change_UsedCar_Prices +
## Apparel + Energy + Recreation, data = ts_predictors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14092 -0.04708 -0.00177 0.04972 0.08415
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.000000000000000262 0.013300578613132222
## BigMac_Price 0.443417109519961217 0.035316178933994656
## Percent_Change_UsedCar_Prices 0.045337597128695784 0.015896468577579832
## Apparel -0.024058808248740533 0.016085796385526641
## Energy -0.053708911803309896 0.023943848842952600
## Recreation 0.615155259692325540 0.036904426670225289
## t value Pr(>|t|)
## (Intercept) 0.00 1.000
## BigMac_Price 12.56 0.0000000002427 ***
## Percent_Change_UsedCar_Prices 2.85 0.011 *
## Apparel -1.50 0.152
## Energy -2.24 0.038 *
## Recreation 16.67 0.0000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0652 on 18 degrees of freedom
## Multiple R-squared: 0.997, Adjusted R-squared: 0.996
## F-statistic: 1.08e+03 on 5 and 18 DF, p-value: <0.0000000000000002
ts_arima <- auto.arima(ts_target)
ts_arima
## Series: ts_target
## ARIMA(0,2,2)
##
## Coefficients:
## ma1 ma2
## 0.298 -0.509
## s.e. 0.208 0.199
##
## sigma^2 = 0.00294: log likelihood = 33.4
## AIC=-60.9 AICc=-59.5 BIC=-57.6
forecast_values <- forecast(ts_arima, h = 12)
plot(forecast_values)
The darker section of the chart represents the 80% confidence interval, while the lighter outer section represents the 95% confidence interval. A summary of the model is shown below:
##
## Call:
## lm(formula = ts_target ~ BigMac_Price + Percent_Change_UsedCar_Prices +
## Apparel + Energy + Recreation, data = ts_predictors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14092 -0.04708 -0.00177 0.04972 0.08415
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.000000000000000262 0.013300578613132222
## BigMac_Price 0.443417109519961217 0.035316178933994656
## Percent_Change_UsedCar_Prices 0.045337597128695784 0.015896468577579832
## Apparel -0.024058808248740533 0.016085796385526641
## Energy -0.053708911803309896 0.023943848842952600
## Recreation 0.615155259692325540 0.036904426670225289
## t value Pr(>|t|)
## (Intercept) 0.00 1.000
## BigMac_Price 12.56 0.0000000002427 ***
## Percent_Change_UsedCar_Prices 2.85 0.011 *
## Apparel -1.50 0.152
## Energy -2.24 0.038 *
## Recreation 16.67 0.0000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0652 on 18 degrees of freedom
## Multiple R-squared: 0.997, Adjusted R-squared: 0.996
## F-statistic: 1.08e+03 on 5 and 18 DF, p-value: <0.0000000000000002
Similar to the Linear Regression Model, the Time-Series Regression Model had heavy importance on BigMac_Price and Recreation, with UsedCar_Prices having lower significance.
K-Means clustering involves separating the data points into ‘K’ distinct clusters based on the similarity of their data points. It groups data by minimizing the distance between points of the same cluster while maximizing the separation between clusters.
set.seed(123)
kmeans_model <- kmeans(data_scaled, centers = 3)
kmeans_data <- data_scaled
kmeans_data$cluster <- factor(kmeans_model$cluster)
kmeans_lm_model <- lm(CPI ~ BigMac_Price +
Percent_Change_UsedCar_Prices +
Apparel +
Energy +
Recreation +
cluster,
data = kmeans_data)
summary(kmeans_lm_model)
##
## Call:
## lm(formula = CPI ~ BigMac_Price + Percent_Change_UsedCar_Prices +
## Apparel + Energy + Recreation + cluster, data = kmeans_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12263 -0.03771 0.00659 0.03723 0.08272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0524 0.0325 -1.61 0.126
## BigMac_Price 0.4646 0.0728 6.38 0.000009142 ***
## Percent_Change_UsedCar_Prices 0.0373 0.0155 2.41 0.028 *
## Apparel -0.0136 0.0152 -0.90 0.383
## Energy -0.0218 0.0254 -0.86 0.404
## Recreation 0.5425 0.0528 10.28 0.000000019 ***
## cluster2 0.1143 0.0534 2.14 0.048 *
## cluster3 0.0507 0.0795 0.64 0.532
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.059 on 16 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.997
## F-statistic: 943 on 7 and 16 DF, p-value: <0.0000000000000002
forecast::accuracy((kmeans_lm_model))
## ME RMSE MAE MPE MAPE MASE
## Training set -0.00000000000000000867 0.0481 0.0389 -6.54 20.9 0.0485
ggplot(kmeans_data, aes(x = fitted(kmeans_lm_model), y = residuals(kmeans_lm_model),
color = cluster)) +
geom_point() +
geom_smooth()
kmeans_predictions <- predict(kmeans_lm_model, newdata = kmeans_data)
kmeans_actuals <- kmeans_data$CPI
kmeans_mae <- mean(abs(kmeans_actuals - kmeans_predictions))
kmeans_mae
## [1] 0.0389
kmeans_rmse <- sqrt(mean((kmeans_actuals - kmeans_predictions)^2))
kmeans_rmse
## [1] 0.0481
Our model created 3 clusters, which are separated by color in the chart. A summary of the model is displayed below:
##
## Call:
## lm(formula = CPI ~ BigMac_Price + Percent_Change_UsedCar_Prices +
## Apparel + Energy + Recreation + cluster, data = kmeans_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12263 -0.03771 0.00659 0.03723 0.08272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0524 0.0325 -1.61 0.126
## BigMac_Price 0.4646 0.0728 6.38 0.000009142 ***
## Percent_Change_UsedCar_Prices 0.0373 0.0155 2.41 0.028 *
## Apparel -0.0136 0.0152 -0.90 0.383
## Energy -0.0218 0.0254 -0.86 0.404
## Recreation 0.5425 0.0528 10.28 0.000000019 ***
## cluster2 0.1143 0.0534 2.14 0.048 *
## cluster3 0.0507 0.0795 0.64 0.532
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.059 on 16 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.997
## F-statistic: 943 on 7 and 16 DF, p-value: <0.0000000000000002
Using different models did not change how the variables affect CPI, as shown in the model. The correlation between variables and CPI can be shown in the Added-Variable Plot shown below:
These plots help show the relationship between each of the variables and CPI. They isolate each individual variable and plot the residuals with a trend line for better clarity. You can see the residuals of BigMac_Price and Recreation are closely aligned with their trend line, which explains the heavy significance that the models indicated. The residuals of the UsedCar_Prices plot did not closely follow the trend line, which would explain the lower significance on the models.
After testing Supervised and Unsupervised Machine Learning models, the McDonald’s BigMac Index was deemed to be a better indicator of inflation. This is due to the different summaries of the models implying higher importance.