knitr::opts_chunk$set(echo = TRUE)

setwd("C:/Users/danie/OneDrive - University of New Haven")
options(digits = 3, scipen = 999)
remove(list = ls())
graphics.off()

library(readxl)
library(tidyverse)
library(dplyr)
library(car)
library(forecast)

My senior project consisted of comparing McDonald’s BigMac Index and the change in used car prices to determine which was a better indicator of inflation. Data-sets were collected and uploaded from the FRED. Supervised and Unsupervised Machine Learning methods were tested in RStudio.

Data Collection and Pre-Processing

Data collected for this project was uploaded from the Federal Reserve of Economic Data (FRED). I compiled the data into a singular Excel sheet for convenience.

bigmac <- read_excel("big-mac-source-data-v2 (1).xlsx", 
                     sheet = "big-mac-source-data-v2")
cars <- read_excel("big-mac-source-data-v2 (1).xlsx", 
                   sheet = "honda")
cpi <- read_excel("big-mac-source-data-v2 (1).xlsx", 
                  sheet = "cpi")
apparel <- read_excel("big-mac-source-data-v2 (1).xlsx", 
                      sheet = "apparel")
recreation <- read_excel("big-mac-source-data-v2 (1).xlsx", 
                         sheet = "recreation")
energy <- read_excel("big-mac-source-data-v2 (1).xlsx", 
                     sheet = "energy")

Pre-processing

Each sheet of our data was then filtered for only the specific values columns and combined into a single data set to start building models.

bigmac <- filter(bigmac, iso_a3 == "USA")
bigmac <- bigmac[-c(8,10,14,15,17,19,21,23,25,27,29,31,33,35,37,39,41),]

cpi <- cpi[-1,]

bm_cols <- bigmac[,4]
cars_cols <- cars %>% dplyr::select(2,5)
cpi_cols <- cpi %>% dplyr::select("CPI")
app_cols <- apparel %>% dplyr::select("130.4")
engy_cols <- energy %>% dplyr::select("120.5")
rec_cols <- recreation %>% dplyr::select(2)

data <- bind_cols(bm_cols, cars_cols, cpi_cols, app_cols, engy_cols, rec_cols)
data <- data[, c("Year", "local_price", "Value", "130.4", "120.5", "102.9", "CPI")]
names(data) <- c("Year", "BigMac_Price", "Percent_Change_UsedCar_Prices", "Apparel", "Energy", "Recreation", "CPI")

data
## # A tibble: 24 × 7
##     Year BigMac_Price Percent_Change_UsedCar_P…¹ Apparel Energy Recreation   CPI
##    <dbl>        <dbl>                      <dbl>   <dbl>  <dbl>      <dbl> <dbl>
##  1  2000         2.24                        2.4    129    135.       105.  84.4
##  2  2001         2.24                        1.9    125.   118.       106.  86.3
##  3  2002         2.35                       -4.1    121.   136.       107.  87.6
##  4  2003         2.46                       -5.8    121    147.       109.  89.1
##  5  2004         2.47                       -6.5    120.   163.       109.  91.0
##  6  2005         2.58                        4.5    120.   198.       111.  93.3
##  7  2006         2.67                        0.5    120.   202.       111.  95.5
##  8  2007         2.89                       -2.9    119.   240.       113.  97.7
##  9  2008         3.21                       -1.2    120.   184.       114.  99.4
## 10  2009         3.43                       -5      120.   210.       114. 100. 
## # ℹ 14 more rows
## # ℹ abbreviated name: ¹​Percent_Change_UsedCar_Prices

Scaling data

With our single data set, I standardized the data. This set all values on a common scale, which would in turn improve the accuracy of our models.

data_scaled <- data
data_scaled[,2:7] <- scale(data_scaled[,2:7])
data_scaled <- as.data.frame(data_scaled)

cor_matrix <- cor(data_scaled)
cor_matrix
##                                Year BigMac_Price Percent_Change_UsedCar_Prices
## Year                          1.000        0.991                         0.322
## BigMac_Price                  0.991        1.000                         0.286
## Percent_Change_UsedCar_Prices 0.322        0.286                         1.000
## Apparel                       0.437        0.521                         0.202
## Energy                        0.760        0.747                         0.492
## Recreation                    0.926        0.917                         0.356
## CPI                           0.976        0.968                         0.360
##                               Apparel Energy Recreation   CPI
## Year                            0.437  0.760      0.926 0.976
## BigMac_Price                    0.521  0.747      0.917 0.968
## Percent_Change_UsedCar_Prices   0.202  0.492      0.356 0.360
## Apparel                         1.000  0.363      0.507 0.508
## Energy                          0.363  1.000      0.782 0.772
## Recreation                      0.507  0.782      1.000 0.984
## CPI                             0.508  0.772      0.984 1.000

Supervised Machine Learning Methods

Linear Regression

The first Supervised Machine Learning Method I used was a Linear Regression Model. This is due to the ease of using the model. An important critique is that there is a small number of rows in my data set, so the accuracy will be very high of the model due to lower sample size.

model_scaled <- lm(CPI ~ BigMac_Price + 
                     Percent_Change_UsedCar_Prices + 
                     Apparel + 
                     Energy + 
                     Recreation,
                   data = data_scaled)
summary(model_scaled)
## 
## Call:
## lm(formula = CPI ~ BigMac_Price + Percent_Change_UsedCar_Prices + 
##     Apparel + Energy + Recreation, data = data_scaled)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14092 -0.04708 -0.00177  0.04972  0.08415 
## 
## Coefficients:
##                                            Estimate            Std. Error
## (Intercept)                   -0.000000000000000262  0.013300578613132222
## BigMac_Price                   0.443417109519961217  0.035316178933994656
## Percent_Change_UsedCar_Prices  0.045337597128695784  0.015896468577579832
## Apparel                       -0.024058808248740533  0.016085796385526641
## Energy                        -0.053708911803309896  0.023943848842952600
## Recreation                     0.615155259692325540  0.036904426670225289
##                               t value        Pr(>|t|)    
## (Intercept)                      0.00           1.000    
## BigMac_Price                    12.56 0.0000000002427 ***
## Percent_Change_UsedCar_Prices    2.85           0.011 *  
## Apparel                         -1.50           0.152    
## Energy                          -2.24           0.038 *  
## Recreation                      16.67 0.0000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0652 on 18 degrees of freedom
## Multiple R-squared:  0.997,  Adjusted R-squared:  0.996 
## F-statistic: 1.08e+03 on 5 and 18 DF,  p-value: <0.0000000000000002
lm_model_predictions <- predict(model_scaled, newdata = data_scaled)
lm_model_actuals <- data_scaled$CPI

plot(data_scaled$CPI, lm_model_predictions) +
  abline(a = 0, b = 1, col = "red")

## integer(0)
lm_mae <- mean(abs(lm_model_actuals - lm_model_predictions))
lm_mae
## [1] 0.0468
lm_rmse <- sqrt(mean((lm_model_actuals - lm_model_predictions)^2))
lm_rmse
## [1] 0.0564

A Mean Absolute Error of 0.0468 suggests that the predictions will deviate from the total range by 4.68%. A summary of the model is displayed below:

## 
## Call:
## lm(formula = CPI ~ BigMac_Price + Percent_Change_UsedCar_Prices + 
##     Apparel + Energy + Recreation, data = data_scaled)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14092 -0.04708 -0.00177  0.04972  0.08415 
## 
## Coefficients:
##                                            Estimate            Std. Error
## (Intercept)                   -0.000000000000000262  0.013300578613132222
## BigMac_Price                   0.443417109519961217  0.035316178933994656
## Percent_Change_UsedCar_Prices  0.045337597128695784  0.015896468577579832
## Apparel                       -0.024058808248740533  0.016085796385526641
## Energy                        -0.053708911803309896  0.023943848842952600
## Recreation                     0.615155259692325540  0.036904426670225289
##                               t value        Pr(>|t|)    
## (Intercept)                      0.00           1.000    
## BigMac_Price                    12.56 0.0000000002427 ***
## Percent_Change_UsedCar_Prices    2.85           0.011 *  
## Apparel                         -1.50           0.152    
## Energy                          -2.24           0.038 *  
## Recreation                      16.67 0.0000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0652 on 18 degrees of freedom
## Multiple R-squared:  0.997,  Adjusted R-squared:  0.996 
## F-statistic: 1.08e+03 on 5 and 18 DF,  p-value: <0.0000000000000002

The triple asterisks next to the P value of BigMac_Price and Recreation show the heavy importance these variables have on the Linear Regression Model. Our Used Car Prices has some importance, but it is not as statistically significant to the others.

Time-Series Regression

The second Supervised Machine Learning method I used was Time-Series Regression. This was just a better way for me to display the predictions for inflation over the next years. My regression was done using an Auto Regressive Integrated Moving Average (ARIMA) model. The function auto.arima() tests different combinations of lag terms, differences, and error terms.

ts_target <- ts(data_scaled$CPI, start = c(2001,1), end = c(2024, 1))
ts_predictors <- data_scaled[, -which(names(data_scaled) == "CPI")]
ts_predictors$time <- time(ts_target)

ts_model <- lm(ts_target ~ BigMac_Price + 
                 Percent_Change_UsedCar_Prices + 
                 Apparel + 
                 Energy + 
                 Recreation,
               data = ts_predictors)
summary(ts_model)
## 
## Call:
## lm(formula = ts_target ~ BigMac_Price + Percent_Change_UsedCar_Prices + 
##     Apparel + Energy + Recreation, data = ts_predictors)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14092 -0.04708 -0.00177  0.04972  0.08415 
## 
## Coefficients:
##                                            Estimate            Std. Error
## (Intercept)                   -0.000000000000000262  0.013300578613132222
## BigMac_Price                   0.443417109519961217  0.035316178933994656
## Percent_Change_UsedCar_Prices  0.045337597128695784  0.015896468577579832
## Apparel                       -0.024058808248740533  0.016085796385526641
## Energy                        -0.053708911803309896  0.023943848842952600
## Recreation                     0.615155259692325540  0.036904426670225289
##                               t value        Pr(>|t|)    
## (Intercept)                      0.00           1.000    
## BigMac_Price                    12.56 0.0000000002427 ***
## Percent_Change_UsedCar_Prices    2.85           0.011 *  
## Apparel                         -1.50           0.152    
## Energy                          -2.24           0.038 *  
## Recreation                      16.67 0.0000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0652 on 18 degrees of freedom
## Multiple R-squared:  0.997,  Adjusted R-squared:  0.996 
## F-statistic: 1.08e+03 on 5 and 18 DF,  p-value: <0.0000000000000002
ts_arima <- auto.arima(ts_target)
ts_arima
## Series: ts_target 
## ARIMA(0,2,2) 
## 
## Coefficients:
##         ma1     ma2
##       0.298  -0.509
## s.e.  0.208   0.199
## 
## sigma^2 = 0.00294:  log likelihood = 33.4
## AIC=-60.9   AICc=-59.5   BIC=-57.6
forecast_values <- forecast(ts_arima, h = 12)
plot(forecast_values)

The darker section of the chart represents the 80% confidence interval, while the lighter outer section represents the 95% confidence interval. A summary of the model is shown below:

## 
## Call:
## lm(formula = ts_target ~ BigMac_Price + Percent_Change_UsedCar_Prices + 
##     Apparel + Energy + Recreation, data = ts_predictors)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14092 -0.04708 -0.00177  0.04972  0.08415 
## 
## Coefficients:
##                                            Estimate            Std. Error
## (Intercept)                   -0.000000000000000262  0.013300578613132222
## BigMac_Price                   0.443417109519961217  0.035316178933994656
## Percent_Change_UsedCar_Prices  0.045337597128695784  0.015896468577579832
## Apparel                       -0.024058808248740533  0.016085796385526641
## Energy                        -0.053708911803309896  0.023943848842952600
## Recreation                     0.615155259692325540  0.036904426670225289
##                               t value        Pr(>|t|)    
## (Intercept)                      0.00           1.000    
## BigMac_Price                    12.56 0.0000000002427 ***
## Percent_Change_UsedCar_Prices    2.85           0.011 *  
## Apparel                         -1.50           0.152    
## Energy                          -2.24           0.038 *  
## Recreation                      16.67 0.0000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0652 on 18 degrees of freedom
## Multiple R-squared:  0.997,  Adjusted R-squared:  0.996 
## F-statistic: 1.08e+03 on 5 and 18 DF,  p-value: <0.0000000000000002

Similar to the Linear Regression Model, the Time-Series Regression Model had heavy importance on BigMac_Price and Recreation, with UsedCar_Prices having lower significance.

Unsupervised Machine Learning Method

K-Means Clustering

K-Means clustering involves separating the data points into ‘K’ distinct clusters based on the similarity of their data points. It groups data by minimizing the distance between points of the same cluster while maximizing the separation between clusters.

set.seed(123)
kmeans_model <- kmeans(data_scaled, centers = 3)
kmeans_data <- data_scaled
kmeans_data$cluster <- factor(kmeans_model$cluster)

kmeans_lm_model <- lm(CPI ~ BigMac_Price + 
                        Percent_Change_UsedCar_Prices + 
                        Apparel + 
                        Energy + 
                        Recreation + 
                        cluster, 
                      data = kmeans_data)
summary(kmeans_lm_model)
## 
## Call:
## lm(formula = CPI ~ BigMac_Price + Percent_Change_UsedCar_Prices + 
##     Apparel + Energy + Recreation + cluster, data = kmeans_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12263 -0.03771  0.00659  0.03723  0.08272 
## 
## Coefficients:
##                               Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)                    -0.0524     0.0325   -1.61       0.126    
## BigMac_Price                    0.4646     0.0728    6.38 0.000009142 ***
## Percent_Change_UsedCar_Prices   0.0373     0.0155    2.41       0.028 *  
## Apparel                        -0.0136     0.0152   -0.90       0.383    
## Energy                         -0.0218     0.0254   -0.86       0.404    
## Recreation                      0.5425     0.0528   10.28 0.000000019 ***
## cluster2                        0.1143     0.0534    2.14       0.048 *  
## cluster3                        0.0507     0.0795    0.64       0.532    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.059 on 16 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.997 
## F-statistic:  943 on 7 and 16 DF,  p-value: <0.0000000000000002
forecast::accuracy((kmeans_lm_model))
##                                   ME   RMSE    MAE   MPE MAPE   MASE
## Training set -0.00000000000000000867 0.0481 0.0389 -6.54 20.9 0.0485
ggplot(kmeans_data, aes(x = fitted(kmeans_lm_model), y = residuals(kmeans_lm_model), 
                        color = cluster)) + 
  geom_point() + 
  geom_smooth()

kmeans_predictions <- predict(kmeans_lm_model, newdata = kmeans_data)
kmeans_actuals <- kmeans_data$CPI

kmeans_mae <- mean(abs(kmeans_actuals - kmeans_predictions))
kmeans_mae
## [1] 0.0389
kmeans_rmse <- sqrt(mean((kmeans_actuals - kmeans_predictions)^2))
kmeans_rmse
## [1] 0.0481

Our model created 3 clusters, which are separated by color in the chart. A summary of the model is displayed below:

## 
## Call:
## lm(formula = CPI ~ BigMac_Price + Percent_Change_UsedCar_Prices + 
##     Apparel + Energy + Recreation + cluster, data = kmeans_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12263 -0.03771  0.00659  0.03723  0.08272 
## 
## Coefficients:
##                               Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)                    -0.0524     0.0325   -1.61       0.126    
## BigMac_Price                    0.4646     0.0728    6.38 0.000009142 ***
## Percent_Change_UsedCar_Prices   0.0373     0.0155    2.41       0.028 *  
## Apparel                        -0.0136     0.0152   -0.90       0.383    
## Energy                         -0.0218     0.0254   -0.86       0.404    
## Recreation                      0.5425     0.0528   10.28 0.000000019 ***
## cluster2                        0.1143     0.0534    2.14       0.048 *  
## cluster3                        0.0507     0.0795    0.64       0.532    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.059 on 16 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.997 
## F-statistic:  943 on 7 and 16 DF,  p-value: <0.0000000000000002

Using different models did not change how the variables affect CPI, as shown in the model. The correlation between variables and CPI can be shown in the Added-Variable Plot shown below:

These plots help show the relationship between each of the variables and CPI. They isolate each individual variable and plot the residuals with a trend line for better clarity. You can see the residuals of BigMac_Price and Recreation are closely aligned with their trend line, which explains the heavy significance that the models indicated. The residuals of the UsedCar_Prices plot did not closely follow the trend line, which would explain the lower significance on the models.

Conclusion

After testing Supervised and Unsupervised Machine Learning models, the McDonald’s BigMac Index was deemed to be a better indicator of inflation. This is due to the different summaries of the models implying higher importance.