Discussion Week 12

Introduction: CPI - The Consumer Price Index

The data set shows information about the consumer price index, or CPI. This provides summary costs of a representative market basket of goods. The data can be used to determine salaries for employees or inflation changes. The index used in the given data compares each location to New York City. For example, if rent.index is 125, then the rent in that city is about 25% higher than in New York City. I provided an excerpt of the explanation from the website below:

“The Consumer Price Index (CPI) summarizes the cost of a representative market basket of goods that includes groceries, restaurants, transportation, utilities, and medical care. Global companies often use the CPI to determine living allowances and salaries for employees. Inflation is often measured by how much the CPI changes from year to year. Relative CPIs can be found for different cities. We have data giving CPI components relative to New York City. For New York City, each index is 100(%).”

The variables in the data set and their descriptions are below:

City: Location by city. U.S. states are included for all U.S. cities.

Consumer.Price.Index: The relative prices of consumer goods like groceries, restaurants, transportation, and utilities. It excludes rent and mortgage.

Rent.Index: Estimates the prices of renting apartments in a city.

Consumer.Price.Plus.Rent.Index: Estimates the consumer goods prices, including rent.

Groceries.Index: Estimation of grocery prices in a city.

Restaurant.Price.Index: The relative prices of meals and drinks in restaurants and bars.

Local.Purchasing.Power.Index: The relative purchasing power in a given city based on the average net salary. A lower purchasing power index means that residents with an average salary can afford less goods and services compared to residents of NYC with an average salary.

Source: https://dasl.datadescription.com/datafiles/?_sf_s=CPI_Worldwide&_sfm_cases=4+59943

In this assignment, I will create two multiple linear regression models. The first model will use the variables that provide the best analysis. The other model will use one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. After creating these models, I will interpret the coefficients and conduct residual analysis. Lastly, I will determine if the models are appropriate to use for the data.

cpi_data <- read.csv("https://raw.githubusercontent.com/juliaDataScience-22/cuny-spring-24/main/hw12file.csv")

Data Cleaning

The data is split into city and country. The United States data is copied into its own data frame with state and city separated.

library(stringr)
# To split into City and Country by the last comma
# Some rows have two commas because of states, and some only have one comma
cpi_data[c('City', 'Country')] <- str_split_fixed(cpi_data$City, ', (?=[^,]+$)', 2)

# Create a data frame of just US data
us_data <- cpi_data[cpi_data$Country == "United States", ]

# Separate to City and State
# Note: Washington, DC is split into city and district
# DC is put into the state category
us_data[c('City', 'State')] <- str_split_fixed(us_data$City, ', ', 2)

# Country is identical so unecessary, the column is removed
us_data <- us_data[, -8]

fifty_states_and_dc <- c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY")

First Linear Model

First, the correlation plot shows how different numerical variables are correlated. This is to help determine how to avoid collinearity problems. Local.Purchasing.Power.Index appears to be the least correlated to other variables.

I will create a linear model using the other numerical variables to predict Local Purchasing Power. To determine the best model to use, I will compare all possible options. I can do this by using stepwise selection. The library MASS contains a function StepAIC that can do this.

Linear model:

Local.Purchasing.Power.Index = 44.8935 - 1.3413(Consumer.Price.Index) + 1.3379(Groceries.Index) + 1.1470(Restaurant.Price.Index)

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.92 loaded

library(MASS)

## Warning: package 'MASS' was built under R version 4.3.3

#Correlation plot
corrplot(cor(cpi_data[c(2,3,4,5,6,7)]), method = "color")

# This only includes the numerical variables
full_model <- lm(Local.Purchasing.Power.Index ~ ., data = cpi_data[c(2,3,4,5,6,7)])

step_model <- stepAIC(full_model, direction = "both")

## Start:  AIC=3441.9
## Local.Purchasing.Power.Index ~ Consumer.Price.Index + Rent.Index + 
##     Consumer.Price.Plus.Rent.Index + Groceries.Index + Restaurant.Price.Index
## 
##                                  Df Sum of Sq    RSS    AIC
## - Rent.Index                      1       6.8 493759 3439.9
## - Consumer.Price.Plus.Rent.Index  1       6.8 493759 3439.9
## - Consumer.Price.Index            1       7.9 493760 3439.9
## <none>                                        493752 3441.9
## - Groceries.Index                 1   23352.4 517104 3462.9
## - Restaurant.Price.Index          1   29373.3 523125 3468.6
## 
## Step:  AIC=3439.9
## Local.Purchasing.Power.Index ~ Consumer.Price.Index + Consumer.Price.Plus.Rent.Index + 
##     Groceries.Index + Restaurant.Price.Index
## 
##                                  Df Sum of Sq    RSS    AIC
## - Consumer.Price.Plus.Rent.Index  1     223.4 493982 3438.1
## <none>                                        493759 3439.9
## + Rent.Index                      1       6.8 493752 3441.9
## - Consumer.Price.Index            1    8176.6 501935 3446.1
## - Groceries.Index                 1   23392.6 517151 3460.9
## - Restaurant.Price.Index          1   29483.6 523242 3466.7
## 
## Step:  AIC=3438.13
## Local.Purchasing.Power.Index ~ Consumer.Price.Index + Groceries.Index + 
##     Restaurant.Price.Index
## 
##                                  Df Sum of Sq    RSS    AIC
## <none>                                        493982 3438.1
## + Consumer.Price.Plus.Rent.Index  1     223.4 493759 3439.9
## + Rent.Index                      1     223.4 493759 3439.9
## - Consumer.Price.Index            1    8393.4 502376 3444.5
## - Groceries.Index                 1   24524.4 518506 3460.2
## - Restaurant.Price.Index          1   29664.9 523647 3465.1

summary(step_model)

## 
## Call:
## lm(formula = Local.Purchasing.Power.Index ~ Consumer.Price.Index + 
##     Groceries.Index + Restaurant.Price.Index, data = cpi_data[c(2, 
##     3, 4, 5, 6, 7)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.613 -21.007  -2.974  17.235 128.952 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             44.8935     5.5014   8.160 2.80e-15 ***
## Consumer.Price.Index    -1.3413     0.4634  -2.894  0.00397 ** 
## Groceries.Index          1.3379     0.2704   4.947 1.03e-06 ***
## Restaurant.Price.Index   1.1470     0.2108   5.441 8.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.65 on 493 degrees of freedom
## Multiple R-squared:  0.4627, Adjusted R-squared:  0.4594 
## F-statistic: 141.5 on 3 and 493 DF,  p-value: < 2.2e-16

I will produce a similar linear model step by step to demonstrate another method. Variables with the lowest p-values will be removed from the model. After removing each variable, I will determine if any other variables should be removed.

full_model <- lm(Local.Purchasing.Power.Index ~ ., data = cpi_data[c(2,3,4,5,6,7)])
summary(full_model)

## 
## Call:
## lm(formula = Local.Purchasing.Power.Index ~ ., data = cpi_data[c(2, 
##     3, 4, 5, 6, 7)])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -89.44 -20.96  -2.30  17.22 129.00 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     45.5438     5.6958   7.996 9.28e-15 ***
## Consumer.Price.Index           -19.2696   217.2637  -0.089    0.929    
## Rent.Index                     -16.5071   201.0479  -0.082    0.935    
## Consumer.Price.Plus.Rent.Index  34.4839   418.3289   0.082    0.934    
## Groceries.Index                  1.3192     0.2737   4.819 1.93e-06 ***
## Restaurant.Price.Index           1.1431     0.2115   5.405 1.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.71 on 491 degrees of freedom
## Multiple R-squared:  0.4629, Adjusted R-squared:  0.4575 
## F-statistic: 84.64 on 5 and 491 DF,  p-value: < 2.2e-16

full_model_minus_consumer <- lm(Local.Purchasing.Power.Index ~ Rent.Index +
                                  Consumer.Price.Plus.Rent.Index + Groceries.Index + Restaurant.Price.Index,
                                data = cpi_data)
summary(full_model_minus_consumer)

## 
## Call:
## lm(formula = Local.Purchasing.Power.Index ~ Rent.Index + Consumer.Price.Plus.Rent.Index + 
##     Groceries.Index + Restaurant.Price.Index, data = cpi_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -89.592 -20.949  -2.349  17.215 128.835 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     45.5632     5.6858   8.013 8.16e-15 ***
## Rent.Index                       1.3242     0.4640   2.854  0.00450 ** 
## Consumer.Price.Plus.Rent.Index  -2.6185     0.8963  -2.922  0.00364 ** 
## Groceries.Index                  1.3198     0.2734   4.828 1.85e-06 ***
## Restaurant.Price.Index           1.1440     0.2111   5.420 9.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.68 on 492 degrees of freedom
## Multiple R-squared:  0.4629, Adjusted R-squared:  0.4585 
## F-statistic:   106 on 4 and 492 DF,  p-value: < 2.2e-16

second_model_minus_consumer_plus_rent <- lm(Local.Purchasing.Power.Index ~ Rent.Index +
                                              Groceries.Index + Restaurant.Price.Index,
                                            data = cpi_data)
summary(second_model_minus_consumer_plus_rent)

## 
## Call:
## lm(formula = Local.Purchasing.Power.Index ~ Rent.Index + Groceries.Index + 
##     Restaurant.Price.Index, data = cpi_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -100.44  -21.66   -2.55   19.44  131.83 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            33.51749    3.94499   8.496 2.33e-16 ***
## Rent.Index              0.03101    0.14005   0.221    0.825    
## Groceries.Index         0.61555    0.12993   4.738 2.83e-06 ***
## Restaurant.Price.Index  0.60932    0.10596   5.751 1.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.92 on 493 degrees of freedom
## Multiple R-squared:  0.4536, Adjusted R-squared:  0.4503 
## F-statistic: 136.4 on 3 and 493 DF,  p-value: < 2.2e-16

second_model_minus_rent <- lm(Local.Purchasing.Power.Index ~ Consumer.Price.Plus.Rent.Index + 
                                Groceries.Index + Restaurant.Price.Index,
                                data = cpi_data)
summary(second_model_minus_rent)

## 
## Call:
## lm(formula = Local.Purchasing.Power.Index ~ Consumer.Price.Plus.Rent.Index + 
##     Groceries.Index + Restaurant.Price.Index, data = cpi_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.312  -21.350   -2.031   19.610  131.745 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     33.4207     3.7995   8.796  < 2e-16 ***
## Consumer.Price.Plus.Rent.Index  -0.1778     0.2704  -0.658    0.511    
## Groceries.Index                  0.7140     0.1735   4.114 4.55e-05 ***
## Restaurant.Price.Index           0.6645     0.1287   5.163 3.53e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.91 on 493 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4507 
## F-statistic: 136.7 on 3 and 493 DF,  p-value: < 2.2e-16

only_groceries_and_restaurant <- lm(Local.Purchasing.Power.Index ~ Groceries.Index + Restaurant.Price.Index,
                                data = cpi_data)
summary(only_groceries_and_restaurant)

## 
## Call:
## lm(formula = Local.Purchasing.Power.Index ~ Groceries.Index + 
##     Restaurant.Price.Index, data = cpi_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.862  -21.694   -2.695   19.232  131.862 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             33.2788     3.7912   8.778  < 2e-16 ***
## Groceries.Index          0.6288     0.1153   5.453 7.83e-08 ***
## Restaurant.Price.Index   0.6142     0.1035   5.935 5.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.89 on 494 degrees of freedom
## Multiple R-squared:  0.4535, Adjusted R-squared:  0.4513 
## F-statistic:   205 on 2 and 494 DF,  p-value: < 2.2e-16

# That did not seem to improve the model, so we will go back to the beginning and remove the second lowest p-value
full_minus_consumer_plus_rent <- lm(Local.Purchasing.Power.Index ~ Consumer.Price.Index + Rent.Index +
                                  Groceries.Index + Restaurant.Price.Index,
                                data = cpi_data)
summary(full_minus_consumer_plus_rent)

## 
## Call:
## lm(formula = Local.Purchasing.Power.Index ~ Consumer.Price.Index + 
##     Rent.Index + Groceries.Index + Restaurant.Price.Index, data = cpi_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -89.580 -20.945  -2.346  17.215 128.847 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            45.56261    5.68549   8.014 8.14e-15 ***
## Consumer.Price.Index   -1.36001    0.46549  -2.922  0.00364 ** 
## Rent.Index              0.06581    0.13950   0.472  0.63728    
## Groceries.Index         1.31980    0.27336   4.828 1.84e-06 ***
## Restaurant.Price.Index  1.14398    0.21106   5.420 9.34e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.68 on 492 degrees of freedom
## Multiple R-squared:  0.4629, Adjusted R-squared:  0.4585 
## F-statistic:   106 on 4 and 492 DF,  p-value: < 2.2e-16

minus_rent <- lm(Local.Purchasing.Power.Index ~ Consumer.Price.Index +
                                  Groceries.Index + Restaurant.Price.Index,
                                data = cpi_data)
summary(minus_rent)

## 
## Call:
## lm(formula = Local.Purchasing.Power.Index ~ Consumer.Price.Index + 
##     Groceries.Index + Restaurant.Price.Index, data = cpi_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.613 -21.007  -2.974  17.235 128.952 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             44.8935     5.5014   8.160 2.80e-15 ***
## Consumer.Price.Index    -1.3413     0.4634  -2.894  0.00397 ** 
## Groceries.Index          1.3379     0.2704   4.947 1.03e-06 ***
## Restaurant.Price.Index   1.1470     0.2108   5.441 8.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.65 on 493 degrees of freedom
## Multiple R-squared:  0.4627, Adjusted R-squared:  0.4594 
## F-statistic: 141.5 on 3 and 493 DF,  p-value: < 2.2e-16

I will now compare the Adjusted R squared values of all the models to determine the best one from my own step by step analysis. From the analysis, we can see that all the models are very similar with moderate strength. The best is the last model with only Consumer.Price.Index, Groceries.Index, and Restaurant.Price.Index as variables. This last model matches the model that was created in the first example using the function StepAIC.

all_values <- c(summary(full_model)$adj.r.squared, summary(full_model_minus_consumer)$adj.r.squared,
                summary(second_model_minus_consumer_plus_rent)$adj.r.squared, summary(second_model_minus_rent)$adj.r.squared,
                summary(only_groceries_and_restaurant)$adj.r.squared, summary(full_minus_consumer_plus_rent)$adj.r.squared,
                summary(minus_rent)$adj.r.squared)
compare <- data.frame(modelName = c("full_model", "full_model_minus_consumer", "second_model_minus_consumer_plus_rent",
                                    "second_model_minus_rent", "only_groceries_and_restaurant", "full_minus_consumer_plus_rent",
                                    "minus_rent"),
                      adjustedRSquared = all_values)
compare

##                               modelName adjustedRSquared
## 1                            full_model        0.4574535
## 2             full_model_minus_consumer        0.4585475
## 3 second_model_minus_consumer_plus_rent        0.4502717
## 4               second_model_minus_rent        0.4506989
## 5         only_groceries_and_restaurant        0.4513300
## 6         full_minus_consumer_plus_rent        0.4585487
## 7                            minus_rent        0.4594025

Second Linear Model

The other model will use one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. The quadratic term is Rent_Index raised to the power of 0.01. I plotted it below to show the correlation. It was about 0.6 when a quadratic term instead of about 0.5. The dichotomous term is Country. I split it into 0 and 1 where 0 indicates the United States and 1 indicates all other countries. The dichotomous vs. quantitative interaction term uses both of those terms. I included one more variable, Restaurant.Price.Index, because it showed the best adjusted R squared value when I tested different models. I did not include the other models to make the final model stand out, but I performed step by step analysis like the example from the first part of this assignment to come to the model shown below.

The final model:

Local.Purchasing.Power.Index = 2606 - 2426(Rent.Index^0.01) - 3560(us_indicator) + 3407(Rent.Index^0.01:us_indicator) + 0.7545(Restaurant.Price.Index)

# Original
plot(cpi_data$Rent.Index, cpi_data$Local.Purchasing.Power.Index)
new <- lm(Local.Purchasing.Power.Index ~ Rent.Index, cpi_data)
abline(new)

plot(cpi_data$Rent.Index^0.01, cpi_data$Local.Purchasing.Power.Index)
cpi_data$Quadratic <- cpi_data$Rent.Index^0.01
new <- lm(Local.Purchasing.Power.Index ~ Quadratic, cpi_data)
abline(new)

cpi_data$us_indicator <- ifelse(cpi_data$Country == "United States", 0, 1)

partTwoLM <- lm(Local.Purchasing.Power.Index ~ Quadratic + us_indicator + Quadratic:us_indicator +
                  Restaurant.Price.Index, data = cpi_data)
summary(partTwoLM)

## 
## Call:
## lm(formula = Local.Purchasing.Power.Index ~ Quadratic + us_indicator + 
##     Quadratic:us_indicator + Restaurant.Price.Index, data = cpi_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.097 -20.189  -2.781  17.388 131.348 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.606e+03  8.671e+02   3.005  0.00279 ** 
## Quadratic              -2.426e+03  8.373e+02  -2.898  0.00393 ** 
## us_indicator           -3.560e+03  8.881e+02  -4.009 7.04e-05 ***
## Restaurant.Price.Index  7.545e-01  9.187e-02   8.213 1.92e-15 ***
## Quadratic:us_indicator  3.407e+03  8.569e+02   3.977 8.04e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.14 on 492 degrees of freedom
## Multiple R-squared:  0.5138, Adjusted R-squared:  0.5098 
## F-statistic:   130 on 4 and 492 DF,  p-value: < 2.2e-16

Graphs for Linear Model 1

library(car)

## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.3

library(cowplot)

## Warning: package 'cowplot' was built under R version 4.3.3

# First model

par(mfrow=c(2, 2))
plot(cpi_data$Consumer.Price.Index, cpi_data$Local.Purchasing.Power.Index, 
     xlab = "Consumer Price Index", ylab = "Local Purchasing Power Index",
     main = "Scatter Plot of Local Purchasing Power Index vs Consumer Price Index")

plot(cpi_data$Groceries.Index, cpi_data$Local.Purchasing.Power.Index, 
     xlab = "Groceries Index", ylab = "Local Purchasing Pp]ower Index",
     main = "Scatter Plot of Local Purchasing Power Index vs Groceries Index")

plot(cpi_data$Restaurant.Price.Index, cpi_data$Local.Purchasing.Power.Index, 
     xlab = "Restaurant Price Index", ylab = "Local Purchasing Power Index",
     main = "Scatter Plot of Local Purchasing Power Index vs Restaurant Price Index")

crPlots(step_model)

# Second model


par(mfrow=c(2, 2))
plot(step_model)

Graphs for Linear Model 2

library(ggplot2)

par(mfrow=c(2, 2))

plot(cpi_data$Quadratic, cpi_data$Local.Purchasing.Power.Index, 
     xlab = "Rent Index ^ 0.01", ylab = "Local Purchasing Power Index",
     main = "Scatter Plot of Local Purchasing Power Index vs Rent Index ^ 0.01")

plot(cpi_data$us_indicator, cpi_data$Local.Purchasing.Power.Index, 
     xlab = "U.S. Indicator", ylab = "Local Purchasing Pp]ower Index",
     main = "Scatter Plot of Local Purchasing Power Index vs U.S. Indicator")

plot(cpi_data$Restaurant.Price.Index, cpi_data$Local.Purchasing.Power.Index, 
     xlab = "Restaurant Price Index", ylab = "Local Purchasing Power Index",
     main = "Scatter Plot of Local Purchasing Power Index vs Restaurant Price Index")

plot(cpi_data$Quadratic * cpi_data$us_indicator, cpi_data$Local.Purchasing.Power.Index,
     xlab = "Rent Index ^ 0.01 : U.S. Indicator", ylab = "Local Purchasing Power Index",
     main = "Scatter Plot of Local Purchasing Power Index vs. Rent Index ^ 0.01 : U.S. Indicator")

cpi_data$residuals <- resid(partTwoLM)

plot1 <- ggplot(cpi_data, aes(x = Quadratic, y = residuals)) +
  geom_point() +
  labs(x = "Rent Index ^ 0.01", y = "Residuals") +
  theme_minimal()

plot2 <- ggplot(cpi_data, aes(x = us_indicator, y = residuals)) +
  geom_point() +
  labs(x = "U.S. Indicator", y = "Residuals") +
  theme_minimal()

plot3 <- ggplot(cpi_data, aes(x = Restaurant.Price.Index, y = residuals)) +
  geom_point() +
  labs(x = "Restaurant Price Index", y = "Residuals") +
  theme_minimal()



plot4 <- ggplot(cpi_data, aes(x = Quadratic, y = resid(partTwoLM), color = us_indicator)) +
  geom_point() +
  labs(x = "Rent Index ^ 0.01 : U.S. Indicator", y = "Residuals") +
  theme_minimal()

plot_grid(plot1, plot2, plot3, plot4, ncol = 2)

par(mfrow=c(2, 2))
plot(partTwoLM)

## Statistical Analysis for Both Models

Analysis

First, we will analyze linear model 1:

Local.Purchasing.Power.Index = 44.8935 - 1.3413(Consumer.Price.Index) + 1.3379(Groceries.Index) + 1.1470(Restaurant.Price.Index)

In the results of this model, the coefficients show how the corresponding indices can predict the local purchasing power index. For example, in locations with a consumer price index higher by 1, their local purchasing power index will decrease by 1.3413. Overall, the consumer price index seemed to decrease the local purchasing power index, while the other two variables seemed to increase the local purchasing power index. The p-values for all the coefficients were significant, so we can assume that the individual relationships were strong.

Although the coefficients have significant p-values, the adjusted R squared value shows a moderate relationship. The value is 0.4594. A strong relationship would be above 0.7, and a weak relationship would be below 0.3. Since this falls in between, the linear model provides a moderate explanation of the data.

The residuals in the graphs above are displayed in two different ways. The first residual plots show the residuals for each dependent variable. These residual plots do not show consistent variability. The final array of plots for this first model show more residual plots and a Q-Q plot. The residual plots appear to have more consistent variability, but it still varies enough that it does not meet the assumptions of a linear model. The Q-Q Plot shows that the data is mostly normally distributed, so that assumption is met.

Next, we will analyze linear model 2:

Local.Purchasing.Power.Index = 2606 - 2426(Rent.Index^0.01) - 3560(us_indicator) + 3407(Rent.Index^0.01:us_indicator) + 0.7545(Restaurant.Price.Index)

In the results of this model, the coefficients show how the corresponding variables predict the local purchasing power index. As the rent index increased, the local purchasing power index decreased. Countries that were not in the U.S. had a lower local purchasing power than the U.S. Based on the coefficient for the interaction term, when both rent and U.S. were taken into account, places with a higher rent index that were not in the U.S. had a higher purchasing power index than U.S. states with a lower rent index. The restaurant price index slightly increased the local purchasing power index. The p-values for all the coefficients were significant, so we can assume that the individual relationships were strong.

Although the coefficients have significant p-values, the adjusted R squared value shows a moderate relationship. The value is 0.5098. A strong relationship would be above 0.7, and a weak relationship would be below 0.3. Since this falls in between, the linear model provides a moderate explanation of the data. However, this shows that the model is slightly better than the other model since it has a slightly stronger adjusted R squared value.

The residuals in the graphs above are displayed in two different ways. The first residual plots show the residuals for each dependent variable. Rent Index ^ 0.01 appears to have pretty consistent variability. The other graphs are not as consistent. The residuals for restaurant price index appears to be the best of those other three graphs. These residual plots do not show consistent variability overall, so this assumption of the linear model is not met. The final array of plots for this first model show more residual plots and a Q-Q plot. The residual plots appear to have more consistent variability than the individaul graphs, but I would still say that the residual plots does appear to meet the assumptions of a linear model. The Q-Q Plot shows that the data is mostly normally distributed, so that assumption is met. It does have a tail, but it appears to be caused by some outliers and not significant.

Conclusions

Overall, these models do not seem to be appropriate for the data. Neither of the models showed strong relationships when applied to the data. Also, the assumptions of linear regression were not met for both models. The second model that contained one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term had a slightly stronger relationship, but it was not enough to be considered strong.

To Be Continued . . .

The U.S. data set can be used in the future for analysis. Different states can be compared to see if regional differences have any affect on the results. I included the us_data data frame in the beginning of this assignment.