Question:

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Answer:

library(tidyverse)

We load data on used car sales.

my_url <- "https://raw.githubusercontent.com/geedoubledee/lilliput/main/CAR_DETAILS_FROM_CAR_DEKHO.csv"
car_dekho <- read.csv(my_url, stringsAsFactors = TRUE)
car_dekho_numeric <- car_dekho %>%
    filter(owner != "Test Drive Car") %>%
    rename_at("year", ~"years_old") %>%
    mutate(years_old = 2021 - years_old) %>%
    rename_at("owner", ~"owners")

car_dekho_numeric$seller_type <- as.factor(str_replace_all(
    car_dekho_numeric$seller_type, "Trustmark Dealer", "Dealer"))
replacements <- c("First Owner" = "1", "Second Owner" = "2", "Third Owner" = "3",
                  "Fourth & Above Owner" = "4")
car_dekho_numeric$owners <- as.integer(str_replace_all(car_dekho_numeric$owners,
                                            replacements))
remove <- c("name", "fuel", "transmission")
car_dekho_numeric <- car_dekho_numeric[, !colnames(car_dekho_numeric) %in% remove]
reorder <- c("selling_price", "years_old", "km_driven", "seller_type", "owners")
car_dekho_numeric <- car_dekho_numeric[, reorder]

We are interested in how selling_price (in Rupees) is related to the other variables in the data, so we look at pairwise scatterplots.

pairs(car_dekho_numeric, gap=0.5)

We can already tell visually that there is no distinctly linear relationship between selling_price and any other variables, so it is unlikely we can model this data well with a linear model, but we will continue. I’ve gotten to this step for a couple different data sets already, and none of them have fared any better.

Please also note that none of the variables in the data set are appropriate as quadratic terms, so I’ll load a different data set to fulfill that requirement later.

We set our p-value threshold for predictor inclusion at p = 0.1 and perform backward elimination.

car_dekho_numeric_lm_full <- lm(selling_price ~ years_old + km_driven + 
                                seller_type + owners, data = car_dekho_numeric)
summary(car_dekho_numeric_lm_full)
## 
## Call:
## lm(formula = selling_price ~ years_old + km_driven + seller_type + 
##     owners, data = car_dekho_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -666921 -232548  -91732   61873 8072217 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.094e+06  2.221e+04  49.274   <2e-16 ***
## years_old             -5.261e+04  2.235e+03 -23.540   <2e-16 ***
## km_driven              4.853e-03  1.892e-01   0.026    0.980    
## seller_typeIndividual -2.262e+05  1.896e+04 -11.929   <2e-16 ***
## owners                -3.478e+03  1.296e+04  -0.268    0.788    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 518700 on 4318 degrees of freedom
## Multiple R-squared:  0.1977, Adjusted R-squared:  0.1969 
## F-statistic:   266 on 4 and 4318 DF,  p-value: < 2.2e-16

Looking at the summary for our model, we see that our median is not near 0, reinforcing the fact that this will not be a good model. We also see that km_driven has the largest p-value and is therefore the least useful for our model. So it is the first predictor we will remove.

car_dekho_numeric_lm_2 <- update(car_dekho_numeric_lm_full, .~. - km_driven,
                                 data = car_dekho_numeric)
summary(car_dekho_numeric_lm_2)
## 
## Call:
## lm(formula = selling_price ~ years_old + seller_type + owners, 
##     data = car_dekho_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -667091 -232584  -91435   61999 8072064 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1094311      21974  49.799   <2e-16 ***
## years_old               -52588       2120 -24.811   <2e-16 ***
## seller_typeIndividual  -226149      18881 -11.978   <2e-16 ***
## owners                   -3434      12843  -0.267    0.789    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 518600 on 4319 degrees of freedom
## Multiple R-squared:  0.1977, Adjusted R-squared:  0.1971 
## F-statistic: 354.7 on 3 and 4319 DF,  p-value: < 2.2e-16

Removing that didn’t improve our model that much, as our Adjusted \(R^2\) value only increased slightly. Now, owners has the largest p-value and is therefore the least useful for our model. So it is the second predictor we will remove.

car_dekho_numeric_lm_3 <- update(car_dekho_numeric_lm_2, .~. - owners,
                                 data = car_dekho_numeric)
summary(car_dekho_numeric_lm_3)
## 
## Call:
## lm(formula = selling_price ~ years_old + seller_type, data = car_dekho_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -670729 -232043  -90843   61455 8072109 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1092080      20328   53.72   <2e-16 ***
## years_old               -52838       1902  -27.77   <2e-16 ***
## seller_typeIndividual  -227184      18478  -12.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 518600 on 4320 degrees of freedom
## Multiple R-squared:  0.1977, Adjusted R-squared:  0.1973 
## F-statistic: 532.2 on 2 and 4320 DF,  p-value: < 2.2e-16

Again, removing that didn’t improve our model that much, as our Adjusted \(R^2\) value only increased slightly. But we are now left with only predictors below our p-value threshold of 0.1. The variable seller_type is a factor with Dealer as level 0 (base) and Individual as level 1, so that is our dichotomous variable. We will add seller_type vs. years_old as our dichotomous vs. quantitative interaction term for the final model.

car_dekho_numeric_lm_final <- lm(selling_price ~ years_old * seller_type,
                                 data = car_dekho_numeric)
summary(car_dekho_numeric_lm_final)
## 
## Call:
## lm(formula = selling_price ~ years_old * seller_type, data = car_dekho_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -848059 -222653  -88292   61157 8000150 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1371865      36636  37.446   <2e-16 ***
## years_old                         -94403       4922 -19.180   <2e-16 ***
## seller_typeIndividual            -566480      41384 -13.688   <2e-16 ***
## years_old:seller_typeIndividual    48705       5328   9.142   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 513700 on 4319 degrees of freedom
## Multiple R-squared:  0.2129, Adjusted R-squared:  0.2124 
## F-statistic: 389.4 on 3 and 4319 DF,  p-value: < 2.2e-16

The final model is therefore:

\(\hat y = 1371865 - 94403a - 566480b + 48705c\)

where \(a = years old\) and \(b = seller type\) and \(c = years old * seller type\)

par(mfrow=c(2,2))
plot(car_dekho_numeric_lm_final)

We can see by plotting the residuals vs. the fitted values that the residuals are not uniformly scattered about zero, and we can see from the qq-plot that the right tail deviates significantly from the normal distribution line. So this linear model was not particularly appropriate, as expected.

Now we look at data for hand and feet reaction times measured using the Nintendo Wii Balance Board by age and gender so we can see what we expect is an appropriate use of a quadratic term.

my_url2 <- "https://raw.githubusercontent.com/geedoubledee/lilliput/main/reaction_time_and_aging_nintendo_wii_balance_board.csv"
reaction_time <- read.csv(my_url2, row.names = 1)
cols <- c("Sex", "Age", "Medicine", "Smoking", "PAL1", "PAL2", "RTH_D",
              "RTH_ND", "RTF_D", "RTF_ND")
colnames(reaction_time) <- cols
remove <- c("Medicine", "Smoking", "PAL1", "PAL2")
reaction_time <- reaction_time[, !colnames(reaction_time) %in% remove]
replacements <- c("1" = "Male", "2" = "Female")
reaction_time$Sex <- as.factor(str_replace_all(reaction_time$Sex, replacements))
reaction_time$RTH_D <- as.numeric(gsub(",", "", reaction_time$RTH_D))
reaction_time$RTH_ND <- as.numeric(gsub(",", "", reaction_time$RTH_ND))
reaction_time$RTF_D <- as.numeric(gsub(",", "", reaction_time$RTF_D))
reaction_time$RTF_ND <- as.numeric(gsub(",", "", reaction_time$RTF_ND))
reaction_time_simplified <- reaction_time %>%
    mutate(RTH_mean = (RTH_D + RTH_ND) / 2, RTF_mean = (RTF_D + RTF_ND) / 2)
remove <- c("RTH_D", "RTH_ND", "RTF_D", "RTF_ND")
reaction_time_simplified <- reaction_time_simplified[, !colnames(
    reaction_time_simplified) %in% remove]
reorder <- c("RTH_mean", "RTF_mean", "Age", "Sex")
reaction_time_simplified <- reaction_time_simplified[, reorder]

Since the researchers measured dominant and non-dominant hands and feet separately, we take the mean for each observation’s hands and the mean for each observation’s feet to look at pairwise scatterplots of hand and feet reaction times against age and gender.

pairs(reaction_time_simplified, gap=0.5)

We see a curvilinear relationship between hand reaction times and age, so we will only model hand reaction times, and age squared will be our quadratic term. Gender will be our dichotomous term, and we’ll look at the interaction term between age squared and gender as well.

reaction_time_simplified$Age2 <- reaction_time_simplified$Age^2
reaction_time_simplified_lm_full <- lm(RTH_mean ~ Age + Age2 * Sex,
                                       data = reaction_time_simplified)
summary(reaction_time_simplified_lm_full)
## 
## Call:
## lm(formula = RTH_mean ~ Age + Age2 * Sex, data = reaction_time_simplified)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -357.15  -70.68  -16.23   52.65 1095.65 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  618.726981  54.896495  11.271  < 2e-16 ***
## Age           -5.932816   2.165200  -2.740  0.00646 ** 
## Age2           0.107207   0.019891   5.390  1.3e-07 ***
## SexMale      -35.480121  28.324338  -1.253  0.21118    
## Age2:SexMale  -0.007669   0.006770  -1.133  0.25814    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142.9 on 348 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4219, Adjusted R-squared:  0.4152 
## F-statistic: 63.49 on 4 and 348 DF,  p-value: < 2.2e-16

The median for our residuals is near 0, which is a good indicator this model might be useful. The p-value for Age squared is low, which indicates it is the best predictor in our model. Since we’re at the bare minimum of predictors for our purposes, we will leave in age, which also has a low p-value, as well as gender and the interaction term between age squared and gender, which both have large p-values. Gender is a factor with female as level 0 (base) and male as level 1. Our Adjusted \(R^2\) value indicates this model could explain 41.5% of variation in mean hand reaction times.

The model is thus:

\(\hat y = 618.726981 - 5.932816a + 0.107207a^2 - 35.480121b - 0.007669c\)

where \(a = age\) and \(b = gender\) and \(c = age * gender\)

par(mfrow=c(2,2))
plot(reaction_time_simplified_lm_full)

Plotting the residuals vs. the fitted values shows us the residuals aren’t really uniformly distributed about zero, though. And the qq-plot shows us the right tail deviates significantly from the normal distribution line. So this model would not be appropriate for fully explaining the relationship between mean hand reaction times and age, but we were at least able to look at a model that involved a quadratic term.