Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
library(tidyverse)
We load data on used car sales.
my_url <- "https://raw.githubusercontent.com/geedoubledee/lilliput/main/CAR_DETAILS_FROM_CAR_DEKHO.csv"
car_dekho <- read.csv(my_url, stringsAsFactors = TRUE)
car_dekho_numeric <- car_dekho %>%
filter(owner != "Test Drive Car") %>%
rename_at("year", ~"years_old") %>%
mutate(years_old = 2021 - years_old) %>%
rename_at("owner", ~"owners")
car_dekho_numeric$seller_type <- as.factor(str_replace_all(
car_dekho_numeric$seller_type, "Trustmark Dealer", "Dealer"))
replacements <- c("First Owner" = "1", "Second Owner" = "2", "Third Owner" = "3",
"Fourth & Above Owner" = "4")
car_dekho_numeric$owners <- as.integer(str_replace_all(car_dekho_numeric$owners,
replacements))
remove <- c("name", "fuel", "transmission")
car_dekho_numeric <- car_dekho_numeric[, !colnames(car_dekho_numeric) %in% remove]
reorder <- c("selling_price", "years_old", "km_driven", "seller_type", "owners")
car_dekho_numeric <- car_dekho_numeric[, reorder]
We are interested in how selling_price (in Rupees) is related to the other variables in the data, so we look at pairwise scatterplots.
pairs(car_dekho_numeric, gap=0.5)
We can already tell visually that there is no distinctly linear relationship between selling_price and any other variables, so it is unlikely we can model this data well with a linear model, but we will continue. I’ve gotten to this step for a couple different data sets already, and none of them have fared any better.
Please also note that none of the variables in the data set are appropriate as quadratic terms, so I’ll load a different data set to fulfill that requirement later.
We set our p-value threshold for predictor inclusion at p = 0.1 and perform backward elimination.
car_dekho_numeric_lm_full <- lm(selling_price ~ years_old + km_driven +
seller_type + owners, data = car_dekho_numeric)
summary(car_dekho_numeric_lm_full)
##
## Call:
## lm(formula = selling_price ~ years_old + km_driven + seller_type +
## owners, data = car_dekho_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -666921 -232548 -91732 61873 8072217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.094e+06 2.221e+04 49.274 <2e-16 ***
## years_old -5.261e+04 2.235e+03 -23.540 <2e-16 ***
## km_driven 4.853e-03 1.892e-01 0.026 0.980
## seller_typeIndividual -2.262e+05 1.896e+04 -11.929 <2e-16 ***
## owners -3.478e+03 1.296e+04 -0.268 0.788
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 518700 on 4318 degrees of freedom
## Multiple R-squared: 0.1977, Adjusted R-squared: 0.1969
## F-statistic: 266 on 4 and 4318 DF, p-value: < 2.2e-16
Looking at the summary for our model, we see that our median is not near 0, reinforcing the fact that this will not be a good model. We also see that km_driven has the largest p-value and is therefore the least useful for our model. So it is the first predictor we will remove.
car_dekho_numeric_lm_2 <- update(car_dekho_numeric_lm_full, .~. - km_driven,
data = car_dekho_numeric)
summary(car_dekho_numeric_lm_2)
##
## Call:
## lm(formula = selling_price ~ years_old + seller_type + owners,
## data = car_dekho_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -667091 -232584 -91435 61999 8072064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1094311 21974 49.799 <2e-16 ***
## years_old -52588 2120 -24.811 <2e-16 ***
## seller_typeIndividual -226149 18881 -11.978 <2e-16 ***
## owners -3434 12843 -0.267 0.789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 518600 on 4319 degrees of freedom
## Multiple R-squared: 0.1977, Adjusted R-squared: 0.1971
## F-statistic: 354.7 on 3 and 4319 DF, p-value: < 2.2e-16
Removing that didn’t improve our model that much, as our Adjusted \(R^2\) value only increased slightly. Now, owners has the largest p-value and is therefore the least useful for our model. So it is the second predictor we will remove.
car_dekho_numeric_lm_3 <- update(car_dekho_numeric_lm_2, .~. - owners,
data = car_dekho_numeric)
summary(car_dekho_numeric_lm_3)
##
## Call:
## lm(formula = selling_price ~ years_old + seller_type, data = car_dekho_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -670729 -232043 -90843 61455 8072109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1092080 20328 53.72 <2e-16 ***
## years_old -52838 1902 -27.77 <2e-16 ***
## seller_typeIndividual -227184 18478 -12.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 518600 on 4320 degrees of freedom
## Multiple R-squared: 0.1977, Adjusted R-squared: 0.1973
## F-statistic: 532.2 on 2 and 4320 DF, p-value: < 2.2e-16
Again, removing that didn’t improve our model that much, as our Adjusted \(R^2\) value only increased slightly. But we are now left with only predictors below our p-value threshold of 0.1. The variable seller_type is a factor with Dealer as level 0 (base) and Individual as level 1, so that is our dichotomous variable. We will add seller_type vs. years_old as our dichotomous vs. quantitative interaction term for the final model.
car_dekho_numeric_lm_final <- lm(selling_price ~ years_old * seller_type,
data = car_dekho_numeric)
summary(car_dekho_numeric_lm_final)
##
## Call:
## lm(formula = selling_price ~ years_old * seller_type, data = car_dekho_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -848059 -222653 -88292 61157 8000150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1371865 36636 37.446 <2e-16 ***
## years_old -94403 4922 -19.180 <2e-16 ***
## seller_typeIndividual -566480 41384 -13.688 <2e-16 ***
## years_old:seller_typeIndividual 48705 5328 9.142 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 513700 on 4319 degrees of freedom
## Multiple R-squared: 0.2129, Adjusted R-squared: 0.2124
## F-statistic: 389.4 on 3 and 4319 DF, p-value: < 2.2e-16
The final model is therefore:
\(\hat y = 1371865 - 94403a - 566480b + 48705c\)
where \(a = years old\) and \(b = seller type\) and \(c = years old * seller type\)
par(mfrow=c(2,2))
plot(car_dekho_numeric_lm_final)
We can see by plotting the residuals vs. the fitted values that the residuals are not uniformly scattered about zero, and we can see from the qq-plot that the right tail deviates significantly from the normal distribution line. So this linear model was not particularly appropriate, as expected.
Now we look at data for hand and feet reaction times measured using the Nintendo Wii Balance Board by age and gender so we can see what we expect is an appropriate use of a quadratic term.
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/lilliput/main/reaction_time_and_aging_nintendo_wii_balance_board.csv"
reaction_time <- read.csv(my_url2, row.names = 1)
cols <- c("Sex", "Age", "Medicine", "Smoking", "PAL1", "PAL2", "RTH_D",
"RTH_ND", "RTF_D", "RTF_ND")
colnames(reaction_time) <- cols
remove <- c("Medicine", "Smoking", "PAL1", "PAL2")
reaction_time <- reaction_time[, !colnames(reaction_time) %in% remove]
replacements <- c("1" = "Male", "2" = "Female")
reaction_time$Sex <- as.factor(str_replace_all(reaction_time$Sex, replacements))
reaction_time$RTH_D <- as.numeric(gsub(",", "", reaction_time$RTH_D))
reaction_time$RTH_ND <- as.numeric(gsub(",", "", reaction_time$RTH_ND))
reaction_time$RTF_D <- as.numeric(gsub(",", "", reaction_time$RTF_D))
reaction_time$RTF_ND <- as.numeric(gsub(",", "", reaction_time$RTF_ND))
reaction_time_simplified <- reaction_time %>%
mutate(RTH_mean = (RTH_D + RTH_ND) / 2, RTF_mean = (RTF_D + RTF_ND) / 2)
remove <- c("RTH_D", "RTH_ND", "RTF_D", "RTF_ND")
reaction_time_simplified <- reaction_time_simplified[, !colnames(
reaction_time_simplified) %in% remove]
reorder <- c("RTH_mean", "RTF_mean", "Age", "Sex")
reaction_time_simplified <- reaction_time_simplified[, reorder]
Since the researchers measured dominant and non-dominant hands and feet separately, we take the mean for each observation’s hands and the mean for each observation’s feet to look at pairwise scatterplots of hand and feet reaction times against age and gender.
pairs(reaction_time_simplified, gap=0.5)
We see a curvilinear relationship between hand reaction times and age, so we will only model hand reaction times, and age squared will be our quadratic term. Gender will be our dichotomous term, and we’ll look at the interaction term between age squared and gender as well.
reaction_time_simplified$Age2 <- reaction_time_simplified$Age^2
reaction_time_simplified_lm_full <- lm(RTH_mean ~ Age + Age2 * Sex,
data = reaction_time_simplified)
summary(reaction_time_simplified_lm_full)
##
## Call:
## lm(formula = RTH_mean ~ Age + Age2 * Sex, data = reaction_time_simplified)
##
## Residuals:
## Min 1Q Median 3Q Max
## -357.15 -70.68 -16.23 52.65 1095.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 618.726981 54.896495 11.271 < 2e-16 ***
## Age -5.932816 2.165200 -2.740 0.00646 **
## Age2 0.107207 0.019891 5.390 1.3e-07 ***
## SexMale -35.480121 28.324338 -1.253 0.21118
## Age2:SexMale -0.007669 0.006770 -1.133 0.25814
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142.9 on 348 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4219, Adjusted R-squared: 0.4152
## F-statistic: 63.49 on 4 and 348 DF, p-value: < 2.2e-16
The median for our residuals is near 0, which is a good indicator this model might be useful. The p-value for Age squared is low, which indicates it is the best predictor in our model. Since we’re at the bare minimum of predictors for our purposes, we will leave in age, which also has a low p-value, as well as gender and the interaction term between age squared and gender, which both have large p-values. Gender is a factor with female as level 0 (base) and male as level 1. Our Adjusted \(R^2\) value indicates this model could explain 41.5% of variation in mean hand reaction times.
The model is thus:
\(\hat y = 618.726981 - 5.932816a + 0.107207a^2 - 35.480121b - 0.007669c\)
where \(a = age\) and \(b = gender\) and \(c = age * gender\)
par(mfrow=c(2,2))
plot(reaction_time_simplified_lm_full)
Plotting the residuals vs. the fitted values shows us the residuals aren’t really uniformly distributed about zero, though. And the qq-plot shows us the right tail deviates significantly from the normal distribution line. So this model would not be appropriate for fully explaining the relationship between mean hand reaction times and age, but we were at least able to look at a model that involved a quadratic term.