Using R, build a multiple regression model for data that interests you.
Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
import kaggle
import pandas as pd
kaggle.api.authenticate()
kaggle.api.dataset_download_files("debajyotipodder/co2-emission-by-vehicles", path = "./", unzip = True)
df = pd.read_csv("CO2 Emissions_Canada.csv")##
## Call:
## lm(formula = `CO2 Emissions(g/km)` ~ Cylinders, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.548 -20.139 -1.139 18.861 150.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.9569 1.2181 82.88 <2e-16 ***
## Cylinders 26.6477 0.2063 129.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.41 on 7383 degrees of freedom
## Multiple R-squared: 0.6933, Adjusted R-squared: 0.6933
## F-statistic: 1.669e+04 on 1 and 7383 DF, p-value: < 2.2e-16
# distributions
par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(df)) {
hist(df[[col_name]], main = paste(col_name), xlab = "Value")
}
par(mfrow = c(1, 1))| column | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Engine Size(L) | 7385 | 3.1600677 | 1.3541705 | 3.0 | 3.0236927 | 1.0 | 0.90 | 8.40 | 7.50 | 0.8090166 | 2.867314 | 0.0157579 |
| Cylinders | 7385 | 5.6150305 | 1.8283065 | 6.0 | 5.4144525 | 2.0 | 3.00 | 16.00 | 13.00 | 1.1101899 | 4.523330 | 0.0212752 |
| Fuel Consumption City (L/100 km) | 7385 | 12.5565335 | 3.5002741 | 12.1 | 12.2872060 | 2.2 | 4.20 | 30.60 | 26.40 | 0.8088404 | 4.194523 | 0.0407312 |
| Fuel Consumption Hwy (L/100 km) | 7385 | 9.0417062 | 2.2244564 | 8.7 | 8.8334913 | 1.3 | 4.00 | 20.60 | 16.60 | 1.0789975 | 5.006797 | 0.0258850 |
| Fuel Consumption Comb (L/100 km) | 7385 | 10.9750711 | 2.8925063 | 10.6 | 10.7375360 | 1.8 | 4.10 | 26.10 | 22.00 | 0.8931343 | 4.391820 | 0.0336588 |
| CO2 Emissions(g/km) | 7385 | 250.5846987 | 58.5126794 | 246.0 | 247.5640548 | 40.0 | 96.00 | 522.00 | 426.00 | 0.5259869 | 3.477664 | 0.6808865 |
| engine_size_squared | 7385 | 17.4345877 | 11.9266413 | 15.0 | 15.4922153 | 7.0 | 3.81 | 80.56 | 76.75 | 1.3880496 | 4.528622 | 0.1387851 |
| big_engine | 7385 | 0.4273527 | 0.4947277 | 0.0 | 0.4092063 | 0.0 | 0.00 | 1.00 | 1.00 | 0.2937057 | 1.086263 | 0.0057569 |
| feature_interaction | 7385 | 11.9961341 | 15.6607210 | 0.0 | 9.4542444 | 0.0 | 0.00 | 80.56 | 80.56 | 1.0111740 | 3.002941 | 0.1822370 |
##
## Call:
## lm(formula = `CO2 Emissions(g/km)` ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.253 -6.794 1.968 10.718 70.042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.7432 2.6875 15.532 < 2e-16 ***
## `Engine Size(L)` 5.4255 2.5647 2.115 0.03443 *
## Cylinders 5.7035 0.5259 10.845 < 2e-16 ***
## `Fuel Consumption City (L/100 km)` 1.0866 3.0307 0.359 0.71997
## `Fuel Consumption Hwy (L/100 km)` 1.5236 2.4994 0.610 0.54215
## `Fuel Consumption Comb (L/100 km)` 10.7175 5.4955 1.950 0.05119 .
## engine_size_squared 1.5277 0.5131 2.978 0.00292 **
## big_engine 9.9598 4.0773 2.443 0.01460 *
## feature_interaction -1.3558 0.2531 -5.357 8.7e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.08 on 7376 degrees of freedom
## Multiple R-squared: 0.8824, Adjusted R-squared: 0.8823
## F-statistic: 6918 on 8 and 7376 DF, p-value: < 2.2e-16
# backwards elimination
back_model <- step(everything_model, direction = "backward", scope = formula(everything_model))## Start: AIC=44312.94
## `CO2 Emissions(g/km)` ~ `Engine Size(L)` + Cylinders + `Fuel Consumption City (L/100 km)` +
## `Fuel Consumption Hwy (L/100 km)` + `Fuel Consumption Comb (L/100 km)` +
## engine_size_squared + big_engine + feature_interaction
##
## Df Sum of Sq RSS AIC
## - `Fuel Consumption City (L/100 km)` 1 52 2973303 44311
## - `Fuel Consumption Hwy (L/100 km)` 1 150 2973401 44311
## <none> 2973251 44313
## - `Fuel Consumption Comb (L/100 km)` 1 1533 2974784 44315
## - `Engine Size(L)` 1 1804 2975055 44315
## - big_engine 1 2405 2975656 44317
## - engine_size_squared 1 3574 2976825 44320
## - feature_interaction 1 11569 2984820 44340
## - Cylinders 1 47411 3020662 44428
##
## Step: AIC=44311.07
## `CO2 Emissions(g/km)` ~ `Engine Size(L)` + Cylinders + `Fuel Consumption Hwy (L/100 km)` +
## `Fuel Consumption Comb (L/100 km)` + engine_size_squared +
## big_engine + feature_interaction
##
## Df Sum of Sq RSS AIC
## - `Fuel Consumption Hwy (L/100 km)` 1 594 2973896 44311
## <none> 2973303 44311
## - `Engine Size(L)` 1 1788 2975091 44314
## - big_engine 1 2424 2975726 44315
## - engine_size_squared 1 3600 2976902 44318
## - feature_interaction 1 11614 2984916 44338
## - Cylinders 1 47413 3020715 44426
## - `Fuel Consumption Comb (L/100 km)` 1 303284 3276587 45026
##
## Step: AIC=44310.54
## `CO2 Emissions(g/km)` ~ `Engine Size(L)` + Cylinders + `Fuel Consumption Comb (L/100 km)` +
## engine_size_squared + big_engine + feature_interaction
##
## Df Sum of Sq RSS AIC
## <none> 2973896 44311
## - `Engine Size(L)` 1 1898 2975794 44313
## - big_engine 1 2352 2976249 44314
## - engine_size_squared 1 3452 2977349 44317
## - feature_interaction 1 11419 2985315 44337
## - Cylinders 1 46834 3020730 44424
## - `Fuel Consumption Comb (L/100 km)` 1 3454439 6428336 50001
##
## Call:
## lm(formula = `CO2 Emissions(g/km)` ~ `Engine Size(L)` + Cylinders +
## `Fuel Consumption Comb (L/100 km)` + engine_size_squared +
## big_engine + feature_interaction, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.364 -6.865 1.964 10.699 69.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.1146 2.6711 15.767 < 2e-16 ***
## `Engine Size(L)` 5.5560 2.5605 2.170 0.03004 *
## Cylinders 5.6433 0.5235 10.779 < 2e-16 ***
## `Fuel Consumption Comb (L/100 km)` 13.2145 0.1427 92.575 < 2e-16 ***
## engine_size_squared 1.4987 0.5121 2.927 0.00344 **
## big_engine 9.8412 4.0740 2.416 0.01573 *
## feature_interaction -1.3454 0.2528 -5.323 1.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.08 on 7378 degrees of freedom
## Multiple R-squared: 0.8824, Adjusted R-squared: 0.8823
## F-statistic: 9224 on 6 and 7378 DF, p-value: < 2.2e-16
# forward selection
forward_model <- step(model, direction = "forward", scope = formula(everything_model))## Start: AIC=51377.62
## `CO2 Emissions(g/km)` ~ Cylinders
##
## Df Sum of Sq RSS AIC
## + `Fuel Consumption Comb (L/100 km)` 1 4651718 3102036 44614
## + `Fuel Consumption City (L/100 km)` 1 4505272 3248482 44955
## + `Fuel Consumption Hwy (L/100 km)` 1 4292851 3460903 45423
## + `Engine Size(L)` 1 1123924 6629830 50223
## + engine_size_squared 1 580747 7173007 50805
## + feature_interaction 1 293384 7460369 51095
## + big_engine 1 181750 7572004 51204
## <none> 7753754 51378
##
## Step: AIC=44614.08
## `CO2 Emissions(g/km)` ~ Cylinders + `Fuel Consumption Comb (L/100 km)`
##
## Df Sum of Sq RSS AIC
## + `Engine Size(L)` 1 48008 3054027 44501
## + engine_size_squared 1 22896 3079139 44561
## + big_engine 1 7031 3095004 44599
## <none> 3102036 44614
## + feature_interaction 1 718 3101318 44614
## + `Fuel Consumption Hwy (L/100 km)` 1 133 3101903 44616
## + `Fuel Consumption City (L/100 km)` 1 96 3101940 44616
##
## Step: AIC=44500.89
## `CO2 Emissions(g/km)` ~ Cylinders + `Fuel Consumption Comb (L/100 km)` +
## `Engine Size(L)`
##
## Df Sum of Sq RSS AIC
## + feature_interaction 1 75637 2978390 44318
## + big_engine 1 36948 3017079 44413
## + engine_size_squared 1 7022 3047005 44486
## <none> 3054027 44501
## + `Fuel Consumption Hwy (L/100 km)` 1 290 3053737 44502
## + `Fuel Consumption City (L/100 km)` 1 223 3053805 44502
##
## Step: AIC=44317.69
## `CO2 Emissions(g/km)` ~ Cylinders + `Fuel Consumption Comb (L/100 km)` +
## `Engine Size(L)` + feature_interaction
##
## Df Sum of Sq RSS AIC
## + engine_size_squared 1 2141.94 2976249 44314
## + big_engine 1 1041.71 2977349 44317
## <none> 2978390 44318
## + `Fuel Consumption Hwy (L/100 km)` 1 336.92 2978054 44319
## + `Fuel Consumption City (L/100 km)` 1 256.37 2978134 44319
##
## Step: AIC=44314.38
## `CO2 Emissions(g/km)` ~ Cylinders + `Fuel Consumption Comb (L/100 km)` +
## `Engine Size(L)` + feature_interaction + engine_size_squared
##
## Df Sum of Sq RSS AIC
## + big_engine 1 2352.05 2973896 44311
## <none> 2976249 44314
## + `Fuel Consumption Hwy (L/100 km)` 1 522.13 2975726 44315
## + `Fuel Consumption City (L/100 km)` 1 421.52 2975827 44315
##
## Step: AIC=44310.54
## `CO2 Emissions(g/km)` ~ Cylinders + `Fuel Consumption Comb (L/100 km)` +
## `Engine Size(L)` + feature_interaction + engine_size_squared +
## big_engine
##
## Df Sum of Sq RSS AIC
## <none> 2973896 44311
## + `Fuel Consumption Hwy (L/100 km)` 1 593.93 2973303 44311
## + `Fuel Consumption City (L/100 km)` 1 495.95 2973401 44311
##
## Call:
## lm(formula = `CO2 Emissions(g/km)` ~ Cylinders + `Fuel Consumption Comb (L/100 km)` +
## `Engine Size(L)` + feature_interaction + engine_size_squared +
## big_engine, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.364 -6.865 1.964 10.699 69.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.1146 2.6711 15.767 < 2e-16 ***
## Cylinders 5.6433 0.5235 10.779 < 2e-16 ***
## `Fuel Consumption Comb (L/100 km)` 13.2145 0.1427 92.575 < 2e-16 ***
## `Engine Size(L)` 5.5560 2.5605 2.170 0.03004 *
## feature_interaction -1.3454 0.2528 -5.323 1.05e-07 ***
## engine_size_squared 1.4987 0.5121 2.927 0.00344 **
## big_engine 9.8412 4.0740 2.416 0.01573 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.08 on 7378 degrees of freedom
## Multiple R-squared: 0.8824, Adjusted R-squared: 0.8823
## F-statistic: 9224 on 6 and 7378 DF, p-value: < 2.2e-16
# significant model
sig_model <- lm(`CO2 Emissions(g/km)` ~ Cylinders + `Fuel Consumption Comb (L/100 km)`,
data = df)
summary(sig_model)##
## Call:
## lm(formula = `CO2 Emissions(g/km)` ~ Cylinders + `Fuel Consumption Comb (L/100 km)`,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.459 -6.546 2.023 10.976 75.153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.8587 0.9370 47.88 <2e-16 ***
## Cylinders 9.5064 0.2087 45.54 <2e-16 ***
## `Fuel Consumption Comb (L/100 km)` 13.8812 0.1319 105.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.5 on 7382 degrees of freedom
## Multiple R-squared: 0.8773, Adjusted R-squared: 0.8773
## F-statistic: 2.639e+04 on 2 and 7382 DF, p-value: < 2.2e-16
model_names <- c("Simple", "Forward", "Backward", "Significant")
models <- list(model, forward_model, back_model, sig_model)
results <- data.frame(Model = character(length(model_names)), AIC = numeric(length(model_names)))
for (i in seq_along(model_names)) {
model <- models[[i]]
results[i, "Model"] <- model_names[i]
results[i, "R_squared"] <- summary(model)$r.squared
results[i, "AIC"] <- AIC(model)
}
print(results)## Model AIC R_squared
## 1 Simple 72337.34 0.6932954
## 2 Forward 65270.26 0.8823656
## 3 Backward 65270.26 0.8823656
## 4 Significant 65573.80 0.8772970