HW1: Find a data set of which you can fit multiple linear regression and interpret the results.
Multiple linear regression (MLR) is a method for estimating how several independent factors together influence a single outcome. It fits a straight-line equation to data points to reveal how each variable contributes when the others are held steady.
When you are conducting Multiple Linear Regression Model, this formula is used:
Y = β0 + β1X1 + β2X2 + … + βnXn + ε
# We are going to first import the data set to use
library()
world_population <- read.csv("world_population.csv", check.names = FALSE)
# Renaming the selected columns in the data set so that the regression formula is easier to read
names(world_population)[names(world_population) == "Country/Territory"] <- "Country"
names(world_population)[names(world_population) == "2022 Population"] <- "Pop2022"
names(world_population)[names(world_population) == "Area (km²)"] <- "Area_km2"
names(world_population)[names(world_population) == "Area (km²)"] <- "Area_km2"
names(world_population)[names(world_population) == "Density (per km²)"] <- "Density_per_km2"
names(world_population)[names(world_population) == "Density (per km²)"] <- "Density_per_km2"
names(world_population)[names(world_population) == "Growth Rate"] <- "Growth_Rate"
names(world_population)
## [1] "Rank" "CCA3"
## [3] "Country" "Capital"
## [5] "Continent" "Pop2022"
## [7] "2020 Population" "2015 Population"
## [9] "2010 Population" "2000 Population"
## [11] "1990 Population" "1980 Population"
## [13] "1970 Population" "Area_km2"
## [15] "Density_per_km2" "Growth_Rate"
## [17] "World Population Percentage"
# Let's now keep the variables which we will use in the model and remove missing values
head(world_population)
## Rank CCA3 Country Capital Continent Pop2022 2020 Population
## 1 36 AFG Afghanistan Kabul Asia 41128771 38972230
## 2 138 ALB Albania Tirana Europe 2842321 2866849
## 3 34 DZA Algeria Algiers Africa 44903225 43451666
## 4 213 ASM American Samoa Pago Pago Oceania 44273 46189
## 5 203 AND Andorra Andorra la Vella Europe 79824 77700
## 6 42 AGO Angola Luanda Africa 35588987 33428485
## 2015 Population 2010 Population 2000 Population 1990 Population
## 1 33753499 28189672 19542982 10694796
## 2 2882481 2913399 3182021 3295066
## 3 39543154 35856344 30774621 25518074
## 4 51368 54849 58230 47818
## 5 71746 71519 66097 53569
## 6 28127721 23364185 16394062 11828638
## 1980 Population 1970 Population Area_km2 Density_per_km2 Growth_Rate
## 1 12486631 10752971 652230 63.0587 1.0257
## 2 2941651 2324731 28748 98.8702 0.9957
## 3 18739378 13795915 2381741 18.8531 1.0164
## 4 32886 27075 199 222.4774 0.9831
## 5 35611 19860 468 170.5641 1.0100
## 6 8330047 6029700 1246700 28.5466 1.0315
## World Population Percentage
## 1 0.52
## 2 0.04
## 3 0.56
## 4 0.00
## 5 0.00
## 6 0.45
model_data <- na.omit(world_population[, c(
"Country",
"Continent",
"Pop2022",
"Area_km2",
"Density_per_km2",
"Growth_Rate"
)])
model_data$Continent <- factor(model_data$Continent)
model_data$Continent <- relevel(model_data$Continent, ref = "Africa")
# Log-transform population and area because both are highly skewed.
model_data$log_Pop2022 <- log(model_data$Pop2022)
model_data$log_Area_km2 <- log(model_data$Area_km2)
mlr_model <- lm(
log_Pop2022 ~ log_Area_km2 + Density_per_km2 + Growth_Rate + Continent,
data = model_data
)
summary(mlr_model)
##
## Call:
## lm(formula = log_Pop2022 ~ log_Area_km2 + Density_per_km2 + Growth_Rate +
## Continent, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6847 -0.5672 0.1473 0.7306 2.0665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.742e+00 6.863e+00 0.982 0.327004
## log_Area_km2 7.206e-01 2.886e-02 24.973 < 2e-16 ***
## Density_per_km2 1.896e-04 4.088e-05 4.637 5.98e-06 ***
## Growth_Rate 6.110e-01 6.750e+00 0.091 0.927960
## ContinentAsia 7.533e-01 2.406e-01 3.130 0.001976 **
## ContinentEurope 9.719e-02 2.611e-01 0.372 0.710071
## ContinentNorth America -2.419e-01 2.807e-01 -0.862 0.389610
## ContinentOceania -1.074e+00 3.240e-01 -3.315 0.001067 **
## ContinentSouth America -1.224e+00 3.596e-01 -3.405 0.000784 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.16 on 225 degrees of freedom
## Multiple R-squared: 0.8321, Adjusted R-squared: 0.8261
## F-statistic: 139.4 on 8 and 225 DF, p-value: < 2.2e-16
anova(mlr_model)
## Analysis of Variance Table
##
## Response: log_Pop2022
## Df Sum Sq Mean Sq F value Pr(>F)
## log_Area_km2 1 1370.96 1370.96 1018.9886 < 2.2e-16 ***
## Density_per_km2 1 57.53 57.53 42.7610 4.103e-10 ***
## Growth_Rate 1 0.01 0.01 0.0058 0.9392
## Continent 5 71.59 14.32 10.6421 3.430e-09 ***
## Residuals 225 302.72 1.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Convert log-scale coefficients into approximate percentage effects.
coef_table <- summary(mlr_model)$coefficients
percent_effect <- (exp(coef(mlr_model)) - 1) * 100
interpretation_table <- data.frame(
Term = names(coef(mlr_model)),
Estimate = coef(mlr_model),
Percent_Effect = percent_effect,
P_Value = coef_table[, "Pr(>|t|)"],
row.names = NULL
)
print(interpretation_table)
## Term Estimate Percent_Effect P_Value
## 1 (Intercept) 6.7418483514 8.461251e+04 3.270042e-01
## 2 log_Area_km2 0.7206219728 1.055711e+02 8.538853e-67
## 3 Density_per_km2 0.0001895945 1.896125e-02 5.982233e-06
## 4 Growth_Rate 0.6109670574 8.422121e+01 9.279601e-01
## 5 ContinentAsia 0.7533191370 1.124038e+02 1.976384e-03
## 6 ContinentEurope 0.0971935585 1.020737e+01 7.100714e-01
## 7 ContinentNorth America -0.2419244333 -2.148845e+01 3.896101e-01
## 8 ContinentOceania -1.0742006188 -6.584293e+01 1.066903e-03
## 9 ContinentSouth America -1.2243572316 -7.060534e+01 7.840724e-04
# Basic diagnostic plots for checking linear regression assumptions.
par(mfrow = c(2, 2))
plot(mlr_model)
log_Area_km2 is strongly significant. A 1% increase in area is associated with about a 0.72% increase in 2022 population, holding other variables constant.
Density_per_km2 is also significant. An increase of 100 people per km² is associated with about a 1.9% higher population, holding other variables constant.
Growth_Rate is not statistically significant in this model, meaning it does not add much explanatory power after accounting for area, density, and continent. Compared with Africa:
Asia has significantly higher population levels, about 112% higher on average, holding other predictors constant.
Oceania and South America have significantly lower population levels.
Europe and North America are not significantly different from Africa in this model.
HW2: Read about variable selection method
Variable Selection
Variable selection is the process of identifying and choosing the most important independent variables (predictors) to include in a regression model. The goal is to retain variables that significantly contribute to explaining or predicting the dependent variable while excluding irrelevant or redundant variables.
Variable selection plays a crucial role in building effective statistical and machine learning models because it helps:
Common variable selection methods include:
By selecting only the most relevant variables, researchers and analysts can develop regression models that are more accurate, robust, and easier to interpret while avoiding the problems associated with overfitting.