HW1: Find a data set of which you can fit multiple linear regression and interpret the results.

Definition

Multiple linear regression (MLR) is a method for estimating how several independent factors together influence a single outcome. It fits a straight-line equation to data points to reveal how each variable contributes when the others are held steady.

When you are conducting Multiple Linear Regression Model, this formula is used:

Y = β0 + β1X1 + β2X2 + … + βnXn + ε

# We are going to first import the data set to use
library()

world_population <- read.csv("world_population.csv", check.names = FALSE)


# Renaming the selected columns in the data set so that the regression formula is easier to read
names(world_population)[names(world_population) == "Country/Territory"] <- "Country"
names(world_population)[names(world_population) == "2022 Population"] <- "Pop2022"
names(world_population)[names(world_population) == "Area (km²)"] <- "Area_km2"
names(world_population)[names(world_population) == "Area (km²)"] <- "Area_km2"
names(world_population)[names(world_population) == "Density (per km²)"] <- "Density_per_km2"
names(world_population)[names(world_population) == "Density (per km²)"] <- "Density_per_km2"
names(world_population)[names(world_population) == "Growth Rate"] <- "Growth_Rate"
names(world_population)
##  [1] "Rank"                        "CCA3"                       
##  [3] "Country"                     "Capital"                    
##  [5] "Continent"                   "Pop2022"                    
##  [7] "2020 Population"             "2015 Population"            
##  [9] "2010 Population"             "2000 Population"            
## [11] "1990 Population"             "1980 Population"            
## [13] "1970 Population"             "Area_km2"                   
## [15] "Density_per_km2"             "Growth_Rate"                
## [17] "World Population Percentage"
# Let's now keep the variables which we will use in the model and remove missing values

head(world_population)
##   Rank CCA3        Country          Capital Continent  Pop2022 2020 Population
## 1   36  AFG    Afghanistan            Kabul      Asia 41128771        38972230
## 2  138  ALB        Albania           Tirana    Europe  2842321         2866849
## 3   34  DZA        Algeria          Algiers    Africa 44903225        43451666
## 4  213  ASM American Samoa        Pago Pago   Oceania    44273           46189
## 5  203  AND        Andorra Andorra la Vella    Europe    79824           77700
## 6   42  AGO         Angola           Luanda    Africa 35588987        33428485
##   2015 Population 2010 Population 2000 Population 1990 Population
## 1        33753499        28189672        19542982        10694796
## 2         2882481         2913399         3182021         3295066
## 3        39543154        35856344        30774621        25518074
## 4           51368           54849           58230           47818
## 5           71746           71519           66097           53569
## 6        28127721        23364185        16394062        11828638
##   1980 Population 1970 Population Area_km2 Density_per_km2 Growth_Rate
## 1        12486631        10752971   652230         63.0587      1.0257
## 2         2941651         2324731    28748         98.8702      0.9957
## 3        18739378        13795915  2381741         18.8531      1.0164
## 4           32886           27075      199        222.4774      0.9831
## 5           35611           19860      468        170.5641      1.0100
## 6         8330047         6029700  1246700         28.5466      1.0315
##   World Population Percentage
## 1                        0.52
## 2                        0.04
## 3                        0.56
## 4                        0.00
## 5                        0.00
## 6                        0.45
model_data <- na.omit(world_population[, c(
  "Country",
  "Continent",
  "Pop2022",
  "Area_km2",
  "Density_per_km2",
  "Growth_Rate"
)])

model_data$Continent <- factor(model_data$Continent)
model_data$Continent <- relevel(model_data$Continent, ref = "Africa")

# Log-transform population and area because both are highly skewed.
model_data$log_Pop2022 <- log(model_data$Pop2022)
model_data$log_Area_km2 <- log(model_data$Area_km2)

mlr_model <- lm(
  log_Pop2022 ~ log_Area_km2 + Density_per_km2 + Growth_Rate + Continent,
  data = model_data
)

summary(mlr_model)
## 
## Call:
## lm(formula = log_Pop2022 ~ log_Area_km2 + Density_per_km2 + Growth_Rate + 
##     Continent, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6847 -0.5672  0.1473  0.7306  2.0665 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             6.742e+00  6.863e+00   0.982 0.327004    
## log_Area_km2            7.206e-01  2.886e-02  24.973  < 2e-16 ***
## Density_per_km2         1.896e-04  4.088e-05   4.637 5.98e-06 ***
## Growth_Rate             6.110e-01  6.750e+00   0.091 0.927960    
## ContinentAsia           7.533e-01  2.406e-01   3.130 0.001976 ** 
## ContinentEurope         9.719e-02  2.611e-01   0.372 0.710071    
## ContinentNorth America -2.419e-01  2.807e-01  -0.862 0.389610    
## ContinentOceania       -1.074e+00  3.240e-01  -3.315 0.001067 ** 
## ContinentSouth America -1.224e+00  3.596e-01  -3.405 0.000784 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.16 on 225 degrees of freedom
## Multiple R-squared:  0.8321, Adjusted R-squared:  0.8261 
## F-statistic: 139.4 on 8 and 225 DF,  p-value: < 2.2e-16
anova(mlr_model)
## Analysis of Variance Table
## 
## Response: log_Pop2022
##                  Df  Sum Sq Mean Sq   F value    Pr(>F)    
## log_Area_km2      1 1370.96 1370.96 1018.9886 < 2.2e-16 ***
## Density_per_km2   1   57.53   57.53   42.7610 4.103e-10 ***
## Growth_Rate       1    0.01    0.01    0.0058    0.9392    
## Continent         5   71.59   14.32   10.6421 3.430e-09 ***
## Residuals       225  302.72    1.35                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Convert log-scale coefficients into approximate percentage effects.
coef_table <- summary(mlr_model)$coefficients
percent_effect <- (exp(coef(mlr_model)) - 1) * 100
interpretation_table <- data.frame(
  Term = names(coef(mlr_model)),
  Estimate = coef(mlr_model),
  Percent_Effect = percent_effect,
  P_Value = coef_table[, "Pr(>|t|)"],
  row.names = NULL
)

print(interpretation_table)
##                     Term      Estimate Percent_Effect      P_Value
## 1            (Intercept)  6.7418483514   8.461251e+04 3.270042e-01
## 2           log_Area_km2  0.7206219728   1.055711e+02 8.538853e-67
## 3        Density_per_km2  0.0001895945   1.896125e-02 5.982233e-06
## 4            Growth_Rate  0.6109670574   8.422121e+01 9.279601e-01
## 5          ContinentAsia  0.7533191370   1.124038e+02 1.976384e-03
## 6        ContinentEurope  0.0971935585   1.020737e+01 7.100714e-01
## 7 ContinentNorth America -0.2419244333  -2.148845e+01 3.896101e-01
## 8       ContinentOceania -1.0742006188  -6.584293e+01 1.066903e-03
## 9 ContinentSouth America -1.2243572316  -7.060534e+01 7.840724e-04
# Basic diagnostic plots for checking linear regression assumptions.
par(mfrow = c(2, 2))
plot(mlr_model)

Interpretation

log_Area_km2 is strongly significant. A 1% increase in area is associated with about a 0.72% increase in 2022 population, holding other variables constant.

Density_per_km2 is also significant. An increase of 100 people per km² is associated with about a 1.9% higher population, holding other variables constant.

Growth_Rate is not statistically significant in this model, meaning it does not add much explanatory power after accounting for area, density, and continent. Compared with Africa:

Asia has significantly higher population levels, about 112% higher on average, holding other predictors constant.

Oceania and South America have significantly lower population levels.

Europe and North America are not significantly different from Africa in this model.

HW2: Read about variable selection method

Variable Selection

Variable selection is the process of identifying and choosing the most important independent variables (predictors) to include in a regression model. The goal is to retain variables that significantly contribute to explaining or predicting the dependent variable while excluding irrelevant or redundant variables.

Variable selection plays a crucial role in building effective statistical and machine learning models because it helps:

Common variable selection methods include:

  1. Forward Selection – Starts with no predictors and adds variables one at a time based on their contribution to the model.
  2. Backward Elimination – Starts with all candidate variables and removes the least significant variables step by step.
  3. Stepwise Selection – Combines forward selection and backward elimination by adding and removing variables iteratively.
  4. Regularization Methods – Techniques such as LASSO and Ridge Regression automatically shrink or eliminate less important variables.

By selecting only the most relevant variables, researchers and analysts can develop regression models that are more accurate, robust, and easier to interpret while avoiding the problems associated with overfitting.