library(readr)
library(ggplot2)
countries <- read_csv("AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Structure check
str(countries)
## spc_tbl_ [217 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country : chr [1:217] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr [1:217] "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num [1:217] 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num [1:217] 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num [1:217] 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : num [1:217] 521 5254 4279 NA 42030 ...
## $ Rural : num [1:217] 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num [1:217] 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num [1:217] 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num [1:217] 3.72 4.08 13.81 NA NA ...
## $ Health : num [1:217] 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : num [1:217] 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num [1:217] 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num [1:217] 67.4 123.7 111 NA 104.4 ...
## $ HIV : num [1:217] NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num [1:217] 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num [1:217] 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num [1:217] 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num [1:217] 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num [1:217] 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num [1:217] 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num [1:217] 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num [1:217] 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : num [1:217] NA 808 1328 NA NA ...
## $ Electricity : num [1:217] NA 2309 1363 NA NA ...
## $ Developed : num [1:217] NA 1 1 NA NA 1 NA 2 1 NA ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. Code = col_character(),
## .. LandArea = col_double(),
## .. Population = col_double(),
## .. Density = col_double(),
## .. GDP = col_double(),
## .. Rural = col_double(),
## .. CO2 = col_double(),
## .. PumpPrice = col_double(),
## .. Military = col_double(),
## .. Health = col_double(),
## .. ArmedForces = col_double(),
## .. Internet = col_double(),
## .. Cell = col_double(),
## .. HIV = col_double(),
## .. Hunger = col_double(),
## .. Diabetes = col_double(),
## .. BirthRate = col_double(),
## .. DeathRate = col_double(),
## .. ElderlyPop = col_double(),
## .. LifeExpectancy = col_double(),
## .. FemaleLabor = col_double(),
## .. Unemployment = col_double(),
## .. Energy = col_double(),
## .. Electricity = col_double(),
## .. Developed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(countries)
## Country Code LandArea Population
## Length:217 Length:217 Min. : 0.01 Min. : 0.0120
## Class :character Class :character 1st Qu.: 10.83 1st Qu.: 0.7728
## Mode :character Mode :character Median : 94.28 Median : 6.5725
## Mean : 608.38 Mean : 35.0335
## 3rd Qu.: 446.30 3rd Qu.: 25.0113
## Max. :16376.87 Max. :1392.7300
## NA's :8 NA's :1
## Density GDP Rural CO2
## Min. : 0.1 Min. : 275 Min. : 0.00 Min. : 0.0400
## 1st Qu.: 37.5 1st Qu.: 2032 1st Qu.:19.62 1st Qu.: 0.8575
## Median : 92.1 Median : 5950 Median :38.15 Median : 2.7550
## Mean : 361.4 Mean : 14733 Mean :39.10 Mean : 4.9780
## 3rd Qu.: 219.8 3rd Qu.: 17298 3rd Qu.:57.83 3rd Qu.: 6.2525
## Max. :20777.5 Max. :114340 Max. :87.00 Max. :43.8600
## NA's :8 NA's :30 NA's :3 NA's :13
## PumpPrice Military Health ArmedForces
## Min. :0.1100 Min. : 0.000 Min. : 0.000 Min. : 0.0
## 1st Qu.:0.7450 1st Qu.: 3.015 1st Qu.: 6.157 1st Qu.: 12.0
## Median :0.9800 Median : 4.650 Median : 9.605 Median : 31.5
## Mean :0.9851 Mean : 6.178 Mean :10.597 Mean : 162.1
## 3rd Qu.:1.1800 3rd Qu.: 8.445 3rd Qu.:13.713 3rd Qu.: 146.5
## Max. :2.0000 Max. :31.900 Max. :39.460 Max. :3031.0
## NA's :50 NA's :67 NA's :29 NA's :49
## Internet Cell HIV Hunger
## Min. : 1.30 Min. : 13.70 Min. : 0.100 Min. : 2.50
## 1st Qu.:29.18 1st Qu.: 83.83 1st Qu.: 0.175 1st Qu.: 2.50
## Median :58.35 Median :110.00 Median : 0.400 Median : 6.50
## Mean :54.47 Mean :107.05 Mean : 1.941 Mean :11.25
## 3rd Qu.:78.92 3rd Qu.:127.50 3rd Qu.: 1.400 3rd Qu.:14.80
## Max. :98.90 Max. :328.80 Max. :27.400 Max. :61.80
## NA's :13 NA's :15 NA's :81 NA's :52
## Diabetes BirthRate DeathRate ElderlyPop
## Min. : 1.000 Min. : 7.00 Min. : 1.600 Min. : 1.200
## 1st Qu.: 5.350 1st Qu.:11.40 1st Qu.: 5.800 1st Qu.: 3.600
## Median : 7.200 Median :17.85 Median : 7.250 Median : 6.600
## Mean : 8.542 Mean :20.11 Mean : 7.683 Mean : 8.953
## 3rd Qu.:10.750 3rd Qu.:27.65 3rd Qu.: 9.350 3rd Qu.:14.500
## Max. :30.500 Max. :47.80 Max. :15.500 Max. :27.500
## NA's :10 NA's :15 NA's :15 NA's :24
## LifeExpectancy FemaleLabor Unemployment Energy
## Min. :52.20 Min. : 6.20 Min. : 0.100 Min. : 66
## 1st Qu.:66.90 1st Qu.:50.15 1st Qu.: 3.400 1st Qu.: 738
## Median :74.30 Median :60.60 Median : 5.600 Median : 1574
## Mean :72.46 Mean :57.95 Mean : 7.255 Mean : 2664
## 3rd Qu.:77.70 3rd Qu.:69.25 3rd Qu.: 9.400 3rd Qu.: 3060
## Max. :84.70 Max. :85.80 Max. :30.200 Max. :17923
## NA's :18 NA's :30 NA's :30 NA's :82
## Electricity Developed
## Min. : 39 Min. :1.00
## 1st Qu.: 904 1st Qu.:1.00
## Median : 2620 Median :2.00
## Mean : 4270 Mean :1.81
## 3rd Qu.: 5600 3rd Qu.:3.00
## Max. :53832 Max. :3.00
## NA's :76 NA's :75
# LifeExpectancy predicted by GDP
model_simple <- lm(LifeExpectancy ~ GDP, data = countries)
# Viewing coefficients, p-values, and R-squared
summary(model_simple)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Intercept: 68.42. This is the predicted life expectancy (in years) for a country with a GDP of 0.
Slope: 0.0002476. For every $1 increase in GDP per capita, life expectancy is predicted to increase by 0.0002476 years.
R^2 Value: 0.4304 (or 43.04%). This indicates that GDP alone explains roughly 43% of the variation in life expectancy across the countries in the dataset.
# LifeExpectancy predicted by GDP, Health, and Internet
model_multiple <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
# Viewing coefficients and Adjusted R-squared
summary(model_multiple)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Health Coefficient: 0.2479. This means that for every 1% increase in government expenditures on healthcare, life expectancy is predicted to increase by approximately 0.25 years, holding both GDP and Internet access constant.
Model Comparison: The adjusted R^2 for this multiple regression model is approximately 0.716 (or 71.6%). Compared to the 43% from Question 2, this significant increase suggests that Health and Internet are strong additional predictors that greatly improve the model’s ability to explain life expectancy
# Base R method to view 4 key diagnostic plots at once
par(mfrow = c(2, 2))
plot(model_simple)
par(mfrow = c(1, 1)) # Reset grid to 1x1
# Visualization using ggplot2
diagnostic_data <- data.frame(
Fitted = fitted(model_simple),
Residuals = residuals(model_simple)
)
# Homoscedasticity plot (Residuals vs Fitted)
ggplot(diagnostic_data, aes(x = Fitted, y = Residuals)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs Fitted (Check Homoscedasticity)",
x = "Fitted Values", y = "Residuals") +
theme_minimal()
# Normality plot (Q-Q Plot)
ggplot(diagnostic_data, aes(sample = Residuals)) +
geom_qq() +
geom_qq_line(color = "blue") +
labs(title = "Normal Q-Q Plot (Checking Normality of Residuals)") +
theme_minimal()
Homoscedasticity: Checked by plotting Residuals vs. Fitted values. An ideal outcome is a random, even scatter of points with no clear pattern. A violation (e.g., a funnel shape) indicates that the model’s prediction error varies at different levels of GDP, making predictions less reliable at certain extremes.
Normality of Residuals: Checked using a Q-Q plot or a histogram of residuals. The ideal outcome is points falling closely along the straight diagonal line. A violation suggests the residuals are skewed, affecting the validity of hypothesis tests and confidence intervals
# Calculating residuals for the multiple regression model
res_mult <- residuals(model_multiple)
# Computing Root Mean Square Error (RMSE)
rmse_val <- sqrt(mean(res_mult^2, na.rm = TRUE))
print(paste("RMSE for Multiple Regression Model:", round(rmse_val, 3)))
## [1] "RMSE for Multiple Regression Model: 4.056"
RMSE (Root Mean Square Error): Approximately 4.056. This represents the average distance that the observed life expectancy values fall from the regression line. This means that the model’s predictions are off by about 4.06 years on average.
Impact of Large Residuals: Large residuals for specific countries mean the model poorly predicts their life expectancy. This reduces overall confidence in the model’s universal application. I would investigate these countries to find omitted contextual variables like that are not accounted for by GDP, health spending, and internet access.
I will be using the library(cars) for this portion as i cannot use the data for “Allcountries” to answer this question.
# To actually test for multicollinearity in R, we use VIF (Variance Inflation Factor)
library(car)
## Loading required package: carData
# Hypothetical model for CO2 emissions
model_co2 <- lm(CO2 ~ Energy + Electricity, data = countries)
# Checking VIF values (Scores > 5 or 10 indicate problematic multicollinearity)
vif(model_co2)
## Energy Electricity
## 2.74052 2.74052
The strong correlation between Energy and Electricity causes multicollinearity in the regression model. This means both variables are providing overlapping information, making it mathematically impossible for the model to cleanly isolate the independent effect that each predictor has on CO2 emissions.As a result, the specific regression coefficients for both predictors become highly unstable and can fluctuate wildly with even minor changes to the dataset. Multicollinearity inflates the standard errors of the coefficients, which can make vital predictors falsely appear statistically insignificant by yielding high p-values. While the model might still yield accurate overall CO2 predictions and a high R^2 value, its structural reliability for explaining why those emissions are happening is severely compromised.