HW 9

Setting working directory

setwd("~/Desktop/Data 101")
countries <- read.csv("AllCountries.csv")

Checking the structure of the dataset

str(countries)

## 'data.frame':    217 obs. of  26 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num  652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num  37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num  56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : int  521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
##  $ Rural         : num  74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num  0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num  0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num  3.72 4.08 13.81 NA NA ...
##  $ Health        : num  2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : int  323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num  11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num  67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num  NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num  30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num  9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num  32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num  6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num  2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num  64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num  50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num  1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : int  NA 808 1328 NA NA 545 NA 2030 1016 NA ...
##  $ Electricity   : int  NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
##  $ Developed     : int  NA 1 1 NA NA 1 NA 2 1 NA ...

Summary statistics to explore the data

summary(countries)

##    Country              Code              LandArea          Population       
##  Length:217         Length:217         Min.   :    0.01   Min.   :   0.0120  
##  Class :character   Class :character   1st Qu.:   10.83   1st Qu.:   0.7728  
##  Mode  :character   Mode  :character   Median :   94.28   Median :   6.5725  
##                                        Mean   :  608.38   Mean   :  35.0335  
##                                        3rd Qu.:  446.30   3rd Qu.:  25.0113  
##                                        Max.   :16376.87   Max.   :1392.7300  
##                                        NA's   :8          NA's   :1          
##     Density             GDP             Rural            CO2         
##  Min.   :    0.1   Min.   :   275   Min.   : 0.00   Min.   : 0.0400  
##  1st Qu.:   37.5   1st Qu.:  2032   1st Qu.:19.62   1st Qu.: 0.8575  
##  Median :   92.1   Median :  5950   Median :38.15   Median : 2.7550  
##  Mean   :  361.4   Mean   : 14733   Mean   :39.10   Mean   : 4.9780  
##  3rd Qu.:  219.8   3rd Qu.: 17298   3rd Qu.:57.83   3rd Qu.: 6.2525  
##  Max.   :20777.5   Max.   :114340   Max.   :87.00   Max.   :43.8600  
##  NA's   :8         NA's   :30       NA's   :3       NA's   :13       
##    PumpPrice         Military          Health        ArmedForces    
##  Min.   :0.1100   Min.   : 0.000   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.:0.7450   1st Qu.: 3.015   1st Qu.: 6.157   1st Qu.:  12.0  
##  Median :0.9800   Median : 4.650   Median : 9.605   Median :  31.5  
##  Mean   :0.9851   Mean   : 6.178   Mean   :10.597   Mean   : 162.1  
##  3rd Qu.:1.1800   3rd Qu.: 8.445   3rd Qu.:13.713   3rd Qu.: 146.5  
##  Max.   :2.0000   Max.   :31.900   Max.   :39.460   Max.   :3031.0  
##  NA's   :50       NA's   :67       NA's   :29       NA's   :49      
##     Internet          Cell             HIV             Hunger     
##  Min.   : 1.30   Min.   : 13.70   Min.   : 0.100   Min.   : 2.50  
##  1st Qu.:29.18   1st Qu.: 83.83   1st Qu.: 0.175   1st Qu.: 2.50  
##  Median :58.35   Median :110.00   Median : 0.400   Median : 6.50  
##  Mean   :54.47   Mean   :107.05   Mean   : 1.941   Mean   :11.25  
##  3rd Qu.:78.92   3rd Qu.:127.50   3rd Qu.: 1.400   3rd Qu.:14.80  
##  Max.   :98.90   Max.   :328.80   Max.   :27.400   Max.   :61.80  
##  NA's   :13      NA's   :15       NA's   :81       NA's   :52     
##     Diabetes        BirthRate       DeathRate        ElderlyPop    
##  Min.   : 1.000   Min.   : 7.00   Min.   : 1.600   Min.   : 1.200  
##  1st Qu.: 5.350   1st Qu.:11.40   1st Qu.: 5.800   1st Qu.: 3.600  
##  Median : 7.200   Median :17.85   Median : 7.250   Median : 6.600  
##  Mean   : 8.542   Mean   :20.11   Mean   : 7.683   Mean   : 8.953  
##  3rd Qu.:10.750   3rd Qu.:27.65   3rd Qu.: 9.350   3rd Qu.:14.500  
##  Max.   :30.500   Max.   :47.80   Max.   :15.500   Max.   :27.500  
##  NA's   :10       NA's   :15      NA's   :15       NA's   :24      
##  LifeExpectancy   FemaleLabor     Unemployment        Energy     
##  Min.   :52.20   Min.   : 6.20   Min.   : 0.100   Min.   :   66  
##  1st Qu.:66.90   1st Qu.:50.15   1st Qu.: 3.400   1st Qu.:  738  
##  Median :74.30   Median :60.60   Median : 5.600   Median : 1574  
##  Mean   :72.46   Mean   :57.95   Mean   : 7.255   Mean   : 2664  
##  3rd Qu.:77.70   3rd Qu.:69.25   3rd Qu.: 9.400   3rd Qu.: 3060  
##  Max.   :84.70   Max.   :85.80   Max.   :30.200   Max.   :17923  
##  NA's   :18      NA's   :30      NA's   :30       NA's   :82     
##   Electricity      Developed   
##  Min.   :   39   Min.   :1.00  
##  1st Qu.:  904   1st Qu.:1.00  
##  Median : 2620   Median :2.00  
##  Mean   : 4270   Mean   :1.81  
##  3rd Qu.: 5600   3rd Qu.:3.00  
##  Max.   :53832   Max.   :3.00  
##  NA's   :76      NA's   :75

Checking for NA’s

colSums(is.na(countries))

##        Country           Code       LandArea     Population        Density 
##              0              0              8              1              8 
##            GDP          Rural            CO2      PumpPrice       Military 
##             30              3             13             50             67 
##         Health    ArmedForces       Internet           Cell            HIV 
##             29             49             13             15             81 
##         Hunger       Diabetes      BirthRate      DeathRate     ElderlyPop 
##             52             10             15             15             24 
## LifeExpectancy    FemaleLabor   Unemployment         Energy    Electricity 
##             18             30             30             82             76 
##      Developed 
##             75

Simple linear regression

# Fit simple linear regression: mpg ~ weight
simple_model <- lm(LifeExpectancy ~ GDP, data = countries)

# View the model summary
summary(simple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation:

The intercept (around 68.42) represents the predicted life expectancy for a country with a GDP of $0. This is not not practically meaningful, but mathematically it’s the y-intercept.
The coefficient for weight (around 0.0002476) means for every $1 increase in GDP, life expectancy increases by about 0.0002476 years.
Look at the p-values: Both are < 0.05, indicating statistical significance.
R² (around 0.43) explains about 43% of the variance in life expectancy is from GDP which is low.

Multiple linear regression

multiple_model <- lm(LifeExpectancy ~ GDP+ Health + Internet, data = countries)

# View the model summary
summary(multiple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation:

Now life expectancy is explained by GDP, Health and Internet.

Coefficients (holding the other variables constant)

GDP: +0.0000237 years per $1 GDP (p = 0.302, not significant). After accounting for health spending and internet access, GDP doesn’t add clear independent information. Roughly +0.024 years per +$1000 GDP.

Health: +0.2479 years per 1% increase in health spending (p = 0.000247). Countries that devote a higher share of government spending to healthcare tend to live longer—about +2.48 years per +10 percentage points.

Internet: +0.1903 years per +1% internet access (p < 2e-16). Greater internet access strongly predicts longer life expectancy, even after controlling for GDP and health spending.

par(mfrow=c(2,2))
plot(simple_model)

par(mfrow=c(1,1))

Checking for assumptions in Homoscedasticity

Homoscedasticity (Residuals vs Fitted Plot)

How to check: Look at a Residuals vs Fitted plot. Then compare all of the residuals to the Residual vs fitted plot. Thus, that way you can check for consistency between the residuals.

Ideal outcome: Residuals should be evenly scattered around 0 with no funnel shape or pattern and should be pretty equal.

Violation would look like: Patterns or widening/narrowing spread indicate non-constant variance, making predictions less reliable across GDP levels. If variance isn’t constant, the model may give incorrect p-values and confidence intervals, so your conclusions about significance might be wrong.

Normality (Q–Q Plot)

How to check for assumption for Normality in residuals: Examine a Q–Q plot of the residuals to see if they follow a normal distribution.

Ideal outcome: Not many outliers, no S curves, most points should be close to the line.Points fall roughly on the 45° line, showing residuals are approximately normal.

Violation meaning: Strong deviations or curve shapes indicate skewness or outliers, making p-values and confidence intervals less trustworthy.

Comparing to outcome Homoscedascity Analysis:

Homoscedasticity Analysis:

Overall, the assumption seems mostly reasonable. The spread of the residuals stays fairly steady, though a few outliers do pop up. They’re worth noticing, but they don’t seriously undermine the model. For the most part, the variance looks stable, so this assumption holds pretty well.

Normality Analysis:

Here the model shows a bit more trouble. The center of the Q–Q plot follows the expected pattern, but the ends drift away from the line, especially on the lower side. This suggests heavier tails, possibly coming from countries with extremely low life expectancies relative to their GDP. Because of this tail behavior, the normality assumption is weaker than the homoscedasticity one, and it raises more potential concerns.

Diagnose Model Fit with Metrics

Beyond summaries, compute RMSE for model.

# For multiple model

# Calculate residuals
residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

## [1] 4.056417

RMSE is how different the models predictions is from actual life expectancy. The value being roughly 4 means that it is off by 4 years. Residual is the difference between what the model predicted vs the real life life expectancy. If a country’s residual is really big, it means the model predicted way too high or way too low for that country. Large residuals makes the model less trustworthy. This could also be a call to include another factor that could be more important or missing. I would investigate further into outliers, countries that are extreme cases (very poor or very wealthy) might behave differently and cause big prediction errors.

Extra

cor (countries [, c("Energy", "Electricity")], use = "complete.obs")

##                Energy Electricity
## Energy      1.0000000   0.7970054
## Electricity 0.7970054   1.0000000

Energy and Electricity have a correlation of 0.797, which is very high. Because they move together, the regression model can’t tell which one actually drives CO₂ emissions. This makes the coefficients unstable, hard to interpret, and sometimes misleading. The model can still predict CO₂ fairly well, but you can’t trust the individual effects of Energy vs. Electricity.