HW9

Answer 1

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

df<- read_csv("Allcountries.csv")

## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df)

## # A tibble: 6 × 26
##   Country Code  LandArea Population Density   GDP Rural   CO2 PumpPrice Military
##   <chr>   <chr>    <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 Afghan… AFG     653.       37.2      56.9   521  74.5  0.29      0.7      3.72
## 2 Albania ALB      27.4       2.87    105.   5254  39.7  1.98      1.36     4.08
## 3 Algeria DZA    2382.       42.2      17.7  4279  27.4  3.74      0.28    13.8 
## 4 Americ… ASM       0.2       0.055   277.     NA  12.8 NA        NA       NA   
## 5 Andorra AND       0.47      0.077   164.  42030  11.9  5.83     NA       NA   
## 6 Angola  AGO    1247.       30.8      24.7  3432  34.5  1.29      0.97     9.4 
## # ℹ 16 more variables: Health <dbl>, ArmedForces <dbl>, Internet <dbl>,
## #   Cell <dbl>, HIV <dbl>, Hunger <dbl>, Diabetes <dbl>, BirthRate <dbl>,
## #   DeathRate <dbl>, ElderlyPop <dbl>, LifeExpectancy <dbl>, FemaleLabor <dbl>,
## #   Unemployment <dbl>, Energy <dbl>, Electricity <dbl>, Developed <dbl>

str(df)

## spc_tbl_ [217 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country       : chr [1:217] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr [1:217] "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num [1:217] 652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num [1:217] 37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num [1:217] 56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : num [1:217] 521 5254 4279 NA 42030 ...
##  $ Rural         : num [1:217] 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num [1:217] 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num [1:217] 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num [1:217] 3.72 4.08 13.81 NA NA ...
##  $ Health        : num [1:217] 2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : num [1:217] 323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num [1:217] 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num [1:217] 67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num [1:217] NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num [1:217] 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num [1:217] 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num [1:217] 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num [1:217] 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num [1:217] 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num [1:217] 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num [1:217] 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num [1:217] 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : num [1:217] NA 808 1328 NA NA ...
##  $ Electricity   : num [1:217] NA 2309 1363 NA NA ...
##  $ Developed     : num [1:217] NA 1 1 NA NA 1 NA 2 1 NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Code = col_character(),
##   ..   LandArea = col_double(),
##   ..   Population = col_double(),
##   ..   Density = col_double(),
##   ..   GDP = col_double(),
##   ..   Rural = col_double(),
##   ..   CO2 = col_double(),
##   ..   PumpPrice = col_double(),
##   ..   Military = col_double(),
##   ..   Health = col_double(),
##   ..   ArmedForces = col_double(),
##   ..   Internet = col_double(),
##   ..   Cell = col_double(),
##   ..   HIV = col_double(),
##   ..   Hunger = col_double(),
##   ..   Diabetes = col_double(),
##   ..   BirthRate = col_double(),
##   ..   DeathRate = col_double(),
##   ..   ElderlyPop = col_double(),
##   ..   LifeExpectancy = col_double(),
##   ..   FemaleLabor = col_double(),
##   ..   Unemployment = col_double(),
##   ..   Energy = col_double(),
##   ..   Electricity = col_double(),
##   ..   Developed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

colSums(is.na(df))

##        Country           Code       LandArea     Population        Density 
##              0              0              8              1              8 
##            GDP          Rural            CO2      PumpPrice       Military 
##             30              3             13             50             67 
##         Health    ArmedForces       Internet           Cell            HIV 
##             29             49             13             15             81 
##         Hunger       Diabetes      BirthRate      DeathRate     ElderlyPop 
##             52             10             15             15             24 
## LifeExpectancy    FemaleLabor   Unemployment         Energy    Electricity 
##             18             30             30             82             76 
##      Developed 
##             75

df<- df |>
mutate( LandArea = if_else(is.na(LandArea), median(df$LandArea, na.rm = TRUE),LandArea), 
     GDP = if_else(is.na(GDP), median(df$GDP, na.rm = TRUE),GDP),
     Health = if_else(is.na(Health), median(df$Health, na.rm=TRUE), Health),
     Internet = if_else(is.na(Internet), median(df$Internet, na.rm = TRUE),Internet)
        )
colSums(is.na(df))

##        Country           Code       LandArea     Population        Density 
##              0              0              0              1              8 
##            GDP          Rural            CO2      PumpPrice       Military 
##              0              3             13             50             67 
##         Health    ArmedForces       Internet           Cell            HIV 
##              0             49              0             15             81 
##         Hunger       Diabetes      BirthRate      DeathRate     ElderlyPop 
##             52             10             15             15             24 
## LifeExpectancy    FemaleLabor   Unemployment         Energy    Electricity 
##             18             30             30             82             76 
##      Developed 
##             75

Answer 2

model1<- lm(LifeExpectancy~GDP, data=df)
model1

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = df)
## 
## Coefficients:
## (Intercept)          GDP  
##   69.224330     0.000235

summary(model1)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.147  -3.865   1.278   4.587  11.777 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.922e+01  5.358e-01  129.20   <2e-16 ***
## GDP         2.350e-04  2.227e-05   10.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.193 on 197 degrees of freedom
##   (18 observations deleted due to missingness)
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3578 
## F-statistic: 111.3 on 1 and 197 DF,  p-value: < 2.2e-16

The y-intercept is 69.22. It means that even when GDP equals 0, life expenctany would be about 69 years.

The slope(coefficient) is 0.000235, which indicates that life expectancy will increase by 0.000235 for each unit added to GDP.

R-squared is equal to 0.361, which means that the model explains 36% of the variance in the data.

Answer 3

model2<- lm(LifeExpectancy ~ GDP + Health + Internet, data=df)
model2

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = df)
## 
## Coefficients:
## (Intercept)          GDP       Health     Internet  
##   5.926e+01    2.650e-05    2.161e-01    1.968e-01

summary(model2)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8766  -1.8037   0.2493   2.6083   9.2252 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.926e+01  7.625e-01  77.718  < 2e-16 ***
## GDP         2.650e-05  1.979e-05   1.339 0.181975    
## Health      2.161e-01  6.215e-02   3.476 0.000627 ***
## Internet    1.968e-01  1.401e-02  14.049  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.101 on 195 degrees of freedom
##   (18 observations deleted due to missingness)
## Multiple R-squared:  0.7226, Adjusted R-squared:  0.7184 
## F-statistic: 169.3 on 3 and 195 DF,  p-value: < 2.2e-16

Intercept = 59.26, this is the predicted life Expectancy when GDP, Health, and Internet are equal to zero.

coefficients

All coefficients are positive, which indicates that the more these predictors increase, the higher life expextancy will be.

The adjusted R-squared is significantly higher for this model than the previous one. In fact, we can conclude that this model better predict life expectancy than the previous one.

answer 4

To check assumptions of homoscedasticity and normality, I will conduct a core diagnosis.

Normality would be ensured if points fall along the reference line. However, a change in significant deviation would be considered as a violation.

As for homoscedasticity, An horizontal display with equal spread is expected. All increased and unevenly spread would be considered as violation.

par(mfrow=c(2,2)); plot(model1); par(mfrow=c(1,1))

Although there is a slight deviation at the end, the points are majoritarily normally distributed–so, normality is checked.

As for variance, the points are unevenly spread. Homoscedasticity is violated.

Answer 5

rs<- resid(model2)

rmse<-sqrt(mean(rs^2))
rmse

## [1] 4.05957

The model’s predictions miss by approximately 4.06 on average, meaning life expectancy could be higher.

Large residuals might affect the predictions, making the model less accurate. Perhaps, outliers or error in the measurement need to be investigated, or a different approach to how the model is designed is needed.

Answer 6

This multicollinearity might affect the model because the predictors will be overlaping wich might lead to a misleading estimation–you won’t even be able to clearly indicate whether it is electricity or enrgy that is influencing the outcomes. Minor changes will produce very differnt results.

HW9

Carlos Dave Sidney

2025-11-27