Answer 1
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df<- read_csv("Allcountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 26
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice Military
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… AFG 653. 37.2 56.9 521 74.5 0.29 0.7 3.72
## 2 Albania ALB 27.4 2.87 105. 5254 39.7 1.98 1.36 4.08
## 3 Algeria DZA 2382. 42.2 17.7 4279 27.4 3.74 0.28 13.8
## 4 Americ… ASM 0.2 0.055 277. NA 12.8 NA NA NA
## 5 Andorra AND 0.47 0.077 164. 42030 11.9 5.83 NA NA
## 6 Angola AGO 1247. 30.8 24.7 3432 34.5 1.29 0.97 9.4
## # ℹ 16 more variables: Health <dbl>, ArmedForces <dbl>, Internet <dbl>,
## # Cell <dbl>, HIV <dbl>, Hunger <dbl>, Diabetes <dbl>, BirthRate <dbl>,
## # DeathRate <dbl>, ElderlyPop <dbl>, LifeExpectancy <dbl>, FemaleLabor <dbl>,
## # Unemployment <dbl>, Energy <dbl>, Electricity <dbl>, Developed <dbl>
str(df)
## spc_tbl_ [217 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country : chr [1:217] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr [1:217] "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num [1:217] 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num [1:217] 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num [1:217] 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : num [1:217] 521 5254 4279 NA 42030 ...
## $ Rural : num [1:217] 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num [1:217] 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num [1:217] 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num [1:217] 3.72 4.08 13.81 NA NA ...
## $ Health : num [1:217] 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : num [1:217] 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num [1:217] 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num [1:217] 67.4 123.7 111 NA 104.4 ...
## $ HIV : num [1:217] NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num [1:217] 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num [1:217] 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num [1:217] 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num [1:217] 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num [1:217] 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num [1:217] 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num [1:217] 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num [1:217] 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : num [1:217] NA 808 1328 NA NA ...
## $ Electricity : num [1:217] NA 2309 1363 NA NA ...
## $ Developed : num [1:217] NA 1 1 NA NA 1 NA 2 1 NA ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. Code = col_character(),
## .. LandArea = col_double(),
## .. Population = col_double(),
## .. Density = col_double(),
## .. GDP = col_double(),
## .. Rural = col_double(),
## .. CO2 = col_double(),
## .. PumpPrice = col_double(),
## .. Military = col_double(),
## .. Health = col_double(),
## .. ArmedForces = col_double(),
## .. Internet = col_double(),
## .. Cell = col_double(),
## .. HIV = col_double(),
## .. Hunger = col_double(),
## .. Diabetes = col_double(),
## .. BirthRate = col_double(),
## .. DeathRate = col_double(),
## .. ElderlyPop = col_double(),
## .. LifeExpectancy = col_double(),
## .. FemaleLabor = col_double(),
## .. Unemployment = col_double(),
## .. Energy = col_double(),
## .. Electricity = col_double(),
## .. Developed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colSums(is.na(df))
## Country Code LandArea Population Density
## 0 0 8 1 8
## GDP Rural CO2 PumpPrice Military
## 30 3 13 50 67
## Health ArmedForces Internet Cell HIV
## 29 49 13 15 81
## Hunger Diabetes BirthRate DeathRate ElderlyPop
## 52 10 15 15 24
## LifeExpectancy FemaleLabor Unemployment Energy Electricity
## 18 30 30 82 76
## Developed
## 75
df<- df |>
mutate( LandArea = if_else(is.na(LandArea), median(df$LandArea, na.rm = TRUE),LandArea),
GDP = if_else(is.na(GDP), median(df$GDP, na.rm = TRUE),GDP),
Health = if_else(is.na(Health), median(df$Health, na.rm=TRUE), Health),
Internet = if_else(is.na(Internet), median(df$Internet, na.rm = TRUE),Internet)
)
colSums(is.na(df))
## Country Code LandArea Population Density
## 0 0 0 1 8
## GDP Rural CO2 PumpPrice Military
## 0 3 13 50 67
## Health ArmedForces Internet Cell HIV
## 0 49 0 15 81
## Hunger Diabetes BirthRate DeathRate ElderlyPop
## 52 10 15 15 24
## LifeExpectancy FemaleLabor Unemployment Energy Electricity
## 18 30 30 82 76
## Developed
## 75
Answer 2
model1<- lm(LifeExpectancy~GDP, data=df)
model1
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = df)
##
## Coefficients:
## (Intercept) GDP
## 69.224330 0.000235
summary(model1)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.147 -3.865 1.278 4.587 11.777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.922e+01 5.358e-01 129.20 <2e-16 ***
## GDP 2.350e-04 2.227e-05 10.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.193 on 197 degrees of freedom
## (18 observations deleted due to missingness)
## Multiple R-squared: 0.361, Adjusted R-squared: 0.3578
## F-statistic: 111.3 on 1 and 197 DF, p-value: < 2.2e-16
The y-intercept is 69.22. It means that even when GDP equals 0, life expenctany would be about 69 years.
The slope(coefficient) is 0.000235, which indicates that life expectancy will increase by 0.000235 for each unit added to GDP.
R-squared is equal to 0.361, which means that the model explains 36% of the variance in the data.
Answer 3
model2<- lm(LifeExpectancy ~ GDP + Health + Internet, data=df)
model2
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = df)
##
## Coefficients:
## (Intercept) GDP Health Internet
## 5.926e+01 2.650e-05 2.161e-01 1.968e-01
summary(model2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8766 -1.8037 0.2493 2.6083 9.2252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.926e+01 7.625e-01 77.718 < 2e-16 ***
## GDP 2.650e-05 1.979e-05 1.339 0.181975
## Health 2.161e-01 6.215e-02 3.476 0.000627 ***
## Internet 1.968e-01 1.401e-02 14.049 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.101 on 195 degrees of freedom
## (18 observations deleted due to missingness)
## Multiple R-squared: 0.7226, Adjusted R-squared: 0.7184
## F-statistic: 169.3 on 3 and 195 DF, p-value: < 2.2e-16
Intercept = 59.26, this is the predicted life Expectancy when GDP, Health, and Internet are equal to zero.
coefficients
All coefficients are positive, which indicates that the more these predictors increase, the higher life expextancy will be.
The adjusted R-squared is significantly higher for this model than the previous one. In fact, we can conclude that this model better predict life expectancy than the previous one.
answer 4
To check assumptions of homoscedasticity and normality, I will conduct a core diagnosis.
Normality would be ensured if points fall along the reference line. However, a change in significant deviation would be considered as a violation.
As for homoscedasticity, An horizontal display with equal spread is expected. All increased and unevenly spread would be considered as violation.
par(mfrow=c(2,2)); plot(model1); par(mfrow=c(1,1))
Although there is a slight deviation at the end, the points are
majoritarily normally distributed–so, normality is checked.
As for variance, the points are unevenly spread. Homoscedasticity is violated.
Answer 5
rs<- resid(model2)
rmse<-sqrt(mean(rs^2))
rmse
## [1] 4.05957
The model’s predictions miss by approximately 4.06 on average, meaning life expectancy could be higher.
Large residuals might affect the predictions, making the model less accurate. Perhaps, outliers or error in the measurement need to be investigated, or a different approach to how the model is designed is needed.
Answer 6
This multicollinearity might affect the model because the predictors will be overlaping wich might lead to a misleading estimation–you won’t even be able to clearly indicate whether it is electricity or enrgy that is influencing the outcomes. Minor changes will produce very differnt results.