Taking a look at some life expectancy data from the WHO
I’m going to use BASE R in conjunction with TIDYVERSE:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Loading/tidying data for model.
path = "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 605 /HW/DATA-605/Life Expectancy Data.csv"
df = read.csv(path)
head(df)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing 65.0 263 62
## 2 Afghanistan 2014 Developing 59.9 271 64
## 3 Afghanistan 2013 Developing 59.9 268 66
## 4 Afghanistan 2012 Developing 59.5 272 69
## 5 Afghanistan 2011 Developing 59.2 275 71
## 6 Afghanistan 2010 Developing 58.8 279 74
## Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths
## 1 0.01 71.279624 65 1154 19.1 83
## 2 0.01 73.523582 62 492 18.6 86
## 3 0.01 73.219243 64 430 18.1 89
## 4 0.01 78.184215 67 2787 17.6 93
## 5 0.01 7.097109 68 3013 17.2 97
## 6 0.01 79.679367 66 1989 16.7 102
## Polio Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 6 8.16 65 0.1 584.25921 33736494
## 2 58 8.18 62 0.1 612.69651 327582
## 3 62 8.13 64 0.1 631.74498 31731688
## 4 67 8.52 67 0.1 669.95900 3696958
## 5 68 7.87 68 0.1 63.53723 2978599
## 6 66 9.20 66 0.1 553.32894 2883167
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## 3 17.7 17.7 0.470
## 4 17.9 18.0 0.463
## 5 18.2 18.2 0.454
## 6 18.4 18.4 0.448
## Schooling
## 1 10.1
## 2 10.0
## 3 9.9
## 4 9.8
## 5 9.5
## 6 9.2
Let’s look at schooling as a predictor of life expectancy:
df = df|>
select(c(Schooling, Life.expectancy)) |>
na.omit()
plot(Schooling ~ Life.expectancy, data=df)
Modeling the data:
model = lm(Schooling ~ Life.expectancy, data=df)
summary(model)
##
## Call:
## lm(formula = Schooling ~ Life.expectancy, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4111 -1.1027 0.0164 1.2696 6.1774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.643467 0.313559 -21.19 <2e-16 ***
## Life.expectancy 0.268828 0.004481 59.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.206 on 2766 degrees of freedom
## Multiple R-squared: 0.5655, Adjusted R-squared: 0.5653
## F-statistic: 3599 on 1 and 2766 DF, p-value: < 2.2e-16
Visualizing the fit:
ggplot(df, aes(x = Life.expectancy, y= Schooling)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
labs(x = "Life Expectancy")
## `geom_smooth()` using formula 'y ~ x'
Let’s see if this matches the model:
plot(Schooling ~ Life.expectancy, data=df)
abline(model, col="blue")
Now let’s look at the residuals:
plot(density(resid(model)))
The residuals appear to be normally distributed.
Schooling and Life Expectancy have a positive linear relationship with normally distributed residuals. These variables are appropriate for linear regression modeling.