DB WEEK 10

Taking a look at some life expectancy data from the WHO

LOADING PACKAGES

I’m going to use BASE R in conjunction with TIDYVERSE:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.1 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

DATA

Loading/tidying data for model.

path = "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 605 /HW/DATA-605/Life Expectancy Data.csv"
df = read.csv(path)
head(df)
##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
## 3 Afghanistan 2013 Developing            59.9             268            66
## 4 Afghanistan 2012 Developing            59.5             272            69
## 5 Afghanistan 2011 Developing            59.2             275            71
## 6 Afghanistan 2010 Developing            58.8             279            74
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01              71.279624          65    1154 19.1                83
## 2    0.01              73.523582          62     492 18.6                86
## 3    0.01              73.219243          64     430 18.1                89
## 4    0.01              78.184215          67    2787 17.6                93
## 5    0.01               7.097109          68    3013 17.2                97
## 6    0.01              79.679367          66    1989 16.7               102
##   Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1     6              8.16         65      0.1 584.25921   33736494
## 2    58              8.18         62      0.1 612.69651     327582
## 3    62              8.13         64      0.1 631.74498   31731688
## 4    67              8.52         67      0.1 669.95900    3696958
## 5    68              7.87         68      0.1  63.53723    2978599
## 6    66              9.20         66      0.1 553.32894    2883167
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
## 3                 17.7               17.7                           0.470
## 4                 17.9               18.0                           0.463
## 5                 18.2               18.2                           0.454
## 6                 18.4               18.4                           0.448
##   Schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5
## 6       9.2

DATA MODELING

Let’s look at schooling as a predictor of life expectancy:

df = df|>
  select(c(Schooling, Life.expectancy)) |>
  na.omit()
plot(Schooling ~ Life.expectancy, data=df)

Modeling the data:

model = lm(Schooling ~ Life.expectancy, data=df)
summary(model)
## 
## Call:
## lm(formula = Schooling ~ Life.expectancy, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4111  -1.1027   0.0164   1.2696   6.1774 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.643467   0.313559  -21.19   <2e-16 ***
## Life.expectancy  0.268828   0.004481   59.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.206 on 2766 degrees of freedom
## Multiple R-squared:  0.5655, Adjusted R-squared:  0.5653 
## F-statistic:  3599 on 1 and 2766 DF,  p-value: < 2.2e-16

Visualizing the fit:

ggplot(df, aes(x = Life.expectancy, y= Schooling)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE) +
  labs(x = "Life Expectancy")
## `geom_smooth()` using formula 'y ~ x'

Let’s see if this matches the model:

plot(Schooling ~ Life.expectancy, data=df)
abline(model, col="blue")

Now let’s look at the residuals:

plot(density(resid(model)))

The residuals appear to be normally distributed.

CONCLUSION

Schooling and Life Expectancy have a positive linear relationship with normally distributed residuals. These variables are appropriate for linear regression modeling.