Homework 12

library("tidyverse")

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

df <- read_csv("/Users/jordanglendrange/Documents/Data 605/who.csv")

## Rows: 190 Columns: 10

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Problem 1

df %>%
  ggplot(aes(y=LifeExp, x=TotExp)) + geom_point()

The R^2 value is a measurement of how close the data points fall to the regression line. Our model accounts for 25% of the variance in the data. The F statistic measures of the results are statistically significant. Since our p value is extremely low we can be confident the two variables are correlated because the likelihood of these results appearing by chance is almost 0.

model <- lm(LifeExp~TotExp, df)
summary(model)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Problem 2

df2 <- df
df2$LifeExp <- df$LifeExp^4.6
df2$TotExp <- df$TotExp^0.6
df2

## # A tibble: 190 × 10
##    Country  LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##    <chr>      <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
##  1 Afghani…  2.93e7          0.835          0.743  0.998 2.29e-4 5.72e-4      20
##  2 Albania   3.28e8          0.985          0.983  1.00  1.14e-3 4.61e-3     169
##  3 Algeria   3.28e8          0.967          0.962  0.999 1.06e-3 2.09e-3     108
##  4 Andorra   6.36e8          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
##  5 Angola    2.62e7          0.846          0.74   0.997 7.04e-5 1.15e-3      36
##  6 Antigua…  3.73e8          0.99           0.989  1.00  1.43e-4 2.77e-3     503
##  7 Argenti…  4.22e8          0.986          0.983  1.00  2.78e-3 7.41e-4     484
##  8 Armenia   2.88e8          0.979          0.976  0.999 3.70e-3 4.92e-3      88
##  9 Austral…  6.36e8          0.995          0.994  1.00  2.33e-3 9.15e-3    3181
## 10 Austria   5.68e8          0.996          0.996  1.00  3.61e-3 6.46e-3    3788
## # … with 180 more rows, and 2 more variables: GovtExp <dbl>, TotExp <dbl>

df2 %>%
  ggplot(aes(y=LifeExp, x=TotExp)) + geom_point() + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

The thing to note in the summary statistics is how we modified the R^2 value. We went from accounting for 25% to 50% of the variability. I would say this model is better to use.

model <- lm(LifeExp~TotExp, df2)
summary(model)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -257351739  -82599957   14030425   93896945  237720335 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 211907647   10234512   20.70   <2e-16 ***
## TotExp         238461      15021   15.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 113800000 on 188 degrees of freedom
## Multiple R-squared:  0.5728, Adjusted R-squared:  0.5705 
## F-statistic:   252 on 1 and 188 DF,  p-value: < 2.2e-16

Problem 3

Our function is : \[ LifeExp = 238,461 * Total Expenditures + 211,907,647 \]

v1 = (238461 * (1.5) + 211907647)^(1/4.6)
v2 = (238461 * (2.5) + 211907647)^(1/4.6)

c(v1, v2)

## [1] 64.59384 64.60961

For TotExp = 1.5 we get 64.59 years and 64.61 with TotExp = 2.5.

Problem 4

Analyzing the F stats show the results are statistically significant. The R^2 only accounts for 35% of the variability, so the previous model was better.

model <- lm(LifeExp~ PropMD + TotExp + PropMD*TotExp, df)
summary(model)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Problem 5

The value we got is 107, which is not very realistic since 107 is pretty old.

PropMD <- 0.03
TotExp <- 14
b_0 <- 62.77
b_1 <- 1.497 * 10^3
b_2 <- 7.233*10^-5
b_3 <- -6.026 * 10^-3

ev = b_0 + b_1*PropMD + b_2*TotExp + b_3*PropMD*TotExp
ev

## [1] 107.6785