library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df <- read_csv("/Users/jordanglendrange/Documents/Data 605/who.csv")
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df %>%
ggplot(aes(y=LifeExp, x=TotExp)) + geom_point()
The R^2 value is a measurement of how close the data points fall to the regression line. Our model accounts for 25% of the variance in the data. The F statistic measures of the results are statistically significant. Since our p value is extremely low we can be confident the two variables are correlated because the likelihood of these results appearing by chance is almost 0.
model <- lm(LifeExp~TotExp, df)
summary(model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
df2 <- df
df2$LifeExp <- df$LifeExp^4.6
df2$TotExp <- df$TotExp^0.6
df2
## # A tibble: 190 × 10
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani… 2.93e7 0.835 0.743 0.998 2.29e-4 5.72e-4 20
## 2 Albania 3.28e8 0.985 0.983 1.00 1.14e-3 4.61e-3 169
## 3 Algeria 3.28e8 0.967 0.962 0.999 1.06e-3 2.09e-3 108
## 4 Andorra 6.36e8 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589
## 5 Angola 2.62e7 0.846 0.74 0.997 7.04e-5 1.15e-3 36
## 6 Antigua… 3.73e8 0.99 0.989 1.00 1.43e-4 2.77e-3 503
## 7 Argenti… 4.22e8 0.986 0.983 1.00 2.78e-3 7.41e-4 484
## 8 Armenia 2.88e8 0.979 0.976 0.999 3.70e-3 4.92e-3 88
## 9 Austral… 6.36e8 0.995 0.994 1.00 2.33e-3 9.15e-3 3181
## 10 Austria 5.68e8 0.996 0.996 1.00 3.61e-3 6.46e-3 3788
## # … with 180 more rows, and 2 more variables: GovtExp <dbl>, TotExp <dbl>
df2 %>%
ggplot(aes(y=LifeExp, x=TotExp)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
The thing to note in the summary statistics is how we modified the R^2 value. We went from accounting for 25% to 50% of the variability. I would say this model is better to use.
model <- lm(LifeExp~TotExp, df2)
summary(model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -257351739 -82599957 14030425 93896945 237720335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 211907647 10234512 20.70 <2e-16 ***
## TotExp 238461 15021 15.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113800000 on 188 degrees of freedom
## Multiple R-squared: 0.5728, Adjusted R-squared: 0.5705
## F-statistic: 252 on 1 and 188 DF, p-value: < 2.2e-16
Our function is : \[ LifeExp = 238,461 * Total Expenditures + 211,907,647 \]
v1 = (238461 * (1.5) + 211907647)^(1/4.6)
v2 = (238461 * (2.5) + 211907647)^(1/4.6)
c(v1, v2)
## [1] 64.59384 64.60961
For TotExp = 1.5 we get 64.59 years and 64.61 with TotExp = 2.5.
Analyzing the F stats show the results are statistically significant. The R^2 only accounts for 35% of the variability, so the previous model was better.
model <- lm(LifeExp~ PropMD + TotExp + PropMD*TotExp, df)
summary(model)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The value we got is 107, which is not very realistic since 107 is pretty old.
PropMD <- 0.03
TotExp <- 14
b_0 <- 62.77
b_1 <- 1.497 * 10^3
b_2 <- 7.233*10^-5
b_3 <- -6.026 * 10^-3
ev = b_0 + b_1*PropMD + b_2*TotExp + b_3*PropMD*TotExp
ev
## [1] 107.6785