The attached who.csv dataset contains real-world data from 2008. The variables included follow.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.1 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
who <- read_csv("C:\\Users\\lizza\\Documents\\CUNY - Data Analytics\\DATA 605\\Assignments\\Week 12\\who.csv")
## Parsed with column specification:
## cols(
## Country = col_character(),
## LifeExp = col_integer(),
## InfantSurvival = col_double(),
## Under5Survival = col_double(),
## TBFree = col_double(),
## PropMD = col_double(),
## PropRN = col_double(),
## PersExp = col_integer(),
## GovtExp = col_integer(),
## TotExp = col_integer()
## )
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures
Using the glimpse & head feature we will look within the who data set:
glimpse(who)
## Observations: 190
## Variables: 10
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra",...
## $ LifeExp <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74,...
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986...
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983...
## $ TBFree <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0....
## $ PropMD <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297...
## $ PropRN <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500...
## $ PersExp <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 378...
## $ GovtExp <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 185...
## $ TotExp <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 19...
head(who)
## # A tibble: 6 x 10
## Country LifeE~ Infant~ Under~ TBFr~ PropMD PropRN Pers~ GovtE~ TotExp
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 Afghani~ 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20 92 112
## 2 Albania 71 0.985 0.983 1.000 1.14e-3 4.61e-3 169 3128 3297
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108 5184 5292
## 4 Andorra 82 0.997 0.996 1.000 3.30e-3 3.50e-3 2589 169725 172314
## 5 Angola 41 0.846 0.740 0.997 7.04e-5 1.15e-3 36 1620 1656
## 6 Antigua~ 73 0.990 0.989 1.000 1.43e-4 2.77e-3 503 12543 13046
Before attempting the exercises we will look at the summary statistics using the summary feature in base R.
summary(who)
## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455
## Median :0.9992 Median :0.0010474 Median :0.0027584
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387
## PersExp GovtExp TotExp
## Min. : 3.00 Min. : 10.0 Min. : 13
## 1st Qu.: 36.25 1st Qu.: 559.5 1st Qu.: 584
## Median : 199.50 Median : 5385.0 Median : 5541
## Mean : 742.00 Mean : 40953.5 Mean : 41696
## 3rd Qu.: 515.25 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :6350.00 Max. :476420.0 Max. :482750
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error, and p-values only. Discuss whether the assumptions of simple linear regression met.
#run the linear model on the LifeExp and TotExp variables
who_lm <- lm(LifeExp~TotExp, data = who)
#create the scatter plot
splot<- ggplot(data = who)+
geom_point(mapping = aes(x = TotExp, y = LifeExp),
color = "blue")
print(splot + labs(title = "The Average Life Expectancy vs Total Expenditures", y="Life Expectancy",x="Total Expenditures"))
Next, we run a summary on the who_lm variable
summary(who_lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Based on information from the Statistics How To website, the F statistics must be used in combination with the p value when you are deciding if your overall results are significant.
The F-statistic is 65.26 with a p-value which is extremely high of 7.714e-14 which makes this insignificant.
The Multiple R-squared is equal to 0.2577, which indicates that a strong linear relationship is not present.
Standard Error is equal to 6.297e-05
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and intepret the F statistics, R^2, standard error, and p-values. Which model is “better?”
#create new values based off instruction
LifeExp_n <- who$LifeExp^4.6
TotExp_n <- who$TotExp^0.06
#run linear regression on updated values
who_lm_n <- lm(LifeExp_n~TotExp_n )
#create a new plot
splot2<- ggplot(data = who)+
geom_point(mapping = aes(x = TotExp_n, y = LifeExp_n),
color = "green")
print(splot2 + labs(title = "The Average Life Expectancy vs Total Expenditures (Updated_", y="Life Expectancy",x="Total Expenditures"))
Run summary statistics on new model
summary (who_lm_n)
##
## Call:
## lm(formula = LifeExp_n ~ TotExp_n)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_n 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
The F-statistic is 507.7 with a p-value equal to 2.2e-16, the p-value is still high in this model.
The Multiple R-squared is equal to 0.7298, which indicates that a strong linear relationship is present.
Standard Error is equal to 620060216
Based off the visual representation, I would say the model is better than the first.
Using the results from 3, forecast life expectancy when TotExp^.06=1.5. Then forecast life expectancy when TotExp^.06=2.5
We achieve this by creating a data frame with the new values (1.5, 2.5) and then using the predict function from base R which makes predictions from the results of various model fitting functions.
#build a data frame with the values of 1.5 & 2.5
values <-data.frame(TotExp_n=c(1.5,2.5))
predict(who_lm_n, values)^(1/4.6)
## 1 2
## 63.31153 86.50645
Build the following multiple regression model and interpret the F statistics, R^2, standard error, and p-values. How good is the model?
\[LifeExp = b0 + b1 x PropMd + b2 x TotExp + b3 x PropMD x TotExp\]
who_lm_n2 <- lm(LifeExp ~ PropMD + TotExp + (TotExp * PropMD), data = who)
Run summary statistics on the model
summary (who_lm_n2)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (TotExp * PropMD), data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The F-statistic is 34.9 with a p-value equal to 2.2e-16, the p-value is still high in this model.
The Multiple R-squared is equal to 0.3574, which indicates that a strong linear relationship is not present.
Forecast LifeExp when PropMD=.03 and TotExp=14. Does this forecast seem realistic? Why or why not?
#build a data frame with the values of 0.03 & 14
values <-data.frame(PropMD = 0.03, TotExp = 14)
predict(who_lm_n2, values)
## 1
## 107.696