Assignment

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Load Libraries/who.csv Data Set

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
who <- read_csv("C:\\Users\\lizza\\Documents\\CUNY - Data Analytics\\DATA 605\\Assignments\\Week 12\\who.csv")
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   LifeExp = col_integer(),
##   InfantSurvival = col_double(),
##   Under5Survival = col_double(),
##   TBFree = col_double(),
##   PropMD = col_double(),
##   PropRN = col_double(),
##   PersExp = col_integer(),
##   GovtExp = col_integer(),
##   TotExp = col_integer()
## )

Country: name of the country

LifeExp: average life expectancy for the country in years

InfantSurvival: proportion of those surviving to one year or more

Under5Survival: proportion of those surviving to five years or more

TBFree: proportion of the population without TB

PropMD: proportion of the population who are MDs

PropRN: proportion of the population who are RNs

PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate

GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate

TotExp: sum of personal and government expenditures

Using the glimpse & head feature we will look within the who data set:

glimpse(who)
## Observations: 190
## Variables: 10
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Andorra",...
## $ LifeExp        <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74,...
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986...
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983...
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0....
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297...
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500...
## $ PersExp        <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 378...
## $ GovtExp        <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 185...
## $ TotExp         <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 19...
head(who)
## # A tibble: 6 x 10
##   Country  LifeE~ Infant~ Under~ TBFr~  PropMD  PropRN Pers~ GovtE~ TotExp
##   <chr>     <int>   <dbl>  <dbl> <dbl>   <dbl>   <dbl> <int>  <int>  <int>
## 1 Afghani~     42   0.835  0.743 0.998 2.29e-4 5.72e-4    20     92    112
## 2 Albania      71   0.985  0.983 1.000 1.14e-3 4.61e-3   169   3128   3297
## 3 Algeria      71   0.967  0.962 0.999 1.06e-3 2.09e-3   108   5184   5292
## 4 Andorra      82   0.997  0.996 1.000 3.30e-3 3.50e-3  2589 169725 172314
## 5 Angola       41   0.846  0.740 0.997 7.04e-5 1.15e-3    36   1620   1656
## 6 Antigua~     73   0.990  0.989 1.000 1.43e-4 2.77e-3   503  12543  13046

Before attempting the exercises we will look at the summary statistics using the summary feature in base R.

summary(who)
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN         
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455  
##  Median :0.9992   Median :0.0010474   Median :0.0027584  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387  
##     PersExp           GovtExp             TotExp      
##  Min.   :   3.00   Min.   :    10.0   Min.   :    13  
##  1st Qu.:  36.25   1st Qu.:   559.5   1st Qu.:   584  
##  Median : 199.50   Median :  5385.0   Median :  5541  
##  Mean   : 742.00   Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 515.25   3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :6350.00   Max.   :476420.0   Max.   :482750

Exercise 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error, and p-values only. Discuss whether the assumptions of simple linear regression met.

#run the linear model on the LifeExp and TotExp variables
who_lm <- lm(LifeExp~TotExp, data = who)
#create the scatter plot
splot<- ggplot(data = who)+
        geom_point(mapping = aes(x = TotExp, y = LifeExp), 
               color = "blue")
print(splot + labs(title = "The Average Life Expectancy vs Total Expenditures", y="Life Expectancy",x="Total Expenditures"))

Next, we run a summary on the who_lm variable

summary(who_lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Based on information from the Statistics How To website, the F statistics must be used in combination with the p value when you are deciding if your overall results are significant.

The F-statistic is 65.26 with a p-value which is extremely high of 7.714e-14 which makes this insignificant.

The Multiple R-squared is equal to 0.2577, which indicates that a strong linear relationship is not present.

Standard Error is equal to 6.297e-05

Exercise 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and intepret the F statistics, R^2, standard error, and p-values. Which model is “better?”

#create new values based off instruction
LifeExp_n <- who$LifeExp^4.6
TotExp_n <- who$TotExp^0.06

#run linear regression on updated values
who_lm_n <- lm(LifeExp_n~TotExp_n )

#create a new plot
splot2<- ggplot(data = who)+
        geom_point(mapping = aes(x = TotExp_n, y = LifeExp_n), 
               color = "green")
print(splot2 + labs(title = "The Average Life Expectancy vs Total Expenditures (Updated_", y="Life Expectancy",x="Total Expenditures"))

Run summary statistics on new model

summary (who_lm_n)
## 
## Call:
## lm(formula = LifeExp_n ~ TotExp_n)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_n     620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

The F-statistic is 507.7 with a p-value equal to 2.2e-16, the p-value is still high in this model.

The Multiple R-squared is equal to 0.7298, which indicates that a strong linear relationship is present.

Standard Error is equal to 620060216

Based off the visual representation, I would say the model is better than the first.

Exercise 3

Using the results from 3, forecast life expectancy when TotExp^.06=1.5. Then forecast life expectancy when TotExp^.06=2.5

We achieve this by creating a data frame with the new values (1.5, 2.5) and then using the predict function from base R which makes predictions from the results of various model fitting functions.

#build a data frame with the values of 1.5 & 2.5
values <-data.frame(TotExp_n=c(1.5,2.5))
predict(who_lm_n, values)^(1/4.6)
##        1        2 
## 63.31153 86.50645

Exercise 4

Build the following multiple regression model and interpret the F statistics, R^2, standard error, and p-values. How good is the model?

\[LifeExp = b0 + b1 x PropMd + b2 x TotExp + b3 x PropMD x TotExp\]

who_lm_n2 <- lm(LifeExp ~ PropMD + TotExp + (TotExp * PropMD), data = who)

Run summary statistics on the model

summary (who_lm_n2)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (TotExp * PropMD), data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The F-statistic is 34.9 with a p-value equal to 2.2e-16, the p-value is still high in this model.

The Multiple R-squared is equal to 0.3574, which indicates that a strong linear relationship is not present.

Exercise 5

Forecast LifeExp when PropMD=.03 and TotExp=14. Does this forecast seem realistic? Why or why not?

#build a data frame with the values of 0.03 & 14
values <-data.frame(PropMD = 0.03, TotExp = 14)
predict(who_lm_n2, values)
##       1 
## 107.696