Data 605 Homework Wk12

Bonnie Cooper


    This assignment with use the ‘who.csv’ data set. The following code loads the necessary r libraries and data into the working environment:

## Rows: 190
## Columns: 10
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angol…
## $ LifeExp        <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979…
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976…
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, …
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.…
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.…
## $ PersExp        <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, …
## $ GovtExp        <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 1876…
## $ TotExp         <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 190…

1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Interpretation of Stats:

  1. F statistics: The F-stat informs us if there is a relationship between the predictor and response variable. Here the F-stat is much larger than 1 but still small compared to the number of datapoints.
  2. \(R^2\): The \(R^2\) value is quite small and suggests that this model only describes ~25% of that variance in the data.
  3. residual standard error This describes the quality of the lm fit. this suggests that, on average, the model is off by ~9.3 for each estimation
  4. p-valuesthe value is much smaller that conventional limits (eg 0.05 or 0.01). Therefore we can assume that there is a statistically significant relationship for the coefficients that describe the linear relationship.

From the stats summary, we see that there is statistical significance for the linear model fit. However, is a linear model appropriate for this data distribution? the following visuals help us address this:

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Jus tlooking at the Residuals vs fitted & QQ-plot, we can see that a linear model is not appropriate here. The Residuals vs fitted points deviate from horizontal 0 line throughout the data which shows patterns in the variance that change through the data series; ideally the values should cluster along the 0-line with even variance. If a linear model were appropriate, the data points should fall along the dashed unity line in the QQ-plot, however, the distribution does not fit. Therefore, we can conclude that a linear model is not appropriate here.

2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df_mod)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Interpretation of Stats: + F statistic: For this linear model, the F-stat 507.7 is much greater than 1 and large compared to the number of data points (188). This suggests there is a statistically significant relationship between the predictive and descriptive variables + \(R^2\): The \(R^2\) value suggests that the model describes ~73% of the variance in the data + residual standard error This values suggests that the typical extimate for the model is off by 9049000. This seems reeeeaaaallly big compare to the previous model but can be explained by the artificial scaling we introduced. However, I have no idea why we would beat our data in to linear shape and in the process introduced artifacts instead of just fiting a nonlinear model. + **p-values* 2.2e-16 is very small suggesting that there is statistical significance to the linear fit of this model.

This visualizations will guide our descision as to whether a linear model is appropriate for our data:

Residuals vs Fitted: the mode of the residuals clusters along the horizontal axis of the fitted values but there are some changes in the variance across the data range. QQ-plot: the data fall reasonably well along the unity line. Therefore, we can say that this data is reasonably appropriate for a linear model fit.

In Conclusions, we can say that this second linear model that scales the data does a better job at modeling LifeExp ~ TotExp

3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

## Prediction at TotExp^.06 = 1.5: 
## 193562413.987494
## Prediction at TotExp^.06 = 2.5: 
## 813622629.638233
## Rescaled Prediction at TotExp^.06 = 1.5: 
## 63.3115334469743
## Rescaled Prediction at TotExp^.06 = 2.5: 
## 86.5064484844719

4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

The question is phrased ambiguously. Am I to fit to original or scaled data? Here is a look at the original WHO data:

## 
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + TotExp * PropMD, data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp:PropMD -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

…and here is a look with the artificially scaled WHO data:

## 
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + TotExp * PropMD, data = who_df_mod)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -296470018  -47729263   12183210   60285515  212311883 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.244e+08  5.083e+07 -14.253   <2e-16 ***
## TotExp         6.048e+08  3.023e+07  20.005   <2e-16 ***
## PropMD         4.727e+10  2.258e+10   2.094   0.0376 *  
## TotExp:PropMD -2.121e+10  1.131e+10  -1.876   0.0622 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88520000 on 186 degrees of freedom
## Multiple R-squared:  0.7441, Adjusted R-squared:   0.74 
## F-statistic: 180.3 on 3 and 186 DF,  p-value: < 2.2e-16

From the look of the two fits, I’m going to assume that the exercise wants us to use the scaled data, so I’ll evaluate those summary stats:

Interpretation of Stats: + F statistic: the F-stat 180.3 is much greater than 1 and large compared to the number of data points (188). This suggests there is a statistically significant relationship between the predictive and descriptive variables + \(R^2\): The \(R^2\) value suggests that the model describes ~74% of the variance in the data + residual standard error This suggests a typical estimate for the model is off by 88520000. This is an improvement over the previous model attempt + **p-values* 2.2e-16 is very small suggesting that there is statistical significance to the linear fit of this model.

There aren’t clear gains in statistical significance of fit for this model or the 2nd model. In situations like this, I prefere to stick with the simple model. However, I do not agree with forcing the data to look linear (use of scaling), because this introduces artifacts and distortions to the data. It is clearly obvious from the intial linear fit that a nonlinear model is more appropriate for this data.

5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

## Prediction at TotExp^.06 = 14: 
## 250760452.972775
## Rescaled Prediction at TotExp^.06 = 14: 
## 66.9770290789385

No, this does not seem like a realistic forcast to make, because it is very far outside of the range of the existing data.