Background

The purpose of the assignment was to explore the properties of linear regression.

Load data

After downloading the .csv file from Blackboard and uploading it to Github, we read the corresponding data (in raw form) and then familiarize ourselves with the dataset by displaying column names, column number, row number, the 1st 6 observations.

#Read .csv data
who <- read_csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-605/Support-Files/who.csv")
## 
## -- Column specification --------------------------------------------------------------------
## cols(
##   Country = col_character(),
##   LifeExp = col_double(),
##   InfantSurvival = col_double(),
##   Under5Survival = col_double(),
##   TBFree = col_double(),
##   PropMD = col_double(),
##   PropRN = col_double(),
##   PersExp = col_double(),
##   GovtExp = col_double(),
##   TotExp = col_double()
## )
#Familiarize ourselves with the dataset
colnames(who)
##  [1] "Country"        "LifeExp"        "InfantSurvival" "Under5Survival"
##  [5] "TBFree"         "PropMD"         "PropRN"         "PersExp"       
##  [9] "GovtExp"        "TotExp"
ncol(who)
## [1] 10
nrow(who)
## [1] 190
head(who)
## # A tibble: 6 x 10
##   Country LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>     <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghan~      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
## 2 Albania      71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
## 3 Algeria      71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
## 4 Andorra      82          0.997          0.996  1.00  3.30e-3 3.50e-3    2589
## 5 Angola       41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
## 6 Antigu~      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
## # ... with 2 more variables: GovtExp <dbl>, TotExp <dbl>

From the above outputs we see that our dataset has 10 columns and 190 rows with column headers as explained in the assignment spec. Additionally, we get an idea of the range of values we might expect per column (ie. LifeExp 42 - 82).

Now, that we’ve familiarized ourselves with the data at hand, we can move on to the exercises …

Exercise 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Written explanation

attach(who)
plot(TotExp, LifeExp, main = "Life Expectancy vs. Total Government Expenditures per Country", xlab = "Sum of Personal and Gov't Expenditures ($)", ylab = "Avg Life Expectancy (years)")

As can be seen above, we have an exponential relationship on our hands.

Next, we’ll run a simple linear regression and interpret the corresponding statistics:

who.lm <- lm(LifeExp ~ TotExp)
summary(who.lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Regarding simple linear regression assumptions:

  • Residual values: for a good fit, we would expect residual values normally distributed around a mean of zero. Although our 1Q and 3Q values are of a similar scale, our median value is above 0 and the 1Q and 3Q values are not equidistant. Additionally the Max and Min values are not of scale. A better model would have a median value nearer 0 and min-max and 1st quartile and 3rd quartile values closer in scale … I would consider failing the fit based on these observations but will instead opt for a “conditional pass” to further analyze the data. PASS

  • Coefficients: for a good model, we’d like to see a standard error on the scale of 5-10x smaller than our corresponding coefficient. Our standard error meets this condition. PASS

  • $R^2 value: values closer to 1 indicate a better fit and representation of the data set of interest. Based on our R^2 value, our model explains only ~25% of the data’s variation and is thus not a good fit … FAIL

  • F-statistic p-value: the F-statistic p-value is far below 0.05, indicating that the regression has some validity in fitting data. PASS

The assumptions of simple linear regression are NOT MET.

Exercise 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Written explanation

attach(who)
## The following objects are masked from who (pos = 3):
## 
##     Country, GovtExp, InfantSurvival, LifeExp, PersExp, PropMD, PropRN,
##     TBFree, TotExp, Under5Survival
plot(TotExp^.06, LifeExp^4.6, main = "Life Expectancy vs. Total Government Expenditures per Country", xlab = "Sum of Personal and Gov't Expenditures ($)", ylab = "Avg Life Expectancy (years)")

LifeExp2 <- LifeExp^4.6
TotExp2 <- TotExp^.06

who.lm2 <- lm(LifeExp2 ~ TotExp2)
summary(who.lm2)
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

The assumptions for simple linear regression are met across the board:

  • Residual values: we’re dealing with massive values this run of it. While our 1Q and 3Q values are of scale, and our Max and Min values are slightly out of scale (~1.5x), our residuals are not distributed about 0. This is a red flag and thus we’ll opt for a “conditional pass” to further analyze the data. PASS

  • Coefficients: our standard error is on the scale of 5-10x smaller than our corresponding coefficient. Our standard error meets this condition. PASS

  • $R^2 value: based on our R^2 value, our model explains ~70% of the data’s variation which is a good fit … PASS

  • F-statistic p-value: the F-statistic p-value is far below 0.05, indicating that the regression has some validity in fitting data. PASS

Where the last model did not pass because of a low R^2 value, this model raised a flag since the residual values did not vary about a median of 0 but otherwise passed unanimously. It’s capable of explaining more than ~70% of the data’s variation.

Thus, this model seems to be better.

Exercise 3

Using the results from 2, forecast life expectancy when TotExp^.06 = 1.5. Then forecast life expectancy when TotExp^.06 = 2.5.

First we’ve got to retrieve our y-intercept and slope to define the corresponding equation. Once we’ve defined our equation, we input the given TotExp^.06 value to forecast life expectancy.

  • TotExp^.06 = 1.5 –> LifeExp = 12.5

  • TotExp^.06 = 2.5 –> LifeExp = 13.4

Calculations shown below:

#Retrieve our y-intercept and slope
who.lm2
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2)
## 
## Coefficients:
## (Intercept)      TotExp2  
##  -736527909    620060216
#Calculate life expectancy for TotExp^.06 = 1.5:
x <- 1.5 # TotExp^.06 = 1.5
y <- -736527909 + 620060216 * x
#y #much too large, take the log
y_1.5 <- log(y, 4.6) # to reverse LifeExp^4.6
y_1.5
## [1] 12.50354
#Calculate life expectancy for TotExp^.06 = 2.5:
x <- 2.5 # TotExp^.06 = 1.5
y <- -736527909 + 620060216 * x
y_2.5 <- log(y, 4.6) # to reverse LifeExp^4.6
y_2.5
## [1] 13.44446

Exercise 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

Given equation: LifeExp = b0+b1 x PropMd + b2 x TotExp + b3 x PropMD x TotExp

who.lm3 <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp))
summary(who.lm3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The assumptions for simple linear regression:

  • Residual values: the median value is slightly above 0, the 1Q and 3Q values are of scale, and the Max and Min values are slightly off scale. Interpreting this data, it appears our residuals have a slight concentration above 0. Ideally our median would be ~0 and the Max and Min values would be of scale. We can flag this and conditionally pass for further analysis. PASS

  • Coefficients: while the PropMD:TotExp standard error is only 3x smaller than the corresponding coefficient, all others are 5-10x smaller. PASS

  • $R^2 value: our model explains ~34.5% of the data’s variation. This is not a good fit … FAIL

  • F-statistic p-value: the F-statistic p-value is far below 0.05, indicating that the regression has some validity in fitting data. PASS

The low R^2 value in combination with the conditional pass of the residual values is enough to say that this is not a good model. This model is better than the 1st model, worse than the 2nd, and not a good fit in general.

Exercise 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Given:

  • LifeExp = b0+b1 x PropMd + b2 x TotExp + b3 x PropMD x TotExp,
  • PropMD = .03,
  • TotExp = 14, and
  • coefficient values as calculated by our linear model.

Our LifeExp is 107.6785.

The worldwide average life expectancy is in the low 70s and thus this forecast does not seem realistic (unless we were considering only Blue Zones).

#Observe our b values
who.lm3
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp))
## 
## Coefficients:
##   (Intercept)         PropMD         TotExp  PropMD:TotExp  
##     6.277e+01      1.497e+03      7.233e-05     -6.026e-03
#Specify our equation and substitute our givens
b0 <- 6.277 * 10^1
b1 <- 1.497 * 10^3
b2 <- 7.233 * 10^-5
b3 <- -6.026 * 10^-3
x1 <- .03 #PropMD
x2 <- 14 #TotExp
y = b0 + (b1 * x1) + (b2 * x2) + (b3 * x1 * x2)
y
## [1] 107.6785