raw <- read.csv("https://learn-us-east-1-prod-fleet02-xythos.content.blackboardcdn.com/61aab133e7df2/14511441?X-Blackboard-Expiration=1650337200000&X-Blackboard-Signature=T3cp%2BNJcKIFqv73QUn7g%2BXujpg85eArvqoRwe9b4QYE%3D&X-Blackboard-Client-Id=100211&response-cache-control=private%2C%20max-age%3D21600&response-content-disposition=inline%3B%20filename%2A%3DUTF-8%27%27who.csv&response-content-type=text%2Fcsv&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEDUaCXVzLWVhc3QtMSJHMEUCIDsdIWfNJTMTy8%2BkQF0QkUApF8w59cWvBLzBtfh2EDpiAiEA%2BPOkUoawzwXF%2BatA%2B6FI4egPWuRtIrjTNg3RqnDPepUqgwQI3v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARACGgw2MzU1Njc5MjQxODMiDLYuZrTntmchld4ItSrXAxQGLpsqRAO4l%2BH23HmR%2BXXdmyPFV9%2Bs39%2FhXmYedTULwH1kHSuU9YJbDpnKJJc9ybzJZ%2FLSEkGEVb8DLIU2iscC7Iz0TPIQqQlcYP5zXKWz6VFzHmlRAfIPCihB53K19XSQuMt79LhjlblFxBz6VnEg9JvrnJacBtHxIwui%2FDg8Yo3LNFByhO3qslQM7rLy7jCV8aBI%2BCivyOv0MIeLn7TjiiVeG3gsR8PY7iAJngVB7qsPp3VnnM%2FdAuGxL%2BTyeJvLYOgEbYxGtbpUds43NxHLJYFVjcP2QKmZriss0LH4oSP6thrQXrgf%2FLWJxrgoU3ooOvf%2Bhp2MpAdW01Ep8dkxPf2%2FvKLu%2BOYsc3v5MYY3JT7sEYlB23Ab6PZQosAo5JBBKn4UXD%2FPCJK%2BT3m%2FoRpYgQu8fAExQ5FakzFBrHFzTQFNaG%2F0O%2B7WyfZ8TlsoLd1KGBWY7wlL8PAlAc99tJFvfMSBXT%2BeL6fcZzseeKWnzbPsjRZa8PpvNvyofoloEhivvzAWTFcpAlW%2BewwbAacIrsF1TepPSPgqSSNEQQt30uj%2BI8%2BkeUeO4pRJHl5gIN5pQDJ%2FG%2FTQCTabbOhIZNzHt6GiI3saSHxQlmjkJkYFtrv%2BFUtZ6zC5nfeSBjqlAZvVR9tgJFoNOtyur7THWGz74R94ZMt6N1mo7NMZW198aH%2B1cYqablfwrTSCqwqLqUKw74dqpEvZujv8tsnrDclTpdcbsXRUOhKPcE1Ox7%2BMvEqVvpBPH7%2FKhHzmNnJ8uQzpznOdy9k%2FvnyT%2FvQGQOu7MzElDEVqrsbCpEGTNbs0juVOFfRYyOZXfup00FY2JysKrRc%2FFO5IYM1NbbzF26gHR6aTKA%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20220418T210000Z&X-Amz-SignedHeaders=host&X-Amz-Expires=21600&X-Amz-Credential=ASIAZH6WM4PLRCSZ5KXV%2F20220418%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=002fb3e6b7e541d22798771d558790bf1db37d7dd00eab49e32e217ba28e5a5c")

Question 1

sct1 <- ggplot(data = raw, aes(x = LifeExp, y = TotExp)) +
  geom_point() +
  xlab("Life Expectancy") +
  ylab("Total Expense") +
  ggtitle("Scatterplot of Total Expense vs. Life Expectancy")+
  theme(plot.title = element_text(hjust = 0.5, size = 7))

sct1

linfit <- lm(LifeExp ~ TotExp, data = raw)

summary(linfit)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = raw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Analysis

The F-statistic is 65.26 on 1 and 188 degrees of freedom. The p-value is \(7.714 \times 10^{-14}\). The high F-statistic coupled with a near zero p-value indicates that the model provides a significant relationship and that the null hypothesis can be rejected.

The \(R^2\) is 0.2537, which is not necessarily indicative of a good fit. A low \(R^2\) value could indicate that the model does not sufficiently outperform the null hypothesis. The standard error is 9.371 on 188 degrees of freedom. This standard error indicates a less than ideal fit. The linear model may not be appropriate for this data, at least not for the whole data set.

Regression Analysis

## List of 1
##  $ plot.title:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 7.5
##   ..$ hjust        : num 0.5
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

The scatterplot is only linear for values above the age of 75 and below the age of 75. The histogram of the residuals heavily skewed. The residuals appear to not have constant variability. The variability follows a distinct pattern. The normal probability plot is not linear. The criteria for simple regression are not met.

Question 2

LifeExpExp <- (raw$LifeExp)^4.6
TotExpExp <- (raw$TotExp)^0.06

fit2 <- lm(LifeExpExp ~ TotExpExp)

summary(fit2)
## 
## Call:
## lm(formula = LifeExpExp ~ TotExpExp)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExpExp    620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Analysis

The F-statistic is 507.7 on 1 and 188 degrees of freedom. The p-value is \(2.2 \times 10^{-16}\). The high F-statistic coupled with a near zero p-value indicates that the model provides a significant relationship and that the null hypothesis can be rejected.

The \(R^2\) is 0.7283, which is much more indicative of a good fit. The standard error is 90490000 on 188 degrees of freedom. This standard error indicates a very poor fit. The linear model may not be appropriate for this data, at least not for the whole data set.

Regression Analysis

## List of 1
##  $ plot.title:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 7.5
##   ..$ hjust        : num 0.5
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

The scatterplot is only linear for values above the age of 75 and below the age of 75. The histogram of the residuals is nearly normal, mostly for the values to the right side of the distribution. The residuals appear to have constant variability, following no particular pattern. The normal probability plot is far more linear than the prior model. The criteria for simple regression are not met because the scatterplot is non linear. The data would need to be trimmed in order for the linear model to be sufficient.

This model is better than the first.

Question 3

#x = TotExp^0.6
pred <- function(x){
  (-736527910 + 620060216*x)^(1/4.6)
}

pred(1.5)
## [1] 63.31153

In the first case, life expectancy is predicted to be 63.31 years.

pred(2.5)
## [1] 86.50645

In the second case, life expectancy is predicted to increase to 86.51 years.

Question 4

mod <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = raw)
  
summary(mod)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = raw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The F-statistic is 34.49 and the p-value is \(2.2 \times 10^{-16}\). This is indicative that the model is better than the null hypothesis. The \(R^2\) is 0.3471, which is not as high as the second model, but it is an improvement over the first model. The standard error is 8.765 on 186 degrees of freedom, which indicates a strong fit. The p-values are near zero for each variable, which is indicative of each variable adding value to the model.

Question 5

predfull <- function(x, y){
  62.77 + 1497*x + 0.00007233*y - 0.006026*x*y
}

predfull(0.03,14)
## [1] 107.6785
summary(raw$PropMD)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000196 0.0002444 0.0010474 0.0017954 0.0024584 0.0351290

The predicted life expectancy in this case is incredibly high, with a value of 107.68 years. This is unrealistic, and that is likely due to the PropMD value being tested being well outside of the IQR for the PropMD variable.