raw <- read.csv("https://learn-us-east-1-prod-fleet02-xythos.content.blackboardcdn.com/61aab133e7df2/14511441?X-Blackboard-Expiration=1650337200000&X-Blackboard-Signature=T3cp%2BNJcKIFqv73QUn7g%2BXujpg85eArvqoRwe9b4QYE%3D&X-Blackboard-Client-Id=100211&response-cache-control=private%2C%20max-age%3D21600&response-content-disposition=inline%3B%20filename%2A%3DUTF-8%27%27who.csv&response-content-type=text%2Fcsv&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEDUaCXVzLWVhc3QtMSJHMEUCIDsdIWfNJTMTy8%2BkQF0QkUApF8w59cWvBLzBtfh2EDpiAiEA%2BPOkUoawzwXF%2BatA%2B6FI4egPWuRtIrjTNg3RqnDPepUqgwQI3v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARACGgw2MzU1Njc5MjQxODMiDLYuZrTntmchld4ItSrXAxQGLpsqRAO4l%2BH23HmR%2BXXdmyPFV9%2Bs39%2FhXmYedTULwH1kHSuU9YJbDpnKJJc9ybzJZ%2FLSEkGEVb8DLIU2iscC7Iz0TPIQqQlcYP5zXKWz6VFzHmlRAfIPCihB53K19XSQuMt79LhjlblFxBz6VnEg9JvrnJacBtHxIwui%2FDg8Yo3LNFByhO3qslQM7rLy7jCV8aBI%2BCivyOv0MIeLn7TjiiVeG3gsR8PY7iAJngVB7qsPp3VnnM%2FdAuGxL%2BTyeJvLYOgEbYxGtbpUds43NxHLJYFVjcP2QKmZriss0LH4oSP6thrQXrgf%2FLWJxrgoU3ooOvf%2Bhp2MpAdW01Ep8dkxPf2%2FvKLu%2BOYsc3v5MYY3JT7sEYlB23Ab6PZQosAo5JBBKn4UXD%2FPCJK%2BT3m%2FoRpYgQu8fAExQ5FakzFBrHFzTQFNaG%2F0O%2B7WyfZ8TlsoLd1KGBWY7wlL8PAlAc99tJFvfMSBXT%2BeL6fcZzseeKWnzbPsjRZa8PpvNvyofoloEhivvzAWTFcpAlW%2BewwbAacIrsF1TepPSPgqSSNEQQt30uj%2BI8%2BkeUeO4pRJHl5gIN5pQDJ%2FG%2FTQCTabbOhIZNzHt6GiI3saSHxQlmjkJkYFtrv%2BFUtZ6zC5nfeSBjqlAZvVR9tgJFoNOtyur7THWGz74R94ZMt6N1mo7NMZW198aH%2B1cYqablfwrTSCqwqLqUKw74dqpEvZujv8tsnrDclTpdcbsXRUOhKPcE1Ox7%2BMvEqVvpBPH7%2FKhHzmNnJ8uQzpznOdy9k%2FvnyT%2FvQGQOu7MzElDEVqrsbCpEGTNbs0juVOFfRYyOZXfup00FY2JysKrRc%2FFO5IYM1NbbzF26gHR6aTKA%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20220418T210000Z&X-Amz-SignedHeaders=host&X-Amz-Expires=21600&X-Amz-Credential=ASIAZH6WM4PLRCSZ5KXV%2F20220418%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=002fb3e6b7e541d22798771d558790bf1db37d7dd00eab49e32e217ba28e5a5c")
sct1 <- ggplot(data = raw, aes(x = LifeExp, y = TotExp)) +
geom_point() +
xlab("Life Expectancy") +
ylab("Total Expense") +
ggtitle("Scatterplot of Total Expense vs. Life Expectancy")+
theme(plot.title = element_text(hjust = 0.5, size = 7))
sct1
linfit <- lm(LifeExp ~ TotExp, data = raw)
summary(linfit)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = raw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The F-statistic is 65.26 on 1 and 188 degrees of freedom. The p-value is \(7.714 \times 10^{-14}\). The high F-statistic coupled with a near zero p-value indicates that the model provides a significant relationship and that the null hypothesis can be rejected.
The \(R^2\) is 0.2537, which is not necessarily indicative of a good fit. A low \(R^2\) value could indicate that the model does not sufficiently outperform the null hypothesis. The standard error is 9.371 on 188 degrees of freedom. This standard error indicates a less than ideal fit. The linear model may not be appropriate for this data, at least not for the whole data set.
## List of 1
## $ plot.title:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 7.5
## ..$ hjust : num 0.5
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
The scatterplot is only linear for values above the age of 75 and below the age of 75. The histogram of the residuals heavily skewed. The residuals appear to not have constant variability. The variability follows a distinct pattern. The normal probability plot is not linear. The criteria for simple regression are not met.
LifeExpExp <- (raw$LifeExp)^4.6
TotExpExp <- (raw$TotExp)^0.06
fit2 <- lm(LifeExpExp ~ TotExpExp)
summary(fit2)
##
## Call:
## lm(formula = LifeExpExp ~ TotExpExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExpExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
The F-statistic is 507.7 on 1 and 188 degrees of freedom. The p-value is \(2.2 \times 10^{-16}\). The high F-statistic coupled with a near zero p-value indicates that the model provides a significant relationship and that the null hypothesis can be rejected.
The \(R^2\) is 0.7283, which is much more indicative of a good fit. The standard error is 90490000 on 188 degrees of freedom. This standard error indicates a very poor fit. The linear model may not be appropriate for this data, at least not for the whole data set.
## List of 1
## $ plot.title:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 7.5
## ..$ hjust : num 0.5
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
The scatterplot is only linear for values above the age of 75 and below the age of 75. The histogram of the residuals is nearly normal, mostly for the values to the right side of the distribution. The residuals appear to have constant variability, following no particular pattern. The normal probability plot is far more linear than the prior model. The criteria for simple regression are not met because the scatterplot is non linear. The data would need to be trimmed in order for the linear model to be sufficient.
This model is better than the first.
#x = TotExp^0.6
pred <- function(x){
(-736527910 + 620060216*x)^(1/4.6)
}
pred(1.5)
## [1] 63.31153
In the first case, life expectancy is predicted to be 63.31 years.
pred(2.5)
## [1] 86.50645
In the second case, life expectancy is predicted to increase to 86.51 years.
mod <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = raw)
summary(mod)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = raw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The F-statistic is 34.49 and the p-value is \(2.2 \times 10^{-16}\). This is indicative that the model is better than the null hypothesis. The \(R^2\) is 0.3471, which is not as high as the second model, but it is an improvement over the first model. The standard error is 8.765 on 186 degrees of freedom, which indicates a strong fit. The p-values are near zero for each variable, which is indicative of each variable adding value to the model.
predfull <- function(x, y){
62.77 + 1497*x + 0.00007233*y - 0.006026*x*y
}
predfull(0.03,14)
## [1] 107.6785
summary(raw$PropMD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000196 0.0002444 0.0010474 0.0017954 0.0024584 0.0351290
The predicted life expectancy in this case is incredibly high, with a value of 107.68 years. This is unrealistic, and that is likely due to the PropMD value being tested being well outside of the IQR for the PropMD variable.