Here are average systolic blood pressure data for children/young adults aged 1 through 19 (we used these exact data in homework 8). Run the following lines of code:
age <- c(1:19)
SBP <- c(91,102,104,108,109,112,109,113,116,119,115,122,124,128,129,135,130,132,134)
bpstudy <- data.frame(age,SBP)
This time including the prediction intervals (make these blue), and the confidence intervals (make these red). As a reminder, the homework 8 question asked you to create a scatterplot showing both the individual data points and the best regression line fit through them, and to add labels to both axes and show the data points using a different color than the line.
## Create a model
my_reg_model <- lm(SBP ~ age, data = bpstudy)
##create a prediction
prediction <- predict(my_reg_model, interval = "prediction")
prediction <- as.data.frame(prediction)
bpstudy <- bpstudy %>%
bind_cols(prediction)
##Plot the data
bpstudy %>%
ggplot(mapping = aes(x = age)) +
geom_point(aes(y = SBP)) +
geom_smooth(aes(y = SBP), fill = "red", method = "lm", col = "gray") +
geom_line(aes(y = lwr), linetype = "dashed", col = "blue") +
geom_line(aes(y = upr), linetype = "dashed", col = "blue") +
labs(x = "Age", y = "SBP", title = "Systolic Blood Pressure by Age")
confidence intervals
and the
prediction intervals
in words.The confidence interval shows the possible range of values associated with a given parameter such as the mean or median. In other words, what range of values in the sample data is likely to contain a parameter of interest such as the mean or median?
On the other hand, a prediction interval is the interval or range within which a future observation or set of observation is likely to fall. Hence, the prediction interval informs us about the preciseness of our forecasts.
There appears to be no outliers in the plot.
my_iqr <- IQR(bpstudy$SBP) * 1.5
q3 <- quantile(bpstudy$SBP, probs = 0.75)
q1 <- quantile(bpstudy$SBP, probs = 0.25)
upper_bound = q3 + my_iqr
lower_bound = q1 - my_iqr
bpstudy$SBP > upper_bound
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
bpstudy$SBP < lower_bound
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Because there are no outliers, I remove the point with the highest residual, which happens to be the first point in the dataset.
residuals(my_reg_model) %>% abs() %>% max()
## [1] 7.573684
new_data <- bpstudy[-1, ]
new_reg <- lm(SBP ~ age, data = new_data)
stargazer::stargazer(my_reg_model, new_reg, type = "text")
##
## ===================================================================
## Dependent variable:
## -----------------------------------------------
## SBP
## (1) (2)
## -------------------------------------------------------------------
## age 2.100*** 1.951***
## (0.127) (0.105)
##
## Constant 96.474*** 98.454***
## (1.451) (1.232)
##
## -------------------------------------------------------------------
## Observations 19 18
## R2 0.941 0.956
## Adjusted R2 0.938 0.953
## Residual Std. Error 3.039 (df = 17) 2.316 (df = 16)
## F Statistic 272.120*** (df = 1; 17) 344.064*** (df = 1; 16)
## ===================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The \(R^{2}\) and adjusted \(R^{2}\) do improve significantly after removing the most extreme value in the data set. The coefficient for age rises while the constant in the regression equation goes down.
Attached to this assignment is health-status.csv
a clean
copy of the 2020 NHIS file used throughout the semester. This version
has the variables already renamed and recorded to NA as required. The
aim is to explore regression models that predict self-reported health
status as a function of age, sex, income, education, and alcohol
consumption. Here are brief summaries of the values.
You can consult the data dictionary for more detail where necessary.
health <- read_csv("health-status.csv") %>%
select(-`...1`, -X)
health_reg <- lm(health ~ ., data = health)
summary(health_reg)
##
## Call:
## lm(formula = health ~ ., data = health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4350 -0.7539 -0.0265 0.6520 3.4127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6212024 0.0294421 89.029 < 2e-16 ***
## age 0.0124396 0.0003078 40.414 < 2e-16 ***
## sex -0.0557810 0.0111688 -4.994 5.93e-07 ***
## income -0.1320181 0.0039370 -33.533 < 2e-16 ***
## education -0.0632032 0.0024944 -25.338 < 2e-16 ***
## alcohol -0.0343176 0.0029872 -11.488 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9673 on 31085 degrees of freedom
## (477 observations deleted due to missingness)
## Multiple R-squared: 0.158, Adjusted R-squared: 0.1578
## F-statistic: 1166 on 5 and 31085 DF, p-value: < 2.2e-16
For example, if body mass index (BMI) was in the model with a coefficient of 0.05, we would say that each unit increase in BMI reduced the health score by an average of 0.05, meaning that BMI is inversely correlated with general health.
\(\beta_{0}\): \(\beta_{0}\) is the baseline health level given that a person cannot have 0 (perfect health) or 6 (no health) unless the person is superhuman or dead.
\(\beta_{1}\): \(\beta_{1}\) for age
is
0.0124396. This coefficient is positive meaning that as the respondents
in the sample increased in age, the poorer their health status.
Specifically, for each unit rise in age, health tends to deteriorate by
0.0124396 units (with health on a scale of 1 to 5) all other variables
remaining the same.
\(\beta_{2}\): \(\beta_{2}\) of -0.0557810 corresponds to sex, with 1 being male and 2 for female. The coefficient shows that holding all other variables constant, women have less health issues compared to men. For instance if you got a man and a woman with the same characteristics except for their different sex, then the man is likely to be 0.0557810 units less healthy than the woman on a scale of 1 to 5 for health.
\(\beta_{3}\): Income has a coefficient of -0.1320181 meaning that more income corresponds to better health. A unit increase in income raise the health metric by 0.1320181 on a scale of 1 (excellent) to 5 (poor), all else remaining the same.
\(\beta_{4}\): Education has a coefficient of -0.0632032 meaning that more education corresponds to better health, ceteris paribus. Again a unit rise in education raises health by 0.0632032 on a scale of 1 (excellent health) to 5 (poor health).
\(\beta_{5}\): Contrary to research, Alcohol, with a coeffcient of -0.0343176 implies that the more alcohol one consumes, the better the health. An extra day of drinking corresponds to an improvement of health by 0.0343176 on the scale of 1 to 5.
The result for alcohol does not make sense. While minimum amounts of alcohol could improve health, daily drinking certainly does not. The number of days a person drinks may not always capture the quantity of alcohol consumed.
We could improve this model by measuring alcohol consumption in terms of quantity of alcohol consumes per week or per day.
my_vars <- data.frame(age = 45, sex = 1, income = 4, education = 8, alcohol = 7)
predict(health_reg, newdata = my_vars)
## 1
## 1.851283
According to this model, a person with the best health shoud be:
The response should be as follows.
If we had a sample of younger people who had similar characteristics with the 75 year old man (except for age), then these younger people would, on average, have better health scores than the 75 year old, all else remaining the same.