library(haven)
JHS_data <- read_sas("path/to/your/file/analysis1.sas7bdat")Error: 'path/to/your/file/analysis1.sas7bdat' does not exist in current working directory ('/Users/kevonkidd/Desktop/Fall/ds').
You will be analyzing data from the Jackson Heart Study (JHS). You can find the data on Canvas. For full credit, you must include all code chunks and R output backing up your responses.
0. Import the JHS data; you can download it from Canvas. You need to research how to read a SAS data file into R (hint: look into the haven package).
library(haven)
JHS_data <- read_sas("path/to/your/file/analysis1.sas7bdat")Error: 'path/to/your/file/analysis1.sas7bdat' does not exist in current working directory ('/Users/kevonkidd/Desktop/Fall/ds').
Week 1: Categorical Predictors
1a. Model systolic blood pressure (sbp; mmHg) as a function of age (age; years), education (HSgrad; 0=no, 1=yes), and health status as defined by body mass index (BMI3cat; 0=poor health, 1=intermediate health, 2=ideal health). Remember to report the resulting model.
library(haven)
library(car)
library(ggplot2)
#Load the data
JHS_data <- read_sas(“/Users/kevonkidd/Desktop/Fall/ds/analysis1sas.sas7bdat”)
model <- lm(sbp ~ age + HSgrad + BMI3cat, data = JHS_data)
#Summary of the model and also where I can check the p-value
summary(model)
Call:
lm(formula = sbp ~ age + HSgrad + BMI3cat, data = JHS_data)
Residuals:
Min 1Q Median 3Q Max
-43.993 -10.126 -1.246 8.014 65.176
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 104.37912 1.77783 58.712 < 2e-16
age 0.43382 0.02542 17.065 < 2e-16
HSgrad -0.93980 0.85489 -1.099 0.272
BMI3cat -1.73009 0.40132 -4.311 1.68e-05
—
Signif. codes: 0 ‘’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.64 on 2638 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.1167, Adjusted R-squared: 0.1157
F-statistic: 116.2 on 3 and 2638 DF, p-value: < 2.2e-16
sbp=104.37912+0.43382(age)−0.93980(HSgrad)−1.73009(BMI3cat).
Intercept: 104.37912 represent someone with base markers as age=0, no high school graduation and poor health.
1b. Construct the 95% confidence intervals for the regression coefficients.
confint(model)
2.5 % 97.5 %
(Intercept) 100.8930478 107.8651977
age 0.3839721 0.4836670
HSgrad -2.6161194 0.7365287
BMI3cat -2.5170271 -0.9431545
Intercept: (100.89, 107.87), Age: (0.384, 0.484), HSgrad: (-2.616, 0.737), BMI3cat: (-2.517, -0.943)
1c. Is this a significant regression line? Test at the \(\alpha=0.05\) level. Remember to state your hypotheses and conclusion.
anova(model)
Analysis of Variance Table
Response: sbp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 70429 70429 328.6472 < 2.2e-16
HSgrad 1 281 281 1.3133 0.2519
BMI3cat 1 3983 3983 18.5846 1.685e-05
Residuals 2638 565322 214
—
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
H0: The regression line is not significant Ha: The regression line is significant.
Conclusion: The F-statistic is 116.2 and a p-value < 2.2e-16, we reject the null hypotheses and conclude that regression line is significant
1d. Which predictors, if any, are significant predictors of systolic blood pressure? Test at the \(\alpha=0.05\) level. Remember to state your conclusions.
Significant predictors are age and bmi3cat because they are less than alpha=0.05 while HSgrad is not hence not a significant predictor.
1e. Provide brief interpretations for the slopes of the predictors.
Age: for each additional age there’s an increase of 0.43382mmHg. HSgrad: High school graduates have an average reduction of 0.94 mmHg in sbp. BMI3cat: an increase in BMI category reduces systolic blood pressure by 1.73 mmHg.
1f. How many suspected outliers exist? You must justify your answer statistically. Remember to state your conclusion.
standardized_residuals <- rstandard(model)
>
> outliers <- which(abs(standardized_residuals) > 2)
> JHS_data[outliers, ]
# A tibble: 122 × 198
subjid visit VisitDate DaysFromV1 YearsFromV1 ARIC recruit ageIneligible
<chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ” 2054” 1 2003-06-30 0 0 0 3 0
2 ” 308” 1 2003-09-10 0 0 0 3 0
3 ” 2613” 1 2004-01-25 0 0 0 5 0
4 ” 931” 1 2003-12-19 0 0 0 5 0
5 ” 332” 1 2003-12-14 0 0 0 4 0
6 ” 1094” 1 2004-01-04 0 0 0 3 0
7 ” 2524” 1 2003-12-30 0 0 0 3 0
8 ” 2725” 1 2004-02-06 0 0 0 4 0
9 ” 1431” 1 2004-02-21 0 0 0 4 0
10 ” 46” 1 2004-02-16 0 0 0 4 0
We have 122 suspected outliers based on standardized residuals greater than 2 or less than -2.
1g. How many suspected influential/leverage points exist? You must justify your answer statistically. Remember to state your conclusion.
cooks_d <- cooks.distance(model)
>
> influential <- which(cooks_d > 4/nrow(JHS_data))
> JHS_data[influential, ]
# A tibble: 142 × 198
subjid visit VisitDate DaysFromV1 YearsFromV1 ARIC recruit ageIneligible
<chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ” 2054” 1 2003-06-30 0 0 0 3 0
2 ” 308” 1 2003-09-10 0 0 0 3 0
3 ” 1521” 1 2003-10-01 0 0 0 4 0
4 ” 332” 1 2003-12-14 0 0 0 4 0
5 ” 1094” 1 2004-01-04 0 0 0 3 0
6 ” 748” 1 2003-10-18 0 0 0 4 0
7 ” 2524” 1 2003-12-30 0 0 0 3 0
8 ” 2725” 1 2004-02-06 0 0 0 4 0
9 ” 1762” 1 2004-02-12 0 0 0 4 0
10 ” 2088” 1 2004-03-15 0 0 0 4 0
142 suspected influential/leverage points exist based on their Cook’s distance exceeding 4/n.
1h. Is multicollinearity a problem in this model? You must justify your answer statistically. Remember to state your conclusion.
vif(model) age HSgrad BMI3cat 1.095162 1.094919 1.000371
It is not a problem because values of VIF are close to 1
1i. Assess the assumptions on the linear model. Remember to draw your conclusion with appropriate justification.
The assumptions of linearity, normality, and homoscedasticity are adequately met. par(mfrow = c(2, 2)) plot(model)
1j. Construct an appropriate data visualization to help with explaining the model results. Systolic blood pressure should be on the y-axis, age should be on the x-axis. Create lines for BMI (BMI3cat); remember that we will plug in a combination of 0’s and 1’s to represent BMI.
ggplot(JHS_data, aes(x = age, y = sbp, color = as.factor(BMI3cat))) + + geom_point() + + geom_smooth(method = “lm”) + + labs(color = “BMI Category”) geom_smooth() using formula = ‘y ~ x’
2. Required for graduate students / extra credit for undergraduate students: Write a paragraph to summarize the above analysis, written such that a non-quantitative person could understand. Note - knowledge about medicine/health is not required. I am only looking for you to report results in a digestible manner.
Insert your answer here :)
Week 2: Interaction Terms
3a. Model systolic blood pressure (sbp; mmHg) as a function of age (age; years), education (HSgrad; 0=no, 1=yes), and body mass index (BMI; kg/m2), and the following interactions: body mass index \(\times\) age and body mass index \(\times\) education. Remember to report the resulting model.
JHS_data\(age_BMI3cat <- JHS_data\)age * JHS_data\(BMI JHS_data\)HSgrad_BMI3cat <- JHS_data\(HSgrad * JHS_data\)BMI
model_interaction <- lm(sbp ~ age * BMI + HSgrad * BMI, data = JHS_data) summary(model_interaction)
Call: lm(formula = sbp ~ age * BMI + HSgrad * BMI, data = JHS_data)
Residuals: Min 1Q Median 3Q Max -43.173 -9.980 -1.016 7.968 64.649
sbp=74.75+0.87(age)+0.89(BMI)−2.77(HSgrad)−0.013(age∗BMI)+0.056(HSgrad∗BMI)
The intercept between age*BMI is significant due to the coefficient being smaller than the p-value.
While HSgrad*BMI does not show the same stastical significance due to it’s higher P-value.
3b. Perform the appropriate hypothesis test to determine if the interaction between body mass index and age is significant. Test at the \(\alpha=0.05\) level. Remember to state your conclusion.
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.749725 7.884470 9.481 < 2e-16 age 0.869644 0.115042 7.559 5.56e-14 BMI 0.885513 0.243923 3.630 0.000288 HSgrad -2.772950 4.009951 -0.692 0.489301
age:BMI -0.013472 0.003571 -3.773 0.000165 BMI:HSgrad 0.055666 0.123568 0.450 0.652393
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.55 on 2636 degrees of freedom (11 observations deleted due to missingness) Multiple R-squared: 0.1277, Adjusted R-squared: 0.1261 F-statistic: 77.2 on 5 and 2636 DF, p-value: < 2.2e-16
Hypotheses H0:The interaction between age and BMI is not significant.
Ha:The interaction between age and BMI is significant.
Conclusion the intercept has a p-value of 0.000165 which is less than alpha, indicating highly significant. Hence we will reject the null hypotheses and conclude that the interaction between age and BMI is statistically significant.
3c. Perform the appropriate hypothesis test to determine if the interaction between body mass index and education is significant Test at the \(\alpha=0.05\) level. Remember to state your conclusion.
Hypotheses H0:The interaction between education and bmi is not significant.
Ha:The interaction between education and bmi is significant.
Conclusion the intercept has a p-value of 0.652 which is more than alpha, indicating no significant. Hence we will fail to reject the null hypotheses and state that there is not enough evidence to conclude that the interaction between age and BMI is statistically significant.
3d. Create the following models (i.e., plug in the following values and algebraically simplify): (1) body mass index of 32, and (2) body mass index of 25. Remember to report the resulting simplified models.
JHS_data_25 <- subset(JHS_data, BMI >= 24 & BMI <=26)
model_25 <- lm(sbp ~ age + HSgrad, data = JHS_data_25) summary(model_25)
Call: lm(formula = sbp ~ age + HSgrad, data = JHS_data_25)
Residuals: Min 1Q Median 3Q Max -32.936 -9.508 -1.426 7.564 65.731
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 96.16364 5.25484 18.300 < 2e-16 age 0.55510 0.07125 7.791 1.74e-13 HSgrad -2.87382 2.76852 -1.038 0.3
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.73 on 252 degrees of freedom Multiple R-squared: 0.22, Adjusted R-squared: 0.2138 F-statistic: 35.54 on 2 and 252 DF, p-value: 2.526e-14
BMI= 32
sbp=103.23+0.47(age)−2.77(HSgrad)
BMI= 25
sbp=96.00+0.55(age)−2.77(HSgrad)
3e. Provide brief interpretations for the slopes of the predictors for the one of the models in 3d (your choice, but make sure you specify which model you are interpreting).
Interpretation of the predictors for BMI = 32.
Intercept: The baseline systolic blood pressure for someone with BMI = 32 is 103.23 mmHg.
Age: For each increase in years of age, systolic blood pressure increases by 0.47 mmHg.
HSgrad: High school graduates have a systolic blood pressure 2.77 mmHg lower than non-graduates.
3f. Construct an appropriate data visualization to help with explaining the model results. Systolic blood pressure should be on the y-axis, age should be on the x-axis, and use the regression lines constructed in 3d to construct predicted values.
ggplot(JHS_data_32, aes(x = age, y = sbp, color = as.factor(HSgrad))) + + geom_point() + + geom_smooth(method = “lm”) + + labs(color = “Education Level”) geom_smooth() using formula = ‘y ~ x’
This Plot shows that systolic blood pressure tends to increase with age, regardless of education level.
4. Required for graduate students / extra credit for undergraduate students: Write a paragraph to summarize the above analysis, written such that a non-quantitative person could understand. Note - knowledge about medicine/health is not required. I am only looking for you to report results in a digestible manner.