1.1.1) 𝜇 = exp(𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥^2)
1.1.2) 𝜇 = log(𝛽0 + 𝛽1𝑥 ) + 𝛽2𝑥^2 (a) Linear in the parameters: Yes, it is linear in the parameters. (b) Appropriate for linear regression models: No, it is not appropriate for linear regression models because the response variable is not continuous and normally distributed. (c) Appropriate for generalized linear models: Yes, it is appropriate for generalized linear models.
1.1.3) 𝜇 = 𝛽0+𝛽1𝑥 (a) Linear in the parameters: Yes, it is linear in the parameters. (b) Appropriate for linear regression models: Yes, it is appropriate for linear regression models because the response variable is continuous and normally distributed. (c) Appropriate for generalized linear models: No, it is not appropriate for generalized linear models.
1.1.4) 𝜇 = (𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥1 𝑥2) (a) Linear in the parameters: Yes, it is linear in the parameters. (b) Appropriate for linear regression models: Yes, it is appropriate for linear regression models because the response variable is continuous and normally distributed. (c) Appropriate for generalized linear models: No, it is not appropriate for generalized linear models.
1.2.1) Using names():
“Age”,“Percent.Fat”,“Gender”,“BMI”,“Gender_dummy”
1.2.2) Determine which variables are quantitative and which are qualitative. Quantitative variables: Percent.Fat, Age, BMI Qualitative variables: Gender, Gender_dummy
1.2.3) Reference level: Female
1.2.4) Plots: a) Plot Percent.Fat vs Age: The plot suggests that there might be a weak positive linear relationship between Age and Percent.Fat.
Plot Percent.Fat vs BMI: The plot suggests a positive linear relationship between BMI and Percent.Fat. However, there is a clear outlier with high Percent.Fat for their BMI.
Plot Percent.Fat vs Gender_dummy: The plot suggests that there is a clear difference in Percent.Fat between males and females.
1.2.5) Would a linear regression model seem appropriate for modelling the data based on your plots from (4)? Yes, a linear regression model seems appropriate for modelling the relationship between Percent.Fat and BMI and Gender.
1.2.6) Suppose a linear regression model was fitted to the data with systematic component 𝜇 = 𝛽0 + 𝛽1 𝑥 , where 𝑥 is BMI. Interpret the systematic component of this model. In other words, interpret the meaning of 𝛽 and 𝛽. You do not need to fit the model and just use 𝛽0 and 𝛽1 for your interpretation.
𝛽0 represents the intercept of the linear regression model, which is the expected value of Percent.Fat when BMI is 0. 𝛽1 represents the slope of the linear regression model, which is the expected change in Percent.Fat for a one-unit increase in BMI, assuming all other variables are held constant.
1.2.7) Suppose a generalized linear model was fitted to the data with systematic component log(𝜇)= 𝛽0 +𝛽1𝑥 +𝛽2x
In the given model, 𝛽1 represents the expected change in the log of the response variable (blood pressure) for a one-unit increase in BMI, while holding the gender variable constant. 𝛽2 represents the expected difference in the log of the response variable between males and females, while holding the BMI variable constant.
1.2.8) Determine the values of 𝑝 and 𝑞 for the models in (6) and (7), respectively. Model in (6): 𝜇 = 𝛽0 + 𝛽1 𝑥 , where 𝑥 is BMI p = 1, q = 2
Model in (7): log(𝜇)=𝛽0+𝛽1 𝑥 +𝛽2 𝑥 , where 𝑥 is BMI, and 𝑥 is 0 for females and 1 for males. p = 2, q = 3
1.3.1) The degrees of freedom omitted from the table are:
The total number of observations is the sum of degrees of freedom for Cue, Sex, Age, and Residual, which are 1, 1, 3, and 177639 respectively.
We can see from the table that the sum of squares for Cue, Sex, and Age have been calculated, but their respective degrees of freedom are missing. We can determine the missing degrees of freedom by subtracting the degrees of freedom for the Residual from the total degrees of freedom:
177644 - 177639 = 5
So, the degrees of freedom omitted from the table is 5.
1.3.2) The number of observations used in the analysis is:
Total number of observations = Degrees of freedom for Residual + 1
Total number of observations = 177639 + 1 = 177640
So, 177640 observations were used in the analysis.
1.3.3) An unbiased estimate of σ^2 is given by the formula: σ^2 = SS(Residual) / df(Residual) = 177639/60 = 2960.65
1.3.4) To determine which explanatory variables are statistically F-tests for each variable in the order they appear in the table. To determine the statistically significant explanatory variables, we need to perform sequential F-tests for each variable. The null hypothesis is that the variable does not contribute significantly to the model, and the alternative hypothesis is that it does contribute significantly.
For Cue, the sequential F-test statistic is Type I sum of squares of Cue divided by Residual sum of squares divided by its degrees of freedom, which is 117793 / 177639 / 1 = 0.662. The degrees of freedom for the F-test is (1, 177639), and the corresponding p-value is p = 0.415. Since the p-value is greater than 0.05, we fail to reject the null hypothesis, which means Cue is not statistically significant for predicting response time.
For Sex, the sequential F-test statistic is Type I sum of squares of Sex divided by Residual sum of squares divided by its degrees of freedom, which is 22850 / 177639 / 1 = 0.129. The degrees of freedom for the F-test is (1, 177639), and the corresponding p-value is p < 0.001. Since the p-value is less than 0.05, we reject the null hypothesis, which means Sex is statistically significant for predicting response time.
For Age, the sequential F-test statistic is Type I sum of squares of Age divided by Residual sum of squares divided by its degrees of freedom, which is 60 / 177639 / 1 = 0.00034. The degrees of freedom for the F-test is (1, 177639), and the corresponding p-value is p = 0.985. Since the p-value is greater than 0.05, we fail to reject the null hypothesis, which means Age is not statistically significant for predicting response time.
Therefore, we can conclude that only Sex is statistically significant for predicting response time.
1.3.5) Omitting participants who failed to wake may introduce bias in the analysis if there is a systematic difference between those who woke and those who did not. This could affect the generalization of the results to the population as a whole.
1.3.6) We can compute R^2 and the adjusted R^2 for the following three models: Calculation of R2 and adjusted R2: The R2 and adjusted R2 can be calculated as follows:
Model with Cue only: R2 = SS_Cue / (SS_Cue + SS_Residual) = 22850 / (22850 + 2659) = 0.895 df = df_Cue + df_Residual = 3 + 177639 = 177642 Adjusted R2 = 1 - [(1 - R2) * (n - 1) / (n - df - 1)] = 1 - [(1 - 0.895) * (296492 - 1) / (296492 - 177642 - 1)] = 0.894
Model with Cue and Sex: R2 = (SS_Cue + SS_Sex) / (SS_Cue + SS_Sex + SS_Residual) = (22850 + 94943) / (22850 + 94943 + 2659) = 0.969 df = df_Cue + df_Sex + df_Residual = 3 + 1 + 177639 = 177643 Adjusted R2 = 1 - [(1 - R2) * (n - 1) / (n - df - 1)] = 1 - [(1 - 0.969) * (296492 - 1) / (296492 - 177643 - 1)] = 0.969
Model with Cue, Sex and Age: R2 = (SS_Cue + SS_Sex + SS_Age) / (SS_Cue + SS_Sex + SS_Age + SS_Residual) = (22850 + 94943 + 60) / (22850 + 94943 + 60 + 2659) = 0.973 df = df_Cue + df_Sex + df_Age + df_Residual = 3 + 1 + 1 + 177639 = 177644 Adjusted R2 = 1 - [(1 - R2) * (n - 1) / (n - df - 1)] = 1 - [(1 - 0.973) * (296492 - 1) / (296492 - 177644 - 1)] = 0.973
1.3.7) Ranking of models by R2 and adjusted R2: The models can be ranked by R2 and adjusted R2 as follows: Model with Cue, Sex and Age > Model with Cue and Sex > Model with Cue only
The model with Cue, Sex and Age has the highest R2 and adjusted R2, indicating that it explains the most variance in the response time and has the best fit among the three models.
1.4.1) Calculation of p-values for Age and Smoking:
The null hypothesis for testing the significance of Age and Smoking in the model is H0: βj = 0, where j corresponds to the explanatory variables Age and Smoking. The alternative hypothesis is Ha: βj ≠ 0.
The t-test statistic for Age is given by:
t = (βAge - 0) / SE(βAge) = 13.096 / 0.062 = 211.355
The degrees of freedom for the test are df = n - p - 1 = 574 - 5 - 1 = 568.
The p-value for the test is P(|t| > 211.355) < 0.001, which is very small. Therefore, Age is a highly significant predictor of systolic blood pressure.
Similarly, the t-test statistic for Smoking is:
t = (βSmoking - 0) / SE(βSmoking) = -0.521 / 0.262 = -1.988
The degrees of freedom for the test are df = n - p - 1 = 574 - 5 - 1 = 568.
The p-value for the test is P(|t| > 1.988) = 0.047, which is just below the conventional significance level of 0.05. Therefore, Smoking is a significant predictor of systolic blood pressure at the 5% level of significance.
1.4.2) Relationship between Ambient temperature and systolic blood pressure:
After adjusting for Age, Waist circumference, Alcohol consumption, and Smoking habits, the estimated regression coefficient for Ambient temperature is -0.521, with a standard error of 0.262. This suggests that there is a negative association between Ambient temperature and systolic blood pressure, i.e., as the temperature increases, the systolic blood pressure decreases, holding other variables constant. However, it is important to note that this is a cross-sectional study and therefore causality cannot be inferred.
1.4.3) Calculation of a 95% confidence interval for the regression parameter for Ambient temperature:
The 95% confidence interval for the regression parameter for Ambient temperature is given by:
βAmbient temperature ± tα/2,df * SE(βAmbient temperature),
where tα/2,df is the t-distribution value for a two-sided test with significance level α = 0.05 and degrees of freedom df = n - p - 1 = 568.
Substituting the values, we get:
βAmbient temperature ± 1.96 * 0.262,
which gives the confidence interval [-1.034, -0.008].
Therefore, we can be 95% confident that the true regression coefficient for Ambient temperature lies in this interval.
1.4.4) Prediction of mean systolic blood pressure for 35-year-old Ghanaian men:
The predicted mean systolic blood pressure for 35-year-old Ghanaian men who do not smoke, drink alcohol, and have a waist circumference of 100 cm when the ambient temperature is 30◦C can be computed as follows:
Ŷ = β0 + βAge * Age + βWaist circumference * Waist circumference + βAlcohol * Alcohol + βSmoking * Smoking + βAmbient temperature * Ambient temperature
Substituting the values, we get:
Ŷ = 100.812 + 13.096 * 35 + 0.332 * 100 + (-3.003) * 1 + (-0.521) * 0 + (-0.362) * 30
which gives the predicted mean systolic blood pressure of 149.814 mm Hg
##
## Call:
## lm(formula = Percent.Fat ~ Age * Gender, data = humanfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6756 -2.8862 -0.2464 1.9100 9.1641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.1116 6.2395 3.223 0.00613 **
## Age 0.2401 0.1204 1.994 0.06600 .
## GenderM -29.2692 10.4098 -2.812 0.01386 *
## Age:GenderM 0.5725 0.2893 1.978 0.06790 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.488 on 14 degrees of freedom
## Multiple R-squared: 0.8016, Adjusted R-squared: 0.7591
## F-statistic: 18.86 on 3 and 14 DF, p-value: 3.455e-05
1.5.1) The point estimates, standard errors, and p-values for each coefficient except the intercept are as follows:
Age: Point estimate = 0.2401, Standard error = 0.1204, p-value = 0.06600 GenderM: Point estimate = -29.2692, Standard error = 10.4098, p-value = 0.01386 Age:GenderM: Point estimate = 0.5725, Standard error = 0.2893, p-value = 0.06790 These values represent the estimated effect sizes, their precision, and their statistical significance for each of the corresponding predictor variables in the model.
1.5.2) The systematic component for females is:
log(μ) = 20.1116 + 0.2401*Age
The systematic component for males is:
log(μ) = (20.1116 - 29.2692) + (0.2401 + 0.5725)*Age
Simplifying the above equation for males, we get:
log(μ) = -9.1576 + 0.8126*Age
1.5.3) To test if the interaction term is significant, we can perform a t-test and an F-test.
First, the t-test:
H0: β3 = 0 (the interaction term is not significant) Ha: β3 ≠ 0 (the interaction term is significant)
The t-statistic is calculated as:
t = (β3 - 0) / SE(β3) = 0.5725 / 0.2893 = 1.978
The degrees of freedom is DF = n - p = 18 - 4 = 14 (n is the number of observations and p is the number of parameters estimated).
The p-value for a two-sided t-test with 14 degrees of freedom is 0.068.
Next, the F-test:
H0: β3 = 0 (the interaction term is not significant) Ha: β3 ≠ 0 (the interaction term is significant)
The F-statistic is calculated as:
F = (SSRreduced - SSRfull) / (pfull - preduced) / SSRfull / (n - pfull)
where SSR is the sum of squared residuals and p is the number of parameters estimated.
The reduced model only includes the intercept, age, and gender:
Percent.Fat ~ Age + Gender
The full model includes the interaction term:
Percent.Fat ~ Age * Gender
Using the output from the linear regression model lm1, the F-statistic as is:
The degrees of freedom for the F-test are using anova() = 1,14.
The p-value for a two-sided F-test with 1 and 14 degrees of freedom is 0.068.
The p-values for the t-test and F-test are the same for the
interaction term, which suggests that the t-distribution and
F-distribution are consistent. Furthermore, the squared t-statistic
(1.978^2 = 3.913) is approximately vequal to the F-statistic. Therefore,
we can conclude that there is weak evidence (p-value = 0.068) that the
interaction term is significant.
1.6.1) Residual plots for checking assumptions of the linear model:
Linearity for Age: Using a scatter plot of the residuals against Age to check for linearity, we find the plot shows no clear pattern in the residuals, indicating that linearity for Age is reasonable.
Constant variance: Using a plot of the absolute residuals against the fitted values to check for constant variance, we find the plot shows a “fan shape” in the absolute residuals, indicating that constant variance is not reasonable.
Normality: The plot shows a roughly straight line, indicating that normality is reasonable.
Outliers and influential observations:
The plot shows no points that are far from zero on the y-axis, indicating that there are no significant outliers or influential observations.