# Load the humanfat dataset
data(humanfat)
The names of the variables in the dataset are: Age, Percent.Fat, Gender, BMI
# Determine which variables are quantitative and which are qualitative
# Qualitative variables are Gender
quantitative_vars <- c("Age", "BMI", "Percent.Fat")
qualitative_vars <- "Gender"
The quantitative variables are: (Age, BMI, Percent.Fat), the qualitative variables are: (Gender).
# Find the values of the dummy variable using treatment coding for the first two observations
# Reference level is female
dummy_vars <- ifelse(humanfat$Gender == "Female", 0, 1)
The reference level for Gender is: Female, Dummy Variables for Gender: 1, 1
# Plot the response against each explanatory variable
library(ggplot2)
ggplot(humanfat, aes(x=BMI, y=Percent.Fat)) +
geom_point() +
labs(title = "Percent Body Fat vs. BMI",
x = "BMI",
y = "Percent Body Fat") +
theme_bw()
ggplot(humanfat, aes(x=Age, y=Percent.Fat)) +
geom_point() +
labs(title = "Percent Body Fat vs. Age",
x = "Age",
y = "Percent Body Fat") +
theme_bw()
ggplot(humanfat, aes(x=Gender, y=Percent.Fat)) +
geom_boxplot() +
labs(title = "Percent Body Fat by Gender",
x = "Gender",
y = "Percent Body Fat") +
theme_bw()
A linear regression model would be appropriate for modeling the relationship between PercentFat and all of the explanatory variables (Age, Gender, and BMI) based on the plots provided. The plot of PercentFat vs BMI shows a clear positive linear relationship between the two variables. The plot of PercentFat vs Age shows a positive trend, although it may not be strictly linear. The plot of PercentFat vs Gender shows differences in the median and range of PercentFat between males and females, which a linear model can capture. Therefore, a linear regression model would be appropriate for modeling the relationship between PercentFat and all of the explanatory variables in this dataset.
𝛽_1 represents the expected change in the ratio of the expected value of Percent.Fat for two individuals with a one-unit difference in BMI, holding gender constant. Specifically, it represents the expected change in the log-transformed expected value of Percent.Fat for a one-unit increase in BMI, holding gender constant.
𝛽_2 represents the expected difference in the ratio of the expected value of Percent.Fat between males and females with the same BMI, on the log scale. Specifically, it represents the expected difference in the log-transformed expected value of Percent.Fat between males and females with the same BMI.
| Source of Variation | df | Sum of Squares |
|---|---|---|
| cue | 3 | 117793 |
| sex | 1 | 2659 |
| Age | 3 | 22850 |
| Residual | 60 | 177639 |
| Total | 67 | 321941 |
There are 68 observations used in the analysis (Total df N - 1, 67 + 1 = 68)
SS_residual <- 177639
df_residual <- 60
sigma_sq <- SS_residual / df_residual
An unbiased estimate of 𝜎^2 is approximately 2960.65
# Define the ANOVA table
df <- data.frame(
Source = c("cue", "sex", "age", "Residual", "Total"),
df = c(3, 1, 3, 60, 67),
SS = c(117793, 2659, 22850, 177639, 320941)
)
# Calculate the sequential F-tests
F_cue <- df[1, "SS"] / df[1, "df"] / (df[4, "SS"] / df[4, "df"])
df_cue <- df[1, "df"]
p_cue <- pf(F_cue, df_cue, df[4, "df"], lower.tail = FALSE)
F_sex <- df[2, "SS"] / df[2, "df"] / (df[4, "SS"] / df[4, "df"])
df_sex <- df[2, "df"]
p_sex <- pf(F_sex, df_sex, df[4, "df"], lower.tail = FALSE)
F_age <- df[3, "SS"] / df[3, "df"] / (df[4, "SS"] / df[4, "df"])
df_age <- df[3, "df"]
p_age <- pf(F_age, df_age, df[4, "df"], lower.tail = FALSE)
Based on the above, Cue is the only statistically significant variable at 0.05 while sex and age are not.
The exclusion of certain participants who failed to wake during the study may raise issues with the data’s representativeness of the population at large. This has the potential to cause biases in the results and restrict the extent to which we can generalize our conclusions. Additionally, there exists the possibility that the omitted participants had different response time distributions than those who were included in the analysis, which could influence our estimates of variability and statistical significance. Thus, it is crucial to acknowledge this limitation and interpret the findings with care.
# ANOVA table:
anova_table <- data.frame(
Source = c("cue", "sex", "age", "Residual", "Total"),
df = c(3, 1, 3, 60, 67),
SS = c(117793, 2659, 22850, 177639, 320941)
)
# Revised SST:
SST <- 320941
# R-squared and adjusted R-squared for each model:
models <- c("Cue only", "Cue and Sex", "Cue, Sex, and Age")
explanatory_vars <- list(c("cue"), c("cue", "sex"), c("cue", "sex", "age"))
df <- c(3, 4, 7)
for (i in 1:length(models)) {
# SSR for current model:
model_SSR <- sum(anova_table$SS[which(anova_table$Source %in% explanatory_vars[[i]])])
# R-squared for current model:
R_squared <- model_SSR / SST
# Adjusted R-squared for current model:
n <- 68 # number of observations
p <- df[i] # number of explanatory variables
adj_R_squared <- 1 - (1 - R_squared) * (n - 1) / (n - p - 1)
# Print results:
cat("\nModel:", models[i], "\n")
cat("R-squared:", round(R_squared, 4), "\n")
cat("Adjusted R-squared:", round(adj_R_squared, 4), "\n")
}
##
## Model: Cue only
## R-squared: 0.367
## Adjusted R-squared: 0.3374
##
## Model: Cue and Sex
## R-squared: 0.3753
## Adjusted R-squared: 0.3356
##
## Model: Cue, Sex, and Age
## R-squared: 0.4465
## Adjusted R-squared: 0.3819
Models ranked on R^2: (Cue, Sex, and Age), (Cue and Sex), (Cue) Models ranked on adjusted R^2: (Cue, Sex, and Age), (Cue), (Cue and sex)
When selecting the best model, it’s generally recommended to use adjusted R^2 over R^2 since R^2 tends to increase with the addition of each predictor, which may lead to overfitting of the model. In this case, the model that includes Cue, Sex, and Age has the highest adjusted R^2 and thus should be preferred.
# Define the parameter estimates and standard errors:
param_estimates <- c(100.812, 0.332, 0.411, -3.003, -0.362, -0.521)
se <- c(13.096, 0.062, 0.090, 1.758, 2.732, 0.262)
# Calculate t-statistic and p-value for Age:
t_age <- (param_estimates[2]) / se[2]
p_age <- pt(t_age, df = 573, lower.tail = FALSE)
# Calculate t-statistic and p-value for Smoking:
t_smoking <- (param_estimates[5]) / se[5]
p_smoking <- pt(t_smoking, df = 573, lower.tail = FALSE)
The p-value for Age is less than 0.001, indicating strong evidence that Age is a significant predictor of systolic blood pressure. However, the p-value for Smoking is not significant at the 0.05 level, meaning there is no evidence that Smoking has a significant effect on systolic blood pressure. Therefore, we reject the null hypothesis for Age but fail to reject the null hypothesis for Smoking.
After adjusting for Age, Waist circumference, Alcohol consumption, and Smoking habits, the relationship between Ambient temperature and systolic blood pressure is negative.
Specifically, for each increase in ambient temperature by 1 degree Celsius, there is a decrease of 0.521 mm Hg in systolic blood pressure.
# Define the coefficient estimate and standard error for Ambient temperature
est <- -0.521
se <- 0.262
# Compute the confidence interval for the regression parameter for Ambient temperature
t_score <- qt(0.975, df = 573)
lower <- est - t_score * se
upper <- est + t_score * se
The 95% confidence interval for the regression parameter for Ambient temperature is (-1.036,-0.006)
# Define the predictor values
age <- 35
waist <- 100
alcohol <- 1
smoking <- 0
temp <- 30
# Calculate the predicted mean systolic blood pressure
predicted_mean <- 100.812 + 0.332 * age + 0.411 * waist - 3.003 * alcohol - 0.362 * smoking - 0.521 * temp
The predicted mean systolic blood pressure for 35 year-old Ghanaian men (who do not smoke, drink alcohol, and have a waist circumference of 100 cm) when the ambient temperature is 30°C is”, 134.9 mm Hg.
# Load the humanfat data set
data(humanfat)
# Fit the linear regression model
model <- lm(Percent.Fat ~ Age * Gender, data = humanfat)
# View the parameter estimates, standard errors, and p-values
summary(model)$coefficients[-1, c("Estimate", "Std. Error", "Pr(>|t|)")]
## Estimate Std. Error Pr(>|t|)
## Age 0.2400802 0.1203972 0.06599752
## GenderM -29.2691967 10.4097941 0.01385748
## Age:GenderM 0.5724628 0.2893456 0.06789632
summary(model)
##
## Call:
## lm(formula = Percent.Fat ~ Age * Gender, data = humanfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6756 -2.8862 -0.2464 1.9100 9.1641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.1116 6.2395 3.223 0.00613 **
## Age 0.2401 0.1204 1.994 0.06600 .
## GenderM -29.2692 10.4098 -2.812 0.01386 *
## Age:GenderM 0.5725 0.2893 1.978 0.06790 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.488 on 14 degrees of freedom
## Multiple R-squared: 0.8016, Adjusted R-squared: 0.7591
## F-statistic: 18.86 on 3 and 14 DF, p-value: 3.455e-05
anova(model)
## Analysis of Variance Table
##
## Response: Percent.Fat
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 891.87 891.87 44.2736 1.089e-05 ***
## Gender 1 168.79 168.79 8.3788 0.01177 *
## Age:Gender 1 78.85 78.85 3.9144 0.06790 .
## Residuals 14 282.02 20.14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA and summary outputs show that the p-values for the t-test and F-test for the interaction term are the same, with a value of 0.0679. This indicates that the tests are asymptotically equivalent in this case. However, we can see from the output that the interaction term is not statistically significant at the 0.05 level in either case.
Looking at the ANOVA output, we can see that the F-statistic for the interaction term is 3.9144, which is the same as the squared t-statistic of 3.912484. This shows that the F-test and t-test are equal, with the squared t-statistic equaling the F-statistic within the limitations of computer arithmetic.
# residual plots
par(mfrow = c(2, 2))
#check linearity: plot std residuals against Ht
scatter.smooth(rstandard(model) ~ humanfat$Age, col = "grey", las = 1,
ylab = "Standardized residuals", xlab = "Age (years)")
#check constant variance: plot std residuals against predicted value
scatter.smooth(rstandard(model) ~ fitted(model), col = "grey",
las = 1, ylab = "Standardized residuals", xlab = "Fitted values")
#check normality: Q-Q plot
qqnorm(rstandard(model), las = 1, pch = 19)
qqline(rstandard(model)) # add reference line
rstandard(model)
## 1 2 3 4 5 6
## -0.009235476 0.830004480 -1.319042852 1.329052836 0.471569467 -0.974976261
## 7 8 9 10 11 12
## -0.009235476 -1.545547407 -0.234897811 0.431777129 2.122646472 -0.922835599
## 13 14 15 16 17 18
## -0.246732303 -0.820454785 -0.244485356 -0.055745863 1.574040599 -0.061828355
Linearity for Age: The plot of residuals against age suggested that the linearity assumption may not be met due to non-linearity and increasing variance. This indicates that the linear model may not fully capture the true relationship between the response variable and age. It’s possible that a nonlinear model or a transformation of the age variable may be more appropriate.
Constant Variance: The plot of residuals against fitted values indicated a mean-variance relationship, with variance increasing as the mean increased. This suggests that the assumption of constant variance may not be met, and may need to be addressed in model selection or estimation.
Normality: The Q-Q plot of residuals showed that the normality assumption may be violated, as the residuals curve above the line on the right. This suggests that the residuals are more right-skewed than expected under normality, which may indicate that the model is misspecified or that a transformation of the response variable may be necessary.
Outliers and Influential Observations: Standard residuals were used to check for the presence of outliers and influential observations. The fact that none of the standard residuals exceeded 3 suggests that no influential observations or outliers are present in the model.