Question 1.1

Component 1

mu = exp(beta_0 + beta_1x_1 + beta_2x_2^2)
- Linear in parameters: Yes
- Appropriate for linear regression models: Yes
- Appropriate for generalized linear models: Yes
- Explanation: The component is linear in parameters as beta_1 and beta_2 have a first-order and second-order relationship with the response variable mu, respectively. The component is appropriate for both linear regression models and generalized linear models as the response variable mu is defined as the exponential function of the linear combination of the explanatory variables.

Component 2

mu = log(beta_0 + beta_1x_1) + beta_2x_2^2
- Linear in parameters: No
- Appropriate for linear regression models: No
- Appropriate for generalized linear models: Yes
- Explanation: The component is not linear in parameters as the parameter beta_1 has a first-order relationship with the response variable mu, while beta_0 has a logarithmic relationship with mu. The component is not appropriate for linear regression models due to the logarithmic relationship between beta_0 and mu. However, the component is appropriate for generalized linear models as the response variable mu is defined as the logarithmic function of the sum of beta_0 and beta_1 times x_1, and the second-order relationship between beta_2 and x_2^2.

Component 3

mu = beta_0 + beta_1*x
- Linear in parameters: Yes
- Appropriate for linear regression models: Yes
- Appropriate for generalized linear models: Yes
- Explanation: The component is linear in parameters as beta_1 has a first-order relationship with the response variable mu. The component is appropriate for both linear regression models and generalized linear models as the response variable mu is defined as a linear function of the explanatory variable x.

Component 4

mu = (beta_0 + beta_1x_2 + beta_2x_1*x_2)^2
- Linear in parameters: No
- Appropriate for linear regression models: No
- Appropriate for generalized linear models: No
- Explanation: The component is not linear in parameters as the parameter beta_2 has a second-order relationship with the response variable mu, while the parameter beta_0 and beta_1 have a first-order relationship with mu. The component is not appropriate for both linear regression models and generalized linear models due to the quadratic relationship between the response variable and the parameters.

Question 1.2

1)

# Load the humanfat dataset
data(humanfat)

The names of the variables in the dataset are: Age, Percent.Fat, Gender, BMI

2)

# Determine which variables are quantitative and which are qualitative
# Qualitative variables are Gender
quantitative_vars <- c("Age", "BMI", "Percent.Fat")
qualitative_vars <- "Gender"

The quantitative variables are: (Age, BMI, Percent.Fat), the qualitative variables are: (Gender).

3)

# Find the values of the dummy variable using treatment coding for the first two observations
# Reference level is female
dummy_vars <- ifelse(humanfat$Gender == "Female", 0, 1)

The reference level for Gender is: Female, Dummy Variables for Gender: 1, 1

4)

# Plot the response against each explanatory variable
library(ggplot2)

ggplot(humanfat, aes(x=BMI, y=Percent.Fat)) +
  geom_point() +
  labs(title = "Percent Body Fat vs. BMI",
       x = "BMI",
       y = "Percent Body Fat") +
  theme_bw()

ggplot(humanfat, aes(x=Age, y=Percent.Fat)) +
  geom_point() +
  labs(title = "Percent Body Fat vs. Age",
       x = "Age",
       y = "Percent Body Fat") +
  theme_bw()

ggplot(humanfat, aes(x=Gender, y=Percent.Fat)) +
  geom_boxplot() +
  labs(title = "Percent Body Fat by Gender",
       x = "Gender",
       y = "Percent Body Fat") +
  theme_bw()

Important features:

BMI vs PercentFat: There is a clear positive relationship between BMI and Percent.Fat, with some outliers at higher BMIs.
Age vs PercentFat: There is a clear positive relationship between relationship between Age and Percent.Fat, with some outliers.
Gender vs PercentFat: Females appear to have a higher median and smaller range of Percent.Fat compared to males.

5)

A linear regression model would be appropriate for modeling the relationship between PercentFat and all of the explanatory variables (Age, Gender, and BMI) based on the plots provided. The plot of PercentFat vs BMI shows a clear positive linear relationship between the two variables. The plot of PercentFat vs Age shows a positive trend, although it may not be strictly linear. The plot of PercentFat vs Gender shows differences in the median and range of PercentFat between males and females, which a linear model can capture. Therefore, a linear regression model would be appropriate for modeling the relationship between PercentFat and all of the explanatory variables in this dataset.

6)

beta_0
- represents the intercept of the regression line, which is the predicted value of Percent.Fat when BMI is 0.
beta_1
- represents the slope of the regression line, which is the change in Percent.Fat for a one-unit increase in BMI.

7)

𝛽_1 represents the expected change in the ratio of the expected value of Percent.Fat for two individuals with a one-unit difference in BMI, holding gender constant. Specifically, it represents the expected change in the log-transformed expected value of Percent.Fat for a one-unit increase in BMI, holding gender constant.

𝛽_2 represents the expected difference in the ratio of the expected value of Percent.Fat between males and females with the same BMI, on the log scale. Specifically, it represents the expected difference in the log-transformed expected value of Percent.Fat between males and females with the same BMI.

8)

- 𝑝 = 2, since there are two model parameters (𝛽_0 and 𝛽_1).
- 𝑞 = 16, since there are 18 observations in the dataset and two model parameters.
- 𝑝 = 3, since there are three model parameters (𝛽_0, 𝛽_1, and 𝛽_2).
- 𝑞 = 18, since there are 18 observations in the dataset.

Question 1.3

1)

Source of Variation	df	Sum of Squares
cue	3	117793
sex	1	2659
Age	3	22850
Residual	60	177639
Total	67	321941

2)

There are 68 observations used in the analysis (Total df N - 1, 67 + 1 = 68)

3)

SS_residual <- 177639
df_residual <- 60
sigma_sq <- SS_residual / df_residual

An unbiased estimate of 𝜎^2 is approximately 2960.65

4)

# Define the ANOVA table
df <- data.frame(
  Source = c("cue", "sex", "age", "Residual", "Total"),
  df = c(3, 1, 3, 60, 67),
  SS = c(117793, 2659, 22850, 177639, 320941)
)

# Calculate the sequential F-tests
F_cue <- df[1, "SS"] / df[1, "df"] / (df[4, "SS"] / df[4, "df"])
df_cue <- df[1, "df"]
p_cue <- pf(F_cue, df_cue, df[4, "df"], lower.tail = FALSE)

F_sex <- df[2, "SS"] / df[2, "df"] / (df[4, "SS"] / df[4, "df"])
df_sex <- df[2, "df"]
p_sex <- pf(F_sex, df_sex, df[4, "df"], lower.tail = FALSE)

F_age <- df[3, "SS"] / df[3, "df"] / (df[4, "SS"] / df[4, "df"])
df_age <- df[3, "df"]
p_age <- pf(F_age, df_age, df[4, "df"], lower.tail = FALSE)

Cue:
- F: 13.262
- df: 3
- p: 9.5356097^{-7}
Sex:
- F: 0.898
- df: 1
- p: 0.3470894
Age:
- F: 2.573
- df: 3
- p: 0.0623729

Based on the above, Cue is the only statistically significant variable at 0.05 while sex and age are not.

5)

The exclusion of certain participants who failed to wake during the study may raise issues with the data’s representativeness of the population at large. This has the potential to cause biases in the results and restrict the extent to which we can generalize our conclusions. Additionally, there exists the possibility that the omitted participants had different response time distributions than those who were included in the analysis, which could influence our estimates of variability and statistical significance. Thus, it is crucial to acknowledge this limitation and interpret the findings with care.

6)

# ANOVA table:
anova_table <- data.frame(
  Source = c("cue", "sex", "age", "Residual", "Total"),
  df = c(3, 1, 3, 60, 67),
  SS = c(117793, 2659, 22850, 177639, 320941)
)

# Revised SST:
SST <- 320941

# R-squared and adjusted R-squared for each model:
models <- c("Cue only", "Cue and Sex", "Cue, Sex, and Age")
explanatory_vars <- list(c("cue"), c("cue", "sex"), c("cue", "sex", "age"))
df <- c(3, 4, 7)

for (i in 1:length(models)) {
  # SSR for current model:
  model_SSR <- sum(anova_table$SS[which(anova_table$Source %in% explanatory_vars[[i]])])
  
  # R-squared for current model:
  R_squared <- model_SSR / SST
  
  # Adjusted R-squared for current model:
  n <- 68 # number of observations
  p <- df[i] # number of explanatory variables
  adj_R_squared <- 1 - (1 - R_squared) * (n - 1) / (n - p - 1)
  
  # Print results:
  cat("\nModel:", models[i], "\n")
  cat("R-squared:", round(R_squared, 4), "\n")
  cat("Adjusted R-squared:", round(adj_R_squared, 4), "\n")
}

## 
## Model: Cue only 
## R-squared: 0.367 
## Adjusted R-squared: 0.3374 
## 
## Model: Cue and Sex 
## R-squared: 0.3753 
## Adjusted R-squared: 0.3356 
## 
## Model: Cue, Sex, and Age 
## R-squared: 0.4465 
## Adjusted R-squared: 0.3819

7)

Models ranked on R^2: (Cue, Sex, and Age), (Cue and Sex), (Cue) Models ranked on adjusted R^2: (Cue, Sex, and Age), (Cue), (Cue and sex)

When selecting the best model, it’s generally recommended to use adjusted R^2 over R^2 since R^2 tends to increase with the addition of each predictor, which may lead to overfitting of the model. In this case, the model that includes Cue, Sex, and Age has the highest adjusted R^2 and thus should be preferred.

Question 1.4

1)

# Define the parameter estimates and standard errors:
param_estimates <- c(100.812, 0.332, 0.411, -3.003, -0.362, -0.521)
se <- c(13.096, 0.062, 0.090, 1.758, 2.732, 0.262)

# Calculate t-statistic and p-value for Age:
t_age <- (param_estimates[2]) / se[2]
p_age <- pt(t_age, df = 573, lower.tail = FALSE) 

# Calculate t-statistic and p-value for Smoking:
t_smoking <- (param_estimates[5]) / se[5]
p_smoking <- pt(t_smoking, df = 573, lower.tail = FALSE)

Age:
- t-value: 5.355
- p-value: 0
Smoking:
- t-value: -0.133
- p-value: 0.553

The p-value for Age is less than 0.001, indicating strong evidence that Age is a significant predictor of systolic blood pressure. However, the p-value for Smoking is not significant at the 0.05 level, meaning there is no evidence that Smoking has a significant effect on systolic blood pressure. Therefore, we reject the null hypothesis for Age but fail to reject the null hypothesis for Smoking.

2)

After adjusting for Age, Waist circumference, Alcohol consumption, and Smoking habits, the relationship between Ambient temperature and systolic blood pressure is negative.

Specifically, for each increase in ambient temperature by 1 degree Celsius, there is a decrease of 0.521 mm Hg in systolic blood pressure.

3)

# Define the coefficient estimate and standard error for Ambient temperature
est <- -0.521
se <- 0.262

# Compute the confidence interval for the regression parameter for Ambient temperature
t_score <- qt(0.975, df = 573)
lower <- est - t_score * se
upper <- est + t_score * se

The 95% confidence interval for the regression parameter for Ambient temperature is (-1.036,-0.006)

4)

# Define the predictor values
age <- 35
waist <- 100
alcohol <- 1
smoking <- 0
temp <- 30

# Calculate the predicted mean systolic blood pressure
predicted_mean <- 100.812 + 0.332 * age + 0.411 * waist - 3.003 * alcohol - 0.362 * smoking - 0.521 * temp

The predicted mean systolic blood pressure for 35 year-old Ghanaian men (who do not smoke, drink alcohol, and have a waist circumference of 100 cm) when the ambient temperature is 30°C is”, 134.9 mm Hg.

Question 1.5

1)

# Load the humanfat data set
data(humanfat)

# Fit the linear regression model
model <- lm(Percent.Fat ~ Age * Gender, data = humanfat)

# View the parameter estimates, standard errors, and p-values
summary(model)$coefficients[-1, c("Estimate", "Std. Error", "Pr(>|t|)")]

##                Estimate Std. Error   Pr(>|t|)
## Age           0.2400802  0.1203972 0.06599752
## GenderM     -29.2691967 10.4097941 0.01385748
## Age:GenderM   0.5724628  0.2893456 0.06789632

2)

For females:
- Percent.Fat = β0 + β1 * Age + β2 * 0 + β3 * Age * 0 + ε
- Percent.Fat = β0 + β1 * Age + 0 + 0 + ε
- Percent.Fat = 16.67 + 0.24 * Age - 0.57 * Age:GenderM
For males:
- Percent.Fat = β0 + β1 * Age + β2 * 1 + β3 * Age * 1 + ε
- Percent.Fat = β0 + (β1 + β2 + β3 * Age) + ε
- Percent.Fat = 16.67 - 29.27 + (0.24 + 0.57) * Age - 0.57 * Age:GenderM

3)

summary(model)

## 
## Call:
## lm(formula = Percent.Fat ~ Age * Gender, data = humanfat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6756 -2.8862 -0.2464  1.9100  9.1641 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  20.1116     6.2395   3.223  0.00613 **
## Age           0.2401     0.1204   1.994  0.06600 . 
## GenderM     -29.2692    10.4098  -2.812  0.01386 * 
## Age:GenderM   0.5725     0.2893   1.978  0.06790 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.488 on 14 degrees of freedom
## Multiple R-squared:  0.8016, Adjusted R-squared:  0.7591 
## F-statistic: 18.86 on 3 and 14 DF,  p-value: 3.455e-05

anova(model)

## Analysis of Variance Table
## 
## Response: Percent.Fat
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Age         1 891.87  891.87 44.2736 1.089e-05 ***
## Gender      1 168.79  168.79  8.3788   0.01177 *  
## Age:Gender  1  78.85   78.85  3.9144   0.06790 .  
## Residuals  14 282.02   20.14                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA and summary outputs show that the p-values for the t-test and F-test for the interaction term are the same, with a value of 0.0679. This indicates that the tests are asymptotically equivalent in this case. However, we can see from the output that the interaction term is not statistically significant at the 0.05 level in either case.

Looking at the ANOVA output, we can see that the F-statistic for the interaction term is 3.9144, which is the same as the squared t-statistic of 3.912484. This shows that the F-test and t-test are equal, with the squared t-statistic equaling the F-statistic within the limitations of computer arithmetic.

4)

# residual plots
par(mfrow = c(2, 2))
#check linearity: plot std residuals against Ht
scatter.smooth(rstandard(model) ~ humanfat$Age, col = "grey", las = 1, 
   ylab = "Standardized residuals", xlab = "Age (years)")
#check constant variance: plot std residuals against predicted value
scatter.smooth(rstandard(model) ~ fitted(model), col = "grey", 
   las = 1, ylab = "Standardized residuals", xlab = "Fitted values")
#check normality: Q-Q plot
qqnorm(rstandard(model), las = 1, pch = 19)
qqline(rstandard(model)) # add reference line

rstandard(model)

##            1            2            3            4            5            6 
## -0.009235476  0.830004480 -1.319042852  1.329052836  0.471569467 -0.974976261 
##            7            8            9           10           11           12 
## -0.009235476 -1.545547407 -0.234897811  0.431777129  2.122646472 -0.922835599 
##           13           14           15           16           17           18 
## -0.246732303 -0.820454785 -0.244485356 -0.055745863  1.574040599 -0.061828355

Linearity for Age: The plot of residuals against age suggested that the linearity assumption may not be met due to non-linearity and increasing variance. This indicates that the linear model may not fully capture the true relationship between the response variable and age. It’s possible that a nonlinear model or a transformation of the age variable may be more appropriate.

Constant Variance: The plot of residuals against fitted values indicated a mean-variance relationship, with variance increasing as the mean increased. This suggests that the assumption of constant variance may not be met, and may need to be addressed in model selection or estimation.

Normality: The Q-Q plot of residuals showed that the normality assumption may be violated, as the residuals curve above the line on the right. This suggests that the residuals are more right-skewed than expected under normality, which may indicate that the model is misspecified or that a transformation of the response variable may be necessary.

Outliers and Influential Observations: Standard residuals were used to check for the presence of outliers and influential observations. The fact that none of the standard residuals exceeded 3 suggests that no influential observations or outliers are present in the model.

Week 1 Homework

Thomas DeWaters

2023-03-08