Psych 250C - Problem Set 3

ChronicPain <- read.csv("/Users/dgkamper/Library/Mobile Documents/com~apple~CloudDocs/Axis - HQ/PhD Terms/Classes/Spring 2024/Psych 250C/Problem Sets/Psych 250C Problems Sets/ChronicPain.csv")

Question 1

The outcome variable we are focusing on is reported pain. Estimate the correlation between pain_pre and pain_post. For an individual who starts out two standard deviations above the mean on pain_pre what is the expected pain_post, with respect to standard deviations from the mean.

# Correlation
correlation_preandpost <- cor(ChronicPain$pain_pre, ChronicPain$pain_post)

print(correlation_preandpost)

## [1] 0.4000344

# Prediction for two standard deviations

# Means and Standard Deviations

mean_pre <- mean(ChronicPain$pain_pre)

sd_pre <- sd(ChronicPain$pain_pre)

mean_post <- mean(ChronicPain$pain_post)

sd_post <- sd(ChronicPain$pain_post)

# Calculate an individual with two standard deviations for an individual who starts out 
# two standard deviations above the mean on pain_pre what is the expected pain_post

# Regression
model_preandpost <- lm(pain_post ~ pain_pre, data = ChronicPain)

summary(model_preandpost)

## 
## Call:
## lm(formula = pain_post ~ pain_pre, data = ChronicPain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50343 -0.94279  0.05721  0.69737  2.05721 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.7002     0.9207   1.847   0.0726 .
## pain_pre      0.5606     0.2084   2.691   0.0105 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.193 on 38 degrees of freedom
## Multiple R-squared:   0.16,  Adjusted R-squared:  0.1379 
## F-statistic:  7.24 on 1 and 38 DF,  p-value: 0.01054

# Predicted Score
sd2prepainpost = data.frame(pain_pre = mean_pre + 2*sd_pre)

predicted_pain_post <- predict(model_preandpost, sd2prepainpost)

print(predicted_pain_post)

##        1 
## 5.152907

# Standard Deviations of Predicted Score
predicted_pain_post_sd <- (predicted_pain_post - mean_post) / sd_post

print(predicted_pain_post_sd)

##         1 
## 0.8000689

As one can observe, the correlation coefficient calculated is approximately 0.4000344. This suggests a moderate positive relationship between the pre treatment pain intensity scores before and after treatment. What this means is that individuals who reported higher pain intensity scores before the treatment (pain_pre) tend to also report higher scores after the treatment (pain_post), but the relationship is not very strong. Moreover, we can say that people who differ by one standard deviation in pain intensity before treatment are estimated to differ by 0.4000344 standard deviations in pain intensity score after treatment.

The expected post treatment pain intensity score for an individual whose pre treatment pain intensity score is two standard deviations above the mean at approximately 5.1529. We also find that the expected post treatment pain intensity score for an individual who starts out two standard deviations above the mean on pre treatment pain intensity score is 0.8007 standard deviations above the mean on post treatment pain intensity score.

Question 2

Consider if we wanted to know whether age of participant was related to effectiveness of treatment (as measured by the change in pain [deltapain = pain_pre – pain_post]). Why is important to account for regression to the mean in estimating this relationship?

Before I note this, I want to clarify regression to the mean. Regression to the mean is a phenomenon that occurs when a variable that is extreme on its first measurement tends to be closer to the mean on a second measurement. This means that subjects with extremely high or low scores on one occasion are likely to have scores closer to the average on another occasion, simply due to chance rather than any systematic change.

For example, if the first measurement (pain_pre) is extreme, a second measurement (pain_post) is likely to be closer to the average, purely due to chance. This means that extreme values in pain_pre are likely to be followed by less extreme, more average values in pain_post, not necessarily due to any treatment effect but due to variability in the pain measurements. In studies such as this one involving pain measurement, if pain_pre scores are particularly high or low, pain_post scores are statistically likely to move toward the mean (less extreme). If this natural tendency is not accounted for, any observed improvement, that is a decrease in pain scores, could be mistakenly attributed to the effectiveness of the treatment rather than to this phenomenon.

Question 3

Consider a model where we are predicting change in pain (deltapain) using treatment, which was randomly assigned. Treatment is coded in a dichotomous variable where 0 indicates acceptance and commitment therapy (ACT), and 1 indicates applied relaxation (AR).

3.1. Estimate a model predicting change in pain using treatment. Report the code and output.

# Model predicting change in pain by treatment type
deltapainmodel <- lm(deltapain ~ Treatment, data = ChronicPain)

summary(deltapainmodel)

## 
## Call:
## lm(formula = deltapain ~ Treatment, data = ChronicPain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4545 -0.8889  0.1111  0.5454  3.1111 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.4545     0.2616   1.737   0.0904 .
## Treatment    -0.5657     0.3900  -1.450   0.1552  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.227 on 38 degrees of freedom
## Multiple R-squared:  0.05244,    Adjusted R-squared:  0.02751 
## F-statistic: 2.103 on 1 and 38 DF,  p-value: 0.1552

3.2) Interpret the coefficient estimates (b_0, b_1), provide an interpretation of each estimate and information about the inferential decision we would make based on the results of the analysis. Use an alpha-level of 0.10 to determine “statistical significance.”

Intercept (b₀) = 0.4545

The estimated intercept is 0.4545. This represents the expected change in pain between pre and post scores for participants who received Acceptance and Commitment Therapy (ACT), assuming the treatment variable is coded as 0 for ACT.

Treatment coefficient (b₁) = -0.5657

The treatment coefficient is -0.5657. Since AR is coded as 1, this coefficient suggests that, on average, participants receiving Applied Relaxation (AR) experience an additional 0.5657 units reduction in pain compared to those receiving ACT. A negative coefficient indicates that the change in pain scores from pre- to post-treatment is greater for the AR group than for the ACT group.

Moreover, the p-value associated with the treatment coefficient is 0.1552. Since we are using an alpha level of 0.10 to determine statistical significance, a p-value of 0.1552 is above 0.10. This means we cannot reject the null hypothesis at the 10% significance level. The conclusion is that, while the AR treatment seems to have a larger average effect on pain reduction than the ACT treatment, this difference is not statistically significant at the 10% level based on the sample data provided.

The model’s R² is approximately 0.05244, indicating that about 5.24% of the variability in the change in pain scores is accounted for by the treatment type. This suggests a relatively small effect of treatment type on change in pain.

3.3) What if we want to estimate whether the average change in pain is significantly different from zero for those in the AR condition. Create a new variable for treatment which uses a different coding strategy that allows us to estimate this effect in a simple linear regression. Include your code, output, and answer to this question based on the results.

# Subset ChronicPain to only include AR group participants
DataOnly_AR <- subset(ChronicPain, Treatment == 1)

# Create variable that is 1 for all observations in the AR group
# I will call this value intercept.
DataOnly_AR$intercept <- 1

# Run regression of deltapain on the variable intercept
model_AR <- lm(deltapain ~ intercept, data = DataOnly_AR)

summary(model_AR)

## 
## Call:
## lm(formula = deltapain ~ intercept, data = DataOnly_AR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8889 -0.8889  0.1111  0.1111  3.1111 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.1111     0.3120  -0.356    0.726
## intercept         NA         NA      NA       NA
## 
## Residual standard error: 1.323 on 17 degrees of freedom

The estimated intercept is -0.1111, which represents the average change in pain for the AR group. The standard error is 0.312. The p-value for the intercept is 0.726, indicating that the average change in pain for the AR group is not statistically significantly different from zero. Therefore, we can answer No: the average change in pain is not significantly different from zero for those in the AR condition.

Question 4

Health related quality of life is captured using two scores the PCS_pre and MCS_pre.

4.1) Estimate a linear model where you can test the combined contribution of the two quality of life measures with respect to explained variability in change in pain (deltapain) after controlling for initial levels of pain (pain_pre) and age.

# Exclude case 25 because of missing values
NewChronicPain <- ChronicPain[-25,]

# Fit the linear regression model
lifequalitymodel <- lm(deltapain ~ PCS_pre + MCS_pre + pain_pre + age, data = NewChronicPain)

summary(lifequalitymodel)

## 
## Call:
## lm(formula = deltapain ~ PCS_pre + MCS_pre + pain_pre + age, 
##     data = NewChronicPain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6450 -0.5595 -0.2092  0.8448  2.0257 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -3.320625   1.639467  -2.025   0.0507 .
## PCS_pre      0.043056   0.018180   2.368   0.0237 *
## MCS_pre     -0.006559   0.020235  -0.324   0.7478  
## pain_pre     0.647837   0.250458   2.587   0.0141 *
## age         -0.002492   0.018798  -0.133   0.8953  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.15 on 34 degrees of freedom
## Multiple R-squared:  0.2364, Adjusted R-squared:  0.1465 
## F-statistic: 2.631 on 4 and 34 DF,  p-value: 0.05122

4.2) Report the change in R2 when adding the two variables to the model, report the results of a significance test examining if this change in R2 is significantly different from zero, report the conclusion about the research question you would make based on the inferential test. Use an a-level of 0.10 to determine “statistical significance.”

# Reduced model with only pain_pre and age
lifequalitymodel_reduced <- lm(deltapain ~ pain_pre + age, data = NewChronicPain)

summary(lifequalitymodel_reduced)

## 
## Call:
## lm(formula = deltapain ~ pain_pre + age, data = NewChronicPain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.13169 -0.58268 -0.03745  0.86023  2.55749 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.524663   1.009929  -1.510    0.140  
## pain_pre     0.456229   0.234927   1.942    0.060 .
## age         -0.005385   0.019743  -0.273    0.787  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.211 on 36 degrees of freedom
## Multiple R-squared:  0.1037, Adjusted R-squared:  0.05388 
## F-statistic: 2.082 on 2 and 36 DF,  p-value: 0.1394

# Full model with pain_pre, age, and the quality of life scores
lifequalitymodel_full <- lm(deltapain ~ pain_pre + age + PCS_pre + MCS_pre, data = NewChronicPain)

summary(lifequalitymodel_full)

## 
## Call:
## lm(formula = deltapain ~ pain_pre + age + PCS_pre + MCS_pre, 
##     data = NewChronicPain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6450 -0.5595 -0.2092  0.8448  2.0257 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -3.320625   1.639467  -2.025   0.0507 .
## pain_pre     0.647837   0.250458   2.587   0.0141 *
## age         -0.002492   0.018798  -0.133   0.8953  
## PCS_pre      0.043056   0.018180   2.368   0.0237 *
## MCS_pre     -0.006559   0.020235  -0.324   0.7478  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.15 on 34 degrees of freedom
## Multiple R-squared:  0.2364, Adjusted R-squared:  0.1465 
## F-statistic: 2.631 on 4 and 34 DF,  p-value: 0.05122

# Compare the two models using an ANOVA
lifequalitymodel_comparison <- anova(lifequalitymodel_reduced, lifequalitymodel_full)

print(lifequalitymodel_comparison)

## Analysis of Variance Table
## 
## Model 1: deltapain ~ pain_pre + age
## Model 2: deltapain ~ pain_pre + age + PCS_pre + MCS_pre
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     36 52.814                              
## 2     34 44.995  2    7.8192 2.9543 0.06562 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the change in R², I first look at the following:

Reduced model R² = 0.1037

Full model R² which includes PCS_pre and MCS_pre = 0.2364

The change in when adding the quality of life measures to the model is:

DeltaR = 0.2364 - 0.1037 = 0.1327

For our inferential test, I use ANOVA to compare the two models to test if the addition of PCS_pre and MCS_pre significantly improves the model fit. I get F(2, 34) = 2.9543, p = 0.06562.

Given that the p-value of 0.06562 is less than the alpha-level of 0.10, we can conclude that the change in R² is statistically significant. Therefore, the addition of the quality of life scores (PCS_pre and MCS_pre) does contribute significantly to explaining the variability in the change in pain, after controlling for initial levels of pain and age.

Based on the inferential test, we would conclude that yes, quality of life does predict change in pain after controlling for initial levels of pain and age, given our criteria for statistical significance at the alpha-level of 0.10. This suggests that the measures of physical and mental components of quality of life offer meaningful insights into the changes in pain that individuals experience.

4.3) Report the squared partial multiple correlation and squared semi-partial multiple correlation for the set of quality of life predictors with the outcome deltapain.

# Squared partial multiple correlation
R_squared_full <- summary(lifequalitymodel_full)$r.squared
R_squared_reduced <- summary(lifequalitymodel_reduced)$r.squared
squared_partial_corr <- R_squared_full - R_squared_reduced

# Squared semi-partial multiple correlation for PCS_pre
model_with_pcs <- lm(deltapain ~ pain_pre + age + PCS_pre, data = NewChronicPain)
R_squared_with_pcs <- summary(model_with_pcs)$r.squared
squared_semi_partial_PCS <- (R_squared_with_pcs - R_squared_reduced) / (1 - R_squared_reduced)

# Squared semi-partial multiple correlation for MCS_pre
model_with_mcs <- lm(deltapain ~ pain_pre + age + MCS_pre, data = NewChronicPain)
R_squared_with_mcs <- summary(model_with_mcs)$r.squared
squared_semi_partial_MCS <- (R_squared_with_mcs - R_squared_reduced) / (1 - R_squared_reduced)

print(paste("Squared Partial Multiple Correlation:", squared_partial_corr))

## [1] "Squared Partial Multiple Correlation: 0.13270233707974"

print(paste("Squared Semi-Partial Multiple Correlation for PCS_pre:", squared_semi_partial_PCS))

## [1] "Squared Semi-Partial Multiple Correlation for PCS_pre: 0.145419427149347"

print(paste("Squared Semi-Partial Multiple Correlation for MCS_pre:", squared_semi_partial_MCS))

## [1] "Squared Semi-Partial Multiple Correlation for MCS_pre: 0.00751475496233062"

Question 5

Consider the model for change in pain with four predictors: PCS_pre, MCS_pre, pain_pre, and age. For this question you may not need to repeat any new analysis, but please paste the relevant output when requested for each part of the question.

5.1) Report R2 and adj-R2 for this model. Explain why we might trust adj-R2 more when trying to generalize explained variance to the population.

summary(lifequalitymodel_full)

## 
## Call:
## lm(formula = deltapain ~ pain_pre + age + PCS_pre + MCS_pre, 
##     data = NewChronicPain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6450 -0.5595 -0.2092  0.8448  2.0257 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -3.320625   1.639467  -2.025   0.0507 .
## pain_pre     0.647837   0.250458   2.587   0.0141 *
## age         -0.002492   0.018798  -0.133   0.8953  
## PCS_pre      0.043056   0.018180   2.368   0.0237 *
## MCS_pre     -0.006559   0.020235  -0.324   0.7478  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.15 on 34 degrees of freedom
## Multiple R-squared:  0.2364, Adjusted R-squared:  0.1465 
## F-statistic: 2.631 on 4 and 34 DF,  p-value: 0.05122

R-squared = 0.2364

Adjusted R-squared = 0.1465

For our explanation, I will clarify R-squared and adjusted R-squared. R-squared is a measure of the proportion of variance in the dependent variable that is predictable from the independent variables. In simpler terms, it tells us how well the independent variables as a group explain the variability of the dependent variable. Adjusted R-squared accounts not only for the proportion of variance explained but also for the number of predictors in the model relative to the number of observations. It adjusts the R-squared value to reflect the complexity of the model: with each additional predictor, the R-squared will not necessarily increase unless there is a true increase in explanatory power. The formula for adjusted R-squared penalizes for adding predictors that do not improve the model’s ability to predict.

When trying to generalize the explanatory variance to the population, adjusted R-squared is considered more reliable than R-squared because it provides a more accurate measure of the model’s explanatory power by accounting for the number of predictors. This prevents the inflation of R-squared due to model overfitting. With adjusted R-squared, if adding a new variable doesn’t actually improve the model, it might cause the adjusted R-squared to decrease, thereby indicating the potential redundancy or irrelevance of the variable.

5.2) If the sample size were doubled, would we expect the difference between R2 and adj-R2 to be smaller or larger. Explain your answer.

If we double the sample size, while keeping the number of predictors the same, we would expect the difference between R-squared and adjusted R-squared to be smaller. Let me demonstrate this mathematically.

In this case, as one can observe, the adjusted R-squared is calculated to penalize the positive bias found in the R-squared calculated, which tends to overestimate the population’s explained variance when the sample size is small or when there are many predictors relative to the number of observations. The formula for adjusted R-squared includes a term that adjusts for the number of predictors and the sample size, as shown. Therefore, As the sample size n increases, the denominator of the equation (n-p-1) gets larger compared to (n-1). This term becomes closer to one, which makes the adjusted R-squared approach R-squared. The adjustment becomes less severe as the sample size grows, because the likelihood of overfitting diminishes with more data, and the estimates of the population parameters based on the sample become more reliable.

5.3) Report the results of a significance test which would examine if these four variables explain a statistically significant amount of variability in the outcome deltapain. Use an alpha-level of 0.10 to determine “statistical significance.” (1pt)

# Intercept-only model
intercept_model <- lm(deltapain ~ 1, data = NewChronicPain)

# Perform the ANOVA to compare the two models
anova_variability_deltapain <- anova(intercept_model, lifequalitymodel_full)

print(anova_variability_deltapain)

## Analysis of Variance Table
## 
## Model 1: deltapain ~ 1
## Model 2: deltapain ~ pain_pre + age + PCS_pre + MCS_pre
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     38 58.923                              
## 2     34 44.995  4    13.928 2.6312 0.05122 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One finds here that there a statistically significant amount of variability in the outcome deltapain: F(4, 34) = 2.6312, p = 0.05122.

Do these variables predict significant variance in change in pain? (Y/N, explain): Yes. We can we can conclude, given our F(4, 34) = 2.6312, p = 0.05122, that the four variables specified in the regression model significantly explain 23.6% of the variance in change in pain.

5.4) Examine the coefficient for age in the model with all four predictors. Use an alpha-level of 0.10 to determine “statistical significance.” Report the following information about this coefficient (2pt):

library (car)

## Loading required package: carData

summary(lifequalitymodel_full)

## 
## Call:
## lm(formula = deltapain ~ pain_pre + age + PCS_pre + MCS_pre, 
##     data = NewChronicPain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6450 -0.5595 -0.2092  0.8448  2.0257 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -3.320625   1.639467  -2.025   0.0507 .
## pain_pre     0.647837   0.250458   2.587   0.0141 *
## age         -0.002492   0.018798  -0.133   0.8953  
## PCS_pre      0.043056   0.018180   2.368   0.0237 *
## MCS_pre     -0.006559   0.020235  -0.324   0.7478  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.15 on 34 degrees of freedom
## Multiple R-squared:  0.2364, Adjusted R-squared:  0.1465 
## F-statistic: 2.631 on 4 and 34 DF,  p-value: 0.05122

# Confidence interval for the age coefficient, and use 90% for an alpha-level of 0.10.
# Also reporting the 95% confidence interval
conf_int_1 <- confint(lifequalitymodel_full, "age", level = 0.90) 
conf_int_2 <- confint(lifequalitymodel_full, "age", level = 0.95) 

print(paste("90% Confidence Interval for age:", conf_int_1))

## [1] "90% Confidence Interval for age: -0.0342771384535484"
## [2] "90% Confidence Interval for age: 0.0292938582132561"

print(paste("95% Confidence Interval for age:", conf_int_2))

## [1] "95% Confidence Interval for age: -0.0406931765082123"
## [2] "95% Confidence Interval for age: 0.03570989626792"

# Variance Inflation Factor (VIF) for coefficients
vif_values <- vif(lifequalitymodel_full)

# VIF for age
age_vif <- vif_values['age']
print(paste("Variance Inflation Factor for age:", age_vif))

## [1] "Variance Inflation Factor for age: 1.23501018302724"

Estimate: b=-0.002492

Inferential test: F(34)=-0.133; p=0.8953

Conclusion based on inferential test: Age does not have a significant effect on change in pain.

Confidence Interval: 90% CI = [-0.0342771384535484, 0.0292938582132561]; 95% CI = [-0.0406931765082123, 0.03570989626792]

Variance Inflation Factor: 1.23501018302724

5.5) Age could have been measured in months rather than years. Create a new variable which is age in months (age*12), note that this will change the standard deviation for the variable. Re-estimate the model with the age in months variable, PCS_pre, MCS_pre, and pain_pre. Examine the estimate and significance test age in months. The standard error for a regression coefficient for regressor j is a function of the standard deviation of variable j. Explain why the t-value and p-value remain unchanged even though we have changed the standard deviation of the predictor. (2pt)

# New variable for age in months
NewChronicPain$age_months <- NewChronicPain$age * 12

# Re-estimate the model with the new age in months variable
lifequalitymodel_full_new <- lm(deltapain ~ PCS_pre + MCS_pre + pain_pre + age_months, data = NewChronicPain)

summary(lifequalitymodel_full_new)

## 
## Call:
## lm(formula = deltapain ~ PCS_pre + MCS_pre + pain_pre + age_months, 
##     data = NewChronicPain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6450 -0.5595 -0.2092  0.8448  2.0257 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -3.3206249  1.6394665  -2.025   0.0507 .
## PCS_pre      0.0430557  0.0181803   2.368   0.0237 *
## MCS_pre     -0.0065595  0.0202352  -0.324   0.7478  
## pain_pre     0.6478366  0.2504583   2.587   0.0141 *
## age_months  -0.0002076  0.0015665  -0.133   0.8953  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.15 on 34 degrees of freedom
## Multiple R-squared:  0.2364, Adjusted R-squared:  0.1465 
## F-statistic: 2.631 on 4 and 34 DF,  p-value: 0.05122

When you rescale a predictor variable (from years to months), the estimated regression coefficient for that variable will change in inverse proportion to the scaling. Since the new variable (age_months) is 12 times the old variable (age), the new coefficient is 1/12th the size of the old coefficient. Moreover, when you rescale a predictor variable, it does so proportionally, leaving the t-values and p-values unchanged.

5.6) Choose one of the four assumption of linear regression, create a visualization from the data helps you examine whether this assumption is violated in the model with 4 predictors. Describe why you chose that visualization and based on what you see whether you think that assumption is violated. (1pt)

library(ggplot2)

# Calculate the residuals and fitted values
residuals_data <- data.frame(
  Fitted = lifequalitymodel_full$fitted.values,
  Residuals = lifequalitymodel_full$residuals)

ggplot(residuals_data, aes(x = Fitted, y = Residuals)) +
  geom_point() + 
  geom_hline(yintercept = 0, color = "red") +  
  geom_smooth(method = "loess", color = "blue") + 
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text = element_text(size = 12, face = "plain"),
    axis.title = element_text(size = 13, face = "plain"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5)
  ) +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals/Fitted")

## `geom_smooth()` using formula = 'y ~ x'

In this plot, the red line represents the expected position of the residuals if they were perfectly homoscedastic. The blue line is a non-parametric fit to the residuals. From this, we see that the residuals are generally scattered around the horizontal line at zero, which is good. Moreover, there does not appear to be a clear pattern to the residuals, which is also good. The spread of the residuals seems fairly consistent across the range of fitted values, where there isn’t an obvious increase or decrease in spread. This suggests homoscedasticity. Lastly, there are no clear outliers that stand out dramatically from the rest of the data points. From this, it does not look like there’s a clear violation of the homoscedasticity assumption. The residuals appear to be randomly scattered and don’t show any clear patterns that would suggest a problem with unequal variances.