Psych 250C - Problem Set 4

Introduction

For this homework you will be using the DeathAnxiety data from Jong, Halberstadt, Bluemke, Kavanagh & Jackson (2019) Scientific Data, “Death anxiety, exposure to death, mortuary preferences, and religiosity in five countries.”

These data come from a survey study of people from four countries (Brazil, Korea, Philippines, Russia) asking questions about their concerns about death, exposure to death, religiosity, and belief in the supernatural. There are about 200 participants from each country (N = 797 total).

DeathAnxiety <- read.csv('/Users/dgkamper/Library/Mobile Documents/com~apple~CloudDocs/Axis - HQ/PhD Terms/Classes/Spring 2024/Psych 250C/Problem Sets/Psych 250C Problems Sets/DeathAnxiety.csv')

Question 1

The model we are going to fit will predict death anxiety (DAQ) from supernatural beliefs (SBS) controlling for Age. Write out the regression equation that corresponds to this research question. Assuming we are primarily concerned with the accurate estimation of the relationship between DAQ and SBS, identify which parameter is of most direct interest.

The coefficient b1 is the parameter of primary interest, as it measures the specific influence of supernatural beliefs on death anxiety, controlling for the effect of age.

Question 2

Create a visualization that could help you identify unusual cases. Provide a comment on whether you see any unusual cases, and if so what makes them unusual, use the language we've learned about distance, leverage, and influence.

# Fit the regression model
deathmodel <- lm(DAQ ~ SBS + Age, data = DeathAnxiety)

# Residuals vs Fitted
plot(deathmodel, which=1, main="Residuals vs Fitted")

# Normal Q-Q plot
plot(deathmodel, which=2, main="Normal Q-Q")

# Scale-Location (Spread vs Leverage)
plot(deathmodel, which=3, main="Scale-Location")

# Residuals vs Leverage
plot(deathmodel, which=5, main="Residuals vs Leverage")

I will work one by one.

Firstly, the residuals vs fitted plot, which is used to check the linearity of the relationship and the homogeneity of variance across the range of fitted values. The points do not show any clear pattern or curve. This is good, as it suggests linearity in the relationship between the predictors and the response variable. The spread of the residuals appears fairly constant across the range of fitted values, indicating homoscedasticity.

Secondly, we have the normal Q-Q plot, which checks if the residuals are approximately normally distributed. The points generally follow the 45-degree line, which is a good indication that the residuals are normally distributed. However, there is some deviation in the tails (both ends), suggesting slight departures from normality.

Thirdly, we have the scale-location plot, where this plot is used to assess the homoscedasticity of residuals. Here, the points are scattered around a horizontal line without showing any distinct patterns or funnels, which generally supports the assumption of homoscedasticity, although there might be slight increases in spread for mid-range fitted values.

Fourthly, we have the residuals vs leverage plot, which helps identify influential observations that might affect the regression analysis. Most data points cluster at the bottom of the plot, indicating low leverage and small residuals. A few points, however, have higher leverage and/or larger residuals. Notably, there are data points on the far right with moderate leverage, and one particularly influential point as indicated by the labeled Cook's distance lines. The Cook’s distance line suggests that the influence of some points (those beyond the Cook’s distance cutoff, typically 0.5 or 1) is significant. The one marked with high Cook's distance (around 0.02 leverage and 0.2 standardized residuals) could be considered an influential observation due to its potential to sway the regression line significantly.

Question 3

Calculate the dfbetas for each case in the dataset for the specific model. Identify the case that has a dfbeta of the largest magnitude for the relationship between DAQ and SBS. Which case is it? Provide an interpretation of the dfbeta value (make sure to clarify what the sign of the value means). Look at this case's scores on Age, DAQ, and SBS, why is this case so influential?

# Load library
library(car)

## Loading required package: carData

# Calculate dfbetas for each predictor in the model
dfbetas_df <- as.data.frame(dfbetas(deathmodel))

# Grab IDs
dfbetas_df$CaseID <- rownames(dfbetas_df)

# Identify ID with largest magnitude dfbeta for SBS
max_dfbeta_sbs <- dfbetas_df[which.max(abs(dfbetas_df$SBS)),]

# Retrieve  data for this specific case from DeathAnxiety
influential_case <- DeathAnxiety[max_dfbeta_sbs$CaseID,]

print(influential_case)

##     Participant Country Gender Age DAQ SBS
## 726         729  Russia      F  48   8  -4

print (max_dfbeta_sbs)

##     (Intercept)        SBS       Age CaseID
## 726 -0.03373943 -0.2566899 0.1349047    726

# For comparison
summary (deathmodel)

## 
## Call:
## lm(formula = DAQ ~ SBS + Age, data = DeathAnxiety)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0982 -1.0432  0.0529  1.2566  4.4080 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.571225   0.221836  20.606  < 2e-16 ***
## SBS          0.171473   0.030078   5.701 1.68e-08 ***
## Age         -0.006112   0.006083  -1.005    0.315    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.803 on 794 degrees of freedom
## Multiple R-squared:  0.03966,    Adjusted R-squared:  0.03724 
## F-statistic: 16.39 on 2 and 794 DF,  p-value: 1.056e-07

The highest absolute DFBETA for SBS was for the femaile participant 729 from Russia. The dfbeta values are a noted:

Intercept: -0.03373943

SBS (Supernatural Beliefs): -0.2566899

Age: 0.1349047

This means that including this individual in the dataset makes the b1 coefficient for SBS about 0.2566899 less than if they were not included in the dataset. Thus, this case is contributing to an decrease in the estimated values of the SBS coefficient (and intercept as well).

The reason this participant is so influential appears to stem significantly from their SBS score, which is an outlier in its negativity. This not only affects the SBS coefficient but also has implications on the intercept. In other words, this participant is so influential because the participant has the highest possible scores on DAQ and SBS. Moreover, this is the oldest of the pool of participants who scored equally high absolute values on both DAQ and SBS.

Question 4

Create 3 plots to check the linearity of the data:

4.1) Evaluate whether the relationship between DAQ and Age is linear.

# Load library
library (ggplot2)

ggplot(DeathAnxiety, aes(x = Age, y = DAQ)) +
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", color = "red") + 
  labs(title = "Scatterplot of DAQ vs. Age",
       x = "Age",
       y = "DAQ") +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text = element_text(size = 10, face = "plain"),
    axis.title = element_text(size = 11, face = "plain"),
    plot.title = element_text(size = 13, face = "bold", hjust = 0.5)
  )

## `geom_smooth()` using formula = 'y ~ x'

4.2) Evaluate whether the relationship between DAQ and SBS is linear.

# Load library
library (ggplot2)

ggplot(DeathAnxiety, aes(x = SBS, y = DAQ)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Scatterplot of DAQ vs. SBS",
       x = "Supernatural Beliefs Score (SBS)",
       y = "DAQ") +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text = element_text(size = 10, face = "plain"),
    axis.title = element_text(size = 11, face = "plain"),
    plot.title = element_text(size = 13, face = "bold", hjust = 0.5)
  )

## `geom_smooth()` using formula = 'y ~ x'

4.3) Evaluate whether the relationship between DAQ and the predicted scores is linear.

# Load library
library (ggplot2)

DeathAnxiety$Predicted_DAQ <- predict(lm(DAQ ~ SBS + Age, data = DeathAnxiety))

ggplot(DeathAnxiety, aes(x = Predicted_DAQ, y = DAQ)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Scatterplot of DAQ vs. Predicted DAQ",
       x = "Predicted DAQ",
       y = "Actual DAQ") +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text = element_text(size = 10, face = "plain"),
    axis.title = element_text(size = 11, face = "plain"),
    plot.title = element_text(size = 13, face = "bold", hjust = 0.5)
  )

## `geom_smooth()` using formula = 'y ~ x'

4.4) Based on your plots does it look like there is any strong deviation away from linearity for any of the relationships? Why or why not?

I will address each plot in turn:

In the Scatterplot of DAQ vs. Age, the plot shows that DAQ values are somewhat evenly distributed across different ages without any clear trend in the DAQ values with age. The linear model line is almost horizontal, suggesting little to no linear relationship between age and DAQ. The spread of the data points does not suggest any clear linear or non-linear pattern; rather, it indicates that age alone might not be a strong predictor for DAQ in this model. In this case, there is no strong deviation from linearity because there’s hardly any linear relationship present.

In Scatterplot of DAQ vs. SBS, the plot shows a slight upward trend. The linear fit line suggests a mild positive linear relationship. The data points are relatively well-distributed around the regression line without obvious deviations or patterns that would suggest a strong non-linear relationship. In this case, the relationship appears linear, and there is no strong evidence of deviation from linearity.

In the Scatterplot of DAQ vs. Predicted DAQ, this plot shows the actual DAQ values plotted against their predicted values from the linear model. The points are clustered around a line, but there is a visible spread that increases as the predicted DAQ increases. We see that the linearity of this relationship is confirmed by the trend of points along the diagonal, although the variability of DAQ around the predicted values suggests possible heteroscedasticity. In this case, the relationship appears linear, but the increasing spread of points (faning out to the end ) as predicted DAQ increases suggests that there might be some need to look at the variance.