1 Loading Libraries

library(psych) # for the describe() command
library(car) # for the vif() command

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

library(sjPlot) # to visualize our results

2 Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use EAMMi2 data
d <- read.csv(file="Data/eammi2_data_final.csv", header=T)

3 State Your Hypothesis

I hypothesize that gender significantly predicts emerging adulthood dimensions, with women demonstrating higher levels of maturity, social media use, and perceived social support compared to men. In this hypothesis, the emerging adulthood dimensions serve as the dependent/outcome variable, while maturity, social media use, and perceived social support serve as independent/predictor variables.

4 Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    3182 obs. of  27 variables:
##  $ ResponseId      : chr  "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
##  $ gender          : chr  "f" "m" "m" "f" ...
##  $ race_rc         : chr  "white" "white" "white" "other" ...
##  $ age             : chr  "1 between 18 and 25" "1 between 18 and 25" "1 between 18 and 25" "1 between 18 and 25" ...
##  $ income          : chr  "1 low" "1 low" "rather not say" "rather not say" ...
##  $ edu             : chr  "2 Currently in college" "5 Completed Bachelors Degree" "2 Currently in college" "2 Currently in college" ...
##  $ sibling         : chr  "at least one sibling" "at least one sibling" "at least one sibling" "at least one sibling" ...
##  $ party_rc        : chr  "democrat" "independent" "apolitical" "apolitical" ...
##  $ disability      : chr  NA NA "psychiatric" NA ...
##  $ marriage5       : chr  "are currently divorced from one another" "are currently married to one another" "are currently married to one another" "are currently married to one another" ...
##  $ phys_sym        : chr  "high number of symptoms" "high number of symptoms" "high number of symptoms" "high number of symptoms" ...
##  $ pipwd           : num  NA NA 2.33 NA NA ...
##  $ moa_independence: num  3.67 3.67 3.5 3 3.83 ...
##  $ moa_role        : num  3 2.67 2.5 2 2.67 ...
##  $ moa_safety      : num  2.75 3.25 3 1.25 2.25 2.5 4 3.25 2.75 3.5 ...
##  $ moa_maturity    : num  3.67 3.33 3.67 3 3.67 ...
##  $ idea            : num  3.75 3.88 3.75 3.75 3.5 ...
##  $ swb             : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ mindful         : num  2.4 1.8 2.2 2.2 3.2 ...
##  $ belong          : num  2.8 4.2 3.6 4 3.4 4.2 3.9 3.6 2.9 2.5 ...
##  $ efficacy        : num  3.4 3.4 2.2 2.8 3 2.4 2.3 3 3 3.7 ...
##  $ support         : num  6 6.75 5.17 5.58 6 ...
##  $ socmeduse       : int  47 23 34 35 37 13 37 43 37 29 ...
##  $ usdream         : chr  "american dream is important and achievable for me" "american dream is important and achievable for me" "american dream is not important and maybe not achievable for me" "american dream is not important and maybe not achievable for me" ...
##  $ npi             : num  0.6923 0.1538 0.0769 0.0769 0.7692 ...
##  $ exploit         : num  2 3.67 4.33 1.67 4 ...
##  $ stress          : num  3.3 3.3 4 3.2 3.1 3.5 3.3 2.4 2.9 2.7 ...

cont <- na.omit(subset(d, select=c(idea, moa_maturity, support, socmeduse))) 
cont$idea <- scale(cont$idea, center=T, scale=T)
cont$moa_maturity <- scale(cont$moa_maturity, center=T, scale=T)
cont$support <- scale(cont$support, center=T, scale=T)
cont$socmeduse <- scale(cont$socmeduse, center=T, scale=T)

# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(cont)

##              vars    n mean sd median trimmed  mad   min  max range  skew
## idea            1 3137    0  1   0.13    0.12 0.98 -6.83 1.13  7.96 -1.48
## moa_maturity    2 3137    0  1   0.17    0.14 1.15 -6.01 0.95  6.96 -1.20
## support         3 3137    0  1   0.19    0.11 0.88 -4.90 1.30  6.20 -1.10
## socmeduse       4 3137    0  1   0.06    0.03 0.87 -2.74 2.40  5.14 -0.30
##              kurtosis   se
## idea             4.19 0.02
## moa_maturity     1.87 0.02
## support          1.43 0.02
## socmeduse        0.26 0.02

# also use histograms to examine your continuous variables
hist(cont$idea)

hist(cont$moa_maturity)

hist(cont$support)

hist(cont$socmeduse)

# last, use scatterplots to examine your continuous variables together
plot(cont$idea, cont$moa_maturity)

plot(cont$idea, cont$support)

plot(cont$support, cont$socmeduse)

plot(cont$moa_maturity, cont$support)

plot(cont$socmeduse, cont$moa_maturity)

plot(cont$socmeduse, cont$idea)

5 View Your Correlations

corr_output_m <- corr.test(cont)
corr_output_m

## Call:corr.test(x = cont)
## Correlation matrix 
##              idea moa_maturity support socmeduse
## idea         1.00         0.15    0.12      0.23
## moa_maturity 0.15         1.00    0.10      0.09
## support      0.12         0.10    1.00      0.21
## socmeduse    0.23         0.09    0.21      1.00
## Sample Size 
## [1] 3137
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##              idea moa_maturity support socmeduse
## idea            0            0       0         0
## moa_maturity    0            0       0         0
## support         0            0       0         0
## socmeduse       0            0       0         0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

6 Run a Multiple Linear Regression

# use the lm() command to run the regression
# dependent/outcome variable on the left, independent/predictor variables on the right
reg_model <- lm(idea ~ moa_maturity + support + socmeduse, data = cont)

7 Check Your Assumptions

7.1 Multiple Linear Regression Assumptions

Assumptions we’ve discussed previously:

Observations should be independent
Variables should be continuous and normally distributed
Outliers should be identified and removed (for our class, we will not be removing outliers, just identifying them)
Relationship between the variables should be linear

New assumptions:

Number of cases should be adequate (N ≥ 80 + 8m, where m is the number of IVs)
Independent variables should not be too correlated (aka multicollinearity)
Residuals should be normal and have constant variance

7.2 Count Number of Cases

For your homework, if you don’t have the required number of cases you’ll need to drop one of your independent variables. Reach out to me and we can figure out the best way to proceed!

needed <- 80 + 8*3
nrow(cont) >= needed

## [1] TRUE

7.3 Check multicollinearity

Higher values indicate more multicollinearity - the variation inflation factor test for multicollinearity
Cutoff is usually 5

For your homework, you will need to discuss multicollinearity and any high values, but you don’t have to drop any variables.

vif(reg_model)

## moa_maturity      support    socmeduse 
##     1.016384     1.054528     1.052444

#scores should be equal to or less than 5 so that multicollinearity is not too high

7.4 Check linearity with Residuals vs Fitted plot

The plot below shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. However, a bit of deviation is okay – just like with skewness and kurtosis, there’s a range that we can work in before non-normality becomes a critical issue. For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page.

The Residuals vs Fitted plot for my variables meets the assumption of linearity - seen as a horizontal red line (the average residual for the specified point of the dependent variable). There is a bit of deviation from the grey dash line, however, it falls within the acceptable range without non-normality becoming a critical issue.

plot(reg_model, 1)

# the ,1 tells it to give us the first plot
#red line should hug the grey dash line as usual

7.5 Check normality of residuals with a Q-Q plot

This plot is a bit new. It’s called a Q-Q plot and shows the standardized residuals plotted against a normal distribution. If our variables are perfectly normal, the points will fit on the dashed line perfectly. This page shows how different types of non-normality appear on a Q-Q plot.

It’s normal for Q-Q plots show a bit of deviation at the ends. This page shows some examples that help us put our Q-Q plot into context. For your homework, you’ll simply need to generate this plot and talk about how your plot compares to the ones pictured. Does it seem like any skew or kurtosis is indicated by your plot? Is it closer to the ‘good’/‘bad’ plots from the second link?

The Q-Q plot for my variables does not indicate that my variables are perfectly normal, as the points do not fit on the dashed line perfectly. Instead, my plot shows deviation from the straight diagonal line at the ends. It is more similar to a Q-Q plot for non-normal data, indicating that my data set is not normally distributed. My plot accurately reflects the skewness and kurtosis for my variables, for example, “Idea,” “Moa_maturity,” and “Support,” which have negative skewness and high kurtosis are indicated by the Q-Q plot deviations at the ends as seen with leptokurtic distributions. The negative skewness also contributes to the deviation from the straight diagonal line, indicating the skewness towards the left side.

plot(reg_model, 2)

7.6 Check for outliers using Cook’s distance and a Residuals vs Leverage plot

The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.

The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Points that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it.

For your homework, you’ll simply need to generate these plots, assess Cook’s distance in your dataset, and then identify any potential cases that are prominent outliers. Since we have some cutoffs, that makes this process is a bit less subjective than some of the other assessments we’ve done here, which is a nice change!

According to the Cook’s distance plot, three outliers were identified. Observations 31 and 2850 have a significant influence on the regression model with a Cook’s distance of 0.3 and 0.4 respectively; while observation 2989 has a very high influence on the model with a Cook’s distance value of 0.6, indicating that it has a substantial impact on the regression coefficients and predictions.

As with the Residuals vs. Leverage plot, the assumption of linearity is met as the red line is horizontal. However, the points do not lie around the line along the total length of the line, as the amount of variation around the line changes along the length of the line. There are some outliers and the points are not symmetrically scattered about the line, as you would expect if the errors were normally distributed.

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

7.7 Check homogeneity of variance in a Scale-Location plot

This plot shows us the standardized residuals across the range of the regression line. Because the residuals are standarized, large residuals (whether positive or negative) are at the top of the plot, while small residuals (whether positive or negative) are at the bottom of the plot. If the assumption of homogeneity of variance (also called homoscedasticity) is met, the red line should be mostly horizontal. If it deviates from the mean line, that means that the variance is smaller or larger at that point of the regression line.

Once again, you can check out this page for some other examples of this type of plot. For your homework, you’ll simply need to generate this plot and talk about how your plot compares to the ones pictured. Is it closer to the ‘good’ plots or one of the ‘bad’ plots? Again, this is a judgement call! It’s okay if feel uncertain, and you won’t be penalized for that.

My Scale-Location plot is close to a ‘good’ regression plot since the assumptions of homogeneity are being met as seen with a mostly horizontal red line. However, the points are scattered and there are several points that are numbered, indicating their deviation is something that needs to be paid attention to.

plot(reg_model, 3)

7.8 Issues with My Data

Before interpreting our results, we assessed our variables to see if they met the assumptions for a multiple linear regression. We analyzed a Scale-Location plot and detected some issues with homogeneity of variance, as well as some issues with linearity in a Residuals vs Fitted plot. There were outliers but nothing that should be considered non-normal. So, we did not detect any serious issues with the normality of our residuals (visually analyzing a Q-Q plot).

8 View Test Output

summary(reg_model)

## 
## Call:
## lm(formula = idea ~ moa_maturity + support + socmeduse, data = cont)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1510 -0.4709  0.1772  0.6869  1.9607 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.082e-15  1.721e-02   0.000 1.000000    
## moa_maturity  1.206e-01  1.736e-02   6.948 4.49e-12 ***
## support       6.186e-02  1.768e-02   3.499 0.000473 ***
## socmeduse     2.041e-01  1.766e-02  11.557  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.964 on 3133 degrees of freedom
## Multiple R-squared:  0.07152,    Adjusted R-squared:  0.07063 
## F-statistic: 80.44 on 3 and 3133 DF,  p-value: < 2.2e-16

plot_model(reg_model, type="std")

plot_model(reg_model, type="emm", terms = c("moa_maturity"))

plot_model(reg_model, type="emm", terms = c("support"))

plot_model(reg_model, type="emm", terms = c("socmeduse"))

# note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else

9 Write Up Results

To test our hypothesis that gender significantly predicts emerging adulthood dimensions, with women demonstrating higher levels of maturity, social media use, and perceived social support compared to men, we used a multiple linear regression to model the relationship between the variables. We confirmed that our data met the assumptions of a linear regression, and although there were some issues with homogeneity of variance and linearity we continued with the analysis anyway.

The model demonstrates statistically significant relationships between the predictor variables (markers of adulthood - maturity, perceived social support, social media use) and the outcome variable (inventory of emerging adulthood dimensions). This is indicated through my results, Adj. R² = 0.07, F(3,3133) = 80.44, and p < .001. My three predictor variables all indicated a positive relationship with the inventory of emerging adulthood dimensions (IDEA-8). With the effect sizes for markers of adulthood - maturity and perceived social support falling below 0.10, they have a trivial or negligible effect; while social media use has an effect size below 0.30, indicating a small effect size. Full output from the regression model is reported in Table 1.

Table 1: Regression model of the influence of gender on emerging adulthood
	Gender
Predictors	Estimates	SE	CI	p
Inventory of the Dimensions of Emerging Adulthood (IDEA-8)	-0.00	0.02	-0.03 – 0.03	1.000
Markers of Adulthood (Maturity)	0.12	0.02	0.09 – 0.15	<0.001
Perceived Social Support	0.06	0.02	0.03 – 0.10	<0.001
Social Media Use	0.20	0.02	0.17 – 0.24	<0.001
Observations	3137
R² / R² adjusted	0.072 / 0.071

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Multiple Linear Regression Homework

Bhavyaa Munnamgi

2024-04-27