1 Loading Libraries

library(psych) # for the describe() command and the corr.test() command
library(apaTables) # to create our correlation table
library(kableExtra) # to create our correlation table
library(broom) # for the augment() command
library(ggplot2) # to visualize our results

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

2 Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use ARC data downloaded previous for lab
d <- read.csv(file="Data/final.csv", header=T)

d <- na.omit (d)

3 Correlation: State Your Hypothesis

We predict that stress, need to belong, subjective well-being, and Social Media use will all be correlated with each other. Furthermore, we predict that stress and need to belong will have a positive correlation with each other and social media use, and subjective well-being will have a negative correlation with social media use.

4 Correlation: Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    3143 obs. of  6 variables:
##  $ sex     : int  2 1 1 2 1 2 2 2 2 2 ...
##  $ income  : int  3 3 1 1 6 1 2 3 7 1 ...
##  $ belong  : num  2.6 4.2 3.8 4.2 3.4 4.2 4.3 3.8 2.9 2.5 ...
##  $ stress  : num  3.1 3.8 4.3 3 3.3 3.7 3.4 2.2 2.9 2.6 ...
##  $ swb     : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ SocMedia: num  4.27 2.09 3.09 3.18 3.36 ...
##  - attr(*, "na.action")= 'omit' Named int [1:39] 61 199 210 304 421 511 728 743 789 1047 ...
##   ..- attr(*, "names")= chr [1:39] "61" "199" "210" "304" ...

# since we're focusing on our continuous variables, we're going to subset them into their own dataframe. this will make some stuff we're doing later easier.
cont <- subset(d, select=c(belong, stress, swb, SocMedia))

# you can use the describe() command on an entire dataframe (d) or just on a single variable (d$pss)
describe (cont)

##          vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## belong      1 3143 3.30 0.73   3.30    3.32 0.74 1.1   5   3.9 -0.29    -0.23
## stress      2 3143 3.06 0.66   3.10    3.06 0.59 1.0   5   4.0  0.03    -0.04
## swb         3 3143 4.47 1.32   4.67    4.53 1.48 1.0   7   6.0 -0.36    -0.45
## SocMedia    4 3143 3.13 0.78   3.18    3.16 0.67 1.0   5   4.0 -0.32     0.27
##            se
## belong   0.01
## stress   0.01
## swb      0.02
## SocMedia 0.01

# our fake variable has high kurtosis, which I'll ignore. you don't need to discuss univariate normality in the results write-ups for the labs/homework, but you will need to discuss it in your final manuscript

# also use histograms to examine your continuous variables
hist(d$belong)

hist(d$stress)

hist(d$swb)

hist(d$SocMedia)

# last, use scatterplots to examine your continuous variables together
plot (d$belong, d$stress)

plot (d$belong, d$swb)

plot (d$belong, d$SocMedia)

plot (d$stress, d$swb)

plot (d$stress, d$SocMedia)

plot (d$swb, d$SocMedia)

5 Correlation: Check Your Assumptions

5.1 Pearson’s Correlation Coefficient Assumptions

Should have two measurements for each participant
Variables should be continuous and normally distributed
Outliers should be identified and removed
Relationship between the variables should be linear

5.1.1 Checking for Outliers

Note: You are not required to screen out outliers or take any action based on what you see here. This is something you will check and then discuss in your write-up.

d$belong_std <- scale(d$belong, center=T, scale=T)
hist(d$belong_std)

sum(d$belong_std < -3 | d$belong_std > 3)

## [1] 2

d$stress_std <- scale(d$stress, center=T, scale=T)
hist(d$stress_std)

sum(d$stress_std < -3 | d$stress_std > 3)

## [1] 1

d$swb_std <- scale(d$swb, center=T, scale=T)
hist(d$swb_std)

sum(d$swb_std < -3 | d$swb_std > 3)

## [1] 0

d$SocMedia_std <- scale(d$SocMedia, center=T, scale=T)
hist(d$SocMedia_std)

sum(d$SocMedia_std < -3 | d$SocMedia_std > 3)

## [1] 0

5.2 Issues with My Data

It appears that there is no correlation between Social Media Use and Percieved Stress or Subjective Well Being. Pearson’s r may underestimate the strength of a non-linear relationship and distort the relationship direction. Need to Belong has 2 outliers, Perceived Stress has one, and Subjective Well Being and Social Media Use have none. Outliers can distort the relationship between two variables and sway the correlation

6 Correlation: Create a Correlation Matrix

corr_output_m <- corr.test(cont)

7 Correlation: View Test Output

corr_output_m

## Call:corr.test(x = cont)
## Correlation matrix 
##          belong stress   swb SocMedia
## belong     1.00   0.31 -0.15     0.28
## stress     0.31   1.00 -0.55     0.08
## swb       -0.15  -0.55  1.00     0.11
## SocMedia   0.28   0.08  0.11     1.00
## Sample Size 
## [1] 3143
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##          belong stress swb SocMedia
## belong        0      0   0        0
## stress        0      0   0        0
## swb           0      0   0        0
## SocMedia      0      0   0        0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

8 Correlation: Write Up Results

To test our hypothesis that stress, need to belong, subjective well-being, and Social Media use will all be correlated with eachother, we calculated a series of Pearson’s correlation coefficients. Most of our data met the assumptions of the test, with all variables meeting the standards of normality. Need to Belong has 2 outliers, Perceived Stress has one. Social Media Use and Subjective Well Being had no outliers.

As predicted, we found that all three variables were correlated (all ps < .001). The effect sizes of all correlations were small (rs < .5; Cohen, 1988), except for Perceived Stress and Subjective Well Being (rs > .5; Cohen, 1988). This test also supported our second hypothesis, that stress and need to belong will have a positive correlation with each other and social media use, as can be seen by the correlation coefficients reported in Table 1. The test did not support our hypothesis that there would be a negative correlation between Subjective Well Being and Social Media Use.

Table 1: Means, standard deviations, and correlations with confidence intervals
Variable	M	SD	1	2	3
Need to Belong	3.30	0.73

Percieved Stress	3.06	0.66	.31**
			[.28, .34]

Subjective Well-Being	4.47	1.32	-.15**	-.55**
			[-.19, -.12]	[-.58, -.53]

Social Media Use	3.13	0.78	.28**	.08**	.11**
			[.25, .32]	[.05, .12]	[.07, .14]

Note:
M and SD are used to represent mean and standard deviation, respectively. Values in square brackets indicate the 95% confidence interval. The confidence interval is a plausible range of population correlations that could have caused the sample correlation.
^* indicates p < .05
^** indicates p < .01.

9 Regression: State Your Hypothesis

We hypothesize that Social Media Use (measured by Social Media Use Scale) will significantly predict Need to Belong, and that the relationship will be positive.

10 Regression: Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    3143 obs. of  10 variables:
##  $ sex         : int  2 1 1 2 1 2 2 2 2 2 ...
##  $ income      : int  3 3 1 1 6 1 2 3 7 1 ...
##  $ belong      : num  2.6 4.2 3.8 4.2 3.4 4.2 4.3 3.8 2.9 2.5 ...
##  $ stress      : num  3.1 3.8 4.3 3 3.3 3.7 3.4 2.2 2.9 2.6 ...
##  $ swb         : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ SocMedia    : num  4.27 2.09 3.09 3.18 3.36 ...
##  $ belong_std  : num [1:3143, 1] -0.962 1.241 0.69 1.241 0.14 ...
##   ..- attr(*, "scaled:center")= num 3.3
##   ..- attr(*, "scaled:scale")= num 0.726
##  $ stress_std  : num [1:3143, 1] 0.0563 1.1183 1.8768 -0.0954 0.3597 ...
##   ..- attr(*, "scaled:center")= num 3.06
##   ..- attr(*, "scaled:scale")= num 0.659
##  $ swb_std     : num [1:3143, 1] -0.107 -0.233 -2.001 0.524 -0.612 ...
##   ..- attr(*, "scaled:center")= num 4.47
##   ..- attr(*, "scaled:scale")= num 1.32
##  $ SocMedia_std: num [1:3143, 1] 1.4652 -1.3367 -0.0525 0.0643 0.2977 ...
##   ..- attr(*, "scaled:center")= num 3.13
##   ..- attr(*, "scaled:scale")= num 0.779
##  - attr(*, "na.action")= 'omit' Named int [1:39] 61 199 210 304 421 511 728 743 789 1047 ...
##   ..- attr(*, "names")= chr [1:39] "61" "199" "210" "304" ...

# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(d)

##              vars    n mean   sd median trimmed  mad   min  max range  skew
## sex             1 3143 1.77 0.46   2.00    1.81 0.00  1.00 3.00  2.00 -0.73
## income          2 3143 3.55 2.30   3.00    3.37 2.97  1.00 9.00  8.00  0.47
## belong          3 3143 3.30 0.73   3.30    3.32 0.74  1.10 5.00  3.90 -0.29
## stress          4 3143 3.06 0.66   3.10    3.06 0.59  1.00 5.00  4.00  0.03
## swb             5 3143 4.47 1.32   4.67    4.53 1.48  1.00 7.00  6.00 -0.36
## SocMedia        6 3143 3.13 0.78   3.18    3.16 0.67  1.00 5.00  4.00 -0.32
## belong_std      7 3143 0.00 1.00   0.00    0.03 1.02 -3.03 2.34  5.37 -0.29
## stress_std      8 3143 0.00 1.00   0.06    0.00 0.90 -3.13 2.94  6.07  0.03
## swb_std         9 3143 0.00 1.00   0.15    0.04 1.12 -2.63 1.91  4.55 -0.36
## SocMedia_std   10 3143 0.00 1.00   0.06    0.03 0.87 -2.74 2.40  5.14 -0.32
##              kurtosis   se
## sex             -0.17 0.01
## income          -1.12 0.04
## belong          -0.23 0.01
## stress          -0.04 0.01
## swb             -0.45 0.02
## SocMedia         0.27 0.01
## belong_std      -0.23 0.02
## stress_std      -0.04 0.02
## swb_std         -0.45 0.02
## SocMedia_std     0.27 0.02

# also use histograms to examine your continuous variables
hist(d$SocMedia)

hist(d$belong)

# last, use scatterplots to examine your continuous variables together
plot(d$SocMedia, d$belong)

11 Regression: Run a Simple Regression

# to calculate standardized coefficients, we have to standardize our IV
d$SocMedia_std <- scale(d$SocMedia, center=T, scale=T)
hist (d$SocMedia_std)

# use the lm() command to run the regression
# dependent/outcome variable on the left, idependent/predictor variable on the right
reg_model <- lm(belong ~ SocMedia_std, data = d)

12 Regression: Check Your Assumptions

12.1 Simple Regression Assumptions

Should have two measurements for each participant
Variables should be continuous and normally distributed
Outliers should be identified and removed
Relationship between the variables should be linear
Residuals should be normal and have constant variance note: we will not be evaluating whether our data meets these assumptions in this lab/homework – we’ll come back to them next week when we talk about multiple linear regression

12.2 Create plots and view residuals

model.diag.metrics <- augment(reg_model)

ggplot(model.diag.metrics, aes(x = SocMedia_std, y = belong)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = SocMedia_std, yend = .fitted), color = "red", size = 0.3)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

12.3 Check linearity with Residuals vs Fitted plot

The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.

The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. Ideally, we want all points to have the same influence on the regression line, although we accept that there will be some variability. The cutoff for a high Cook’s distance score is .5 (not .05, which is our cutoff for statistical significance). For our data, some points do exert more influence than others but they’re generally equal, and none of them are close to the cutoff.

The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Point that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it. The red line indicates the average residual across points with the same amount of leverage. As usual, we want this line to stay as close to the mean line (or the zero line) as possible.

Because the leverage in our plot is low, part of it actually cut off! If you check the first set of plots on this page (note that Residuals vs Leverage is the fourth in the grid) you can see there are curved red lines in the corners of the Residuals vs Leverage plots. This is the .5 cutoff for Cook’s distance, and so any points appearing past these lines is a serious outlier that needs to be removed. On this page you can also see Residuals vs Leverage plots with severe deviations from the mean line, which makes our deviations appear much less serious.

Our data doesn’t have any severe outliers. For your homework, you’ll simply need to generate these plots, assess Cook’s distance in your dataset, and then identify any potential cases that are prominent outliers. Since we have some cutoffs, that makes this process is a bit less subjective than some of the other assessments we’ve done here, which is a nice change!

My Residuals vs. fitted plot aligns well with the good plots. post of my points are surrounding the line that smoothly intersects the plot.

plot(reg_model, 1)

12.4 Check for outliers

The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.

The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. Ideally, we want all points to have the same influence on the regression line, although we accept that there will be some variability. The cutoff for a high Cook’s distance score is .5 (not .05, which is our cutoff for statistical significance). For our data, some points do exert more influence than others but they’re generally equal, and none of them are close to the cutoff.

The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Point that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it. The red line indicates the average residual across points with the same amount of leverage. As usual, we want this line to stay as close to the mean line (or the zero line) as possible.

Because the leverage in our plot is low, part of it actually cut off! If you check the first set of plots on this page (note that Residuals vs Leverage is the fourth in the grid) you can see there are curved red lines in the corners of the Residuals vs Leverage plots. This is the .5 cutoff for Cook’s distance, and so any points appearing past these lines is a serious outlier that needs to be removed. On this page you can also see Residuals vs Leverage plots with severe deviations from the mean line, which makes our deviations appear much less serious.

Our data doesn’t have any severe outliers. For your homework, you’ll simply need to generate these plots, assess Cook’s distance in your dataset, and then identify any potential cases that are prominent outliers. Since we have some cutoffs, that makes this process is a bit less subjective than some of the other assessments we’ve done here, which is a nice change!

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

12.5 Issues with My Data

Before interpreting our results, we assessed our variables to see if they met the assumptions for a simple linear regression. Analysis of a Residuals vs Fitted plot suggested that there is linearity. We also checked Cook’s distance and a Residuals vs Leverage plot to detect outliers. A few cases had large residuals and several had above-average leverage but all were below the recommended cutoff for Cook’s distance.

13 Regression: View Test Output

summary(reg_model)

## 
## Call:
## lm(formula = belong ~ SocMedia_std, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.19477 -0.43942  0.03643  0.49156  2.14691 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.29857    0.01242  265.59   <2e-16 ***
## SocMedia_std  0.20683    0.01242   16.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6963 on 3141 degrees of freedom
## Multiple R-squared:  0.0811, Adjusted R-squared:  0.08081 
## F-statistic: 277.2 on 1 and 3141 DF,  p-value: < 2.2e-16

# note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else

14 Regression: Write Up Results

To test out hypothesis that Social Media Use (measured by Social Media Use Scale) will significantly predict Need to Belong, and that the relationship will be positive,we used a simple linear regression to model the relationship between the variables. We confirmed that our data met the assumptions of a linear regression, checking the linearity of the relationship using a Residuals vs Fitted plot and checking for outliers using Cook’s distance and a Residuals vs Leverage plot. Note: we are skipping the assumptions of normality and homogeneity of variance for this assignment.

As predicted, we found that Social Media Use significantly predicted the Need to Belong, Adj. R² = .08, F(1,3141) = 277.2, p < .001. The relationship between Social Media Use and Need to Belong was positive, ß = .20, t(3141) = 16.65, p < .001 (refer to Figure 1). According to Cohen (1988), this constitutes a large effect size (> .50).

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Correlation and Simple Regression Homework

Sophia Freeland

2023-06-01

1 Loading Libraries

2 Importing Data

3 Correlation: State Your Hypothesis

4 Correlation: Check Your Variables

5 Correlation: Check Your Assumptions

5.1 Pearson’s Correlation Coefficient Assumptions

5.1.1 Checking for Outliers

5.2 Issues with My Data

6 Correlation: Create a Correlation Matrix

7 Correlation: View Test Output

8 Correlation: Write Up Results

9 Regression: State Your Hypothesis

10 Regression: Check Your Variables

11 Regression: Run a Simple Regression

12 Regression: Check Your Assumptions

12.1 Simple Regression Assumptions

12.2 Create plots and view residuals

12.3 Check linearity with Residuals vs Fitted plot

12.4 Check for outliers

12.5 Issues with My Data

13 Regression: View Test Output

14 Regression: Write Up Results