Correlation & Regression Lab & Homework

library(tidyverse)
library(car)
library(corrplot)
library(lsr)

CORRELATIONS

In today’s lab we will be running correlation coefficients and conducting linear regression.

We will use the file “Exam Anxiety.csv” to start with. Please load it into R. We will primarily use 3 variables in this dataset: 1) Revise – meaning the amount of time a person spent studying for an exam, 2) Exam – their score on the exam and 3) Anxiety – their level of anxiety prior to the exam.

library(readr)
ExamAnxiety <- read_csv("ExamAnxiety.csv")

sapply(ExamAnxiety, class)

##      Code    Revise      Exam   Anxiety    Gender 
## "numeric" "numeric" "numeric" "numeric" "numeric"

As a review and because it’s good to have a sense of your data before running any tests:

Create scatter plots of Anxiety vs. Exam, Anxiety vs. Revise, Revise vs. Exam.

#Anxiety vs. Exam
plot(ExamAnxiety$Anxiety, ExamAnxiety$Exam)

#Anxiety vs. Revise
plot(ExamAnxiety$Anxiety, ExamAnxiety$Revise)

#Revise vs. Exam
plot(ExamAnxiety$Revise, ExamAnxiety$Exam)

#OR
pairs(ExamAnxiety)

Please comment on the graphs produced in #1

In Anxiety vs. Exam, Anxiety vs. Revise, and Revise vs. Exam, the data is clustered. The data does not appear to be linear.

What are the assumptions of Pearson’s Correlation?

The data is continuous
There is a linear relationship
The data is normal-ish

Check these assumptions and determine whether they are fulfilled.

The data is continuous, numeric. Except gander is binary in this data.
I already looked at the the linear relationship for Anxiety vs. Exam, Anxiety vs. Revise, and Revise vs. Exam and determined that the relationship did not appear very linear do to data point clustering.
To check if the data is normal, I create a histogram and run use the shapiro.test() function

hist(ExamAnxiety$Anxiety)

shapiro.test(ExamAnxiety$Anxiety)

## 
##  Shapiro-Wilk normality test
## 
## data:  ExamAnxiety$Anxiety
## W = 0.82242, p-value = 8.65e-10

hist(ExamAnxiety$Exam)

shapiro.test(ExamAnxiety$Exam)

## 
##  Shapiro-Wilk normality test
## 
## data:  ExamAnxiety$Exam
## W = 0.95516, p-value = 0.001522

hist(ExamAnxiety$Revise)

shapiro.test(ExamAnxiety$Revise)

## 
##  Shapiro-Wilk normality test
## 
## data:  ExamAnxiety$Revise
## W = 0.80445, p-value = 2.252e-10

sapply(ExamAnxiety, length)

##    Code  Revise    Exam Anxiety  Gender 
##     103     103     103     103     103

Given that our sample size is relatively large – probably better to use histograms than shapiro.test

This data is skewed (not normal).

Correlation Matrix

Let’s create a correlation matrix of Anxiety, Exam and Revise.

Run a Pearson Correlation Coefficient (even if the assumptions are not met).

#Anxiety vs. Exam
cor(ExamAnxiety$Anxiety, ExamAnxiety$Exam)

## [1] -0.4409934

The Pearson correlation coefficient between anxiety and exam is approximately -0.4409934. This indicates a moderate negative linear relationship, meaning that as the anxiety level increases, the exam score tends to decrease (and vice versa).

#Anxiety vs. Revise
cor(ExamAnxiety$Anxiety, ExamAnxiety$Revise)

## [1] -0.7092493

The Pearson correlation coefficient between anxiety and revise is approximately -0.7092493. This indicates a strong-ist negative linear relationship, meaning that as the anxiety level increases, the amount of time spend studying to decrease (and vice versa).

#Revise vs. Exam
cor(ExamAnxiety$Revise, ExamAnxiety$Exam)

## [1] 0.3967207

The Pearson correlation coefficient between revise and exam is approximately 0.3967207. This indicates a weak-ish positive linear relationship, meaning that as the amount of time a person spent studying increases, their score on the exam tends to also increase (and vice versa).

Interpret the results of the Pearson Correlation Coefficients.

See above.

SPEARMAN TEST

If the data have violated the assumptions of our Pearson correlation coefficient, then we can use either Spearman’s correlation coefficient.

Did the exam data violate our assumptions?

This was determined above and yes it did. It is not normally distributed data.

Determine the Spearman’s rank coefficient of Exam and Revise, except this time run a Spearman’s rank correlation.

#Anxiety vs. Exam
cor(ExamAnxiety$Anxiety, ExamAnxiety$Exam, method="spearman")

## [1] -0.4046141

The Spearman’s rank correlation between anxiety and exam is approximately -0.4046141. This indicates a moderate negative linear relationship, meaning that as the anxiety level increases, the exam score tends to decrease (and vice versa).

#Anxiety vs. Revise
cor(ExamAnxiety$Anxiety, ExamAnxiety$Revise, method="spearman")

## [1] -0.6219694

The Spearman’s rank correlation between anxiety and revise is approximately -0.6219694. This indicates a moderate negative linear relationship, meaning that as the anxiety level increases, the amount of time spend studying to decrease (and vice versa).

#Revise vs. Exam
cor(ExamAnxiety$Revise, ExamAnxiety$Exam, method="spearman")

## [1] 0.3498948

The Spearman’s rank correlation between revise and exam is approximately 0.3498948. This indicates a weak-ish positive linear relationship, meaning that as the amount of time a person spent studying increases, their score on the exam tends to also increase (and vice versa).

Please interpret the results of the Spearman’s \$rho.$ How does the Spearman’s rho compare to the Pearson’s correlation coefficient?

See #8

REGRESSION

Let’s load the data: “Album Sales.csv”. This dataset has the following variables: adverts (amount spent on advertising in 1000s), sales (record sales in 1000s), airplay (number of plays on the radio per week), attact (attractiveness of the band – ranked from 1-10, 10 being highest).

AlbumSales <- read_csv("AlbumSales.csv")

sapply(AlbumSales, class)

##   Adverts     Sales   Airplay   Attract 
## "numeric" "numeric" "numeric" "numeric"

sapply(AlbumSales, length)

## Adverts   Sales Airplay Attract 
##     200     200     200     200

Before running any analyses, it’s best to look at the data. Please create a scatter plot matrix of the variables.

pairs(AlbumSales)

To run a regression model, we assume that the relationship between the independent and dependent variable is linear. Is the relationship between advertising budget (independent variable) and album sales (dependent variable) linear?

The data for album sales to advertising budget is sortof linear but there is heavy clustering.

Running Linear Regression

Let’s run a linear regression model of adverts on sales. Note: “Album Sales” is the dependent variable (the outcome variable) and “Advertising Budget” is the independent variable. After you run the model, examine the output by running the summary function. E.g., summary(reg_model)

#linear regression model
reg_model <- lm(AlbumSales$Sales ~ AlbumSales$Adverts)
summary(reg_model)

## 
## Call:
## lm(formula = AlbumSales$Sales ~ AlbumSales$Adverts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -152.949  -43.796   -0.393   37.040  211.866 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.341e+02  7.537e+00  17.799   <2e-16 ***
## AlbumSales$Adverts 9.612e-02  9.632e-03   9.979   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.99 on 198 degrees of freedom
## Multiple R-squared:  0.3346, Adjusted R-squared:  0.3313 
## F-statistic: 99.59 on 1 and 198 DF,  p-value: < 2.2e-16

What is the r² value? Please interpret this.

Multiple R-squared: A measure of the proportion of the variance in the dependent variable that is explained by the model. In this case, it’s 0.3346, indicating that the model explains 33.46% of the variance in sales.

Is advertising a significant predictor of sales?

Pr(>|t|): The p-value associated with the t-statistic, used for hypothesis testing. In this case, the advertising p-value is very small (< 2e-16), indicating that the coefficients are significantly different from zero at a 0.05 significance level.

What is the predicted equation for our model? Hint: sales = b₀ + b₁(adverts)?

sales_i = 1.341e+02 + 9.612e-02(adverts_i)

If one invests $2000 on advertising [NOTE: advertising is measured in thousands, so this is equivalent to plugging in 2 for adverts], what is the expected number of sales based on your linear regression model?

sales_2 <- 1.341e+02 + 9.612e-02 * 2
sales_2

## [1] 134.2922

If one invests an additional $1000 on advertising, what is the expected effect on the number of sales? (How much do we expect sales to increase or decrease?)

sales_3 <- 1.341e+02 + 9.612e-02 * 3

effect <- sales_3 - sales_2

effect * 1000

## [1] 96.12

The sales increases by the slope of the line. for every $1000 increase, sales will also increase by $96.12

Multiple Regression

Now let’s add airplay (number of plays on the radio) and Attract (attractiveness of the band) to our model. We’ll store this model in mod2.

#multiple regression model
mod2 <- lm(AlbumSales$Sales ~ AlbumSales$Adverts + AlbumSales$Airplay + AlbumSales$Attract)
summary(mod2)

## 
## Call:
## lm(formula = AlbumSales$Sales ~ AlbumSales$Adverts + AlbumSales$Airplay + 
##     AlbumSales$Attract)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -121.324  -28.336   -0.451   28.967  144.132 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -26.612958  17.350001  -1.534    0.127    
## AlbumSales$Adverts   0.084885   0.006923  12.261  < 2e-16 ***
## AlbumSales$Airplay   3.367425   0.277771  12.123  < 2e-16 ***
## AlbumSales$Attract  11.086335   2.437849   4.548 9.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared:  0.6647, Adjusted R-squared:  0.6595 
## F-statistic: 129.5 on 3 and 196 DF,  p-value: < 2.2e-16

Is this new model (mod2) significant? Hint: look at the p-values at the end of the summary output (associated with the F-statistic).

p-value: < 2.2e-16 is significant

What is the r² value of model? Has it increased or decreased from our linear regression model (from above)?

The overall model explains approximatly 66.47% of the variance in sales (Multiple R-squared: 0.6647)

The linear regression model was 0.3346. Therefore there is an increase in multiple regression.

Is advertising a significant predictor of sales?

The coefficient for advertising (0.084885) suggests that for every $1000 spent in advertising, sales is expected to increase by approximately $84.89, holding airplay and attract constant. This relationship is highly significant (p-value < 2e-16), indicating that advertising has a strong and positive association with sales.

Is ‘no. of plays on the radio’ (Airplay) a significant predictor of sales?

The coefficient for Airplay (3.367425) suggests that for every no. of plays on the radio per week, sales is expected to increase by approximately $3,367, holding advertising and attract constant. This relationship is highly significant (p-value < 2e-16), indicating that airplay has a strong and positive association with sales.

Is attractiveness a significant predictor of sales?

The coefficient for Attract (11.086335) suggests that for every rank increase of band attractiveness, the sales is expected to increase by approximately $11,086, holding advertising and airplay constant. This relationship is highly significant (p-value < 9.49e-06), indicating that attractiveness has a strong and positive association with sales.

What is the equation for our model? Hint: sales = b₀ + b1~~(advertising) + b₂ (plays) + b₃(attractiveness)?

sales = -26.612958 + 0.084885(advertising) + 3.367425(plays) + 11.086335(attractiveness)

STANDARDIZED BETAS

At this point, it is hard to compare the coefficient of advertisements, plays on radio, and attractiveness because they are measured on different scales. We can examine standardized coefficients, which running this code:

standardCoefs(mod2)

##                              b      beta
## AlbumSales$Adverts  0.08488483 0.5108462
## AlbumSales$Airplay  3.36742517 0.5119881
## AlbumSales$Attract 11.08633520 0.1916834

The first column represents the beta coefficients (same values as the summary output from above).The second column represents the standardized beta coefficients.

You can interpret this as: for 1 standard deviation increase in advertising, you see a 0.5108 standard deviation increase in sales. We could calculate these standard deviations (the standard deviation of advertising is $485,655 and the standard deviation of sales is 80.669 – but remember it’s measured in thousands, so 80,669). So, for 1 standard deviation increase in advertising (aka $485,655), there is a result of 0.511*80669 = 41,221 extra records sold.

Which independent variable has the largest effect on record sales? (Hint: Compare the standardized (beta) coefficients)

Since the second column represents the standardized beta coefficients, 0.5119881 is the largest standardized bata coefficient. This value corresponds to Airplay. For every 1 standard deviation increase in airplay, we see a corresponding increase in sales by 0.512 standard deviations. This is only slightly larger than the effect of advertisements.

0.512*80669

## [1] 41302.53

This means that for a 1 standard deviation increase in airplay, there is an estimated increase of approximately 41,302 extra records sold.

41302-41221

## [1] 81

That’s an extra 81 records sold.

ASSUMPTIONS

There are many assumptions associated with regression.

Predictor variables must be quantitative or categorical (with 2 categories). The outcome variable must be quantitative, continuous and unbounded.
The predictors should have some variation in value. (Variance not equal to 0)
Multicollinearity: Your independent variables should not be too correlated with each other
Predictors are uncorrelated with ‘external variables’. Shouldn’t be variables that influence the predictor variables that aren’t included in the model.
Homoscedasticity: The variance of the residuals should be constant
Normally-distributed errors: The residuals should be normally distributed
Independent errors: Residual terms (errors) should be independent & therefore uncorrelated
All the values of the outcome variable are independent
The relationship between the predictors and the outcome variable lie along a straight line.

The assumptions in bold are things we can test.

CHECKING RESIDUAL ASSUMPTIONS

(HOMOSCEDASTICITY & NORMALLY-DISTRIBUTED)

We will plot our residuals to take a look at them.

hist(residuals(mod2))

Does the histogram look normal? Yes

Next, we should look at the 4 plots outputted by R in the plot function:

plot(mod2)

The first plot shows us Residuals vs Fitted. If our relationship is truly linear, we would expect the red line to be horizontal.

Is the red line relatively flat? Yes

Has our assumption of linearity been met? Yes

The second plot shows us Q-Q plot. This shows us whether our residuals are normally distributed (similar test as the hist() function from above). The dots should fall on the diagonally dotted line.

Does the Q-Q plot suggest the residuals are normally distributed? Yes

The third plot presents the predicted [fitted] value vs. the square root of the absolute value of the standardized residual. This allows us to look for heteroscedasticity and non-linearity. To meet our assumptions, we want a random array of dots. If we have any funneling or trend in the residuals, it suggests we have deviated from our assumption.

Do you think the assumptions of homoscedasticity and linearity have been met? Yes

INFLUENTIAL CASES

It is important to test whether some values have a large influence on the parameters in your model. There are several different diagnostics we can use to determine whether a case is particularly influential – we will only examine 2 today.

Our fourth plot from the plot() function shows us standardized residuals vs. leverage. If we have an influential point, we will actually get a line that represents Cook’s distance. Check to see if there’s any points all by themselves on the left – suggesting that one point has a lot of influence.

Do you see a point that has a significant influence over the regression line? Not really, point 169 is a little high but it is within the cluster

Cook’s Distance – considers the effect of a single case on the model as a whole. Values greater than 1 may be of concern. This is probably the most popular way to test influential cases.

To get Cook’s distance, type:

#cooks.distance(mod2) 
#this gives every Cook’s distance value

max(cooks.distance(mod2)) #gives you the maximum value

## [1] 0.07076588

This just gives you the maximum of all your data points and you just want it less than 1. It is a problem if it is greater than one.

Do we have any cook’s distance values greater than 1? No

MULTICOLLINEARITY

If two or more predictors in a regression model are highly correlated, it can make your betas unreliable and/or increase standard errors. Generally, if two variables are highly correlated it is very difficult to tell which is having an effect on the model. We will use the variance inflation factor (VIF) to determine if multicollinearity is a problem. Rule of thumb: a VIF greater than ~2.5 is problematic.

vif(mod2)

## AlbumSales$Adverts AlbumSales$Airplay AlbumSales$Attract 
##           1.014593           1.042504           1.038455

Values greater than 10 are very problematic. Some conservative estimates state that you should look for less than 2.5.

Are any of our VIF values greater than 2.5? No

INDEPENDENT ERRORS

We can run a Durbin-Watson test to determine if our errors are independent. Values less than 1 or greater than 3 should raise alarm bells.

durbinWatsonTest(mod2)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.0026951      1.949819   0.748
##  Alternative hypothesis: rho != 0

Test statistic close to 2: This indicates no autocorrelation in the residuals. The residuals are independent, which is a desirable property in a linear regression model.

What is the value of the Durbin-Watson statistic? The test statistic close to 2 (D-W ≈ 1.95)

Homework

Import “Marriage.csv” from Canvas

marriage <- read_csv("Marriage.csv")

sapply(marriage, class)

##    education    age_marr1 age_menarche 
##    "numeric"    "numeric"    "numeric"

Create a scatter plot of education and age at first marriage (include a best fit line). What is the relationship between the variables? Are the variables linearly related?

# Create a scatter plot with a linear regression line of education and age at first marriage
ggplot(data = marriage, aes(x = education, y = age_marr1)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Education", y = "Age at First Marriage")

Conduct a linear regression of education (independent) on age at first marriage (dependent).

#linear regression model
reg_model <- lm(marriage$age_marr1 ~ marriage$education)
summary(reg_model)

## 
## Call:
## lm(formula = marriage$age_marr1 ~ marriage$education)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4357 -2.5508 -0.5242  1.8209 13.4492 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        17.32064    0.41139  42.103  < 2e-16 ***
## marriage$education  0.37169    0.04447   8.358 1.64e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.299 on 341 degrees of freedom
## Multiple R-squared:   0.17,  Adjusted R-squared:  0.1676 
## F-statistic: 69.85 on 1 and 341 DF,  p-value: 1.644e-15

Is the model significant?

Yes, the linear regression model suggests a significant (p-value: 1.644e-15) weak positive relationship between education and age at first marriage.

What is the r² value?

Multiple R-squared =0.17, Adjusted R-squared= 0.1676

Is education a significant predictor of age at marriage?

The linear regression model suggests a significant (p-value: 1.644e-15) weak positive relationship between education and age at first marriage, with each additional year of education associated with an 0.37169 year increase of age at first marriage. The model explains 37.17% of the variance in age at first marriage.

If someone has 8 years of education, what is the predicted age at first marriage?

age_marr1_8 <- 0.37169*8 + 17.32064
age_marr1_8

## [1] 20.29416

12*0.29416

## [1] 3.52992

The model predicts that a person with 8 years of education will first marry around that 20 years and 3.5 months.

Now add age at menarche into your model (as an independent variable). Let’s refer to this model as model_2

#multiple regression model
model_2 <- lm(marriage$age_marr1 ~ marriage$education + marriage$age_menarche)
summary(model_2)

## 
## Call:
## lm(formula = marriage$age_marr1 ~ marriage$education + marriage$age_menarche)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2601 -2.4948 -0.4948  1.8787 13.3872 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           15.72757    1.79391   8.767  < 2e-16 ***
## marriage$education     0.37224    0.04449   8.367 1.55e-15 ***
## marriage$age_menarche  0.11798    0.12931   0.912    0.362    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.3 on 340 degrees of freedom
## Multiple R-squared:  0.172,  Adjusted R-squared:  0.1672 
## F-statistic: 35.32 on 2 and 340 DF,  p-value: 1.153e-14

What is the r² value of model_2?

Multiple R-squared=0.172, Adjusted R-squared=0.1672

In model_2, Is education a significant predictor of age at marriage? Is age at menarche a significant predictor of age at marriage?

The coefficient for education (0.37224) suggests that for every additional year of education, a person’s age at first marriage is expected to increase by approximately 0.37224 years, holding age at menarche constant. This relationship is highly significant (p-value < 1.55e-15), indicating that years of education has a moderate and positive association with a person’s age at first marriage.

The coefficient for age at menarche (0.11798) suggests that for every additional age at menarche, a person’s age at first marriage is expected to increase by approximately 0.11798 years, holding education in years constant. However, this relationship is not significant (p-value < 0.362 ), indicating that age at menarche has a no association with a person’s age at first marriage.

What are the standardized betas from model_2?

standardCoefs(model_2)

##                               b       beta
## marriage$education    0.3722382 0.41293603
## marriage$age_menarche 0.1179825 0.04502747

The output shows the original beta value for education as 0.3722382 and the standardized value 0.41293603. The original beta value for age at menarche is 0.1179825 and the standardized value 0.04502747.

Which variable has a larger effect on age at marriage?

We interpret the standardized betas as: with a 1 standard deviation increase in education we see a corresponding increase by 0.41293603 standard deviations in age at marriage and with a 1 standard deviation increase in age at menarche we see a corresponding increase by 0.04502747 standard deviations in age at marriage. Because 0.04502747 is larger than 0.41293603, age at menarche has a larger effect on age at marriage

Create plots of the residuals vs. predicted [fitted] values, a Q-Q plot of the residuals, and predicted [fitted] values vs. standardized residuals (square root of absolute value). Do our residuals meet the assumptions (homoscedasticity and normally distributed errors) of the regression model?

plot(model_2)

Do our residuals meet the assumptions (homoscedasticity and normally distributed errors) of the regression model?

They are ok. The QQ plot has a little bit of a curve.

Do we have influential cases in our model_2? Run cook’s distance to determine this.

#cooks.distance(model_2)

max(cooks.distance(model_2))

## [1] 0.08728567

No value is greater than 1.

Do we have multicollinearity in the data? What is the result of the VIF?

vif(model_2)

##    marriage$education marriage$age_menarche 
##              1.000185              1.000185

Both values are less than 2.5, therefore I conclude that the predictors are not highly correlated with each other.

Are the errors independent? What is the result of the Durbin-Watson statistic?

dwt(model_2)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.02906788      1.939552   0.604
##  Alternative hypothesis: rho != 0

The result of the Durbin-Watson statistic for this particular model is 1.939552. The p-value associated with this statistic is 0.618. The test statistic is close to 2 which this indicates no autocorrelation in the residuals. Furthermore, the high p-value of 0.618 indicates that there is not enough evidence to reject the null hypothesis of no autocorrelation. The residuals are independent.