library(tidyverse)
library(car)
library(corrplot)
library(lsr)
In today’s lab we will be running correlation coefficients and conducting linear regression.
We will use the file “Exam Anxiety.csv” to start with. Please load it into R. We will primarily use 3 variables in this dataset: 1) Revise – meaning the amount of time a person spent studying for an exam, 2) Exam – their score on the exam and 3) Anxiety – their level of anxiety prior to the exam.
library(readr)
ExamAnxiety <- read_csv("ExamAnxiety.csv")
sapply(ExamAnxiety, class)
## Code Revise Exam Anxiety Gender
## "numeric" "numeric" "numeric" "numeric" "numeric"
As a review and because it’s good to have a sense of your data before running any tests:
#Anxiety vs. Exam
plot(ExamAnxiety$Anxiety, ExamAnxiety$Exam)
#Anxiety vs. Revise
plot(ExamAnxiety$Anxiety, ExamAnxiety$Revise)
#Revise vs. Exam
plot(ExamAnxiety$Revise, ExamAnxiety$Exam)
#OR
pairs(ExamAnxiety)
In Anxiety vs. Exam, Anxiety vs. Revise, and Revise vs. Exam, the data is clustered. The data does not appear to be linear.
The data is continuous
There is a linear relationship
The data is normal-ish
The data is continuous, numeric. Except gander is binary in this data.
I already looked at the the linear relationship for Anxiety vs. Exam, Anxiety vs. Revise, and Revise vs. Exam and determined that the relationship did not appear very linear do to data point clustering.
To check if the data is normal, I create a histogram and run use
the shapiro.test() function
hist(ExamAnxiety$Anxiety)
shapiro.test(ExamAnxiety$Anxiety)
##
## Shapiro-Wilk normality test
##
## data: ExamAnxiety$Anxiety
## W = 0.82242, p-value = 8.65e-10
hist(ExamAnxiety$Exam)
shapiro.test(ExamAnxiety$Exam)
##
## Shapiro-Wilk normality test
##
## data: ExamAnxiety$Exam
## W = 0.95516, p-value = 0.001522
hist(ExamAnxiety$Revise)
shapiro.test(ExamAnxiety$Revise)
##
## Shapiro-Wilk normality test
##
## data: ExamAnxiety$Revise
## W = 0.80445, p-value = 2.252e-10
sapply(ExamAnxiety, length)
## Code Revise Exam Anxiety Gender
## 103 103 103 103 103
Given that our sample size is relatively large – probably better to use histograms than shapiro.test
This data is skewed (not normal).
Let’s create a correlation matrix of Anxiety, Exam and Revise.
#Anxiety vs. Exam
cor(ExamAnxiety$Anxiety, ExamAnxiety$Exam)
## [1] -0.4409934
The Pearson correlation coefficient between anxiety and exam is approximately -0.4409934. This indicates a moderate negative linear relationship, meaning that as the anxiety level increases, the exam score tends to decrease (and vice versa).
#Anxiety vs. Revise
cor(ExamAnxiety$Anxiety, ExamAnxiety$Revise)
## [1] -0.7092493
The Pearson correlation coefficient between anxiety and revise is approximately -0.7092493. This indicates a strong-ist negative linear relationship, meaning that as the anxiety level increases, the amount of time spend studying to decrease (and vice versa).
#Revise vs. Exam
cor(ExamAnxiety$Revise, ExamAnxiety$Exam)
## [1] 0.3967207
The Pearson correlation coefficient between revise and exam is approximately 0.3967207. This indicates a weak-ish positive linear relationship, meaning that as the amount of time a person spent studying increases, their score on the exam tends to also increase (and vice versa).
See above.
If the data have violated the assumptions of our Pearson correlation coefficient, then we can use either Spearman’s correlation coefficient.
This was determined above and yes it did. It is not normally distributed data.
#Anxiety vs. Exam
cor(ExamAnxiety$Anxiety, ExamAnxiety$Exam, method="spearman")
## [1] -0.4046141
The Spearman’s rank correlation between anxiety and exam is approximately -0.4046141. This indicates a moderate negative linear relationship, meaning that as the anxiety level increases, the exam score tends to decrease (and vice versa).
#Anxiety vs. Revise
cor(ExamAnxiety$Anxiety, ExamAnxiety$Revise, method="spearman")
## [1] -0.6219694
The Spearman’s rank correlation between anxiety and revise is approximately -0.6219694. This indicates a moderate negative linear relationship, meaning that as the anxiety level increases, the amount of time spend studying to decrease (and vice versa).
#Revise vs. Exam
cor(ExamAnxiety$Revise, ExamAnxiety$Exam, method="spearman")
## [1] 0.3498948
The Spearman’s rank correlation between revise and exam is approximately 0.3498948. This indicates a weak-ish positive linear relationship, meaning that as the amount of time a person spent studying increases, their score on the exam tends to also increase (and vice versa).
See #8
Let’s load the data: “Album Sales.csv”. This dataset has the following variables: adverts (amount spent on advertising in 1000s), sales (record sales in 1000s), airplay (number of plays on the radio per week), attact (attractiveness of the band – ranked from 1-10, 10 being highest).
AlbumSales <- read_csv("AlbumSales.csv")
sapply(AlbumSales, class)
## Adverts Sales Airplay Attract
## "numeric" "numeric" "numeric" "numeric"
sapply(AlbumSales, length)
## Adverts Sales Airplay Attract
## 200 200 200 200
pairs(AlbumSales)
The data for album sales to advertising budget is sortof linear but there is heavy clustering.
Let’s run a linear regression model of adverts on sales. Note: “Album Sales” is the dependent variable (the outcome variable) and “Advertising Budget” is the independent variable. After you run the model, examine the output by running the summary function. E.g., summary(reg_model)
#linear regression model
reg_model <- lm(AlbumSales$Sales ~ AlbumSales$Adverts)
summary(reg_model)
##
## Call:
## lm(formula = AlbumSales$Sales ~ AlbumSales$Adverts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.949 -43.796 -0.393 37.040 211.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.341e+02 7.537e+00 17.799 <2e-16 ***
## AlbumSales$Adverts 9.612e-02 9.632e-03 9.979 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.99 on 198 degrees of freedom
## Multiple R-squared: 0.3346, Adjusted R-squared: 0.3313
## F-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16
Multiple R-squared: A measure of the proportion of the variance in the dependent variable that is explained by the model. In this case, it’s 0.3346, indicating that the model explains 33.46% of the variance in sales.
Pr(>|t|): The p-value associated with the t-statistic, used for hypothesis testing. In this case, the advertising p-value is very small (< 2e-16), indicating that the coefficients are significantly different from zero at a 0.05 significance level.
salesi = 1.341e+02 + 9.612e-02(advertsi)
sales_2 <- 1.341e+02 + 9.612e-02 * 2
sales_2
## [1] 134.2922
sales_3 <- 1.341e+02 + 9.612e-02 * 3
effect <- sales_3 - sales_2
effect * 1000
## [1] 96.12
The sales increases by the slope of the line. for every
$1000 increase, sales will also increase by
$96.12
Now let’s add airplay (number of plays on the radio) and
Attract (attractiveness of the band) to our model. We’ll store
this model in mod2.
#multiple regression model
mod2 <- lm(AlbumSales$Sales ~ AlbumSales$Adverts + AlbumSales$Airplay + AlbumSales$Attract)
summary(mod2)
##
## Call:
## lm(formula = AlbumSales$Sales ~ AlbumSales$Adverts + AlbumSales$Airplay +
## AlbumSales$Attract)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.324 -28.336 -0.451 28.967 144.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.612958 17.350001 -1.534 0.127
## AlbumSales$Adverts 0.084885 0.006923 12.261 < 2e-16 ***
## AlbumSales$Airplay 3.367425 0.277771 12.123 < 2e-16 ***
## AlbumSales$Attract 11.086335 2.437849 4.548 9.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.6595
## F-statistic: 129.5 on 3 and 196 DF, p-value: < 2.2e-16
mod2) significant? Hint: look at
the p-values at the end of the summary output (associated with the
F-statistic).p-value: < 2.2e-16 is significant
The overall model explains approximatly 66.47% of the variance in sales (Multiple R-squared: 0.6647)
The linear regression model was 0.3346. Therefore there is an increase in multiple regression.
The coefficient for advertising (0.084885) suggests that for every
$1000 spent in advertising, sales is expected to increase
by approximately $84.89, holding airplay and attract
constant. This relationship is highly significant (p-value < 2e-16),
indicating that advertising has a strong and positive association with
sales.
The coefficient for Airplay (3.367425) suggests that for every no. of
plays on the radio per week, sales is expected to increase by
approximately $3,367, holding advertising and attract
constant. This relationship is highly significant (p-value < 2e-16),
indicating that airplay has a strong and positive association with
sales.
The coefficient for Attract (11.086335) suggests that for every rank
increase of band attractiveness, the sales is expected to increase by
approximately $11,086, holding advertising and airplay
constant. This relationship is highly significant (p-value <
9.49e-06), indicating that attractiveness has a strong and positive
association with sales.
sales = -26.612958 + 0.084885(advertising) + 3.367425(plays) + 11.086335(attractiveness)
At this point, it is hard to compare the coefficient of advertisements, plays on radio, and attractiveness because they are measured on different scales. We can examine standardized coefficients, which running this code:
standardCoefs(mod2)
## b beta
## AlbumSales$Adverts 0.08488483 0.5108462
## AlbumSales$Airplay 3.36742517 0.5119881
## AlbumSales$Attract 11.08633520 0.1916834
The first column represents the beta coefficients (same values as the summary output from above).The second column represents the standardized beta coefficients.
You can interpret this as: for 1 standard deviation increase in
advertising, you see a 0.5108 standard deviation increase
in sales. We could calculate these standard deviations (the standard
deviation of advertising is $485,655 and the standard
deviation of sales is 80.669 – but remember it’s measured
in thousands, so 80,669). So, for 1 standard deviation
increase in advertising (aka $485,655), there is a result
of 0.511*80669 = 41,221 extra records sold.
Since the second column represents the standardized beta coefficients, 0.5119881 is the largest standardized bata coefficient. This value corresponds to Airplay. For every 1 standard deviation increase in airplay, we see a corresponding increase in sales by 0.512 standard deviations. This is only slightly larger than the effect of advertisements.
0.512*80669
## [1] 41302.53
This means that for a 1 standard deviation increase in airplay, there is an estimated increase of approximately 41,302 extra records sold.
41302-41221
## [1] 81
That’s an extra 81 records sold.
There are many assumptions associated with regression.
Predictor variables must be quantitative or categorical (with 2 categories). The outcome variable must be quantitative, continuous and unbounded.
The predictors should have some variation in value. (Variance not equal to 0)
Multicollinearity: Your independent variables should not be too correlated with each other
Predictors are uncorrelated with ‘external variables’. Shouldn’t be variables that influence the predictor variables that aren’t included in the model.
Homoscedasticity: The variance of the residuals should be constant
Normally-distributed errors: The residuals should be normally distributed
Independent errors: Residual terms (errors) should be independent & therefore uncorrelated
All the values of the outcome variable are independent
The relationship between the predictors and the outcome variable lie along a straight line.
The assumptions in bold are things we can test.
(HOMOSCEDASTICITY & NORMALLY-DISTRIBUTED)
We will plot our residuals to take a look at them.
hist(residuals(mod2))
Next, we should look at the 4 plots outputted by R in the plot function:
plot(mod2)
The first plot shows us Residuals vs Fitted. If our relationship is truly linear, we would expect the red line to be horizontal.
Has our assumption of linearity been met? Yes
The second plot shows us Q-Q plot. This shows us whether our residuals are normally distributed (similar test as the hist() function from above). The dots should fall on the diagonally dotted line.
The third plot presents the predicted [fitted] value vs. the square root of the absolute value of the standardized residual. This allows us to look for heteroscedasticity and non-linearity. To meet our assumptions, we want a random array of dots. If we have any funneling or trend in the residuals, it suggests we have deviated from our assumption.
It is important to test whether some values have a large influence on the parameters in your model. There are several different diagnostics we can use to determine whether a case is particularly influential – we will only examine 2 today.
Our fourth plot from the plot() function shows us standardized residuals vs. leverage. If we have an influential point, we will actually get a line that represents Cook’s distance. Check to see if there’s any points all by themselves on the left – suggesting that one point has a lot of influence.
Cook’s Distance – considers the effect of a single case on the model as a whole. Values greater than 1 may be of concern. This is probably the most popular way to test influential cases.
To get Cook’s distance, type:
#cooks.distance(mod2)
#this gives every Cook’s distance value
max(cooks.distance(mod2)) #gives you the maximum value
## [1] 0.07076588
This just gives you the maximum of all your data points and you just want it less than 1. It is a problem if it is greater than one.
If two or more predictors in a regression model are highly correlated, it can make your betas unreliable and/or increase standard errors. Generally, if two variables are highly correlated it is very difficult to tell which is having an effect on the model. We will use the variance inflation factor (VIF) to determine if multicollinearity is a problem. Rule of thumb: a VIF greater than ~2.5 is problematic.
vif(mod2)
## AlbumSales$Adverts AlbumSales$Airplay AlbumSales$Attract
## 1.014593 1.042504 1.038455
Values greater than 10 are very problematic. Some conservative estimates state that you should look for less than 2.5.
We can run a Durbin-Watson test to determine if our errors are independent. Values less than 1 or greater than 3 should raise alarm bells.
durbinWatsonTest(mod2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.0026951 1.949819 0.748
## Alternative hypothesis: rho != 0
Test statistic close to 2: This indicates no autocorrelation in the residuals. The residuals are independent, which is a desirable property in a linear regression model.
marriage <- read_csv("Marriage.csv")
sapply(marriage, class)
## education age_marr1 age_menarche
## "numeric" "numeric" "numeric"
# Create a scatter plot with a linear regression line of education and age at first marriage
ggplot(data = marriage, aes(x = education, y = age_marr1)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(x = "Education", y = "Age at First Marriage")
#linear regression model
reg_model <- lm(marriage$age_marr1 ~ marriage$education)
summary(reg_model)
##
## Call:
## lm(formula = marriage$age_marr1 ~ marriage$education)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4357 -2.5508 -0.5242 1.8209 13.4492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.32064 0.41139 42.103 < 2e-16 ***
## marriage$education 0.37169 0.04447 8.358 1.64e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.299 on 341 degrees of freedom
## Multiple R-squared: 0.17, Adjusted R-squared: 0.1676
## F-statistic: 69.85 on 1 and 341 DF, p-value: 1.644e-15
Yes, the linear regression model suggests a significant (p-value: 1.644e-15) weak positive relationship between education and age at first marriage.
Multiple R-squared =0.17, Adjusted R-squared= 0.1676
The linear regression model suggests a significant (p-value: 1.644e-15) weak positive relationship between education and age at first marriage, with each additional year of education associated with an 0.37169 year increase of age at first marriage. The model explains 37.17% of the variance in age at first marriage.
age_marr1_8 <- 0.37169*8 + 17.32064
age_marr1_8
## [1] 20.29416
12*0.29416
## [1] 3.52992
The model predicts that a person with 8 years of education will first marry around that 20 years and 3.5 months.
model_2#multiple regression model
model_2 <- lm(marriage$age_marr1 ~ marriage$education + marriage$age_menarche)
summary(model_2)
##
## Call:
## lm(formula = marriage$age_marr1 ~ marriage$education + marriage$age_menarche)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2601 -2.4948 -0.4948 1.8787 13.3872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.72757 1.79391 8.767 < 2e-16 ***
## marriage$education 0.37224 0.04449 8.367 1.55e-15 ***
## marriage$age_menarche 0.11798 0.12931 0.912 0.362
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.3 on 340 degrees of freedom
## Multiple R-squared: 0.172, Adjusted R-squared: 0.1672
## F-statistic: 35.32 on 2 and 340 DF, p-value: 1.153e-14
model_2?Multiple R-squared=0.172, Adjusted R-squared=0.1672
model_2, Is education a significant predictor of age
at marriage? Is age at menarche a significant predictor of age at
marriage?The coefficient for education (0.37224) suggests that for every additional year of education, a person’s age at first marriage is expected to increase by approximately 0.37224 years, holding age at menarche constant. This relationship is highly significant (p-value < 1.55e-15), indicating that years of education has a moderate and positive association with a person’s age at first marriage.
The coefficient for age at menarche (0.11798) suggests that for every additional age at menarche, a person’s age at first marriage is expected to increase by approximately 0.11798 years, holding education in years constant. However, this relationship is not significant (p-value < 0.362 ), indicating that age at menarche has a no association with a person’s age at first marriage.
model_2?standardCoefs(model_2)
## b beta
## marriage$education 0.3722382 0.41293603
## marriage$age_menarche 0.1179825 0.04502747
The output shows the original beta value for education as
0.3722382 and the standardized value
0.41293603. The original beta value for age at menarche is
0.1179825 and the standardized value
0.04502747.
We interpret the standardized betas as: with a 1 standard deviation increase in education we see a corresponding increase by 0.41293603 standard deviations in age at marriage and with a 1 standard deviation increase in age at menarche we see a corresponding increase by 0.04502747 standard deviations in age at marriage. Because 0.04502747 is larger than 0.41293603, age at menarche has a larger effect on age at marriage
plot(model_2)
They are ok. The QQ plot has a little bit of a curve.
model_2? Run cook’s
distance to determine this.#cooks.distance(model_2)
max(cooks.distance(model_2))
## [1] 0.08728567
No value is greater than 1.
vif(model_2)
## marriage$education marriage$age_menarche
## 1.000185 1.000185
Both values are less than 2.5, therefore I conclude that the predictors are not highly correlated with each other.
dwt(model_2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.02906788 1.939552 0.604
## Alternative hypothesis: rho != 0
The result of the Durbin-Watson statistic for this particular model is 1.939552. The p-value associated with this statistic is 0.618. The test statistic is close to 2 which this indicates no autocorrelation in the residuals. Furthermore, the high p-value of 0.618 indicates that there is not enough evidence to reject the null hypothesis of no autocorrelation. The residuals are independent.