LT10: Common Complications (And Solutions)

Learning outcomes

At the end of today’s workshop you should be able to do the following:

Conduct a polynomial regression
Correct a correlation for range restriction
Correct a correlation for unreliability

Install required packages

install.packages("psych")
install.packages("mosaic")
install.packages("car")
install.packages("Hmisc")
install.packages("tidyverse")
install.packages("ggplot2")

Load required packages

The required packages are the same as the installed packages. Write the code needed to load the required packages in the below R chunk.

library(mosaic)
library(car)
library(Hmisc)
library(psych)
library(tidyverse)
library(ggplot2)

Polynomial regression

Polynomial regression is used to test the shape of non-linear relationships. Those that are assumed to have a shape that is different to a straight line.

Lets say, for example, that we are interested in the the happiness of British college students over the last 10 years. We have reason to suspect that the data will be non-linear. Although college student happiness seems to rise after the recession, there is good evidence that this rise slowed down in recent years due to several macro-economic factors including increased competition and avaliablity of graduate jobs.

Okay, given this information, we need to create a polynomial regression that can approximate the levels of happiness of British college students across the past 10 years. I’m going to do somethign different in terms of data entry this week. I’m going to show you how to create your own data frames. First I’m going to enter the year variable covering 2010 to 2020.

Year <- c(2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020) # create an object year that contains (c) 2010 through 2020

Then I’m going to create the happiness variable, which contains mean happiness data from students in the UK who completed the NSS survey. Happiness is captured on a 1 to 7 scale, with 1 indicating not at all happy to 7 extemely happy.

Happiness <- c(4.835, 4.970, 5.085, 5.160, 5.310, 5.260, 5.235, 5.255, 5.235, 5.210, 5.175)

Finally, I am going to combine the Year and Happiness variables into a new data frame called happiness.

happiness <- data.frame(Year, Happiness)
happiness

As you can see, we now have a data object in R called happiness that houses the data of interest.

Okay, so we have the data. Whenever you are conducting polynomial regression, it is important to first mean center the predictor so that there is a neighborhood of zero (i.e., the mean value is 0 and thus there are an equal number of points above and below 0 - remember that the mean is the balancing point of the distribution!). Note we are not changing anything about the distribution of the predictor variable, just the units so that the shape of the trend can be plotted through 0. In our case, finding the mean of the predictor is easy, it is just the middle year - in this case 2015. So we just subtract the observed year from 2015 to center it.

happiness$Year <- (happiness$Year-2015)
happiness

Now let’s plot these values to get an idea of the shape of the trend. We can do this really easily in ggplot give visualisation methods we have covered way back in the Michaelmas term.

plot <- ggplot(happiness, aes(x = Year, y = Happiness)) + geom_point() + theme_classic()
plot

Okay, so we can see here that the trend is definately not linear. Just as we suspected, there is an initial period of growth in happiness following the recession, but this slows down from around 2014 (-1 on our recoded year variable).

Yet it is not sufficient to just visulaise this trend, we need to statistically ascertain its best fitting shape. For this we will run polynomial regression.

The 1st degree linear model

As we saw in the lecture, polynomial regression just uses the linear model, but adds powers to the predictor variable to model the shape of a relationship.

You might be asking; at what degree of the polynomial stop? Well, it depends on the degree of precision that we seek. The greater the degree of the polynomial, the greater the accuracy of the model, but the greater the difficulty in calculating; remember that we must also verify the significance of coefficients that are found.

I am minded at this point to remind students of Occam’s razor - the princpal of parsimony. Named after William of Occam who beleived that ‘it is vain to do with more what can be done with less’. Science should be no different. An elegant equation is more likely to be right than an ugly one that seems to fit the data. In other words, natural symmertary and honed brevity are more likely to fit the data in the end.

As such, it is unlikely that psychologists would go further than the 3rd degree when testing polynomial regression. We will also stop here.

First then, we need to test the linear model to ascertain the fit of a straight line to the data. We know it will not fit optimally becuase we have plotted the points, but it nevertheless gives us a baseline model with which to compare the addition of polynomial terms. So let’s go ahead and run the linear model and save it as a new R object called ‘fit1’.

fit1 <- lm(Happiness ~ 1 + Year, data = happiness)
summary(fit1)

## 
## Call:
## lm(formula = Happiness ~ 1 + Year, data = happiness)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17568 -0.06727  0.01568  0.05489  0.18205 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.15727    0.03306 155.988   <2e-16 ***
## Year         0.02932    0.01046   2.804   0.0206 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1097 on 9 degrees of freedom
## Multiple R-squared:  0.4663, Adjusted R-squared:  0.407 
## F-statistic: 7.863 on 1 and 9 DF,  p-value: 0.02057

We can see that the linear model fits the data reasonably well. There is 47% variance in happiness explained by year. And the slope coefficient for year is positive and statistically significnant (i.e., p < .05). Lets just bootstrap those estimates as all good staticians should.

fit1.boot <- Boot(fit1, f=coef, R = 5000)

## Loading required namespace: boot

summary(fit1.boot)

confint(fit1.boot, level = .95, type = "norm")

## Bootstrap normal confidence intervals
## 
##                  2.5 %     97.5 %
## (Intercept) 5.07810437 5.21773178
## Year        0.00283033 0.05681653

And bootstrapping supports the conclusions of normal theory testing.

But the question is, can we do better? Can we explain more variance by adding powers of the predictor to model the potential for non-linear effects? For this we need to add the polynomial terms.

The second degree polynomial model (quadratic shape)

The firs polynomial we are going to add is the 2nd degree polynomial term, which models a quadratic shape - that is, whether the effect accelerates or decelerates over time. For this we will add a third parameter to the simple regression model, which is year^2 (or yeasr to the second power). Lets go ahead and do that now and save the model as a new R object called ‘fit2’.

fit2 <- lm(Happiness ~ 1 + Year + I(Year^2), data = happiness) # must specify the polynomial term in I() to tell R this is polynomial (and not Year specified twice)
summary(fit2)

## 
## Call:
## lm(formula = Happiness ~ 1 + Year + I(Year^2), data = happiness)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.046888 -0.018834 -0.003159  0.002040  0.086748 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.263159   0.017655 298.110  < 2e-16 ***
## Year         0.029318   0.003696   7.933 4.64e-05 ***
## I(Year^2)   -0.010589   0.001323  -8.002 4.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03876 on 8 degrees of freedom
## Multiple R-squared:  0.9407, Adjusted R-squared:  0.9259 
## F-statistic: 63.48 on 2 and 8 DF,  p-value: 1.235e-05

Okay, now we are getting somewhere! We can see that both the linear and quadratic effects are significant. The negative sign of the quadratic slope indicates that while the linear trend is positive, it decelerates over time.

We can also see that the addition of the 2nd degree polynomial has substancially increased the variance explained in the model - from 47% to 94%!

Like always, lets just check this with bootstrapping.

fit2.boot <- Boot(fit2, f=coef, R = 5000) 
summary(fit2.boot)

confint(fit2.boot, level = .95, type = "norm")

## Bootstrap normal confidence intervals
## 
##                   2.5 %       97.5 %
## (Intercept)  5.21963356  5.304766292
## Year         0.01998425  0.042930166
## I(Year^2)   -0.01456384 -0.006270443

And the bootsrapping support the normal theory testing.

As we saw in LT2, it is not an especially good idea to compare models directly using R2 becuase R2 does not take into consideration the degrees of freedom in the model.

There is another more direct test of whether adding our 2nd degree polynomial significantly improves the model. Rather than inspect the R2 the models, we can also calculate F for the comparison between the simple model (linear effect only) and the full model (linear and quadratic). The calculation for doing this F test is:

F = ((R2full - R2imple)/(dffull-dfsimple))/((1-R2full)/dffull) - see lecture notes for formula

But we can also request it directly in R using anova(). Lets do that now for models fit1 and fit2.

anova(fit1, fit2)

This output shows us that the addition of the quadratic term to the linear model has reduced the residual (error) sum of squares by .10 (.11 - .01). In the context of the 1 degree of freedom spent adding the quadratic term to the linear model (i.e., 1), the F ratio asssociated with this reduction in error is large and significnant (64.03, p < .05). There is an improvement of fit with the quadratic term added!

The 3rd degree polynomial model (cubic shape)

Finally, we can check to see whether the fit is improved further with the addition of a polynomial term to the third power. To do so, we just add a fourth parameter; Year^3 to our model. Lets do that now and save it as a new R object called ‘fit3’.

fit3 <- lm(Happiness ~ 1 + Year + I(Year^2) + I(Year^3), data = happiness) 
summary(fit3)

## 
## Call:
## lm(formula = Happiness ~ 1 + Year + I(Year^2) + I(Year^3), data = happiness)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.032774 -0.014802 -0.001253  0.003199  0.072634 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.2631585  0.0150667 349.324 4.16e-16 ***
## Year         0.0143638  0.0081282   1.767   0.1205    
## I(Year^2)   -0.0105886  0.0011293  -9.376 3.27e-05 ***
## I(Year^3)    0.0008401  0.0004209   1.996   0.0861 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03308 on 7 degrees of freedom
## Multiple R-squared:  0.9622, Adjusted R-squared:  0.946 
## F-statistic: 59.44 on 3 and 7 DF,  p-value: 2.403e-05

In this model, the slopes for the lienar and cubic effects are not significant, and it therefore appears that the best model is the polynomial of 2nd degree. We can confirm this using bootrapping.

fit3.boot <- Boot(fit3, f=coef, R = 5000) 
summary(fit3.boot)

confint(fit3.boot, level = .95, type = "norm")

## Bootstrap normal confidence intervals
## 
##                    2.5 %       97.5 %
## (Intercept)  5.218829443 5.3026627236
## Year        -0.006482348 0.0388273603
## I(Year^2)   -0.020408585 0.0006132732
## I(Year^3)   -0.001862377 0.0033362721

And the bootsrapping results support the normal theory.

Furthermore, look at the Multiple R-squared: in the 2nd degree model it is 94.07%, while in the 3rd degree model it is 96.22%. Yes there has been an increase in accuracy of the model, but we know that adding any variable will increase R2 due if nothing else but to chance. Therefore, let’s compare the cubic model with the quadratic model using anova() to see if the addition of the cubic term was worth it.

anova(fit2, fit3)

Since the p-value is greater than 0.05, we accept the null hypothesis: there wasn’t a significant improvement of the model. The quadratic model with the 2nd degree polynomial is the best fit.

Plotting the polynomial

Great, we have identified the shape of the trend - it is quadradic with a deceleration!

The last thing to do now is to represent graphically the result. In fact, ggplot has an excellent layer called stat_smooth that plots polynomials for whatever degree you have found to fit the data best. In our case we have found a 2nd degree polynomial to fit the data best, so we just add this layer (+) to our initial plot of the data points above (i.e., plot).

plot + stat_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1) # the smooth method just uses the linear model to plot the shape of the relationship using I(x^2), which is what we found to fit the data best.

And there you have it! You have just successfully conducted your first polynomial regression. Not that difficult, right?!

Correcting for range restriction

Another topic we covered in the lecture was correcting correlations for two common artefacts that can influence their estimation; range restriction and unreliability.

Let’s start with range restriction and look how we correct correlations that come from restricted ranges. In this example, we are going to use some of my perfectionism data retrieved from college students.. Rather than look at the relationship of time or country on pefectionism scores, we are going to look at age.

Our research question is: is there a relationship between age and perfectionism in American adults?

First, lets load this data into our R environment. Go to the LT10 folder, and then to the workshop folder, and find the “perfectionism.csv” file. Click on it and then select “import dataset”. In the new window that appears, click “update” and then when the dataframe shows, click import. If you want, you can try running the code below and it might do the same thing (if not put your hand up).

perfectionism <- read_csv("C:/Users/tc560/Dropbox/Work/LSE/PB130/LT10/Workshop/perfectionism.csv")

## Parsed with column specification:
## cols(
##   id = col_double(),
##   year = col_double(),
##   spp = col_double(),
##   age = col_double()
## )

Becuase college student data has a resticted range of age (i.e., between 18 and 25), we need to correct the correlation observed in college students to yield a likely correlation in the population as a whole. To see this, lets just run the favstats function on the age variable.

favstats(~age, data=perfectionism)

The range of values is small going from 18 to 25.96, with a mean of 20.59 and SD of 2.27. This is a classic scenario for range restriction correction - when the predictor variable is restricted at one end of the distribution (in this case the lower end).

So let’s first calcualte the correlation coefficient for age and perfectionism (variable name; spp) in the restricted sample.

rcorr(as.matrix(perfectionism[,c("spp","age")], type="pearson"))

##       spp   age
## spp  1.00 -0.08
## age -0.08  1.00
## 
## n= 159 
## 
## 
## P
##     spp    age   
## spp        0.2925
## age 0.2925

As we can see, in our restricted data there is no correlation between age and perfectionism. The correlation is -0.08, and the p value is 0.29. But we know the correlation is restricted, so lets estimate it in the adult population using the range restriction correction we covered in the lecture.

Doing the range restriction calculation

A correction for range restriction that is well known is Thorndike Case 2:

Let R be the unrestricted correlaton, r the restricted correlation, S the unrestricted standard deviation, s the restricted standard deviation, then

R = (rS / s) / sqrt(1 - r^2 + r^2 * (S^2 / s^2)).

Estimating the population SD

For this formula to hold, we need the population standard deviation (S). We could find the population range of ages and work a SD for the American adult population. But lets for the sake of this task assume thast we do not have access to the population SD. I am assuming this becuase it is often the case that we do not know the SD in the population and thus we need to somehow estimate it.

To estimate the population SD from the distributional properties of a restricted sample, Cohen (1959) proposed the ratio: SD^2/(X - Xc)^2. In other words, the sample variance (SD) over the squared difference between the sample mean (X) and point of truncation (Xc).

This ratio is useful because a normal distribution, truncated at Xc, will have a unique value of SD^2/(X - Xc)^2. Two normal functions, the standard deviation in standardized form in a restricted distribution, and the z-score representing the truncation point, are tabled directly against Cohen’s ratio (see below table). These tabled values can then easily be used to correct means, standard deviations, and correlations.

Cohen’s Ratio, Restricted SD, and z Score for Truncation Point

The sample of interest is assumed to be a random sample from a normal population truncated at some unknown point. Cohen’s ratio is calculated from the sample mean and standard deviation (X, SD) and the lowest or highest observed variate-value (Xc) depending on where the sample is restricted. In this example, truncation is assumed to occur at the upper end of the distribution (i.e., we are missing older age groups). If truncation has occurred at the lower end of a distribution, the method described is the same except that Xc is taken as the lowest observed sample value. Let’s calculate Cohen’s ratio with the avaliable information:

Cohen’s ratio: SD^2/(X - Xc)^2

2.271121^2/(20.5984 - 25.96)^2

## [1] 0.1794286

Once Cohen’s ratio has been calculated, the closest tabled value (Table 1) is located. The corresponding z-cut, that is, the z-score identifying the point of truncation, and the standard deviation in a normal distribution truncated at that point may then be obtained from Table 1. Here we can see that a Cohen’s ratio of 0.18 corresponds with a truncated SD of 0.963 and a z score of -2.25. Since the standard deviation in Table 1 is the standardized value of the standard deviation after truncation, that tabled value also represents the proportional reduction in the standard deviation due to range restriction. The observed sample standard deviation may be corrected by:

population SD = restricted SD / truncated SD

2.271121/.963

## [1] 2.358381

And there we have it - the estimated SD in the population! Incidently, we can also estimate the corrected mean using this information. All we have to do is take the truncated value - in this case the highest score 25.96 and subtract out the z-score multiplied by the restricted SD.

25.96 - (-2.25*2.271121)

## [1] 31.07002

When we do so we get an unrestricted mean age of 31. Still a slight underestimation based on what we know about the mean US population (approx 38 years), but nevertheless closer than what our origional sample assumes.

Correcting the correlation

Now we have an estimated population SD of 2.27, lets now correct our correlation of -0.08 for range restriction. To do so, we can use the psych package, which implements Thorndike’s case 2 using the rangeCorrection() fucntion.

rangeCorrection(-0.08,2.358381,2.271121,case=2) # this function takes the correlation, followed by the population SD, followed by the sample SD, and then stipualtes case 2 for Thorndike's case 2 correction

## [1] -0.08305291

And here we can see that the corrected correlation is more or less the same as the uncorrected correlation. This is likely because the estimated population SD was quite close to the sample SD. If we increase the population SD a little to 4 you can see that the corrected correlation difference also increases.

rangeCorrection(-0.10,4,2.271121,case=2)

## [1] -0.1743021

This is what happens when we correct our correlations for range restriction - the bigger the difference in the variance of the population and sample, the larger the difference betwene the uncorrected and corrected correlation.

Correcting for unreliability

The second correction often applied to correlations is that for unreliability. As we saw in the lecture and in LT8, unreliability is when a measure contains error. Now, all measures contain some error, but some contain more than others. We can estimate how much error is contained in a measure by calcualting the ratio of true score variance to overall variance in a variable.

To illustrate this, I want to return to the data we collected on personality in LT8 and correct the corrlation between neuroticism and openness for unrelaibility.

First, lets load this data into our R environment. Go to the LT10 folder, and then to the workshop folder, and find the “bigfive.csv” file. Click on it and then select “import dataset”. In the new window that appears, click “update” and then when the dataframe shows, click import. If you want, you can try running the code below and it might do the same thing (if not put your hand up).

bigfive <- read_csv("C:/Users/tc560/Dropbox/Work/LSE/PB130/LT10/Workshop/bigfive.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Event Index` = col_character(),
##   `UTC Date` = col_character(),
##   `Local Date` = col_character(),
##   `Tree Node Key` = col_character(),
##   `Repeat Key` = col_logical(),
##   `Participant Public ID` = col_character(),
##   `Participant Starting Group` = col_logical(),
##   `Participant Status` = col_character(),
##   `Participant Completion Code` = col_logical(),
##   `Participant External Session ID` = col_logical(),
##   `Participant Device Type` = col_character(),
##   `Participant Device` = col_character(),
##   `Participant OS` = col_character(),
##   `Participant Browser` = col_character(),
##   `Participant Monitor Size` = col_character(),
##   `Participant Viewport Size` = col_character(),
##   Checkpoint = col_logical(),
##   `Task Name` = col_character()
## )

## See spec(...) for full column specifications.

## Warning: 1 parsing failure.
## row col    expected    actual                                                              file
##  15  -- 115 columns 1 columns 'C:/Users/tc560/Dropbox/Work/LSE/PB130/LT10/Workshop/bigfive.csv'

Before we work on this, we need to run some preliminaries to clean it up. Let’s do that now.

# first filter out NAs
bigfive <- 
bigfive %>% 
filter(e1 >= "1")

# note I specify car:: here becuase car and tidyverse have a recode function and I want R to use the car recode function
bigfive$a1 <- car::recode(bigfive$a1r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$c2 <- car::recode(bigfive$c2r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$e2 <- car::recode(bigfive$e2r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$n2 <- car::recode(bigfive$n2r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$a3 <- car::recode(bigfive$a3r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$c4 <- car::recode(bigfive$c4r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$e5 <- car::recode(bigfive$e5r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$c5 <- car::recode(bigfive$c5r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$n5 <- car::recode(bigfive$n5r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$a6 <- car::recode(bigfive$a6r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$e7 <- car::recode(bigfive$e7r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$n7 <- car::recode(bigfive$n7r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$o7 <- car::recode(bigfive$o7r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$a8 <- car::recode(bigfive$a8r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$o9 <- car::recode(bigfive$o9r, "1=5; 2=4; 3=3; 4=2; 5=1")
bigfive$c9 <- car::recode(bigfive$c9r, "1=5; 2=4; 3=3; 4=2; 5=1")

# select out the items

bigfive <- select(bigfive, e1, e2, e3, e4, e5, e6, e7, e8, c1, c2, c3, c4, c5, c6, c7, c8, c9, o1, o2, o3, o4, o5, o6, o7, o8, o9, o10, a1, a2, a3, a4, a5, a6, a7, a8, a9, n1, n2, n3, n4, n5, n6, n7, n8)

# calculate variables

bigfive <- bigfive %>%
  mutate(neuroticism = (n1 + n2 + n3 + n4 + n5 + n6 + n7 + n8)/8)
bigfive <- bigfive %>%
  mutate(openness = (o1 + o2 + o3 + o4 + o5 + o6 + o7 + o8 + o9 + o10)/10)

Okay, great we have the data in good shape. Just like for range restriction, lets first calcualte the uncorrected correlation between neuroticism and openness.

rcorr(as.matrix(bigfive[,c("neuroticism","openness")], type="pearson"))

##             neuroticism openness
## neuroticism        1.00     0.43
## openness           0.43     1.00
## 
## n= 14 
## 
## 
## P
##             neuroticism openness
## neuroticism             0.124   
## openness    0.124

We can see that the uncorrected correlation is 0.43, but not significant (p< .05). But don’t worry about that, we have a small sample of only 14. Lets focus on the correlation estimate and how much it has been attenuated by measurement unrelaibility.

Calculating reliability

The first step in correcting the correlation for unreliability is to calcuate the reliability of the variable. If you recall, neuroticism has 8 items and opennes has 10, we first collate these items and then calculate Cronbach’s alpha. We did this by hand in the lecture, and by using the alpha() function from the psych package in the workshop. So lets go ahead and do that now.

neuroticism <- select(bigfive, n1, n2, n3, n4, n5, n6, n7, n8)
openness <- select(bigfive, o1, o2, o3, o4, o5, o6, o7, o8, o9, o10)

alpha(neuroticism)

## 
## Reliability analysis   
## Call: alpha(x = neuroticism)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
##       0.83      0.83    0.93      0.37 4.8 0.064  2.9 0.71     0.41
## 
##  lower alpha upper     95% confidence boundaries
## 0.7 0.83 0.95 
## 
##  Reliability if an item is dropped:
##    raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## n1      0.82      0.82    0.89      0.40 4.6    0.068 0.078  0.48
## n2      0.81      0.81    0.88      0.38 4.4    0.068 0.087  0.48
## n3      0.80      0.80    0.91      0.36 3.9    0.076 0.083  0.39
## n4      0.76      0.76    0.84      0.31 3.2    0.093 0.068  0.34
## n5      0.79      0.79    0.91      0.34 3.7    0.083 0.092  0.34
## n6      0.87      0.87    0.94      0.49 6.7    0.051 0.041  0.54
## n7      0.81      0.80    0.91      0.37 4.1    0.073 0.088  0.43
## n8      0.79      0.79    0.87      0.35 3.7    0.078 0.065  0.36
## 
##  Item statistics 
##     n raw.r std.r r.cor r.drop mean   sd
## n1 14  0.57  0.59  0.58  0.449  2.1 0.92
## n2 14  0.64  0.63  0.64  0.516  2.8 1.05
## n3 14  0.74  0.74  0.71  0.645  3.1 0.95
## n4 14  0.91  0.92  0.95  0.860  3.4 1.22
## n5 14  0.82  0.80  0.75  0.702  2.9 1.41
## n6 14  0.24  0.23  0.11  0.068  2.6 1.02
## n7 14  0.68  0.69  0.65  0.572  3.1 0.92
## n8 14  0.78  0.79  0.80  0.706  3.1 0.86
## 
## Non missing response frequency for each item
##       1    2    3    4    5 miss
## n1 0.29 0.43 0.21 0.07 0.00    0
## n2 0.00 0.57 0.14 0.21 0.07    0
## n3 0.00 0.29 0.36 0.29 0.07    0
## n4 0.00 0.36 0.07 0.36 0.21    0
## n5 0.29 0.00 0.43 0.14 0.14    0
## n6 0.07 0.50 0.29 0.07 0.07    0
## n7 0.00 0.29 0.43 0.21 0.07    0
## n8 0.00 0.21 0.50 0.21 0.07    0

alpha(openness)

## Warning in alpha(openness): Some items were negatively correlated with the total scale and probably 
## should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## Some items ( o6 o9 o10 ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## 
## Reliability analysis   
## Call: alpha(x = openness)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N  ase mean   sd median_r
##       0.55      0.62     0.9      0.14 1.7 0.18  3.6 0.46     0.12
## 
##  lower alpha upper     95% confidence boundaries
## 0.2 0.55 0.9 
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r med.r
## o1       0.46      0.53    0.88     0.113 1.14     0.21  0.14 0.095
## o2       0.57      0.65    0.89     0.172 1.87     0.17  0.15 0.164
## o3       0.45      0.53    0.88     0.110 1.11     0.22  0.14 0.095
## o4       0.41      0.53    0.85     0.109 1.11     0.23  0.15 0.095
## o5       0.38      0.48    0.83     0.095 0.94     0.25  0.14 0.087
## o6       0.66      0.72    0.91     0.224 2.59     0.14  0.12 0.229
## o7       0.60      0.64    0.87     0.165 1.78     0.16  0.13 0.115
## o8       0.41      0.46    0.83     0.087 0.86     0.23  0.13 0.071
## o9       0.61      0.67    0.89     0.182 2.01     0.16  0.15 0.229
## o10      0.56      0.64    0.90     0.164 1.77     0.16  0.16 0.185
## 
##  Item statistics 
##      n  raw.r std.r r.cor r.drop mean   sd
## o1  14  0.637  0.70  0.68  0.506  3.6 0.85
## o2  14  0.149  0.25  0.21 -0.013  4.4 0.74
## o3  14  0.661  0.72  0.70  0.502  4.0 1.04
## o4  14  0.753  0.72  0.73  0.629  3.6 1.01
## o5  14  0.820  0.84  0.85  0.728  3.2 0.97
## o6  14 -0.027 -0.14 -0.21 -0.261  3.8 1.12
## o7  14  0.202  0.30  0.29 -0.032  3.1 1.07
## o8  14  0.846  0.89  0.92  0.790  4.1 0.73
## o9  14  0.237  0.17  0.14 -0.026  2.9 1.21
## o10 14  0.437  0.31  0.23  0.147  3.4 1.40
## 
## Non missing response frequency for each item
##        1    2    3    4    5 miss
## o1  0.00 0.07 0.43 0.36 0.14    0
## o2  0.00 0.00 0.14 0.36 0.50    0
## o3  0.00 0.07 0.29 0.21 0.43    0
## o4  0.00 0.07 0.50 0.14 0.29    0
## o5  0.00 0.29 0.29 0.36 0.07    0
## o6  0.00 0.14 0.29 0.21 0.36    0
## o7  0.07 0.21 0.36 0.29 0.07    0
## o8  0.00 0.00 0.21 0.50 0.29    0
## o9  0.07 0.36 0.29 0.14 0.14    0
## o10 0.14 0.14 0.07 0.43 0.21    0

The first and most important outpur tells us that the neuroticism variable has a Cronbach alpha of .83 and the openness varable a Cronbach alpha of .55.

Correcting the correlation

With this information we can do a simple correction to the correlation between these variables, which accounts for the measururement error or unreliability inherent to the two variables. Recall from the lecture that the calculation for correcting the correlation for measurement error is:

rxy / sqrt(rxx * ryy),

where rxy is the uncorrected correlation, rxx is the Cronbach alpha for x and ryy the Cronbach alpha for y.

Plugging these numbers into R..

.43 / sqrt(.83*.55)

## [1] 0.6364262

We find that the correct correlation is .64. Thats quite an increase! This is becuase the openness variable has poor reliability and therefore the measure contains alot of error that is irrelevant to neuroticism. Lets see what happens if openness contained less error and, for instance, had a Chronbach alpha of .90..

.43 / sqrt(.83*.90)

## [1] 0.4975173

A much smaller difference. So this is what I mean when I say unreliablity attenuates relationships. The more error there is in a measure, the smaller the relationship can be. It is therefore important to correct correlations for measurement error when we aggregate variables using psychological tools. And now you know how to!

Activity

For this activity, I want you to have a go at polynomial regression using the perfectionism dataset. To date, we have looked at the linear effect of perfectionism over time, but perhaps the trend is quadratic, or maybe cubic. Have a go at finding out.

Task 1 - Plot the relationship

Use the chunk code below to plot the points for the relationship between year and spp in the perfectionism dataframe.

plot1 <- ggplot(perfectionism, aes(x = ??, y = ??)) + geom_point() + theme_classic()
plot1

Describe the shape of this trend:

Task 2 - Mean center the predictor variable

Before we look at this relationship, we must center the predictor (year) at the mean. First lets calcuate the mean for year.

mean(perfectionism$year)

And then subtract this mean from the observed year..

perfectionism$year <- (perfectionism$year-??)
perfectionism

Task 3 - Test the linear model

First we need to test the linear model to ascertain if there is a linear trend. Lets build the linear model now and save it as a new R object called per1..

per1 <- lm(spp ~ ??, data = perfectionism) 
summary(per1)

What is the R2 assoaicted with this model?

What is the beta for year?

Is the beta significant?

Task 4 - Test the quadratic model

Next we need to test the quadratic model, with year raised to the 2nd power. To do this we just add year^2 to our linear model. Lets do that now and save it as a new R object called per2..

per2 <- lm(spp ~ ??, data = perfectionism)
summary(per1)

What is the R2 assoaicted with this model?

What is the beta for the quadratic effect?

Is the beta significant?

Now let’s see whether the introduction of the quadratic term improves the linear model using anova()..

anova(??)

Does the addition of the quadratic term improve the model?

Why?

Test the cubic model

Fianlly we need to test the cubic model, with year raised to the 3rd power. To do this we just add year^3 to our quadratic model. Lets do that now and save it as a new R object called per2..

per3 <- lm(spp ~ ??, data = perfectionism) 
summary(per3)

What is the R2 assoaicted with this model?

What is the beta for the cubic effect?

Is the beta significant?

Now let’s see whether the introduction of the cubic term improves the quadratic model using anova()..

anova(??, ??)

Does the addition of the quadratic term improve the model?

Why?

Task 5 - Plot the trend

To visulaise this relationship draw a figure that fits the best fitting shape to the data (i.e., linear, quadratic, or cubic). You can use the origional plot1 from task 1 and add the stat_smooth layer with the best fitting polynomial model in the formula.

plot1 + stat_smooth(method = "lm", formula = y ~ x + ?? + ??, size = 1)