We will begin by introducing and observing the data set. We will then proceed to the chapters of correlation of numerical variables and association between categorical variables before arriving at conclusions.
Let us now define two research questions. The first will be answered with correlation analysis; the second with Pearson Chi2 test for association between 2 categorical variables.
Research question 1: We want to check for correlation between SATV scores and SATQ scores. Are SAT Verbal and SAT Quantitative scores correlated?
Research question 2: We want to check association between gender and education (the latter is measured on a scale from 1 to 5 where 1 stands for “high school” and 5 stands for “graduate work”. Is the self-reported degree of education related to gender?
library(psych)
mydata <- force(sat.act)
colnames(mydata) <- c("Gender", "Education", "Age", "ACT", "SATV", "SATQ") #changing column names to be capitalized because it bothers me
head(mydata)
## Gender Education Age ACT SATV SATQ
## 29442 2 3 19 24 500 500
## 29457 2 3 23 35 600 500
## 29498 2 3 20 21 480 470
## 29503 1 4 27 26 550 520
## 29504 1 2 33 31 600 550
## 29518 1 5 26 28 640 640
Explanations of the data set:
Gender: A representation of a respective person’s gender - 1 for male and 2 for female.
Education: A representation of a respective person’s self-reported level of education - 1 stands for high school and 5 stands for graduate work. The interim levels are not specified, so we will leave them numbered, because we do not know their representation and the precise intervals between them.
Age: A measurement of the respective person’s age.
ACT score: A measurement of the respective person’s composite American College Test (hereafter: ACT) score. The scores can range from 1-36; national norms have a mean of 20.
SATV score: A measurement of the respective person’s SAT verbal score. These scores can range from 200-800.
SATQ score: A measurement of the respective person’s SAT quantitative score. These scores can also range from 200-800.
The unit of observation is a subject; a person.
The sample size of our data encompasses 700 units along the 6 variables delineated above (I did not count ID as a specific variable).
Explanations of the variables:
The first two variables measure a person’s gender and self-reported level of education. Both of these variables are categorical. Gender is a nominal scale variable and education is an ordinal scale variable (but we do not know specific ordinal brackets, because they are not specified beyond the scores of 1 and 5).
The third variable is age. It is a numerical variable that measures a person’s age.
The fourth variable measures a person’s ACT composite score on a scale from 1-36. It is a numerical variable.
The fifth and sixth variables measure a person’s SATV and SATQ scores. Both are numerical variables whose scores can range from 200-800.
Following is the data source citation:
Revelle, William, Wilt, Joshua, and Rosenthal, Allen (2009) Personality and Cognition: The Personality-Cognition Link. In Gruszka, Alexandra and Matthews, Gerald and Szymura, Blazej (Eds.) Handbook of Individual Differences in Cognition: Attention, Memory and Executive Control, Springer.
Finally, before proceeding, I will factor the variables of Gender and Education.
mydata$Gender <- factor(mydata$Gender,
levels = c(1, 2),
labels = c("Male", "Female"))
mydata$Education <- factor(mydata$Education,
levels = c(1, 2, 3, 4, 5),
labels = c(1, 2, 3, 4, 5))
head(mydata)
## Gender Education Age ACT SATV SATQ
## 29442 Female 3 19 24 500 500
## 29457 Female 3 23 35 600 500
## 29498 Female 3 20 21 480 470
## 29503 Male 4 27 26 550 520
## 29504 Male 2 33 31 600 550
## 29518 Male 5 26 28 640 640
Before continuing with the respective chapters of correlation and association of two categorical variables, let’s observe some descriptive statistics.
summary(mydata)
## Gender Education Age ACT SATV SATQ
## Male :247 1 : 45 Min. :13.00 Min. : 3.00 Min. :200.0 Min. :200.0
## Female:453 2 : 44 1st Qu.:19.00 1st Qu.:25.00 1st Qu.:550.0 1st Qu.:530.0
## 3 :275 Median :22.00 Median :29.00 Median :620.0 Median :620.0
## 4 :138 Mean :25.59 Mean :28.55 Mean :612.2 Mean :610.2
## 5 :141 3rd Qu.:29.00 3rd Qu.:32.00 3rd Qu.:700.0 3rd Qu.:700.0
## NA's: 57 Max. :65.00 Max. :36.00 Max. :800.0 Max. :800.0
## NA's :13
We can observe 57 NA values with education - this is because factoring categorical variables assigns specific outcomes, and so the unassigned outcomes end up being NA. In our case, some units had education of “0”, while it is measured on a 1-5 scale (probably because they didn’t answer) and as 0 was not factored, it was assigned NA. I could drop the NA units, but I don’t really have to because we’re not doing regression here; plus, they will get automatically dropped later on as we will see in the chapter of association between 2 categorical variables. There are also 13 NA units in SATQ, we will adjust for this when doing correlation by using the “complete.obs” argument, which will effectively “drop” them from comparison by observing correlation between units that have complete observations only.
Explaining some descriptive statistics:
“Gender”: We have 247 males and 453 females in our sample (for the total sample size of 700; we did not drop NA values)
max of “Age”: The oldest person(s) in our sample is (are) 65 years old,
Mean of SATV: The respondents in our sample scored on average 612,2 points in their SATV test.
We will now conduct correlation analysis for two of the numerical variables (the last four variables; last four columns).
Our first research question was the following: “Are SAT Verbal and SAT Quantitative scores correlated?”.
Let’s begin by plotting distributions of each of the 2 variables and scatterplots to observe their correlation.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata[ , c(-1, -2, -3, -4)], smooth=FALSE)
scatterplot(mydata$SATV ~ mydata$SATQ,
smooth = FALSE,
boxplots = FALSE,
ylab = "Verbal SAT score",
xlab = "Quantitative SAT score")
The distributions of SATV and SATQ do not appear normal. Both distributions are left-skewed. Looking at scatterplots, I cannot perceive any significant outliers I would like to remove from observation. I am generally not in habit of removing outliers when it comes to correlations because it can affect results significantly. Besides, units that more than slightly fall outside the correlation line are not really outliers so long as they follow the “logic” of the line. E.g. in the scatterplots above, we can observe correlation, in the most crude and simplistic words, “go from bottom left to top right”. More accurately put, units appear to be correlated when both SAT tests yield either low or high scores.
Let’s take one of the scatterplots for visualization - I will observe the one where SATV is on the y-axis and SATQ is on the x-axis (the top right scatterplot, which I have “zoomed in on” by drawing it separately). In it, we can see a unit in the bottom left corner below the regression line - it represents a person who had truly low scores of both SAT tests. But that still follows the logic of the trend we can see in the graph. I would consider this an outlier only if, for example, the unit had been in the top left corner where a person would have a huge SATV score but a negligible SATQ score. Then I might consider it as a measurement or data error. But, as it is, I cannot denote units than falls more-than-slightly off the line as outliers and remove then, because they still tell a story. Such people exist in the population and I have no basis upon which to presume they do not; that it could be just a measurement/data entry error.
Proceeding with our correlation analysis, let’s use the “rcorr” function. Because we suspect the distribution of the variables is not normal looking at the graphs above, let’s do a Shapiro-Wilks test to confirm.
shapiro.test(mydata$SATV)
##
## Shapiro-Wilk normality test
##
## data: mydata$SATV
## W = 0.96708, p-value = 1.934e-11
shapiro.test(mydata$SATQ)
##
## Shapiro-Wilk normality test
##
## data: mydata$SATQ
## W = 0.96627, p-value = 1.781e-11
With both of the Shapiro-Wilk tests above, our hypotheses are as follows:
H0: The respective variable is normally distributed
H1: The respective variable is not normally distributed.
For both SATV and SATQ scores we can reject the null hypothesis and conclude that neither of the variables is normally distributed at statistical significance (p<0.001).
Because the distributions of variables are not normal, we will use Spearman correlation.
cor(mydata$SATV, mydata$SATQ,
method = "spearman",
use = "complete.obs")
## [1] 0.6103223
We can see that there is a semi-strong positive linear relationship between SAT Verbal scores and SAT Quantitative scores. We needed to use “complete.obs” to specify observation of only those units with complete responses; as we saw with descriptive statistics above, we have 13 NA values with SATQ scores and without accounting for complete observations only we cannot observe correlation, for we cannot account for those units missing a value - 13 units would have SATV scores but not SATQ scores; we cannot correlate something with nothing.
cor.test(mydata$SATV, mydata$SATQ,
method = "spearman",
use = "complete.obs")
## Warning in cor.test.default(mydata$SATV, mydata$SATQ, method = "spearman", : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mydata$SATV and mydata$SATQ
## S = 21058311, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.6103223
With this test, our hypotheses are as follows:
H0: Rho of SATV and SATQ is equal to 0.
H0: Rho of SATV and SATQ is not equal to 0.
We can reject the null hypothesis and conclude that Rho of SAT Verbal scores and SAT Quantitative scores differs from 0 at statistical significance; there is correlation between the variables (p<0.001).
Our first research question was: Are SAT Verbal and SAT Quantitative scores correlated?”. We can claim that there is correlation between SAT Verbal and SAT Quantitative scores (p<0.001). There is a semi-strong positive linear relationship.
Our second research question was the following: Is the self-reported degree of education related to gender?
We will answer this research question with the Pearson Chi2 test.
Firstly, we should note that we have 57 unanswered observations of the “Education” variable (NA). So when checking our frequencies, the total sum of all observations will be 643, because 57 people did not answer about their Education.
Finally, before proceeding with the Pearson CHi2 test, we need to check its assumptions. They are the following:
Observations are independent (categories do not mix). This is true; each person was either male or female and made a single choice about their self-reported level of education.
All expected frequencies are greater than 5 for all combinations.
In larger contingency tables like ours, up to 20% of expected frequencies can be between 1 and 5, but can never be below 1.
None of the assumptions are violated with our data set. This is not visible immediately, but will be seen below where we will display expected frequencies and see that they are all above 5. It makes sense, after all, given how large our sample size is (technically 643 for this test; 57 unanswered).
results <- chisq.test(mydata$Gender, mydata$Education,
correct = FALSE)
results
##
## Pearson's Chi-squared test
##
## data: mydata$Gender and mydata$Education
## X-squared = 12.294, df = 4, p-value = 0.0153
The hypotheses for Pearon’s Chi2 test are as follows:
H0: There are no associations between the two categorical variables.
H1: There is association between the two categorical variables.
We can reject the null hypothesis and conclude that there is association between the categorical variables of Education and Gender (p=0.016).
Let’s now display our observed and expected frequencies.
addmargins(results$observed)
## mydata$Education
## mydata$Gender 1 2 3 4 5 Sum
## Male 20 23 80 51 46 220
## Female 25 21 195 87 95 423
## Sum 45 44 275 138 141 643
In the table above, we can see observed frequencies. This is simply a display of our responses. For example - 44 of respondents self reported their level of education as “2”; 23 of these 44 were male and 21 of them were female.
As we can see, the total sum is 643 due to missing data - we mentioned this before.
Let’s now observe expected frequencies.
addmargins(round(results$expected, 2))
## mydata$Education
## mydata$Gender 1 2 3 4 5 Sum
## Male 15.4 15.05 94.09 47.22 48.24 220
## Female 29.6 28.95 180.91 90.78 92.76 423
## Sum 45.0 44.00 275.00 138.00 141.00 643
In the table above, we have expected frequencies rounded to 2 decimal points. I double-checked some of them by hand to make sure we have correct data. These confirm our assumptions above - all of the expected frequencies are greater than 5.
Just as an example - let’s observe the expected frequency of male respondents who selected option “1” as self reported education. How have we arrived at 15.4? We simply calculated the expected probability based on how our sample is built and multiplied it with the total number of our respondents; 643. We can check observed frequencies above to see that we have 220 male and 243 female respondents who answered about education. Furthermore, 45 people answered with “1”, 44 with “2”, 275 with “3”, 138 with “4”, and 141 with “5” as their levels of self reported education. If we want expected value for male respondents who answered with “1”, we multiply the probability of a respondent being male with the probability that a respondent answered with “1”; because the variables are independent, we can do probability this way. As such, our calculation ends up being the probability of a male respondent (220/643) times the probability of a respondents answering “1” (45/643). Multiplying these fractions gives up the expected probability of 0,02394 (rounding) which, when multiplied by our total number of respondents who answered (643) yields the expected frequency rounded to 2 decimal places; 15.40.
R basically does the above for each of the combinations of gender and responses and yields the table.
In continuation, we are checking to see whether differences between observed and expected frequencies within categories prove significant. We will see whether these differences are statistically significant by displaying standardized residuals below and comparing them to the extreme values of the z-distribution.
round(results$res, 2)
## mydata$Education
## mydata$Gender 1 2 3 4 5
## Male 1.17 2.05 -1.45 0.55 -0.32
## Female -0.85 -1.48 1.05 -0.40 0.23
In the table above, we can see observed and expected frequencies morphed into standardized residuals. The first boundary for extreme values is at 1,96, where alpha is a=0.05. Consequently, we can see that only the category of (male, “2”) falls outside the critical zone by z-distribution. We conclude that there is more than expected people in category (male, “2”) at alpha a=0.05.
We can see that there is more than expected males who self-reported education as “2” at statistical significance; therefore, gender had an effect on self-reported education (a=0.05). This affirmatively answers our second research question, which stated: “Is the self-reported degree of education related to gender?”
We will now show the three possible proportion tables. They will offer us varying interpretations we will use to support our claim of effect of gender of self-reported education above. Each table is explained slightly differently, and we will explain a result or two in each to show the logic behind it.
addmargins(round(prop.table(results$observed), 3))
## mydata$Education
## mydata$Gender 1 2 3 4 5 Sum
## Male 0.031 0.036 0.124 0.079 0.072 0.342
## Female 0.039 0.033 0.303 0.135 0.148 0.658
## Sum 0.070 0.069 0.427 0.214 0.220 1.000
In the first table above, probabilities are divergent by all observed units. As such, we can say that 3,6% of all respondents were males who selected “2” as their level of self-reported education. Similarly, we can say that 3,3% of all respondents were females who selected “2”. This table does not really tell us much, but the following two tables tend to be much more informative in line with answering our research question either way.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata$Education
## mydata$Gender 1 2 3 4 5 Sum
## Male 0.091 0.105 0.364 0.232 0.209 1.001
## Female 0.059 0.050 0.461 0.206 0.225 1.001
In the second table above, probabilities are divergent by gender so as to all given responses for a gender sum up to the probability of 1. Thus, we can say that 10.5% of male respondents selected “2” as their level of self-reported education, while only 5% of female respondents did so.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata$Education
## mydata$Gender 1 2 3 4 5
## Male 0.444 0.523 0.291 0.370 0.326
## Female 0.556 0.477 0.709 0.630 0.674
## Sum 1.000 1.000 1.000 1.000 1.000
In the last table above, probabilities are divergent by the respective self-reported value of degree of education. From this table, we can see that out of all people who self-reported their level of education as “2”, 52,3% were male and 47,7% were female.
To draw a conclusion, we can combine the story that the latter two tables tell us - we have seen that when comparing by gender, there were more individuals among men who reported education as “2” than among women. And when comparing by the degree of self-reported education, more men have selected “2” than women. This serves as a notion of support to the conclusion we derived with standardized residuals, where we saw that there is more than expected males who self-reported education as “2” and concluded that gender had effect on self-reported education. There were more men than women who had lower than the middle value of education. We can supplement the answer to our research question by saying that men have worse education than women.
Finally, let’s look at Cramer’s V statistic as a measure of effect size. We will not take a look at odds ratios, because our table is larger than 2x2.
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.3.2
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata$Gender, mydata$Education)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.11 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.11)
## [1] "small"
## (Rules: funder2019)
While we claim association between gender and self-reported level of education (a=0.05), and observe that men have worse education than women, we can see that the size of discrepancies is tiny.