We will begin by defining our research question: “How is performance in ACT (American College Test) affected by age, gender, education, and performance in SAT tests?”
In continuation we will introduce and describe the data set. Rather than looking at theory for modelling as this is but an exercise, we will go through the intuitive logic behind the variables included in the model (are the expected impacts positive/negative are these supposed effects sensible?).
After that, we will check all assumptions and practical requirements of regression and see whether they are met, correcting where they are not.
My reasoning in drawing the regression is the following: because SATV and SATQ both refer to the scores of SAT exams; they represent verbal and quantitative scores respectively, I would not plot them as functions of one another. Rather, I will use both as explanatory variables for another test: the ACT (American College Test) score in my regression. It could very well happen that SATV and SATQ are too correlated and I will have to drop either of them due to multicolinearity; we will see.
#data()
#data(package = .packages(all.available = TRUE))
library(psych)
mydata <- force(sat.act)
colnames(mydata) <- c("Gender", "Education", "Age", "ACT", "SATV", "SATQ") #changing column names to be capitalized because it bothers me
head(mydata, 6)
## Gender Education Age ACT SATV SATQ
## 29442 2 3 19 24 500 500
## 29457 2 3 23 35 600 500
## 29498 2 3 20 21 480 470
## 29503 1 4 27 26 550 520
## 29504 1 2 33 31 600 550
## 29518 1 5 26 28 640 640
Explanations of the data set:
Gender: A representation of a respective person’s gender - 1 for male and 2 for female.
Education: A representation of a respective person’s self-reported level of education - 1 stands for high school and 5 stands for graduate work. The interim levels are not specified, so we will leave them numbered, because we do not know their representation and the precise intervals between them.
Age: A measurement of the respective person’s age.
ACT score: A measurement of the respective person’s composite American College Test (hereafter: ACT) score. The scores can range from 1-36; national norms have a mean of 20.
SATV score: A measurement of the respective person’s SAT verbal score. These scores can range from 200-800.
SATQ score: A measurement of the respective person’s SAT quantitative score. These scores can also range from 200-800.
The unit of observation is a subject; a person.
The sample size of our data encompasses 700 units along the 6 variables delineated above (I did not count ID as a specific variable).
Explanations of the variables:
The first two variables measure a person’s gender and self-reported level of education. Both of these variables are categorical. Gender is a nominal scale variable and education is an ordinal scale variable (but we do not know specific ordinal brackets, because they are not specified beyond the scores of 1 and 5).
The third variable is age. It is a numerical variable that measures a person’s age.
The fourth variable measures a person’s ACT composite score on a scale from 1-36. It is a numerical variable.
The fifth and sixth variables measure a person’s SATV and SATQ scores. Both are numerical variables whose scores can range from 200-800.
Following is the data source citation:
Revelle, William, Wilt, Joshua, and Rosenthal, Allen (2009) Personality and Cognition: The Personality-Cognition Link. In Gruszka, Alexandra and Matthews, Gerald and Szymura, Blazej (Eds.) Handbook of Individual Differences in Cognition: Attention, Memory and Executive Control, Springer.
Before doing anything else, I will add ID as a variable in case I will need to drop some units later due to Cooks’ Distance values, for example.
mydata$ID <- seq(1, nrow(mydata))
mydata <- mydata[c("ID", "Gender", "Education", "Age", "ACT", "SATV", "SATQ")]
head(mydata, 6)
## ID Gender Education Age ACT SATV SATQ
## 29442 1 2 3 19 24 500 500
## 29457 2 2 3 23 35 600 500
## 29498 3 2 3 20 21 480 470
## 29503 4 1 4 27 26 550 520
## 29504 5 1 2 33 31 600 550
## 29518 6 1 5 26 28 640 640
Because in regression, I will be including the categorical variable of education, I have to check which of the 5 values as degrees of self-reported education is the most common and use it as the reference category. In order to do that, let’s check its distributions below.
table(mydata$Education)
##
## 0 1 2 3 4 5
## 57 45 44 275 138 141
As we can see, the most commonly selected level of education was “3”. When factoring categorical variables, we will ensure to put “3” on the first place so that it will be taken as the reference category. Consequently, we will have created four dummy variables for the other four values, comparing them to “3”. Some responses (57) are missing (0 is not a valid score from 1-5), but when we factor, those should be converted to NA automatically.
Finally, I will factor the variables of Gender and Education.
mydata$Gender <- factor(mydata$Gender,
levels = c(1, 2),
labels = c("Male", "Female"))
mydata$Education <- factor(mydata$Education,
levels = c(3, 1, 2, 4, 5),
labels = c(3, 1, 2, 4, 5))
head(mydata)
## ID Gender Education Age ACT SATV SATQ
## 29442 1 Female 3 19 24 500 500
## 29457 2 Female 3 23 35 600 500
## 29498 3 Female 3 20 21 480 470
## 29503 4 Male 4 27 26 550 520
## 29504 5 Male 2 33 31 600 550
## 29518 6 Male 5 26 28 640 640
Factoring variables showed us some NA values, which we will now drop from our dataset to avoid complications later on.
library(tidyr)
mydata <- drop_na(mydata)
With NA data dropped, we will now proceed to defining and plotting our regression. In continuation, we will check the assumptions and fix issues where necessary.
Finally, let’s display some descriptive statistics.
summary(mydata[ , -1])
## Gender Education Age ACT SATV
## Male :218 3:269 Min. :17.00 Min. : 3.00 Min. :200.0
## Female:413 1: 43 1st Qu.:19.00 1st Qu.:26.00 1st Qu.:550.0
## 2: 43 Median :23.00 Median :29.00 Median :620.0
## 4:137 Mean :26.41 Mean :28.64 Mean :611.8
## 5:139 3rd Qu.:30.00 3rd Qu.:32.00 3rd Qu.:700.0
## Max. :65.00 Max. :36.00 Max. :800.0
## SATQ
## Min. :200.0
## 1st Qu.:530.0
## Median :620.0
## Mean :609.3
## 3rd Qu.:700.0
## Max. :800.0
Explaining some descriptive statistics:
“Gender”: We have 218 males and 413 females in our sample (after dropping NA values, our total sample size is 631 units),
min. of “Age”: The youngest person(s) in our sample is (are) 17 years old,
Mean of ACT: The respondents in our sample scored 28,64 points in their ACT test on average.
In our regression model, we will use age, gender, the self-reported degree of education, SATV (SAT Verbal) score, and SATQ (SAT Quantitative) score to explain performance in the ACT (American College Test).
So far as effects of explanatory variables go, we expect to see gender to have no impact on ACT score, and education to have a positive impact on ACT score - the dummy variables for “4” and “5” having a positive value, and the dummy variables for “1” and “2” having a negative value compared to the reference category of “3”.
We expect age to positively impact ACT score due to more learning time.
We also expect higher SATV and SATQ scores to positively impact ACT score because they could hold some notion of signalling skill/education.
We have 7 main regression assumptions along with some other additional requirements; we will go through them systematically.
Assumption 1: Linearity of parameters With the first assumption we want to see that ACT score (the dependent variable) is a linear function of parameters. We will check later when we check for multicolinearity with a scatter plot matrix. We can also see it in the scatter plot of standardized residuals pitted against standardized fitted values as checked in Assumption 3. Spoiler alert: linearity is met. In the graph later we will not see any curved shapes in the graphs that compare ACT score with the other numerical variables.
Assumption 2: Expected value of errors equals zero This assumption will generally be met so long as our model has been built on theory. That ensures we have included all theoretically important explanatory variables. If we skip some of them, our estimates of coefficients will be off. Since for this homework, we are dealing with a given dataset of variables, we will simply use all that are provided. The theory to back it up will be our reasoning as to what effect (positive, none, or negative) an explanatory variable we expect to have on the dependent variable. We explained this a couple paragraphs above and will stick to the assumption by using all variables in our model.
Assumption 3: Homoskedasticity To check for homoskedasticity, we will draw a graph of standardized residuals pitted against standardized fitted values. We will also test for it using Breusch-Pagan’s test for heteroskedasticity. Using this graph, we will also explain linearity for assumption 1 as mentioned above.
fit <- lm(ACT ~ Gender + Education + Age + SATV + SATQ,
data = mydata)
mydata$StdFitted <- scale(fit$fitted.values) #Adding standardized fitted values to our data set
mydata$StdResid <- round(rstandard(fit), 3) #Adding standardized residuals to our data set
mydata$CooksD <- round(cooks.distance(fit), 3) #Adding cook's distances to our data set
We have added standardized fitted values, standardized residuals, and Cook’s distances to our data set all at once, because we will need them all later anyway.
Now to test for homoskedasticity and linearity, let’s draw the graph we mentioned before.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(y=StdResid, x=StdFitted)) +
geom_point() +
ylab("Standardized residuals") +
xlab("Standardized fitted values") +
theme_minimal()
Observing the graph above, we see no curves in the distribution, so we have no issues with linearity, as answered at the first assumption.
So far as homoskedasticity is concerned, we can see quite some inconsistencies in variance - it is likely that homoskedasticity is violated. Let’s run a formal test to confirm.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------
## Response : ACT
## Variables: fitted values of ACT
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 159.4366
## Prob > Chi2 = 1.502298e-36
The hypotheses are already stated with the test. We can reject the null hypothesis at statistical significance and conclude our variance is not constant (we have heteroskedasticity) at statistical significance (p<0.001). To solve this issue, we will later use robust standard errors in our regression.
Assumption 4: Normal distribution of errors We can check for normality of distribution of errors with a histogram of standardized residuals and a Shapiro-Wilk test. At the same time we are also checking for outliers; units that fall outside the -3 and +3 margins in the standardized distribution.
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col= "darkred",
border = "black")
The standardized residuals (with them, we test if errors in the population can be said to be normally distributed at statistical significance based on our sample’s residuals) do not appear normally distributed. Let’s also test this formally.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.97773, p-value = 3.308e-08
H0: Errors in the population are normally distributed
H1: Errors in the population are not normally distributed
We can reject the null hypothesis and conclude that errors in the population are not normally distributed (p<0.001) based on our sample’s residuals. BUT this is likely due to the outliers we will remove anyway. We will check again later.
Either way, due to CLT (central limit theorem) and our sample size (631 as of dropping NA units), we can dismiss the abnormal distribution as a trend in our sample. Having used a a t-test on our data should not prove a problem. As such, we leave the data as is, even though it is not normally distributed.
Outliers, on the other hand, were a huge problem above. We will need to remove them. Let’s start with those below the threshold of -3.
head(mydata[order(mydata$StdResid),], 6)
## ID Gender Education Age ACT SATV SATQ StdFitted StdResid CooksD
## 403 440 Male 2 22 3 200 400 -3.2186674 -4.367 0.100
## 448 490 Male 1 23 15 600 600 -0.5007597 -3.326 0.033
## 310 340 Male 4 34 19 590 742 0.7650939 -3.277 0.016
## 475 534 Male 1 17 15 600 596 -0.6298315 -3.217 0.031
## 103 107 Female 5 65 23 700 650 1.3803813 -2.733 0.031
## 181 193 Female 3 22 17 600 500 -0.6742755 -2.603 0.004
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!ID == 440)
mydata <- mydata %>%
filter(!ID == 490)
mydata <- mydata %>%
filter(!ID == 340)
mydata <- mydata %>%
filter(!ID == 534)
head(mydata[order(mydata$StdResid),], 6)
## ID Gender Education Age ACT SATV SATQ StdFitted StdResid CooksD
## 103 107 Female 5 65 23 700 650 1.3803813 -2.733 0.031
## 181 193 Female 3 22 17 600 500 -0.6742755 -2.603 0.004
## 224 244 Male 4 28 16 500 500 -1.0282742 -2.588 0.010
## 575 645 Male 3 22 22 700 700 0.6831250 -2.382 0.005
## 574 644 Male 3 22 22 670 720 0.6693245 -2.370 0.005
## 458 518 Female 3 18 16 500 450 -1.4271638 -2.246 0.004
The units with the value of standardized residuals lower than -3 have been removed. Let’s do the same for those above +3.
head(mydata[order(-mydata$StdResid),], 6)
## ID Gender Education Age ACT SATV SATQ StdFitted StdResid CooksD
## 228 251 Female 3 23 35 200 200 -3.9264420 5.079 0.088
## 336 370 Female 1 17 35 350 230 -3.5048585 4.750 0.109
## 258 282 Male 2 22 30 300 300 -3.3541978 3.233 0.047
## 126 131 Male 1 45 35 450 500 -1.2655640 2.846 0.043
## 113 117 Female 4 33 36 500 500 -0.7980115 2.675 0.008
## 195 211 Female 2 26 36 550 555 -0.7304345 2.639 0.020
mydata <- mydata %>%
filter(!ID == 251)
mydata <- mydata %>%
filter(!ID == 370)
mydata <- mydata %>%
filter(!ID == 282)
head(mydata[order(-mydata$StdResid),], 6)
## ID Gender Education Age ACT SATV SATQ StdFitted StdResid CooksD
## 126 131 Male 1 45 35 450 500 -1.2655640 2.846 0.043
## 113 117 Female 4 33 36 500 500 -0.7980115 2.675 0.008
## 195 211 Female 2 26 36 550 555 -0.7304345 2.639 0.020
## 450 505 Male 3 20 30 300 400 -2.6226855 2.585 0.016
## 86 90 Female 3 38 30 450 250 -2.3632613 2.373 0.017
## 7 7 Female 5 30 36 610 500 -0.4305172 2.368 0.007
We have removed a total of 7 units. Let’s plot the histogram of standardized residuals again to check.
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col= "darkred",
border = "black")
As we can see, the outlier units have been removed. Looking at it, our distribution looks much more normal now as well. Let’s test for normality of errors in the population again to see.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.99637, p-value = 0.1663
H0: Errors in the population are normally distributed
H1: Errors in the population are not normally distributed
We can no longer reject the null hypothesis at statistical significance. We now lack sufficient evidence to claim the distribution of errors is not normal in the population based on our sample’s residuals. Like we said before, though, it would not have mattered either way. This is just to show what huge effect removing outliers can have on our data.
Assumption 5: Errors are independent Since our data is not panel or time-series data and each person is divided by gender, we have no issues with this. Each of our units (persons) is observed once and gave a single answer about categorical variables such as gender and education. Our units are not observed multiple times, so have not violated this assumption. By seeing our variables are independent, we will heuristically assume our errors are independent as well; especially since our distribution of errors in the population has now (after removing the outliers) been tested to be normal.
Assumption 6: No perfect multicolinearity To check this, let’s just plot a scatterplot matrix and observe correlations between explanatory variables. We will not touch VIF statistics just yet; we will do that later with additional requirements. We show the scatterplot for numerical units only (no gender and education). We also do not include other things we added for testing of assumptions - ID, standardized values, or Cooks’D.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata[ , c(-1, -2, -3, -8, -9, -10)],
smooth = FALSE)
We can see the distributions of variables themselves not being normal, but that does not violate our assumption.
For checking multicolinearity, we observe the relationship between the explanatory variables - that is, age, SATV, and SATQ. Age does not appear strongly correlated neither to SATV nor SATQ. However, SATV and SATQ do appear correlated between each other - it only makes sense; both tests comprise SATs, although they do measure different capabilities. It is hard to judge from the scatter plot above; they may or may not be too strongly correlated. As far as this assumption goes, they do not feature perfect multicolinearity, so thus far we are okay. But we will check for the strength of multicolinearity again later with VIF statistics when we will be checking additional requirements.
Back to our first assumption of linearity we mentioned above; we can see that the relationships between ACT score and the other 3 variables (Age, SATV, SATQ) are linear (there are no curves on the respective scatter plots). Let’s now take a look at exact correlation data (numbers) below.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -2, -3, -8, -9, -10)]))
## Age ACT SATV SATQ
## Age 1.00 0.11 -0.05 -0.04
## ACT 0.11 1.00 0.57 0.63
## SATV -0.05 0.57 1.00 0.60
## SATQ -0.04 0.63 0.60 1.00
##
## n= 624
##
##
## P
## Age ACT SATV SATQ
## Age 0.0084 0.1985 0.3148
## ACT 0.0084 0.0000 0.0000
## SATV 0.1985 0.0000 0.0000
## SATQ 0.3148 0.0000 0.0000
So far, we are looking for perfect multicolinearity as per the assumption #6. We will check VIF values later. There appears to be no perfect multicolinearity between the explanatory variables (that would mean that one explanatory variable is the linear combination of remaining explanatory variables).
The statistically significant correlation between SATQ and SATV scores could prove trouble in the future. We will see later with VIF values whether one of those variables should be excluded from our model.
Assumption 7: The number of units is larger than the number of parameters estimated In our case, this holds. We do not violate the assumption #7.
Additional requirement #1 The dependent variable is numerical; explanatory variables can be numerical or dichotomous (or transformed into dummies for categorical variables). This is so with our data.
Additional requirement #2 Each explanatory variable must vary (have nonzero variance); it is desirable they assume as wide a range of values as possible. Our explanatory variables do vary - there could be some overlap between SATV and SATQ but we will solve that later on. We could also see none of the variables have variance of 0 in the scatter plot matrix above.
Additional requirement #3 We need to deal with potential outliers and units of large impact. We have already dealt with outliers when we observed for normality of distribution of standardized residuals and removed them based on the values that fell outside the standardized critical zone of -3 and +3. We are only left with units of large impact.
To deal with those, we will observe values of Cooks’ Distances which we have already included in our data set above.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
col= "darkred",
border = "black",
main = "Histogram of Cooks distances")
Looking at R, Cook’s distances appear to feature no “gaps” between “pillars” of frequency distributions, and the values are well below 1. But when knitting the document, we can see two units that fall beyond the bracket that ends with 0,02.
I have decided not to remove those two units. My reasoning is the following: I have already removed 7 units. I do not like removing units in regression, because all of them have a story to tell. Maybe these two units are of high impact by theoretical definition, but in relative terms, their impact is not so high. I look at it this way: when trying to forecast things from samples, it can be a lethal mistake to disregard units of high impact. These units, although deviations that affect our regression more than the others, still represent concrete observations that are certainly present in the population. Removing them can be a terrible mistake, so long as removing them is not truly crucial for plotting regression. I am referencing Nassim Nicholas Taleb’s books here so any statistical anger that arises from my reasoning should be directed at that guy. I want those units in my regression model. I want their impact. Not removing them is a choice I made in my analysis; we could remove them if we so wished, I simply do not see the necessity to do so.
Additional requirement #4: There is not too strong multicolinearity Unlike with the 6th regression assumption, where we simply checked for no perfect multicolinearity, we want it sufficiently low this time around. To achieve this, we will observe VIF (variance inflation factor) statistics, which we generally want to be below 5, with their mean being as close to 1 as possible.
But we have slightly different VIF values - because our model includes the categorical variable of education accounted for by 4 respective dummy variables, R will instead yield generalized inflation factors. These, we want to be below square root of 5 rather than below 5.
If we do not deal with multicolinearity in our data, our standard errors will be inflated. At the same time, we will have biased partial regression coefficient estimates and inflated R squared, so it is crucial we deal with it.
Let’s then observe VIF values
vif(fit)
## GVIF Df GVIF^(1/(2*Df))
## Gender 1.069198 1 1.034020
## Education 1.696736 4 1.068321
## Age 1.645405 1 1.282733
## SATV 1.662541 1 1.289396
## SATQ 1.711133 1 1.308103
Because we included a categorical variable with more than 2 categories using multiple dummy variables (education took 4 dummies to include), our VIF values look a little bit different than ordinarily. Long story short, we want to observed the last column - the so-called generalized inflation factors. Whereas normally we would want VIF values to be below 5, this time around with GVIF, we want to see them to be below the square root of 5 (approximately 2,24). So all of the values in our last column should be below approximately 2,24 in order not to have too strong multicolinearity between explanatory variables.
As we can see, all of our GVIF values are below 2,24, so based on that estimate, we can keep all of our explanatory variables in the model.
Lastly, we should also look at the mean of VIF values.
mean(vif(fit))
## [1] 1.451172
When it comes to the VIF mean, we want to see the mean value to be as close to 1 as possible. Our mean VIF value is low enough as to where multicolinearity is not a problem. Having checked all of our assumptions and requirements, we can fit the OLS regression model.
In this final chapter, we will just observe R’s regression output and explain everything step-by-step.
However, since homoskedasticity was violated, we have to build another model with robust standard errors before calling on it. We will use White’s robust standard errors.
library(estimatr)
fit1 <- lm_robust(ACT ~ Gender + Education + Age + SATV + SATQ,
se_type="HC1",
data = mydata)
summary(fit1)
##
## Call:
## lm_robust(formula = ACT ~ Gender + Education + Age + SATV + SATQ,
## data = mydata, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 7.80621 1.182409 6.6020 8.777e-11 5.484157 10.12825 615
## GenderFemale 0.25317 0.288149 0.8786 3.800e-01 -0.312702 0.81905 615
## Education1 -0.56499 0.627718 -0.9001 3.684e-01 -1.797721 0.66774 615
## Education2 -0.53971 0.561698 -0.9609 3.370e-01 -1.642789 0.56337 615
## Education4 0.38077 0.404513 0.9413 3.469e-01 -0.413627 1.17516 615
## Education5 0.17105 0.422036 0.4053 6.854e-01 -0.657755 0.99986 615
## Age 0.05827 0.021531 2.7063 6.994e-03 0.015985 0.10055 615
## SATV 0.01264 0.001654 7.6414 8.281e-14 0.009391 0.01589 615
## SATQ 0.01864 0.001617 11.5284 5.634e-28 0.015467 0.02182 615
##
## Multiple R-squared: 0.4727 , Adjusted R-squared: 0.4659
## F-statistic: 57.13 on 8 and 615 DF, p-value: < 2.2e-16
The first thing we will look at are the p-values of estimated coefficients (b’s) above. These p-values refer to a t-test that tests whether we have enough statistical evidence to say that the actual regression coefficient (BETA) differs from 0. The hypotheses are thus:
H0: Beta(i) = 0
H1: Beta(i) =/= 0
Where p-value is less than 0,05, we can reject the null hypothesis at statistical significance. Where it is not so, we lack sufficient evidence to reject the null hypothesis.
Looking at the p-values of estimated coefficients above, we can see that both for Gender and Education we lack the statistical evidence necessary to claim impact on ACT score. In other words, we cannot claim that the real regression coefficients (BETAs) of those variables differ from 0 at statistical significance. We do not interpret statistically insignificant coefficients. But, for the sake of practice, let’s just show how they work because both variables are transformed into dummies and it would be a shame to just leave them be.
If gender were statistically significant, we could have said this: Given values of all other explanatory variables, female respondents will on average have by 0,25 higher ACT scores than male respondents.
If education were statistically significant, we could have said this for each of the corresponding education “levels”: We know that our reference group is the education level of “3”, because it was the most common answer. Each of the dummies for levels “1”, “2”, “4”, and “5”, are thus compared to “3”.
If we want to interpret, for example, “Education 1”, we can say: Given values of all other explanatory variables, respondents with the education level “1” will on average have by 0,56 lower ACT scores than respondents with the education level “3”.
We can interpret, for example, “Education 4” as: Given values of all other explanatory variables, respondents with the education level “4” will on average have by 0,38 higher ACT scores than respondents with the education level “3”.
I naturally did not mention any p-values, because in reality, there was no significance.
As we said before, the interpretations of estimates of coefficients so far were only for practice; they were statistically insignificant. Now we move on to those we should actually interpret
Intercept: This can be a funny one, because to interpret the intercept we’d have to have someone with value 0 for the explanatory variables - this would be anti-intuitive especially with age. Let’s simply not interpret it for it is not completely sensible and move on.
Age: If the age of a person increases by 1 year, their ACT score will on average increase by 0,058 (p=0,007), controlling for all other explanatory variables.
SATV: If a person scores 1 point more on their Verbal SAT exam, their ACT score will on average increase by 0,013 (p<0,001), controlling for all other explanatory variables.
SATQ: If a person scores 1 point more on their Quantitative SAT exam, their ACT score will on average increase by 0,019 (p<0,001), controlling for all other explanatory variables.
Let’s now interpret the R_squared of 0.4727: 47,27% of variability in ACT score (the dependent variable) is explained by the variability in Gender, Education, Age, SATV score, and SATQ score (the explanatory variables). Despite the fact that gender and education are not statistically significant, they still explain some of the variability; we cannot simply dismiss them for being unable to prove their relevance at statistical significance.
We won’t interpret adjusted R_squared. It can be used to, in simple terms, compare the pay-off of “complicating” the model with an additional variable, but there are better ways of doing so.
The final thing we will touch upon is the F-statistic at the very bottom of the output. The test for this F-statistic, in the simplest of terms, measures how good our model is. It tells us whether we can claim that we explain at least some variability in the dependent variable by our explanatory variables at statistical significance. The hypotheses for the F-test are the following:
H0: q_squared = 0 (read q as “ro”)
H1: q_squared > 0 (read q as “ro”)
Our F-statistic has the value of 57.13 which corresponds to the p-value lower than 0,001. Thus our conclusion is: Based on the sample data, we reject H0 (p<0,001) and conclude that we can explain at least some variability in ACT score with gender, education, age, Variable SAT score, and Quantitative SAT score. Our explanatory variables (gender, education, age, SATV, SATQ) have at least some impact on ACT score.
And to answer our research question, which states: “How is performance in ACT (American College Test) affected by age, gender, education, and performance in SAT tests?”, we can say that Gender and Education do not affect performance in ACT at statistical significance, while age (p=0.007) and performance in both SATV (p<0.001) and SATQ (p<0.001) tests affect it positively, which coincides with what we expected to see at the beginning.
Finally, let’s look at our multiple correlation coefficient.
sqrt(summary(fit1)$r.squared)
## [1] 0.6875387
By our multiple correlation coefficient, the relationship between ACT score and the explanatory variables (Gender, Age, SATV score, and SATQ score) is semi strong (falls between 0.3 and 0.7, which is semi strong).