Discussion Multiple regression

1. Load data

The data set “child_data” is stored on the local machine but can be found via Github https://github.com/ex-pr/DATA607/blob/week-11/Child_Data.sav
The data contains info about pupils. The data frame consists of 4 columns (AGE, MEM_SPAN - memory testing score, IQ, READ_AB - reading skills score) and 20 rows.

df <- read.spss("C:/Users/daria/Downloads/Child_data.sav", use.value.labels = T, to.data.frame = T, use.missings = T)
str(df)

## 'data.frame':    20 obs. of  4 variables:
##  $ AGE     : num  6.7 5.9 5.5 6.2 6.4 7.3 5.7 6.15 7.5 6.9 ...
##  $ MEM_SPAN: num  4.4 4 4.1 4.8 5 5.5 3.6 5 5.4 5 ...
##  $ IQ      : num  95 90 105 98 106 100 88 95 96 104 ...
##  $ READ_AB : num  7.2 6 6 6.6 7 7.2 5.3 6.4 6.6 7.3 ...
##  - attr(*, "variable.labels")= Named chr [1:4] "age" "short-term memory span" "IQ" "reading ability"
##   ..- attr(*, "names")= chr [1:4] "AGE" "MEM_SPAN" "IQ" "READ_AB"

2. Visualize the Data

We will analyse the relation between the memory and reading skills. First, we should determine whether or not a linear relationship exists between the predictor and the output value.
It looks like linear relationship between the memory and reading skills, when reading is increased, the memory goes up too.

ggplot(df, aes(READ_AB, MEM_SPAN)) + 
  geom_point (color="blue", side=4) + 
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90))+
  labs(x = 'Reading score', y = "Memory score")

3. The Model Function

Reading skills depends not only from memory but maybe from age too. We will add another predictor - students’ age. Maybe, older kids can read better.

model <- lm(READ_AB ~ MEM_SPAN + AGE, data=df)
summary(model)

## 
## Call:
## lm(formula = READ_AB ~ MEM_SPAN + AGE, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65555 -0.15167  0.01677  0.11466  0.73690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1.89714    0.57819   3.281  0.00441 **
## MEM_SPAN     0.52096    0.18353   2.839  0.01135 * 
## AGE          0.33936    0.09881   3.435  0.00316 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3192 on 17 degrees of freedom
## Multiple R-squared:  0.8071, Adjusted R-squared:  0.7844 
## F-statistic: 35.57 on 2 and 17 DF,  p-value: 8.414e-07

p-value is 8.414e-07, which is a tiny value which can help us to assume that there is strong evidence of a linear relationship between reading and memory/age, the model describes 81% of our data. (Multiple R-squared: 0.8071).
For MEM_SPAN, p-value is less than 0.05, so we have to reject the null-hypothesis about 0 coefficient for MEM_SPAN.
For AGE, p-value is less than 0.05, so we have to reject the null-hypothesis about 0 coefficient for AGE.
The Standard error is 3 times smaller than the estimated value for MEM_SPAN and 3 times for AGE, which is fine for now.
Regression equation:
READ_AB = 1.89714 + 0.52096 * MEM_SPAN + 0.33936 * AGE
If we increase MEM_SPAN by one, the reading skills should increase by 0.52096 if AGE is constant.
If we increase AGE by one, the reading skills should increase by 0.33936 if MEM_SPAN is constant.

But memory is also changing with age. As a result, memory/age intercation can affect reading skills. Let’s check it by adding another predictor.

model2 <- lm(READ_AB ~ MEM_SPAN + AGE + MEM_SPAN:AGE, data=df)
summary(model2)

## 
## Call:
## lm(formula = READ_AB ~ MEM_SPAN + AGE + MEM_SPAN:AGE, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48254 -0.15311  0.00693  0.14414  0.62320 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -10.6151     4.8385  -2.194  0.04336 * 
## MEM_SPAN       3.4686     1.1448   3.030  0.00797 **
## AGE            2.2162     0.7269   3.049  0.00766 **
## MEM_SPAN:AGE  -0.4382     0.1685  -2.600  0.01935 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2759 on 16 degrees of freedom
## Multiple R-squared:  0.8644, Adjusted R-squared:  0.839 
## F-statistic:    34 on 3 and 16 DF,  p-value: 3.578e-07

p-value is 0.01935, which is good. MEM_SPAN:AGE estimate value is -0.4382. The connection between independent and dependent variables is affected by another independent variable.
The three predictors explain 86% of the data (Multiple R-squared: 0.8644).
Regression equation:
READ_AB = -10.6151 + 3.4686 * MEM_SPAN + 2.2162 * AGE + (-0.4382 * MEM_SPAN:AGE)

Визуализируем взаимодействие с помощью функции visreg() из пакета visreg.

visreg(model2, "MEM_SPAN", by = "AGE", overlay=TRUE, band=TRUE)

4. Residual analysis

The last check for our model is residual analysis. The residuals are the differences between the actual measured values and the corresponding values on the fitted regression line.
For a model to be a good fit with the data, the residuals should be normally distributed around a mean of zero, uniformly scattered above and below zero, the median value near zero, Min/Max and 1Q/3Q are around the same magnitude. In the summary() results we see that our model satisfies the requirements for median (0.00693) Min/Max (-0.48254/0.62320) and 1Q/3Q (-0.15311/0.14414).
To check other requirements, we will build the residuals plot. The residuals are not normally distributed around the zero, and not uniformly scattered above and below zero though there is different variance of the outliers at the beginning and the end of the plot.
We should continue checking the residuals.

plot(fitted(model2),resid(model2), xlab = "Fitted", ylab = "Residuals",
     main = "Residuals plot")
abline(0,0)

Another test is to use the quantile-versus-quantile plot, Q-Q. The Q-Q plot shows whether the residuals are normally distributed. We see that the right end diverge from the line.

qqnorm(resid(model2))
qqline(resid(model2))

The overall results of the analysis.

par(mfrow=c(2,2))
plot(model2)

6. Conclusion