Regression (Continued)

regression vs. correlation

Upload the students’ data

library(readr)
df <- read_csv("corr.dept.csv")
df$SEX <- as.factor(df$SEX)
df$DEPT <- as.factor(df$DEPT)
library(psych)
describe(df)
##        vars  n   mean   sd median trimmed  mad min max range  skew
## HEIGHT    1 50 168.50 5.33  168.0  168.50 5.93 156 180    24 -0.04
## SEX*      2 50   1.14 0.35    1.0    1.05 0.00   1   2     1  2.01
## DEPT*     3 50   1.96 0.83    2.0    1.95 1.48   1   3     2  0.07
## WEIGHT    4 50  57.54 6.86   56.5   56.92 5.19  46  82    36  1.03
##        kurtosis   se
## HEIGHT    -0.54 0.75
## SEX*       2.10 0.05
## DEPT*     -1.58 0.12
## WEIGHT     1.54 0.97
cor.test(df$HEIGHT, df$WEIGHT)
## 
##  Pearson's product-moment correlation
## 
## data:  df$HEIGHT and df$WEIGHT
## t = 3.4981, df = 48, p-value = 0.001021
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1970908 0.6477917
## sample estimates:
##       cor 
## 0.4507125

r = 0.45

model <- lm(df$HEIGHT ~ df$WEIGHT)
summary(model)
## 
## Call:
## lm(formula = df$HEIGHT ~ df$WEIGHT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9119  -2.4623  -0.3605   2.8011   9.6889 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 148.3721     5.7939  25.608  < 2e-16 ***
## df$WEIGHT     0.3498     0.1000   3.498  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.805 on 48 degrees of freedom
## Multiple R-squared:  0.2031, Adjusted R-squared:  0.1865 
## F-statistic: 12.24 on 1 and 48 DF,  p-value: 0.001021

Goodness-of-fit measure for linear regression is the R2. It shows how much variance the model explains.

Multiple R-squared: 0.2031

0.4507125*0.4507125
## [1] 0.2031418

Correlation coefficient squared = coefficient of determinations (R2) in regression when there is 1 (one) predictor.

Moreover, look at the last line of the regression output:

F-statistic: 12.24 (it tests whether this model is better than no model)

t-value for df$WEIGHT = 3.498

3.498*3.498
## [1] 12.236

Difference in interpretation:

Correlation shows the relationship between two variables. Regression shows how changes in one or more variables called ‘predictors’ can predict change in the variable called ‘outcome’.

summary(model)
## 
## Call:
## lm(formula = df$HEIGHT ~ df$WEIGHT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9119  -2.4623  -0.3605   2.8011   9.6889 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 148.3721     5.7939  25.608  < 2e-16 ***
## df$WEIGHT     0.3498     0.1000   3.498  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.805 on 48 degrees of freedom
## Multiple R-squared:  0.2031, Adjusted R-squared:  0.1865 
## F-statistic: 12.24 on 1 and 48 DF,  p-value: 0.001021

Here, the average height is 148.37 (cm). When a person weighs 1 kg more, their height is expected to increase by 0.35 cm. When a person weighs 10 kg more, the height is expected to increase by 3.5 cm.

However, ‘weighs 1 kg more’ than what? By default, the model will be estimated against 0 (here, 0 kg). Sometimes (like 0 kg) it makes no sense; therefore, it is wise to center such variables before modeling.

This is centering.

This is centering.

df$weight_c <- scale(df$WEIGHT, center = T, scale = F)  #scale = T will standardize the WEIGHT values (mean = 0, sd = 1)
df$weight_c <- as.integer(df$weight_c)
summary(df$WEIGHT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   46.00   53.00   56.50   57.54   61.50   82.00
summary(df$weight_c)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -11.00   -4.00   -0.50    0.08    3.50   24.00
model2 <- lm(df$HEIGHT ~ df$weight_c)
summary(model2)
## 
## Call:
## lm(formula = df$HEIGHT ~ df$weight_c)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9822  -2.4881  -0.4003   2.7738   9.6697 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 168.4702     0.6790 248.120  < 2e-16 ***
## df$weight_c   0.3720     0.1059   3.512  0.00098 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.801 on 48 degrees of freedom
## Multiple R-squared:  0.2044, Adjusted R-squared:  0.1878 
## F-statistic: 12.33 on 1 and 48 DF,  p-value: 0.0009805

Average height is now estimated as 168.47 (cm) – more likely to be true, huh? With each 1 kilo above the average weight the person is expected to be 0.37 cm higher, which makes +3.7 cm for each 10 kilograms.

Multiple R-squared: 0.2044 – no difference. The model is essentially the same but all coefficients are now interpretable.

How to plot the linear regression model?

plot(model2) #Produces residuals plots -- useful for model diagnostics but not the linear plot you might expect

This is ‘regression diagnostics’ – looking at what the model cannot explain. This is not exactly what you want for now. But if you are interested, have a look at chapter 4 in Miles & Shevlin’s book.

Plotting the regression line:

with(df,plot(weight_c, HEIGHT, main = "Students' height and weight", xlab = "Weight, centered"))
    abline(model2)

library(ggplot2)
ggplot(df,aes(df$weight_c,df$HEIGHT))+ 
  geom_smooth(method='lm', se = F) + labs(x = "Weight (centered), kg", y = "Height, cm")

ggplot(data = df, aes(x = weight_c, y = HEIGHT)) + 
  geom_point(color='blue') + geom_point(shape = 1) +
  geom_smooth(method = "lm", se = FALSE) + labs(x = "Weight (centered), kg", y = "Height, cm")

#https://stackoverflow.com/questions/7549694/adding-regression-line-equation-and-r2-on-graph
#Here is a function that draws the regression equation straight on the graph
lm_eqn = function(m) {

  l <- list(a = format(coef(m)[1], digits = 2),
      b = format(abs(coef(m)[2]), digits = 2),
      r2 = format(summary(m)$r.squared, digits = 3));

  if (coef(m)[2] >= 0)  {
    eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(R)^2~"="~r2,l)
  } else {
    eq <- substitute(italic(y) == a - b %.% italic(x)*","~~italic(R)^2~"="~r2,l)    
  }

  as.character(as.expression(eq));                 
}
p <- ggplot(data = df, aes(x = weight_c, y = HEIGHT)) + 
  geom_point(color='orange') + geom_point(shape = 1, color = "black") +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", col = "purple") + labs(x = "Weight (centered), kg", y = "Height, cm")
p + annotate("text", 
             x = 1, y = 150, 
             label = lm_eqn(lm(HEIGHT ~ weight_c, data = df)), 
             colour="black", 
             size = 5, parse = TRUE) +
  theme_bw() #the colors are meant to look flashy intentionally

The difference between ANOVA and regression is that ANOVA compares group means while regression estimates the difference between the reference group and all other groups.

oneway.test(df$HEIGHT ~ df$DEPT)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  df$HEIGHT and df$DEPT
## F = 13.741, num df = 2.000, denom df = 30.538, p-value = 5.547e-05

Look at the group means here!

userfriendlyscience::posthocTGH(df$HEIGHT, df$DEPT)
##    n means variances
## 1 18   164        16
## 2 16   170        13
## 3 16   172        26
## 
##     diff ci.lo ci.hi   t df    p
## 2-1  5.4   2.1   8.6 4.1 32 <.01
## 3-1  7.7   3.7  11.6 4.8 29 <.01
## 3-2  2.3  -1.5   6.2 1.5 27  .31

Now calculate the group means from the regression output:

summary(lm(df$HEIGHT ~ df$DEPT))
## 
## Call:
## lm(formula = df$HEIGHT ~ df$DEPT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0000  -2.2500  -0.6667   3.1510   9.6667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  164.333      1.008 163.044  < 2e-16 ***
## df$DEPT2       5.354      1.469   3.644 0.000669 ***
## df$DEPT3       7.667      1.469   5.218 4.01e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.276 on 47 degrees of freedom
## Multiple R-squared:  0.3819, Adjusted R-squared:  0.3556 
## F-statistic: 14.52 on 2 and 47 DF,  p-value: 1.229e-05

t-test and regression

df$SEX <- as.factor(df$SEX)
t.test(df$HEIGHT ~ df$SEX)
## 
##  Welch Two Sample t-test
## 
## data:  df$HEIGHT by df$SEX
## t = -2.7139, df = 9.9746, p-value = 0.02183
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.3198739 -0.8163387
## sample estimates:
## mean in group f mean in group m 
##        167.8605        172.4286

172.43 - 167.86 = 4.57 (cm)

modelt <- lm(df$HEIGHT ~ df$SEX)
summary(modelt)
## 
## Call:
## lm(formula = df$HEIGHT ~ df$SEX)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8605  -2.8605   0.1395   2.1395  10.1395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 167.8605     0.7828 214.427   <2e-16 ***
## df$SEXm       4.5681     2.0922   2.183   0.0339 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.133 on 48 degrees of freedom
## Multiple R-squared:  0.09034,    Adjusted R-squared:  0.07139 
## F-statistic: 4.767 on 1 and 48 DF,  p-value: 0.03393

df$SEXm (estimated Y for males as compared to females): 4.5681 (cm)