library(readr)
df <- read_csv("corr.dept.csv")
df$SEX <- as.factor(df$SEX)
df$DEPT <- as.factor(df$DEPT)
library(psych)
describe(df)## vars n mean sd median trimmed mad min max range skew
## HEIGHT 1 50 168.50 5.33 168.0 168.50 5.93 156 180 24 -0.04
## SEX* 2 50 1.14 0.35 1.0 1.05 0.00 1 2 1 2.01
## DEPT* 3 50 1.96 0.83 2.0 1.95 1.48 1 3 2 0.07
## WEIGHT 4 50 57.54 6.86 56.5 56.92 5.19 46 82 36 1.03
## kurtosis se
## HEIGHT -0.54 0.75
## SEX* 2.10 0.05
## DEPT* -1.58 0.12
## WEIGHT 1.54 0.97
cor.test(df$HEIGHT, df$WEIGHT)##
## Pearson's product-moment correlation
##
## data: df$HEIGHT and df$WEIGHT
## t = 3.4981, df = 48, p-value = 0.001021
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1970908 0.6477917
## sample estimates:
## cor
## 0.4507125
r = 0.45
model <- lm(df$HEIGHT ~ df$WEIGHT)
summary(model)##
## Call:
## lm(formula = df$HEIGHT ~ df$WEIGHT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9119 -2.4623 -0.3605 2.8011 9.6889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148.3721 5.7939 25.608 < 2e-16 ***
## df$WEIGHT 0.3498 0.1000 3.498 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.805 on 48 degrees of freedom
## Multiple R-squared: 0.2031, Adjusted R-squared: 0.1865
## F-statistic: 12.24 on 1 and 48 DF, p-value: 0.001021
Goodness-of-fit measure for linear regression is the R2. It shows how much variance the model explains.
Multiple R-squared: 0.2031
0.4507125*0.4507125## [1] 0.2031418
Correlation coefficient squared = coefficient of determinations (R2) in regression when there is 1 (one) predictor.
Moreover, look at the last line of the regression output:
F-statistic: 12.24 (it tests whether this model is better than no model)
t-value for df$WEIGHT = 3.498
3.498*3.498## [1] 12.236
Correlation shows the relationship between two variables. Regression shows how changes in one or more variables called ‘predictors’ can predict change in the variable called ‘outcome’.
summary(model)##
## Call:
## lm(formula = df$HEIGHT ~ df$WEIGHT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9119 -2.4623 -0.3605 2.8011 9.6889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148.3721 5.7939 25.608 < 2e-16 ***
## df$WEIGHT 0.3498 0.1000 3.498 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.805 on 48 degrees of freedom
## Multiple R-squared: 0.2031, Adjusted R-squared: 0.1865
## F-statistic: 12.24 on 1 and 48 DF, p-value: 0.001021
Here, the average height is 148.37 (cm). When a person weighs 1 kg more, their height is expected to increase by 0.35 cm. When a person weighs 10 kg more, the height is expected to increase by 3.5 cm.
However, ‘weighs 1 kg more’ than what? By default, the model will be estimated against 0 (here, 0 kg). Sometimes (like 0 kg) it makes no sense; therefore, it is wise to center such variables before modeling.
This is centering.
df$weight_c <- scale(df$WEIGHT, center = T, scale = F) #scale = T will standardize the WEIGHT values (mean = 0, sd = 1)
df$weight_c <- as.integer(df$weight_c)
summary(df$WEIGHT)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 46.00 53.00 56.50 57.54 61.50 82.00
summary(df$weight_c)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.00 -4.00 -0.50 0.08 3.50 24.00
model2 <- lm(df$HEIGHT ~ df$weight_c)
summary(model2)##
## Call:
## lm(formula = df$HEIGHT ~ df$weight_c)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9822 -2.4881 -0.4003 2.7738 9.6697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.4702 0.6790 248.120 < 2e-16 ***
## df$weight_c 0.3720 0.1059 3.512 0.00098 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.801 on 48 degrees of freedom
## Multiple R-squared: 0.2044, Adjusted R-squared: 0.1878
## F-statistic: 12.33 on 1 and 48 DF, p-value: 0.0009805
Average height is now estimated as 168.47 (cm) – more likely to be true, huh? With each 1 kilo above the average weight the person is expected to be 0.37 cm higher, which makes +3.7 cm for each 10 kilograms.
Multiple R-squared: 0.2044 – no difference. The model is essentially the same but all coefficients are now interpretable.
plot(model2) #Produces residuals plots -- useful for model diagnostics but not the linear plot you might expectThis is ‘regression diagnostics’ – looking at what the model cannot explain. This is not exactly what you want for now. But if you are interested, have a look at chapter 4 in Miles & Shevlin’s book.
with(df,plot(weight_c, HEIGHT, main = "Students' height and weight", xlab = "Weight, centered"))
abline(model2)library(ggplot2)
ggplot(df,aes(df$weight_c,df$HEIGHT))+
geom_smooth(method='lm', se = F) + labs(x = "Weight (centered), kg", y = "Height, cm")ggplot(data = df, aes(x = weight_c, y = HEIGHT)) +
geom_point(color='blue') + geom_point(shape = 1) +
geom_smooth(method = "lm", se = FALSE) + labs(x = "Weight (centered), kg", y = "Height, cm")#https://stackoverflow.com/questions/7549694/adding-regression-line-equation-and-r2-on-graph
#Here is a function that draws the regression equation straight on the graph
lm_eqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(R)^2~"="~r2,l)
} else {
eq <- substitute(italic(y) == a - b %.% italic(x)*","~~italic(R)^2~"="~r2,l)
}
as.character(as.expression(eq));
}
p <- ggplot(data = df, aes(x = weight_c, y = HEIGHT)) +
geom_point(color='orange') + geom_point(shape = 1, color = "black") +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", col = "purple") + labs(x = "Weight (centered), kg", y = "Height, cm")
p + annotate("text",
x = 1, y = 150,
label = lm_eqn(lm(HEIGHT ~ weight_c, data = df)),
colour="black",
size = 5, parse = TRUE) +
theme_bw() #the colors are meant to look flashy intentionallyThe difference between ANOVA and regression is that ANOVA compares group means while regression estimates the difference between the reference group and all other groups.
oneway.test(df$HEIGHT ~ df$DEPT)##
## One-way analysis of means (not assuming equal variances)
##
## data: df$HEIGHT and df$DEPT
## F = 13.741, num df = 2.000, denom df = 30.538, p-value = 5.547e-05
userfriendlyscience::posthocTGH(df$HEIGHT, df$DEPT)## n means variances
## 1 18 164 16
## 2 16 170 13
## 3 16 172 26
##
## diff ci.lo ci.hi t df p
## 2-1 5.4 2.1 8.6 4.1 32 <.01
## 3-1 7.7 3.7 11.6 4.8 29 <.01
## 3-2 2.3 -1.5 6.2 1.5 27 .31
summary(lm(df$HEIGHT ~ df$DEPT))##
## Call:
## lm(formula = df$HEIGHT ~ df$DEPT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.0000 -2.2500 -0.6667 3.1510 9.6667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 164.333 1.008 163.044 < 2e-16 ***
## df$DEPT2 5.354 1.469 3.644 0.000669 ***
## df$DEPT3 7.667 1.469 5.218 4.01e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.276 on 47 degrees of freedom
## Multiple R-squared: 0.3819, Adjusted R-squared: 0.3556
## F-statistic: 14.52 on 2 and 47 DF, p-value: 1.229e-05
df$SEX <- as.factor(df$SEX)
t.test(df$HEIGHT ~ df$SEX)##
## Welch Two Sample t-test
##
## data: df$HEIGHT by df$SEX
## t = -2.7139, df = 9.9746, p-value = 0.02183
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.3198739 -0.8163387
## sample estimates:
## mean in group f mean in group m
## 167.8605 172.4286
172.43 - 167.86 = 4.57 (cm)
modelt <- lm(df$HEIGHT ~ df$SEX)
summary(modelt)##
## Call:
## lm(formula = df$HEIGHT ~ df$SEX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8605 -2.8605 0.1395 2.1395 10.1395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 167.8605 0.7828 214.427 <2e-16 ***
## df$SEXm 4.5681 2.0922 2.183 0.0339 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.133 on 48 degrees of freedom
## Multiple R-squared: 0.09034, Adjusted R-squared: 0.07139
## F-statistic: 4.767 on 1 and 48 DF, p-value: 0.03393
df$SEXm (estimated Y for males as compared to females): 4.5681 (cm)