There are two regression lines (ligh blue and dark blue) on the figure of the language and math example. One of them is based on aggregates while the other, on individual scores. Indicate which is which? Which has a steeper slope?
Ecological correlations are based on rates or averages. They tend to overstate the strength of an association.
The data set consists of grade 8 pupils (age about 11 years) in elementary schools in the Netherlands. The number of pupils is 2,287 and the number of schools is 131. Class sizes are from 4 to 35. The question of interest is the correlation between scores on an arithmetic test and a language test.
Source: Snijders, T., & Bosker, R. (1999). Multilevel Analysis.
Column 1: School ID Column 2: Pupil ID Column 4: Language test score Column 5: Arithmetic test score
# data management and graphics package
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# input data
dta <- read.csv("C:/Users/Ching-Fang Wu/Documents/lmm/langMath.csv", h=T)
# inspect data structure
str(dta)
## 'data.frame': 2287 obs. of 4 variables:
## $ School: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Pupil : int 17001 17002 17003 17004 17005 17006 17007 17008 17009 17010 ...
## $ Lang : int 46 45 33 46 20 30 30 57 36 36 ...
## $ Arith : int 24 19 24 26 9 13 13 30 23 22 ...
# compute averages by school
dta_a <- dta %>%
group_by(School) %>%
summarize(ave_lang = mean(Lang, na.rm=TRUE),
ave_arith = mean(Arith,na.rm=TRUE))
# examine first 6 lines
head(dta)
## School Pupil Lang Arith
## 1 1 17001 46 24
## 2 1 17002 45 19
## 3 1 17003 33 24
## 4 1 17004 46 26
## 5 1 17005 20 9
## 6 1 17006 30 13
# superimpose two plots
ggplot(data=dta, aes(x=Arith, y=Lang)) +
geom_point(color="skyblue") +
stat_smooth(method="lm", formula=y ~ x, se=F, col="skyblue") +
geom_point(data=dta_a, aes(ave_arith, ave_lang), color="steelblue") +
stat_smooth(data=dta_a, aes(ave_arith, ave_lang),
method="lm", formula= y ~ x, se=F, color="steelblue") +
labs(x="Arithmetic score",
y="Language score") +
theme_bw()
The ligh blue regression is based on individual scores. The dark blue regression is based on aggregates data.(因為斜率較高)