1 Problem statement

There are two regression lines (ligh blue and dark blue) on the figure of the language and math example. One of them is based on aggregates while the other, on individual scores. Indicate which is which? Which has a steeper slope?

2 Introduction

Ecological correlations are based on rates or averages. They tend to overstate the strength of an association.

The data set consists of grade 8 pupils (age about 11 years) in elementary schools in the Netherlands. The number of pupils is 2,287 and the number of schools is 131. Class sizes are from 4 to 35. The question of interest is the correlation between scores on an arithmetic test and a language test.

Source: Snijders, T., & Bosker, R. (1999). Multilevel Analysis.

Column 1: School ID Column 2: Pupil ID Column 4: Language test score Column 5: Arithmetic test score

3 Data management

# data management and graphics package
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# input data
dta <- read.csv("C:/Users/Ching-Fang Wu/Documents/lmm/langMath.csv", h=T)
# inspect data structure
str(dta)
## 'data.frame':    2287 obs. of  4 variables:
##  $ School: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Pupil : int  17001 17002 17003 17004 17005 17006 17007 17008 17009 17010 ...
##  $ Lang  : int  46 45 33 46 20 30 30 57 36 36 ...
##  $ Arith : int  24 19 24 26 9 13 13 30 23 22 ...
# compute averages by school
dta_a <- dta %>%
        group_by(School) %>%
        summarize(ave_lang = mean(Lang, na.rm=TRUE),
                  ave_arith = mean(Arith,na.rm=TRUE))
# examine first 6 lines
head(dta)
##   School Pupil Lang Arith
## 1      1 17001   46    24
## 2      1 17002   45    19
## 3      1 17003   33    24
## 4      1 17004   46    26
## 5      1 17005   20     9
## 6      1 17006   30    13
# superimpose two plots
ggplot(data=dta, aes(x=Arith, y=Lang)) +
 geom_point(color="skyblue") +
 stat_smooth(method="lm", formula=y ~ x, se=F, col="skyblue") +
 geom_point(data=dta_a, aes(ave_arith, ave_lang), color="steelblue") +
 stat_smooth(data=dta_a, aes(ave_arith, ave_lang),
             method="lm", formula= y ~ x, se=F, color="steelblue") +
 labs(x="Arithmetic score", 
      y="Language score") +
 theme_bw()

The ligh blue regression is based on individual scores. The dark blue regression is based on aggregates data.(因為斜率較高)

4 The End