In-class exercises 3: There are two regression lines (ligh blue and dark blue) on the figure of the language and math example. One of them is based on aggregates while the other, on individual scores. Indicate which is which? Which has a steeper slope?

1 Introduction

Ecological correlations are based on rates or averages. They tend to overstate the strength of an association.

The data set consists of grade 8 pupils (age about 11 years) in elementary schools in the Netherlands. The number of pupils is 2,287 and the number of schools is 131. Class sizes are from 4 to 35. The question of interest is the correlation between scores on an arithmetic test and a language test.

Source: Snijders, T., & Bosker, R. (1999). Multilevel Analysis.

Data：langMath.csv
R：langMath.R.txt

Column 1: School ID Column 2: Pupil ID Column 4: Language test score Column 5: Arithmetic test score

2 Data management

# data management and graphics package
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## √ ggplot2 3.3.2     √ purrr   0.3.4
## √ tibble  3.0.4     √ dplyr   1.0.2
## √ tidyr   1.1.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# input data
dta <- read.csv("C:/Users/Ching-Fang Wu/Documents/data/langMath.csv",h=T)

# compute averages by school
dta_a <- dta %>%
        group_by(School) %>% #依據各校計算
        summarize(ave_lang = mean(Lang, na.rm=TRUE),
                  ave_arith = mean(Arith, na.rm=TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

3 Plot

# superimpose two plots
ggplot(data=dta, aes(x=Arith, y=Lang)) +
 geom_point(color="skyblue") +
 stat_smooth(method="lm", formula=y ~ x, se=F, col="skyblue") +
 geom_point(data=dta_a, aes(ave_arith, ave_lang), color="steelblue") +
 stat_smooth(data=dta_a, aes(ave_arith, ave_lang),
             method="lm", formula= y ~ x, se=F, color="steelblue") +
 labs(x="Arithmetic score", 
      y="Language score") +
 theme_bw()

The ligh blue regression is based on individual scores. The dark blue regression is based on aggregates data.(斜率較高)

#THE END

W1 in-class exercise3：Individual correlation vs grouped correlation

Ching-Fang Wu

Thu Jan 07 15:58:54 2021

1 Introduction

2 Data management

3 Plot