1 Learning target

Ecological fallacy

2 Introduction

This dataset contains variables that address the relationship between public school expenditures and academic performance, as measured by the SAT. The scatterplot shows that SAT performance is lower, on average, in high-spending states than in low-spending states, this statistical relationship is misleading because of an omitted variable. Once the percentage of students taking the exam is controlled for, the relationship between spending and performance reverses to become both positive and statistically significant.

The variables in this dataset, all aggregated to the state level, were extracted from the 1997, Digest of Education Statistics, an annual publication of the U.S. Department of Education.

Source: Guber, D.L. (1999). Getting What You Pay For: The Debate Over Equity in Public School Expenditures. The Journal of Statistics Education, 7(2).

Columns 1 - 16: Name of state (in quotation marks) 18 - 22: Current expenditure per pupil in average daily attendance in public elementary and secondary schools, 1994-95 (in thousands of dollars) 24 - 27: Average pupil/teacher ratio in public elementary and secondary schools, Fall 1994 29 - 34: Estimated average annual salary of teachers in public elementary and secondary schools, 1994-95 (in thousands of dollars) 36 - 37: Percentage of all eligible students taking the SAT, 1994-95 39 - 41: Average verbal SAT score, 1994-95 43 - 45: Average math SAT score, 1994-95 47 - 50: Average total score on the SAT, 1994-95

3 Data management

# input data
dta <- read.table("http://www.amstat.org/publications/jse/datasets/sat.dat.txt")
#assign variable names
names(dta) <- c("State", "Expend", "Ratio", "Salary", "Frac", "Verbal", "Math","Sat")
# check data structure
str(dta)
## 'data.frame':    50 obs. of  8 variables:
##  $ State : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Expend: num  4.41 8.96 4.78 4.46 4.99 ...
##  $ Ratio : num  17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
##  $ Salary: num  31.1 48 32.2 28.9 41.1 ...
##  $ Frac  : int  8 47 27 6 45 29 81 68 48 65 ...
##  $ Verbal: int  491 445 448 482 417 462 431 429 420 406 ...
##  $ Math  : int  538 489 496 523 485 518 477 468 469 448 ...
##  $ Sat   : int  1029 934 944 1005 902 980 908 897 889 854 ...
# look at the first 6 lines
head(dta)
##        State Expend Ratio Salary Frac Verbal Math  Sat
## 1    Alabama  4.405  17.2 31.144    8    491  538 1029
## 2     Alaska  8.963  17.6 47.951   47    445  489  934
## 3    Arizona  4.778  19.3 32.175   27    448  496  944
## 4   Arkansas  4.459  17.1 28.934    6    482  523 1005
## 5 California  4.992  24.0 41.078   45    417  485  902
## 6   Colorado  5.443  18.4 34.571   29    462  518  980

4 Visualization

# load data management and plotting package
#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# create a factor variable with 3 levels from Fracf
dta <- mutate(dta, Fracf = cut(Frac, breaks = c(0, 22, 49, 81),labels = c("Low", "Medium", "High")))
# plot
ggplot(data=dta, aes(x=Salary, y=Sat, label=State, group=Fracf)) +
 stat_smooth(method="lm", 
             formula= y ~ x,
             se=F, 
             color="gray", 
             linetype=2, 
             size=rel(.5)) +
 geom_text(aes(color=Fracf), 
           check_overlap=TRUE, 
           show.legend=FALSE, 
           size=rel(2)) +
 labs(x="Salary ($1000)", 
      y="SAT Score") +
 theme_bw()

5 The End