Ecological fallacy
This dataset contains variables that address the relationship between public school expenditures and academic performance, as measured by the SAT. The scatterplot shows that SAT performance is lower, on average, in high-spending states than in low-spending states, this statistical relationship is misleading because of an omitted variable. Once the percentage of students taking the exam is controlled for, the relationship between spending and performance reverses to become both positive and statistically significant.
The variables in this dataset, all aggregated to the state level, were extracted from the 1997, Digest of Education Statistics, an annual publication of the U.S. Department of Education.
Source: Guber, D.L. (1999). Getting What You Pay For: The Debate Over Equity in Public School Expenditures. The Journal of Statistics Education, 7(2).
Columns 1 - 16: Name of state (in quotation marks) 18 - 22: Current expenditure per pupil in average daily attendance in public elementary and secondary schools, 1994-95 (in thousands of dollars) 24 - 27: Average pupil/teacher ratio in public elementary and secondary schools, Fall 1994 29 - 34: Estimated average annual salary of teachers in public elementary and secondary schools, 1994-95 (in thousands of dollars) 36 - 37: Percentage of all eligible students taking the SAT, 1994-95 39 - 41: Average verbal SAT score, 1994-95 43 - 45: Average math SAT score, 1994-95 47 - 50: Average total score on the SAT, 1994-95
# input data
dta <- read.table("http://www.amstat.org/publications/jse/datasets/sat.dat.txt")
#assign variable names
names(dta) <- c("State", "Expend", "Ratio", "Salary", "Frac", "Verbal", "Math","Sat")
# check data structure
str(dta)
## 'data.frame': 50 obs. of 8 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Expend: num 4.41 8.96 4.78 4.46 4.99 ...
## $ Ratio : num 17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
## $ Salary: num 31.1 48 32.2 28.9 41.1 ...
## $ Frac : int 8 47 27 6 45 29 81 68 48 65 ...
## $ Verbal: int 491 445 448 482 417 462 431 429 420 406 ...
## $ Math : int 538 489 496 523 485 518 477 468 469 448 ...
## $ Sat : int 1029 934 944 1005 902 980 908 897 889 854 ...
# look at the first 6 lines
head(dta)
## State Expend Ratio Salary Frac Verbal Math Sat
## 1 Alabama 4.405 17.2 31.144 8 491 538 1029
## 2 Alaska 8.963 17.6 47.951 47 445 489 934
## 3 Arizona 4.778 19.3 32.175 27 448 496 944
## 4 Arkansas 4.459 17.1 28.934 6 482 523 1005
## 5 California 4.992 24.0 41.078 45 417 485 902
## 6 Colorado 5.443 18.4 34.571 29 462 518 980
# load data management and plotting package
#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# create a factor variable with 3 levels from Fracf
dta <- mutate(dta, Fracf = cut(Frac, breaks = c(0, 22, 49, 81),labels = c("Low", "Medium", "High")))
# plot
ggplot(data=dta, aes(x=Salary, y=Sat, label=State, group=Fracf)) +
stat_smooth(method="lm",
formula= y ~ x,
se=F,
color="gray",
linetype=2,
size=rel(.5)) +
geom_text(aes(color=Fracf),
check_overlap=TRUE,
show.legend=FALSE,
size=rel(2)) +
labs(x="Salary ($1000)",
y="SAT Score") +
theme_bw()