1. We will use the data on California schools used in class. For this, load the package AER (you will have to install it first, but do not copy that line in the lab), and load the dataset as in class.
library(AER)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
data("CASchools")
  1. Compute the distribution of the read score conditional on the county being “Los Angeles” (pay attention to quotes). Call it read_la.
library(tidyverse)
colnames(CASchools)
##  [1] "district"    "school"      "county"      "grades"      "students"   
##  [6] "teachers"    "calworks"    "lunch"       "computer"    "expenditure"
## [11] "income"      "english"     "read"        "math"
read_la <- CASchools[CASchools$county == "Los Angeles",]
ggplot(read_la, aes(x = read)) + geom_histogram(bins = 30) + geom_density()

  1. Now create a new variable equal to the number of computers per student and add it to the original dataset CASchools.
CASchools <- CASchools[!is.na(CASchools$computers),]
CASchools$computers[CASchools$computers==0] <- NA
CASchools$students[CASchools$students==0] <- NA
CASchools$computers_per_student <- CASchools$computers / CASchools$students
  1. Define a new variable indicating if more than 10% of students have a computer.
CASchools$computer_more_than_10 <- ifelse(CASchools$computer_ratio > 0.1, 1, 0)
  1. Plot two distributions of the read score: the distribution of conditional on less than 10% of the students having a computer, and the distribution conditional on more than 10% of students having a computer. Make sure to define the labels.
library(ggplot2)
read_less_than_10 <- CASchools[CASchools$computer_more_than_10 == 0,]
read_more_than_10 <- CASchools[CASchools$computer_more_than_10 == 1,]
ggplot() +
  geom_density(data = read_less_than_10, aes(x = read), color = "red", fill = "red", alpha = 0.5) +
  geom_density(data = read_more_than_10, aes(x = read), color = "blue", fill = "blue", alpha = 0.5) +
  xlab("Read Score") +
  ylab("Density") +
  ggtitle("Reading Score Distribution") +
  scale_x_continuous(limits = c(0,100)) +
  scale_fill_manual(name = "Computer Ratio", labels = c("< 10%", "> 10%"), values = c("red", "blue"))

  1. Can you add the population distribution? How do the distributions differ?
read_population <- CASchools[,]
geom_density(data = read_population, aes(x = read), color = "black", fill = "black", alpha = 0.5)
## mapping: x = ~read 
## geom_density: na.rm = FALSE, orientation = NA, outline.type = upper
## stat_density: na.rm = FALSE, orientation = NA
## position_identity
  1. Are the distributions different? How? yes one is on population and one is on reading scores

  2. Make a table showing how the mean value of the read score differs by the share of students with a computer (more or less than 10%).