Star wars with yellow eyes:
starwars %>% filter(eye_color == "yellow") %>% nrow()
## [1] 11
Basic plots: age, sex, race
qplot(
deathPenaltyData$Age,
xlab = "Age",
ylab = "Count",
main = "Age of executed prisoners"
)
qplot(Sex, data=deathPenaltyData,
xlab="Sex",
ylab="Count",
main = "Sex of executed prisoners")
Question: what percent of executed prisoners are women?
Answer: filter and nrow!
#how many in whole data set?
nrow(deathPenaltyData)
## [1] 1442
head(deathPenaltyData)
## # A tibble: 6 x 17
## Date Name Age Sex Race Crime `Victim Count` `Victim Sex`
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 01/1… Gary… 36 Male White Murd… 1 Male
## 2 05/2… John… 30 Male White Murd… 1 Male
## 3 10/2… Jess… 46 Male White Murd… 1 Male
## 4 03/0… Stev… 24 Male White Murd… 4 2 Male, 2 F…
## 5 08/1… Fran… 38 Male White Murd… 1 Male
## 6 12/0… Char… 40 Male Black Murd… 1 Male
## # … with 9 more variables: `Victim Race` <chr>, County <chr>, State <chr>,
## # Region <chr>, Method <chr>, Juvenile <chr>, Volunteer <chr>,
## # Federal <chr>, `Foreign National` <chr>
deathPenaltyData %>% filter(Sex == "Female") %>% nrow
## [1] 16
#find the percent
16/1442
## [1] 0.0110957
Wow! Of the 1442 people who’ve received the death penalty, only 16 of them (1.1%) were women!
Select lets us choose which variables we want:
deathPenaltyData %>% head()
## # A tibble: 6 x 17
## Date Name Age Sex Race Crime `Victim Count` `Victim Sex`
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 01/1… Gary… 36 Male White Murd… 1 Male
## 2 05/2… John… 30 Male White Murd… 1 Male
## 3 10/2… Jess… 46 Male White Murd… 1 Male
## 4 03/0… Stev… 24 Male White Murd… 4 2 Male, 2 F…
## 5 08/1… Fran… 38 Male White Murd… 1 Male
## 6 12/0… Char… 40 Male Black Murd… 1 Male
## # … with 9 more variables: `Victim Race` <chr>, County <chr>, State <chr>,
## # Region <chr>, Method <chr>, Juvenile <chr>, Volunteer <chr>,
## # Federal <chr>, `Foreign National` <chr>
deathPenaltyData %>% select(Date, Age, Sex, Race, State, Method, `Victim Count`)
## # A tibble: 1,442 x 7
## Date Age Sex Race State Method `Victim Count`
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 01/17/1977 36 Male White UT Firing Squad 1
## 2 05/25/1979 30 Male White FL Electrocution 1
## 3 10/22/1979 46 Male White NV Gas Chamber 1
## 4 03/09/1981 24 Male White IN Electrocution 4
## 5 08/10/1982 38 Male White VA Electrocution 1
## 6 12/07/1982 40 Male Black TX Lethal Injection 1
## 7 04/22/1983 33 Male White AL Electrocution 1
## 8 09/02/1983 34 Male White MS Gas Chamber 1
## 9 11/30/1983 36 Male White FL Electrocution 1
## 10 12/14/1983 31 Male Black LA Electrocution 1
## # … with 1,432 more rows
#We picked our variables, but don't forget to save them with <- !!!
deathPenaltyData %>%
select(Date, Age, Sex, Race, State, Method, `Victim Count`) ->
myDeathPenaltyData
#Did it work? Does myDeath..as;dlfja have the right stuff?
#Let's look with the head() commmand: it shows you the first 6 rows
myDeathPenaltyData %>% head()
## # A tibble: 6 x 7
## Date Age Sex Race State Method `Victim Count`
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 01/17/1977 36 Male White UT Firing Squad 1
## 2 05/25/1979 30 Male White FL Electrocution 1
## 3 10/22/1979 46 Male White NV Gas Chamber 1
## 4 03/09/1981 24 Male White IN Electrocution 4
## 5 08/10/1982 38 Male White VA Electrocution 1
## 6 12/07/1982 40 Male Black TX Lethal Injection 1
#is there other stuff head() can do? Use the help() command!
help("head")
These let us “collapse” data along a variable. Let me show you!
#Find total victim count in each state
#i.e., group by state, then summarise by total:
myDeathPenaltyData %>% group_by(State) %>%
summarise(totalVictimCount = sum(`Victim Count`))
## # A tibble: 35 x 2
## State totalVictimCount
## <chr> <dbl>
## 1 AL 72
## 2 AR 58
## 3 AZ 57
## 4 CA 32
## 5 CO 1
## 6 CT 4
## 7 DE 25
## 8 FE 172
## 9 FL 145
## 10 GA 93
## # … with 25 more rows
# what about average victim count? Use mean() instead of sum()
myDeathPenaltyData %>% group_by(State) %>%
summarise(avgVictimCount = mean(`Victim Count`), totalVictimCount = sum(`Victim Count`))
## # A tibble: 35 x 3
## State avgVictimCount totalVictimCount
## <chr> <dbl> <dbl>
## 1 AL 1.24 72
## 2 AR 2.15 58
## 3 AZ 1.54 57
## 4 CA 2.46 32
## 5 CO 1 1
## 6 CT 4 4
## 7 DE 1.56 25
## 8 FE 57.3 172
## 9 FL 1.58 145
## 10 GA 1.35 93
## # … with 25 more rows
Useful shortcut: “count” means both group_by and summarise with total:
myDeathPenaltyData %>% count(Race, State)
## # A tibble: 81 x 3
## Race State n
## <chr> <chr> <int>
## 1 Asian CA 1
## 2 Asian NV 1
## 3 Asian OK 2
## 4 Asian TX 2
## 5 Black AL 25
## 6 Black AR 7
## 7 Black AZ 1
## 8 Black CA 2
## 9 Black DE 7
## 10 Black FE 1
## # … with 71 more rows
Often, we want break up our visual summaries by group. Ex:
qplot(myDeathPenaltyData$State)
#Ok, what about by race?
qplot(data=myDeathPenaltyData, State, fill=Race)
The above technique (using fill=Race) is called “faceting”. It breaks up our visuals by some categorical variable.
Q: would it make sense to facet by the variable: Age?
qplot(data=myDeathPenaltyData, State, fill=Age)
It doesn’t make sense to facet on quantitative variables!!!! Need categorical, and hopefully, not too many categories!