Go ahead and launch the R Studio Server, and open a new R Markdown file. Recall, to do this:
Once you have opened the document:
Finally, save your new document:
lab_02_lastname_firstnameYou will hand in a knitted html file as your problem set. It is OK if your lab report includes the example code from the lab, as well as your Exercises. Just be sure to make a header to label each Exercise. Please type your code to answer the questions in a code chunk (gray part), under the exercise headers and type (BRIEF) answers to any interpretation questions in the white part under the headers.
R Packages are like apps on your cell phone - they are tools for accomplishing common tasks. R is an open-source programming language, meaning that people can contribute packages that make our lives easier, and we can use them for free. For this problem set we will use the following R packages:
dplyr: for data wranglingggplot2: for data visualizationThese packages are already installed for you in this project. Every time you open a new R session you need to load (open) any packages you want to use. We do this with the library function. Copy, paste and run the following in a code chunk (see the figure above if you forget how to insert a code chunk).
library(ggplot2)
library(dplyr)
Remember, “running code means” telling R “do this”. You tell R to do something by passing it through the console. You can run existing code many ways:
Control-Enter on PC or Command-Return on a Mac (easiest method 2)Today we will practice data visualization using data on police stops in San Jose from 2013 to 2018. Copy, paste and run the code below to load the data.
sanjose <- readRDS("tr137st9964_ca_san_jose_2019_02_25.rds")
The data set that shows up in your Environment is a large data frame. Each observation or case is a police stop.
You can see the dimensions of this data frame (# of rows and columns), the names of the variables, the variable types, and the first few observations using the glimpse function. Copy, paste, and run the following in a new code chunk.
glimpse(sanjose)
We can see that there are 158,935 observations and 17 variables in this data set. The variable names are date, time, location, etc. This output also tells us that some variables are numbers…some might be integers <int>, others are numbers with decimals <dbl>. Some of the variables are factors <fct>. It is a good practice to see if R is treating variables as factors <fct>; as numbers <int> or <dbl> (basically numbers with decimals); or as characters (i.e. text) <chr>.
subject_race to be? What variable type is search_conducted? (answer with text)You can view the data by clicking on the name sanjose in the Environment pane (upper right window). This will bring up an alternative display of the data set in the Data Viewer (upper left window). R has stored these data in a kind of spreadsheet called a data frame. Each row represents a different police stop: the first entry or column in each row is simply the row number, the rest are the different variables that were recorded for each stop. You can close the data viewer by clicking on the x in the upper left hand corner.
It is a good idea to try kitting your document from time to time as you go along! Go ahead, and make sure your document is knitting, and that your html file includes Exercise headers, text, and code. Note that knitting automatically saves your Rmd file too!
An intial question might be what is the demographic breakdown of police stops. This question can be answered numerical and graphically.
Numerically, we can show the results in a table. The table() function requires an input of the variable that to be tabulated.
This is the command to have R compute the number of police stops broken down by subject_race for the entire sanjose data frame.
table(sanjose$subject_race)
##
## asian/pacific islander black hispanic
## 16578 13984 82693
## other/unknown white
## 12036 27217
If we prefer to see this summary as proportions of the total we can nest the table function inside the prop.table() function.
prop.table(table(sanjose$subject_race))
##
## asian/pacific islander black hispanic
## 0.10870249 0.09169355 0.54222074
## other/unknown white
## 0.07892045 0.17846277
If you want to see fewer decimal places you can round the entire prop.table using the round function.
round(prop.table(table(sanjose$subject_race)),2)
##
## asian/pacific islander black hispanic
## 0.11 0.09 0.54
## other/unknown white
## 0.08 0.18
prop.table over the table and vice-versa.The visual representation of a table is bar graph or geom_bar in R. Note that in a bar graph, the x variable needs to be a <fct> and is inputed as an aesthetic of the ggplot. R calculates the totals for each of the levels of the factor.
ggplot(sanjose, aes(x=subject_race))+
geom_bar()+
geom_text(stat="count", aes(label=..count.., y=..count..+2000)) # adds labels above bars
If you prefer to see the proportions, you’ll need to add y= ..prop.., group=1 to ggplot’s aesthetic
ggplot(sanjose, aes(x=subject_race, y= ..prop.., group=1))+
geom_bar()+
geom_text(stat="count", aes(label=round(..prop.., 2), y=..prop..+ 0.02)) ##adds labels
README.md file)A natural next question is; what proportion of stops result in the search being conducted?
A follow-up question would be, do all races get searched at the same rate?
We can answer that by creating a contigency table with 2 factors subject_race and search_conducted to calculate the number of stops that a search was or was not conducted broken down by race.
table(sanjose$subject_race, sanjose$search_conducted)
##
## FALSE TRUE
## asian/pacific islander 14210 2238
## black 9248 4584
## hispanic 53717 28449
## other/unknown 7547 1625
## white 20328 6651
The prop.table function is a little more complicated with 2 factors. Let’s look at three variations to learn about the complication.
prop.table(table(sanjose$subject_race, sanjose$search_conducted))
##
## FALSE TRUE
## asian/pacific islander 0.09562777 0.01506087
## black 0.06223544 0.03084854
## hispanic 0.36149451 0.19145070
## other/unknown 0.05078837 0.01093562
## white 0.13679953 0.04475864
prop.table(table(sanjose$subject_race, sanjose$search_conducted),1)
##
## FALSE TRUE
## asian/pacific islander 0.8639348 0.1360652
## black 0.6685946 0.3314054
## hispanic 0.6537619 0.3462381
## other/unknown 0.8228304 0.1771696
## white 0.7534749 0.2465251
prop.table(table(sanjose$subject_race, sanjose$search_conducted),2)
##
## FALSE TRUE
## asian/pacific islander 0.13526892 0.05139275
## black 0.08803427 0.10526558
## hispanic 0.51134698 0.65329414
## other/unknown 0.07184198 0.03731600
## white 0.19350785 0.15273153
geom_barTo create the bar graph the x variable is the explanatory variable and the different levels of the response variable is used to create the heights of each bar. Bar graphs can be made with at least three different .
ggplot(sanjose, aes(x=subject_race, fill=search_conducted))+
geom_bar()
dodged Bar Graphggplot(sanjose, aes(x=subject_race, fill=search_conducted))+
geom_bar(position="dodge")
ggplot(sanjose, aes(x=subject_race, fill=search_conducted))+
geom_bar(position="fill")
subject_race and report on what you discover. I would like everyone to try an analysis on your own - I think this is where you can have some fun and really develop your skills. You analysis should include at least a table and a graph with discussion of your observations.sanjose_ped <- sanjose %>%
filter(type=="pedestrian")
ggplot(data=sanjose_ped, aes(x=subject_race, fill=outcome))+
geom_bar(position="dodge")
prop.table(table(sanjose_ped$subject_race,sanjose_ped$outcome, useNA="ifany"),1)
##
## warning citation summons arrest
## asian/pacific islander 0.00000000 0.19221557 0.00000000 0.12335329
## black 0.00000000 0.15320000 0.00000000 0.09820000
## hispanic 0.00000000 0.12231559 0.00000000 0.08720822
## other/unknown 0.00000000 0.14564831 0.00000000 0.12611012
## white 0.00000000 0.14097628 0.00000000 0.09116880
## <NA> 0.00000000 0.18485742 0.00000000 0.12094395
##
## <NA>
## asian/pacific islander 0.68443114
## black 0.74860000
## hispanic 0.79047619
## other/unknown 0.72824156
## white 0.76785492
## <NA> 0.69419862
Analysis if their are any trends (time dependent) in the data
Use lubridate package
library(lubridate)
Create a new variable year for both data frames
sanjose <- sanjose %>%
mutate(year = year(date))
sanjose_ped <- sanjose_ped %>%
mutate(year = year(date))
How has the number of police stops per year changed? (note data for 2013 and 2018 is incomplete)
table(sanjose$year, sanjose$type)
##
## pedestrian vehicular
## 2013 4914 9782
## 2014 11597 22459
## 2015 8900 20536
## 2016 6647 18304
## 2017 5318 23229
## 2018 1471 6682
How has the racial breakdown of all police stops changed over the past 5 years?
summarysj <-sanjose %>%
group_by(year) %>%
summarize(prop.hispanic=round(length(which(subject_race=="hispanic"))/n(),2),
prop.black =round(length(which(subject_race=="black"))/n(),2),
prop.asian=round(length(which(subject_race=="asian/pacific islander"))/n(),2),
prop.white=round(length(which(subject_race=="white"))/n(),2))
summarysj
| year | prop.hispanic | prop.black | prop.asian | prop.white |
|---|---|---|---|---|
| 2013 | 0.42 | 0.08 | 0.08 | 0.14 |
| 2014 | 0.54 | 0.10 | 0.09 | 0.18 |
| 2015 | 0.55 | 0.09 | 0.10 | 0.17 |
| 2016 | 0.55 | 0.09 | 0.10 | 0.18 |
| 2017 | 0.51 | 0.09 | 0.13 | 0.18 |
| 2018 | 0.50 | 0.08 | 0.13 | 0.19 |
| NA | 1.00 | 0.00 | 0.00 | 0.00 |
Visualize the table
ggplot(data=summarysj, aes(x=year))+
geom_line(y=summarysj$prop.hispanic, color="green")+
geom_line(y=summarysj$prop.black, color="red")+
geom_line(y=summarysj$prop.asian, color="blue")+
geom_line(y=summarysj$prop.white, color="yellow")
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
Now let’s just focus our just police stops of hispanic
sanjose %>%
filter(subject_race == "hispanic") %>%
group_by(year) %>%
summarize(count=n(), prop_search =round(length(which(search_conducted=="TRUE"))/n(),2))
| year | count | prop_search |
|---|---|---|
| 2013 | 8185 | 0.40 |
| 2014 | 21140 | 0.39 |
| 2015 | 18384 | 0.35 |
| 2016 | 15440 | 0.31 |
| 2017 | 15225 | 0.29 |
| 2018 | 4318 | 0.31 |
| NA | 1 | 1.00 |
Submit your problem set html file on Canvas. This involves downloading the html file from the R Studio Server to your personal computer. The steps to do this are as follows: