Loading Data Into a Dataframe

Overview

Using fivethirtyeight.com to source our data, we will load and show a glimpse into the dataframe. We are looking at the article “Where Police Have Killed Americans In 2015” by Ben Casselman. The article takes a look at the Guardian’s database of Americans killed by police in 2015. The article was published June 3, 2015. At the time of publishing, 467 Americans had been killed and recorded into the database. Casselman looks at the race/ethnicity of the deceased, as well as the location, proportion of white population in location, and household income of the location they were killed. The insights gleamed were focused on how the people killed were in some of the poorest areas of the U.S., measured by census tract. About 30% of the killings took place in census tracts that are in the bottom 20 percent nationally in terms of household income. Police killed about a quarter of people in majority black census tracts. The article was written on the heels of big moments in history, after the killings of Michael Brown in Ferguson, Missouri and Freddie Gray in Baltimore, Maryland.

Raw data taken from github repository

Loading data and viewing all columns

Here we will use the tidyverse for ease of manipulation and the RCurl package for retrieving the data directly from github.com.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RCurl)

## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete

x <- getURL("https://raw.githubusercontent.com/fivethirtyeight/data/master/police-killings/police_killings.csv") 
police_killings <- read.csv(text = x)
glimpse(police_killings)

## Rows: 467
## Columns: 34
## $ name                 <chr> "A'donte Washington", "Aaron Rutledge", "Aaron Si…
## $ age                  <chr> "16", "27", "26", "25", "29", "29", "22", "35", "…
## $ gender               <chr> "Male", "Male", "Male", "Male", "Male", "Male", "…
## $ raceethnicity        <chr> "Black", "White", "White", "Hispanic/Latino", "Wh…
## $ month                <chr> "February", "April", "March", "March", "March", "…
## $ day                  <int> 23, 2, 14, 11, 19, 7, 27, 26, 28, 7, 26, 12, 20, …
## $ year                 <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
## $ streetaddress        <chr> "Clearview Ln", "300 block Iris Park Dr", "22nd A…
## $ city                 <chr> "Millbrook", "Pineville", "Kenosha", "South Gate"…
## $ state                <chr> "AL", "LA", "WI", "CA", "OH", "AZ", "CA", "CA", "…
## $ latitude             <dbl> 32.52958, 31.32174, 42.58356, 33.93930, 41.14857,…
## $ longitude            <dbl> -86.36283, -92.43486, -87.83571, -118.21946, -81.…
## $ state_fp             <int> 1, 22, 55, 6, 39, 4, 6, 6, 48, 26, 6, 6, 48, 18, …
## $ county_fp            <int> 51, 79, 59, 37, 153, 13, 29, 37, 41, 81, 31, 59, …
## $ tract_ce             <int> 30902, 11700, 1200, 535607, 530800, 111602, 700, …
## $ geo_id               <dbl> 1051030902, 22079011700, 55059001200, 6037535607,…
## $ county_id            <int> 1051, 22079, 55059, 6037, 39153, 4013, 6029, 6037…
## $ namelsad             <chr> "Census Tract 309.02", "Census Tract 117", "Censu…
## $ lawenforcementagency <chr> "Millbrook Police Department", "Rapides Parish Sh…
## $ cause                <chr> "Gunshot", "Gunshot", "Gunshot", "Gunshot", "Guns…
## $ armed                <chr> "No", "No", "No", "Firearm", "No", "No", "Firearm…
## $ pop                  <int> 3779, 2769, 4079, 4343, 6809, 4682, 5027, 5238, 4…
## $ share_white          <chr> "60.5", "53.8", "73.8", "1.2", "92.5", "7", "50.8…
## $ share_black          <chr> "30.5", "36.2", "7.7", "0.6", "1.4", "7.7", "0.3"…
## $ share_hispanic       <chr> "5.6", "0.5", "16.8", "98.8", "1.7", "79", "44.2"…
## $ p_income             <chr> "28375", "14678", "25286", "17194", "33954", "155…
## $ h_income             <int> 51367, 27972, 45365, 48295, 68785, 20833, 58068, …
## $ county_income        <int> 54766, 40930, 54930, 55909, 49669, 53596, 48552, …
## $ comp_income          <dbl> 0.9379359, 0.6834107, 0.8258693, 0.8638144, 1.384…
## $ county_bucket        <int> 3, 2, 2, 3, 5, 1, 4, 4, 2, 3, 4, 5, 3, 4, 3, 1, 3…
## $ nat_bucket           <int> 3, 1, 3, 3, 4, 1, 4, 4, 1, 2, 3, 5, 3, 2, 2, 1, 3…
## $ pov                  <chr> "14.1", "28.8", "14.6", "11.7", "1.9", "58", "17.…
## $ urate                <dbl> 0.09768638, 0.06572379, 0.16629314, 0.12482727, 0…
## $ college              <dbl> 0.16850951, 0.11140236, 0.14731227, 0.05013293, 0…

Refined dataframe

Now we take the columns of interest and look at a subset of the data.

police_killings <- police_killings %>% select(name,age,gender,raceethnicity,month,day,year,city,state,tract_ce,cause,armed,share_white,share_black,h_income,urate) %>% rename(census_tract = tract_ce,cause_of_death = cause, percent_white = share_white, percent_black = share_black, household_income = h_income, unemployment_rate = urate)

glimpse(police_killings)

## Rows: 467
## Columns: 16
## $ name              <chr> "A'donte Washington", "Aaron Rutledge", "Aaron Siler…
## $ age               <chr> "16", "27", "26", "25", "29", "29", "22", "35", "44"…
## $ gender            <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ raceethnicity     <chr> "Black", "White", "White", "Hispanic/Latino", "White…
## $ month             <chr> "February", "April", "March", "March", "March", "Mar…
## $ day               <int> 23, 2, 14, 11, 19, 7, 27, 26, 28, 7, 26, 12, 20, 25,…
## $ year              <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015…
## $ city              <chr> "Millbrook", "Pineville", "Kenosha", "South Gate", "…
## $ state             <chr> "AL", "LA", "WI", "CA", "OH", "AZ", "CA", "CA", "TX"…
## $ census_tract      <int> 30902, 11700, 1200, 535607, 530800, 111602, 700, 294…
## $ cause_of_death    <chr> "Gunshot", "Gunshot", "Gunshot", "Gunshot", "Gunshot…
## $ armed             <chr> "No", "No", "No", "Firearm", "No", "No", "Firearm", …
## $ percent_white     <chr> "60.5", "53.8", "73.8", "1.2", "92.5", "7", "50.8", …
## $ percent_black     <chr> "30.5", "36.2", "7.7", "0.6", "1.4", "7.7", "0.3", "…
## $ household_income  <int> 51367, 27972, 45365, 48295, 68785, 20833, 58068, 665…
## $ unemployment_rate <dbl> 0.09768638, 0.06572379, 0.16629314, 0.12482727, 0.06…

Conclusion

In conclusion, we have shown the code necessary to pull and load data into a dataframe from github and the steps needed to subset and rename columns for a ready to use dataframe for analysis. From here we can verify the insights found in the article that report 139 of the 467 killings occurred in the bottom 20% of census tracts. With more time we can take a look at the rates of those killed by gender, race, whether they were armed, and the comparison of percent white versus percent black census tract populations where the killings took place to gleam more insights.

Loading Data Into a Dataframe - Week1 Assignment

Isaias Soto

2025-02-02

Overview