Today I hope everyone:

Learns about the open data landscape in DC.
Gains familiarity with R
Builds a foundation of data analysis and visualization in R
Learns about DC through open data!

Find the presentation and data we’re using here: http://bit.ly/2cBtK2S

Who Am I?

Oh, hi, I’m Kate.

I like to analyze and visualize data about the District for my website, DataLensDC. My work’s been featured on CityLab, GGWash, Washingtonian, and Washington City Paper, among others.

I’m the Data Lead for Code for DC, a civic hacking organization you should all join!

I’m also a co-organizer of this year’s Tech Lady Hackathon, a one day free hackathon and learning workshop for women.

From 9-5 M-F I’m an Economist for the U.S. government.

You can find me on Twitter at the handle @datalensdc or by e-mail at datalensdc@gmail.com.

DC’s Open Data

There is a lot of data about DC publicly available through the federal or city government.

The most comprehensive sites are:

For federal-level data: data.gov
For city-level data: opendata.dc.gov

But these are by no means exhuastive!

Open data has many homes on the web. I’ve created a DC-specific list of data here. Keep an eye out for a new open data website from Code for DC which aims to be an exhaustive repository.

There is also a lot of data not publicly available.

If there is data you are interested and think the government has it, FOIA for it. This is not as daunting as it may seem. Here’s my FOIA guide..

Here are some cool things made with open data:

All my DataLensDC blog posts.
Home Fire Risk Map, created by DataKindDC
Per-Student Funding, created by Code for DC
Tons and tons of news stories and visualizations.

RStudio: Where We’ll be Working Today

R Studio is the primary working environment for R programming and it’s split into four different panels.

The bottom left is the Console. Code entered here immediately executes and can’t be saved to file. Let’s try a few simple math equations in the Console!
The top left panel has tabs for R files and Data. Here you can create, edit, and view R files that hold your code. You can also view your data.
The top right panel has tabs for the Environment and History. The environment tab is a list of the (mostly) data you have loaded for your session. The history tab shows your history of R coding. Whether or not it was saved to a file! History is also searchable.
The bottom right panel includes the Help and Plots tabs. Help holds descriptions and examples for R functions. Plots displays the graphics you’ve created. Other tabs show files and packages.

Some of R’s best features don’t come pre-loaded.

R is open source. There is base functionality that comes standard with R, but many of the best features in R come through packages. Packages are created by users and contain data or functions to address a given task in R. To use a package it must first be installed on your computer. Then you need the load the package once per session.

Today we’re going to be using two packages:

dplyr for manipulating data
ggplot2 for visualizing data

Let’s get those packages!

install.packages(dplyr) #installs package to your computer
install.packages(ggplot2)

And load them in our current session.

require(dplyr) #loads package to session

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

require(ggplot2)

## Loading required package: ggplot2

A handful of things to always keep in mind with R

On syntax:

R is always case-sensitive. Syntax is also not always consistent across packages.
- I get many errors solely due to typos. This is always a good first check.
Functions are always formatted as such:
- function (argument1, argument2)
- The function name always goes first and the arguments follow, separated by commas and enclosed in parentheses.
- We’ll see lots of examples!
When your console returns ‘+’ it means the function is still open. You probably forgot a close parentheses.
Use # to make comments in R - text that won’t run as code. Be kind to your future self and comment often!

On learning to code in R:

?function or help(function) are your best friends in RStudio.
Search engines and StackOverflow are your best friends outside of RStudio.
There can be many different approaches to solving a single problem.
This is a learned skill and it’s not always easy.

Let’s dive into the data!

The data we’re working with today can be found here. It is DC population statistics by ward from 1990, 2000, and 2010 from NeighborhoodInfoDC, an Urban Institute project.

Let’s get that data into R.

dcPop<-read.csv("https://raw.githubusercontent.com/katerabinowitz/CodeHer16/master/data/dcPopulation.csv")

And we have our first data frame!

A data frame is a table of data made of rows (each observation) and columns (each variable). We’ve just created our first, named dcPop. You should see it in your environment tab to the right.

We also performed one of our first functions - read.csv!

How’d that work?

The new object you’re creating always goes first (dcPop).
<- is the assignment operator. This means everything to the right of <- is used to create the object (“assign value”) on the left.
read.csv reads in a comma-separated value file to create a data frame.
- The function is read.csv and the argument is the location of the file you’re reading in.
- The file can come from a url or your computer, i.e. (“/Users/katerabinowitz/Documents/Talks/CodeHer16/data/dcPopulation.csv”)
- for more details and arguments do: ?read.csv

How does our data frame look? We can use the str to check the structure of the data.

str(dcPop)

## 'data.frame':    24 obs. of  9 variables:
##  $ ward      : int  1 2 3 4 5 6 7 8 1 2 ...
##  $ totalPop  : int  71005 59457 74271 78010 83198 75556 78966 86437 71711 63408 ...
##  $ perUnder18: num  19 6.3 11 19 20 18 25 32 18 6.2 ...
##  $ perOver65 : num  10 10 16 17 17 12 13 6.3 7.6 8.9 ...
##  $ perBlack  : num  60 19 4.9 79 86 66 96 92 47 15 ...
##  $ perWhite  : num  20 65 84 15 11 30 2.4 6.4 24 66 ...
##  $ perHisp   : num  18 9.6 7 4.2 1.4 2 0.9 1.1 25 10 ...
##  $ perAsianPI: num  1.7 5.8 4.3 0.9 0.6 1.3 0.2 0.5 3.6 8.5 ...
##  $ year      : int  1990 1990 1990 1990 1990 1990 1990 1990 2000 2000 ...

Wait! Where did that <- go?

When we ran

dcPop<-read.csv(“https://raw.githubusercontent.com/katerabinowitz/CodeHer16/master/data/dcPopulation.csv”)

It created the data frame dcPop, which appears in our environment and we can open or reference anytime this session. In fact, we just referenced it in the function str(dcPop)!

<- notation created dcPop. See what happens when we just run the read.csv function:

read.csv("https://raw.githubusercontent.com/katerabinowitz/CodeHer16/master/data/dcPopulation.csv")

##    ward totalPop perUnder18 perOver65 perBlack perWhite perHisp perAsianPI
## 1     1    71005       19.0      10.0     60.0     20.0    18.0        1.7
## 2     2    59457        6.3      10.0     19.0     65.0     9.6        5.8
## 3     3    74271       11.0      16.0      4.9     84.0     7.0        4.3
## 4     4    78010       19.0      17.0     79.0     15.0     4.2        0.9
## 5     5    83198       20.0      17.0     86.0     11.0     1.4        0.6
## 6     6    75556       18.0      12.0     66.0     30.0     2.0        1.3
## 7     7    78966       25.0      13.0     96.0      2.4     0.9        0.2
## 8     8    86437       32.0       6.3     92.0      6.4     1.1        0.5
## 9     1    71711       18.0       7.6     47.0     24.0    25.0        3.6
## 10    2    63408        6.2       8.9     15.0     66.0    10.0        8.5
## 11    3    75375       12.0      14.0      6.2     80.0     6.7        6.4
## 12    4    75001       20.0      18.0     71.0     15.0    12.0        1.4
## 13    5    71604       21.0      17.0     88.0      7.4     3.0        0.8
## 14    6    70912       18.0      12.0     64.0     29.0     3.6        2.4
## 15    7    70011       27.0      14.0     97.0      1.5     0.9        0.3
## 16    8    74037       35.0       6.7     93.0      4.9     1.2        0.7
## 17    1    74462       12.0       7.1     33.0     40.0    21.0        5.0
## 18    2    76883        4.8       8.2      9.8     70.0     9.5       10.0
## 19    3    78887       13.0      14.0      5.6     78.0     7.5        8.2
## 20    4    75773       20.0      15.0     59.0     20.0    19.0        2.0
## 21    5    74308       17.0      15.0     77.0     15.0     6.3        1.7
## 22    6    76000       14.0      10.0     43.0     47.0     4.8        5.1
## 23    7    71748       24.0      13.0     95.0      1.5     2.7        0.3
## 24    8    73662       30.0       8.1     94.0      3.2     1.8        0.5
##    year
## 1  1990
## 2  1990
## 3  1990
## 4  1990
## 5  1990
## 6  1990
## 7  1990
## 8  1990
## 9  2000
## 10 2000
## 11 2000
## 12 2000
## 13 2000
## 14 2000
## 15 2000
## 16 2000
## 17 2010
## 18 2010
## 19 2010
## 20 2010
## 21 2010
## 22 2010
## 23 2010
## 24 2010

It works! But nothing is saved. All the data outputs to the console and it can’t be used or referenced again.

When we use the read.csv function we want to save that data to the environment so we can use it again.

With the str() function we don’t need to save or use it’s output, which describes the structure of our data frame.

Okay, back to the structure of our data frame.

str(dcPop)

## 'data.frame':    24 obs. of  9 variables:
##  $ ward      : int  1 2 3 4 5 6 7 8 1 2 ...
##  $ totalPop  : int  71005 59457 74271 78010 83198 75556 78966 86437 71711 63408 ...
##  $ perUnder18: num  19 6.3 11 19 20 18 25 32 18 6.2 ...
##  $ perOver65 : num  10 10 16 17 17 12 13 6.3 7.6 8.9 ...
##  $ perBlack  : num  60 19 4.9 79 86 66 96 92 47 15 ...
##  $ perWhite  : num  20 65 84 15 11 30 2.4 6.4 24 66 ...
##  $ perHisp   : num  18 9.6 7 4.2 1.4 2 0.9 1.1 25 10 ...
##  $ perAsianPI: num  1.7 5.8 4.3 0.9 0.6 1.3 0.2 0.5 3.6 8.5 ...
##  $ year      : int  1990 1990 1990 1990 1990 1990 1990 1990 2000 2000 ...

This is the structure of our ‘data frame’, or our table of data. Let’s take a moment to walk through it.

There are 9 variables and 24 observations.
Specific variables within a dataset can be referenced by putting a dollar sign between the data frame and variable names, like this: dataframe$variable.

Let’s try:

dcPop$ward

##  [1] 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Variables can come in three different types:
- character (for text variables)
- numeric (for numerical variables, which may be integer or numeric. What’s the difference?)
- factors (for categorical variables, which have a numeric value there refers to a category. for instance, 0 means no, 1 means yes.)

Changing variable types

The variable ward is currently stored as a number. Does that make sense?

Let’s change it to a factor, along with the variable year.

dcPop<- mutate(dcPop, ward = factor(ward), year=factor(year))

Did it work? Let’s check.

str(dcPop)

## 'data.frame':    24 obs. of  9 variables:
##  $ ward      : Factor w/ 8 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ totalPop  : int  71005 59457 74271 78010 83198 75556 78966 86437 71711 63408 ...
##  $ perUnder18: num  19 6.3 11 19 20 18 25 32 18 6.2 ...
##  $ perOver65 : num  10 10 16 17 17 12 13 6.3 7.6 8.9 ...
##  $ perBlack  : num  60 19 4.9 79 86 66 96 92 47 15 ...
##  $ perWhite  : num  20 65 84 15 11 30 2.4 6.4 24 66 ...
##  $ perHisp   : num  18 9.6 7 4.2 1.4 2 0.9 1.1 25 10 ...
##  $ perAsianPI: num  1.7 5.8 4.3 0.9 0.6 1.3 0.2 0.5 3.6 8.5 ...
##  $ year      : Factor w/ 3 levels "1990","2000",..: 1 1 1 1 1 1 1 1 2 2 ...

Let’s get a quick summary of the data.

summary(dcPop)

##       ward      totalPop       perUnder18      perOver65     
##  1      :3   Min.   :59457   Min.   : 4.80   Min.   : 6.300  
##  2      :3   1st Qu.:71684   1st Qu.:12.75   1st Qu.: 8.725  
##  3      :3   Median :74385   Median :18.50   Median :12.500  
##  4      :3   Mean   :74195   Mean   :18.43   Mean   :12.079  
##  5      :3   3rd Qu.:76221   3rd Qu.:21.75   3rd Qu.:15.000  
##  6      :3   Max.   :86437   Max.   :35.00   Max.   :18.000  
##  (Other):6                                                   
##     perBlack        perWhite        perHisp         perAsianPI    
##  Min.   : 4.90   Min.   : 1.50   Min.   : 0.900   Min.   : 0.200  
##  1st Qu.:29.50   1st Qu.: 7.15   1st Qu.: 1.950   1st Qu.: 0.675  
##  Median :65.00   Median :20.00   Median : 5.550   Median : 1.700  
##  Mean   :58.35   Mean   :30.68   Mean   : 7.467   Mean   : 3.008  
##  3rd Qu.:89.00   3rd Qu.:51.50   3rd Qu.: 9.700   3rd Qu.: 5.025  
##  Max.   :97.00   Max.   :84.00   Max.   :25.000   Max.   :10.000  
##                                                                   
##    year  
##  1990:8  
##  2000:8  
##  2010:8  
##          
##          
##          
##

Maybe you’re only interested a statistic of a single variable, like DC’s total population.

sum(dcPop$totalPop)

## [1] 1780682

Does that sum make sense?

Let’s think about how the data is structured….

Maybe even look at the data by clicking the data frame on the Environment tab.

We want DC’s total population, the sum of each Ward’s population, by year.

dcGroup <- group_by(dcPop, year)
totalPop <- summarise(dcGroup,
                        dcPop = sum(totalPop)
)
totalPop

## # A tibble: 3 x 2
##     year  dcPop
##   <fctr>  <int>
## 1   1990 606900
## 2   2000 572059
## 3   2010 601723

Let Graph it!

ggplot is the most popular R graphics function and what we’ll be using today.

The ggplot() function is how evey graphic starts, and the first argument is the dataset you’re using. “aes” refers to the aesthetic of the graph, which at a minimum is the X and Y variables. This sets the graphical environment. Let’s see what that looks like.

ggplot(totalPop, aes(x=year, y=dcPop))

But where’s the data? We haven’t added that yet. After the ggplot() function you can add (literally with a “+”) layers, the first of which is the geometric objects we want to add to the graph. For a bar graph we use “geom_bar.” We want the bar height to equal DC’s population, our Y, hence stat=“identity.”

ggplot(totalPop, aes(x=year, y=dcPop)) + 
  geom_bar(stat="identity")

There are many different layers you can add on to change the styling and notation, but for now let’s keep it simple.

Are there any differences in population changes by ward?

Let’s say I only want to compare 1990 to 2010

We’ll need to filter the data.

dcPop9010 <- filter(dcPop, year != 2000)

Let’s take a minute to discuss operators

Operator	Defintion
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to
!x	Not x
“x \| y”	x OR y
x & y	x AND y
isTRUE(x)	test if X is TRUE

Let’s play a game of filters and operators!

Graphing population changes by ward

ggplot(dcPop9010, aes(x=year, y=totalPop, group=ward, colour=ward)) + 
  geom_line()

How much of the population is ‘working age’ (18-65)?

Let’s look at the variables we have.

names(dcPop9010)

## [1] "ward"       "totalPop"   "perUnder18" "perOver65"  "perBlack"  
## [6] "perWhite"   "perHisp"    "perAsianPI" "year"

Okay, so we have a variable for the percent of people under 18, and the percent of people over 65. How do we get the number of people between 18 and 65?

We can create a new variable based off the variables we already have!

dcPop9010<-mutate(dcPop9010,btwn1865=100-perUnder18-perOver65)

We have lots of variables we don’t need right now. Maybe only keep what we need?

Subsetting data

Filtering data removes specific rows based on values. Subsetting removes specific columns based on their names. Let’s only keep, or select, what we need for this graph.

dcWrkAge<-select(dcPop9010, ward, btwn1865, year)
names(dcWrkAge)

## [1] "ward"     "btwn1865" "year"

Arranging data: When and where are the lowest proportions working age populations?

Sorting data can help us see the highest and lowest values, and what those observations are. We’ll arrange the data according to the variable btwn1865. Then we’ll output the top values with head() and the bottom with tail()

arranged <- arrange(dcWrkAge,btwn1865)
head(arranged)

##   ward btwn1865 year
## 1    8     61.7 1990
## 2    8     61.9 2010
## 3    7     62.0 1990
## 4    5     63.0 1990
## 5    7     63.0 2010
## 6    4     64.0 1990

tail(arranged)

##    ward btwn1865 year
## 11    3     73.0 1990
## 12    3     73.0 2010
## 13    6     76.0 2010
## 14    1     80.9 2010
## 15    2     83.7 1990
## 16    2     87.0 2010

Arranged defaults to sorting lowest to highest. To flip that, tell the function to use descending order desc(), like this:

arranged <- arrange(dcWrkAge,desc(btwn1865))

Let’s see this in a graph

ggplot(dcPop9010, aes(x=ward, y=btwn1865, fill=year)) + 
  geom_bar(stat="identity",position=position_dodge())

Okay, now it’s time for you to play!

There are four datasets here that are structured just like the one we’ve been working with. Pick one and start doing analysis/visualizations on your own or with a partner! I’ll be floating around to help but don’t forget RStudio’s ? function too.

The datasets are:

dcHomeSales: Number of home sales and median sales price by Ward, 1995-2015
dcIncome: Poverty rates and average income by Ward for 1990, 2000, and 2010
dcPopulation: That dataset we’ve already been using! Includes overall population plus race and age by Ward for 1990, 2000, and 2010.
dcSchools: Number of schools and students, both overall and divided into Charter and DCPS, by Ward, 2001-13

Before you leave…

There’s a lot to learn with R and we couldn’t possibly cover it all in two hours! But I do hope everyone feels comfortable and excited about programming in R. There a number of great free, online resources for continuing your R learning after today:

Cookbook for R by Winston Chang Guide to common tasks and problems in R
Advanced R by Hadley Wickham Don’t let the ‘Advanced’ scare you
R-Bloggers The most popular blog about R
R Cheatsheets Cheatsheets for RStudio, data visualization, wrangling, and more
Swirl An interactive way to learn R, in R
StackOverflow All the questions and answers!

And of course, keep in touch!

Analyzing and Visualizing DC Open Data