Today I hope everyone:

Find the presentation and data we’re using here: http://bit.ly/2cBtK2S



Who Am I?

Oh, hi, I’m Kate.

I like to analyze and visualize data about the District for my website, DataLensDC. My work’s been featured on CityLab, GGWash, Washingtonian, and Washington City Paper, among others.

I’m the Data Lead for Code for DC, a civic hacking organization you should all join!

I’m also a co-organizer of this year’s Tech Lady Hackathon, a one day free hackathon and learning workshop for women.

From 9-5 M-F I’m an Economist for the U.S. government.

You can find me on Twitter at the handle @datalensdc or by e-mail at datalensdc@gmail.com.



DC’s Open Data

There is a lot of data about DC publicly available through the federal or city government.

The most comprehensive sites are:

But these are by no means exhuastive!

Open data has many homes on the web. I’ve created a DC-specific list of data here. Keep an eye out for a new open data website from Code for DC which aims to be an exhaustive repository.

There is also a lot of data not publicly available.

If there is data you are interested and think the government has it, FOIA for it. This is not as daunting as it may seem. Here’s my FOIA guide..

Here are some cool things made with open data:



RStudio: Where We’ll be Working Today

R Studio is the primary working environment for R programming and it’s split into four different panels.

  • The bottom left is the Console. Code entered here immediately executes and can’t be saved to file. Let’s try a few simple math equations in the Console!
  • The top left panel has tabs for R files and Data. Here you can create, edit, and view R files that hold your code. You can also view your data.
  • The top right panel has tabs for the Environment and History. The environment tab is a list of the (mostly) data you have loaded for your session. The history tab shows your history of R coding. Whether or not it was saved to a file! History is also searchable.
  • The bottom right panel includes the Help and Plots tabs. Help holds descriptions and examples for R functions. Plots displays the graphics you’ve created. Other tabs show files and packages.



Some of R’s best features don’t come pre-loaded.

R is open source. There is base functionality that comes standard with R, but many of the best features in R come through packages. Packages are created by users and contain data or functions to address a given task in R. To use a package it must first be installed on your computer. Then you need the load the package once per session.

Today we’re going to be using two packages:

  • dplyr for manipulating data
  • ggplot2 for visualizing data

Let’s get those packages!

install.packages(dplyr) #installs package to your computer
install.packages(ggplot2)

And load them in our current session.

require(dplyr) #loads package to session
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2



A handful of things to always keep in mind with R

On syntax:

On learning to code in R:


Let’s dive into the data!

The data we’re working with today can be found here. It is DC population statistics by ward from 1990, 2000, and 2010 from NeighborhoodInfoDC, an Urban Institute project.

Let’s get that data into R.

dcPop<-read.csv("https://raw.githubusercontent.com/katerabinowitz/CodeHer16/master/data/dcPopulation.csv")

And we have our first data frame!

A data frame is a table of data made of rows (each observation) and columns (each variable). We’ve just created our first, named dcPop. You should see it in your environment tab to the right.

We also performed one of our first functions - read.csv!

How’d that work?

  • The new object you’re creating always goes first (dcPop).
  • <- is the assignment operator. This means everything to the right of <- is used to create the object (“assign value”) on the left.
  • read.csv reads in a comma-separated value file to create a data frame.
    • The function is read.csv and the argument is the location of the file you’re reading in.
    • The file can come from a url or your computer, i.e. (“/Users/katerabinowitz/Documents/Talks/CodeHer16/data/dcPopulation.csv”)
    • for more details and arguments do: ?read.csv



How does our data frame look? We can use the str to check the structure of the data.

str(dcPop)
## 'data.frame':    24 obs. of  9 variables:
##  $ ward      : int  1 2 3 4 5 6 7 8 1 2 ...
##  $ totalPop  : int  71005 59457 74271 78010 83198 75556 78966 86437 71711 63408 ...
##  $ perUnder18: num  19 6.3 11 19 20 18 25 32 18 6.2 ...
##  $ perOver65 : num  10 10 16 17 17 12 13 6.3 7.6 8.9 ...
##  $ perBlack  : num  60 19 4.9 79 86 66 96 92 47 15 ...
##  $ perWhite  : num  20 65 84 15 11 30 2.4 6.4 24 66 ...
##  $ perHisp   : num  18 9.6 7 4.2 1.4 2 0.9 1.1 25 10 ...
##  $ perAsianPI: num  1.7 5.8 4.3 0.9 0.6 1.3 0.2 0.5 3.6 8.5 ...
##  $ year      : int  1990 1990 1990 1990 1990 1990 1990 1990 2000 2000 ...

Wait! Where did that <- go?

When we ran

dcPop<-read.csv(“https://raw.githubusercontent.com/katerabinowitz/CodeHer16/master/data/dcPopulation.csv”)

It created the data frame dcPop, which appears in our environment and we can open or reference anytime this session. In fact, we just referenced it in the function str(dcPop)!

<- notation created dcPop. See what happens when we just run the read.csv function:

read.csv("https://raw.githubusercontent.com/katerabinowitz/CodeHer16/master/data/dcPopulation.csv")
##    ward totalPop perUnder18 perOver65 perBlack perWhite perHisp perAsianPI
## 1     1    71005       19.0      10.0     60.0     20.0    18.0        1.7
## 2     2    59457        6.3      10.0     19.0     65.0     9.6        5.8
## 3     3    74271       11.0      16.0      4.9     84.0     7.0        4.3
## 4     4    78010       19.0      17.0     79.0     15.0     4.2        0.9
## 5     5    83198       20.0      17.0     86.0     11.0     1.4        0.6
## 6     6    75556       18.0      12.0     66.0     30.0     2.0        1.3
## 7     7    78966       25.0      13.0     96.0      2.4     0.9        0.2
## 8     8    86437       32.0       6.3     92.0      6.4     1.1        0.5
## 9     1    71711       18.0       7.6     47.0     24.0    25.0        3.6
## 10    2    63408        6.2       8.9     15.0     66.0    10.0        8.5
## 11    3    75375       12.0      14.0      6.2     80.0     6.7        6.4
## 12    4    75001       20.0      18.0     71.0     15.0    12.0        1.4
## 13    5    71604       21.0      17.0     88.0      7.4     3.0        0.8
## 14    6    70912       18.0      12.0     64.0     29.0     3.6        2.4
## 15    7    70011       27.0      14.0     97.0      1.5     0.9        0.3
## 16    8    74037       35.0       6.7     93.0      4.9     1.2        0.7
## 17    1    74462       12.0       7.1     33.0     40.0    21.0        5.0
## 18    2    76883        4.8       8.2      9.8     70.0     9.5       10.0
## 19    3    78887       13.0      14.0      5.6     78.0     7.5        8.2
## 20    4    75773       20.0      15.0     59.0     20.0    19.0        2.0
## 21    5    74308       17.0      15.0     77.0     15.0     6.3        1.7
## 22    6    76000       14.0      10.0     43.0     47.0     4.8        5.1
## 23    7    71748       24.0      13.0     95.0      1.5     2.7        0.3
## 24    8    73662       30.0       8.1     94.0      3.2     1.8        0.5
##    year
## 1  1990
## 2  1990
## 3  1990
## 4  1990
## 5  1990
## 6  1990
## 7  1990
## 8  1990
## 9  2000
## 10 2000
## 11 2000
## 12 2000
## 13 2000
## 14 2000
## 15 2000
## 16 2000
## 17 2010
## 18 2010
## 19 2010
## 20 2010
## 21 2010
## 22 2010
## 23 2010
## 24 2010

It works! But nothing is saved. All the data outputs to the console and it can’t be used or referenced again.

When we use the read.csv function we want to save that data to the environment so we can use it again.

With the str() function we don’t need to save or use it’s output, which describes the structure of our data frame.



Okay, back to the structure of our data frame.

str(dcPop)
## 'data.frame':    24 obs. of  9 variables:
##  $ ward      : int  1 2 3 4 5 6 7 8 1 2 ...
##  $ totalPop  : int  71005 59457 74271 78010 83198 75556 78966 86437 71711 63408 ...
##  $ perUnder18: num  19 6.3 11 19 20 18 25 32 18 6.2 ...
##  $ perOver65 : num  10 10 16 17 17 12 13 6.3 7.6 8.9 ...
##  $ perBlack  : num  60 19 4.9 79 86 66 96 92 47 15 ...
##  $ perWhite  : num  20 65 84 15 11 30 2.4 6.4 24 66 ...
##  $ perHisp   : num  18 9.6 7 4.2 1.4 2 0.9 1.1 25 10 ...
##  $ perAsianPI: num  1.7 5.8 4.3 0.9 0.6 1.3 0.2 0.5 3.6 8.5 ...
##  $ year      : int  1990 1990 1990 1990 1990 1990 1990 1990 2000 2000 ...

This is the structure of our ‘data frame’, or our table of data. Let’s take a moment to walk through it.

  • There are 9 variables and 24 observations.

  • Specific variables within a dataset can be referenced by putting a dollar sign between the data frame and variable names, like this: dataframe$variable.

Let’s try:

dcPop$ward
##  [1] 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
  • Variables can come in three different types:
    • character (for text variables)
    • numeric (for numerical variables, which may be integer or numeric. What’s the difference?)
    • factors (for categorical variables, which have a numeric value there refers to a category. for instance, 0 means no, 1 means yes.)



Changing variable types

The variable ward is currently stored as a number. Does that make sense?

Let’s change it to a factor, along with the variable year.

dcPop<- mutate(dcPop, ward = factor(ward), year=factor(year))

Did it work? Let’s check.

str(dcPop)
## 'data.frame':    24 obs. of  9 variables:
##  $ ward      : Factor w/ 8 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ totalPop  : int  71005 59457 74271 78010 83198 75556 78966 86437 71711 63408 ...
##  $ perUnder18: num  19 6.3 11 19 20 18 25 32 18 6.2 ...
##  $ perOver65 : num  10 10 16 17 17 12 13 6.3 7.6 8.9 ...
##  $ perBlack  : num  60 19 4.9 79 86 66 96 92 47 15 ...
##  $ perWhite  : num  20 65 84 15 11 30 2.4 6.4 24 66 ...
##  $ perHisp   : num  18 9.6 7 4.2 1.4 2 0.9 1.1 25 10 ...
##  $ perAsianPI: num  1.7 5.8 4.3 0.9 0.6 1.3 0.2 0.5 3.6 8.5 ...
##  $ year      : Factor w/ 3 levels "1990","2000",..: 1 1 1 1 1 1 1 1 2 2 ...



Let’s get a quick summary of the data.

summary(dcPop)
##       ward      totalPop       perUnder18      perOver65     
##  1      :3   Min.   :59457   Min.   : 4.80   Min.   : 6.300  
##  2      :3   1st Qu.:71684   1st Qu.:12.75   1st Qu.: 8.725  
##  3      :3   Median :74385   Median :18.50   Median :12.500  
##  4      :3   Mean   :74195   Mean   :18.43   Mean   :12.079  
##  5      :3   3rd Qu.:76221   3rd Qu.:21.75   3rd Qu.:15.000  
##  6      :3   Max.   :86437   Max.   :35.00   Max.   :18.000  
##  (Other):6                                                   
##     perBlack        perWhite        perHisp         perAsianPI    
##  Min.   : 4.90   Min.   : 1.50   Min.   : 0.900   Min.   : 0.200  
##  1st Qu.:29.50   1st Qu.: 7.15   1st Qu.: 1.950   1st Qu.: 0.675  
##  Median :65.00   Median :20.00   Median : 5.550   Median : 1.700  
##  Mean   :58.35   Mean   :30.68   Mean   : 7.467   Mean   : 3.008  
##  3rd Qu.:89.00   3rd Qu.:51.50   3rd Qu.: 9.700   3rd Qu.: 5.025  
##  Max.   :97.00   Max.   :84.00   Max.   :25.000   Max.   :10.000  
##                                                                   
##    year  
##  1990:8  
##  2000:8  
##  2010:8  
##          
##          
##          
## 

Maybe you’re only interested a statistic of a single variable, like DC’s total population.

sum(dcPop$totalPop)
## [1] 1780682



Does that sum make sense?

Let’s think about how the data is structured….

Maybe even look at the data by clicking the data frame on the Environment tab.

We want DC’s total population, the sum of each Ward’s population, by year.

dcGroup <- group_by(dcPop, year)
totalPop <- summarise(dcGroup,
                        dcPop = sum(totalPop)
)
totalPop
## # A tibble: 3 x 2
##     year  dcPop
##   <fctr>  <int>
## 1   1990 606900
## 2   2000 572059
## 3   2010 601723



Let Graph it!

Are there any differences in population changes by ward?

Let’s say I only want to compare 1990 to 2010

We’ll need to filter the data.

dcPop9010 <- filter(dcPop, year != 2000)

Let’s take a minute to discuss operators

Operator Defintion
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
“x | y” x OR y
x & y x AND y
isTRUE(x) test if X is TRUE

Let’s play a game of filters and operators!



Graphing population changes by ward

ggplot(dcPop9010, aes(x=year, y=totalPop, group=ward, colour=ward)) + 
  geom_line()



How much of the population is ‘working age’ (18-65)?

Let’s look at the variables we have.

names(dcPop9010)
## [1] "ward"       "totalPop"   "perUnder18" "perOver65"  "perBlack"  
## [6] "perWhite"   "perHisp"    "perAsianPI" "year"

Okay, so we have a variable for the percent of people under 18, and the percent of people over 65. How do we get the number of people between 18 and 65?

We can create a new variable based off the variables we already have!

dcPop9010<-mutate(dcPop9010,btwn1865=100-perUnder18-perOver65)

We have lots of variables we don’t need right now. Maybe only keep what we need?

Subsetting data

Filtering data removes specific rows based on values. Subsetting removes specific columns based on their names. Let’s only keep, or select, what we need for this graph.

dcWrkAge<-select(dcPop9010, ward, btwn1865, year)
names(dcWrkAge)
## [1] "ward"     "btwn1865" "year"

Arranging data: When and where are the lowest proportions working age populations?

Sorting data can help us see the highest and lowest values, and what those observations are. We’ll arrange the data according to the variable btwn1865. Then we’ll output the top values with head() and the bottom with tail()

arranged <- arrange(dcWrkAge,btwn1865)
head(arranged)
##   ward btwn1865 year
## 1    8     61.7 1990
## 2    8     61.9 2010
## 3    7     62.0 1990
## 4    5     63.0 1990
## 5    7     63.0 2010
## 6    4     64.0 1990
tail(arranged)
##    ward btwn1865 year
## 11    3     73.0 1990
## 12    3     73.0 2010
## 13    6     76.0 2010
## 14    1     80.9 2010
## 15    2     83.7 1990
## 16    2     87.0 2010

Arranged defaults to sorting lowest to highest. To flip that, tell the function to use descending order desc(), like this:

arranged <- arrange(dcWrkAge,desc(btwn1865))

Let’s see this in a graph

ggplot(dcPop9010, aes(x=ward, y=btwn1865, fill=year)) + 
  geom_bar(stat="identity",position=position_dodge())


Okay, now it’s time for you to play!

There are four datasets here that are structured just like the one we’ve been working with. Pick one and start doing analysis/visualizations on your own or with a partner! I’ll be floating around to help but don’t forget RStudio’s ? function too.

The datasets are:

We’ll share what we’ve worked on before the workshop ends. But also keep it up after today and let me know what you find!


Before you leave…

There’s a lot to learn with R and we couldn’t possibly cover it all in two hours! But I do hope everyone feels comfortable and excited about programming in R. There a number of great free, online resources for continuing your R learning after today:

And of course, keep in touch!