I like to analyze and visualize data about the District for my website, DataLensDC. My work’s been featured on CityLab, GGWash, Washingtonian, and Washington City Paper, among others.
I’m the Data Lead for Code for DC, a civic hacking organization you should all join!
I’m also a co-organizer of this year’s Tech Lady Hackathon, a one day free hackathon and learning workshop for women.
From 9-5 M-F I’m an Economist for the U.S. government.
You can find me on Twitter at the handle @datalensdc or by e-mail at datalensdc@gmail.com.
The most comprehensive sites are:
For federal-level data: data.gov
For city-level data: opendata.dc.gov
But these are by no means exhuastive!
Open data has many homes on the web. I’ve created a DC-specific list of data here. Keep an eye out for a new open data website from Code for DC which aims to be an exhaustive repository.
If there is data you are interested and think the government has it, FOIA for it. This is not as daunting as it may seem. Here’s my FOIA guide..
R is open source. There is base functionality that comes standard with R, but many of the best features in R come through packages. Packages are created by users and contain data or functions to address a given task in R. To use a package it must first be installed on your computer. Then you need the load the package once per session.
install.packages(dplyr) #installs package to your computer
install.packages(ggplot2)
And load them in our current session.
require(dplyr) #loads package to session
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
On syntax:
When your console returns ‘+’ it means the function is still open. You probably forgot a close parentheses.
Use # to make comments in R - text that won’t run as code. Be kind to your future self and comment often!
On learning to code in R:
?function or help(function) are your best friends in RStudio.
Search engines and StackOverflow are your best friends outside of RStudio.
There can be many different approaches to solving a single problem.
This is a learned skill and it’s not always easy.
The data we’re working with today can be found here. It is DC population statistics by ward from 1990, 2000, and 2010 from NeighborhoodInfoDC, an Urban Institute project.
dcPop<-read.csv("https://raw.githubusercontent.com/katerabinowitz/CodeHer16/master/data/dcPopulation.csv")
A data frame is a table of data made of rows (each observation) and columns (each variable). We’ve just created our first, named dcPop. You should see it in your environment tab to the right.
How’d that work?
str(dcPop)
## 'data.frame': 24 obs. of 9 variables:
## $ ward : int 1 2 3 4 5 6 7 8 1 2 ...
## $ totalPop : int 71005 59457 74271 78010 83198 75556 78966 86437 71711 63408 ...
## $ perUnder18: num 19 6.3 11 19 20 18 25 32 18 6.2 ...
## $ perOver65 : num 10 10 16 17 17 12 13 6.3 7.6 8.9 ...
## $ perBlack : num 60 19 4.9 79 86 66 96 92 47 15 ...
## $ perWhite : num 20 65 84 15 11 30 2.4 6.4 24 66 ...
## $ perHisp : num 18 9.6 7 4.2 1.4 2 0.9 1.1 25 10 ...
## $ perAsianPI: num 1.7 5.8 4.3 0.9 0.6 1.3 0.2 0.5 3.6 8.5 ...
## $ year : int 1990 1990 1990 1990 1990 1990 1990 1990 2000 2000 ...
When we ran
dcPop<-read.csv(“https://raw.githubusercontent.com/katerabinowitz/CodeHer16/master/data/dcPopulation.csv”)
It created the data frame dcPop, which appears in our environment and we can open or reference anytime this session. In fact, we just referenced it in the function str(dcPop)!
<- notation created dcPop. See what happens when we just run the read.csv function:
read.csv("https://raw.githubusercontent.com/katerabinowitz/CodeHer16/master/data/dcPopulation.csv")
## ward totalPop perUnder18 perOver65 perBlack perWhite perHisp perAsianPI
## 1 1 71005 19.0 10.0 60.0 20.0 18.0 1.7
## 2 2 59457 6.3 10.0 19.0 65.0 9.6 5.8
## 3 3 74271 11.0 16.0 4.9 84.0 7.0 4.3
## 4 4 78010 19.0 17.0 79.0 15.0 4.2 0.9
## 5 5 83198 20.0 17.0 86.0 11.0 1.4 0.6
## 6 6 75556 18.0 12.0 66.0 30.0 2.0 1.3
## 7 7 78966 25.0 13.0 96.0 2.4 0.9 0.2
## 8 8 86437 32.0 6.3 92.0 6.4 1.1 0.5
## 9 1 71711 18.0 7.6 47.0 24.0 25.0 3.6
## 10 2 63408 6.2 8.9 15.0 66.0 10.0 8.5
## 11 3 75375 12.0 14.0 6.2 80.0 6.7 6.4
## 12 4 75001 20.0 18.0 71.0 15.0 12.0 1.4
## 13 5 71604 21.0 17.0 88.0 7.4 3.0 0.8
## 14 6 70912 18.0 12.0 64.0 29.0 3.6 2.4
## 15 7 70011 27.0 14.0 97.0 1.5 0.9 0.3
## 16 8 74037 35.0 6.7 93.0 4.9 1.2 0.7
## 17 1 74462 12.0 7.1 33.0 40.0 21.0 5.0
## 18 2 76883 4.8 8.2 9.8 70.0 9.5 10.0
## 19 3 78887 13.0 14.0 5.6 78.0 7.5 8.2
## 20 4 75773 20.0 15.0 59.0 20.0 19.0 2.0
## 21 5 74308 17.0 15.0 77.0 15.0 6.3 1.7
## 22 6 76000 14.0 10.0 43.0 47.0 4.8 5.1
## 23 7 71748 24.0 13.0 95.0 1.5 2.7 0.3
## 24 8 73662 30.0 8.1 94.0 3.2 1.8 0.5
## year
## 1 1990
## 2 1990
## 3 1990
## 4 1990
## 5 1990
## 6 1990
## 7 1990
## 8 1990
## 9 2000
## 10 2000
## 11 2000
## 12 2000
## 13 2000
## 14 2000
## 15 2000
## 16 2000
## 17 2010
## 18 2010
## 19 2010
## 20 2010
## 21 2010
## 22 2010
## 23 2010
## 24 2010
It works! But nothing is saved. All the data outputs to the console and it can’t be used or referenced again.
When we use the read.csv function we want to save that data to the environment so we can use it again.
With the str() function we don’t need to save or use it’s output, which describes the structure of our data frame.
str(dcPop)
## 'data.frame': 24 obs. of 9 variables:
## $ ward : int 1 2 3 4 5 6 7 8 1 2 ...
## $ totalPop : int 71005 59457 74271 78010 83198 75556 78966 86437 71711 63408 ...
## $ perUnder18: num 19 6.3 11 19 20 18 25 32 18 6.2 ...
## $ perOver65 : num 10 10 16 17 17 12 13 6.3 7.6 8.9 ...
## $ perBlack : num 60 19 4.9 79 86 66 96 92 47 15 ...
## $ perWhite : num 20 65 84 15 11 30 2.4 6.4 24 66 ...
## $ perHisp : num 18 9.6 7 4.2 1.4 2 0.9 1.1 25 10 ...
## $ perAsianPI: num 1.7 5.8 4.3 0.9 0.6 1.3 0.2 0.5 3.6 8.5 ...
## $ year : int 1990 1990 1990 1990 1990 1990 1990 1990 2000 2000 ...
This is the structure of our ‘data frame’, or our table of data. Let’s take a moment to walk through it.
There are 9 variables and 24 observations.
Specific variables within a dataset can be referenced by putting a dollar sign between the data frame and variable names, like this: dataframe$variable.
Let’s try:
dcPop$ward
## [1] 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
The variable ward is currently stored as a number. Does that make sense?
Let’s change it to a factor, along with the variable year.
dcPop<- mutate(dcPop, ward = factor(ward), year=factor(year))
str(dcPop)
## 'data.frame': 24 obs. of 9 variables:
## $ ward : Factor w/ 8 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 1 2 ...
## $ totalPop : int 71005 59457 74271 78010 83198 75556 78966 86437 71711 63408 ...
## $ perUnder18: num 19 6.3 11 19 20 18 25 32 18 6.2 ...
## $ perOver65 : num 10 10 16 17 17 12 13 6.3 7.6 8.9 ...
## $ perBlack : num 60 19 4.9 79 86 66 96 92 47 15 ...
## $ perWhite : num 20 65 84 15 11 30 2.4 6.4 24 66 ...
## $ perHisp : num 18 9.6 7 4.2 1.4 2 0.9 1.1 25 10 ...
## $ perAsianPI: num 1.7 5.8 4.3 0.9 0.6 1.3 0.2 0.5 3.6 8.5 ...
## $ year : Factor w/ 3 levels "1990","2000",..: 1 1 1 1 1 1 1 1 2 2 ...
summary(dcPop)
## ward totalPop perUnder18 perOver65
## 1 :3 Min. :59457 Min. : 4.80 Min. : 6.300
## 2 :3 1st Qu.:71684 1st Qu.:12.75 1st Qu.: 8.725
## 3 :3 Median :74385 Median :18.50 Median :12.500
## 4 :3 Mean :74195 Mean :18.43 Mean :12.079
## 5 :3 3rd Qu.:76221 3rd Qu.:21.75 3rd Qu.:15.000
## 6 :3 Max. :86437 Max. :35.00 Max. :18.000
## (Other):6
## perBlack perWhite perHisp perAsianPI
## Min. : 4.90 Min. : 1.50 Min. : 0.900 Min. : 0.200
## 1st Qu.:29.50 1st Qu.: 7.15 1st Qu.: 1.950 1st Qu.: 0.675
## Median :65.00 Median :20.00 Median : 5.550 Median : 1.700
## Mean :58.35 Mean :30.68 Mean : 7.467 Mean : 3.008
## 3rd Qu.:89.00 3rd Qu.:51.50 3rd Qu.: 9.700 3rd Qu.: 5.025
## Max. :97.00 Max. :84.00 Max. :25.000 Max. :10.000
##
## year
## 1990:8
## 2000:8
## 2010:8
##
##
##
##
sum(dcPop$totalPop)
## [1] 1780682
Maybe even look at the data by clicking the data frame on the Environment tab.
We want DC’s total population, the sum of each Ward’s population, by year.
dcGroup <- group_by(dcPop, year)
totalPop <- summarise(dcGroup,
dcPop = sum(totalPop)
)
totalPop
## # A tibble: 3 x 2
## year dcPop
## <fctr> <int>
## 1 1990 606900
## 2 2000 572059
## 3 2010 601723
The ggplot() function is how evey graphic starts, and the first argument is the dataset you’re using. “aes” refers to the aesthetic of the graph, which at a minimum is the X and Y variables. This sets the graphical environment. Let’s see what that looks like.
ggplot(totalPop, aes(x=year, y=dcPop))
But where’s the data? We haven’t added that yet. After the ggplot() function you can add (literally with a “+”) layers, the first of which is the geometric objects we want to add to the graph. For a bar graph we use “geom_bar.” We want the bar height to equal DC’s population, our Y, hence stat=“identity.”
ggplot(totalPop, aes(x=year, y=dcPop)) +
geom_bar(stat="identity")
There are many different layers you can add on to change the styling and notation, but for now let’s keep it simple.
We’ll need to filter the data.
dcPop9010 <- filter(dcPop, year != 2000)
| Operator | Defintion |
|---|---|
| < | less than |
| <= | less than or equal to |
| > | greater than |
| >= | greater than or equal to |
| == | exactly equal to |
| != | not equal to |
| !x | Not x |
| “x | y” | x OR y |
| x & y | x AND y |
| isTRUE(x) | test if X is TRUE |
ggplot(dcPop9010, aes(x=year, y=totalPop, group=ward, colour=ward)) +
geom_line()
Let’s look at the variables we have.
names(dcPop9010)
## [1] "ward" "totalPop" "perUnder18" "perOver65" "perBlack"
## [6] "perWhite" "perHisp" "perAsianPI" "year"
Okay, so we have a variable for the percent of people under 18, and the percent of people over 65. How do we get the number of people between 18 and 65?
We can create a new variable based off the variables we already have!
dcPop9010<-mutate(dcPop9010,btwn1865=100-perUnder18-perOver65)
We have lots of variables we don’t need right now. Maybe only keep what we need?
Filtering data removes specific rows based on values. Subsetting removes specific columns based on their names. Let’s only keep, or select, what we need for this graph.
dcWrkAge<-select(dcPop9010, ward, btwn1865, year)
names(dcWrkAge)
## [1] "ward" "btwn1865" "year"
Sorting data can help us see the highest and lowest values, and what those observations are. We’ll arrange the data according to the variable btwn1865. Then we’ll output the top values with head() and the bottom with tail()
arranged <- arrange(dcWrkAge,btwn1865)
head(arranged)
## ward btwn1865 year
## 1 8 61.7 1990
## 2 8 61.9 2010
## 3 7 62.0 1990
## 4 5 63.0 1990
## 5 7 63.0 2010
## 6 4 64.0 1990
tail(arranged)
## ward btwn1865 year
## 11 3 73.0 1990
## 12 3 73.0 2010
## 13 6 76.0 2010
## 14 1 80.9 2010
## 15 2 83.7 1990
## 16 2 87.0 2010
Arranged defaults to sorting lowest to highest. To flip that, tell the function to use descending order desc(), like this:
arranged <- arrange(dcWrkAge,desc(btwn1865))
ggplot(dcPop9010, aes(x=ward, y=btwn1865, fill=year)) +
geom_bar(stat="identity",position=position_dodge())
There are four datasets here that are structured just like the one we’ve been working with. Pick one and start doing analysis/visualizations on your own or with a partner! I’ll be floating around to help but don’t forget RStudio’s ? function too.
The datasets are:
There’s a lot to learn with R and we couldn’t possibly cover it all in two hours! But I do hope everyone feels comfortable and excited about programming in R. There a number of great free, online resources for continuing your R learning after today: