I’ll explain what this is in a moment. For now, trust me. Go here:
https://api.census.gov/data/key_signup.html
… enter your name in the “Organization Name” box, an e-mail address that you can check momentarily in the “Email Address” box,” check the “I agree to the terms of service” box, and click “Submit Key Request.”
There are four main ways of getting data into R. You can key it in, you can retrieve it in .csv format from somewhere on the web, you can retrieve it from a file stored on your computer, or you can extract it from an API.
As is true when working with a spreadsheet, it’s best to avoid keying data in unless you have a foolproof way of detecting and correcting the errors you inevitably will make. In practice, you’ll be much more likely to retrieve data from the web or from a file stored somewhere on your computer or on an attached storage device, like a flash drive.
But it’s good to know how to enter data manually. It’s also a good introduction to how R works.
You can think of R as a language, like English, French or Spanish.
And if R is a language, then its most common verb is the symbol
<-
, which, roughly translated, means, “define this as.”
The “this” is whatever appears to the left of the symbol, and the “as”
is whatever appears to the right of the symbol. So this code:
FirstName <- "Ken"
… means, “Create an object called FirstName
, and define
it as ‘Ken.’” Once such an object has been created, you can do things
with it. For example, the code below will tell R to display the contents
of FirstName
on the computer’s screen. This R Markdown
document will show you the code first, followed by the output. Here you
go:
print(FirstName)
## [1] "Ken"
That’s not very efficient, though. It would be better to define
FirstName
as a “vector,” or a list of items that are of the
same type. Doing so lets you use FirstName
to store a whole
bunch of first names instead of just one. While we’re at it, lets add a
LastName
vector and use it to store the last name that goes
with each first name. Also, just to show you that vectors can be numbers
as well as text, let’s add a vector containing the age of each person
being named:
FirstName <- c("Ken","Jane","John")
LastName <- c("Blake","Doe","Smith")
Age <- c(58,20,21)
print(FirstName)
## [1] "Ken" "Jane" "John"
print(LastName)
## [1] "Blake" "Doe" "Smith"
print(Age)
## [1] 58 20 21
Well, that opens up a lot more possibilities. Still, though, we have to put these three vectors together somehow in order to have anything like a data set. Here’s how you do it in R:
mydata <- data.frame(FirstName,LastName,Age)
head(mydata)
## FirstName LastName Age
## 1 Ken Blake 58
## 2 Jane Doe 20
## 3 John Smith 21
… which tells R something like, “Make an object called
mydata
, and define it as a data frame consisting of the
vectors previously defined as FirstName
,
LastName
, and Age
. Then, print the contents of
mydata
out on the screen.”
After you run this code - and assuming you already have run the code
defining FirstName
, LastName
and
Age
- you will see mydata
appear in the
“Environment” area of RStudio, that is, the upper-right area of the
screen. Click on mydata
, and R Studio will show you the
data frame as a spreadsheet-like grid.
So, that’s how you can get data into R manually. You can key data into vectors, then combine the vectors into a data frame. It’s the equivalent of typing data into a spreadsheet.
Again, good to know. But it’s not how you typically will get data into R. Before looking at the other ways, let’s clean up a bit by deleting from the environment the objects and data frame we have made:
rm (mydata,
Age,
FirstName,
LastName)
More typically, you will get data into R by reading the data from a
file stored on the web. The approach is like using the
=importdata()
function in Google Sheets. This code will
define an object called mydata
(again) and define it as a
data frame containing the imported contents of the NashAreaPop.csv file
stored online in my web space. The head(mydata,10)
code
will tell R to print the first 10 rows of the data frame on the
screen.
mydata <- read.csv("https://drkblake.com/wp-content/uploads/2024/01/NashAreaPop.csv")
head(mydata,10)
## GEOID District County State From To Significance
## 1 4702190022 District 1 Cheatham County Tennessee 6683 7845 Significant
## 2 4702190212 District 2 Cheatham County Tennessee 6367 6646 Nonsignificant
## 3 4702190402 District 3 Cheatham County Tennessee 6557 6956 Nonsignificant
## 4 4702190592 District 4 Cheatham County Tennessee 6683 7408 Nonsignificant
## 5 4702190782 District 5 Cheatham County Tennessee 6739 6749 Nonsignificant
## 6 4702190972 District 6 Cheatham County Tennessee 6684 5580 Significant
## 7 4703790038 District 1 Davidson County Tennessee 17311 18394 Nonsignificant
## 8 4703790228 District 2 Davidson County Tennessee 18487 21064 Significant
## 9 4703790418 District 3 Davidson County Tennessee 19356 19560 Nonsignificant
## 10 4703790608 District 4 Davidson County Tennessee 19969 21668 Nonsignificant
If you’d like to see all of the mydata
data frame, click
on mydata
in the environment area (the upper-right area of
RStudio), and RStudio will open the data frame as a spreadsheet-like
grid that you can scroll around in.
You also can read data that has been stored on your computer. The
easiest place to retrieve stored data from is the directory of the R
project you are working on. Right now, you probably don’t have any data
there for R to read. So, let’s store the mydata
data frame
there, as a .csv file called NashAreaPop.csv. This code will do the
trick:
write.csv(mydata,"NashAreaPop.csv",
row.names = FALSE)
Translated, the code told R, “You know that mydata
data
frame presently in the environment? Store it as a .csv file in the
project directory as a file called NashAreaPop.csv
. Oh, and
when you write the file, don’t add a variable assigning each row a
numeric name. I won’t need that this time.”
Once you have run the code, click the “Files” tab in the lower-right area of RStudio, and you’ll see “NashAreaPop.csv” listed as one of the files there. How about that? Incidentally, if you wanted to, you could read the file into Google Sheets and work with it there.
But my goal here is to show you how to read a .csv file from your
computer, no matter how the .csv file got onto your computer. Here’s the
code for reader the file in as a data frame called
my_local_data
, assuming the .csv file you’re after is
available in your project directory. Note that it’s really the same code
as the code, above, that you used to read the file from my web space. It
just drops the URL and uses only the file name.
my_local_data <- read.csv("NashAreaPop.csv")
head(my_local_data,10)
## GEOID District County State From To Significance
## 1 4702190022 District 1 Cheatham County Tennessee 6683 7845 Significant
## 2 4702190212 District 2 Cheatham County Tennessee 6367 6646 Nonsignificant
## 3 4702190402 District 3 Cheatham County Tennessee 6557 6956 Nonsignificant
## 4 4702190592 District 4 Cheatham County Tennessee 6683 7408 Nonsignificant
## 5 4702190782 District 5 Cheatham County Tennessee 6739 6749 Nonsignificant
## 6 4702190972 District 6 Cheatham County Tennessee 6684 5580 Significant
## 7 4703790038 District 1 Davidson County Tennessee 17311 18394 Nonsignificant
## 8 4703790228 District 2 Davidson County Tennessee 18487 21064 Significant
## 9 4703790418 District 3 Davidson County Tennessee 19356 19560 Nonsignificant
## 10 4703790608 District 4 Davidson County Tennessee 19969 21668 Nonsignificant
Some of the best datasets available on the web come from Application Programming Interfaces, or APIs. These are systems set up to provide bulk data that can be read directly into a computer script. We’ll start with learning how to access the U.S. Census Bureau’s API. To do it, you’ll need a couple of R add-on programs. One is called tidyverse. It’s an add-on that provides easier-to-use alternatives to many of R’s functions. The other is tidycensus, which helps R communicate with the Census Bureau API. This code will install both programs - “packages,” in R lingo - if they haven’t been installed already, then load them so you can use them.
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidycensus")) install.packages("tidycensus")
library(tidyverse)
library(tidycensus)
Next, check your e-mail for your Census API key. You’ll have to activate your API key by clicking a link in the e-mail. Then, copy your API key out of the e-mail, paste it into this code in place of PasteYourAPIKeyBetweenTheseQuoteMarks, and run the code.
census_api_key("PasteYourAPIKeyBetweenTheseQuoteMarks")
Ready to get some Census data? This code will grab the American
Community Survey’s 2022 population estimates for each county subdivision
in Tennessee, put them in a data frame called mydata
, and
display the data frame’s first 10 rows. The B01001_001
code
used in the syntax is the name of the American Community Survey’s
population estimate variable. Below, I’ll show you how to look up the
name for any variable the American Community Survey measures.
In the resulting data frame, PopE
is the population
estimate. PopM
is the estimate’s error margin. I’ll explain
about error margins later. For now, just know that each row’s
PopE
figure is the estimated number of people who live in
whatever area the row represents.
mydata <- get_acs(geography = "county subdivision",
state = "TN",
variables = c(Pop = "B01001_001"),
year = 2022,
survey = "acs5",
output = "wide")
## Getting data from the 2018-2022 5-year ACS
head(mydata,10)
## # A tibble: 10 × 4
## GEOID NAME PopE PopM
## <chr> <chr> <dbl> <dbl>
## 1 4700190002 District 1, Anderson County, Tennessee 9093 815
## 2 4700190192 District 2, Anderson County, Tennessee 10094 473
## 3 4700190382 District 3, Anderson County, Tennessee 10780 853
## 4 4700190572 District 4, Anderson County, Tennessee 8800 686
## 5 4700190762 District 5, Anderson County, Tennessee 9741 846
## 6 4700190952 District 6, Anderson County, Tennessee 9614 840
## 7 4700191142 District 7, Anderson County, Tennessee 9276 771
## 8 4700191332 District 8, Anderson County, Tennessee 9939 894
## 9 4700390004 District 1, Bedford County, Tennessee 5795 649
## 10 4700390194 District 2, Bedford County, Tennessee 5300 814
Unhandily, the data frame we got crams each district’s number,
county, and state into a single column called NAME
. It
would be better to have one column for the district number, a second,
separate column for the county name, and a third column for the state
name.
Fortunately, there is a comma and a space separating the three pieces
of information. The separate_wider_delim()
function, which
is available through one of the subpackages in the
tidyverse
package we installed and activated a moment ago,
can exploit that comma-and-space pattern to rearrange the the data into
three columns. The syntax is a little involved. Translated, it says,
“Redefine the mydata
data frame as the original
mydata
data frame, but only after splitting the
NAME
column into three columns called
District
, County
, and State
and
using the comma-and-space between each piece of information in the
NAME
column to decide what goes in each new column.
mydata <-
separate_wider_delim(mydata,
NAME,
delim = ", ",
names = c("District", "County", "State"))
How in the world is an R newbie supposed to guess that R can do a thing like that, let alone figure out how to get R to actually do it? Learning R won’t be easy. But all the help you need is just a Google search away. In this case, Google something like, “Separate an R data frame column into multiple columns,” and you’ll find a number of tutorials, videos and discussion board posts explaining the process. I sometimes say I am a self-taught coder. That’s not quite right, though. I am a web-and-YouTube-taught coder. Persist, and you can figure it out. And the more you do it, the easier it will get.
In data analysis, “sorting” refers to arranging data in alphabetical order (for text data) or numeric order (for numeric data). Sorting in “ascending” order means sorting it from a to Z in the case of an alphabetical sort, and from the smallest number to the largest number in the case of numeric data. Sorting in “descending” order means the opposite: Z to a, for text, and largest to smallest for numeric data.
In Google Sheets, sorting involves interacting with a point-and-click menu that you use to specify one or more sorts and, if more than one, the order in which they will execute. In R, you use, of course, code.
arrange()
Having installed and activated the tidyverse
package,
you can use this code to sort the mydata
Census data frame
by the NAME
variable in ascending order:
mydata_sorted <- arrange(mydata,PopE)
head(mydata_sorted,10)
## # A tibble: 10 × 6
## GEOID District County State PopE PopM
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 4716991500 District 8 Trousdale County Tennessee 560 280
## 2 4716990550 District 3 Trousdale County Tennessee 610 277
## 3 4702591546 District 9 Claiborne County Tennessee 666 159
## 4 4716991690 District 9 Trousdale County Tennessee 708 329
## 5 4703390604 District 4 Crockett County Tennessee 735 196
## 6 4706790448 District 3 Hancock County Tennessee 736 231
## 7 4706790828 District 5 Hancock County Tennessee 739 185
## 8 4716991880 District 10 Trousdale County Tennessee 779 327
## 9 4706790638 District 4 Hancock County Tennessee 784 231
## 10 4706790068 District 1 Hancock County Tennessee 799 203
… which means, “Create mydata_sorted
, and define it as
the mydata
data frame, but sorted in ascending order by the
PopE
variable. Then, display the first 10 rows of
mydata_sorted
.” Easy, right?
arrange()
To do the same sort but in descending order, just wrap PopE variable
name in a desc()
function.
mydata_sorted <- arrange(mydata, desc(PopE))
head(mydata_sorted,10)
## # A tibble: 10 × 6
## GEOID District County State PopE PopM
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 4715790348 District 2 Shelby County Tennessee 79698 2676
## 2 4715790728 District 4 Shelby County Tennessee 79370 2802
## 3 4715792438 District 13 Shelby County Tennessee 76196 3350
## 4 4715790538 District 3 Shelby County Tennessee 76147 1900
## 5 4715792058 District 11 Shelby County Tennessee 73172 3083
## 6 4715790918 District 5 Shelby County Tennessee 71264 3179
## 7 4715790158 District 1 Shelby County Tennessee 71132 3239
## 8 4715791868 District 10 Shelby County Tennessee 69398 2914
## 9 4715791108 District 6 Shelby County Tennessee 68449 4035
## 10 4715792248 District 12 Shelby County Tennessee 68209 2881
arrange()
Hierarchical sorts - that is, sorting by one variable, then by a
second variable within levels of the first variable - are similarly
easy. This code will sort the data in ascending order by
County
, then, within each county, in descending order by
the value of PopE
:
mydata_sorted <- arrange(mydata,County, desc(PopE))
head(mydata_sorted,10)
## # A tibble: 10 × 6
## GEOID District County State PopE PopM
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 4700190382 District 3 Anderson County Tennessee 10780 853
## 2 4700190192 District 2 Anderson County Tennessee 10094 473
## 3 4700191332 District 8 Anderson County Tennessee 9939 894
## 4 4700190762 District 5 Anderson County Tennessee 9741 846
## 5 4700190952 District 6 Anderson County Tennessee 9614 840
## 6 4700191142 District 7 Anderson County Tennessee 9276 771
## 7 4700190002 District 1 Anderson County Tennessee 9093 815
## 8 4700190572 District 4 Anderson County Tennessee 8800 686
## 9 4700390954 District 6 Bedford County Tennessee 6781 933
## 10 4700391524 District 9 Bedford County Tennessee 6095 1094
filter()
Working in the tidyverse environment makes filtering a snap, too. If
you want just Rutherford County rows, for example, you simply tell R to
filter for all rows in which the County
variable equals
“Rutherford County.”
mydata_filtered <- filter(mydata, County == "Rutherford County")
head(mydata_filtered,10)
## # A tibble: 10 × 6
## GEOID District County State PopE PopM
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 4714990150 District 1 Rutherford County Tennessee 16065 1436
## 2 4714990340 District 2 Rutherford County Tennessee 13855 1335
## 3 4714990530 District 3 Rutherford County Tennessee 15560 1545
## 4 4714990720 District 4 Rutherford County Tennessee 15809 1356
## 5 4714990910 District 5 Rutherford County Tennessee 16019 1543
## 6 4714991100 District 6 Rutherford County Tennessee 18576 1993
## 7 4714991290 District 7 Rutherford County Tennessee 19926 2174
## 8 4714991480 District 8 Rutherford County Tennessee 18111 1535
## 9 4714991670 District 9 Rutherford County Tennessee 18315 1698
## 10 4714991860 District 10 Rutherford County Tennessee 13142 1219
You can filter by two or more criteria, both of which have to
be met. For example, this code will filter for rows that have
“Rutherford County” in the County
column and values greater
than 7,000 in the PopE
column:
mydata_filtered <- filter(mydata, County == "Rutherford County",
PopE > 18000 )
head(mydata_filtered,10)
## # A tibble: 6 × 6
## GEOID District County State PopE PopM
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 4714991100 District 6 Rutherford County Tennessee 18576 1993
## 2 4714991290 District 7 Rutherford County Tennessee 19926 2174
## 3 4714991480 District 8 Rutherford County Tennessee 18111 1535
## 4 4714991670 District 9 Rutherford County Tennessee 18315 1698
## 5 4714993000 District 16 Rutherford County Tennessee 18973 1641
## 6 4714993760 District 20 Rutherford County Tennessee 26266 2179
You also can do partial match filtering by combining
the filter()
and str_detect()
functions.
Notice how this code finds all rows in which County
is
“Rutherford County,” even though “County” isn’t included in the search
criteria.
mydata_filtered <- filter(mydata,(str_detect(County, "Rutherford")))
head(mydata_filtered,10)
## # A tibble: 10 × 6
## GEOID District County State PopE PopM
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 4714990150 District 1 Rutherford County Tennessee 16065 1436
## 2 4714990340 District 2 Rutherford County Tennessee 13855 1335
## 3 4714990530 District 3 Rutherford County Tennessee 15560 1545
## 4 4714990720 District 4 Rutherford County Tennessee 15809 1356
## 5 4714990910 District 5 Rutherford County Tennessee 16019 1543
## 6 4714991100 District 6 Rutherford County Tennessee 18576 1993
## 7 4714991290 District 7 Rutherford County Tennessee 19926 2174
## 8 4714991480 District 8 Rutherford County Tennessee 18111 1535
## 9 4714991670 District 9 Rutherford County Tennessee 18315 1698
## 10 4714991860 District 10 Rutherford County Tennessee 13142 1219
Sometimes, too, you need to filter by two or more criteria, any one of which must be met. Here, we get all rows for the counties in the Nashville area:
mydata_filtered <- filter(mydata, County == "Davidson County"|
County == "Cheatham County"|
County == "Robertson County"|
County == "Rutherford County"|
County == "Sumner County"|
County == "Williamson County"|
County == "Wilson County")
head(mydata_filtered,10)
## # A tibble: 10 × 6
## GEOID District County State PopE PopM
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 4702190022 District 1 Cheatham County Tennessee 7845 719
## 2 4702190212 District 2 Cheatham County Tennessee 6646 676
## 3 4702190402 District 3 Cheatham County Tennessee 6956 459
## 4 4702190592 District 4 Cheatham County Tennessee 7408 777
## 5 4702190782 District 5 Cheatham County Tennessee 6749 664
## 6 4702190972 District 6 Cheatham County Tennessee 5580 511
## 7 4703790038 District 1 Davidson County Tennessee 18394 1440
## 8 4703790228 District 2 Davidson County Tennessee 21064 1756
## 9 4703790418 District 3 Davidson County Tennessee 19560 1856
## 10 4703790608 District 4 Davidson County Tennessee 21668 2134
The American Community Survey measures a lot more than area headcounts. This syntax will retrieve all three codebooks for the 2022 five-year American Community Survey. Feel free to browse them and find something interesting to extract.
DetailedTables <- load_variables(2022, "acs5", cache = TRUE)
SubjectTables <- load_variables(2022, "acs5/subject", cache = TRUE)
ProfileTables <- load_variables(2022, "acs5/profile", cache = TRUE)
For example, the DP02_0068P
variable will give you the
percentage of people age 25 and older who have a bachelor’s degree or
higher. With a few tweaks, you could put together several of the code
snippets explained above into a script that gets education-level data
for county subdivisions in the Nashville area. To keep things as simple
as possible, I’ll call the data frame mydata
and just
overwrite it every time I make a filtering or sorting change. Also, I’ll
use the generic variable name MyVar_
, which, in the
resulting data frame, will end up as MyVar_E
for the
estimate, and MyVar_M
for the margin of error. You can use
a custom variable name if you like, of course.
#Getting subdivision-level 2022 data on
#the percent of those 25 and older with
#a bachelor's or higher
mydata <- get_acs(geography = "county subdivision",
state = "TN",
variables = c(MyVar_ = "DP02_0068P"),
year = 2022,
survey = "acs5",
output = "wide")
mydata <-
separate_wider_delim(mydata,
NAME,
delim = ", ",
names = c("District", "County", "State"))
mydata <- filter(mydata, County == "Davidson County" |
County == "Cheatham County" |
County == "Robertson County" |
County == "Rutherford County" |
County == "Sumner County" |
County == "Williamson County" |
County == "Wilson County")
mydata <- arrange(mydata,County, desc(MyVar_E))
Want to save your data in .csv format, so you can
work with it in Google Sheets? R’s write.csv()
function
will do the job. The code will save the mydata
data frame
in your project directory, under the filename “mydata.csv.”
write.csv(mydata,"mydata.csv", row.names = FALSE)