First: Go get your U.S. Census Bureau API key

I’ll explain what this is in a moment. For now, trust me. Go here:

https://api.census.gov/data/key_signup.html

… enter your name in the “Organization Name” box, an e-mail address that you can check momentarily in the “Email Address” box,” check the “I agree to the terms of service” box, and click “Submit Key Request.”

Four ways to get data into R

There are four main ways of getting data into R. You can key it in, you can retrieve it in .csv format from somewhere on the web, you can retrieve it from a file stored on your computer, or you can extract it from an API.

As is true when working with a spreadsheet, it’s best to avoid keying data in unless you have a foolproof way of detecting and correcting the errors you inevitably will make. In practice, you’ll be much more likely to retrieve data from the web or from a file stored somewhere on your computer or on an attached storage device, like a flash drive.

But it’s good to know how to enter data manually. It’s also a good introduction to how R works.

First way: Key it in

You can think of R as a language, like English, French or Spanish. And if R is a language, then its most common verb is the symbol <-, which, roughly translated, means, “define this as.” The “this” is whatever appears to the left of the symbol, and the “as” is whatever appears to the right of the symbol. So this code:

FirstName <- "Ken"

… means, “Create an object called FirstName, and define it as ‘Ken.’” Once such an object has been created, you can do things with it. For example, the code below will tell R to display the contents of FirstName on the computer’s screen. This R Markdown document will show you the code first, followed by the output. Here you go:

print(FirstName)
## [1] "Ken"

That’s not very efficient, though. It would be better to define FirstName as a “vector,” or a list of items that are of the same type. Doing so lets you use FirstName to store a whole bunch of first names instead of just one. While we’re at it, lets add a LastName vector and use it to store the last name that goes with each first name. Also, just to show you that vectors can be numbers as well as text, let’s add a vector containing the age of each person being named:

FirstName <- c("Ken","Jane","John")
LastName <- c("Blake","Doe","Smith")
Age <- c(58,20,21)
print(FirstName)
## [1] "Ken"  "Jane" "John"
print(LastName)
## [1] "Blake" "Doe"   "Smith"
print(Age)
## [1] 58 20 21

Well, that opens up a lot more possibilities. Still, though, we have to put these three vectors together somehow in order to have anything like a data set. Here’s how you do it in R:

mydata <- data.frame(FirstName,LastName,Age)
head(mydata)
##   FirstName LastName Age
## 1       Ken    Blake  58
## 2      Jane      Doe  20
## 3      John    Smith  21

… which tells R something like, “Make an object called mydata, and define it as a data frame consisting of the vectors previously defined as FirstName, LastName, and Age. Then, print the contents of mydata out on the screen.”

After you run this code - and assuming you already have run the code defining FirstName, LastName and Age - you will see mydata appear in the “Environment” area of RStudio, that is, the upper-right area of the screen. Click on mydata, and R Studio will show you the data frame as a spreadsheet-like grid.

So, that’s how you can get data into R manually. You can key data into vectors, then combine the vectors into a data frame. It’s the equivalent of typing data into a spreadsheet.

Again, good to know. But it’s not how you typically will get data into R. Before looking at the other ways, let’s clean up a bit by deleting from the environment the objects and data frame we have made:

rm (mydata,
    Age,
    FirstName,
    LastName)

Second way: Read it in .csv format from the web

More typically, you will get data into R by reading the data from a file stored on the web. The approach is like using the =importdata() function in Google Sheets. This code will define an object called mydata (again) and define it as a data frame containing the imported contents of the NashAreaPop.csv file stored online in my web space. The head(mydata,10) code will tell R to print the first 10 rows of the data frame on the screen.

mydata <- read.csv("https://drkblake.com/wp-content/uploads/2024/01/NashAreaPop.csv")
head(mydata,10)
##         GEOID   District          County     State  From    To   Significance
## 1  4702190022 District 1 Cheatham County Tennessee  6683  7845    Significant
## 2  4702190212 District 2 Cheatham County Tennessee  6367  6646 Nonsignificant
## 3  4702190402 District 3 Cheatham County Tennessee  6557  6956 Nonsignificant
## 4  4702190592 District 4 Cheatham County Tennessee  6683  7408 Nonsignificant
## 5  4702190782 District 5 Cheatham County Tennessee  6739  6749 Nonsignificant
## 6  4702190972 District 6 Cheatham County Tennessee  6684  5580    Significant
## 7  4703790038 District 1 Davidson County Tennessee 17311 18394 Nonsignificant
## 8  4703790228 District 2 Davidson County Tennessee 18487 21064    Significant
## 9  4703790418 District 3 Davidson County Tennessee 19356 19560 Nonsignificant
## 10 4703790608 District 4 Davidson County Tennessee 19969 21668 Nonsignificant

If you’d like to see all of the mydata data frame, click on mydata in the environment area (the upper-right area of RStudio), and RStudio will open the data frame as a spreadsheet-like grid that you can scroll around in.

Third way: Read it from your computer

You also can read data that has been stored on your computer. The easiest place to retrieve stored data from is the directory of the R project you are working on. Right now, you probably don’t have any data there for R to read. So, let’s store the mydata data frame there, as a .csv file called NashAreaPop.csv. This code will do the trick:

write.csv(mydata,"NashAreaPop.csv",
          row.names = FALSE)

Translated, the code told R, “You know that mydata data frame presently in the environment? Store it as a .csv file in the project directory as a file called NashAreaPop.csv. Oh, and when you write the file, don’t add a variable assigning each row a numeric name. I won’t need that this time.”

Once you have run the code, click the “Files” tab in the lower-right area of RStudio, and you’ll see “NashAreaPop.csv” listed as one of the files there. How about that? Incidentally, if you wanted to, you could read the file into Google Sheets and work with it there.

But my goal here is to show you how to read a .csv file from your computer, no matter how the .csv file got onto your computer. Here’s the code for reader the file in as a data frame called my_local_data, assuming the .csv file you’re after is available in your project directory. Note that it’s really the same code as the code, above, that you used to read the file from my web space. It just drops the URL and uses only the file name.

my_local_data <- read.csv("NashAreaPop.csv")
head(my_local_data,10)
##         GEOID   District          County     State  From    To   Significance
## 1  4702190022 District 1 Cheatham County Tennessee  6683  7845    Significant
## 2  4702190212 District 2 Cheatham County Tennessee  6367  6646 Nonsignificant
## 3  4702190402 District 3 Cheatham County Tennessee  6557  6956 Nonsignificant
## 4  4702190592 District 4 Cheatham County Tennessee  6683  7408 Nonsignificant
## 5  4702190782 District 5 Cheatham County Tennessee  6739  6749 Nonsignificant
## 6  4702190972 District 6 Cheatham County Tennessee  6684  5580    Significant
## 7  4703790038 District 1 Davidson County Tennessee 17311 18394 Nonsignificant
## 8  4703790228 District 2 Davidson County Tennessee 18487 21064    Significant
## 9  4703790418 District 3 Davidson County Tennessee 19356 19560 Nonsignificant
## 10 4703790608 District 4 Davidson County Tennessee 19969 21668 Nonsignificant

Fourth way: Extract it from an API

Some of the best datasets available on the web come from Application Programming Interfaces, or APIs. These are systems set up to provide bulk data that can be read directly into a computer script. We’ll start with learning how to access the U.S. Census Bureau’s API. To do it, you’ll need a couple of R add-on programs. One is called tidyverse. It’s an add-on that provides easier-to-use alternatives to many of R’s functions. The other is tidycensus, which helps R communicate with the Census Bureau API. This code will install both programs - “packages,” in R lingo - if they haven’t been installed already, then load them so you can use them.

if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidycensus")) install.packages("tidycensus")
library(tidyverse)
library(tidycensus)

Next, check your e-mail for your Census API key. You’ll have to activate your API key by clicking a link in the e-mail. Then, copy your API key out of the e-mail, paste it into this code in place of PasteYourAPIKeyBetweenTheseQuoteMarks, and run the code.

census_api_key("PasteYourAPIKeyBetweenTheseQuoteMarks")

Ready to get some Census data? This code will grab the American Community Survey’s 2022 population estimates for each county subdivision in Tennessee, put them in a data frame called mydata, and display the data frame’s first 10 rows. The B01001_001 code used in the syntax is the name of the American Community Survey’s population estimate variable. Below, I’ll show you how to look up the name for any variable the American Community Survey measures.

In the resulting data frame, PopE is the population estimate. PopM is the estimate’s error margin. I’ll explain about error margins later. For now, just know that each row’s PopE figure is the estimated number of people who live in whatever area the row represents.

mydata <- get_acs(geography = "county subdivision",
                   state = "TN",
                   variables = c(Pop = "B01001_001"),
                   year = 2022,
                   survey = "acs5",
                   output = "wide")
## Getting data from the 2018-2022 5-year ACS
head(mydata,10)
## # A tibble: 10 × 4
##    GEOID      NAME                                    PopE  PopM
##    <chr>      <chr>                                  <dbl> <dbl>
##  1 4700190002 District 1, Anderson County, Tennessee  9093   815
##  2 4700190192 District 2, Anderson County, Tennessee 10094   473
##  3 4700190382 District 3, Anderson County, Tennessee 10780   853
##  4 4700190572 District 4, Anderson County, Tennessee  8800   686
##  5 4700190762 District 5, Anderson County, Tennessee  9741   846
##  6 4700190952 District 6, Anderson County, Tennessee  9614   840
##  7 4700191142 District 7, Anderson County, Tennessee  9276   771
##  8 4700191332 District 8, Anderson County, Tennessee  9939   894
##  9 4700390004 District 1, Bedford County, Tennessee   5795   649
## 10 4700390194 District 2, Bedford County, Tennessee   5300   814

A little surgery

Unhandily, the data frame we got crams each district’s number, county, and state into a single column called NAME. It would be better to have one column for the district number, a second, separate column for the county name, and a third column for the state name.

Fortunately, there is a comma and a space separating the three pieces of information. The separate_wider_delim() function, which is available through one of the subpackages in the tidyverse package we installed and activated a moment ago, can exploit that comma-and-space pattern to rearrange the the data into three columns. The syntax is a little involved. Translated, it says, “Redefine the mydata data frame as the original mydata data frame, but only after splitting the NAME column into three columns called District, County, and State and using the comma-and-space between each piece of information in the NAME column to decide what goes in each new column.

mydata <-
  separate_wider_delim(mydata,
                       NAME,
                       delim = ", ",
                       names = c("District", "County", "State"))

An aside: Getting help with R

How in the world is an R newbie supposed to guess that R can do a thing like that, let alone figure out how to get R to actually do it? Learning R won’t be easy. But all the help you need is just a Google search away. In this case, Google something like, “Separate an R data frame column into multiple columns,” and you’ll find a number of tutorials, videos and discussion board posts explaining the process. I sometimes say I am a self-taught coder. That’s not quite right, though. I am a web-and-YouTube-taught coder. Persist, and you can figure it out. And the more you do it, the easier it will get.

Sorting data in R

In data analysis, “sorting” refers to arranging data in alphabetical order (for text data) or numeric order (for numeric data). Sorting in “ascending” order means sorting it from a to Z in the case of an alphabetical sort, and from the smallest number to the largest number in the case of numeric data. Sorting in “descending” order means the opposite: Z to a, for text, and largest to smallest for numeric data.

In Google Sheets, sorting involves interacting with a point-and-click menu that you use to specify one or more sorts and, if more than one, the order in which they will execute. In R, you use, of course, code.

Sorting: An ascending sort with arrange()

Having installed and activated the tidyverse package, you can use this code to sort the mydata Census data frame by the NAME variable in ascending order:

mydata_sorted <- arrange(mydata,PopE)
head(mydata_sorted,10)
## # A tibble: 10 × 6
##    GEOID      District    County           State      PopE  PopM
##    <chr>      <chr>       <chr>            <chr>     <dbl> <dbl>
##  1 4716991500 District 8  Trousdale County Tennessee   560   280
##  2 4716990550 District 3  Trousdale County Tennessee   610   277
##  3 4702591546 District 9  Claiborne County Tennessee   666   159
##  4 4716991690 District 9  Trousdale County Tennessee   708   329
##  5 4703390604 District 4  Crockett County  Tennessee   735   196
##  6 4706790448 District 3  Hancock County   Tennessee   736   231
##  7 4706790828 District 5  Hancock County   Tennessee   739   185
##  8 4716991880 District 10 Trousdale County Tennessee   779   327
##  9 4706790638 District 4  Hancock County   Tennessee   784   231
## 10 4706790068 District 1  Hancock County   Tennessee   799   203

… which means, “Create mydata_sorted, and define it as the mydata data frame, but sorted in ascending order by the PopE variable. Then, display the first 10 rows of mydata_sorted.” Easy, right?

A descending sort with arrange()

To do the same sort but in descending order, just wrap PopE variable name in a desc() function.

mydata_sorted <- arrange(mydata, desc(PopE))
head(mydata_sorted,10)
## # A tibble: 10 × 6
##    GEOID      District    County        State      PopE  PopM
##    <chr>      <chr>       <chr>         <chr>     <dbl> <dbl>
##  1 4715790348 District 2  Shelby County Tennessee 79698  2676
##  2 4715790728 District 4  Shelby County Tennessee 79370  2802
##  3 4715792438 District 13 Shelby County Tennessee 76196  3350
##  4 4715790538 District 3  Shelby County Tennessee 76147  1900
##  5 4715792058 District 11 Shelby County Tennessee 73172  3083
##  6 4715790918 District 5  Shelby County Tennessee 71264  3179
##  7 4715790158 District 1  Shelby County Tennessee 71132  3239
##  8 4715791868 District 10 Shelby County Tennessee 69398  2914
##  9 4715791108 District 6  Shelby County Tennessee 68449  4035
## 10 4715792248 District 12 Shelby County Tennessee 68209  2881

A hierarchical sort with arrange()

Hierarchical sorts - that is, sorting by one variable, then by a second variable within levels of the first variable - are similarly easy. This code will sort the data in ascending order by County, then, within each county, in descending order by the value of PopE:

mydata_sorted <- arrange(mydata,County, desc(PopE))
head(mydata_sorted,10)
## # A tibble: 10 × 6
##    GEOID      District   County          State      PopE  PopM
##    <chr>      <chr>      <chr>           <chr>     <dbl> <dbl>
##  1 4700190382 District 3 Anderson County Tennessee 10780   853
##  2 4700190192 District 2 Anderson County Tennessee 10094   473
##  3 4700191332 District 8 Anderson County Tennessee  9939   894
##  4 4700190762 District 5 Anderson County Tennessee  9741   846
##  5 4700190952 District 6 Anderson County Tennessee  9614   840
##  6 4700191142 District 7 Anderson County Tennessee  9276   771
##  7 4700190002 District 1 Anderson County Tennessee  9093   815
##  8 4700190572 District 4 Anderson County Tennessee  8800   686
##  9 4700390954 District 6 Bedford County  Tennessee  6781   933
## 10 4700391524 District 9 Bedford County  Tennessee  6095  1094

Filtering data in R with filter()

Working in the tidyverse environment makes filtering a snap, too. If you want just Rutherford County rows, for example, you simply tell R to filter for all rows in which the County variable equals “Rutherford County.”

mydata_filtered <- filter(mydata, County == "Rutherford County")
head(mydata_filtered,10)
## # A tibble: 10 × 6
##    GEOID      District    County            State      PopE  PopM
##    <chr>      <chr>       <chr>             <chr>     <dbl> <dbl>
##  1 4714990150 District 1  Rutherford County Tennessee 16065  1436
##  2 4714990340 District 2  Rutherford County Tennessee 13855  1335
##  3 4714990530 District 3  Rutherford County Tennessee 15560  1545
##  4 4714990720 District 4  Rutherford County Tennessee 15809  1356
##  5 4714990910 District 5  Rutherford County Tennessee 16019  1543
##  6 4714991100 District 6  Rutherford County Tennessee 18576  1993
##  7 4714991290 District 7  Rutherford County Tennessee 19926  2174
##  8 4714991480 District 8  Rutherford County Tennessee 18111  1535
##  9 4714991670 District 9  Rutherford County Tennessee 18315  1698
## 10 4714991860 District 10 Rutherford County Tennessee 13142  1219

You can filter by two or more criteria, both of which have to be met. For example, this code will filter for rows that have “Rutherford County” in the County column and values greater than 7,000 in the PopE column:

mydata_filtered <- filter(mydata, County == "Rutherford County",
                         PopE > 18000 )
head(mydata_filtered,10)
## # A tibble: 6 × 6
##   GEOID      District    County            State      PopE  PopM
##   <chr>      <chr>       <chr>             <chr>     <dbl> <dbl>
## 1 4714991100 District 6  Rutherford County Tennessee 18576  1993
## 2 4714991290 District 7  Rutherford County Tennessee 19926  2174
## 3 4714991480 District 8  Rutherford County Tennessee 18111  1535
## 4 4714991670 District 9  Rutherford County Tennessee 18315  1698
## 5 4714993000 District 16 Rutherford County Tennessee 18973  1641
## 6 4714993760 District 20 Rutherford County Tennessee 26266  2179

You also can do partial match filtering by combining the filter() and str_detect() functions. Notice how this code finds all rows in which County is “Rutherford County,” even though “County” isn’t included in the search criteria.

mydata_filtered <- filter(mydata,(str_detect(County, "Rutherford"))) 
head(mydata_filtered,10)
## # A tibble: 10 × 6
##    GEOID      District    County            State      PopE  PopM
##    <chr>      <chr>       <chr>             <chr>     <dbl> <dbl>
##  1 4714990150 District 1  Rutherford County Tennessee 16065  1436
##  2 4714990340 District 2  Rutherford County Tennessee 13855  1335
##  3 4714990530 District 3  Rutherford County Tennessee 15560  1545
##  4 4714990720 District 4  Rutherford County Tennessee 15809  1356
##  5 4714990910 District 5  Rutherford County Tennessee 16019  1543
##  6 4714991100 District 6  Rutherford County Tennessee 18576  1993
##  7 4714991290 District 7  Rutherford County Tennessee 19926  2174
##  8 4714991480 District 8  Rutherford County Tennessee 18111  1535
##  9 4714991670 District 9  Rutherford County Tennessee 18315  1698
## 10 4714991860 District 10 Rutherford County Tennessee 13142  1219

Sometimes, too, you need to filter by two or more criteria, any one of which must be met. Here, we get all rows for the counties in the Nashville area:

mydata_filtered <- filter(mydata, County == "Davidson County"|
                         County == "Cheatham County"|
                           County == "Robertson County"|
                           County == "Rutherford County"|
                           County == "Sumner County"|
                           County == "Williamson County"|
                           County == "Wilson County")
head(mydata_filtered,10)
## # A tibble: 10 × 6
##    GEOID      District   County          State      PopE  PopM
##    <chr>      <chr>      <chr>           <chr>     <dbl> <dbl>
##  1 4702190022 District 1 Cheatham County Tennessee  7845   719
##  2 4702190212 District 2 Cheatham County Tennessee  6646   676
##  3 4702190402 District 3 Cheatham County Tennessee  6956   459
##  4 4702190592 District 4 Cheatham County Tennessee  7408   777
##  5 4702190782 District 5 Cheatham County Tennessee  6749   664
##  6 4702190972 District 6 Cheatham County Tennessee  5580   511
##  7 4703790038 District 1 Davidson County Tennessee 18394  1440
##  8 4703790228 District 2 Davidson County Tennessee 21064  1756
##  9 4703790418 District 3 Davidson County Tennessee 19560  1856
## 10 4703790608 District 4 Davidson County Tennessee 21668  2134

Bringing it all together

The American Community Survey measures a lot more than area headcounts. This syntax will retrieve all three codebooks for the 2022 five-year American Community Survey. Feel free to browse them and find something interesting to extract.

DetailedTables <- load_variables(2022, "acs5", cache = TRUE)
SubjectTables <- load_variables(2022, "acs5/subject", cache = TRUE)
ProfileTables <- load_variables(2022, "acs5/profile", cache = TRUE)

For example, the DP02_0068P variable will give you the percentage of people age 25 and older who have a bachelor’s degree or higher. With a few tweaks, you could put together several of the code snippets explained above into a script that gets education-level data for county subdivisions in the Nashville area. To keep things as simple as possible, I’ll call the data frame mydata and just overwrite it every time I make a filtering or sorting change. Also, I’ll use the generic variable name MyVar_, which, in the resulting data frame, will end up as MyVar_E for the estimate, and MyVar_M for the margin of error. You can use a custom variable name if you like, of course.

#Getting subdivision-level 2022 data on
#the percent of those 25 and older with
#a bachelor's or higher

mydata <- get_acs(geography = "county subdivision",
  state = "TN",
  variables = c(MyVar_ = "DP02_0068P"),
  year = 2022,
  survey = "acs5",
  output = "wide")

mydata <-
  separate_wider_delim(mydata,
                       NAME,
                       delim = ", ",
                       names = c("District", "County", "State"))

mydata <- filter(mydata, County == "Davidson County" |
    County == "Cheatham County" |
    County == "Robertson County" |
    County == "Rutherford County" |
    County == "Sumner County" |
    County == "Williamson County" |
    County == "Wilson County")

mydata <- arrange(mydata,County, desc(MyVar_E))

Want to save your data in .csv format, so you can work with it in Google Sheets? R’s write.csv() function will do the job. The code will save the mydata data frame in your project directory, under the filename “mydata.csv.”

write.csv(mydata,"mydata.csv", row.names = FALSE)