Assignment
Read: - the tmap package document and the replication code - the tigris package document
Map: - Use the tigris package to retrieve county-level spatial shapefiles; - Use Social Explorer to retrieve county-level information of your own choice; - Map your county-level information using the tmap package. What can you see from this map? - Present the same county-level information in a more convention, non-spatial way
Discuss: - the strengths and weaknesses of the two approaches (i.e., spatial vs. non-spatial) - What does the “cb = TRUE” do? - What will happen if you change it to “cb = FALSE”? - Show an example
Introduction
The National Historical Geographic Information System (NHGIS) is a project that aggregates census data for usage in spatial mapping. The LEHD (Longitudinal Employer Household Dynamics) was last updated by NHGIS in 2015. The NHGIS data can be sourced from SocialExplorer, and I am looking at the total number of jobs across the continental United States. An additional bit of data needed is map shape files, which can be obtained through the Tigris package, as shown below.
Data
#get map files
library(tigris)
options(tigris_class = "sf")
#this next line does all the data downloading and cleaning
t_county <- counties(cb = TRUE)
names(t_county)
[1] "STATEFP" "COUNTYFP" "COUNTYNS" "AFFGEOID" "GEOID" "NAME" "LSAD" "ALAND" "AWATER"
[10] "geometry"
library(readr)
jdata <- read_csv(“/Users/meredithpowers/Desktop/alljobs.csv”)
# combine the two files
library(dplyr)
jdata <- jdata %>%
mutate(TotalJobs = as.integer(ORGRACN_B001_001)) %>%
mutate(fips = parse_integer(Geo_FIPS))
# a common mistake when joining a dataset by a variable is when you try to use two TYPES (categorical and integer, for example)
# something easily goes wrong at this next step unless you make sure they are both numerical variables, hence parse_integer function which is part of readr package
t_county <- t_county %>%
mutate(fips = parse_integer(GEOID))
# always a good idea to use the map file as base data, in case there are missing values in the other file
comb_data <- t_county %>%
left_join(jdata, by = "fips")
#subset of data limited to continental US
comb_data_sub <- subset(comb_data, STATEFP != "02") %>%
subset(STATEFP != "02") %>%
subset(STATEFP != "15") %>%
subset(STATEFP != "60") %>%
subset(STATEFP != "66") %>%
subset(STATEFP != "69") %>%
subset(STATEFP != "72") %>%
subset(STATEFP != "78")
Mapping
# draw the map with state borders and cleaned up county lines for visual impact
tm_shape(comb_data_sub) +
tm_polygons("TotalJobs", border.col = "grey", border.alpha = .4) +
tm_shape(us_states) +
tm_borders(lwd = .36, col = "black", alpha = 1)

Histogramming
library(ggplot2)
theme_set(theme_bw())
jobdata <- ggplot(comb_data_sub, aes(TotalJobs))
jobdata + geom_histogram(mapping = aes(x = TotalJobs)) +
ggtitle("Distribution of Counties by Number of Jobs") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Number of Jobs", y = "Number of Counties") +
labs(caption = "There are a few counties with a much greater number of jobs than the majority of counties") +
theme(plot.caption = element_text(hjust = 0.5))

Discussion
The map clearly shows that most counties are the same shade of light yellow, indicating 0-1million jobs, while a few counties stand out as much darker and have 2-3 million or 4-5 million jobs. The map implies a positive skewed distribution, though the details are less clear. The map is useful for showing where the outliers are, which can lead to focusing on geographic regions for further study and comparison. It’s also great for presenting information to others, since the heatmap is easy to grasp. Still, there’s a lot of room in that “0 to 1 million jobs” category, and the map does not easily show off any of this variation.
The histogram shows us that this is a positively skewed distribution, showing that a very small number of counties have a much higher number of jobs that the rest of the country. The histogram is useful for confirming the appearance of the map, and it is also useful in showing the specific shape of the distribution – it very clearly offers a good estimation of the actual number of counties and the actual number of jobs. There are a lot of counties with under a million jobs, and this shows just how many are on the low end of that range (should prob change the scale for even more clarity).
What does the “cb = TRUE” do? What happens if we use “cb = FALSE” instead?
When cb is set to TRUE, tigris will download a generalized (1:500k) file. When cb is set to FALSE, tigris downloads the most detailed TIGER file – in this case, the boundary lines on the counties map file would be much more detailed. When looking at the general continental US, this is probably unnecessary and may needlessly increase loading time. However, when looking at a more zoomed in view of a state or a few specific counties, setting cb to FALSE could provide valuable information.
Even though I don’t think it matters for this case, here is an example of a new map using the CB false command
Example of cb = FALSE
#get detailed map files
options(tigris_class = "sf")
#this next line does all the data downloading and cleaning
c_county <- counties(cb = FALSE)
names(c_county)
[1] "STATEFP" "COUNTYFP" "COUNTYNS" "GEOID" "NAME" "NAMELSAD" "LSAD" "CLASSFP" "MTFCC"
[10] "CSAFP" "CBSAFP" "METDIVFP" "FUNCSTAT" "ALAND" "AWATER" "INTPTLAT" "INTPTLON" "geometry"
More variables! More detail?!
# combine two new files
jdata <- jdata %>%
mutate(TotalJobs = as.integer(ORGRACN_B001_001)) %>%
mutate(fips = parse_integer(Geo_FIPS))
c_county <- c_county %>%
mutate(fips = parse_integer(GEOID))
newdata <- c_county %>%
left_join(jdata, by = "fips")
#subset of data limited to continental US
newdata_sub <- subset(newdata, STATEFP != "02") %>%
subset(STATEFP != "02") %>%
subset(STATEFP != "15") %>%
subset(STATEFP != "60") %>%
subset(STATEFP != "66") %>%
subset(STATEFP != "69") %>%
subset(STATEFP != "72") %>%
subset(STATEFP != "78")
#draw the new detailed map
tm_shape(newdata_sub) +
tm_polygons("TotalJobs", border.col = "grey", border.alpha = .4) +
tm_shape(us_states) +
tm_borders(lwd = .36, col = "black", alpha = 1)

This is definitely a more detailed map, and it’s nice that some of the distracting lines (e.g., around the Great Lakes) are now gone!
---
title: "Homework 10: Mapping Employment across the US"
output:
  html_notebook:
    theme: lumen
  html_document:
    df_print: paged
---

# Assignment

*Read:*
- the tmap package document and the replication code
- the tigris package document

*Map:*
- Use the tigris package to retrieve county-level spatial shapefiles;
- Use Social Explorer to retrieve county-level information of your own choice;
- Map your county-level information using the tmap package. What can you see from this map? 
- Present the same county-level information in a more convention, non-spatial way

*Discuss:*
- the strengths and weaknesses of the two approaches (i.e., spatial vs. non-spatial)
- What does the "cb = TRUE" do?
- What will happen if you change it to "cb = FALSE"?
- Show an example

# Introduction 

The National Historical Geographic Information System (NHGIS) is a project that aggregates census data for usage in spatial mapping. The LEHD (Longitudinal Employer Household Dynamics) was last updated by NHGIS in 2015. The NHGIS data can be sourced from SocialExplorer, and I am looking at the total number of jobs across the continental United States. An additional bit of data needed is map shape files, which can be obtained through the *Tigris* package, as shown below.

# Data
```{r}
#get map files
library(tigris)
options(tigris_class = "sf")
#this next line does all the data downloading and cleaning
t_county <- counties(cb = TRUE)
names(t_county)
```

```{r message=FALSE, warning=FALSE, include=FALSE}
# read the data file
library(readr)
jdata <- read_csv("/Users/meredithpowers/Desktop/alljobs.csv")
```

library(readr)

jdata <- read_csv("/Users/meredithpowers/Desktop/alljobs.csv")

```{r}
# combine the two files
library(dplyr)
jdata <- jdata %>% 
  mutate(TotalJobs = as.integer(ORGRACN_B001_001)) %>%
  mutate(fips = parse_integer(Geo_FIPS))
# a common mistake when joining a dataset by a variable is when you try to use two TYPES (categorical and integer, for example)
# something easily goes wrong at this next step unless you make sure they are both numerical variables, hence parse_integer function which is part of readr package
t_county <- t_county %>% 
  mutate(fips = parse_integer(GEOID)) 
# always a good idea to use the map file as base data, in case there are missing values in the other file
comb_data <- t_county %>% 
  left_join(jdata, by = "fips")
```

```{r message=FALSE, warning=FALSE}
#subset of data limited to continental US
comb_data_sub <- subset(comb_data, STATEFP != "02") %>%
                 subset(STATEFP != "02") %>% 
                 subset(STATEFP != "15") %>% 
                 subset(STATEFP != "60") %>% 
                 subset(STATEFP != "66") %>% 
                 subset(STATEFP != "69") %>% 
                 subset(STATEFP != "72") %>% 
                 subset(STATEFP != "78")
```

# Mapping

```{r message=FALSE, warning=FALSE}
# draw the map with state borders and cleaned up county lines for visual impact
tm_shape(comb_data_sub) + 
  tm_polygons("TotalJobs", border.col = "grey", border.alpha = .4) + 
  tm_shape(us_states) + 
  tm_borders(lwd = .36, col = "black", alpha = 1)

```

# Histogramming

```{r message=FALSE, warning=FALSE}
library(ggplot2)
theme_set(theme_bw())
jobdata <- ggplot(comb_data_sub, aes(TotalJobs))
jobdata + geom_histogram(mapping = aes(x = TotalJobs)) +
  ggtitle("Distribution of Counties by Number of Jobs") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(x = "Number of Jobs", y = "Number of Counties") +
  labs(caption = "There are a few counties with a much greater number of jobs than the majority of counties") +
  theme(plot.caption = element_text(hjust = 0.5))
```

# Discussion

The map clearly shows that most counties are the same shade of light yellow, indicating 0-1million jobs, while a few counties stand out as much darker and have 2-3 million or 4-5 million jobs. The map implies a positive skewed distribution, though the details are less clear. The map is useful for showing where the outliers are, which can lead to focusing on geographic regions for further study and comparison. It's also great for presenting information to others, since the heatmap is easy to grasp. Still, there's a lot of room in that "0 to 1 million jobs" category, and the map does not easily show off any of this variation.

The histogram shows us that this is a positively skewed distribution, showing that a very small number of counties have a much higher number of jobs that the rest of the country. The histogram is useful for confirming the appearance of the map, and it is also useful in showing the specific shape of the distribution -- it very clearly offers a good estimation of the actual number of counties and the actual number of jobs. There are a lot of counties with under a million jobs, and this shows just how many are on the low end of that range (should prob change the scale for even more clarity).


##### What does the "cb = TRUE" do? What happens if we use "cb = FALSE" instead?

When cb is set to TRUE, *tigris* will download a generalized (1:500k) file. When cb is set to FALSE, *tigris* downloads the most detailed TIGER file -- in this case, the boundary lines on the counties map file would be much more detailed. When looking at the general continental US, this is probably unnecessary and may needlessly increase loading time. However, when looking at a more zoomed in view of a state or a few specific counties, setting cb to FALSE could provide valuable information.

Even though I don't think it matters for this case, here is an example of a new map using the CB false command

## Example of cb = FALSE

```{r}
#get detailed map files
options(tigris_class = "sf")
#this next line does all the data downloading and cleaning
c_county <- counties(cb = FALSE)
names(c_county)
```


More variables! More detail?!


```{r}
# combine two new files
jdata <- jdata %>% 
  mutate(TotalJobs = as.integer(ORGRACN_B001_001)) %>%
  mutate(fips = parse_integer(Geo_FIPS))
c_county <- c_county %>% 
  mutate(fips = parse_integer(GEOID)) 
newdata <- c_county %>% 
  left_join(jdata, by = "fips")
```


```{r}
#subset of data limited to continental US
newdata_sub <- subset(newdata, STATEFP != "02") %>%
                 subset(STATEFP != "02") %>% 
                 subset(STATEFP != "15") %>% 
                 subset(STATEFP != "60") %>% 
                 subset(STATEFP != "66") %>% 
                 subset(STATEFP != "69") %>% 
                 subset(STATEFP != "72") %>% 
                 subset(STATEFP != "78")
```

```{r}
#draw the new detailed map
tm_shape(newdata_sub) + 
  tm_polygons("TotalJobs", border.col = "grey", border.alpha = .4) + 
  tm_shape(us_states) + 
  tm_borders(lwd = .36, col = "black", alpha = 1)
```

This is definitely a more detailed map, and it's nice that some of the distracting lines (e.g., around the Great Lakes) are now gone!


