Instructions

For each question below, show code. Once you’ve completed things, don’t forget to to upload this document (knitted version please!) to CANVAS.

A few tips:

Don’t forget to knit your document frequently!
Don’t forget to install.packages() and load them using library().
Don’t forget to use ? or help() if you’re unsure about a function
EXPLAIN WHAT YOUR RESULTS MEAN! Think about the numbers and visualizations and explain, in words, what they mean.
Make sure you label all axes and add a title to your plots. I will take off points if you fail to do this.

22 points total + 2 points extra-credit

Set up

Questions (19 points)

**1. For a state of your choosing, use functions in* tidycensus *to extract 1. total Population, 2. total number of people reporting their race as white only, and 3. median income from the 2014-2018 5-year ACS estimates in all census tracts. (2 points)**
Hint: Use View(). The labels for each of the variables you’re extracting are listed below * Estimate!!Median household income in the past 12 months (in 2018 inflation-adjusted dollars) * Estimate!!Total!!White alone * Estimate!!Total (make sure the concept is TOTAL POPULATION)

v18 = load_variables(2018, "acs5", cache = TRUE)
#view(v18)
NY_ = get_acs(geography = "tract", variables = c(Median_Income =    "B19013_001", White = "B02001_002", Total_Pop = "B01003_001"), year=2018, state = "NY")

## Getting data from the 2014-2018 5-year ACS

head(NY_)

## # A tibble: 6 x 5
##   GEOID       NAME                                   variable     estimate   moe
##   <chr>       <chr>                                  <chr>           <dbl> <dbl>
## 1 36001000100 Census Tract 1, Albany County, New Yo~ Total_Pop        2022   218
## 2 36001000100 Census Tract 1, Albany County, New Yo~ White             473   122
## 3 36001000100 Census Tract 1, Albany County, New Yo~ Median_Inco~    29063  8493
## 4 36001000200 Census Tract 2, Albany County, New Yo~ Total_Pop        4700   690
## 5 36001000200 Census Tract 2, Albany County, New Yo~ White             621   221
## 6 36001000200 Census Tract 2, Albany County, New Yo~ Median_Inco~    29470  6013

2. Perform all of the necessary operations IN R to clean your data. (4 points) Hint: The list below indicates data cleaning steps you should consider. * remove all columns that are not needed for the analysis (we will only work with ACS estimates below) * separate grouped geography variables into their own columns * make sure all unnecessary spaces are removed from character variables * remove rows with NA values * pivot the dataframe

NY_ = separate(NY_, "NAME", into = c("tract", "county", "state"), sep=",")
NY_$county <- gsub( "^ ", "", NY_$county)
NY_$state <- gsub( "^ ", "", NY_$state)
NY_estimate = NY_[,1:6] %>% pivot_wider(names_from = variable, values_from = estimate)
NY_estimate = NY_estimate[complete.cases(NY_estimate),]
head(NY_estimate)

## # A tibble: 6 x 7
##   GEOID       tract           county       state   Total_Pop White Median_Income
##   <chr>       <chr>           <chr>        <chr>       <dbl> <dbl>         <dbl>
## 1 36001000100 Census Tract 1  Albany Coun~ New Yo~      2022   473         29063
## 2 36001000200 Census Tract 2  Albany Coun~ New Yo~      4700   621         29470
## 3 36001000300 Census Tract 3  Albany Coun~ New Yo~      5966  2525         37296
## 4 36001000401 Census Tract 4~ Albany Coun~ New Yo~      2479  2133         75809
## 5 36001000403 Census Tract 4~ Albany Coun~ New Yo~      4236  3108         70488
## 6 36001000501 Census Tract 5~ Albany Coun~ New Yo~      3215  1764         31480

3. Create a new column (called per_white) where you calculate the percent of the population reporting their race as white only in each census tract. Check your calculation by slicing your dataframe using a logical expression to return all rows where per_white > 100. This expression should return 0 rows. (2 points)

NY_estimate = NY_estimate %>%
  mutate(per_white = if_else(White < Total_Pop, White/Total_Pop*100, White/Total_Pop*100))
 filter(NY_estimate, per_white > 100)

## # A tibble: 0 x 8
## # ... with 8 variables: GEOID <chr>, tract <chr>, county <chr>, state <chr>,
## #   Total_Pop <dbl>, White <dbl>, Median_Income <dbl>, per_white <dbl>

3. Summarize average median income by county. What are the top 5 counties with highest average median income? What are the top 5 counties with lowest average median income? (3 points)

NY_estimate%>%
  group_by(county)%>%
  summarise(
    mean.Median_Income = mean(Median_Income)
  )%>%
  arrange(desc(mean.Median_Income))

## # A tibble: 62 x 2
##    county             mean.Median_Income
##    <chr>                           <dbl>
##  1 Nassau County                 118475.
##  2 Westchester County            106272.
##  3 Putnam County                 104310.
##  4 Suffolk County                102428.
##  5 New York County                94791.
##  6 Rockland County                93892.
##  7 Dutchess County                79964.
##  8 Saratoga County                79720.
##  9 Orange County                  78067.
## 10 Richmond County                78058.
## # ... with 52 more rows

NY_estimate%>%
  group_by(county)%>%
  summarise(
    mean.Median_Income = mean(Median_Income)
  )%>%
  arrange(mean.Median_Income)

## # A tibble: 62 x 2
##    county              mean.Median_Income
##    <chr>                            <dbl>
##  1 Bronx County                    43355.
##  2 Montgomery County               44574.
##  3 Chautauqua County               45896.
##  4 Allegany County                 47342.
##  5 Cattaraugus County              47753.
##  6 Delaware County                 48622.
##  7 Fulton County                   48737.
##  8 St. Lawrence County             49513.
##  9 Chenango County                 49962.
## 10 Broome County                   50290.
## # ... with 52 more rows

4. Visualize the median income as a function of the percent of the population reporting their race as white only in two counties of your choosing using a smoothed line plot. Explain the patterns in your plot. (2 points) The pattern I can see in this plot is that in the counties of Nassau and Bronx the higher percentage of people who identify as white correlate with higher median income. Hint: Use geom_smooth() with x = % white and y = median income.

library(dplyr)
library(ggplot2)

NY_estimate %>%
  filter(county %in% c("Nassau County", "Bronx County")) %>%
  ggplot() +
  geom_smooth(mapping = aes(x = per_white, y = Median_Income)) +
  labs(title = "Median Income Nassau and Bronx Counties",
       x = "White Only Percentage",
       y = "Median Income")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

5. Use a box and whisker plot (geom_boxplot()) to visualize differences in median income across 4 counties. What differences/similarities in median income do you notices across the 4 counties? (2 points)The bars represent the median point of data from each county. The boxes show the middle 50% of data. The lines add another 45% to cover 95% of all data. The dots are outliers, or points that fall outside of the 95%. Looking at these 4 counties there is variance in between all of the parts of the plot. The top two counties have higher medians, but the lower to have a smaller spread for the 50% in the middle. Nassau County has the highest median, but New York County has more data on the upper range of the 95%. Bronx County has the lowest median, but there are outliers that reach to the same value of Nassau’s median. These plots are a great way to show how spread out the data really is because looking at a single number describing the median does not explain everything going on. Hint: Explain the meaning of each of the plot components: 1. bars, 2. boxes, 3. lines, 4. dots.

NY_estimate %>%
  filter(county %in% c("New York County", "Bronx County", "Nassau County", "Cattaraugus County")) %>%
  ggplot() +
  geom_boxplot(mapping = aes(x = county, y = Median_Income)) +
  labs(title = "Median Income in select NY Counties",
       x = "County",
       y = "Median Income") +
  coord_flip()

6. Create a scatterplot (geom_point()) showing the relationship between median income and the percent of the population reporting their race as white only. Explain the patterns in your plot. (2 points) There seems to be a slight positive correlation where the counties with higher white percentage have higher median income, but overall there really is not much of a trend. There are many points scattered throughout most of the plot so it is difficult to come to a conclusion.

NY_estimate %>%
    ggplot() +
  geom_point(mapping = aes(x = per_white, y = Median_Income), color = "blue") +
  labs(title = "Median Income By County",
       subtitle = "New York State",
       x = "White Only Percentage",
       y = "Median Income")

7. Compute the correlation between median income and the percent of the population reporting their race as white only. Explain the meaning of the correlation value. (2 points) Correlation coefficients can tell if data has a strong correlation or not. If the data is somewhat linear it works, otherwise it does not accurately describe the data. In this case the coefficient is low (on a scale from 0 to 1), so it backs up the way that there is not much correlation between the two variables.

cor(NY_estimate$Median_Income, NY_estimate$per_white)

## [1] 0.3492649

3. Reflection (3 points)

When conducting data analysis, you as a researcher not only have to demonstrate your technical skills but consider your choices and their impacts. Reflecting on data analysis helps you document your process, what worked, what didn’t, and how you might improve. At the end of each class assignment you will asked to write a sort reflection.

For this reflection, please respond to the the following prompts.

1. Why did you select the state and counties for your analysis in this assignment? I have visited New York and I really enjoyed it there. I would like to go back sometime to see it again. The counties I chose were the top, bottom, top 5th, and bottom 5th median incomes. I thought this would show an interesting comparison. 2. What additional information could you add to this analysis to better understand the relationships between race and income inequality? It would be nice to look at more box and whisker plot data to see the spread for each county. Those plots are very helpful at showing that a single median doesn’t explain the density of where the most values of a dataset really are in relation to each other. 3. Describe how you could apply one data visualization skill from this week to your class research project. I haven’t gotten my data all chosen yet, but I want to work with earthquakes. I think using the plotting tools to look at relationships of where geographically the earthquakes are occurring and/or death counts compared to magnitude might have correlation.

4. Extra Credit (2 points)

Add a linear correlation line of best fit to your scatterplot of median income and the percent of the population reporting their race as white only. Do you think a linear relationship best describes your data?

Knit your document to a .html file. Submit this knitted document on Canvas.

GEOG 4870/6870 Assignment 3

Julianne Atencio

2/2/2021

Instructions

Set up

Questions (19 points)

3. Reflection (3 points)

4. Extra Credit (2 points)