For each question below, show code. Once you’ve completed things, don’t forget to to upload this document (knitted version please!) to CANVAS.
A few tips:
install.packages() and load them using library().? or help() if you’re unsure about a function22 points total + 2 points extra-credit
**1. For a state of your choosing, use functions in* tidycensus *to extract 1. total Population, 2. total number of people reporting their race as white only, and 3. median income from the 2014-2018 5-year ACS estimates in all census tracts. (2 points)**
Hint: Use View(). The labels for each of the variables you’re extracting are listed below * Estimate!!Median household income in the past 12 months (in 2018 inflation-adjusted dollars) * Estimate!!Total!!White alone * Estimate!!Total (make sure the concept is TOTAL POPULATION)
v18 = load_variables(2018, "acs5", cache = TRUE)
#view(v18)
NY_ = get_acs(geography = "tract", variables = c(Median_Income = "B19013_001", White = "B02001_002", Total_Pop = "B01003_001"), year=2018, state = "NY")
## Getting data from the 2014-2018 5-year ACS
head(NY_)
## # A tibble: 6 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 36001000100 Census Tract 1, Albany County, New Yo~ Total_Pop 2022 218
## 2 36001000100 Census Tract 1, Albany County, New Yo~ White 473 122
## 3 36001000100 Census Tract 1, Albany County, New Yo~ Median_Inco~ 29063 8493
## 4 36001000200 Census Tract 2, Albany County, New Yo~ Total_Pop 4700 690
## 5 36001000200 Census Tract 2, Albany County, New Yo~ White 621 221
## 6 36001000200 Census Tract 2, Albany County, New Yo~ Median_Inco~ 29470 6013
2. Perform all of the necessary operations IN R to clean your data. (4 points) Hint: The list below indicates data cleaning steps you should consider. * remove all columns that are not needed for the analysis (we will only work with ACS estimates below) * separate grouped geography variables into their own columns * make sure all unnecessary spaces are removed from character variables * remove rows with NA values * pivot the dataframe
NY_ = separate(NY_, "NAME", into = c("tract", "county", "state"), sep=",")
NY_$county <- gsub( "^ ", "", NY_$county)
NY_$state <- gsub( "^ ", "", NY_$state)
NY_estimate = NY_[,1:6] %>% pivot_wider(names_from = variable, values_from = estimate)
NY_estimate = NY_estimate[complete.cases(NY_estimate),]
head(NY_estimate)
## # A tibble: 6 x 7
## GEOID tract county state Total_Pop White Median_Income
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 36001000100 Census Tract 1 Albany Coun~ New Yo~ 2022 473 29063
## 2 36001000200 Census Tract 2 Albany Coun~ New Yo~ 4700 621 29470
## 3 36001000300 Census Tract 3 Albany Coun~ New Yo~ 5966 2525 37296
## 4 36001000401 Census Tract 4~ Albany Coun~ New Yo~ 2479 2133 75809
## 5 36001000403 Census Tract 4~ Albany Coun~ New Yo~ 4236 3108 70488
## 6 36001000501 Census Tract 5~ Albany Coun~ New Yo~ 3215 1764 31480
3. Create a new column (called per_white) where you calculate the percent of the population reporting their race as white only in each census tract. Check your calculation by slicing your dataframe using a logical expression to return all rows where per_white > 100. This expression should return 0 rows. (2 points)
NY_estimate = NY_estimate %>%
mutate(per_white = if_else(White < Total_Pop, White/Total_Pop*100, White/Total_Pop*100))
filter(NY_estimate, per_white > 100)
## # A tibble: 0 x 8
## # ... with 8 variables: GEOID <chr>, tract <chr>, county <chr>, state <chr>,
## # Total_Pop <dbl>, White <dbl>, Median_Income <dbl>, per_white <dbl>
3. Summarize average median income by county. What are the top 5 counties with highest average median income? What are the top 5 counties with lowest average median income? (3 points)
NY_estimate%>%
group_by(county)%>%
summarise(
mean.Median_Income = mean(Median_Income)
)%>%
arrange(desc(mean.Median_Income))
## # A tibble: 62 x 2
## county mean.Median_Income
## <chr> <dbl>
## 1 Nassau County 118475.
## 2 Westchester County 106272.
## 3 Putnam County 104310.
## 4 Suffolk County 102428.
## 5 New York County 94791.
## 6 Rockland County 93892.
## 7 Dutchess County 79964.
## 8 Saratoga County 79720.
## 9 Orange County 78067.
## 10 Richmond County 78058.
## # ... with 52 more rows
NY_estimate%>%
group_by(county)%>%
summarise(
mean.Median_Income = mean(Median_Income)
)%>%
arrange(mean.Median_Income)
## # A tibble: 62 x 2
## county mean.Median_Income
## <chr> <dbl>
## 1 Bronx County 43355.
## 2 Montgomery County 44574.
## 3 Chautauqua County 45896.
## 4 Allegany County 47342.
## 5 Cattaraugus County 47753.
## 6 Delaware County 48622.
## 7 Fulton County 48737.
## 8 St. Lawrence County 49513.
## 9 Chenango County 49962.
## 10 Broome County 50290.
## # ... with 52 more rows
4. Visualize the median income as a function of the percent of the population reporting their race as white only in two counties of your choosing using a smoothed line plot. Explain the patterns in your plot. (2 points) The pattern I can see in this plot is that in the counties of Nassau and Bronx the higher percentage of people who identify as white correlate with higher median income. Hint: Use geom_smooth() with x = % white and y = median income.
library(dplyr)
library(ggplot2)
NY_estimate %>%
filter(county %in% c("Nassau County", "Bronx County")) %>%
ggplot() +
geom_smooth(mapping = aes(x = per_white, y = Median_Income)) +
labs(title = "Median Income Nassau and Bronx Counties",
x = "White Only Percentage",
y = "Median Income")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
5. Use a box and whisker plot (geom_boxplot()) to visualize differences in median income across 4 counties. What differences/similarities in median income do you notices across the 4 counties? (2 points)The bars represent the median point of data from each county. The boxes show the middle 50% of data. The lines add another 45% to cover 95% of all data. The dots are outliers, or points that fall outside of the 95%. Looking at these 4 counties there is variance in between all of the parts of the plot. The top two counties have higher medians, but the lower to have a smaller spread for the 50% in the middle. Nassau County has the highest median, but New York County has more data on the upper range of the 95%. Bronx County has the lowest median, but there are outliers that reach to the same value of Nassau’s median. These plots are a great way to show how spread out the data really is because looking at a single number describing the median does not explain everything going on. Hint: Explain the meaning of each of the plot components: 1. bars, 2. boxes, 3. lines, 4. dots.
NY_estimate %>%
filter(county %in% c("New York County", "Bronx County", "Nassau County", "Cattaraugus County")) %>%
ggplot() +
geom_boxplot(mapping = aes(x = county, y = Median_Income)) +
labs(title = "Median Income in select NY Counties",
x = "County",
y = "Median Income") +
coord_flip()
6. Create a scatterplot (geom_point()) showing the relationship between median income and the percent of the population reporting their race as white only. Explain the patterns in your plot. (2 points) There seems to be a slight positive correlation where the counties with higher white percentage have higher median income, but overall there really is not much of a trend. There are many points scattered throughout most of the plot so it is difficult to come to a conclusion.
NY_estimate %>%
ggplot() +
geom_point(mapping = aes(x = per_white, y = Median_Income), color = "blue") +
labs(title = "Median Income By County",
subtitle = "New York State",
x = "White Only Percentage",
y = "Median Income")
7. Compute the correlation between median income and the percent of the population reporting their race as white only. Explain the meaning of the correlation value. (2 points) Correlation coefficients can tell if data has a strong correlation or not. If the data is somewhat linear it works, otherwise it does not accurately describe the data. In this case the coefficient is low (on a scale from 0 to 1), so it backs up the way that there is not much correlation between the two variables.
cor(NY_estimate$Median_Income, NY_estimate$per_white)
## [1] 0.3492649
When conducting data analysis, you as a researcher not only have to demonstrate your technical skills but consider your choices and their impacts. Reflecting on data analysis helps you document your process, what worked, what didn’t, and how you might improve. At the end of each class assignment you will asked to write a sort reflection.
For this reflection, please respond to the the following prompts.
1. Why did you select the state and counties for your analysis in this assignment? I have visited New York and I really enjoyed it there. I would like to go back sometime to see it again. The counties I chose were the top, bottom, top 5th, and bottom 5th median incomes. I thought this would show an interesting comparison. 2. What additional information could you add to this analysis to better understand the relationships between race and income inequality? It would be nice to look at more box and whisker plot data to see the spread for each county. Those plots are very helpful at showing that a single median doesn’t explain the density of where the most values of a dataset really are in relation to each other. 3. Describe how you could apply one data visualization skill from this week to your class research project. I haven’t gotten my data all chosen yet, but I want to work with earthquakes. I think using the plotting tools to look at relationships of where geographically the earthquakes are occurring and/or death counts compared to magnitude might have correlation.
Add a linear correlation line of best fit to your scatterplot of median income and the percent of the population reporting their race as white only. Do you think a linear relationship best describes your data?
Knit your document to a .html file. Submit this knitted document on Canvas.