Introduction and Hypothesis

Tracking and understanding the spread of the Novel Cornavirus (COVID-19) has been a difficult and important task. In this project, I am looking only at past data, not attempting to make any type of predictions about the reach of this terrible virus. I chose to focus only on New York, specifically New York City because it is a major hotspot within the United States so there are plently of cases to analyze, as well as the quality and quantity of data available. The goal of this project is to investigate the different rates of infection between races and to provide possible other contributing factors as to why COVID-19 seems to have a larger impact on Black and other minority communities.

Components of this analysis:

1.RStudio- tidytext, tidyverse
2.Tableau Desktop- create interactive maps linked in this report
3.Data-
  1. Census data provided by Tableau (broken down by zip code- includes population by race, income, household size)

  2. New York Cases (by zip code)

  3. New York Occupational Statistics

  4. New York Times COVID-19 data overtime

  5. U.S. Bureau of Labor Statistics, data on ability to work from home In order to use this data, I created a new Excel document and copied in the needed statistics, in order to make the analysis cleaner. I did not alter the information, just copied over what I needed in a more compatiable format

  6. U.S. Census data- average household size statistics In order to use this data, I created a seperate document with easier to use headers and just copied the information I needed

Methodology:

For this analysis my aim is to provide some background information and possible reasons as to why, in New York, there is a clear difference in how this virus is affecting minority communities. I chose to focus only on New York City due to the amount of data available for this area. While this allows me to include various possible factors, it also means that these results cannot be widely applied without further investigation and other geographic areas investigated. The data surrounding COVID-19 varies greatly across the country and the world and I wanted to stick to solid sources and one area, instead of using possibly incorrect data and having a wider scope. Hopefully as this virus progresses we obtain more mainstreamed data at granular levels.

I began by looking at the New York City data by zip code. I chose the zip code level because that was the smallest level that I could find both COVID-19 data and census data. This allows for more precise estimates. I overlaid cases over time on a map on which I was able to change the layer directly under it to include information such as median household income, population by race, average household size, and occupations. I chose to include these variables for specific reasons.

Income: One speculated factor is that certain people cannot afford to take off work for sick days, therefore lower income jobs would be presumably affected by this complication.

Race: This is the main variable of investigation so a visualization of this factor is helpful.

Household Size: We know that the amount of people that come in contact with one another, the more the virus spreads. There is also data to suggest that the average household size varies by race.

I found outside data sets for other variables to include information on occupation and education. I chose these variables because some types of occupations are at a higher risk than others. Certain jobs involve more contact with people, some are not as flexible to work from home, and some essential businesses that are still in operation need staff to keep functioning. This means that people in these fields are at a higher risk. And some education levels lend themselves to jobs that are more likely and able to work from home.

Overall, I focused on New York City and analyzed COVID-19 case data in comparison to race, household size, income, occupation and education in an effort to show that minorities are being more affected by this virus because of societal factors and inequalities.

New York Data

The main reason that I chose to focus on New York City was because of the quantity of cases in that area. When looking at a visualization of the growth and spread of cases in the United States over time, it is clear that New York City is a dangerous hub for this disease. (see map of overall cases over time here)

After deciding where to focus, I searched for data on a smaller scale and more granular level. I was able to find data for New York City seperated by zip code.

This is the data that I will be primarily working with in comparision to other factors for the remainder of this report. The purpose is to hopefully shed some light on why minorities are being affected on a larger scale by drawing connections between race and other factors that provide increased risk.

Occupation and Education

Essential workers and those who are unable to work from home are at a higher risk than those who are able to stay home and avoid daily contact. However, the reality is there are a lot of people that cannot work from home, or afford to take off of work to stay safe.

According to the U.S. Bureau of Labor Statistics, there is a strong correlation between education level and flexibiltiy when it comes to working from home.

library(tidytext)
library(tidyverse)
library(textdata)
workfromhome <- readr::read_csv("/Users/Genna/Desktop/work.csv")

workfromhome %>% 
  ggplot(aes(reorder(Education, Percent), Percent)) +
  geom_col()

To connect this back to the COVID-19 data, I placed the New York zip code data in over Census data for the area showing occupation, specifically blue collar and service jobs. I chose these specific occupation categories because there is a lot of research supporting the idea that blue collar and service industry jobs are held by those with lower education levels.

You can see that in the area I chose to focus on, there is little variability in these occupations, which is why the underlying map appears about the same shade. But if you look at the small land area to the upper left of the map (Manhattan) compared to the main land area (Brooklyn), you can see that in Manhattan there are fewer service and blue collar jobs, as well as much fewer cases. The data in this section supports that education, ability to work from home, and occupation are all interconnected and that they are all in some capacity connected to COVID-19 suceptiblity.

Household Size

Another factor that I decided to investigate was the impact that household size has on the spread in New York City. I chose to incorporate an analysis of this variable because it does have reasonable ties to the race variable that is being cited most commonly.

workfromhome <- readr::read_csv("/Users/Genna/Desktop/households.csv")

workfromhome %>% 
  ggplot(aes(reorder(race, avgsize), avgsize)) +
  geom_col()

Now when looking at this variable in relation to the COVID-19 data, we can see a strong correlation between average household size and the number of cases in an area.

Here we see that most of the darker areas (those with larger average household sizes) have larger and darker circles (indicating more COVID-19 cases). This supports the idea that there is a correlation between household size and the number of cases in a particular area.

Overall, the data included in this section shows that there is a connection between race and average household size, and taking that one step further, that there is also a connection between average household size in an area, and the number of cases.

Income

I chose to look at income as a possible variable because there have been reports concerning those with low incomes that cannot afford to take sick days or miss work. Therefore I thought that this was an important variable to include that could have a potentionally high connection.

This map shows a high connection between income and COVID-19 cases in New York City. Almost all of the larger, darker spots are in areas that are shaded ligher meaning they have a lower median income. This supports that idea that lower income areas are more suspectiable to the virus.

Race

Lastly, I wanted to include a visualization connecting race and COVID-19 cases. This connection has gained a lot of popularity recently and is being increasingly reported on.

These two maps show the COVID-19 cases in New York City overlaid on a map shaded according to the population race. In both cases you can see that there is actually little variation in area populations being considered black or hispanic. The background shading does vary, but not drastically and not in relation to the COVID-19 case indicators. This shows that while race could be a factor, it is not an overwhemling cause.

Conclusion

The point of this report was to depict different causes of the COVID-19 virus in New York City. I looked at different variables including education, occupation, household size, income and race, to try to draw connections both between the variables and then referring back to the spread of COVID-19. I was attempting to show that although black and minoritiy communities are being hit very hard by this disease and there are increasing reports surfacing that cite a connection between race and the spread, that connection is due to several different factors, only some of which are included as examples here. This report saw very high correlations between income and the virus, as well as average household size and the virus. Both of these variables also have connections back to race but that does not mean they are determined by it. I think this analysis shows that there are so many different little things that, when tied together, create favorable conditions for the virus to spread. There is not just one factor that makes somebody more suspectiable to the virus.