library(openintro)
library(tinytex)
library(tidyverse)
library(stringr)
library(magrittr)
library(gridExtra)
library(readxl)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.0.4

Project 2

Team members: Eric Hirsch, Dan Sullivan, Cassie Coste

Introduction

In this project we will tidy three datasets using TidyR and Dyplr and other handy R tools. After the transformation we will conduct some analyses. This is a joint project with Cassie Coste and Dan Sullivan. We each cleaned a data set and then showed each other what our challenges were and how we overcame them. This first data set was the data set that I tidied.

Dataset 1: Bureau of Labor Statistics Data

The data consists of six CSV files from the Bureau of Labor Statistics showing numbers of Americans involved in various occupations and industries, spanning the years 2015 through 2020. Each file is in the same format. The data set is “wide” – occupations run horizontally and industries run vertically.

Challenges with the data set:

  1. The data set needs to be converted from wide to long, and needs to include a year column and a demographic column.
  2. The industries which run vertically are repeated six times - for two genders, three races, and a total. These will need to be collected together.
  3. The race categories do not add up to the total because they don’t comprise all of the possible races. Therefore an “other race” category will need to calculated and then created.
  4. The occupations do not appear at the top of the raw data (they appear in the fifth row), which means they need to be extracted and inserted as column headings for the data set. There are a number of issues with these column headings - for example, they are too long, and they include the insertion of dashes and carriage returns which will need to be removed.
  5. Some of the rows are summary rows and will need to be removed. In some cases, remaining rows will need to be renamed as they don’t make sense standing alone without the summary row.
  6. All of the years of the data set need to be appended to the data frame
  7. The demographic categories (race, gender and total) need to be spread out as columns and arranged in order.

Dataset 2 - The Upshot - Prison Admissions by County

This dataset is a long dataset used by The Upshot NYT in their article “A Small Indiana County Sends More People to Prison Than San Francisco and Durham, N.C., Combined. Why?” to report on the increase in rural prison populations in recent years.

Challenges:

The primary issue with the data set is that it contains years in the variable names for three different variables. The goal is to get to a data set with columns for the three prison admission variables and one for the year. There are also some minor tidying edits such as converting columns to numeric or factor and or removing/adding words to columns or column names.

As this dataset is only part of the picture that was being looked at by The Upshot, to gain further insight into this dataset and look at some of the things that the article was referring to as well as comments made in the class discussion board, new variables need to be computed and more county data is needed that was not made available by The Upshot. For county data, an additional dataset (2.b) from the US Department of Agriculture (USDA) is joined with The Upshot dataset to perform the final analyses.

This county dataset had its own set of challenges. There is a rural-urban variable at two time points that is on a 1-9 spectrum and will be re-coded to match Metropolitan/Urban/Rural categories provided by the USDA. When these variables are incorporated into the original dataset, the rural-urban variable must then be made long through the creation of a new variable using ifelse statements to make sure that prison county data from the 2000s receives the urbanicity variable from the 2000 census and county data from the 2010s receives the urbanicity variable from the 2010 census.

Dataset 3 - Historic Epidemic Data

Dataset three Uses historic epidemic data as well as estimates and measurements on global population in the form of two separate tables for pre 1950 populations and post 1950 populations. It incorporates these three csv to get estimates on how impactful certain epidemics were based on the total deaths as compared to global population.

Challenges:

The biggest challenge with this data was data formation. It contained many pieces of data that were mixed strings and numbers, notes within data points, as well as hyperlink references. Because many columns were not standardized I used a lot of regex to tackle a lot of these issues ultimately paring things down to a point where the values could be standardized and used.

The next challenge was creating a metric to join population data that fluctuated depending on the year. I found population data that was yearly from 1950-2017, by decade from 1900-1950 and every century from 1AD to 1900. Because of this I had to make my own rounding function. where depending on what year the event was it rounded so that population metrics could be added accordingly.

Tidying Dataset #1: BLS Data

  1. First, the first year of data is read into a data frame:
dfBLS_raw <- read.delim("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/bls-2015.csv", sep=",")
  1. Next, we drop any columns we don’t need and add any columns we do. In this case there is one of each:
  1. Drop the Totals column
  2. Add the Year column
dfBLS <- dfBLS_raw %>%
  select(-"X") %>%
  mutate(Year = "2015") 
  1. Now we need to fix the column names (the column names should be the occupations, but they are not because the occupations were not at the top of the file.) We will need to extract the column names into a vector, clean the names of any extraneous characters and other issues, and use the vector to rename the dataframe columns:
  1. Get a vector for the column names from the row that contains them
  2. Clean the column names by replacing the hyphens and paragraph breaks, taking out unneeded words like ‘occupations’, etc.
  3. Rename the columns using the vector
#Extract the names
vColumnNames <- as.character(dfBLS %>%
  filter(row_number() %in% 5) %>%
  select(-contains("Household") & -"Year"))

#Clean the names
vColumnNames = str_replace_all(vColumnNames, "[\r\n]", " ")
vColumnNames = str_replace_all(vColumnNames, "-   ", "")
vColumnNames = str_replace_all(vColumnNames, "- ", "")
vColumnNames = str_replace_all(vColumnNames, "  ", "")
vColumnNames = str_replace_all(vColumnNames, " occupations", "")

#Replace the dataframe names with the extracted names
dfBLS %<>%
  rename(Industry = contains("Household"), !!vColumnNames[1] := X.1, !!vColumnNames[2] := X.2, !!vColumnNames[3] := X.3, !!vColumnNames[4] := X.4, !!vColumnNames[5] := X.5, !!vColumnNames[6] := X.6, !!vColumnNames[7] := X.7, !!vColumnNames[8] := X.8, !!vColumnNames[9] := X.9, !!vColumnNames[10] := X.10, !!vColumnNames[11] := X.11)
  1. The industries are repeated six times for total, female, male, Black, Asian, and White. Each of these units will need to be extracted into its own data frame, and the columns gathered into “long” format. We will also do some cleaning of extraneous rows:
  1. Select rows using ‘filter’
  2. Add the relevant demographic info as a column (.e.g Gender or Race)
  3. Handle rows which merely summarize other rows
    1. Remove the summary row
    2. Rename the remaining rows where necessary
  4. Gather columns (using handy vector names column) to convert from wide to long
DemographicUnit <- function(df, rowStart, rowEnd, demoContent)
{
 
#Extract units based on rows, insert a column with demographic category, remove summary rows and gather the dataframe into long format 
dfNew <- df %>% 
  filter(row_number() %in% rowStart:rowEnd) %>%
  mutate(demog := demoContent) %>%
  filter(Industry !="Manufacturing" & Industry !="Wholesale and retail trade" & Industry !="Other services") %>%
  gather(all_of(vColumnNames),  key="Occupation", value="NumberEmployed")

#The new "NumberEmployed" category needs to be turned into an integer from character. This means removing the comma, and recasting it as an integer. A unique row ID is also created here which will simplify merging the data frame with others.
  dfNew$NumberEmployed = str_replace_all(dfNew$NumberEmployed, ",", "")
  dfNew <- mutate_at(dfNew, vars(NumberEmployed), list(as.integer)) %>%
  mutate(RowID = row_number())

#Some of the rows which were summarized by summary rows need to be reworded.
  dfNew$Industry = str_replace_all(dfNew$Industry, "Durable goods", "Manufacturing, durable goods")
  dfNew$Industry = str_replace_all(dfNew$Industry, "Nondurable goods", "Manufacturing, nondurable goods")
  dfNew$Industry = str_replace_all(dfNew$Industry, "Private households", "Other services, private households only")

return (dfNew)
}

#Each of the units is extracted using the function we've just created
dfBLS_Men = DemographicUnit(dfBLS, 29, 47, "Male")
dfBLS_Women = DemographicUnit(dfBLS, 50, 68, "Female")
dfBLS_White = DemographicUnit(dfBLS, 71, 89, "White")
dfBLS_Black = DemographicUnit(dfBLS, 92, 110, "Black")
dfBLS_Asian = DemographicUnit(dfBLS, 113, 131, "Asian")
dfBLS_Total = DemographicUnit(dfBLS, 8, 26, "Total")
  1. Create the unified dataframe by binding all the race and gender dataframes together:
dfAll <- rbind(dfBLS_Women, dfBLS_Men, dfBLS_Black, dfBLS_White, dfBLS_Asian, dfBLS_Total)
  1. Spread the dataframe by demographic unit and create the “Other Race” column by comparing the three race units to the total
  1. Spread the dataframe using TidyR so each demographic unit is a column
  2. Create a category called “Other” which is the Total minus the other ace categories
  3. Create Final dataframe to accept more years
dfAll %<>% 
  spread(demog, NumberEmployed) %<>%
  mutate_at(vars(Black, White, Asian, Total), list(as.numeric)) %<>%
  mutate(Other = Total-(Black + White + Asian))

dfFinal <- dfAll

This is the last step. We now have a clean data frame in long form. The only thing that remains is using the same steps to read in the other years and appending those dataframes to this. We do this with a function that brings all of the steps together.

ReadYear <- function(bls_Year)
{

fileName <- str_c("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/bls-", bls_Year, ".csv")
dfBLS_raw <- read.delim(fileName, sep=",")

dfBLS <- dfBLS_raw %>%
  select(-"X") %>%
  mutate(Year = bls_Year) 

vColumnNames <- as.character(dfBLS %>%
  filter(row_number() %in% 5) %>%
  select(-contains("Household") & -"Year"))

vColumnNames = str_replace_all(vColumnNames, "[\r\n]", " ")
vColumnNames = str_replace_all(vColumnNames, "-   ", "")
vColumnNames = str_replace_all(vColumnNames, "- ", "")
vColumnNames = str_replace_all(vColumnNames, "-", "")
vColumnNames = str_replace_all(vColumnNames, "  ", "")
vColumnNames = str_replace_all(vColumnNames, " occupations", "")

dfBLS %<>%
  rename(Industry = contains("Household"), !!vColumnNames[1] := X.1, !!vColumnNames[2] := X.2, !!vColumnNames[3] := X.3, !!vColumnNames[4] := X.4, !!vColumnNames[5] := X.5, !!vColumnNames[6] := X.6, !!vColumnNames[7] := X.7, !!vColumnNames[8] := X.8, !!vColumnNames[9] := X.9, !!vColumnNames[10] := X.10, !!vColumnNames[11] := X.11)

dfBLS_Men = DemographicUnit(dfBLS, 29, 47, "Male")
dfBLS_Women = DemographicUnit(dfBLS, 50, 68, "Female")
dfBLS_White = DemographicUnit(dfBLS, 71, 89, "White")
dfBLS_Black = DemographicUnit(dfBLS, 92, 110, "Black")
dfBLS_Asian = DemographicUnit(dfBLS, 113, 131, "Asian")
dfBLS_Total = DemographicUnit(dfBLS, 8, 26, "Total")

dfAll <- rbind(dfBLS_Women, dfBLS_Men, dfBLS_Black, dfBLS_White, dfBLS_Asian, dfBLS_Total)

dfAll %<>% 
  spread(demog, NumberEmployed) %<>%
  mutate_at(vars(Black, White, Asian, Total), list(as.numeric)) %<>%
  mutate(Other = Total-(Black + White + Asian))

  return(dfAll)

}

dfs <- ReadYear("2016")
dfFinal <- rbind(dfFinal, dfs)

dfs <- ReadYear("2017")
dfFinal <- rbind(dfFinal, dfs)

dfs <- ReadYear("2018")
dfFinal <- rbind(dfFinal, dfs)

dfs <- ReadYear("2019")
dfFinal <- rbind(dfFinal, dfs)

dfs <- ReadYear("2020")
dfFinal <- rbind(dfFinal, dfs)

dfFinal %<>%
  select("Year", "Industry", "Occupation", "Female", "Male", "Black", "White", "Asian", "Other", "Total")

Thus we go from this:

head(dfBLS_raw,10) %>%
  kbl(caption = "Raw Data - BLS") %>%
  kable_styling(bootstrap_options = c("condensed"))
Raw Data - BLS
ï..HOUSEHOLD.DATA..ANNUAL.AVERAGES..17..Employed.persons.by.industry..sex..race..and.occupation X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10 X.11
[In thousands]
Industry, sex, and race 2015
Total em- ployed Management, professional, and related occupations Service occupations Sales and office occupations Natural resources, construction, and maintenance occupations Production, trans- portation, and material moving occupations
Management, business, and financial operations occupations Profes- sional and related occupa- tions Protec- tive service occupa- tions Service occupa- tions, except protec- tive Sales and related occupa- tions Office and adminis- trative support occupa- tions Farming, fishing, and forestry occupa- tions Construc- tion and extrac- tion occupa- tions Instal- lation, mainte- nance, and repair occupa- tions Produc- tion occupa- tions Transpor- tation and material moving occupa- tions
TOTAL
Agriculture and related 2,422 1,085 49 18 91 17 87 903 14 31 30 95
Mining, quarrying, and oil and gas extraction 917 150 148 2 5 12 73 1 292 73 62 100
Construction 9,935 1,762 237 10 39 98 509 4 6,347 534 153 244

to this:

head(dfFinal) %>%
  kbl(caption = "Final Data - BLS") %>%
  kable_styling(bootstrap_options = c("condensed"))
Final Data - BLS
Year Industry Occupation Female Male Black White Asian Other Total
2015 Agriculture and related Construction and extraction 0 14 0 13 1 0 14
2015 Agriculture and related Farming, fishing, and forestry 184 719 34 820 9 40 903
2015 Agriculture and related Installation, maintenance, and repair 0 31 0 30 0 1 31
2015 Agriculture and related Management, business, and financial operations 261 824 9 1052 6 18 1085
2015 Agriculture and related Office and administrative support 76 11 2 83 0 2 87
2015 Agriculture and related Production 6 24 3 25 0 2 30

Analysis of Dataset 1: BLS Data

Task - Name one significant shift in women’s employment from traditional occupations/industries to nontraditional from 2015 to 2020.

Thiswas harder than one might think!

We begin by examining the top 5 occupation/Industry combinations for women in 2015 and 2020. We can see that the same combinations appear in each year, though the percentages and order change slightly. These are sectors where women dominate: nannies and maids, education and health, and various office and admin support occupations.

dfFinal <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/Final.csv")
dfFinal <- mutate_at(dfFinal, vars(Year), list(as.character))
dfFinal <- na.omit(dfFinal)
  
df2015 <- dfFinal %>%
  filter(Year=="2015" & Total>100) %>%
  mutate(PercentWomen = Female/Total) %>%
  select(Industry, Occupation, Total, PercentWomen) %>%
  arrange(desc(PercentWomen))
df2015 <- na.omit(df2015)

head(df2015) %>%
  kbl(caption = "Top Occupation/Industries - 2015") %>%
  kable_styling(bootstrap_options = c("condensed"))
Top Occupation/Industries - 2015
Industry Occupation Total PercentWomen
Other services, private households only Service, except protective 770 0.9428571
Education and health services Office and administrative support 3561 0.8885145
Construction Office and administrative support 509 0.8506876
Other services, except private households Office and administrative support 628 0.8471338
Financial activities Office and administrative support 2171 0.8060801
Education and health services Service, except protective 7018 0.8045027
df2020 <- dfFinal %>%
  filter(Year=="2020" & Total>100) %>%
  mutate(PercentWomen = Female/Total) %>%
  select(Industry, Occupation, Total, PercentWomen) %>%
  arrange(desc(PercentWomen))
df2020 <- na.omit(df2020)

head(df2020) %>%
  kbl(caption = "Top Occupation/Industries - 2020") %>%
  kable_styling(bootstrap_options = c("condensed"))
Top Occupation/Industries - 2020
Industry Occupation Total PercentWomen
Other services, private households only Service, except protective 611 0.9148936
Education and health services Office and administrative support 3258 0.8956415
Other services, except private households Office and administrative support 531 0.8305085
Construction Office and administrative support 545 0.8146789
Education and health services Service, except protective 6495 0.8096998
Financial activities Office and administrative support 1932 0.7774327

We can do a deeper dive into occupations and industries by year to see if we can track any significant changes there.

dfSummary <- dfFinal %>%
  group_by(Occupation, Year) %>%
  summarize(Percent = sum(Female)/sum(Total))
## `summarise()` regrouping output by 'Occupation' (override with `.groups` argument)
dfSummary <- na.omit(dfSummary)

ggplot(data = dfSummary, aes(x =Occupation, y=Percent, group=Year, fill=Year)) +
  geom_col(position=position_dodge()) + ggtitle("Percent of Women Employed By Occupation") +
  coord_flip() +
  scale_fill_brewer(palette = "Spectral") 

The average percent of women employed in various occupations in the US was very stable from 2105 to 2020. The only really noticeable shift is in transportation. In most sectors there is little change. There are some odd outliers - e.g. a large drop in the % of women employed in farming and fishing in 2017. Given the relative stability elsewhere this is probably a shift in how occupations are labeled rather than a real change in the workforce.

Looking at industries …

dfSummary2 <- dfFinal %>%
  group_by(Industry, Year) %>%
  summarize(Percent = sum(Female)/sum(Total))
## `summarise()` regrouping output by 'Industry' (override with `.groups` argument)
dfSummary2 <- na.omit(dfSummary2)

ggplot(data = dfSummary2, aes(x =Industry, y=Percent, group=Year, fill=Year)) +
  geom_col(position=position_dodge()) + ggtitle("Percentage of Women Employed By Industry")+
  coord_flip() +
  scale_fill_brewer(palette = "Spectral")

Industries show a tiny bit more movement. Women’s % in some traditional industries like nannying and maids fell, and rose slightly in nontraditional ones like construction and mining.

Do these minor industry shifts signify real change? Let’s look at the occupations women are occupying in the mining industry:

dfSummary <- dfFinal %>%
  filter(Industry=="Mining, quarrying, and oil and gas extraction") %>%
  group_by(Occupation, Year) %>%
  summarize(Percent = sum(Female)/sum(Total))
## `summarise()` regrouping output by 'Occupation' (override with `.groups` argument)
dfSummary <- na.omit(dfSummary)

ggplot(data = dfSummary, aes(x =Occupation, y=Percent, group=Year, fill=Year)) +
  geom_col(position=position_dodge()) + ggtitle("Percentage of Women Employed in Mining") +
  coord_flip() +
  scale_fill_brewer(palette = "Spectral")

Most of the increase in women’s % in mining is in traditional occupations like office, admin, and non-protective services.

We can also circle back to transportation, the occupation in which women have been making the most inroads. What is happening there?

dfSummary2 <- dfFinal %>%
  filter(Occupation=="Transportation and material moving") %>%
  group_by(Industry, Year) %>%
  summarize(Percent = sum(Female)/sum(Total))
## `summarise()` regrouping output by 'Industry' (override with `.groups` argument)
dfSummary2 <- na.omit(dfSummary2)

ggplot(data = dfSummary2, aes(x =Industry, y=Percent, group=Year, fill=Year)) +
  geom_col(position=position_dodge()) + ggtitle("Percentage of Women in Transportation By Industry")+
  coord_flip() +
  scale_fill_brewer(palette = "Spectral")

There appears to be some real movement in women’s % of participation in transportation accross many industries. While mainly in service, education, information and retail, nonetheless women do appear to be making inroads into transportation occupations.

A google search reveals that in 2016 the DOT launched the Women and Girls Transportation Initiative to increase women’ participation in transportation. The initiative appears to have been successful.

While there was little shift of women’s percentages in most occupations between 2015 and 2020, transportation does appear to be one area where women made headway in a number of industries.

Analysis of Dataset 2: Prison Data

The prison dataset was tidied from this:

prison_admissions_raw <- as.data.frame(read.delim("https://raw.githubusercontent.com/TheUpshot/prison-admissions/master/county-prison-admissions.csv", 
                                                  header = TRUE, stringsAsFactors = FALSE, sep = ","))


head(prison_admissions_raw) %>%
  kbl(caption = "Raw Data - Prisons") %>%
  kable_styling(bootstrap_options = c("condensed"))
Raw Data - Prisons
fips county state admitsPer10k2006 admitsPer10k2013 admitsPer10k2014 valid06 valid13 valid14 population2006 population2013 population2014 admissions2006 admissions2013 admissions2014 source
1001 Autauga County AL 44.25665 19.225189 18.593736 true true true 51328 55136 55395 243 106 103 NCRP
1003 Baldwin County AL 24.63739 17.703371 16.540820 true true true 168121 195443 200111 461 346 331 NCRP
1005 Barbour County AL 75.39988 10.378827 12.273589 true true true 27861 26978 26887 206 28 33 NCRP
1007 Bibb County AL 21.97416 11.109136 6.664889 true true true 22099 22504 22506 50 25 15 NCRP
1009 Blount County AL 16.13939 11.781012 6.930127 true true true 55485 57720 57719 93 68 40 NCRP
1011 Bullock County AL 66.07110 8.486563 NA true true 10776 10605 10764 71 9 NA NCRP

to this:

dfPrisons <- read.csv("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/prison_df.csv")

head(dfPrisons) %>%
  kbl(caption = "Prison and County Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Prison and County Data
fips county state source percent_change year prison_admitsper10k county_population prison_admissions persistent_poverty urbanicity
1001 Autauga AL NCRP -57.99 2006 44.26 51328 243 0 Metropolitan
1003 Baldwin AL NCRP -32.86 2006 24.64 168121 461 0 Urban_Adjacent
1005 Barbour AL NCRP -83.72 2006 75.40 27861 206 1 Urban_Adjacent
1007 Bibb AL NCRP -69.67 2006 21.97 22099 50 1 Metropolitan
1009 Blount AL NCRP -57.06 2006 16.14 55485 93 0 Metropolitan
1011 Bullock AL NCRP NA 2006 66.07 10776 71 1 Urban_Adjacent

Task - What states account for the reduction in prison admissions from 2006 to 2014?

dfPrisons <- mutate_at(dfPrisons, vars(year), list(as.character))
dfPrisons <- na.omit(dfPrisons)

dfSummary2 <- dfPrisons %>%
  group_by(year) %>%
  summarize(Total = sum(prison_admissions))
## `summarise()` ungrouping output (override with `.groups` argument)
dfSummary2 <- na.omit(dfSummary2)

ggplot(data = dfSummary2, aes(x =year, y=Total)) +
  geom_col(position=position_dodge()) + ggtitle("Total Prison Admissions by Year")+
  scale_fill_brewer(palette = "Spectral") +
  ylab("Admissions")

We can see that overall prison admissions fell from 2006 to 2014.

dfSummary2 <- dfPrisons %>%
  group_by(state, year) %>%
  summarize(Total = sum(prison_admissions))
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
dfSummary2 <- na.omit(dfSummary2)

ggplot(data = dfSummary2, aes(x =state, y=Total, group=year, fill=year)) +
  geom_col(position=position_dodge()) + ggtitle("Total Prison Admissions by State")+
  scale_fill_brewer(palette = "Spectral") +
  ylab("Admissions")

The lion’s share of decreases came from California - Alabama, Florida, New York and Ohio contributed as well. There were a few states that gained admissions as well, though the gains were rather small - Arizona, Minnesota, Kentucky, North Carolina, Oklahoma and Pennsylvania.

dfCounty <- dfPrisons %>%
  filter(state=="CA") %>%
  group_by(county, urbanicity) %>%
  summarize(Diff = sum(prison_admissions[year=="2006"] - sum(prison_admissions[year=="2014"]))) %>%
  filter(Diff>750)
## `summarise()` regrouping output by 'county' (override with `.groups` argument)
ggplot(data = dfCounty, aes(x = reorder(county, Diff),  y=Diff)) +
  geom_col(position=position_dodge()) + ggtitle("Decrease in Prison Admissions by CA County, 2006-2014")+
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  ylab("Decrease in Admissions") +
    xlab("County")

Many counties in California contributed to the decline, mostly in Southern California which accounted for the 5 counties over 5,000. Los Angeles, however, dominated the losses with over 22,000.

ggplot(data = dfCounty, aes(x = reorder(county, Diff),  y=urbanicity)) +
  geom_col(position=position_dodge()) + ggtitle("Counties by 'Urbanicity'")+
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  ylab("Urbanicity") +
  xlab("County")

When we look at the “urbancity” of admission declines we see all the counties in California with declines above 750 are urban metropolitan areas.

Thus we conclude that the bulk of admission declines are from California metro areas, especially in Southern California.

Analysis of Dataset 3: Epidemics

The epidemic dataset was tidied from this:

dfepidemic_raw <- as.data.frame(read.delim("https://raw.githubusercontent.com/TheSaltyCrab/Data607-Project2/main/epidemic.csv", 
                                                  header = TRUE, stringsAsFactors = FALSE, sep = ","))


head(dfepidemic_raw) %>%
  kbl(caption = "Raw Data - Epidemics") %>%
  kable_styling(bootstrap_options = c("condensed"))
Raw Data - Epidemics
Event Date Location Disease Death.toll..estimate. Ref.
Antonine Plague 165–180 (possibly up to 190) Roman Empire Unknown, possibly smallpox 5–10 million [23][24]
Jian’an Plague 217 Han Dynasty Unknown, possibly typhoid fever or viral hemorrhagic fever Unknown [25][26]
Plague of Cyprian 250–266 Europe Unknown, possibly smallpox Unknown [27][28]
Plague of Justinian (beginning of First plague pandemic) 541–549 Europe and West Asia Bubonic plague 15–100 million (25–60% of population of Europe) [11][29][30]
Roman Plague of 590 (part of First plague pandemic) 590 Rome, Byzantine Empire Bubonic plague Unknown [31]
Plague of Sheroe (part of First plague pandemic) 627–628 Bilad al-Sham Bubonic plague 25,000+

to this:

dfepidemics <- read.csv("https://raw.githubusercontent.com/TheSaltyCrab/Data607-Project2/main/a_clean_epidemic.csv")

head(dfepidemics) %>%
  kbl(caption = "Epidemic - clean") %>%
  kable_styling(bootstrap_options = c("condensed"))
Epidemic - clean
X event start_year end_year disease deaths_low_estimate global_population low_global_death_percent
1 Black Death 1346 1353 Bubonic plague 7.5e+07 3.920e+08 0.1913265
2 Plague of Justinian 541 549 Bubonic plague 1.5e+07 2.100e+08 0.0714286
3 Antonine Plague 165 180 Unknown 5.0e+06 2.020e+08 0.0247525
4 1520 Mexico smallpox epidemic 1519 1520 Smallpox 5.0e+06 4.610e+08 0.0108460
5 Cocoliztli Epidemic of 1545–1548 1545 1548 Possibly Salmonella enterica 5.0e+06 4.610e+08 0.0108460
6 1918 influenza pandemic 1918 1920 Influenza A virus subtype H1N1 1.7e+07 1.912e+09 0.0088912

Task - How does the Covid epidemic compare to other epidemics through history?

dfepidemics <- mutate_at(dfepidemics, vars(deaths_low_estimate), list(as.integer))

dfE <- dfepidemics %>%
  select(event, deaths_low_estimate) %>%
  na.omit() %>%
  filter(deaths_low_estimate > 200000)

dfHighlight <- dfE %>%
  filter(event=="COVID-19 pandemic")

ggplot(data = dfE, aes(x = reorder(event, deaths_low_estimate),  y=deaths_low_estimate)) +
  geom_col(position=position_dodge()) + ggtitle("Epidemics by Total Deaths")+
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  geom_col(data=dfHighlight, aes(x=event, y=deaths_low_estimate), color="black", fill="orange") +
  xlab("Event") +
  ylab("Number of Deaths")

In terms of total deaths COVID 19 has been very significant (in the top ten) but is dwarfed by some tof the big ones (AIDS, the Black Death and the 1918 Spanish Flu pandemic).

The dataset does not include population estimates for many years so we will fill NA values with the value above for a proxy. Then we can calculate % deaths per population for COVID and other events that have no global percentage calculated.

dfE <- dfepidemics %>% fill(global_population) %>%
   mutate(glob_percent = deaths_low_estimate/global_population*100) %>%
  select(event, glob_percent) %>%
    na.omit() %>%
  filter(glob_percent>.035)

dfHighlight <- dfE %>%
  filter(event=="COVID-19 pandemic")

ggplot(data = dfE, aes(x = reorder(event, glob_percent),  y=glob_percent)) +
  geom_col(position=position_dodge()) + ggtitle("Epidemics by % of Deaths in World Pop")+
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  geom_col(data=dfHighlight, aes(x=event, y=glob_percent), color="orange", fill="orange") +
  xlab("Event") +
  ylab("% of Deaths")

In terms of % of global population, COVID is again dwarfed by others but still makes the list at 20.

Conclusion

The three data sets Cassie, Dan and I chose all had different challenges. My data set had classic issues of wideness, repeating data, etc. Cassie also had a wide data set that needed a reworking of columns and some adding of new columns. Dan had a host of standardization issues – dates, numbers, and text were all written in different formats throughout the data set. Both Cassie and Dan added data from more than one type of data set, and I combined six data sets of the same type from different years.

Working in a team made it possible to learn many different techniques very quickly. We are all much more advanced in our ability to tidy data.