Project 2

Team members: Eric Hirsch, Dan Sullivan, Cassie Coste

Introduction

In this project we will tidy three datasets using TidyR and Dyplr and other handy R tools. After the transformation we will conduct some analyses. This is a joint project with Cassie Coste and Dan Sullivan. We each cleaned a data set and then showed each other what our challenges were and how we overcame them. This first data set was the data set that I tidied.

Dataset 1: Bureau of Labor Statistics Data

The data consists of six CSV files from the Bureau of Labor Statistics showing numbers of Americans involved in various occupations and industries, spanning the years 2015 through 2020. Each file is in the same format. The data set is “wide” – occupations run horizontally and industries run vertically.

Challenges with the data set:

The data set needs to be converted from wide to long, and needs to include a year column and a demographic column.
The industries which run vertically are repeated six times - for two genders, three races, and a total. These will need to be collected together.
The race categories do not add up to the total because they don’t comprise all of the possible races. Therefore an “other race” category will need to calculated and then created.
The occupations do not appear at the top of the raw data (they appear in the fifth row), which means they need to be extracted and inserted as column headings for the data set. There are a number of issues with these column headings - for example, they are too long, and they include the insertion of dashes and carriage returns which will need to be removed.
Some of the rows are summary rows and will need to be removed. In some cases, remaining rows will need to be renamed as they don’t make sense standing alone without the summary row.
All of the years of the data set need to be appended to the data frame
The demographic categories (race, gender and total) need to be spread out as columns and arranged in order.

Dataset 2 - The Upshot - Prison Admissions by County

This dataset is a long dataset used by The Upshot NYT in their article “A Small Indiana County Sends More People to Prison Than San Francisco and Durham, N.C., Combined. Why?” to report on the increase in rural prison populations in recent years.

Challenges:

The primary issue with the data set is that it contains years in the variable names for three different variables. The goal is to get to a data set with columns for the three prison admission variables and one for the year. There are also some minor tidying edits such as converting columns to numeric or factor and or removing/adding words to columns or column names.

As this dataset is only part of the picture that was being looked at by The Upshot, to gain further insight into this dataset and look at some of the things that the article was referring to as well as comments made in the class discussion board, new variables need to be computed and more county data is needed that was not made available by The Upshot. For county data, an additional dataset (2.b) from the US Department of Agriculture (USDA) is joined with The Upshot dataset to perform the final analyses.

This county dataset had its own set of challenges. There is a rural-urban variable at two time points that is on a 1-9 spectrum and will be re-coded to match Metropolitan/Urban/Rural categories provided by the USDA. When these variables are incorporated into the original dataset, the rural-urban variable must then be made long through the creation of a new variable using ifelse statements to make sure that prison county data from the 2000s receives the urbanicity variable from the 2000 census and county data from the 2010s receives the urbanicity variable from the 2010 census.

Dataset 3 - Historic Epidemic Data

Dataset three Uses historic epidemic data as well as estimates and measurements on global population in the form of two separate tables for pre 1950 populations and post 1950 populations. It incorporates these three csv to get estimates on how impactful certain epidemics were based on the total deaths as compared to global population.

Challenges:

The biggest challenge with this data was data formation. It contained many pieces of data that were mixed strings and numbers, notes within data points, as well as hyperlink references. Because many columns were not standardized I used a lot of regex to tackle a lot of these issues ultimately paring things down to a point where the values could be standardized and used.

The next challenge was creating a metric to join population data that fluctuated depending on the year. I found population data that was yearly from 1950-2017, by decade from 1900-1950 and every century from 1AD to 1900. Because of this I had to make my own rounding function. where depending on what year the event was it rounded so that population metrics could be added accordingly.

Tidying Dataset #1: BLS Data

First, the first year of data is read into a data frame:

dfBLS_raw <- read.delim("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/bls-2015.csv", sep=",")

Next, we drop any columns we don’t need and add any columns we do. In this case there is one of each:

Drop the Totals column
Add the Year column

dfBLS <- dfBLS_raw %>%
  select(-"X") %>%
  mutate(Year = "2015")

Now we need to fix the column names (the column names should be the occupations, but they are not because the occupations were not at the top of the file.) We will need to extract the column names into a vector, clean the names of any extraneous characters and other issues, and use the vector to rename the dataframe columns:

Get a vector for the column names from the row that contains them
Clean the column names by replacing the hyphens and paragraph breaks, taking out unneeded words like ‘occupations’, etc.
Rename the columns using the vector

#Extract the names
vColumnNames <- as.character(dfBLS %>%
  filter(row_number() %in% 5) %>%
  select(-contains("Household") & -"Year"))

#Clean the names
vColumnNames = str_replace_all(vColumnNames, "[\r\n]", " ")
vColumnNames = str_replace_all(vColumnNames, "-   ", "")
vColumnNames = str_replace_all(vColumnNames, "- ", "")
vColumnNames = str_replace_all(vColumnNames, "  ", "")
vColumnNames = str_replace_all(vColumnNames, " occupations", "")

#Replace the dataframe names with the extracted names
dfBLS %<>%
  rename(Industry = contains("Household"), !!vColumnNames[1] := X.1, !!vColumnNames[2] := X.2, !!vColumnNames[3] := X.3, !!vColumnNames[4] := X.4, !!vColumnNames[5] := X.5, !!vColumnNames[6] := X.6, !!vColumnNames[7] := X.7, !!vColumnNames[8] := X.8, !!vColumnNames[9] := X.9, !!vColumnNames[10] := X.10, !!vColumnNames[11] := X.11)

The industries are repeated six times for total, female, male, Black, Asian, and White. Each of these units will need to be extracted into its own data frame, and the columns gathered into “long” format. We will also do some cleaning of extraneous rows:

Select rows using ‘filter’
Add the relevant demographic info as a column (.e.g Gender or Race)
Handle rows which merely summarize other rows
1. Remove the summary row
2. Rename the remaining rows where necessary
Gather columns (using handy vector names column) to convert from wide to long

DemographicUnit <- function(df, rowStart, rowEnd, demoContent)
{
 
#Extract units based on rows, insert a column with demographic category, remove summary rows and gather the dataframe into long format 
dfNew <- df %>% 
  filter(row_number() %in% rowStart:rowEnd) %>%
  mutate(demog := demoContent) %>%
  filter(Industry !="Manufacturing" & Industry !="Wholesale and retail trade" & Industry !="Other services") %>%
  gather(all_of(vColumnNames),  key="Occupation", value="NumberEmployed")

#The new "NumberEmployed" category needs to be turned into an integer from character. This means removing the comma, and recasting it as an integer. A unique row ID is also created here which will simplify merging the data frame with others.
  dfNew$NumberEmployed = str_replace_all(dfNew$NumberEmployed, ",", "")
  dfNew <- mutate_at(dfNew, vars(NumberEmployed), list(as.integer)) %>%
  mutate(RowID = row_number())

#Some of the rows which were summarized by summary rows need to be reworded.
  dfNew$Industry = str_replace_all(dfNew$Industry, "Durable goods", "Manufacturing, durable goods")
  dfNew$Industry = str_replace_all(dfNew$Industry, "Nondurable goods", "Manufacturing, nondurable goods")
  dfNew$Industry = str_replace_all(dfNew$Industry, "Private households", "Other services, private households only")

return (dfNew)
}

#Each of the units is extracted using the function we've just created
dfBLS_Men = DemographicUnit(dfBLS, 29, 47, "Male")
dfBLS_Women = DemographicUnit(dfBLS, 50, 68, "Female")
dfBLS_White = DemographicUnit(dfBLS, 71, 89, "White")
dfBLS_Black = DemographicUnit(dfBLS, 92, 110, "Black")
dfBLS_Asian = DemographicUnit(dfBLS, 113, 131, "Asian")
dfBLS_Total = DemographicUnit(dfBLS, 8, 26, "Total")

Create the unified dataframe by binding all the race and gender dataframes together:

dfAll <- rbind(dfBLS_Women, dfBLS_Men, dfBLS_Black, dfBLS_White, dfBLS_Asian, dfBLS_Total)

Spread the dataframe by demographic unit and create the “Other Race” column by comparing the three race units to the total

Spread the dataframe using TidyR so each demographic unit is a column
Create a category called “Other” which is the Total minus the other ace categories
Create Final dataframe to accept more years

dfAll %<>% 
  spread(demog, NumberEmployed) %<>%
  mutate_at(vars(Black, White, Asian, Total), list(as.numeric)) %<>%
  mutate(Other = Total-(Black + White + Asian))

dfFinal <- dfAll

This is the last step. We now have a clean data frame in long form. The only thing that remains is using the same steps to read in the other years and appending those dataframes to this. We do this with a function that brings all of the steps together.

ReadYear <- function(bls_Year)
{

fileName <- str_c("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/bls-", bls_Year, ".csv")
dfBLS_raw <- read.delim(fileName, sep=",")

dfBLS <- dfBLS_raw %>%
  select(-"X") %>%
  mutate(Year = bls_Year) 

vColumnNames <- as.character(dfBLS %>%
  filter(row_number() %in% 5) %>%
  select(-contains("Household") & -"Year"))

vColumnNames = str_replace_all(vColumnNames, "[\r\n]", " ")
vColumnNames = str_replace_all(vColumnNames, "-   ", "")
vColumnNames = str_replace_all(vColumnNames, "- ", "")
vColumnNames = str_replace_all(vColumnNames, "-", "")
vColumnNames = str_replace_all(vColumnNames, "  ", "")
vColumnNames = str_replace_all(vColumnNames, " occupations", "")

dfBLS %<>%
  rename(Industry = contains("Household"), !!vColumnNames[1] := X.1, !!vColumnNames[2] := X.2, !!vColumnNames[3] := X.3, !!vColumnNames[4] := X.4, !!vColumnNames[5] := X.5, !!vColumnNames[6] := X.6, !!vColumnNames[7] := X.7, !!vColumnNames[8] := X.8, !!vColumnNames[9] := X.9, !!vColumnNames[10] := X.10, !!vColumnNames[11] := X.11)

dfBLS_Men = DemographicUnit(dfBLS, 29, 47, "Male")
dfBLS_Women = DemographicUnit(dfBLS, 50, 68, "Female")
dfBLS_White = DemographicUnit(dfBLS, 71, 89, "White")
dfBLS_Black = DemographicUnit(dfBLS, 92, 110, "Black")
dfBLS_Asian = DemographicUnit(dfBLS, 113, 131, "Asian")
dfBLS_Total = DemographicUnit(dfBLS, 8, 26, "Total")

dfAll <- rbind(dfBLS_Women, dfBLS_Men, dfBLS_Black, dfBLS_White, dfBLS_Asian, dfBLS_Total)

dfAll %<>% 
  spread(demog, NumberEmployed) %<>%
  mutate_at(vars(Black, White, Asian, Total), list(as.numeric)) %<>%
  mutate(Other = Total-(Black + White + Asian))

  return(dfAll)

}

dfs <- ReadYear("2016")
dfFinal <- rbind(dfFinal, dfs)

dfs <- ReadYear("2017")
dfFinal <- rbind(dfFinal, dfs)

dfs <- ReadYear("2018")
dfFinal <- rbind(dfFinal, dfs)

dfs <- ReadYear("2019")
dfFinal <- rbind(dfFinal, dfs)

dfs <- ReadYear("2020")
dfFinal <- rbind(dfFinal, dfs)

dfFinal %<>%
  select("Year", "Industry", "Occupation", "Female", "Male", "Black", "White", "Asian", "Other", "Total")

Thus we go from this:

head(dfBLS_raw,10) %>%
  kbl(caption = "Raw Data - BLS") %>%
  kable_styling(bootstrap_options = c("condensed"))

Raw Data - BLS
ï..HOUSEHOLD.DATA..ANNUAL.AVERAGES..17..Employed.persons.by.industry..sex..race..and.occupation	X	X.1	X.2	X.3	X.4	X.5	X.6	X.7	X.8	X.9	X.10	X.11
[In thousands]

Industry, sex, and race	2015
	Total em- ployed	Management, professional, and related occupations		Service occupations		Sales and office occupations		Natural resources, construction, and maintenance occupations			Production, trans- portation, and material moving occupations
		Management, business, and financial operations occupations	Profes- sional and related occupa- tions	Protec- tive service occupa- tions	Service occupa- tions, except protec- tive	Sales and related occupa- tions	Office and adminis- trative support occupa- tions	Farming, fishing, and forestry occupa- tions	Construc- tion and extrac- tion occupa- tions	Instal- lation, mainte- nance, and repair occupa- tions	Produc- tion occupa- tions	Transpor- tation and material moving occupa- tions

TOTAL
Agriculture and related	2,422	1,085	49	18	91	17	87	903	14	31	30	95
Mining, quarrying, and oil and gas extraction	917	150	148	2	5	12	73	1	292	73	62	100
Construction	9,935	1,762	237	10	39	98	509	4	6,347	534	153	244

to this:

head(dfFinal) %>%
  kbl(caption = "Final Data - BLS") %>%
  kable_styling(bootstrap_options = c("condensed"))

Final Data - BLS
Year	Industry	Occupation	Female	Male	Black	White	Asian	Other	Total
2015	Agriculture and related	Construction and extraction	0	14	0	13	1	0	14
2015	Agriculture and related	Farming, fishing, and forestry	184	719	34	820	9	40	903
2015	Agriculture and related	Installation, maintenance, and repair	0	31	0	30	0	1	31
2015	Agriculture and related	Management, business, and financial operations	261	824	9	1052	6	18	1085
2015	Agriculture and related	Office and administrative support	76	11	2	83	0	2	87
2015	Agriculture and related	Production	6	24	3	25	0	2	30

Analysis of Dataset 1: BLS Data

Task - Name one significant shift in women’s employment from traditional occupations/industries to nontraditional from 2015 to 2020.

Thiswas harder than one might think!

We begin by examining the top 5 occupation/Industry combinations for women in 2015 and 2020. We can see that the same combinations appear in each year, though the percentages and order change slightly. These are sectors where women dominate: nannies and maids, education and health, and various office and admin support occupations.

dfFinal <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/Final.csv")
dfFinal <- mutate_at(dfFinal, vars(Year), list(as.character))
dfFinal <- na.omit(dfFinal)
  
df2015 <- dfFinal %>%
  filter(Year=="2015" & Total>100) %>%
  mutate(PercentWomen = Female/Total) %>%
  select(Industry, Occupation, Total, PercentWomen) %>%
  arrange(desc(PercentWomen))
df2015 <- na.omit(df2015)

head(df2015) %>%
  kbl(caption = "Top Occupation/Industries - 2015") %>%
  kable_styling(bootstrap_options = c("condensed"))

Top Occupation/Industries - 2015
Industry	Occupation	Total	PercentWomen
Other services, private households only	Service, except protective	770	0.9428571
Education and health services	Office and administrative support	3561	0.8885145
Construction	Office and administrative support	509	0.8506876
Other services, except private households	Office and administrative support	628	0.8471338
Financial activities	Office and administrative support	2171	0.8060801
Education and health services	Service, except protective	7018	0.8045027

df2020 <- dfFinal %>%
  filter(Year=="2020" & Total>100) %>%
  mutate(PercentWomen = Female/Total) %>%
  select(Industry, Occupation, Total, PercentWomen) %>%
  arrange(desc(PercentWomen))
df2020 <- na.omit(df2020)

head(df2020) %>%
  kbl(caption = "Top Occupation/Industries - 2020") %>%
  kable_styling(bootstrap_options = c("condensed"))

Top Occupation/Industries - 2020
Industry	Occupation	Total	PercentWomen
Other services, private households only	Service, except protective	611	0.9148936
Education and health services	Office and administrative support	3258	0.8956415
Other services, except private households	Office and administrative support	531	0.8305085
Construction	Office and administrative support	545	0.8146789
Education and health services	Service, except protective	6495	0.8096998
Financial activities	Office and administrative support	1932	0.7774327

We can do a deeper dive into occupations and industries by year to see if we can track any significant changes there.

dfSummary <- dfFinal %>%
  group_by(Occupation, Year) %>%
  summarize(Percent = sum(Female)/sum(Total))

## `summarise()` regrouping output by 'Occupation' (override with `.groups` argument)

dfSummary <- na.omit(dfSummary)

ggplot(data = dfSummary, aes(x =Occupation, y=Percent, group=Year, fill=Year)) +
  geom_col(position=position_dodge()) + ggtitle("Percent of Women Employed By Occupation") +
  coord_flip() +
  scale_fill_brewer(palette = "Spectral")

The average percent of women employed in various occupations in the US was very stable from 2105 to 2020. The only really noticeable shift is in transportation. In most sectors there is little change. There are some odd outliers - e.g. a large drop in the % of women employed in farming and fishing in 2017. Given the relative stability elsewhere this is probably a shift in how occupations are labeled rather than a real change in the workforce.

Looking at industries …

dfSummary2 <- dfFinal %>%
  group_by(Industry, Year) %>%
  summarize(Percent = sum(Female)/sum(Total))

## `summarise()` regrouping output by 'Industry' (override with `.groups` argument)

dfSummary2 <- na.omit(dfSummary2)

ggplot(data = dfSummary2, aes(x =Industry, y=Percent, group=Year, fill=Year)) +
  geom_col(position=position_dodge()) + ggtitle("Percentage of Women Employed By Industry")+
  coord_flip() +
  scale_fill_brewer(palette = "Spectral")

Industries show a tiny bit more movement. Women’s % in some traditional industries like nannying and maids fell, and rose slightly in nontraditional ones like construction and mining.

Do these minor industry shifts signify real change? Let’s look at the occupations women are occupying in the mining industry:

dfSummary <- dfFinal %>%
  filter(Industry=="Mining, quarrying, and oil and gas extraction") %>%
  group_by(Occupation, Year) %>%
  summarize(Percent = sum(Female)/sum(Total))

## `summarise()` regrouping output by 'Occupation' (override with `.groups` argument)

dfSummary <- na.omit(dfSummary)

ggplot(data = dfSummary, aes(x =Occupation, y=Percent, group=Year, fill=Year)) +
  geom_col(position=position_dodge()) + ggtitle("Percentage of Women Employed in Mining") +
  coord_flip() +
  scale_fill_brewer(palette = "Spectral")

Most of the increase in women’s % in mining is in traditional occupations like office, admin, and non-protective services.

We can also circle back to transportation, the occupation in which women have been making the most inroads. What is happening there?

dfSummary2 <- dfFinal %>%
  filter(Occupation=="Transportation and material moving") %>%
  group_by(Industry, Year) %>%
  summarize(Percent = sum(Female)/sum(Total))

## `summarise()` regrouping output by 'Industry' (override with `.groups` argument)

dfSummary2 <- na.omit(dfSummary2)

ggplot(data = dfSummary2, aes(x =Industry, y=Percent, group=Year, fill=Year)) +
  geom_col(position=position_dodge()) + ggtitle("Percentage of Women in Transportation By Industry")+
  coord_flip() +
  scale_fill_brewer(palette = "Spectral")

There appears to be some real movement in women’s % of participation in transportation accross many industries. While mainly in service, education, information and retail, nonetheless women do appear to be making inroads into transportation occupations.

A google search reveals that in 2016 the DOT launched the Women and Girls Transportation Initiative to increase women’ participation in transportation. The initiative appears to have been successful.

While there was little shift of women’s percentages in most occupations between 2015 and 2020, transportation does appear to be one area where women made headway in a number of industries.

Analysis of Dataset 2: Prison Data

The prison dataset was tidied from this:

prison_admissions_raw <- as.data.frame(read.delim("https://raw.githubusercontent.com/TheUpshot/prison-admissions/master/county-prison-admissions.csv", 
                                                  header = TRUE, stringsAsFactors = FALSE, sep = ","))


head(prison_admissions_raw) %>%
  kbl(caption = "Raw Data - Prisons") %>%
  kable_styling(bootstrap_options = c("condensed"))

Raw Data - Prisons
fips	county	state	admitsPer10k2006	admitsPer10k2013	admitsPer10k2014	valid06	valid13	valid14	population2006	population2013	population2014	admissions2006	admissions2013	admissions2014	source
1001	Autauga County	AL	44.25665	19.225189	18.593736	true	true	true	51328	55136	55395	243	106	103	NCRP
1003	Baldwin County	AL	24.63739	17.703371	16.540820	true	true	true	168121	195443	200111	461	346	331	NCRP
1005	Barbour County	AL	75.39988	10.378827	12.273589	true	true	true	27861	26978	26887	206	28	33	NCRP
1007	Bibb County	AL	21.97416	11.109136	6.664889	true	true	true	22099	22504	22506	50	25	15	NCRP
1009	Blount County	AL	16.13939	11.781012	6.930127	true	true	true	55485	57720	57719	93	68	40	NCRP
1011	Bullock County	AL	66.07110	8.486563	NA	true	true		10776	10605	10764	71	9	NA	NCRP

to this:

dfPrisons <- read.csv("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/prison_df.csv")

head(dfPrisons) %>%
  kbl(caption = "Prison and County Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Prison and County Data
fips	county	state	source	percent_change	year	prison_admitsper10k	county_population	prison_admissions	persistent_poverty	urbanicity
1001	Autauga	AL	NCRP	-57.99	2006	44.26	51328	243	0	Metropolitan
1003	Baldwin	AL	NCRP	-32.86	2006	24.64	168121	461	0	Urban_Adjacent
1005	Barbour	AL	NCRP	-83.72	2006	75.40	27861	206	1	Urban_Adjacent
1007	Bibb	AL	NCRP	-69.67	2006	21.97	22099	50	1	Metropolitan
1009	Blount	AL	NCRP	-57.06	2006	16.14	55485	93	0	Metropolitan
1011	Bullock	AL	NCRP	NA	2006	66.07	10776	71	1	Urban_Adjacent

Task - What states account for the reduction in prison admissions from 2006 to 2014?

dfPrisons <- mutate_at(dfPrisons, vars(year), list(as.character))
dfPrisons <- na.omit(dfPrisons)

dfSummary2 <- dfPrisons %>%
  group_by(year) %>%
  summarize(Total = sum(prison_admissions))

## `summarise()` ungrouping output (override with `.groups` argument)

dfSummary2 <- na.omit(dfSummary2)

ggplot(data = dfSummary2, aes(x =year, y=Total)) +
  geom_col(position=position_dodge()) + ggtitle("Total Prison Admissions by Year")+
  scale_fill_brewer(palette = "Spectral") +
  ylab("Admissions")

We can see that overall prison admissions fell from 2006 to 2014.

dfSummary2 <- dfPrisons %>%
  group_by(state, year) %>%
  summarize(Total = sum(prison_admissions))

## `summarise()` regrouping output by 'state' (override with `.groups` argument)

dfSummary2 <- na.omit(dfSummary2)

ggplot(data = dfSummary2, aes(x =state, y=Total, group=year, fill=year)) +
  geom_col(position=position_dodge()) + ggtitle("Total Prison Admissions by State")+
  scale_fill_brewer(palette = "Spectral") +
  ylab("Admissions")

The lion’s share of decreases came from California - Alabama, Florida, New York and Ohio contributed as well. There were a few states that gained admissions as well, though the gains were rather small - Arizona, Minnesota, Kentucky, North Carolina, Oklahoma and Pennsylvania.

dfCounty <- dfPrisons %>%
  filter(state=="CA") %>%
  group_by(county, urbanicity) %>%
  summarize(Diff = sum(prison_admissions[year=="2006"] - sum(prison_admissions[year=="2014"]))) %>%
  filter(Diff>750)

## `summarise()` regrouping output by 'county' (override with `.groups` argument)

ggplot(data = dfCounty, aes(x = reorder(county, Diff),  y=Diff)) +
  geom_col(position=position_dodge()) + ggtitle("Decrease in Prison Admissions by CA County, 2006-2014")+
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  ylab("Decrease in Admissions") +
    xlab("County")

Many counties in California contributed to the decline, mostly in Southern California which accounted for the 5 counties over 5,000. Los Angeles, however, dominated the losses with over 22,000.

ggplot(data = dfCounty, aes(x = reorder(county, Diff),  y=urbanicity)) +
  geom_col(position=position_dodge()) + ggtitle("Counties by 'Urbanicity'")+
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  ylab("Urbanicity") +
  xlab("County")

When we look at the “urbancity” of admission declines we see all the counties in California with declines above 750 are urban metropolitan areas.

Thus we conclude that the bulk of admission declines are from California metro areas, especially in Southern California.

Analysis of Dataset 3: Epidemics

The epidemic dataset was tidied from this:

dfepidemic_raw <- as.data.frame(read.delim("https://raw.githubusercontent.com/TheSaltyCrab/Data607-Project2/main/epidemic.csv", 
                                                  header = TRUE, stringsAsFactors = FALSE, sep = ","))


head(dfepidemic_raw) %>%
  kbl(caption = "Raw Data - Epidemics") %>%
  kable_styling(bootstrap_options = c("condensed"))

Raw Data - Epidemics
Event	Date	Location	Disease	Death.toll..estimate.	Ref.
Antonine Plague	165–180 (possibly up to 190)	Roman Empire	Unknown, possibly smallpox	5–10 million	[23][24]
Jian’an Plague	217	Han Dynasty	Unknown, possibly typhoid fever or viral hemorrhagic fever	Unknown	[25][26]
Plague of Cyprian	250–266	Europe	Unknown, possibly smallpox	Unknown	[27][28]
Plague of Justinian (beginning of First plague pandemic)	541–549	Europe and West Asia	Bubonic plague	15–100 million (25–60% of population of Europe)	[11][29][30]
Roman Plague of 590 (part of First plague pandemic)	590	Rome, Byzantine Empire	Bubonic plague	Unknown	[31]
Plague of Sheroe (part of First plague pandemic)	627–628	Bilad al-Sham	Bubonic plague	25,000+

to this:

dfepidemics <- read.csv("https://raw.githubusercontent.com/TheSaltyCrab/Data607-Project2/main/a_clean_epidemic.csv")

head(dfepidemics) %>%
  kbl(caption = "Epidemic - clean") %>%
  kable_styling(bootstrap_options = c("condensed"))

Epidemic - clean
X	event	start_year	end_year	disease	deaths_low_estimate	global_population	low_global_death_percent
1	Black Death	1346	1353	Bubonic plague	7.5e+07	3.920e+08	0.1913265
2	Plague of Justinian	541	549	Bubonic plague	1.5e+07	2.100e+08	0.0714286
3	Antonine Plague	165	180	Unknown	5.0e+06	2.020e+08	0.0247525
4	1520 Mexico smallpox epidemic	1519	1520	Smallpox	5.0e+06	4.610e+08	0.0108460
5	Cocoliztli Epidemic of 1545–1548	1545	1548	Possibly Salmonella enterica	5.0e+06	4.610e+08	0.0108460
6	1918 influenza pandemic	1918	1920	Influenza A virus subtype H1N1	1.7e+07	1.912e+09	0.0088912

Task - How does the Covid epidemic compare to other epidemics through history?

dfepidemics <- mutate_at(dfepidemics, vars(deaths_low_estimate), list(as.integer))

dfE <- dfepidemics %>%
  select(event, deaths_low_estimate) %>%
  na.omit() %>%
  filter(deaths_low_estimate > 200000)

dfHighlight <- dfE %>%
  filter(event=="COVID-19 pandemic")

ggplot(data = dfE, aes(x = reorder(event, deaths_low_estimate),  y=deaths_low_estimate)) +
  geom_col(position=position_dodge()) + ggtitle("Epidemics by Total Deaths")+
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  geom_col(data=dfHighlight, aes(x=event, y=deaths_low_estimate), color="black", fill="orange") +
  xlab("Event") +
  ylab("Number of Deaths")

In terms of total deaths COVID 19 has been very significant (in the top ten) but is dwarfed by some tof the big ones (AIDS, the Black Death and the 1918 Spanish Flu pandemic).

The dataset does not include population estimates for many years so we will fill NA values with the value above for a proxy. Then we can calculate % deaths per population for COVID and other events that have no global percentage calculated.

dfE <- dfepidemics %>% fill(global_population) %>%
   mutate(glob_percent = deaths_low_estimate/global_population*100) %>%
  select(event, glob_percent) %>%
    na.omit() %>%
  filter(glob_percent>.035)

dfHighlight <- dfE %>%
  filter(event=="COVID-19 pandemic")

ggplot(data = dfE, aes(x = reorder(event, glob_percent),  y=glob_percent)) +
  geom_col(position=position_dodge()) + ggtitle("Epidemics by % of Deaths in World Pop")+
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  geom_col(data=dfHighlight, aes(x=event, y=glob_percent), color="orange", fill="orange") +
  xlab("Event") +
  ylab("% of Deaths")

In terms of % of global population, COVID is again dwarfed by others but still makes the list at 20.

Conclusion

The three data sets Cassie, Dan and I chose all had different challenges. My data set had classic issues of wideness, repeating data, etc. Cassie also had a wide data set that needed a reworking of columns and some adding of new columns. Dan had a host of standardization issues – dates, numbers, and text were all written in different formats throughout the data set. Both Cassie and Dan added data from more than one type of data set, and I combined six data sets of the same type from different years.

Working in a team made it possible to learn many different techniques very quickly. We are all much more advanced in our ability to tidy data.

Eric_Hirsch_607_Project_2

Eric Hirsch

2021-03-13

Project 2

Introduction

Dataset 1: Bureau of Labor Statistics Data

Dataset 2 - The Upshot - Prison Admissions by County

Dataset 3 - Historic Epidemic Data

Tidying Dataset #1: BLS Data

Analysis of Dataset 1: BLS Data

Analysis of Dataset 2: Prison Data

Analysis of Dataset 3: Epidemics

Conclusion