library(openintro)
library(tinytex)
library(tidyverse)
library(stringr)
library(magrittr)
library(gridExtra)
library(readxl)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.0.4
Team members: Eric Hirsch, Dan Sullivan, Cassie Coste
In this project we will tidy three datasets using TidyR and Dyplr and other handy R tools. After the transformation we will conduct some analyses. This is a joint project with Cassie Coste and Dan Sullivan. We each cleaned a data set and then showed each other what our challenges were and how we overcame them. This first data set was the data set that I tidied.
The data consists of six CSV files from the Bureau of Labor Statistics showing numbers of Americans involved in various occupations and industries, spanning the years 2015 through 2020. Each file is in the same format. The data set is “wide” – occupations run horizontally and industries run vertically.
Challenges with the data set:
This dataset is a long dataset used by The Upshot NYT in their article “A Small Indiana County Sends More People to Prison Than San Francisco and Durham, N.C., Combined. Why?” to report on the increase in rural prison populations in recent years.
Challenges:
The primary issue with the data set is that it contains years in the variable names for three different variables. The goal is to get to a data set with columns for the three prison admission variables and one for the year. There are also some minor tidying edits such as converting columns to numeric or factor and or removing/adding words to columns or column names.
As this dataset is only part of the picture that was being looked at by The Upshot, to gain further insight into this dataset and look at some of the things that the article was referring to as well as comments made in the class discussion board, new variables need to be computed and more county data is needed that was not made available by The Upshot. For county data, an additional dataset (2.b) from the US Department of Agriculture (USDA) is joined with The Upshot dataset to perform the final analyses.
This county dataset had its own set of challenges. There is a rural-urban variable at two time points that is on a 1-9 spectrum and will be re-coded to match Metropolitan/Urban/Rural categories provided by the USDA. When these variables are incorporated into the original dataset, the rural-urban variable must then be made long through the creation of a new variable using ifelse statements to make sure that prison county data from the 2000s receives the urbanicity variable from the 2000 census and county data from the 2010s receives the urbanicity variable from the 2010 census.
Dataset three Uses historic epidemic data as well as estimates and measurements on global population in the form of two separate tables for pre 1950 populations and post 1950 populations. It incorporates these three csv to get estimates on how impactful certain epidemics were based on the total deaths as compared to global population.
Challenges:
The biggest challenge with this data was data formation. It contained many pieces of data that were mixed strings and numbers, notes within data points, as well as hyperlink references. Because many columns were not standardized I used a lot of regex to tackle a lot of these issues ultimately paring things down to a point where the values could be standardized and used.
The next challenge was creating a metric to join population data that fluctuated depending on the year. I found population data that was yearly from 1950-2017, by decade from 1900-1950 and every century from 1AD to 1900. Because of this I had to make my own rounding function. where depending on what year the event was it rounded so that population metrics could be added accordingly.
dfBLS_raw <- read.delim("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/bls-2015.csv", sep=",")
dfBLS <- dfBLS_raw %>%
select(-"X") %>%
mutate(Year = "2015")
#Extract the names
vColumnNames <- as.character(dfBLS %>%
filter(row_number() %in% 5) %>%
select(-contains("Household") & -"Year"))
#Clean the names
vColumnNames = str_replace_all(vColumnNames, "[\r\n]", " ")
vColumnNames = str_replace_all(vColumnNames, "- ", "")
vColumnNames = str_replace_all(vColumnNames, "- ", "")
vColumnNames = str_replace_all(vColumnNames, " ", "")
vColumnNames = str_replace_all(vColumnNames, " occupations", "")
#Replace the dataframe names with the extracted names
dfBLS %<>%
rename(Industry = contains("Household"), !!vColumnNames[1] := X.1, !!vColumnNames[2] := X.2, !!vColumnNames[3] := X.3, !!vColumnNames[4] := X.4, !!vColumnNames[5] := X.5, !!vColumnNames[6] := X.6, !!vColumnNames[7] := X.7, !!vColumnNames[8] := X.8, !!vColumnNames[9] := X.9, !!vColumnNames[10] := X.10, !!vColumnNames[11] := X.11)
DemographicUnit <- function(df, rowStart, rowEnd, demoContent)
{
#Extract units based on rows, insert a column with demographic category, remove summary rows and gather the dataframe into long format
dfNew <- df %>%
filter(row_number() %in% rowStart:rowEnd) %>%
mutate(demog := demoContent) %>%
filter(Industry !="Manufacturing" & Industry !="Wholesale and retail trade" & Industry !="Other services") %>%
gather(all_of(vColumnNames), key="Occupation", value="NumberEmployed")
#The new "NumberEmployed" category needs to be turned into an integer from character. This means removing the comma, and recasting it as an integer. A unique row ID is also created here which will simplify merging the data frame with others.
dfNew$NumberEmployed = str_replace_all(dfNew$NumberEmployed, ",", "")
dfNew <- mutate_at(dfNew, vars(NumberEmployed), list(as.integer)) %>%
mutate(RowID = row_number())
#Some of the rows which were summarized by summary rows need to be reworded.
dfNew$Industry = str_replace_all(dfNew$Industry, "Durable goods", "Manufacturing, durable goods")
dfNew$Industry = str_replace_all(dfNew$Industry, "Nondurable goods", "Manufacturing, nondurable goods")
dfNew$Industry = str_replace_all(dfNew$Industry, "Private households", "Other services, private households only")
return (dfNew)
}
#Each of the units is extracted using the function we've just created
dfBLS_Men = DemographicUnit(dfBLS, 29, 47, "Male")
dfBLS_Women = DemographicUnit(dfBLS, 50, 68, "Female")
dfBLS_White = DemographicUnit(dfBLS, 71, 89, "White")
dfBLS_Black = DemographicUnit(dfBLS, 92, 110, "Black")
dfBLS_Asian = DemographicUnit(dfBLS, 113, 131, "Asian")
dfBLS_Total = DemographicUnit(dfBLS, 8, 26, "Total")
dfAll <- rbind(dfBLS_Women, dfBLS_Men, dfBLS_Black, dfBLS_White, dfBLS_Asian, dfBLS_Total)
dfAll %<>%
spread(demog, NumberEmployed) %<>%
mutate_at(vars(Black, White, Asian, Total), list(as.numeric)) %<>%
mutate(Other = Total-(Black + White + Asian))
dfFinal <- dfAll
This is the last step. We now have a clean data frame in long form. The only thing that remains is using the same steps to read in the other years and appending those dataframes to this. We do this with a function that brings all of the steps together.
ReadYear <- function(bls_Year)
{
fileName <- str_c("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/bls-", bls_Year, ".csv")
dfBLS_raw <- read.delim(fileName, sep=",")
dfBLS <- dfBLS_raw %>%
select(-"X") %>%
mutate(Year = bls_Year)
vColumnNames <- as.character(dfBLS %>%
filter(row_number() %in% 5) %>%
select(-contains("Household") & -"Year"))
vColumnNames = str_replace_all(vColumnNames, "[\r\n]", " ")
vColumnNames = str_replace_all(vColumnNames, "- ", "")
vColumnNames = str_replace_all(vColumnNames, "- ", "")
vColumnNames = str_replace_all(vColumnNames, "-", "")
vColumnNames = str_replace_all(vColumnNames, " ", "")
vColumnNames = str_replace_all(vColumnNames, " occupations", "")
dfBLS %<>%
rename(Industry = contains("Household"), !!vColumnNames[1] := X.1, !!vColumnNames[2] := X.2, !!vColumnNames[3] := X.3, !!vColumnNames[4] := X.4, !!vColumnNames[5] := X.5, !!vColumnNames[6] := X.6, !!vColumnNames[7] := X.7, !!vColumnNames[8] := X.8, !!vColumnNames[9] := X.9, !!vColumnNames[10] := X.10, !!vColumnNames[11] := X.11)
dfBLS_Men = DemographicUnit(dfBLS, 29, 47, "Male")
dfBLS_Women = DemographicUnit(dfBLS, 50, 68, "Female")
dfBLS_White = DemographicUnit(dfBLS, 71, 89, "White")
dfBLS_Black = DemographicUnit(dfBLS, 92, 110, "Black")
dfBLS_Asian = DemographicUnit(dfBLS, 113, 131, "Asian")
dfBLS_Total = DemographicUnit(dfBLS, 8, 26, "Total")
dfAll <- rbind(dfBLS_Women, dfBLS_Men, dfBLS_Black, dfBLS_White, dfBLS_Asian, dfBLS_Total)
dfAll %<>%
spread(demog, NumberEmployed) %<>%
mutate_at(vars(Black, White, Asian, Total), list(as.numeric)) %<>%
mutate(Other = Total-(Black + White + Asian))
return(dfAll)
}
dfs <- ReadYear("2016")
dfFinal <- rbind(dfFinal, dfs)
dfs <- ReadYear("2017")
dfFinal <- rbind(dfFinal, dfs)
dfs <- ReadYear("2018")
dfFinal <- rbind(dfFinal, dfs)
dfs <- ReadYear("2019")
dfFinal <- rbind(dfFinal, dfs)
dfs <- ReadYear("2020")
dfFinal <- rbind(dfFinal, dfs)
dfFinal %<>%
select("Year", "Industry", "Occupation", "Female", "Male", "Black", "White", "Asian", "Other", "Total")
Thus we go from this:
head(dfBLS_raw,10) %>%
kbl(caption = "Raw Data - BLS") %>%
kable_styling(bootstrap_options = c("condensed"))
| ï..HOUSEHOLD.DATA..ANNUAL.AVERAGES..17..Employed.persons.by.industry..sex..race..and.occupation | X | X.1 | X.2 | X.3 | X.4 | X.5 | X.6 | X.7 | X.8 | X.9 | X.10 | X.11 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [In thousands] | ||||||||||||
| Industry, sex, and race | 2015 | |||||||||||
| Total em- ployed | Management, professional, and related occupations | Service occupations | Sales and office occupations | Natural resources, construction, and maintenance occupations | Production, trans- portation, and material moving occupations | |||||||
| Management, business, and financial operations occupations | Profes- sional and related occupa- tions | Protec- tive service occupa- tions | Service occupa- tions, except protec- tive | Sales and related occupa- tions | Office and adminis- trative support occupa- tions | Farming, fishing, and forestry occupa- tions | Construc- tion and extrac- tion occupa- tions | Instal- lation, mainte- nance, and repair occupa- tions | Produc- tion occupa- tions | Transpor- tation and material moving occupa- tions | ||
| TOTAL | ||||||||||||
| Agriculture and related | 2,422 | 1,085 | 49 | 18 | 91 | 17 | 87 | 903 | 14 | 31 | 30 | 95 |
| Mining, quarrying, and oil and gas extraction | 917 | 150 | 148 | 2 | 5 | 12 | 73 | 1 | 292 | 73 | 62 | 100 |
| Construction | 9,935 | 1,762 | 237 | 10 | 39 | 98 | 509 | 4 | 6,347 | 534 | 153 | 244 |
to this:
head(dfFinal) %>%
kbl(caption = "Final Data - BLS") %>%
kable_styling(bootstrap_options = c("condensed"))
| Year | Industry | Occupation | Female | Male | Black | White | Asian | Other | Total |
|---|---|---|---|---|---|---|---|---|---|
| 2015 | Agriculture and related | Construction and extraction | 0 | 14 | 0 | 13 | 1 | 0 | 14 |
| 2015 | Agriculture and related | Farming, fishing, and forestry | 184 | 719 | 34 | 820 | 9 | 40 | 903 |
| 2015 | Agriculture and related | Installation, maintenance, and repair | 0 | 31 | 0 | 30 | 0 | 1 | 31 |
| 2015 | Agriculture and related | Management, business, and financial operations | 261 | 824 | 9 | 1052 | 6 | 18 | 1085 |
| 2015 | Agriculture and related | Office and administrative support | 76 | 11 | 2 | 83 | 0 | 2 | 87 |
| 2015 | Agriculture and related | Production | 6 | 24 | 3 | 25 | 0 | 2 | 30 |
Task - Name one significant shift in women’s employment from traditional occupations/industries to nontraditional from 2015 to 2020.
Thiswas harder than one might think!
We begin by examining the top 5 occupation/Industry combinations for women in 2015 and 2020. We can see that the same combinations appear in each year, though the percentages and order change slightly. These are sectors where women dominate: nannies and maids, education and health, and various office and admin support occupations.
dfFinal <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/Final.csv")
dfFinal <- mutate_at(dfFinal, vars(Year), list(as.character))
dfFinal <- na.omit(dfFinal)
df2015 <- dfFinal %>%
filter(Year=="2015" & Total>100) %>%
mutate(PercentWomen = Female/Total) %>%
select(Industry, Occupation, Total, PercentWomen) %>%
arrange(desc(PercentWomen))
df2015 <- na.omit(df2015)
head(df2015) %>%
kbl(caption = "Top Occupation/Industries - 2015") %>%
kable_styling(bootstrap_options = c("condensed"))
| Industry | Occupation | Total | PercentWomen |
|---|---|---|---|
| Other services, private households only | Service, except protective | 770 | 0.9428571 |
| Education and health services | Office and administrative support | 3561 | 0.8885145 |
| Construction | Office and administrative support | 509 | 0.8506876 |
| Other services, except private households | Office and administrative support | 628 | 0.8471338 |
| Financial activities | Office and administrative support | 2171 | 0.8060801 |
| Education and health services | Service, except protective | 7018 | 0.8045027 |
df2020 <- dfFinal %>%
filter(Year=="2020" & Total>100) %>%
mutate(PercentWomen = Female/Total) %>%
select(Industry, Occupation, Total, PercentWomen) %>%
arrange(desc(PercentWomen))
df2020 <- na.omit(df2020)
head(df2020) %>%
kbl(caption = "Top Occupation/Industries - 2020") %>%
kable_styling(bootstrap_options = c("condensed"))
| Industry | Occupation | Total | PercentWomen |
|---|---|---|---|
| Other services, private households only | Service, except protective | 611 | 0.9148936 |
| Education and health services | Office and administrative support | 3258 | 0.8956415 |
| Other services, except private households | Office and administrative support | 531 | 0.8305085 |
| Construction | Office and administrative support | 545 | 0.8146789 |
| Education and health services | Service, except protective | 6495 | 0.8096998 |
| Financial activities | Office and administrative support | 1932 | 0.7774327 |
We can do a deeper dive into occupations and industries by year to see if we can track any significant changes there.
dfSummary <- dfFinal %>%
group_by(Occupation, Year) %>%
summarize(Percent = sum(Female)/sum(Total))
## `summarise()` regrouping output by 'Occupation' (override with `.groups` argument)
dfSummary <- na.omit(dfSummary)
ggplot(data = dfSummary, aes(x =Occupation, y=Percent, group=Year, fill=Year)) +
geom_col(position=position_dodge()) + ggtitle("Percent of Women Employed By Occupation") +
coord_flip() +
scale_fill_brewer(palette = "Spectral")
The average percent of women employed in various occupations in the US was very stable from 2105 to 2020. The only really noticeable shift is in transportation. In most sectors there is little change. There are some odd outliers - e.g. a large drop in the % of women employed in farming and fishing in 2017. Given the relative stability elsewhere this is probably a shift in how occupations are labeled rather than a real change in the workforce.
Looking at industries …
dfSummary2 <- dfFinal %>%
group_by(Industry, Year) %>%
summarize(Percent = sum(Female)/sum(Total))
## `summarise()` regrouping output by 'Industry' (override with `.groups` argument)
dfSummary2 <- na.omit(dfSummary2)
ggplot(data = dfSummary2, aes(x =Industry, y=Percent, group=Year, fill=Year)) +
geom_col(position=position_dodge()) + ggtitle("Percentage of Women Employed By Industry")+
coord_flip() +
scale_fill_brewer(palette = "Spectral")
Industries show a tiny bit more movement. Women’s % in some traditional industries like nannying and maids fell, and rose slightly in nontraditional ones like construction and mining.
Do these minor industry shifts signify real change? Let’s look at the occupations women are occupying in the mining industry:
dfSummary <- dfFinal %>%
filter(Industry=="Mining, quarrying, and oil and gas extraction") %>%
group_by(Occupation, Year) %>%
summarize(Percent = sum(Female)/sum(Total))
## `summarise()` regrouping output by 'Occupation' (override with `.groups` argument)
dfSummary <- na.omit(dfSummary)
ggplot(data = dfSummary, aes(x =Occupation, y=Percent, group=Year, fill=Year)) +
geom_col(position=position_dodge()) + ggtitle("Percentage of Women Employed in Mining") +
coord_flip() +
scale_fill_brewer(palette = "Spectral")
Most of the increase in women’s % in mining is in traditional occupations like office, admin, and non-protective services.
We can also circle back to transportation, the occupation in which women have been making the most inroads. What is happening there?
dfSummary2 <- dfFinal %>%
filter(Occupation=="Transportation and material moving") %>%
group_by(Industry, Year) %>%
summarize(Percent = sum(Female)/sum(Total))
## `summarise()` regrouping output by 'Industry' (override with `.groups` argument)
dfSummary2 <- na.omit(dfSummary2)
ggplot(data = dfSummary2, aes(x =Industry, y=Percent, group=Year, fill=Year)) +
geom_col(position=position_dodge()) + ggtitle("Percentage of Women in Transportation By Industry")+
coord_flip() +
scale_fill_brewer(palette = "Spectral")
There appears to be some real movement in women’s % of participation in transportation accross many industries. While mainly in service, education, information and retail, nonetheless women do appear to be making inroads into transportation occupations.
A google search reveals that in 2016 the DOT launched the Women and Girls Transportation Initiative to increase women’ participation in transportation. The initiative appears to have been successful.
While there was little shift of women’s percentages in most occupations between 2015 and 2020, transportation does appear to be one area where women made headway in a number of industries.
The prison dataset was tidied from this:
prison_admissions_raw <- as.data.frame(read.delim("https://raw.githubusercontent.com/TheUpshot/prison-admissions/master/county-prison-admissions.csv",
header = TRUE, stringsAsFactors = FALSE, sep = ","))
head(prison_admissions_raw) %>%
kbl(caption = "Raw Data - Prisons") %>%
kable_styling(bootstrap_options = c("condensed"))
| fips | county | state | admitsPer10k2006 | admitsPer10k2013 | admitsPer10k2014 | valid06 | valid13 | valid14 | population2006 | population2013 | population2014 | admissions2006 | admissions2013 | admissions2014 | source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Autauga County | AL | 44.25665 | 19.225189 | 18.593736 | true | true | true | 51328 | 55136 | 55395 | 243 | 106 | 103 | NCRP |
| 1003 | Baldwin County | AL | 24.63739 | 17.703371 | 16.540820 | true | true | true | 168121 | 195443 | 200111 | 461 | 346 | 331 | NCRP |
| 1005 | Barbour County | AL | 75.39988 | 10.378827 | 12.273589 | true | true | true | 27861 | 26978 | 26887 | 206 | 28 | 33 | NCRP |
| 1007 | Bibb County | AL | 21.97416 | 11.109136 | 6.664889 | true | true | true | 22099 | 22504 | 22506 | 50 | 25 | 15 | NCRP |
| 1009 | Blount County | AL | 16.13939 | 11.781012 | 6.930127 | true | true | true | 55485 | 57720 | 57719 | 93 | 68 | 40 | NCRP |
| 1011 | Bullock County | AL | 66.07110 | 8.486563 | NA | true | true | 10776 | 10605 | 10764 | 71 | 9 | NA | NCRP |
to this:
dfPrisons <- read.csv("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/prison_df.csv")
head(dfPrisons) %>%
kbl(caption = "Prison and County Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| fips | county | state | source | percent_change | year | prison_admitsper10k | county_population | prison_admissions | persistent_poverty | urbanicity |
|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Autauga | AL | NCRP | -57.99 | 2006 | 44.26 | 51328 | 243 | 0 | Metropolitan |
| 1003 | Baldwin | AL | NCRP | -32.86 | 2006 | 24.64 | 168121 | 461 | 0 | Urban_Adjacent |
| 1005 | Barbour | AL | NCRP | -83.72 | 2006 | 75.40 | 27861 | 206 | 1 | Urban_Adjacent |
| 1007 | Bibb | AL | NCRP | -69.67 | 2006 | 21.97 | 22099 | 50 | 1 | Metropolitan |
| 1009 | Blount | AL | NCRP | -57.06 | 2006 | 16.14 | 55485 | 93 | 0 | Metropolitan |
| 1011 | Bullock | AL | NCRP | NA | 2006 | 66.07 | 10776 | 71 | 1 | Urban_Adjacent |
Task - What states account for the reduction in prison admissions from 2006 to 2014?
dfPrisons <- mutate_at(dfPrisons, vars(year), list(as.character))
dfPrisons <- na.omit(dfPrisons)
dfSummary2 <- dfPrisons %>%
group_by(year) %>%
summarize(Total = sum(prison_admissions))
## `summarise()` ungrouping output (override with `.groups` argument)
dfSummary2 <- na.omit(dfSummary2)
ggplot(data = dfSummary2, aes(x =year, y=Total)) +
geom_col(position=position_dodge()) + ggtitle("Total Prison Admissions by Year")+
scale_fill_brewer(palette = "Spectral") +
ylab("Admissions")
We can see that overall prison admissions fell from 2006 to 2014.
dfSummary2 <- dfPrisons %>%
group_by(state, year) %>%
summarize(Total = sum(prison_admissions))
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
dfSummary2 <- na.omit(dfSummary2)
ggplot(data = dfSummary2, aes(x =state, y=Total, group=year, fill=year)) +
geom_col(position=position_dodge()) + ggtitle("Total Prison Admissions by State")+
scale_fill_brewer(palette = "Spectral") +
ylab("Admissions")
The lion’s share of decreases came from California - Alabama, Florida, New York and Ohio contributed as well. There were a few states that gained admissions as well, though the gains were rather small - Arizona, Minnesota, Kentucky, North Carolina, Oklahoma and Pennsylvania.
dfCounty <- dfPrisons %>%
filter(state=="CA") %>%
group_by(county, urbanicity) %>%
summarize(Diff = sum(prison_admissions[year=="2006"] - sum(prison_admissions[year=="2014"]))) %>%
filter(Diff>750)
## `summarise()` regrouping output by 'county' (override with `.groups` argument)
ggplot(data = dfCounty, aes(x = reorder(county, Diff), y=Diff)) +
geom_col(position=position_dodge()) + ggtitle("Decrease in Prison Admissions by CA County, 2006-2014")+
scale_fill_brewer(palette = "Spectral") +
coord_flip() +
ylab("Decrease in Admissions") +
xlab("County")
Many counties in California contributed to the decline, mostly in Southern California which accounted for the 5 counties over 5,000. Los Angeles, however, dominated the losses with over 22,000.
ggplot(data = dfCounty, aes(x = reorder(county, Diff), y=urbanicity)) +
geom_col(position=position_dodge()) + ggtitle("Counties by 'Urbanicity'")+
scale_fill_brewer(palette = "Spectral") +
coord_flip() +
ylab("Urbanicity") +
xlab("County")
When we look at the “urbancity” of admission declines we see all the counties in California with declines above 750 are urban metropolitan areas.
Thus we conclude that the bulk of admission declines are from California metro areas, especially in Southern California.
The epidemic dataset was tidied from this:
dfepidemic_raw <- as.data.frame(read.delim("https://raw.githubusercontent.com/TheSaltyCrab/Data607-Project2/main/epidemic.csv",
header = TRUE, stringsAsFactors = FALSE, sep = ","))
head(dfepidemic_raw) %>%
kbl(caption = "Raw Data - Epidemics") %>%
kable_styling(bootstrap_options = c("condensed"))
| Event | Date | Location | Disease | Death.toll..estimate. | Ref. |
|---|---|---|---|---|---|
| Antonine Plague | 165–180 (possibly up to 190) | Roman Empire | Unknown, possibly smallpox | 5–10 million | [23][24] |
| Jian’an Plague | 217 | Han Dynasty | Unknown, possibly typhoid fever or viral hemorrhagic fever | Unknown | [25][26] |
| Plague of Cyprian | 250–266 | Europe | Unknown, possibly smallpox | Unknown | [27][28] |
| Plague of Justinian (beginning of First plague pandemic) | 541–549 | Europe and West Asia | Bubonic plague | 15–100 million (25–60% of population of Europe) | [11][29][30] |
| Roman Plague of 590 (part of First plague pandemic) | 590 | Rome, Byzantine Empire | Bubonic plague | Unknown | [31] |
| Plague of Sheroe (part of First plague pandemic) | 627–628 | Bilad al-Sham | Bubonic plague | 25,000+ |
to this:
dfepidemics <- read.csv("https://raw.githubusercontent.com/TheSaltyCrab/Data607-Project2/main/a_clean_epidemic.csv")
head(dfepidemics) %>%
kbl(caption = "Epidemic - clean") %>%
kable_styling(bootstrap_options = c("condensed"))
| X | event | start_year | end_year | disease | deaths_low_estimate | global_population | low_global_death_percent |
|---|---|---|---|---|---|---|---|
| 1 | Black Death | 1346 | 1353 | Bubonic plague | 7.5e+07 | 3.920e+08 | 0.1913265 |
| 2 | Plague of Justinian | 541 | 549 | Bubonic plague | 1.5e+07 | 2.100e+08 | 0.0714286 |
| 3 | Antonine Plague | 165 | 180 | Unknown | 5.0e+06 | 2.020e+08 | 0.0247525 |
| 4 | 1520 Mexico smallpox epidemic | 1519 | 1520 | Smallpox | 5.0e+06 | 4.610e+08 | 0.0108460 |
| 5 | Cocoliztli Epidemic of 1545–1548 | 1545 | 1548 | Possibly Salmonella enterica | 5.0e+06 | 4.610e+08 | 0.0108460 |
| 6 | 1918 influenza pandemic | 1918 | 1920 | Influenza A virus subtype H1N1 | 1.7e+07 | 1.912e+09 | 0.0088912 |
Task - How does the Covid epidemic compare to other epidemics through history?
dfepidemics <- mutate_at(dfepidemics, vars(deaths_low_estimate), list(as.integer))
dfE <- dfepidemics %>%
select(event, deaths_low_estimate) %>%
na.omit() %>%
filter(deaths_low_estimate > 200000)
dfHighlight <- dfE %>%
filter(event=="COVID-19 pandemic")
ggplot(data = dfE, aes(x = reorder(event, deaths_low_estimate), y=deaths_low_estimate)) +
geom_col(position=position_dodge()) + ggtitle("Epidemics by Total Deaths")+
scale_fill_brewer(palette = "Spectral") +
coord_flip() +
geom_col(data=dfHighlight, aes(x=event, y=deaths_low_estimate), color="black", fill="orange") +
xlab("Event") +
ylab("Number of Deaths")
In terms of total deaths COVID 19 has been very significant (in the top ten) but is dwarfed by some tof the big ones (AIDS, the Black Death and the 1918 Spanish Flu pandemic).
The dataset does not include population estimates for many years so we will fill NA values with the value above for a proxy. Then we can calculate % deaths per population for COVID and other events that have no global percentage calculated.
dfE <- dfepidemics %>% fill(global_population) %>%
mutate(glob_percent = deaths_low_estimate/global_population*100) %>%
select(event, glob_percent) %>%
na.omit() %>%
filter(glob_percent>.035)
dfHighlight <- dfE %>%
filter(event=="COVID-19 pandemic")
ggplot(data = dfE, aes(x = reorder(event, glob_percent), y=glob_percent)) +
geom_col(position=position_dodge()) + ggtitle("Epidemics by % of Deaths in World Pop")+
scale_fill_brewer(palette = "Spectral") +
coord_flip() +
geom_col(data=dfHighlight, aes(x=event, y=glob_percent), color="orange", fill="orange") +
xlab("Event") +
ylab("% of Deaths")
In terms of % of global population, COVID is again dwarfed by others but still makes the list at 20.
The three data sets Cassie, Dan and I chose all had different challenges. My data set had classic issues of wideness, repeating data, etc. Cassie also had a wide data set that needed a reworking of columns and some adding of new columns. Dan had a host of standardization issues – dates, numbers, and text were all written in different formats throughout the data set. Both Cassie and Dan added data from more than one type of data set, and I combined six data sets of the same type from different years.
Working in a team made it possible to learn many different techniques very quickly. We are all much more advanced in our ability to tidy data.