Dataset 1 - Global TB rates
This dataset represents the tuberculosis cases reported in the past 30 years by country and demographic. The original dataset has multi-variate columns and is missing a substantial amount of data.
I will be tidying this dataset so that each row represents a single observation (i.e. the cases of tb for a particular country/year/gender, & age range). I will be identifying if there are any patterns in the data that is missing and dropping or imputing NA values accordingly. I will be looking up country name to replace the non-intuitive iso country code in the original dataset.
I will then perform some EDA and summary statistics on the tidied dataset.
Importing the Data
NAMES_tb <- read.table("https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/tb.csv", nrow = 1, stringsAsFactors = FALSE, sep = ",")
DATA_tb <- read.table("https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/tb.csv", skip = 1, stringsAsFactors = FALSE, sep = ",")
tb <- DATA_tb[, 1:length(NAMES_tb)]
names(tb) <- NAMES_tblookup_country <- read_csv("https://pkgstore.datahub.io/core/country-list/data_csv/data/d7c9d7cfb42cb69f4422dec222dbbaa8/data_csv.csv", show_col_types = FALSE)glimpse(tb)## Rows: 5,769
## Columns: 23
## $ iso2 <chr> "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD~
## $ year <int> 1989, 1990, 1991, 1992, 1993, 1994, 1996, 1997, 1998, 199~
## $ new_sp <int> NA, NA, NA, NA, 15, 24, 8, 17, 1, 4, 1, 3, 2, 7, 3, 5, 8,~
## $ new_sp_m04 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_m514 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_m014 <int> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ new_sp_m1524 <int> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 1,~
## $ new_sp_m2534 <int> NA, NA, NA, NA, NA, NA, 0, 1, 0, 0, 1, NA, 0, 0, 0, 1, 1,~
## $ new_sp_m3544 <int> NA, NA, NA, NA, NA, NA, 4, 2, 1, 1, 0, 2, 1, 1, 1, 1, 2, ~
## $ new_sp_m4554 <int> NA, NA, NA, NA, NA, NA, 1, 2, 0, 1, 0, 1, 0, 2, 1, 0, 0, ~
## $ new_sp_m5564 <int> NA, NA, NA, NA, NA, NA, 0, 1, 0, 0, 0, NA, 0, 0, 0, 0, 1,~
## $ new_sp_m65 <int> NA, NA, NA, NA, NA, NA, 0, 6, 0, 0, 0, NA, 0, 0, 0, 0, 1,~
## $ new_sp_mu <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_f04 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_f514 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_f014 <int> NA, NA, NA, NA, NA, NA, 0, 0, NA, 0, NA, NA, 0, 0, 0, 0, ~
## $ new_sp_f1524 <int> NA, NA, NA, NA, NA, NA, 1, 1, NA, 0, NA, NA, 1, 1, 0, 1, ~
## $ new_sp_f2534 <int> NA, NA, NA, NA, NA, NA, 1, 2, NA, 0, NA, NA, 0, 1, 1, 1, ~
## $ new_sp_f3544 <int> NA, NA, NA, NA, NA, NA, 0, 3, NA, 1, NA, NA, 0, 1, 0, 1, ~
## $ new_sp_f4554 <int> NA, NA, NA, NA, NA, NA, 0, 0, NA, 0, NA, NA, 0, 0, 0, 0, ~
## $ new_sp_f5564 <int> NA, NA, NA, NA, NA, NA, 1, 0, NA, 0, NA, NA, 0, 0, 0, 0, ~
## $ new_sp_f65 <int> NA, NA, NA, NA, NA, NA, 0, 1, NA, 0, NA, NA, 0, 0, 0, 0, ~
## $ new_sp_fu <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
Reshaping the Dataset
I am collapsing the columns that are storing multiple variables per column (gender, age range, and tb_cases) into 2 columns - demographic and case count. I am pivoting on iso2, year, and new_sp columns.
I chose to keep new_sp column as this column captures the total of all tb cases reported per country per year. I am renaming this column total_cases for better accuracy.
tb_clean <- tb %>%
pivot_longer(
cols = -c(iso2,year,new_sp),
names_to = "demo",
values_to = "cases") %>%
rename("total_cases" = new_sp) Since the disparate variables of the demo (demographic) column are not separated by any non-alphanumeric characters, I am using str_extract() inside a mutate to pull out variables gender and age range instead of a separate() verb.
Both new variables will be factor variables in that they have a fixed select number of values they could be, so I am also casting both variables to factor inside my mutate.
tb_clean <- tb_clean %>%
mutate(
gender = as.factor(str_extract(demo, "f|m")),
age_range = (str_extract(demo,"[0-9]{2,4}"))
) %>%
replace_na(list(age_range = "unknown")) %>%
select(-demo)Handling Missing Data
It is noticeable that there are many NA values in this dataset.
sum(is.na(tb_clean))## [1] 133850
vis_miss(tb_clean)Looking at the data, I suspect that the NA values are more often happening in earlier years where data may have been more difficult to ascertain. I am going to arrange by year and then call vis_miss. Should my suspicion be correct, the missing values should all show as sequential (ie. data is missing at random but not completely at random).
tb_clean_by_year <- tb_clean %>%
arrange(year)
vis_miss(tb_clean_by_year)It looks like my hunch was correct. To see at around what year did frequency of NA values drop off, I will filter and count only the observations with NA in the cases column and then plot.
tb_missing <- tb_clean %>%
filter(is.na(cases)) %>%
count(year, cases, name = "freq")
ggplot(tb_missing, aes(x=year, y=freq)) + geom_col()It appears that only around 1995 did data begin to be collected in a comprehensive manner. It could still be helpful to retain incidence rates from the 80s for certain countries where this data was collected successfully. However, I am looking to apply summary statistics across countries, and the quality/accuracy of this analysis will significantly decrease with so many missing values; therefore I am going to drop all observational records from 1980-1995.
tb_clean$year <- as.numeric(tb_clean$year)
exclude <- c(1980:1995)
tb_clean<- tb_clean %>%
filter(!(year %in% exclude))At this point, we have the choice to either assume that where values are NA in the case column, no tb cases were reported or no data was collected at all. We can discern from the dataset that there do exist records of cases reported at 0 indicating that NA in this dataset is not necessarily a proxy for 0 cases.
tb_clean %>% filter(cases == 0) %>% count(cases)## # A tibble: 1 x 2
## cases n
## <int> <int>
## 1 0 4096
tb_clean %>% filter(is.na(cases)) %>% count(cases)## # A tibble: 1 x 2
## cases n
## <int> <int>
## 1 NA 19998
Where total cases reported are also NA, records containing NA values in cases variable will be dropped. Where a total case count for country/year has been reported, cases will be imputed as 0.
tb_clean <- tb_clean %>% filter(!is.na(cases) & !is.na(total_cases))
tb_clean <- tb_clean %>% replace_na(list(cases = 0L))It was noted in our plot of missing values that there are a number of observations where the country code is missing. After dropping by year and dropping by cases = NA, there is only 184 records without an iso code. Reviewing this extracted data, it appears these values are not missing at random but are of a country code not included. We will need to drop all records missing the iso code as imputation does not appear possible.
sum(is.na(tb_clean$iso2))## [1] 184
missing_country <- tb_clean %>% filter(is.na(iso2))
missing_country %>% group_by(year) %>% tally()## # A tibble: 13 x 2
## year n
## <dbl> <int>
## 1 1996 14
## 2 1997 14
## 3 1998 14
## 4 1999 14
## 5 2000 14
## 6 2001 14
## 7 2002 14
## 8 2003 14
## 9 2004 14
## 10 2005 14
## 11 2006 14
## 12 2007 14
## 13 2008 16
Replacing Variable Values via Lookup
By importing the lookup table for country and iso codes, we can lookup and add the country name for each iso code listed via join(). This will make for easier interpretation under which country each tb case record is reporting.
tb_clean <- lookup_country %>% inner_join(tb_clean, by= c("Code"="iso2"))
tb_clean <- tb_clean %>% rename("country"= Name) %>% select(-Code)Exploratory Data Analysis on Age Ranges
If we list out the possible age ranges, we see that there is an age range value for unknown and age ranges for 0-4, 5-14, and 0-14 years of age. Lets do some digging on the cases reported in these unexpected age ranges.
tb_clean$age_range <- as.factor(tb_clean$age_range)
unique(tb_clean$age_range)## [1] 014 1524 2534 3544 4554 5564 65 04 514
## [10] unknown
## Levels: 014 04 1524 2534 3544 4554 514 5564 65 unknown
TB Cases Reported where Age Range Unknown
How many cases have been reported where the gender but not the age range was unknown?
Where/when were these cases highest?
tb_clean %>% filter(age_range =="unknown" & cases > 0) %>% arrange(desc(cases))## # A tibble: 53 x 6
## country year total_cases cases gender age_range
## <chr> <dbl> <int> <int> <fct> <fct>
## 1 Nepal 2008 14640 1416 m unknown
## 2 Nepal 2008 14640 631 f unknown
## 3 Argentina 2008 4758 30 m unknown
## 4 Italy 2005 1275 24 m unknown
## 5 Italy 2006 1377 19 m unknown
## 6 Italy 2007 979 11 m unknown
## 7 Italy 2006 1377 10 f unknown
## 8 Italy 2005 1275 9 f unknown
## 9 Argentina 2008 4758 8 f unknown
## 10 Egypt 2008 5102 8 m unknown
## # ... with 43 more rows
It appears there are 53 cases (within the subset timeframe_ in which tb cases were reported absent of age range. The largest numbers were reported in Nepal in 2008. It also appears that Italy has in multiple recent years reported a small number of cases absent of age range information.
TB Cases Reported in Children
How many cases of TB have been reported in the age range 0-4 years? It appears that cases involving children 0-4 have only been reported in the last years of this dataset: 2005-2008. The highest numbers recorded in any one year / country / gender were ~600 in South Africa 2006. Within the same period, 2006-2008, tb cases were reported at their highest among children 5-14 in the countries of South Africa and Indonesia. However, on the bright side, tb cases among children represent less than 2% of all reported cases in any year.
zero_to_four <- tb_clean %>% filter(age_range== "04" & cases > 0) %>%
mutate(percentage = scales::percent(cases/total_cases))
zero_to_four %>% arrange(desc(cases))## # A tibble: 281 x 7
## country year total_cases cases gender age_range percentage
## <chr> <dbl> <int> <int> <fct> <fct> <chr>
## 1 South Africa 2006 131099 655 m 04 0.4996224%
## 2 South Africa 2006 131099 620 f 04 0.4729250%
## 3 South Africa 2008 138803 344 m 04 0.2478333%
## 4 South Africa 2007 135604 340 m 04 0.2507301%
## 5 South Africa 2008 138803 298 f 04 0.2146928%
## 6 South Africa 2007 135604 293 f 04 0.2160703%
## 7 Indonesia 2007 160617 213 m 04 0.1326136%
## 8 Brazil 2007 38444 175 m 04 0.4552076%
## 9 Indonesia 2008 166376 161 m 04 0.0967688%
## 10 Indonesia 2007 160617 148 f 04 0.0921447%
## # ... with 271 more rows
summary(zero_to_four[c("cases","gender")])## cases gender
## Min. : 1.00 f:133
## 1st Qu.: 1.00 m:148
## Median : 3.00
## Mean : 21.26
## 3rd Qu.: 8.00
## Max. :655.00
zero_to_four %>% group_by(country) %>% tally() %>% arrange(desc(n))## # A tibble: 92 x 2
## country n
## <chr> <int>
## 1 Belgium 8
## 2 France 8
## 3 Romania 8
## 4 Spain 8
## 5 Turkey 8
## 6 United Kingdom 7
## 7 Uzbekistan 7
## 8 Argentina 6
## 9 Brazil 6
## 10 Italy 6
## # ... with 82 more rows
five_to_fourteen <- tb_clean %>% filter(age_range=="514" & year %in% (2004:max(year)) & cases > 0) %>%
mutate(percentage = scales::percent(cases/total_cases))
five_to_fourteen %>% arrange(desc(cases))## # A tibble: 480 x 7
## country year total_cases cases gender age_range percentage
## <chr> <dbl> <int> <int> <fct> <fct> <chr>
## 1 South Africa 2008 138803 2062 f 514 1.485559%
## 2 South Africa 2006 131099 1959 f 514 1.494291%
## 3 South Africa 2008 138803 1519 m 514 1.094357%
## 4 South Africa 2006 131099 1407 m 514 1.073235%
## 5 Indonesia 2008 166376 895 f 514 0.537938%
## 6 South Africa 2007 135604 894 f 514 0.659273%
## 7 Indonesia 2007 160617 772 f 514 0.480647%
## 8 Indonesia 2008 166376 710 m 514 0.426744%
## 9 Indonesia 2007 160617 636 m 514 0.395973%
## 10 South Africa 2007 135604 594 m 514 0.438040%
## # ... with 470 more rows
summary(five_to_fourteen[c("cases","gender")])## cases gender
## Min. : 1.00 f:256
## 1st Qu.: 2.00 m:224
## Median : 7.00
## Mean : 57.19
## 3rd Qu.: 31.00
## Max. :2062.00
five_to_fourteen %>% group_by(country) %>% tally() %>% arrange(desc(n))## # A tibble: 121 x 2
## country n
## <chr> <int>
## 1 France 8
## 2 Germany 8
## 3 Italy 8
## 4 Kyrgyzstan 8
## 5 Portugal 8
## 6 Romania 8
## 7 Spain 8
## 8 Turkey 8
## 9 United Kingdom 8
## 10 Uzbekistan 8
## # ... with 111 more rows
zero_to_four %>% ggplot(aes(x=year,y=cases,color=gender)) + geom_point()five_to_fourteen %>% ggplot(aes(x=year,y=cases,color=gender)) + geom_point()Calculating Total Cases
Calculate total cases by country and year by summarizing cases reported within each demographic.
Interesting enough, the manual summary of cases within each demographic by country/year is not equal to the total number of tb cases reported by country/year within the original dataset. Comparing the difference, India appears to be most off in total cases vs cases totalled across demographics. China, Ukraine, and Malaysia all appear to have cases of underreporting total tb cases vs cases totalled across demographics.
summaries <- tb_clean %>% group_by(country, year) %>% summarize(country, year, total_cases, total = sum(cases), difference = total_cases - total) %>% unique()## `summarise()` has grouped output by 'country', 'year'. You can override using the `.groups` argument.
summaries %>% arrange(desc(difference)) %>% head(10)## # A tibble: 10 x 5
## # Groups: country, year [10]
## country year total_cases total difference
## <chr> <dbl> <int> <int> <int>
## 1 India 1999 345150 52900 292250
## 2 India 1996 290953 6254 284699
## 3 India 1997 274877 7708 267169
## 4 India 1998 278275 12421 265854
## 5 India 2001 384827 185277 199550
## 6 India 2000 349374 193149 156225
## 7 Philippines 1996 86695 318 86377
## 8 Philippines 1997 80163 1952 78211
## 9 Philippines 1998 69476 1889 67587
## 10 South Africa 2000 75967 11490 64477
summaries %>% arrange(difference) %>% head(10)## # A tibble: 10 x 5
## # Groups: country, year [10]
## country year total_cases total difference
## <chr> <dbl> <int> <int> <int>
## 1 Brazil 2000 41186 80488 -39302
## 2 Ukraine 2007 11028 25679 -14651
## 3 Ukraine 2001 0 11952 -11952
## 4 China 1998 202817 214404 -11587
## 5 China 1999 201775 212258 -10483
## 6 China 2000 204765 213766 -9001
## 7 China 2001 204591 212766 -8175
## 8 Malaysia 2005 8446 16066 -7620
## 9 Malaysia 2004 7843 15394 -7551
## 10 Malaysia 2008 10441 17967 -7526
Summarizing Cases by Percentage of Yearly Total
It appears that reported tb case count has increased substantially in recent years. It would be interesting to know if this is due to an actual increase in cases or the result of more accurate & complete reporting.
by_year <- tb_clean %>%
group_by(year) %>%
mutate(yearly_total = sum(cases))
by_year %>%
ggplot(aes(x=year,y=yearly_total)) + geom_line()group_by_total <- by_year %>%
group_by(year,country) %>%
summarize(
percentage_of_all = round(total_cases/yearly_total,2)) %>%
arrange(desc(percentage_of_all)) %>%
unique() %>%
head(10)## `summarise()` has grouped output by 'year', 'country'. You can override using the `.groups` argument.
group_by_total## # A tibble: 10 x 3
## # Groups: year, country [10]
## year country percentage_of_all
## <dbl> <chr> <dbl>
## 1 1996 India 0.46
## 2 1997 India 0.38
## 3 1999 India 0.35
## 4 1998 India 0.33
## 5 1996 China 0.32
## 6 1997 China 0.32
## 7 2001 India 0.31
## 8 2000 India 0.3
## 9 2002 India 0.26
## 10 1998 China 0.24
It appears that China and India saw rapid increases in tb case reporting in the first decade of the 21st century. China and India both have had very high proportions of the yearly global tb case load in the past 2 decades.
top_c <- by_year %>% filter(country %in% c("China","India","Indonesia"))
top_c %>% ggplot(aes(x=year,y=total_cases)) + geom_line(aes(color=country)) +scale_color_brewer(palette="Set1")Dataset 2 - Titanic
I am importing the popular dataset of survival rates among passenger demographics of the HMS Titanic sinking in 1910.
NAMES <- read.table("https://raw.githubusercontent.com/cassie-boylan/DATA-607-Project-2/main/titanic.csv", nrow = 1, stringsAsFactors = FALSE, sep = ",")
DATA <- read.table("https://raw.githubusercontent.com/cassie-boylan/DATA-607-Project-2/main/titanic.csv", skip = 1, stringsAsFactors = FALSE, sep = ",", na.strings=c("", NA))
titanic <- DATA[, 1:12]
names(titanic) <- NAMESExtracting Multi-Variables
titanic_clean <- titanic %>%
separate(Name, into=c("Last.Name", "First.Name"), sep=",") %>%
mutate(Age = round(Age,0))
names(titanic_clean)[8] <- "Sibling.Spouse.Count"
names(titanic_clean)[9] <- "Parent.Child.Count"Handling Missing Data
There are 177 observations missing data. Calling a plot of the dataset utilizing vis_dat, I can see that all the NA values are within the variable Age and Cabin.
I can also see that all the dummy variables of this dataset are set as numeric. They should be represented as factors.
sum(is.na(titanic_clean))## [1] 866
vis_dat(titanic_clean)Since this age is a numeric value, imputing with the mean value seems reasonable. I ran a quick five num and confirmed from IQR and confirmed that mean is similar to median and can be considered a fair measure of center.
mean_age <- round(mean(titanic_clean$Age, na.rm=TRUE),0)
fivenum(titanic_clean$Age, na.rm=TRUE)## [1] 0 20 28 38 80
titanic_clean <- titanic_clean %>%
mutate(Age = replace_na(Age, mean_age))A quick look through the Cabin values does show that this variable does not carry any real meaning so I am dropping column.
titanic_clean <- titanic_clean %>%
select(-Cabin)How many of each gender & socio-economic class survived the Titanic crash?
It appears that if you were a female of first or second class, your odds of survival look pretty good. Women of first and second class had a survival rate of 97% and 92% Only 50% of woman of the third class survived, which was still better than men of the first class at 37%.
Most saddening is the survival rate of men of the third class at 15%
titanic_sum <- titanic_clean %>%
group_by(Pclass, Sex) %>%
summarize(survived=round(sum(Survived)/n(),2),
total_passengers = n(),
dead_passengers = total_passengers - round((survived * total_passengers),0))
titanic_sum## # A tibble: 6 x 5
## # Groups: Pclass [3]
## Pclass Sex survived total_passengers dead_passengers
## <int> <chr> <dbl> <int> <dbl>
## 1 1 female 0.97 94 3
## 2 1 male 0.37 122 77
## 3 2 female 0.92 76 6
## 4 2 male 0.16 108 91
## 5 3 female 0.5 144 72
## 6 3 male 0.14 347 298
Datset 3 - Religion vs Income
This dataset tabulates survey responses of self-reported religious affiliation and yearly income.
Importing Data
religion_income <- read_csv("https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/relinc.csv", show_col_types = FALSE)Reshaping Data
moving religion_income dataset long to wide
religion_income <- religion_income %>%
gather(key="income", value="frequency", "<10k":"refused", -religion) %>%
filter(religion != "refused") %>%
arrange(desc(frequency))
religion_income## # A tibble: 170 x 3
## religion income frequency
## <chr> <chr> <dbl>
## 1 Evangelical Prot refused 1529
## 2 Catholic refused 1489
## 3 Evangelical Prot 50-75k 1486
## 4 Mainline Prot refused 1328
## 5 Catholic 50-75k 1116
## 6 Mainline Prot 50-75k 1107
## 7 Evangelical Prot 20-30k 1064
## 8 Evangelical Prot 30-40k 982
## 9 Catholic 75-100k 949
## 10 Evangelical Prot 75-100k 949
## # ... with 160 more rows
unique(religion_income$religion)## [1] "Evangelical Prot" "Catholic"
## [3] "Mainline Prot" "Unaffiliated"
## [5] "Historically Black Prot" "Jewish"
## [7] "Agnostic" "Mormon"
## [9] "Atheist" "Orthodox"
## [11] "Other Faiths" "Buddhist"
## [13] "Hindu" "Jehovah's Witness"
## [15] "Muslim" "Other Christian"
## [17] "Other World Religions"
Recoding Variables for Cleaner Analysis
religion_income <- religion_income %>%
mutate(income_level = case_when(
income %in% c("75-100k","100-150k", ">150k") ~ "Wealthy",
income %in% c("50-75k","40-50k","30-40k") ~"Middle Class",
income %in% c("20-30k","10-20k", "<10k")~ "Blue Collar",
TRUE ~ "unknown"))by_class <- religion_income %>%
group_by(income_level) %>%
summarise(ppl_total = sum(frequency)) %>%
arrange(desc(ppl_total))by_religion <- religion_income %>%
group_by(religion) %>%
summarise(ppl_total = sum(frequency)) %>%
arrange(desc(ppl_total))religion_income2 <- religion_income %>%
group_by(income_level, religion) %>%
mutate(total = sum(frequency)) %>%
select(religion, income_level, total) %>%
arrange(religion, income_level) %>%
unique()religion_income_wide <- religion_income2 %>% pivot_wider(names_from = religion, values_from = total)rel_inc_distr <- religion_income2 %>% group_by(religion) %>% mutate(sum = sum(total), percent = scales::percent(total/sum))Subsetting Data to DrillDown
religion_income_ca <- rel_inc_distr %>%
filter(religion == "Catholic")
religion_income_ca %>% select(income_level, percent)## Adding missing grouping variables: `religion`
## # A tibble: 4 x 3
## # Groups: religion [1]
## religion income_level percent
## <chr> <chr> <chr>
## 1 Catholic Blue Collar 21.94%
## 2 Catholic Middle Class 30.10%
## 3 Catholic unknown 18.49%
## 4 Catholic Wealthy 29.48%
religion_skeptics <- rel_inc_distr %>% filter(religion %in% c("Atheist","Agnostic","Jewish"))religion_protestant <- rel_inc_distr %>% filter(grepl("Prot",religion))Plotting Subsets
ggplot(religion_skeptics,
aes(fill=income_level, x=religion, y=total)) + geom_bar(position="stack",stat="identity") + scale_color_brewer(palette="Pastel1")ggplot(religion_protestant,
aes(fill=income_level, x=religion, y=total)) + geom_bar(position="stack",stat="identity") + scale_color_brewer(palette="Pastel1")