Dataset 1 - Global TB rates

This dataset represents the tuberculosis cases reported in the past 30 years by country and demographic. The original dataset has multi-variate columns and is missing a substantial amount of data.

I will be tidying this dataset so that each row represents a single observation (i.e. the cases of tb for a particular country/year/gender, & age range). I will be identifying if there are any patterns in the data that is missing and dropping or imputing NA values accordingly. I will be looking up country name to replace the non-intuitive iso country code in the original dataset.

I will then perform some EDA and summary statistics on the tidied dataset.

Importing the Data

NAMES_tb <- read.table("https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/tb.csv", nrow = 1, stringsAsFactors = FALSE, sep = ",")

DATA_tb <- read.table("https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/tb.csv", skip = 1, stringsAsFactors = FALSE, sep = ",")

tb <- DATA_tb[, 1:length(NAMES_tb)]
names(tb) <- NAMES_tb

lookup_country <- read_csv("https://pkgstore.datahub.io/core/country-list/data_csv/data/d7c9d7cfb42cb69f4422dec222dbbaa8/data_csv.csv", show_col_types = FALSE)

glimpse(tb)

## Rows: 5,769
## Columns: 23
## $ iso2         <chr> "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD~
## $ year         <int> 1989, 1990, 1991, 1992, 1993, 1994, 1996, 1997, 1998, 199~
## $ new_sp       <int> NA, NA, NA, NA, 15, 24, 8, 17, 1, 4, 1, 3, 2, 7, 3, 5, 8,~
## $ new_sp_m04   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_m514  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_m014  <int> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ new_sp_m1524 <int> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 1,~
## $ new_sp_m2534 <int> NA, NA, NA, NA, NA, NA, 0, 1, 0, 0, 1, NA, 0, 0, 0, 1, 1,~
## $ new_sp_m3544 <int> NA, NA, NA, NA, NA, NA, 4, 2, 1, 1, 0, 2, 1, 1, 1, 1, 2, ~
## $ new_sp_m4554 <int> NA, NA, NA, NA, NA, NA, 1, 2, 0, 1, 0, 1, 0, 2, 1, 0, 0, ~
## $ new_sp_m5564 <int> NA, NA, NA, NA, NA, NA, 0, 1, 0, 0, 0, NA, 0, 0, 0, 0, 1,~
## $ new_sp_m65   <int> NA, NA, NA, NA, NA, NA, 0, 6, 0, 0, 0, NA, 0, 0, 0, 0, 1,~
## $ new_sp_mu    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_f04   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_f514  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ new_sp_f014  <int> NA, NA, NA, NA, NA, NA, 0, 0, NA, 0, NA, NA, 0, 0, 0, 0, ~
## $ new_sp_f1524 <int> NA, NA, NA, NA, NA, NA, 1, 1, NA, 0, NA, NA, 1, 1, 0, 1, ~
## $ new_sp_f2534 <int> NA, NA, NA, NA, NA, NA, 1, 2, NA, 0, NA, NA, 0, 1, 1, 1, ~
## $ new_sp_f3544 <int> NA, NA, NA, NA, NA, NA, 0, 3, NA, 1, NA, NA, 0, 1, 0, 1, ~
## $ new_sp_f4554 <int> NA, NA, NA, NA, NA, NA, 0, 0, NA, 0, NA, NA, 0, 0, 0, 0, ~
## $ new_sp_f5564 <int> NA, NA, NA, NA, NA, NA, 1, 0, NA, 0, NA, NA, 0, 0, 0, 0, ~
## $ new_sp_f65   <int> NA, NA, NA, NA, NA, NA, 0, 1, NA, 0, NA, NA, 0, 0, 0, 0, ~
## $ new_sp_fu    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~

Reshaping the Dataset

I am collapsing the columns that are storing multiple variables per column (gender, age range, and tb_cases) into 2 columns - demographic and case count. I am pivoting on iso2, year, and new_sp columns.

I chose to keep new_sp column as this column captures the total of all tb cases reported per country per year. I am renaming this column total_cases for better accuracy.

tb_clean <- tb %>%
  pivot_longer(
    cols = -c(iso2,year,new_sp),
    names_to = "demo", 
    values_to = "cases") %>%   
  rename("total_cases" = new_sp)

Since the disparate variables of the demo (demographic) column are not separated by any non-alphanumeric characters, I am using str_extract() inside a mutate to pull out variables gender and age range instead of a separate() verb.

Both new variables will be factor variables in that they have a fixed select number of values they could be, so I am also casting both variables to factor inside my mutate.

tb_clean <- tb_clean %>%
  mutate(
    gender = as.factor(str_extract(demo, "f|m")),
    age_range =                              (str_extract(demo,"[0-9]{2,4}"))
    ) %>%
  replace_na(list(age_range = "unknown")) %>%
  select(-demo)

Handling Missing Data

It is noticeable that there are many NA values in this dataset.

sum(is.na(tb_clean))

## [1] 133850

vis_miss(tb_clean)

Looking at the data, I suspect that the NA values are more often happening in earlier years where data may have been more difficult to ascertain. I am going to arrange by year and then call vis_miss. Should my suspicion be correct, the missing values should all show as sequential (ie. data is missing at random but not completely at random).

tb_clean_by_year <- tb_clean %>%
  arrange(year)
vis_miss(tb_clean_by_year)

It looks like my hunch was correct. To see at around what year did frequency of NA values drop off, I will filter and count only the observations with NA in the cases column and then plot.

tb_missing <- tb_clean %>%
  filter(is.na(cases)) %>%
  count(year, cases, name = "freq")

ggplot(tb_missing, aes(x=year, y=freq)) + geom_col()

It appears that only around 1995 did data begin to be collected in a comprehensive manner. It could still be helpful to retain incidence rates from the 80s for certain countries where this data was collected successfully. However, I am looking to apply summary statistics across countries, and the quality/accuracy of this analysis will significantly decrease with so many missing values; therefore I am going to drop all observational records from 1980-1995.

tb_clean$year <- as.numeric(tb_clean$year)

exclude <- c(1980:1995)

tb_clean<- tb_clean %>%
  filter(!(year %in% exclude))

At this point, we have the choice to either assume that where values are NA in the case column, no tb cases were reported or no data was collected at all. We can discern from the dataset that there do exist records of cases reported at 0 indicating that NA in this dataset is not necessarily a proxy for 0 cases.

tb_clean %>% filter(cases == 0) %>% count(cases)

## # A tibble: 1 x 2
##   cases     n
##   <int> <int>
## 1     0  4096

tb_clean %>% filter(is.na(cases)) %>% count(cases)

## # A tibble: 1 x 2
##   cases     n
##   <int> <int>
## 1    NA 19998

Where total cases reported are also NA, records containing NA values in cases variable will be dropped. Where a total case count for country/year has been reported, cases will be imputed as 0.

tb_clean <- tb_clean %>% filter(!is.na(cases) & !is.na(total_cases))

tb_clean <- tb_clean %>% replace_na(list(cases = 0L))

It was noted in our plot of missing values that there are a number of observations where the country code is missing. After dropping by year and dropping by cases = NA, there is only 184 records without an iso code. Reviewing this extracted data, it appears these values are not missing at random but are of a country code not included. We will need to drop all records missing the iso code as imputation does not appear possible.

sum(is.na(tb_clean$iso2))

## [1] 184

missing_country <- tb_clean %>% filter(is.na(iso2))
missing_country %>% group_by(year) %>% tally()

## # A tibble: 13 x 2
##     year     n
##    <dbl> <int>
##  1  1996    14
##  2  1997    14
##  3  1998    14
##  4  1999    14
##  5  2000    14
##  6  2001    14
##  7  2002    14
##  8  2003    14
##  9  2004    14
## 10  2005    14
## 11  2006    14
## 12  2007    14
## 13  2008    16

Replacing Variable Values via Lookup

By importing the lookup table for country and iso codes, we can lookup and add the country name for each iso code listed via join(). This will make for easier interpretation under which country each tb case record is reporting.

tb_clean <- lookup_country %>% inner_join(tb_clean, by= c("Code"="iso2"))

tb_clean <- tb_clean %>% rename("country"= Name) %>% select(-Code)

Exploratory Data Analysis on Age Ranges

If we list out the possible age ranges, we see that there is an age range value for unknown and age ranges for 0-4, 5-14, and 0-14 years of age. Lets do some digging on the cases reported in these unexpected age ranges.

tb_clean$age_range <- as.factor(tb_clean$age_range)

unique(tb_clean$age_range)

##  [1] 014     1524    2534    3544    4554    5564    65      04      514    
## [10] unknown
## Levels: 014 04 1524 2534 3544 4554 514 5564 65 unknown

TB Cases Reported where Age Range Unknown

How many cases have been reported where the gender but not the age range was unknown?
Where/when were these cases highest?

tb_clean %>% filter(age_range =="unknown" & cases > 0) %>% arrange(desc(cases))

## # A tibble: 53 x 6
##    country    year total_cases cases gender age_range
##    <chr>     <dbl>       <int> <int> <fct>  <fct>    
##  1 Nepal      2008       14640  1416 m      unknown  
##  2 Nepal      2008       14640   631 f      unknown  
##  3 Argentina  2008        4758    30 m      unknown  
##  4 Italy      2005        1275    24 m      unknown  
##  5 Italy      2006        1377    19 m      unknown  
##  6 Italy      2007         979    11 m      unknown  
##  7 Italy      2006        1377    10 f      unknown  
##  8 Italy      2005        1275     9 f      unknown  
##  9 Argentina  2008        4758     8 f      unknown  
## 10 Egypt      2008        5102     8 m      unknown  
## # ... with 43 more rows

It appears there are 53 cases (within the subset timeframe_ in which tb cases were reported absent of age range. The largest numbers were reported in Nepal in 2008. It also appears that Italy has in multiple recent years reported a small number of cases absent of age range information.

TB Cases Reported in Children

How many cases of TB have been reported in the age range 0-4 years? It appears that cases involving children 0-4 have only been reported in the last years of this dataset: 2005-2008. The highest numbers recorded in any one year / country / gender were ~600 in South Africa 2006. Within the same period, 2006-2008, tb cases were reported at their highest among children 5-14 in the countries of South Africa and Indonesia. However, on the bright side, tb cases among children represent less than 2% of all reported cases in any year.

zero_to_four <- tb_clean %>% filter(age_range== "04" & cases > 0) %>%
  mutate(percentage = scales::percent(cases/total_cases))

zero_to_four %>% arrange(desc(cases))

## # A tibble: 281 x 7
##    country       year total_cases cases gender age_range percentage
##    <chr>        <dbl>       <int> <int> <fct>  <fct>     <chr>     
##  1 South Africa  2006      131099   655 m      04        0.4996224%
##  2 South Africa  2006      131099   620 f      04        0.4729250%
##  3 South Africa  2008      138803   344 m      04        0.2478333%
##  4 South Africa  2007      135604   340 m      04        0.2507301%
##  5 South Africa  2008      138803   298 f      04        0.2146928%
##  6 South Africa  2007      135604   293 f      04        0.2160703%
##  7 Indonesia     2007      160617   213 m      04        0.1326136%
##  8 Brazil        2007       38444   175 m      04        0.4552076%
##  9 Indonesia     2008      166376   161 m      04        0.0967688%
## 10 Indonesia     2007      160617   148 f      04        0.0921447%
## # ... with 271 more rows

summary(zero_to_four[c("cases","gender")])

##      cases        gender 
##  Min.   :  1.00   f:133  
##  1st Qu.:  1.00   m:148  
##  Median :  3.00          
##  Mean   : 21.26          
##  3rd Qu.:  8.00          
##  Max.   :655.00

zero_to_four %>% group_by(country) %>% tally() %>% arrange(desc(n))

## # A tibble: 92 x 2
##    country            n
##    <chr>          <int>
##  1 Belgium            8
##  2 France             8
##  3 Romania            8
##  4 Spain              8
##  5 Turkey             8
##  6 United Kingdom     7
##  7 Uzbekistan         7
##  8 Argentina          6
##  9 Brazil             6
## 10 Italy              6
## # ... with 82 more rows

five_to_fourteen <- tb_clean %>% filter(age_range=="514" & year %in% (2004:max(year)) & cases > 0) %>%
    mutate(percentage = scales::percent(cases/total_cases))

five_to_fourteen %>% arrange(desc(cases))

## # A tibble: 480 x 7
##    country       year total_cases cases gender age_range percentage
##    <chr>        <dbl>       <int> <int> <fct>  <fct>     <chr>     
##  1 South Africa  2008      138803  2062 f      514       1.485559% 
##  2 South Africa  2006      131099  1959 f      514       1.494291% 
##  3 South Africa  2008      138803  1519 m      514       1.094357% 
##  4 South Africa  2006      131099  1407 m      514       1.073235% 
##  5 Indonesia     2008      166376   895 f      514       0.537938% 
##  6 South Africa  2007      135604   894 f      514       0.659273% 
##  7 Indonesia     2007      160617   772 f      514       0.480647% 
##  8 Indonesia     2008      166376   710 m      514       0.426744% 
##  9 Indonesia     2007      160617   636 m      514       0.395973% 
## 10 South Africa  2007      135604   594 m      514       0.438040% 
## # ... with 470 more rows

summary(five_to_fourteen[c("cases","gender")])

##      cases         gender 
##  Min.   :   1.00   f:256  
##  1st Qu.:   2.00   m:224  
##  Median :   7.00          
##  Mean   :  57.19          
##  3rd Qu.:  31.00          
##  Max.   :2062.00

five_to_fourteen %>% group_by(country) %>% tally() %>% arrange(desc(n))

## # A tibble: 121 x 2
##    country            n
##    <chr>          <int>
##  1 France             8
##  2 Germany            8
##  3 Italy              8
##  4 Kyrgyzstan         8
##  5 Portugal           8
##  6 Romania            8
##  7 Spain              8
##  8 Turkey             8
##  9 United Kingdom     8
## 10 Uzbekistan         8
## # ... with 111 more rows

zero_to_four %>% ggplot(aes(x=year,y=cases,color=gender)) + geom_point()

five_to_fourteen %>% ggplot(aes(x=year,y=cases,color=gender)) + geom_point()

Calculating Total Cases

Calculate total cases by country and year by summarizing cases reported within each demographic.

Interesting enough, the manual summary of cases within each demographic by country/year is not equal to the total number of tb cases reported by country/year within the original dataset. Comparing the difference, India appears to be most off in total cases vs cases totalled across demographics. China, Ukraine, and Malaysia all appear to have cases of underreporting total tb cases vs cases totalled across demographics.

summaries <- tb_clean %>% group_by(country, year) %>% summarize(country, year, total_cases, total = sum(cases), difference = total_cases - total) %>% unique()

## `summarise()` has grouped output by 'country', 'year'. You can override using the `.groups` argument.

summaries %>% arrange(desc(difference)) %>% head(10)

## # A tibble: 10 x 5
## # Groups:   country, year [10]
##    country       year total_cases  total difference
##    <chr>        <dbl>       <int>  <int>      <int>
##  1 India         1999      345150  52900     292250
##  2 India         1996      290953   6254     284699
##  3 India         1997      274877   7708     267169
##  4 India         1998      278275  12421     265854
##  5 India         2001      384827 185277     199550
##  6 India         2000      349374 193149     156225
##  7 Philippines   1996       86695    318      86377
##  8 Philippines   1997       80163   1952      78211
##  9 Philippines   1998       69476   1889      67587
## 10 South Africa  2000       75967  11490      64477

summaries %>% arrange(difference) %>% head(10)

## # A tibble: 10 x 5
## # Groups:   country, year [10]
##    country   year total_cases  total difference
##    <chr>    <dbl>       <int>  <int>      <int>
##  1 Brazil    2000       41186  80488     -39302
##  2 Ukraine   2007       11028  25679     -14651
##  3 Ukraine   2001           0  11952     -11952
##  4 China     1998      202817 214404     -11587
##  5 China     1999      201775 212258     -10483
##  6 China     2000      204765 213766      -9001
##  7 China     2001      204591 212766      -8175
##  8 Malaysia  2005        8446  16066      -7620
##  9 Malaysia  2004        7843  15394      -7551
## 10 Malaysia  2008       10441  17967      -7526

Summarizing Cases by Percentage of Yearly Total

It appears that reported tb case count has increased substantially in recent years. It would be interesting to know if this is due to an actual increase in cases or the result of more accurate & complete reporting.

by_year <- tb_clean %>%
  group_by(year) %>%
  mutate(yearly_total = sum(cases)) 

by_year %>%
  ggplot(aes(x=year,y=yearly_total)) + geom_line()

group_by_total <- by_year %>% 
  group_by(year,country) %>%
  summarize(
    percentage_of_all = round(total_cases/yearly_total,2)) %>%
  arrange(desc(percentage_of_all)) %>% 
  unique() %>%
  head(10)

## `summarise()` has grouped output by 'year', 'country'. You can override using the `.groups` argument.

group_by_total

## # A tibble: 10 x 3
## # Groups:   year, country [10]
##     year country percentage_of_all
##    <dbl> <chr>               <dbl>
##  1  1996 India                0.46
##  2  1997 India                0.38
##  3  1999 India                0.35
##  4  1998 India                0.33
##  5  1996 China                0.32
##  6  1997 China                0.32
##  7  2001 India                0.31
##  8  2000 India                0.3 
##  9  2002 India                0.26
## 10  1998 China                0.24

It appears that China and India saw rapid increases in tb case reporting in the first decade of the 21st century. China and India both have had very high proportions of the yearly global tb case load in the past 2 decades.

top_c <- by_year %>% filter(country %in% c("China","India","Indonesia"))

top_c %>% ggplot(aes(x=year,y=total_cases)) + geom_line(aes(color=country)) +scale_color_brewer(palette="Set1")

Dataset 2 - Titanic

I am importing the popular dataset of survival rates among passenger demographics of the HMS Titanic sinking in 1910.

NAMES <- read.table("https://raw.githubusercontent.com/cassie-boylan/DATA-607-Project-2/main/titanic.csv", nrow = 1, stringsAsFactors = FALSE, sep = ",")

DATA <- read.table("https://raw.githubusercontent.com/cassie-boylan/DATA-607-Project-2/main/titanic.csv", skip = 1, stringsAsFactors = FALSE, sep = ",", na.strings=c("", NA))

titanic <- DATA[, 1:12]
names(titanic) <- NAMES

Extracting Multi-Variables

titanic_clean <- titanic %>%
  separate(Name, into=c("Last.Name", "First.Name"), sep=",") %>%
  mutate(Age = round(Age,0))

names(titanic_clean)[8] <- "Sibling.Spouse.Count"
names(titanic_clean)[9] <- "Parent.Child.Count"

Handling Missing Data

There are 177 observations missing data. Calling a plot of the dataset utilizing vis_dat, I can see that all the NA values are within the variable Age and Cabin.

I can also see that all the dummy variables of this dataset are set as numeric. They should be represented as factors.

sum(is.na(titanic_clean))

## [1] 866

vis_dat(titanic_clean)

Since this age is a numeric value, imputing with the mean value seems reasonable. I ran a quick five num and confirmed from IQR and confirmed that mean is similar to median and can be considered a fair measure of center.

mean_age <- round(mean(titanic_clean$Age, na.rm=TRUE),0)

fivenum(titanic_clean$Age, na.rm=TRUE)

## [1]  0 20 28 38 80

titanic_clean <- titanic_clean %>%
  mutate(Age = replace_na(Age, mean_age))

A quick look through the Cabin values does show that this variable does not carry any real meaning so I am dropping column.

titanic_clean <- titanic_clean %>%
  select(-Cabin)

How many of each gender & socio-economic class survived the Titanic crash?

It appears that if you were a female of first or second class, your odds of survival look pretty good. Women of first and second class had a survival rate of 97% and 92% Only 50% of woman of the third class survived, which was still better than men of the first class at 37%.
Most saddening is the survival rate of men of the third class at 15%

titanic_sum <- titanic_clean %>%
  group_by(Pclass, Sex) %>% 
  summarize(survived=round(sum(Survived)/n(),2),
            total_passengers = n(),
            dead_passengers = total_passengers - round((survived * total_passengers),0))

titanic_sum

## # A tibble: 6 x 5
## # Groups:   Pclass [3]
##   Pclass Sex    survived total_passengers dead_passengers
##    <int> <chr>     <dbl>            <int>           <dbl>
## 1      1 female     0.97               94               3
## 2      1 male       0.37              122              77
## 3      2 female     0.92               76               6
## 4      2 male       0.16              108              91
## 5      3 female     0.5               144              72
## 6      3 male       0.14              347             298

Datset 3 - Religion vs Income

This dataset tabulates survey responses of self-reported religious affiliation and yearly income.

Importing Data

religion_income <- read_csv("https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/relinc.csv", show_col_types = FALSE)

Reshaping Data

moving religion_income dataset long to wide

religion_income <- religion_income %>%
  gather(key="income", value="frequency", "<10k":"refused", -religion) %>%
  filter(religion != "refused") %>%
  arrange(desc(frequency))

religion_income

## # A tibble: 170 x 3
##    religion         income  frequency
##    <chr>            <chr>       <dbl>
##  1 Evangelical Prot refused      1529
##  2 Catholic         refused      1489
##  3 Evangelical Prot 50-75k       1486
##  4 Mainline Prot    refused      1328
##  5 Catholic         50-75k       1116
##  6 Mainline Prot    50-75k       1107
##  7 Evangelical Prot 20-30k       1064
##  8 Evangelical Prot 30-40k        982
##  9 Catholic         75-100k       949
## 10 Evangelical Prot 75-100k       949
## # ... with 160 more rows

unique(religion_income$religion)

##  [1] "Evangelical Prot"        "Catholic"               
##  [3] "Mainline Prot"           "Unaffiliated"           
##  [5] "Historically Black Prot" "Jewish"                 
##  [7] "Agnostic"                "Mormon"                 
##  [9] "Atheist"                 "Orthodox"               
## [11] "Other Faiths"            "Buddhist"               
## [13] "Hindu"                   "Jehovah's Witness"      
## [15] "Muslim"                  "Other Christian"        
## [17] "Other World Religions"

Recoding Variables for Cleaner Analysis

religion_income <- religion_income %>%
  mutate(income_level = case_when(
    income %in% c("75-100k","100-150k", ">150k") ~ "Wealthy", 
    income %in% c("50-75k","40-50k","30-40k") ~"Middle Class", 
    income %in% c("20-30k","10-20k", "<10k")~ "Blue Collar",
    TRUE ~ "unknown"))

by_class <- religion_income %>%
  group_by(income_level) %>%
  summarise(ppl_total = sum(frequency)) %>%
  arrange(desc(ppl_total))

by_religion <- religion_income %>%
  group_by(religion) %>%
  summarise(ppl_total = sum(frequency)) %>%
  arrange(desc(ppl_total))

religion_income2 <- religion_income %>%
  group_by(income_level, religion) %>%
  mutate(total = sum(frequency)) %>%
  select(religion, income_level, total) %>%
  arrange(religion, income_level) %>%
  unique()

religion_income_wide <- religion_income2 %>% pivot_wider(names_from = religion, values_from = total)

rel_inc_distr <- religion_income2 %>% group_by(religion) %>% mutate(sum = sum(total),                          percent = scales::percent(total/sum))

Subsetting Data to DrillDown

religion_income_ca <- rel_inc_distr %>%
  filter(religion == "Catholic") 

religion_income_ca %>% select(income_level, percent)

## Adding missing grouping variables: `religion`

## # A tibble: 4 x 3
## # Groups:   religion [1]
##   religion income_level percent
##   <chr>    <chr>        <chr>  
## 1 Catholic Blue Collar  21.94% 
## 2 Catholic Middle Class 30.10% 
## 3 Catholic unknown      18.49% 
## 4 Catholic Wealthy      29.48%

religion_skeptics <- rel_inc_distr %>% filter(religion %in% c("Atheist","Agnostic","Jewish"))

religion_protestant <- rel_inc_distr %>% filter(grepl("Prot",religion))

Plotting Subsets

ggplot(religion_skeptics, 
        aes(fill=income_level, x=religion, y=total)) + geom_bar(position="stack",stat="identity") + scale_color_brewer(palette="Pastel1")

ggplot(religion_protestant, 
        aes(fill=income_level, x=religion, y=total)) + geom_bar(position="stack",stat="identity") + scale_color_brewer(palette="Pastel1")

Project 2

Cassie Boylan

10/3/2021

Dataset 1 - Global TB rates

Importing the Data

Reshaping the Dataset

Handling Missing Data

Replacing Variable Values via Lookup

Exploratory Data Analysis on Age Ranges

TB Cases Reported where Age Range Unknown

TB Cases Reported in Children

Calculating Total Cases

Calculate total cases by country and year by summarizing cases reported within each demographic.

Summarizing Cases by Percentage of Yearly Total

Dataset 2 - Titanic

Extracting Multi-Variables

Handling Missing Data

How many of each gender & socio-economic class survived the Titanic crash?

Datset 3 - Religion vs Income

Importing Data

Reshaping Data

Recoding Variables for Cleaner Analysis

Subsetting Data to DrillDown

Plotting Subsets