## Warning: package 'arsenal' was built under R version 4.4.3
The purpose of this lab is to tidy and demonstrate an analysis of three different data sets. The data sets chosen were the Pokemon, cheese, and cancer data sets from the Week 5 Discussion.
The goal for the Pokemon data set is to use the usage values as target variables. In theory, almost all of a Pokemon’s properties can work as a predictor variable; even a Pokemon’s name and generation can affect its usage rate. For our analysis, we we focus on “total_stats” and “usage”.
Load the Pokemon competitive usage data set. There are multiple columns including a Pokemon’s name, general properties, and usage information.
pokemon_usage_url <- 'https://raw.githubusercontent.com/Megabuster/Data607/refs/heads/main/data/project2/pokemon_competitive_analysis.csv'
pokemon_usage_raw <- fread(pokemon_usage_url)
head(pokemon_usage_raw)
## index name type1 type2 ability1 ability2 hidden_ability hp
## <int> <char> <char> <char> <char> <char> <char> <int>
## 1: 1 bulbasaur grass poison overgrow No_ability chlorophyll 45
## 2: 2 ivysaur grass poison overgrow No_ability chlorophyll 60
## 3: 3 venusaur grass poison overgrow No_ability chlorophyll 80
## 4: 3 venusaur-mega grass poison thick-fat No_ability None 80
## 5: 3 venusaur-gmax grass poison overgrow No_ability chlorophyll 80
## 6: 4 charmander fire No_type blaze No_ability solar-power 39
## attack defense sp_atk sp_def speed total_stats legendary mythical
## <int> <int> <int> <int> <int> <int> <lgcl> <lgcl>
## 1: 49 49 65 65 45 318 FALSE FALSE
## 2: 62 63 80 80 60 405 FALSE FALSE
## 3: 82 83 100 100 80 525 FALSE FALSE
## 4: 100 123 122 120 80 625 FALSE FALSE
## 5: 82 83 100 100 80 525 FALSE FALSE
## 6: 52 43 60 50 65 309 FALSE FALSE
## generation Smogon_VGC_Usage_2022 Smogon_VGC_Usage_2023
## <char> <char> <char>
## 1: generation-i 0.0 NoUsage
## 2: generation-i 0.0 NoUsage
## 3: generation-i 20.83915 NoUsage
## 4: generation-i NoUsage NoUsage
## 5: generation-i NoUsage NoUsage
## 6: generation-i 0.0 0.0
## Smogon_VGC_Usage_2024 Worlds_VGC_Usage_2022 Worlds_VGC_Usage_2023
## <char> <char> <char>
## 1: 0.0 NoUsage NoUsage
## 2: 0.0 NoUsage NoUsage
## 3: 0.4441 19.62 NoUsage
## 4: NoUsage NoUsage NoUsage
## 5: NoUsage NoUsage NoUsage
## 6: 0.0 NoUsage NoUsage
## Worlds_VGC_Usage_2024
## <char>
## 1: NoUsage
## 2: NoUsage
## 3: 0.09
## 4: NoUsage
## 5: NoUsage
## 6: NoUsage
Most of the columns are base features of the listed Pokemon, but the last 6 are VGC data which denotes the competitive format usage. Since these are observations, they should not be mixed with the other variables. This dataset is a strong candidate for converting from wide to long format. Use “melt” to combine all the VGC columns together.
pokemon_usage_long <- pokemon_usage_raw %>% melt(id.vars = c(colnames(pokemon_usage_raw)[1:17]), variable.name = 'competitive_format', value.name = 'usage')
head(pokemon_usage_long)
## index name type1 type2 ability1 ability2 hidden_ability hp
## <int> <char> <char> <char> <char> <char> <char> <int>
## 1: 1 bulbasaur grass poison overgrow No_ability chlorophyll 45
## 2: 2 ivysaur grass poison overgrow No_ability chlorophyll 60
## 3: 3 venusaur grass poison overgrow No_ability chlorophyll 80
## 4: 3 venusaur-mega grass poison thick-fat No_ability None 80
## 5: 3 venusaur-gmax grass poison overgrow No_ability chlorophyll 80
## 6: 4 charmander fire No_type blaze No_ability solar-power 39
## attack defense sp_atk sp_def speed total_stats legendary mythical
## <int> <int> <int> <int> <int> <int> <lgcl> <lgcl>
## 1: 49 49 65 65 45 318 FALSE FALSE
## 2: 62 63 80 80 60 405 FALSE FALSE
## 3: 82 83 100 100 80 525 FALSE FALSE
## 4: 100 123 122 120 80 625 FALSE FALSE
## 5: 82 83 100 100 80 525 FALSE FALSE
## 6: 52 43 60 50 65 309 FALSE FALSE
## generation competitive_format usage
## <char> <fctr> <char>
## 1: generation-i Smogon_VGC_Usage_2022 0.0
## 2: generation-i Smogon_VGC_Usage_2022 0.0
## 3: generation-i Smogon_VGC_Usage_2022 20.83915
## 4: generation-i Smogon_VGC_Usage_2022 NoUsage
## 5: generation-i Smogon_VGC_Usage_2022 NoUsage
## 6: generation-i Smogon_VGC_Usage_2022 0.0
Every row is completely filled, but certain cells are effectively “NA” data: “No_type” in “type2”, “No_ability” in “ability2”, “None” in “hidden_ability”, and “NoUsage” in “usage”.
A Pokemon always has a “type1”, but might not have a “type2”. Two Pokemon with the same two types in opposite orders such as “grass” and “poison” or “poison” and “grass” are generally the same for our purposes.
For “No_ability” and “None”, “ability1” is always set and is the default if no other abilities exist. Unfortunately, there is no information within the dataset that determines which ability is being used and how often. For the scope of this analysis, these columns will not be changed.
In “usage”, “NoUsage” stands out because that should be the same as 0% usage. As 0% does exist within the “usage” column as “0.0”, what does “NoUsage” actually mean? We may be able to answer this, and thus decide how to proceed with tidying, by analyzing when we see each value. We will examine data for all Pokemon, used (any number above 0%), no usage (“NoUsage” value), and 0% usage.
used_df <- pokemon_usage_long %>% filter(usage != '0.0' & usage != 'NoUsage')
no_usage_df <- pokemon_usage_long %>% filter(usage == 'NoUsage')
zero_usage_df <- pokemon_usage_long %>% filter(usage == '0.0')
all_summar <- pokemon_usage_long %>% summarise(mean = mean(total_stats), median = median(total_stats))
used_summar <- used_df %>% summarise(mean = mean(total_stats), median = median(total_stats))
no_usage_summar <- no_usage_df %>% summarise(mean = mean(total_stats), median = median(total_stats))
zero_usage_summar <- zero_usage_df %>% summarise(mean = mean(total_stats), median = median(total_stats))
usage_means <- c(all_summar$mean, used_summar$mean, no_usage_summar$mean, zero_usage_summar$mean)
usage_medians <- c(all_summar$median, used_summar$median, no_usage_summar$median, zero_usage_summar$median)
usage_names <- c('all', 'used', 'no_usage', 'zero_usage')
barplot(usage_means, names.arg = usage_names, ylab = 'avg_total_stats')
## usage_type usage_mean usage_median
## 1 all 445.9593 470
## 2 used 510.0786 505
## 3 no_usage 445.1212 460
## 4 zero_usage 337.4237 320
The “total_stats” property is a collection of all the base stats of a Pokemon ranging from “hp” to “speed”. Logically, a Pokemon that is stronger, one that has higher stats, is more likely to be used. 0% usage Pokemon having the lowest total stats in the 300’s makes sense, but “no_usage” is close to the average of all Pokemon in the original dataset. The medians were also calculated as despite being numbers, “total_stats” are actually discrete and not continuous variables. Pokemon are distinctly given totals such as 470 or 505. The used set of Pokemon have the highest median at 505.
There are many potential directions at this juncture. While the original question was what effect does “total_stats” have on “usage”, a subgoal is to tidy the “NoUsage” values. One path forward is to see the distribution of each group’s “total_stats” to see if they have any odd distributions.
pokemon_usage_long %>%
select(c('name', 'total_stats')) %>%
unique() %>%
ggplot(aes(x = total_stats)) +
geom_histogram(bins = 15, binwidth = 20) +
labs(title = 'All Pokemon')
used_df %>%
select(c('name', 'total_stats')) %>%
unique() %>%
ggplot(aes(x = total_stats)) +
geom_histogram(bins = 15, binwidth = 20) +
labs(title = 'Used Pokemon')
no_usage_df %>%
select(c('name', 'total_stats')) %>%
unique() %>%
ggplot(aes(x = total_stats)) +
geom_histogram(bins = 15, binwidth = 20) +
labs(title = 'No Usage Pokemon')
zero_usage_df %>%
select(c('name', 'total_stats')) %>%
unique() %>%
ggplot(aes(x = total_stats)) +
geom_histogram(bins = 15, binwidth = 20) +
labs(title = 'Zero Usage Pokemon')
These histograms agree with what the averages implied. Used Pokemon usually had at least over 500 total stats. Zero usage Pokemon peaked at around 300 total stats. The plot of all Pokemon and no usage Pokemon are somehow still very similar. One more test can be done to verify this finding is the “comparef” function that will show how many observations, individual Pokemon, are not in the no usage group.
comparedf(pokemon_usage_long %>% select(c('name', 'total_stats')) %>% unique(),
no_usage_df %>% select(c('name', 'total_stats')) %>% unique())
## Compare Object
##
## Function Call:
## comparedf(x = pokemon_usage_long %>% select(c("name", "total_stats")) %>%
## unique(), y = no_usage_df %>% select(c("name", "total_stats")) %>%
## unique())
##
## Shared: 2 non-by variables and 1272 observations.
## Not shared: 0 variables and 31 observations.
##
## Differences found in 2/2 variables compared.
## 0 variables compared have non-identical attributes.
With this result that 1272/1303 Pokemon matched, it appears that “total_stats” has almost no correlation with the “NoUsage” value. Almost every Pokemon had a competitive format where it had “NoUsage” regardless of its stats. Given the original question of whether or not “total_stats” impacted usage, we will remove instances of “NoUsage” as they are merely noise for our purpose. This feels rudimentary, but “NoUsage” clearly does not mean 0% usage and cannot be trusted to provide meaningful information.
This leads to the next issue with the original data. The “usage” column treats its values as strings instead of numbers because of “NoUsage” and the small values that use “e” in them. We can start by making a data frame without “NoUsage” in it. The resulting “usage” column is still a character type. Turn the usage values into numerics so that they can be operated on.
pokemon_usage_df <- pokemon_usage_long %>% filter(usage != 'NoUsage')
# typeof(pokemon_usage_df$usage)
pokemon_usage_df_fixed <- transform(pokemon_usage_df, usage = as.numeric(usage))
head(pokemon_usage_df_fixed)
## index name type1 type2 ability1 ability2 hidden_ability hp
## <int> <char> <char> <char> <char> <char> <char> <int>
## 1: 1 bulbasaur grass poison overgrow No_ability chlorophyll 45
## 2: 2 ivysaur grass poison overgrow No_ability chlorophyll 60
## 3: 3 venusaur grass poison overgrow No_ability chlorophyll 80
## 4: 4 charmander fire No_type blaze No_ability solar-power 39
## 5: 5 charmeleon fire No_type blaze No_ability solar-power 58
## 6: 6 charizard fire flying blaze No_ability solar-power 78
## attack defense sp_atk sp_def speed total_stats legendary mythical
## <int> <int> <int> <int> <int> <int> <lgcl> <lgcl>
## 1: 49 49 65 65 45 318 FALSE FALSE
## 2: 62 63 80 80 60 405 FALSE FALSE
## 3: 82 83 100 100 80 525 FALSE FALSE
## 4: 52 43 60 50 65 309 FALSE FALSE
## 5: 64 58 80 65 80 405 FALSE FALSE
## 6: 84 78 109 85 100 534 FALSE FALSE
## generation competitive_format usage
## <char> <fctr> <num>
## 1: generation-i Smogon_VGC_Usage_2022 0.00000
## 2: generation-i Smogon_VGC_Usage_2022 0.00000
## 3: generation-i Smogon_VGC_Usage_2022 20.83915
## 4: generation-i Smogon_VGC_Usage_2022 0.00000
## 5: generation-i Smogon_VGC_Usage_2022 0.00000
## 6: generation-i Smogon_VGC_Usage_2022 7.17374
The main question was the impact of “total_stats” on “usage”. Let’s plot the correlation between these values.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
pokemon_usage_df_fixed %>%
filter(usage > 1) %>%
ggplot(aes(x = total_stats, y = usage)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There does appear to be an effect on usage by total_stats. However, most usage rates are still clustered near 0% even among the Pokemon with higher total_stats. The highest usage rate Pokemon at around 60% usage and higher are generally in the 500s. Setting “usage” above 1% to remove the impact of barely used Pokemon shows a peak usage of Pokemon around 550 total stats. This implies that other predictors are also affecting usage.
The average usage percentage of a Pokemon is 1.6%. Keep this in mind for the final plot.
## mean
## 1 1.608681
aggr_df <- aggregate(usage ~ total_stats, data = pokemon_usage_df_fixed, mean)
aggr_df %>%
filter() %>%
ggplot(aes(x = total_stats, y = usage)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The individual Pokemon that were used a lot skew the results heavily. However, the trend here seems to fit what we had assumed. The average usage of lower “total_stat” Pokemon is extremely low compared to the higher ones. We can try to regroup the data into stat groups to reduce the impact of individual Pokemon outliers. This categorization is because “total_stat” is discrete and not continuous. A single Pokemon with a 525 stat total and 50% usage does not automatically mean we should see a higher usage rate for a 530 total Pokemon.
test_combine <- pokemon_usage_df_fixed %>% reframe(stat_group = round(total_stats, digits = -2), name = name, usage = usage) %>% aggregate(usage ~ stat_group, mean)
test_combine
## stat_group usage
## 1 200 0.04548513
## 2 300 0.02637072
## 3 400 0.16347696
## 4 500 1.41415392
## 5 600 6.19699620
## 6 700 5.42428733
test_combine %>%
ggplot() +
geom_bar(aes(x = stat_group, y = usage), stat = 'identity') +
labs(title = "Pokemon usage by total stat group (rounded)")
Combine each Pokemon by total_stat groups, rounded to the nearest hundred. This means 550-649 will round to 600. Notably, this is now the highest average usage group. The correlation between “usage” and “total_stats” is much more apparent. The 600 and 700 total stat groups are by far the most represented in battles with 6% and 5% respectively. The 500 total stat group is down at 1.4% now, while the smaller groups average usage rates of well below 1%. 1.4% is still quite close to the 1.6% mean. It is possible there are some extraneous factors that prevent the highest “stat_group” from being the most used of all such as a ban list that would disallow usage of certain Pokemon during specific tournaments, but those variables were not collected in this data set.
The data set chosen is a list of different cheeses and properties about them such as taste, origin, and nutritional facts. The goal is to compare the cheeses by region and determine if there’s a trend in flavor.
Start by loading the data.
cheese_url <- 'https://raw.githubusercontent.com/Megabuster/Data607/refs/heads/main/data/project2/cheese.csv'
cheese_raw <- fread(cheese_url)
head(cheese_raw)
## cheese url milk
## <char> <char> <char>
## 1: Aarewasser https://www.cheese.com/aarewasser/ cow
## 2: Abbaye de Belloc https://www.cheese.com/abbaye-de-belloc/ sheep
## 3: Abbaye de Belval https://www.cheese.com/abbaye-de-belval/ cow
## 4: Abbaye de Citeaux https://www.cheese.com/abbaye-de-citeaux/ cow
## 5: Abbaye de Tamié https://www.cheese.com/tamie/ cow
## 6: Abbaye de Timadeuc https://www.cheese.com/abbaye-de-timadeuc/ cow
## country region family type
## <char> <char> <char> <char>
## 1: Switzerland <NA> <NA> semi-soft
## 2: France Pays Basque <NA> semi-hard, artisan
## 3: France <NA> <NA> semi-hard
## 4: France Burgundy <NA> semi-soft, artisan, brined
## 5: France Savoie <NA> soft, artisan
## 6: France province of Brittany <NA> semi-hard
## fat_content calcium_content texture rind color
## <char> <char> <char> <char> <char>
## 1: <NA> <NA> buttery washed yellow
## 2: <NA> <NA> creamy, dense, firm natural yellow
## 3: 40-46% <NA> elastic washed ivory
## 4: <NA> <NA> creamy, dense, smooth washed white
## 5: <NA> <NA> creamy, open, smooth washed white
## 6: <NA> <NA> soft washed pale yellow
## flavor aroma vegetarian vegan
## <char> <char> <lgcl> <lgcl>
## 1: sweet buttery FALSE FALSE
## 2: burnt caramel lanoline TRUE FALSE
## 3: <NA> aromatic FALSE FALSE
## 4: acidic, milky, smooth barnyardy, earthy FALSE FALSE
## 5: fruity, nutty perfumed, pungent FALSE FALSE
## 6: salty, smooth nutty FALSE FALSE
## synonyms alt_spellings
## <char> <char>
## 1: <NA> <NA>
## 2: Abbaye Notre-Dame de Belloc <NA>
## 3: <NA> <NA>
## 4: <NA> <NA>
## 5: <NA> Tamié, Trappiste de Tamie, Abbey of Tamie
## 6: <NA> <NA>
## producers
## <char>
## 1: Jumi
## 2: <NA>
## 3: <NA>
## 4: <NA>
## 5: <NA>
## 6: Abbaye Cistercienne NOTRE-DAME DE TIMADEUC
As one of the main variables we are analyzing today, we want to see how regions look. Nearly 28% of cheeses do not have a region. There is also very little consistency regarding the way regions are listed. There are cities, states, multiple municipalities, actual regions, regions within states, contracted names, names with prepositions, etc.
## [1] 27.96967
# Sample country for missing data twice because apparently they are not listed as "NA"
sum(is.na(cheese_raw$country)) / nrow(cheese_raw) * 100
## [1] 0
## [1] 0.926706
A key to this analysis is that each “region” is going to be within a country and fortunately all of those have values… or do they? Actually, there are countries that are missing, but they are listed as empty strings instead of “NA”. Over 99% of countries are available making them a reliable starting point. Let’s group the countries together and see what regions are associated with each one.
aggr_country <- aggregate(region ~ country, cheese_raw, \(cheese_raw) paste(unique(cheese_raw), collapse = ", "))
head(aggr_country)
## country
## 1
## 2 Albania, Bulgaria, Croatia, Greece, Israel, Macedonia, Romania, Serbia
## 3 Australia
## 4 Austria
## 5 Belgium
## 6 Belgium, Germany, Netherlands
## region
## 1 Carrigtwohill, Cornwall, North East Victoria, Veneto, Inagh, Co Clare
## 2 Trakia
## 3 South Australia, Brisbane, Mornington Peninsula, Melbourne, Pokolbin, Hunter Valley, Gippsland, Victoria, Victoria, Ballarat, Victoria, Tasmania, North East Victoria, New South Wales, Hunter Valley
## 4 Vorarlberg, Bregenzerwald, Kleinwalsertal, Großwalsertal, Laiblachtal (Pfänderstock) and Rheintal, Sulzberg
## 5 Flanders, Wallonia, Passendale, Postel
## 6 Duchy of Limburg
Actually, the countries are not tidy either. There are multiple countries grouped together and sometimes a country has a sovereign state (England as a part of the United Kingdom). Without additional contextual information, there are multiple possible paths forward. One logical assumption is that countries or regions that are co-credited for a cheese are probably either similar culturally or are neighboring entities. The flaw in that assumption is that you could argue that there are similarities between the United States and Italy and between the United States and Mexico, but that does not transitively mean that Mexico and Italy are related.
From a quick check, we find that there are 6 instances where no region or country is provided. Since we’re trying to see the impact of location on the cheese’s properties, we will remove these cheeses as we are unable to analyze them without more context.
cheese_reduced <- cheese_raw %>% filter(!(cheese_raw$country == '' & is.na(cheese_raw$region)))
cheese_longest <- cheese_reduced %>% select(cheese, country, region, flavor) %>% separate_longer_delim(country, ', ') %>% separate_longer_delim(region, ', ') %>% separate_longer_delim(flavor, ', ')
head(cheese_longest)
## cheese country region flavor
## 1 Aarewasser Switzerland <NA> sweet
## 2 Abbaye de Belloc France Pays Basque burnt caramel
## 3 Abbaye de Belval France <NA> <NA>
## 4 Abbaye de Citeaux France Burgundy acidic
## 5 Abbaye de Citeaux France Burgundy milky
## 6 Abbaye de Citeaux France Burgundy smooth
Going over the resulting “cheese_longest” dataframe, we get a spoiler regarding the fact that this is the extent that we will be lengthening this data set today. Let’s see what impacts country or region have on flavor.
distinct_country_cheese <- cheese_longest %>% filter(country != '' & !is.na(flavor)) %>% select(cheese, country, flavor) %>% distinct()
sum_country_cheese <- distinct_country_cheese %>%
select(country, flavor) %>%
group_by(country, flavor) %>%
summarise(total = n(), across(everything(), first), .groups = 'drop')
head(sum_country_cheese %>% arrange(desc(total)))
## # A tibble: 6 × 3
## country flavor total
## <chr> <chr> <int>
## 1 United States sweet 100
## 2 United States creamy 89
## 3 United States nutty 77
## 4 United States tangy 76
## 5 United States buttery 65
## 6 United States mild 52
aggregate(total ~ country + flavor, data = sum_country_cheese, sum) %>% ggplot(aes(x = country, y = total, fill = flavor)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = 'Cheese flavor trends per country') +
coord_flip()
distinct_region_cheese <- cheese_longest %>% filter(!is.na(region) & !is.na(flavor)) %>% select(cheese, region, flavor) %>% distinct()
sum_region_cheese <- distinct_region_cheese %>%
select(region, flavor) %>%
group_by(region, flavor) %>%
summarise(total = n(), across(everything(), first), .groups = 'drop')
head(sum_region_cheese %>% arrange(desc(total)))
## # A tibble: 6 × 3
## region flavor total
## <chr> <chr> <int>
## 1 Wisconsin sweet 31
## 2 Wisconsin nutty 21
## 3 Quebec creamy 20
## 4 Wisconsin creamy 19
## 5 Quebec buttery 18
## 6 Wisconsin mild 18
This data set pushes the limits of what can be reasonably viewed within a bar plot. It appears that Canada, Italy and the United States were especially well represented in terms of flavors of cheese. Notable trends from examining the top of the country specific data is the prevalence of sweet flavors which matches a common stereotype.
There were 380 unique region names still remaining in this modified data frame, so I saved us the trouble of trying to read that plot. The region table shows that Wisconsin produces a lot of sweet cheese as well. As an American state, this corroborates the implications from the previous table.
While the final dataframes were still not perfectly tidy, we were able to see notable consistent trends when analyzing the most recurring data within the cheese data set.
This data set is about breast cancer cases from 2020-2024 at Houston’s MD Anderson Cancer Center. Tidying the data will focus on making the data longer. The analysis goal will be to see how race demographics impact case totals.
Start by loading the data.
cancer_url <- 'https://raw.githubusercontent.com/Megabuster/Data607/refs/heads/main/data/project2/Untidydataset%20_624.csv'
cancer_raw <- fread(cancer_url)
head(cancer_raw)
## V1 V2 V3 V4
## <char> <char> <char> <char>
## 1: Untidy Breast Cancer Dataset (2020-2024)
## 2: Patient_ID Month Age_Group Race
## 3: P001 January 30-39 White
## 4: P002 January 40-49 Black
## 5: P003 January 50-59 Hispanic
## 6: P004 February 30-39 Asian
## V5 V6 V7 V8 V9 V10
## <char> <char> <char> <char> <char> <char>
## 1:
## 2: Cases_2020 Cases_2021 Cases_2022 Cases_2023 Cases_2024 Total_Patients
## 3: 50 55 60 65 70 300
## 4: 90 95 100 105 120 510
## 5: 80 85 88 90 110 453
## 6: 70 75 80 85 90 400
## V11
## <char>
## 1:
## 2: Type
## 3: Invasive Ductal
## 4: Lobular
## 5: Triple-Negative
## 6: HER2+
This data set was converted from xlsx to csv type to streamline it with the other data sets. Inadvertently, this has led to the form of the data getting untidier than before. We will need to remove the first two rows and make row 3 the new column names.
names(cancer_raw) <- unlist(cancer_raw[2])
cancer_fixed_col <- cancer_raw[-c(1:2),]
head(cancer_fixed_col)
## Patient_ID Month Age_Group Race Cases_2020 Cases_2021 Cases_2022
## <char> <char> <char> <char> <char> <char> <char>
## 1: P001 January 30-39 White 50 55 60
## 2: P002 January 40-49 Black 90 95 100
## 3: P003 January 50-59 Hispanic 80 85 88
## 4: P004 February 30-39 Asian 70 75 80
## 5: P005 February 40-49 White 110 115 120
## 6: P006 March 50-59 Black 130 135 140
## Cases_2023 Cases_2024 Total_Patients Type
## <char> <char> <char> <char>
## 1: 65 70 300 Invasive Ductal
## 2: 105 120 510 Lobular
## 3: 90 110 453 Triple-Negative
## 4: 85 90 400 HER2+
## 5: 125 140 610 Inflammatory
## 6: 145 160 710 Invasive Ductal
With the column names now correctly assigned, let’s remove “total_patients” as it will be useless for us after we lengthen the data. Then we will use the “melt” function to extract the years into a column with the values being another column. Clean up the new “Year” column by removing “Cases_” and making the resulting years into numeric values.
## Patient_ID Month Age_Group Race Cases_2020 Cases_2021 Cases_2022
## <char> <char> <char> <char> <char> <char> <char>
## 1: P001 January 30-39 White 50 55 60
## 2: P002 January 40-49 Black 90 95 100
## 3: P003 January 50-59 Hispanic 80 85 88
## 4: P004 February 30-39 Asian 70 75 80
## 5: P005 February 40-49 White 110 115 120
## 6: P006 March 50-59 Black 130 135 140
## Cases_2023 Cases_2024 Type
## <char> <char> <char>
## 1: 65 70 Invasive Ductal
## 2: 105 120 Lobular
## 3: 90 110 Triple-Negative
## 4: 85 90 HER2+
## 5: 125 140 Inflammatory
## 6: 145 160 Invasive Ductal
cancer_melt <- cancer_fixed_col %>% melt(id.vars = c(colnames(cancer_fixed_col)[1:4],colnames(cancer_fixed_col)[10]), variable.name = 'Year', value.name = 'Cases')
head(cancer_melt)
## Patient_ID Month Age_Group Race Type Year Cases
## <char> <char> <char> <char> <char> <fctr> <char>
## 1: P001 January 30-39 White Invasive Ductal Cases_2020 50
## 2: P002 January 40-49 Black Lobular Cases_2020 90
## 3: P003 January 50-59 Hispanic Triple-Negative Cases_2020 80
## 4: P004 February 30-39 Asian HER2+ Cases_2020 70
## 5: P005 February 40-49 White Inflammatory Cases_2020 110
## 6: P006 March 50-59 Black Invasive Ductal Cases_2020 130
cancer_melt <- cancer_melt %>% mutate(Year = as.numeric(str_remove(Year, 'Cases_')))
cancer_melt <- cancer_melt %>% mutate(Cases = as.numeric(Cases))
head(cancer_melt)
## Patient_ID Month Age_Group Race Type Year Cases
## <char> <char> <char> <char> <char> <num> <num>
## 1: P001 January 30-39 White Invasive Ductal 2020 50
## 2: P002 January 40-49 Black Lobular 2020 90
## 3: P003 January 50-59 Hispanic Triple-Negative 2020 80
## 4: P004 February 30-39 Asian HER2+ 2020 70
## 5: P005 February 40-49 White Inflammatory 2020 110
## 6: P006 March 50-59 Black Invasive Ductal 2020 130
With the data better formatted, let’s take advantage of having the cases and year properly split. Let’s see how many cancer cases each race had from 2020-2024.
## Race Year Cases
## 1 Asian 2020 140
## 2 Black 2020 310
## 3 Hispanic 2020 300
## 4 White 2020 220
## 5 Asian 2021 150
## 6 Black 2021 325
## 7 Hispanic 2021 315
## 8 White 2021 235
## 9 Asian 2022 158
## 10 Black 2022 340
## 11 Hispanic 2022 328
## 12 White 2022 250
## 13 Asian 2023 165
## 14 Black 2023 355
## 15 Hispanic 2023 340
## 16 White 2023 265
## 17 Asian 2024 190
## 18 Black 2024 400
## 19 Hispanic 2024 390
## 20 White 2024 295
aggregate(Cases ~ Race + Year, data = cancer_melt, sum) %>% ggplot(aes(x = Year, y = Cases, fill = Race)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = 'Cancer rate from 2020-2024 by race')
Nothing looks out of place in this chart. The total cancer cases for each race increased gradually each year. This trend did not affect the ratio of cancer rates for each race against each other. Asians had the lowest rate while Blacks had the highest.
Not all untidy data look the same. Many similar techniques were used to tidy each data set, but different nuances were still required.
The Pokemon data set was a good candidate for a wide to long transformation. Since the goal was to compare “total_stats” and “usage” to determine if there was a correlation, the focus was on tidying the “usage” column. A choice was made to remove any data listed as “NoUsage” because through further analysis, there was not enough information to decide if a technique like imputation was better served. The need to analyze the data before determining a tidying strategy was a unique feature of this data set. Further work on the data could include breaking down the formats into specific years and tournaments to see time series trends.
The hardest data set to work with was the cheese data as without more context beyond what the data set provided, it was nearly impossible to account for every exception. Concessions had to be made when tidying because too many manual fixes and assumptions needed to be made. A good lesson from this work was that despite the issues, trend data still stood out. As more nuanced trends would be hard to detect, further work would be introducing geographic data sets that can help with cleaning and categorizing the countries and regions.
The main takeaway from working with the cancer data set was the importance of practice. It was the last data set to be tidied and that felt evident as many patterns were quicker to find. Further work could be to incorporate the other predictors.