DATA607_Project2

Introduction

In this assignment we select three data sets from the Week 5 discussions, clean them up and do the recommended analysis

1) Glassdoor job listings for data scientists - Isaias Soto

This data set is relevant to our field and looks at how the data field job market is across different companies and locations. It has some good information to work with once its cleaned up.

data <- read.csv("https://raw.githubusercontent.com/jerryjerald27/Data-607/refs/heads/main/Week6Assignment/Uncleaned_DS_jobs.csv")
glimpse(data)

## Rows: 672
## Columns: 15
## $ index             <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Job.Title         <chr> "Sr Data Scientist", "Data Scientist", "Data Scienti…
## $ Salary.Estimate   <chr> "$137K-$171K (Glassdoor est.)", "$137K-$171K (Glassd…
## $ Job.Description   <chr> "Description\n\nThe Senior Data Scientist is respons…
## $ Rating            <dbl> 3.1, 4.2, 3.8, 3.5, 2.9, 4.2, 3.9, 3.5, 4.4, 3.6, 4.…
## $ Company.Name      <chr> "Healthfirst\n3.1", "ManTech\n4.2", "Analysis Group\…
## $ Location          <chr> "New York, NY", "Chantilly, VA", "Boston, MA", "Newt…
## $ Headquarters      <chr> "New York, NY", "Herndon, VA", "Boston, MA", "Bad Ra…
## $ Size              <chr> "1001 to 5000 employees", "5001 to 10000 employees",…
## $ Founded           <int> 1993, 1968, 1981, 2000, 1998, 2010, 1996, 1990, 1983…
## $ Type.of.ownership <chr> "Nonprofit Organization", "Company - Public", "Priva…
## $ Industry          <chr> "Insurance Carriers", "Research & Development", "Con…
## $ Sector            <chr> "Insurance", "Business Services", "Business Services…
## $ Revenue           <chr> "Unknown / Non-Applicable", "$1 to $2 billion (USD)"…
## $ Competitors       <chr> "EmblemHealth, UnitedHealth Group, Aetna", "-1", "-1…

Out of the 15 columns, we are mainly interested in the the title, salary estimate, location, industry and sector, to perform the recommended analysis. We can split the salary estimate that is in the format “$137K-$171K (Glassdoor est.)”
Into Minimum salary , Maximum salary and even find an average salary. We can also split the Location into City and State and calculate the age of the company from its Founded date

jobs <-  data %>%
  select(Job.Title, Salary.Estimate, Location, Founded, Sector)

jobs <- jobs %>%
  mutate(
    min_salary = as.numeric(str_extract(Salary.Estimate,  "([0-9]+)")), 
    max_salary = as.numeric(str_extract(Salary.Estimate, "(?<=-\\$)([0-9]+)")), 
    avg_salary = (min_salary + max_salary) / 2,  
    company_age = 2025 - Founded , 
  ) %>% 
  separate(Location, into = c("city", "state"), sep = ", ", remove = FALSE) %>%
  select(-Salary.Estimate, -Location,  -Founded)

now we can plot how the average salary changes by state

jobs_summary <- jobs %>%
  group_by(state) %>%
  summarise(avg_salary = mean(avg_salary, na.rm = TRUE)) %>%
  arrange(desc(avg_salary))  
jobs_de <- jobs %>% 
  filter(state == "DE")


ggplot(head(jobs_summary, 10), aes(x = reorder(state, -avg_salary), y = avg_salary, fill = state)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Average Salary by State",
    x = "State",
    y = "Average Salary (USD)"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Since DE only has one entry with a high salary job, it does far better than every other state. This is not indicative of the general job market ceiling. For example from personal experience New York should be up there in high paying jobs. But the state also has a lot of entry level jobs that would average it out to not being top 10. To maybe help decide where to move for a better pay we could shorten it down to the top 5 average salaries of each state to get a better picture of the highs.

jobs_summary_top5 <- jobs %>%
  group_by(state) %>%
  arrange(state, desc(avg_salary)) %>%
  slice_head(n = 5) %>%
  summarise(avg_top5_salary = mean(avg_salary, na.rm = TRUE)) %>%
  arrange(desc(avg_top5_salary))  
ggplot(head(jobs_summary_top5,10), aes(x = reorder(state, -avg_top5_salary), y = avg_top5_salary, fill = state)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Average of Top 5 Salaries by State",
    x = "State",
    y = "Average Salary (Top 5)"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Now we get a much better picture of where the top paying jobs in the field of data science are centered around. And as expected NY is right after DE and DC , then followed by VA, MA and CA the silicon valley of the country.

Isaias was also interested in discovering a relationship between company sectors and how the average salaries compare. We could use another box plot for this to show the distribution of various salaries across different sectors.

ggplot(jobs, aes(x = Sector, y = avg_salary)) +
  geom_boxplot() +
  labs(
    title = "Salary Distribution by Company Sector",
    x = "Company Sector",
    y = "Average Salary (in 1000 USD)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This plot gives us great insights into the distribution of salaries. While every sector has some high paying outliers, the consumer sector is clear ahead with their entire range falling above most other sectors. This is to be expected as data scientists are being increasingly employed by online marketplaces to analyse the vast amount of consumer data they have to market better.

2) MDA Cancer Centre Data - Woodelyne Durosier

This is a data set of breast cancer incidence rates in Houston across different racial and age demographics across the 4 year span between 2020 and 2024

data2 <- read.csv("https://raw.githubusercontent.com/jerryjerald27/Data-607/refs/heads/main/Week6Assignment/Uncleaned_MD_cancer.csv")
glimpse(data2)

## Rows: 14
## Columns: 11
## $ Patient_ID     <chr> "P001", "P002", "P003", "P004", "P005", "P006", "P007",…
## $ Month          <chr> "January", "January", "January", "February", "February"…
## $ Age_Group      <chr> "30-39", "40-49", "50-59", "30-39", "40-49", "50-59", "…
## $ Race           <chr> "White", "Black", "Hispanic", "Asian", "White", "Black"…
## $ Cases_2020     <int> 50, 90, 80, 70, 110, 130, 100, 70, 60, 90, 120, NA, NA,…
## $ Cases_2021     <int> 55, 95, 85, 75, 115, 135, 105, 75, 65, 95, 125, NA, NA,…
## $ Cases_2022     <int> 60, 100, 88, 80, 120, 140, 110, 78, 70, 100, 130, NA, N…
## $ Cases_2023     <int> 65, 105, 90, 85, 125, 145, 115, 80, 75, 105, 135, NA, N…
## $ Cases_2024     <int> 70, 120, 110, 90, 140, 160, 130, 100, 85, 120, 150, NA,…
## $ Total_Patients <int> 300, 510, 453, 400, 610, 710, 560, 403, 355, 510, 660, …
## $ Type           <chr> "Invasive Ductal", "Lobular", "Triple-Negative", "HER2+…

As mentioned in the discussion, columns Cases_2020 through 2024 can be pivoted into two columns Year: holding the year number and Cases: holding the number of cases. We can also drop irrelevant columns and focus on the Age groups, Race, Year and the type of cancer.

cancer_data_tidy <- data2 %>%

  pivot_longer(cols = starts_with("Cases"), 
               names_to = "Year",           
               values_to = "Cases",         
               names_prefix = "Cases_",   
               values_drop_na = TRUE) %>%  
  select( Age_Group, Race, Year, Cases, Type)

head(cancer_data_tidy)

## # A tibble: 6 × 5
##   Age_Group Race  Year  Cases Type           
##   <chr>     <chr> <chr> <int> <chr>          
## 1 30-39     White 2020     50 Invasive Ductal
## 2 30-39     White 2021     55 Invasive Ductal
## 3 30-39     White 2022     60 Invasive Ductal
## 4 30-39     White 2023     65 Invasive Ductal
## 5 30-39     White 2024     70 Invasive Ductal
## 6 40-49     Black 2020     90 Lobular

Now to analyse this data we could plot how cancer cases have evolved over the years for different age groups and for different races. This can help us identify any obvious trends when it comes to certain age groups or races.

ggplot(cancer_data_tidy, aes(x = Year, y = Cases, fill = Race)) +
  geom_bar(stat = "identity", position = "dodge") +  
  labs(
    title = "Cancer Cases by Race and Year",
    x = "Year",
    y = "Number of Cases",
    fill = "Race"
  )

Here when plotting the cases by year and races, we can see that cancer cases have been linearly rising across all races. We can also observe that the cases have been lowest in the Asian community, followed by white, Hispanic and finally the most incidences seen in the black community. This might be indicative of certain genetic markers that predict cancer occurrences, or lifestyle changes between the races that might be conducive to breast cancer.

Our next suggestion was to analyse the patterns between age groups and breast cancer incidences in the women of Houtson. For this we can use a line and point plot as an extremely simple and easy to understand ggplot to that is able to identify relationships with relative ease.

ggplot(cancer_data_tidy, aes(x = Year, y = Cases, color = Age_Group, group = Age_Group)) +
  geom_line() +
  geom_point() +
  labs(title = "Cancer Cases by Age Group and Year",
       x = "Year",
       y = "Number of Cases")

in this plot each vertical line is an age group, with points on the line denoting each of the races. Plotting across the ages also shows us the highest number of cases being reported in the age group of 50s followed by the 60s, 40s , 70s and then the 30s.Here also we can see the number of cases of cancer rising through the years in an almost linear fashion.

3) Pokemon Usage Rates 2022-2024 for Smogon and Worlds VGC - Lawrence Yu

This data set contains information on usage statistics of Pokemon used in 2 separate tournaments Smogon and Worlds VGC across 2 years. Here as mentioned the Columns for the tournament data are written in a wide format like Smogon_VGC_Usage_2022 which includes the name of the tournament and the year. This can be pivoted longer to Tournament and Year columns to tidy up the data.

data3 <- read.csv("https://raw.githubusercontent.com/Megabuster/Data607/refs/heads/main/data/project2/pokemon_competitive_analysis.csv")

data3_tidy <- data3 %>%
  pivot_longer(cols = c(Smogon_VGC_Usage_2022, Smogon_VGC_Usage_2023, Smogon_VGC_Usage_2024,
                        Worlds_VGC_Usage_2022, Worlds_VGC_Usage_2023, Worlds_VGC_Usage_2024),
               names_to = c("Tournament", "Year"),
               names_pattern = "(.*)_(\\d+)",
               values_to = "Usage") %>%
  mutate(
    Tournament = gsub("_Usage", "", Tournament),  # Remove '_Usage'
    Tournament = gsub("_VGC", " VGC", Tournament),  # Replace '_VGC' with ' VGC'
    Usage = as.numeric(Usage)  # turn data showing "NoUsage" to NA
  )

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Usage = as.numeric(Usage)`.
## Caused by warning:
## ! NAs introduced by coercion

glimpse(data3_tidy)

## Rows: 7,818
## Columns: 20
## $ index          <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3…
## $ name           <chr> "bulbasaur", "bulbasaur", "bulbasaur", "bulbasaur", "bu…
## $ type1          <chr> "grass", "grass", "grass", "grass", "grass", "grass", "…
## $ type2          <chr> "poison", "poison", "poison", "poison", "poison", "pois…
## $ ability1       <chr> "overgrow", "overgrow", "overgrow", "overgrow", "overgr…
## $ ability2       <chr> "No_ability", "No_ability", "No_ability", "No_ability",…
## $ hidden_ability <chr> "chlorophyll", "chlorophyll", "chlorophyll", "chlorophy…
## $ hp             <int> 45, 45, 45, 45, 45, 45, 60, 60, 60, 60, 60, 60, 80, 80,…
## $ attack         <int> 49, 49, 49, 49, 49, 49, 62, 62, 62, 62, 62, 62, 82, 82,…
## $ defense        <int> 49, 49, 49, 49, 49, 49, 63, 63, 63, 63, 63, 63, 83, 83,…
## $ sp_atk         <int> 65, 65, 65, 65, 65, 65, 80, 80, 80, 80, 80, 80, 100, 10…
## $ sp_def         <int> 65, 65, 65, 65, 65, 65, 80, 80, 80, 80, 80, 80, 100, 10…
## $ speed          <int> 45, 45, 45, 45, 45, 45, 60, 60, 60, 60, 60, 60, 80, 80,…
## $ total_stats    <int> 318, 318, 318, 318, 318, 318, 405, 405, 405, 405, 405, …
## $ legendary      <chr> "False", "False", "False", "False", "False", "False", "…
## $ mythical       <chr> "False", "False", "False", "False", "False", "False", "…
## $ generation     <chr> "generation-i", "generation-i", "generation-i", "genera…
## $ Tournament     <chr> "Smogon VGC", "Smogon VGC", "Smogon VGC", "Worlds VGC",…
## $ Year           <chr> "2022", "2023", "2024", "2022", "2023", "2024", "2022",…
## $ Usage          <dbl> 0.00000, NA, 0.00000, NA, NA, NA, 0.00000, NA, 0.00000,…

In the discussion, the suggested analysis was to see the relationship between usage statistics and total stats . I would like to focus on the top 50 used Pokemon from both tournaments in the year 2024 and see how their total_stats relate to their usage

smogon_top50 <- data3_tidy %>%
  filter(Year == "2024") %>%
  filter(Tournament == "Smogon VGC") %>%
  arrange(desc(Usage)) %>% 
  head(50)  

head(smogon_top50 %>% select(name, Usage))

## # A tibble: 6 × 2
##   name               Usage
##   <chr>              <dbl>
## 1 flutter-mane        59.4
## 2 incineroar          36.8
## 3 ogerpon-wellspring  35.1
## 4 tornadus-incarnate  26.2
## 5 raging-bolt         26.1
## 6 urshifu-rapid       25.6

worlds_top50 <- data3_tidy %>%
    filter(Year == "2024") %>%
  filter(Tournament == "Worlds VGC") %>%
  arrange(desc(Usage)) %>%  
  head(50)  

tourneys_data_combined <- bind_rows(smogon_top50, worlds_top50)

ggplot(tourneys_data_combined, aes(x = total_stats, y = Usage)) +
  geom_point(aes(color = Tournament)) + 
  labs(title = "Top 100 most used pokemon (2024) : Usage vs Total_stats",
       x = "Total Stats",
       y = "Usage",
       color = "Tournament")

This scatter plot tells us that the total_stats of the popularly used Pokemon all lie better 250 and 600. There also seems to be pretty defined line at 600 for both tournaments which lead me to believe that there might be a total_stats cutoff for these tournaments. But we also see some outliers towards the 700 but only for the Smogon VGC. Comparing the two tournaments, the players in Worlds tend to pick Pokemon with lower total stats than those playing Smogon. Players in Smogon also seem to have a clear favorite sitting all the way up by itself at 59.6, flutter-mane, a dual type ghost-fairy paradox Pokemon.

We can also plot the types of the most common types of Pokemon used with how they compare to each other.

type_colors <- c(
  ghost = "purple",
  fighting = "red",
  fire = "orange",
  grass = "darkgreen",
  steel = "lightgrey",
  flying = "cyan",
  electric = "yellow",
  dragon = "orange",
  psychic = "pink",
  water = "blue",
  ground = "brown",
  ice = "lightblue",
  rock = "black",
  bug = "magenta",
  fairy = "lightpink"
  )

## Order by descending highest usage Pokemon within each type 
tourneys_data_combined_desc <- tourneys_data_combined %>%
  group_by(type1) %>%
  mutate(max_usage = max(Usage)) %>% 
  ungroup() %>%
  mutate(type1 = factor(type1, levels = names(sort(tapply(max_usage, type1, max), decreasing = TRUE)))) %>%
  select(-max_usage)  

# Plotting types against the usage 
ggplot(tourneys_data_combined_desc, aes(x = type1, y = Usage, color = type1)) +
  geom_point(size = 3) + 
  scale_color_manual(values = type_colors) + 
  labs(title = "Top 100 Most Used Pokémon 2024: Usage vs Main Type",
       x = "Pokémon Type",
       y = "Usage",
       color = "Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

Here again we see Flutter-Mane making sure that the ghost type is way ahead of everyone else. Its followed closely by the fighting types , which makes sense for a tournament, and then my personal favorite when I was a kid, the fire types and their rivals the grass type.

Conclusion

We were able to clean up all three data sets, convert them into the standard from of observation rows and variable columns , and extract relevant insights from the data after plotting the important variables against each other.