DS Labs HW

Author

Arthur De Almeida

DSlabs Homework

I used the data set of research_funding_rates found in dslabs. Which according to dslabs it shows Gender bias in research funding in the Netherlands. But it is a little difference since it does not show funding but actually success rates and awards.

#Loading the library and looking for all the data sets in dslabs.
library(dslabs)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
data(package = "dslabs")

#Deciding which dataset to use and finding its summary.
data("research_funding_rates")
summary(research_funding_rates)

  discipline        applications_total applications_men applications_women
 Length:9           Min.   : 76.0      Min.   : 67.0    Min.   :  9       
 Class :character   1st Qu.:174.0      1st Qu.:105.0    1st Qu.: 39       
 Mode  :character   Median :251.0      Median :156.0    Median : 78       
                    Mean   :313.7      Mean   :181.7    Mean   :132       
                    3rd Qu.:396.0      3rd Qu.:230.0    3rd Qu.:166       
                    Max.   :834.0      Max.   :425.0    Max.   :409       
  awards_total      awards_men     awards_women   success_rates_total
 Min.   : 20.00   Min.   :12.00   Min.   : 2.00   Min.   :13.4       
 1st Qu.: 32.00   1st Qu.:22.00   1st Qu.:10.00   1st Qu.:15.8       
 Median : 43.00   Median :30.00   Median :17.00   Median :17.1       
 Mean   : 51.89   Mean   :32.22   Mean   :19.67   Mean   :18.9       
 3rd Qu.: 65.00   3rd Qu.:38.00   3rd Qu.:29.00   3rd Qu.:20.1       
 Max.   :112.00   Max.   :65.00   Max.   :47.00   Max.   :26.3       
 success_rates_men success_rates_women
 Min.   :11.4      Min.   :11.20      
 1st Qu.:15.3      1st Qu.:14.30      
 Median :18.8      Median :21.00      
 Mean   :19.2      Mean   :18.89      
 3rd Qu.:24.4      3rd Qu.:22.20      
 Max.   :26.9      Max.   :25.60

#Creating a new column that analyses the difference of success rates in each discipline.
data <- research_funding_rates %>%
  mutate(rate_difference = success_rates_men - success_rates_women,
        gender = ifelse(rate_difference > 0, "Men", "Women")) #Since we did men - women if the value is more than 0 it outputs man and less than 0 outputs woman

data1 <-data %>%
  mutate(abs(rate_difference)) #This is to create a new column that transforms all negative values to positive for the differnce in succes rates

ggplot(data1, aes(x = awards_total, y = discipline, color = gender, size = abs(rate_difference))) +
  geom_point() +  #Creating a point graph with color for dominant gender and size to show the difference in dominance.
  labs(
    title = "Gender difference between disciplines",
    x = "Total Awards",
    caption = "Data found on DSlabs",
    y = "Disciplines"                           
  ) + #Creating the titles for each axis and creating a title for the graph and the caption for the data set.
  theme_minimal() +          
  theme(
    plot.title = element_text(size = 12, face = "bold"), 
    axis.title = element_text(size = 8),                 
  ) + #changing the size of the title and the axis and making the title in bold
  scale_color_brewer(palette = "Set1") +  #putting the set1 collor palette for gender
  guides(size = guide_legend(title = "Dominance difference")) #changing the name for the size on the graph

This dataset very easy to work with since it does not have any N/A`s. So the cleaning was not as needed for the data. The main problems I has was how to show the data I have since it has some values that alone don’t show much. So the biggest “cleaning” I had to do was finding the difference in success rates and removing their negative sign to use it in the size function on the graph.