I used the data set of research_funding_rates found in dslabs. Which according to dslabs it shows Gender bias in research funding in the Netherlands. But it is a little difference since it does not show funding but actually success rates and awards.
#Loading the library and looking for all the data sets in dslabs.library(dslabs)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)data(package ="dslabs")
#Deciding which dataset to use and finding its summary.data("research_funding_rates")summary(research_funding_rates)
discipline applications_total applications_men applications_women
Length:9 Min. : 76.0 Min. : 67.0 Min. : 9
Class :character 1st Qu.:174.0 1st Qu.:105.0 1st Qu.: 39
Mode :character Median :251.0 Median :156.0 Median : 78
Mean :313.7 Mean :181.7 Mean :132
3rd Qu.:396.0 3rd Qu.:230.0 3rd Qu.:166
Max. :834.0 Max. :425.0 Max. :409
awards_total awards_men awards_women success_rates_total
Min. : 20.00 Min. :12.00 Min. : 2.00 Min. :13.4
1st Qu.: 32.00 1st Qu.:22.00 1st Qu.:10.00 1st Qu.:15.8
Median : 43.00 Median :30.00 Median :17.00 Median :17.1
Mean : 51.89 Mean :32.22 Mean :19.67 Mean :18.9
3rd Qu.: 65.00 3rd Qu.:38.00 3rd Qu.:29.00 3rd Qu.:20.1
Max. :112.00 Max. :65.00 Max. :47.00 Max. :26.3
success_rates_men success_rates_women
Min. :11.4 Min. :11.20
1st Qu.:15.3 1st Qu.:14.30
Median :18.8 Median :21.00
Mean :19.2 Mean :18.89
3rd Qu.:24.4 3rd Qu.:22.20
Max. :26.9 Max. :25.60
#Creating a new column that analyses the difference of success rates in each discipline.data <- research_funding_rates %>%mutate(rate_difference = success_rates_men - success_rates_women,gender =ifelse(rate_difference >0, "Men", "Women")) #Since we did men - women if the value is more than 0 it outputs man and less than 0 outputs woman
data1 <-data %>%mutate(abs(rate_difference)) #This is to create a new column that transforms all negative values to positive for the differnce in succes rates
ggplot(data1, aes(x = awards_total, y = discipline, color = gender, size =abs(rate_difference))) +geom_point() +#Creating a point graph with color for dominant gender and size to show the difference in dominance.labs(title ="Gender difference between disciplines",x ="Total Awards",caption ="Data found on DSlabs",y ="Disciplines" ) +#Creating the titles for each axis and creating a title for the graph and the caption for the data set.theme_minimal() +theme(plot.title =element_text(size =12, face ="bold"), axis.title =element_text(size =8), ) +#changing the size of the title and the axis and making the title in boldscale_color_brewer(palette ="Set1") +#putting the set1 collor palette for genderguides(size =guide_legend(title ="Dominance difference")) #changing the name for the size on the graph
This dataset very easy to work with since it does not have any N/A`s. So the cleaning was not as needed for the data. The main problems I has was how to show the data I have since it has some values that alone don’t show much. So the biggest “cleaning” I had to do was finding the difference in success rates and removing their negative sign to use it in the size function on the graph.