Creating A New Multivariable Graph using DSlabs Dataset
I have used the “Research Funding Rates” dataset from DSlabs. It includes 9 observations, 10 variables including nine types of disciplines, total applications, applications by men, applications by women, total awards, awards for men, awards for women, total success rate, men’s success rate, and women’s success rate.
To analyze the dataset, I installed the DSlabs package, and loaded the data in the RStudio and saved in the computer.
I have converted the dataset into long format and create one more new variable, gender.
# pivot the dataset to long format and add gender variableresearch_long1 <- research_funding_rates_nv %>%pivot_longer(cols =c(applications_men, applications_women),names_to ="application_type",values_to ="applications",names_pattern ="applications_(.*)") %>%pivot_longer(cols =c(awards_men, awards_women),names_to ="awards_type",values_to ="awards",names_pattern ="awards_(.*)") %>%pivot_longer(cols =c(success_rates_men, success_rates_women),names_to ="success_rate_type",values_to ="success_rate",names_pattern ="success_rates_(.*)") %>%pivot_longer(cols =c(awards_rate_men, awards_rate_women),names_to ="awards_rate_type",values_to ="awards_rate",names_pattern ="awards_rate_(.*)") %>%mutate(gender =case_when( application_type =="men"~"Male", application_type =="women"~"Female" )) %>%filter(application_type == awards_type & awards_type == success_rate_type & success_rate_type == awards_rate_type) %>%select(discipline, gender, applications, awards, success_rate, awards_rate) %>%filter(gender %in%c("Male", "Female")) # filter out total lines and other categories
Addtionally, the total success numbers has calculated based on the success_rate and total applications number.The new long dataset has increased up to 18 observations with 7 variables (discipline, gender, applications, success, awards, success_rate, award_rate). The percentage values for award and sucess are converted into proportionate values.
# calculate success_total and create new variableresearch_long1$success <- research_long1$success_rate /100* research_long1$applications
I have check the clean dataset using the ‘head’ function.
# check out the first few lineshead(research_long1, 10)
From the summary table, the dataset comprising 18 observations and 7 variables has no missing values or duplicates. The variables represent different aspects of research funding, including the discipline, gender, applications, awards, success rate, and award rate.
The number of applications varies widely, with an average of 156.8 applications (standard deviation of 119.5). The minimum number of applications is 9, the median is 130.5, and the maximum is 425. The average number of awards is 25.9 (standard deviation of 16). The minimum number of awards is 2, the median is 24, and the maximum is 65. The success rate has a mean of 19% (standard deviation of 5.3%). The success rates range from 11.2% to 26.9%, with a median of 19.3%. The awards rate, another measure of funding success, has a mean of 9.5% (standard deviation of 5.2%). The awards rates range from 2.6% to 23.7%, with a median of 8.1%. The IQR is 5.4%, and the CV is 0.6, indicating significant variability in awards rates across disciplines.
Checking differences between genders
From this summary, I cannot see the difference in gender. So I make two plots for average success_rate and average success_rate between femal and male researchers.
# load the library to convert from data values to perceptual propertieslibrary(scales)
Average_success_proportion across disciplines, grouped by gender.
# Aggregate data to get mean success_rate by discipline and gendersummary_data <- research_long1 %>%group_by(discipline, gender) %>%summarize(avg_success_rate =mean(success_rate),avg_awards_rate =mean(awards_rate)) %>%ungroup()
`summarise()` has grouped output by 'discipline'. You can override using the
`.groups` argument.
# create a plot using geom_lineggplot(summary_data, aes(x = discipline, y = avg_success_rate, color = gender,group = gender)) +geom_line() +geom_point(aes(shape = gender), size =2, alpha =0.8) +labs(x ="Disciplines", y ="Average Success Rate",title ="Average Success Rate in Research Funding",caption ="Source: Data Science Lab") +theme_minimal() +theme(axis.text.x =element_text(size =10, angle =45, hjust =1), plot.title =element_text(size =12, hjust =0.5, margin =margin(t =10, b =10)),axis.title.x =element_text(size =12),axis.title.y =element_text(size =12), plot.caption =element_text(size =10, face ="italic"),legend.position ="bottom",legend.box ="horizontal") +scale_y_continuous(labels = scales::percent) +scale_color_brewer(palette ="Dark2") +scale_shape_manual(values =c(15, 16)) # Circle for Male (16), Square for Female (15)
Average_award_proportion across disciplines, grouped by gender.
# create a plot using geom_lineggplot(summary_data, aes(x = discipline, y = avg_awards_rate, color = gender,group = gender)) +geom_line(alpha =0.8, size=0.6) +geom_point(aes(shape = gender), size =3, alpha =0.8) +labs(x ="Disciplines", y ="Average Award Rate",title ="Average Award Rate in Research Funding",caption ="Source: Data Science Lab") +theme_minimal() +theme(axis.text.x =element_text(size =10, angle =45, hjust =1), plot.title =element_text(size =12, hjust =0.5, margin =margin(t =10, b =10)),axis.title.x =element_text(size =12),axis.title.y =element_text(size =12), plot.caption =element_text(size =10, face ="italic"),legend.position ="bottom",legend.box ="horizontal") +scale_y_continuous(labels = scales::percent) +scale_color_brewer(palette ="Dark2") +scale_shape_manual(values =c(15, 16)) # Circle for Male (16), Square for Female (15)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
From the two plots above, the curves of success and award rates show notable differences between genders.
Creating the visualizations
Visualization 1:
Success and Award Rates in Research Funding: Gender Perspective
Before plotting the variables, I need to covert the variables having percentage value into proportionate value ensuring the label_percent function works.
# create a plot without geom_pointplot1 <-ggplot(research_long, aes(x = discipline, y = success_proportion, color = gender,size = awards_proportion,group = gender)) +labs(x ="Disciplines", y ="Success rate",title =expression(bold("Success and Award Rates in Research Funding: Gender Perspective")),caption ="Source: Data Science Lab") +theme_minimal(base_size =12) +theme(axis.text.x =element_text(size =7, angle =45, hjust =1),plot.title =element_text(size =12, hjust =0.5, margin =margin(t =10, b =10)), axis.title.x =element_text(size =10),axis.title.y =element_text(size =10), plot.caption =element_text(face ="italic")) +# Make the caption italicscale_y_continuous(labels = scales::label_percent(accuracy =1)) +# Limits to whole numbersscale_color_brewer(palette ="Dark2")
# add the geom layer with five variablesplot2 <- plot1 +geom_point (alpha=0.8)# display the plotprint (plot2)
The line plots reveal that, on average, men have higher success and award rates compared to women across most disciplines. This finding suggests potential gender biases or other underlying factors affecting funding outcomes.
Visualization 2:
Relationship Between Applications, Success Rates, and Awards Rates by Discipline
# Calculate average number of applications, success rates, and awards rates by disciplinesummary_dataset <- research_long %>%group_by(discipline) %>%summarise(avg_applications =mean(applications),avg_success_proportion =mean(success_proportion),avg_awards_proportion =mean(awards_proportion))
# Plot to visualize the relationship between the number of applications, success rates, and awards ratesggplot(summary_dataset, aes(x = avg_applications)) +geom_point(aes(y = avg_success_proportion *100, color ="Success Rate"), size =2) +geom_point(aes(y = avg_awards_proportion *100, color ="Awards Rate"), size =2) +geom_smooth(aes(y = avg_success_proportion *100, color ="Success Rate"), method ="lm", se =FALSE, linetype ="dotdash", size =0.5) +geom_smooth(aes(y = avg_awards_proportion *100, color ="Awards Rate"), method ="lm", se =FALSE, linetype ="dotdash", size =0.5) +geom_text(aes(y = avg_success_proportion *100, label = discipline), angle =1, vjust =-1, size =2) +geom_text(aes(y = avg_awards_proportion *100, label = discipline), angle =1, vjust =1.5, size =1.9) +labs(x ="Number of Applications",y ="Rate (%)",title =expression(bold("Variability in Applications, Success Rates, and Awards Rates Across Disciplines")),color ="Rate Type",caption ="Source: Data Science Lab") +theme_minimal() +theme(axis.text.x =element_text(size =10, hjust =1),plot.title =element_text(size =11, hjust =0.5, margin =margin(t =10, b =10)),axis.title.x =element_text(size =9),axis.title.y =element_text(size =9), plot.caption =element_text(size =7, face ="italic"),legend.position ="bottom",legend.box ="horizontal") +scale_color_manual(values =c("Success Rate"="magenta", "Awards Rate"="blue"))
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Variability in Applications and Awards: There is substantial variability in the number of applications and awards across disciplines. Disciplines with a higher number of applications do not necessarily have higher success or award rates, indicating differences in competitiveness or funding availability.
Discipline-specific Trends: Certain disciplines exhibit higher success and award rates, highlighting disparities in funding success across research fields. For example, disciplines like Chemical Sciences and Physical Sciences show higher rates compared to Humanities and Social Sciences.
Conclusion
The analysis reveals significant variability in research funding success and award rates across disciplines and genders. While the dataset provides valuable insights, further investigation into the underlying causes of these disparities is necessary. Future research could explore factors such as the quality of applications, review processes, and institutional support to better understand and address the observed differences in funding outcomes.