DS Labs Assignment

Author

Su Thet Hninn

Creating A New Multivariable Graph using DSlabs Dataset

I have used the “Research Funding Rates” dataset from DSlabs. It includes 9 observations, 10 variables including nine types of disciplines, total applications, applications by men, applications by women, total awards, awards for men, awards for women, total success rate, men’s success rate, and women’s success rate.

To analyze the dataset, I installed the DSlabs package, and loaded the data in the RStudio and saved in the computer.

# install package ("dslabs) 
library(package = "dslabs")
list.files(system.file ("script", package = "dslabs"))

 [1] "make-admissions.R"                   
 [2] "make-brca.R"                         
 [3] "make-brexit_polls.R"                 
 [4] "make-calificaciones.R"               
 [5] "make-death_prob.R"                   
 [6] "make-divorce_margarine.R"            
 [7] "make-gapminder-rdas.R"               
 [8] "make-greenhouse_gases.R"             
 [9] "make-historic_co2.R"                 
[10] "make-mice_weights.R"                 
[11] "make-mnist_127.R"                    
[12] "make-mnist_27.R"                     
[13] "make-movielens.R"                    
[14] "make-murders-rda.R"                  
[15] "make-na_example-rda.R"               
[16] "make-nyc_regents_scores.R"           
[17] "make-olive.R"                        
[18] "make-outlier_example.R"              
[19] "make-polls_2008.R"                   
[20] "make-polls_us_election_2016.R"       
[21] "make-pr_death_counts.R"              
[22] "make-reported_heights-rda.R"         
[23] "make-research_funding_rates.R"       
[24] "make-stars.R"                        
[25] "make-temp_carbon.R"                  
[26] "make-tissue-gene-expression.R"       
[27] "make-trump_tweets.R"                 
[28] "make-weekly_us_contagious_diseases.R"
[29] "save-gapminder-example-csv.R"

# load the research_funding rates data
data("research_funding_rates")

The libraries ‘tidyverse’, ‘ggfortify’, ‘htmltools’, ‘plotly’, and ‘scale’ have been loaded.

# load the required packages
library(tidyverse)
library(ggfortify)
library(htmltools)
library(plotly)
library(scales)
library(summarytools)

Cleaning the dataset

The three new variables: awards_rate_men, awards_rate_women, and awards_rate_total are created, using the ‘mutate’ function.

#  create new variables using 'mutate' function
research_funding_rates_nv <- research_funding_rates %>%
  mutate(
    awards_rate_men = (awards_men / applications_total) * 100,
    awards_rate_women = (awards_women / applications_total) * 100, 
    awards_rate_total = (awards_rate_men + awards_rate_women)
  )

I have converted the dataset into long format and create one more new variable, gender.

# pivot the dataset to long format and add gender variable
research_long1 <- research_funding_rates_nv %>%
  pivot_longer(cols = c(applications_men, applications_women),
               names_to = "application_type",
               values_to = "applications",
               names_pattern = "applications_(.*)") %>%
  pivot_longer(cols = c(awards_men, awards_women),
               names_to = "awards_type",
               values_to = "awards",
               names_pattern = "awards_(.*)") %>%
  pivot_longer(cols = c(success_rates_men, success_rates_women),
               names_to = "success_rate_type",
               values_to = "success_rate",
               names_pattern = "success_rates_(.*)") %>%
  pivot_longer(cols = c(awards_rate_men, awards_rate_women),
               names_to = "awards_rate_type",
               values_to = "awards_rate",
               names_pattern = "awards_rate_(.*)") %>%
  mutate(gender = case_when(
    application_type == "men" ~ "Male",
    application_type == "women" ~ "Female"
  )) %>%
  filter(application_type == awards_type & awards_type == success_rate_type & success_rate_type == awards_rate_type) %>%
  select(discipline, gender, applications, awards, success_rate, awards_rate) %>%
  filter(gender %in% c("Male", "Female"))  # filter out total lines and other categories

Addtionally, the total success numbers has calculated based on the success_rate and total applications number.The new long dataset has increased up to 18 observations with 7 variables (discipline, gender, applications, success, awards, success_rate, award_rate). The percentage values for award and sucess are converted into proportionate values.

# calculate success_total and create new variable
research_long1$success <- research_long1$success_rate / 100 * research_long1$applications

I have check the clean dataset using the ‘head’ function.

# check out the first few lines
head(research_long1, 10)

# A tibble: 10 × 7
   discipline        gender applications awards success_rate awards_rate success
   <chr>             <chr>         <dbl>  <dbl>        <dbl>       <dbl>   <dbl>
 1 Chemical sciences Male             83     22         26.5       18.0    22.0 
 2 Chemical sciences Female           39     10         25.6        8.20    9.98
 3 Physical sciences Male            135     26         19.3       14.9    26.1 
 4 Physical sciences Female           39      9         23.1        5.17    9.01
 5 Physics           Male             67     18         26.9       23.7    18.0 
 6 Physics           Female            9      2         22.2        2.63    2.00
 7 Humanities        Male            230     33         14.3        8.33   32.9 
 8 Humanities        Female          166     32         19.3        8.08   32.0 
 9 Technical scienc… Male            189     30         15.9       12.0    30.1 
10 Technical scienc… Female           62     13         21          5.18   13.0

Checking the data Summary

Before creating the charts, I have checked the summary of data.

# Load the library for summary function
library(summarytools)

dfSummary(research_long1)

Data Frame Summary  
research_long1  
Dimensions: 18 x 7  
Duplicates: 0  

-------------------------------------------------------------------------------------------------------------
No   Variable       Stats / Values              Freqs (% of Valid)   Graph               Valid      Missing  
---- -------------- --------------------------- -------------------- ------------------- ---------- ---------
1    discipline     1. Chemical sciences        2 (11.1%)            II                  18         0        
     [character]    2. Earth/life sciences      2 (11.1%)            II                  (100.0%)   (0.0%)   
                    3. Humanities               2 (11.1%)            II                                      
                    4. Interdisciplinary        2 (11.1%)            II                                      
                    5. Medical sciences         2 (11.1%)            II                                      
                    6. Physical sciences        2 (11.1%)            II                                      
                    7. Physics                  2 (11.1%)            II                                      
                    8. Social sciences          2 (11.1%)            II                                      
                    9. Technical sciences       2 (11.1%)            II                                      

2    gender         1. Female                   9 (50.0%)            IIIIIIIIII          18         0        
     [character]    2. Male                     9 (50.0%)            IIIIIIIIII          (100.0%)   (0.0%)   

3    applications   Mean (sd) : 156.8 (119.5)   17 distinct values     :                 18         0        
     [numeric]      min < med < max:                                 : : : :             (100.0%)   (0.0%)   
                    9 < 130.5 < 425                                  : : : : .       .                       
                    IQR (CV) : 150 (0.8)                             : : : : :       :                       
                                                                     : : : : : :     :                       

4    awards         Mean (sd) : 25.9 (16)       17 distinct values     :                 18         0        
     [numeric]      min < med < max:                                   : :               (100.0%)   (0.0%)   
                    2 < 24 < 65                                      : : : :                                 
                    IQR (CV) : 18.8 (0.6)                            : : : : :                               
                                                                     : : : : :   :                           

5    success_rate   Mean (sd) : 19 (5.3)        16 distinct values       :               18         0        
     [numeric]      min < med < max:                                 :   :   :           (100.0%)   (0.0%)   
                    11.2 < 19.3 < 26.9                               :   :   : . . . .                       
                    IQR (CV) : 8.3 (0.3)                             :   :   : : : : :                       
                                                                     :   :   : : : : :                       

6    awards_rate    Mean (sd) : 9.5 (5.2)       18 distinct values     :                 18         0        
     [numeric]      min < med < max:                                   :                 (100.0%)   (0.0%)   
                    2.6 < 8.1 < 23.7                                   :                                     
                    IQR (CV) : 5.4 (0.6)                               :                                     
                                                                     . : : . .                               

7    success        Mean (sd) : 26 (16)         18 distinct values     :                 18         0        
     [numeric]      min < med < max:                                   :   :             (100.0%)   (0.0%)   
                    2 < 24 < 65                                      : : : :                                 
                    IQR (CV) : 18.7 (0.6)                            : : : : :                               
                                                                     : : : : :   :                           
-------------------------------------------------------------------------------------------------------------

view(dfSummary(research_long1))

Switching method to 'browser'

Output file written: /var/folders/mg/qdgwyzy52v98fm2nmhhjppnc0000gn/T//Rtmpsrlp0o/filec252c3bb114.html

Data Summary

From the summary table, the dataset comprising 18 observations and 7 variables has no missing values or duplicates. The variables represent different aspects of research funding, including the discipline, gender, applications, awards, success rate, and award rate.

The number of applications varies widely, with an average of 156.8 applications (standard deviation of 119.5). The minimum number of applications is 9, the median is 130.5, and the maximum is 425. The average number of awards is 25.9 (standard deviation of 16). The minimum number of awards is 2, the median is 24, and the maximum is 65. The success rate has a mean of 19% (standard deviation of 5.3%). The success rates range from 11.2% to 26.9%, with a median of 19.3%. The awards rate, another measure of funding success, has a mean of 9.5% (standard deviation of 5.2%). The awards rates range from 2.6% to 23.7%, with a median of 8.1%. The IQR is 5.4%, and the CV is 0.6, indicating significant variability in awards rates across disciplines.

Checking differences between genders

From this summary, I cannot see the difference in gender. So I make two plots for average success_rate and average success_rate between femal and male researchers.

# load the library to convert from data values to perceptual properties
library(scales)

Average_success_proportion across disciplines, grouped by gender.

# Aggregate data to get mean success_rate by discipline and gender
summary_data <- research_long1 %>%
  group_by(discipline, gender) %>%
  summarize(avg_success_rate = mean(success_rate),
            avg_awards_rate = mean(awards_rate)) %>%
  ungroup()

`summarise()` has grouped output by 'discipline'. You can override using the
`.groups` argument.

# create a plot using geom_line
ggplot(summary_data, aes(x = discipline, 
                         y = avg_success_rate, 
                         color = gender,
                         group = gender)) +
  geom_line() +
  geom_point(aes(shape = gender), size = 2, alpha = 0.8) +
  labs(x = "Disciplines", 
       y = "Average Success Rate",
       title = "Average Success Rate in Research Funding",
       caption = "Source: Data Science Lab") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1), 
        plot.title = element_text(size = 12, hjust = 0.5, margin = margin(t = 10, b = 10)),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12), 
        plot.caption = element_text(size = 10, face = "italic"),
        legend.position = "bottom",
        legend.box = "horizontal") +
  scale_y_continuous(labels = scales::percent) +  
  scale_color_brewer(palette = "Dark2") +
  scale_shape_manual(values = c(15, 16))  # Circle for Male (16), Square for Female (15)

Average_award_proportion across disciplines, grouped by gender.

# create a plot using geom_line
ggplot(summary_data, aes(x = discipline, 
                         y = avg_awards_rate, 
                         color = gender,
                         group = gender)) +
  geom_line(alpha = 0.8, size=0.6) +
  geom_point(aes(shape = gender), size = 3, alpha = 0.8) +
  labs(x = "Disciplines", 
       y = "Average Award Rate",
       title = "Average Award Rate in Research Funding",
       caption = "Source: Data Science Lab") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1), 
        plot.title = element_text(size = 12, hjust = 0.5, margin = margin(t = 10, b = 10)),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12), 
        plot.caption = element_text(size = 10, face = "italic"),
        legend.position = "bottom",
        legend.box = "horizontal") +
  scale_y_continuous(labels = scales::percent) +  
  scale_color_brewer(palette = "Dark2") +
  scale_shape_manual(values = c(15, 16))  # Circle for Male (16), Square for Female (15)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

From the two plots above, the curves of success and award rates show notable differences between genders.

Creating the visualizations

Visualization 1:

Success and Award Rates in Research Funding: Gender Perspective

Before plotting the variables, I need to covert the variables having percentage value into proportionate value ensuring the label_percent function works.

# prepare data
research_long1$success_rate <- research_long1$success_rate /100
research_long1$awards_rate <- research_long1$awards_rate /100

# rename the variables
research_long <- research_long1 %>%
  rename(success_proportion = success_rate,
         awards_proportion = awards_rate)

# create a plot without geom_point
plot1 <- ggplot(research_long, 
                aes(x = discipline, 
                    y = success_proportion, 
                    color = gender,
                    size = awards_proportion,
                    group = gender)) +
  labs(x = "Disciplines", 
       y = "Success rate",
       title = expression(bold("Success and Award Rates in Research Funding: Gender Perspective")),
       caption = "Source: Data Science Lab") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1),
        plot.title = element_text(size = 12, hjust = 0.5, margin = margin(t = 10, b = 10)), 
        axis.title.x = element_text(size = 10),
        axis.title.y = element_text(size = 10), 
        plot.caption = element_text(face = "italic")) +  # Make the caption italic
  scale_y_continuous(labels = scales::label_percent(accuracy = 1)) +  # Limits to whole numbers
  scale_color_brewer(palette = "Dark2")

# add the geom layer with five variables
plot2 <- plot1 + geom_point (alpha= 0.8)

# display the plot
print (plot2)

The line plots reveal that, on average, men have higher success and award rates compared to women across most disciplines. This finding suggests potential gender biases or other underlying factors affecting funding outcomes.

Visualization 2:

Relationship Between Applications, Success Rates, and Awards Rates by Discipline

# Calculate average number of applications, success rates, and awards rates by discipline
summary_dataset <- research_long %>%
  group_by(discipline) %>%
  summarise(avg_applications = mean(applications),
            avg_success_proportion = mean(success_proportion),
            avg_awards_proportion = mean(awards_proportion))

# Plot to visualize the relationship between the number of applications, success rates, and awards rates
ggplot(summary_dataset, aes(x = avg_applications)) +
  geom_point(aes(y = avg_success_proportion * 100, color = "Success Rate"), size = 2) +
  geom_point(aes(y = avg_awards_proportion * 100, color = "Awards Rate"), size = 2) +
  geom_smooth(aes(y = avg_success_proportion * 100, color = "Success Rate"), method = "lm", se = FALSE, linetype = "dotdash", size = 0.5) +
  geom_smooth(aes(y = avg_awards_proportion * 100, color = "Awards Rate"), method = "lm", se = FALSE, linetype = "dotdash", size = 0.5) +
  geom_text(aes(y = avg_success_proportion * 100, label = discipline), angle = 1, vjust = -1, size = 2) +
  geom_text(aes(y = avg_awards_proportion * 100, label = discipline), angle = 1, vjust = 1.5, size = 1.9) +
  labs(x = "Number of Applications",
       y = "Rate (%)",
       title = expression(bold("Variability in Applications, Success Rates, and Awards Rates Across Disciplines")),
       color = "Rate Type",
       caption = "Source: Data Science Lab") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10, hjust = 1),
        plot.title = element_text(size = 11, hjust = 0.5, margin = margin(t = 10, b = 10)),
        axis.title.x = element_text(size = 9),
        axis.title.y = element_text(size = 9), 
        plot.caption = element_text(size = 7, face = "italic"),
        legend.position = "bottom",
        legend.box = "horizontal") +
  scale_color_manual(values = c("Success Rate" = "magenta", "Awards Rate" = "blue"))

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Variability in Applications and Awards: There is substantial variability in the number of applications and awards across disciplines. Disciplines with a higher number of applications do not necessarily have higher success or award rates, indicating differences in competitiveness or funding availability.

Discipline-specific Trends: Certain disciplines exhibit higher success and award rates, highlighting disparities in funding success across research fields. For example, disciplines like Chemical Sciences and Physical Sciences show higher rates compared to Humanities and Social Sciences.

Conclusion

The analysis reveals significant variability in research funding success and award rates across disciplines and genders. While the dataset provides valuable insights, further investigation into the underlying causes of these disparities is necessary. Future research could explore factors such as the quality of applications, review processes, and institutional support to better understand and address the observed differences in funding outcomes.