Data Introduction & Description

I have chosen to work with the Nobel Prize Data Set for my final project. This data set contains information about Nobel Prize winners from it’s beginning in 1901. I became interested in this data while reading a book called Cassandra Speaks, which is a book about rewriting women’s stories in the past and present. In one of the chapters she talked through the underrepresentation of women Nobel prize winners and particularly those in STEM categories. I thought it would be interesting to be able to explore the claims that she made and to bring light a subject that I feel personally compelled by. I ended up also looking at age of winner with other variables because I wanted to be able to perform addtional statistical analyses.
For my data cleaning process, I uploaded my data into excel. I am excluding rows that are not exclusive to a person (any organizational winners have been removed), I also had to remove duplicates, join birth year into this data set and fill in additional missing information. I discovered some errors in my data set that needed to be corrected as I looked at the summary statistics.
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load Data Set

data <- read.csv("~/Desktop/Statistics 2 Homework/2023.06.17.Glover.NBPEDA/NBPDataClean.csv")

Data Structure

Subject Matter: Nobel Prize Winners by Name
Number of Rows: 884
# of Columns: 12
Columns I utilized: ID, firstname, surname, birth year
str(data)
## 'data.frame':    884 obs. of  12 variables:
##  $ index          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id             : int  846 783 230 918 428 773 597 615 782 553 ...
##  $ firstname      : chr  "Elinor" "Wangari Muta" "Dorothy Crowfoot" "Youyou" ...
##  $ surname        : chr  "Ostrom" "Maathai" "Hodgkin" "Tu" ...
##  $ Birth.Year     : int  1933 1940 1910 1930 1902 1947 1871 1889 1946 1945 ...
##  $ bornCountry    : chr  "USA" "Kenya" "Egypt" "China" ...
##  $ bornCountryCode: chr  "US" "KE" "EG" "CN" ...
##  $ bornCity       : chr  "Los Angeles, CA" "Nyeri" "Cairo" "Zhejiang Ningbo" ...
##  $ gender         : chr  "female" "female" "female" "female" ...
##  $ year           : int  2009 2004 1964 2015 1983 2003 1926 1945 2004 1991 ...
##  $ Winner.Age     : int  76 64 54 85 81 56 55 56 58 46 ...
##  $ category       : chr  "economics" "peace" "chemistry" "medicine" ...
summary(data)
##      index             id         firstname           surname         
##  Min.   :  1.0   Min.   :  1.0   Length:884         Length:884        
##  1st Qu.:231.8   1st Qu.:222.0   Class :character   Class :character  
##  Median :477.5   Median :442.5   Mode  :character   Mode  :character  
##  Mean   :473.3   Mean   :457.1                                        
##  3rd Qu.:715.2   3rd Qu.:696.2                                        
##  Max.   :945.0   Max.   :934.0                                        
##    Birth.Year   bornCountry        bornCountryCode      bornCity        
##  Min.   :1817   Length:884         Length:884         Length:884        
##  1st Qu.:1889   Class :character   Class :character   Class :character  
##  Median :1915   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1910                                                           
##  3rd Qu.:1933                                                           
##  Max.   :1997                                                           
##     gender               year        Winner.Age      category        
##  Length:884         Min.   :1901   Min.   :17.00   Length:884        
##  Class :character   1st Qu.:1946   1st Qu.:50.00   Class :character  
##  Mode  :character   Median :1975   Median :60.00   Mode  :character  
##                     Mean   :1969   Mean   :59.43                     
##                     3rd Qu.:1997   3rd Qu.:69.00                     
##                     Max.   :2016   Max.   :90.00

Winner Ages

df <- data
summary(df$Winner.Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   50.00   60.00   59.43   69.00   90.00
winner_age <- df$Winner.Age
hist(winner_age, breaks = "FD", col = "skyblue", main = "Histogram of Nobel Prize- Age at time of Award", xlab = "Age")

Summary Statistics for Winner Age

summary(df$Winner.Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   50.00   60.00   59.43   69.00   90.00
mean(df$Winner.Age)
## [1] 59.4276
sd(df$Winner.Age)
## [1] 12.3864
var(df$Winner.Age)
## [1] 153.4228
length(df$Winner.Age)
## [1] 884
quantile(df$Winner.Age, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75% 
##  50  60  69
df$year <- as.numeric(df$year)
df$Winner.Age <- as.numeric(df$Winner.Age)

Bivariate Comparison of 2 Variables

plot(df$year, df$Winner.Age, pch = 16, col = "blue", xlab = "Year", ylab = "Winner Age", main = "Scatter Plot: Winner Age vs. Year")

boxplot(Winner.Age ~ gender, data = df, col = "blue", xlab = "Gender", ylab = "Winner Age", main = "Box Plot: Winner Age by Gender")

ggplot(df, aes(x = gender, y = Winner.Age, fill = gender)) +
  geom_violin() +
  labs(x = "Gender", y = "Winner Age", title = "Violin Plot: Winner Age by Gender")

boxplot(Winner.Age ~ category, data = df, col = "blue", xlab = "Category", ylab = "Winner Age", main = "Box Plot: Winner Age by Category")

Exploring Gender in Dataset

ggplot(data, aes(x = gender)) +
  geom_bar() +
  labs(title = "Gender Distribution", x = "Gender", y = "Count")

Gender Counts

gender_counts <- table(data$gender)
most_common_gender <- names(gender_counts)[which.max(gender_counts)]
print(most_common_gender)
## [1] "male"
print(gender_counts)
## 
## female   male 
##     49    835

Exploring Gender and Category Variables

ggplot(df, aes(x = category, fill = gender)) +
  geom_bar(position = "dodge") +
  labs(x = "Category", y = "Frequency", title = "Grouped Bar Plot: Frequency of Categories by Gender")

cont_table <- table(df$gender, df$category)
chi_square <- chisq.test(cont_table)
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
print(chi_square)
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 43.898, df = 5, p-value = 2.429e-08
df$female <- ifelse(df$gender == "female", 1, 0)
cont_table <- table(df$female, df$category)
chi_square <- chisq.test(cont_table)
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
print(chi_square)
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 43.898, df = 5, p-value = 2.429e-08

ANOVA for Age of Winner and Gender

female <- df$Winner.Age[df$gender == "female"]
male <- df$Winner.Age[df$gender == "male"]
anova_result <- aov(winner_age ~ gender, data = df)
summary(anova_result)
##              Df Sum Sq Mean Sq F value Pr(>F)
## gender        1    106   105.7   0.689  0.407
## Residuals   882 135367   153.5
print(anova_result)
## Call:
##    aov(formula = winner_age ~ gender, data = df)
## 
## Terms:
##                    gender Residuals
## Sum of Squares     105.72 135366.64
## Deg. of Freedom         1       882
## 
## Residual standard error: 12.38858
## Estimated effects may be unbalanced

Scatterplot for Female Winners

female_data <- df[df$gender == "female", ]
plot(female_data$year, female_data$Winner.Age, 
     xlab = "Year", ylab = "Age", 
     main = "Scatter Plot: Age vs Year (Females)")

text(female_data$year, female_data$age, labels = female_data$index, pos = 3)

Exploring Ratios of Female vs Male by Category

ratio_by_category <- with(df, table(category, gender)) / rowSums(with(df, table(category, gender)))

ratio_by_category <- round(ratio_by_category * 100, 2)

print(ratio_by_category)
##             gender
## category     female  male
##   chemistry    2.27 97.73
##   economics    1.32 98.68
##   literature  12.50 87.50
##   medicine     5.66 94.34
##   peace       15.38 84.62
##   physics      0.98 99.02

Chi-Square Test

In exploring my data, I wanted to see if there was statistical significance in the Categories women will win a Nobel Prize in. According to my Chi-Square and small p-value their is evidence against the null hypothesis and that female gender and category of the Nobel prize is not independent of each other.

female_table <- table(data$gender[data$gender == "female"], data$category[data$gender == "female"])

chi_sq_test <- chisq.test(female_table)

print(chi_sq_test)
## 
##  Chi-squared test for given probabilities
## 
## data:  female_table
## X-squared = 26.551, df = 5, p-value = 6.975e-05

Exploring Different visualizations of this Data

transposed_ratio <- t(ratio_by_category)

barplot(transposed_ratio, beside = TRUE,
        xlab = "", ylab = "Male-to-Female Ratio (%)",
        main = "Male-to-Female Ratios by Category",
        col = c("pink", "blue"))

axis(1, at = 1:ncol(transposed_ratio), labels = FALSE)

Data by Country

country_counts <- table(data$bornCountry)

most_common_country <- names(country_counts)[which.max(country_counts)]

print(most_common_country)
## [1] "USA"
usa_data <- data[data$bornCountry == "USA", ]

category_counts <- table(usa_data$category)

most_common_category <- names(category_counts)[which.max(category_counts)]

print(most_common_category)
## [1] "medicine"