Data Introduction & Description
I have chosen to work with the Nobel Prize Data Set for my final
project. This data set contains information about Nobel Prize winners
from it’s beginning in 1901. I became interested in this data while
reading a book called Cassandra Speaks, which is a book about rewriting
women’s stories in the past and present. In one of the chapters she
talked through the underrepresentation of women Nobel prize winners and
particularly those in STEM categories. I thought it would be interesting
to be able to explore the claims that she made and to bring light a
subject that I feel personally compelled by. I ended up also looking at
age of winner with other variables because I wanted to be able to
perform addtional statistical analyses.
Winner Ages
df <- data
summary(df$Winner.Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 50.00 60.00 59.43 69.00 90.00
winner_age <- df$Winner.Age
hist(winner_age, breaks = "FD", col = "skyblue", main = "Histogram of Nobel Prize- Age at time of Award", xlab = "Age")

Summary Statistics for Winner Age
summary(df$Winner.Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 50.00 60.00 59.43 69.00 90.00
mean(df$Winner.Age)
## [1] 59.4276
sd(df$Winner.Age)
## [1] 12.3864
var(df$Winner.Age)
## [1] 153.4228
length(df$Winner.Age)
## [1] 884
quantile(df$Winner.Age, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 50 60 69
df$year <- as.numeric(df$year)
df$Winner.Age <- as.numeric(df$Winner.Age)
Bivariate Comparison of 2 Variables
plot(df$year, df$Winner.Age, pch = 16, col = "blue", xlab = "Year", ylab = "Winner Age", main = "Scatter Plot: Winner Age vs. Year")

boxplot(Winner.Age ~ gender, data = df, col = "blue", xlab = "Gender", ylab = "Winner Age", main = "Box Plot: Winner Age by Gender")

ggplot(df, aes(x = gender, y = Winner.Age, fill = gender)) +
geom_violin() +
labs(x = "Gender", y = "Winner Age", title = "Violin Plot: Winner Age by Gender")

boxplot(Winner.Age ~ category, data = df, col = "blue", xlab = "Category", ylab = "Winner Age", main = "Box Plot: Winner Age by Category")

Exploring Gender in Dataset
ggplot(data, aes(x = gender)) +
geom_bar() +
labs(title = "Gender Distribution", x = "Gender", y = "Count")

Gender Counts
gender_counts <- table(data$gender)
most_common_gender <- names(gender_counts)[which.max(gender_counts)]
print(most_common_gender)
## [1] "male"
print(gender_counts)
##
## female male
## 49 835
Exploring Gender and Category Variables
ggplot(df, aes(x = category, fill = gender)) +
geom_bar(position = "dodge") +
labs(x = "Category", y = "Frequency", title = "Grouped Bar Plot: Frequency of Categories by Gender")

cont_table <- table(df$gender, df$category)
chi_square <- chisq.test(cont_table)
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
print(chi_square)
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = 43.898, df = 5, p-value = 2.429e-08
df$female <- ifelse(df$gender == "female", 1, 0)
cont_table <- table(df$female, df$category)
chi_square <- chisq.test(cont_table)
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
print(chi_square)
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = 43.898, df = 5, p-value = 2.429e-08
ANOVA for Age of Winner and Gender
female <- df$Winner.Age[df$gender == "female"]
male <- df$Winner.Age[df$gender == "male"]
anova_result <- aov(winner_age ~ gender, data = df)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 106 105.7 0.689 0.407
## Residuals 882 135367 153.5
print(anova_result)
## Call:
## aov(formula = winner_age ~ gender, data = df)
##
## Terms:
## gender Residuals
## Sum of Squares 105.72 135366.64
## Deg. of Freedom 1 882
##
## Residual standard error: 12.38858
## Estimated effects may be unbalanced
Scatterplot for Female Winners
female_data <- df[df$gender == "female", ]
plot(female_data$year, female_data$Winner.Age,
xlab = "Year", ylab = "Age",
main = "Scatter Plot: Age vs Year (Females)")
text(female_data$year, female_data$age, labels = female_data$index, pos = 3)

Exploring Ratios of Female vs Male by Category
ratio_by_category <- with(df, table(category, gender)) / rowSums(with(df, table(category, gender)))
ratio_by_category <- round(ratio_by_category * 100, 2)
print(ratio_by_category)
## gender
## category female male
## chemistry 2.27 97.73
## economics 1.32 98.68
## literature 12.50 87.50
## medicine 5.66 94.34
## peace 15.38 84.62
## physics 0.98 99.02
Chi-Square Test
In exploring my data, I wanted to see if there was statistical
significance in the Categories women will win a Nobel Prize in.
According to my Chi-Square and small p-value their is evidence against
the null hypothesis and that female gender and category of the Nobel
prize is not independent of each other.
female_table <- table(data$gender[data$gender == "female"], data$category[data$gender == "female"])
chi_sq_test <- chisq.test(female_table)
print(chi_sq_test)
##
## Chi-squared test for given probabilities
##
## data: female_table
## X-squared = 26.551, df = 5, p-value = 6.975e-05
Exploring Different visualizations of this Data
transposed_ratio <- t(ratio_by_category)
barplot(transposed_ratio, beside = TRUE,
xlab = "", ylab = "Male-to-Female Ratio (%)",
main = "Male-to-Female Ratios by Category",
col = c("pink", "blue"))
axis(1, at = 1:ncol(transposed_ratio), labels = FALSE)

Data by Country
country_counts <- table(data$bornCountry)
most_common_country <- names(country_counts)[which.max(country_counts)]
print(most_common_country)
## [1] "USA"
usa_data <- data[data$bornCountry == "USA", ]
category_counts <- table(usa_data$category)
most_common_category <- names(category_counts)[which.max(category_counts)]
print(most_common_category)
## [1] "medicine"