Data Introduction & Description

I have chosen to work with the Nobel Prize Data Set for my final project. This data set contains information about Nobel Prize winners from it’s beginning in 1901-2016. I became interested in this data while reading a book called Cassandra Speaks, which is a book about rewriting women’s stories in the past and present. In one of the chapters she talked through the underrepresentation of women Nobel prize winners and particularly those in STEM categories. I thought it would be interesting to be able to explore the claims that she made and to bring light a subject that I feel personally compelled by. I ended up also looking at age of winner with other variables because I wanted to be able to perform addtional statistical analyses.

Questions Explored:

-How are women represented in Nobel Prize recipient data & what categories are they awarded? How does age factor into Nobel Prize winners?

Data Sourcing

I found my data on Kaggl http://www.kaggle.com/datasets/thedevastator/a-complete-history-of-nobel-prize-winners?resource=download and utilized wikipedia to fill in any missing dates in this data set.

Data Notes

This dataset looks at counts, but for both male and female nobel prize data recipients, there are multiple individuals who have won the award twice, I did not remove these multiple wins. Also, I removed any instances where the nobel prize went to organizations instead of an individual.

Data Cleaning

For my data cleaning process, I uploaded my data into excel. I am excluding rows that are not exclusive to a person (any organizational winners have been removed), I also had to remove duplicates, join birth year into this data set and fill in additional missing information. I discovered some errors in my data set that needed to be corrected as I looked at the summary statistics.

Data Structure

Subject Matter: Nobel Prize Winners by Name
Number of Rows: 884
# of Columns: 12
Columns I utilized: ID, firstname, surname, birth year, gender, winner age

Winner Ages

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   50.00   60.00   59.43   69.00   90.00

Summary Statistics for Winner Age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   50.00   60.00   59.43   69.00   90.00
## [1] 59.4276
## [1] 12.3864
## [1] 153.4228
## [1] 884
## 25% 50% 75% 
##  50  60  69

Nobel Prize Ages Histogram

Bivariate Comparison of 2 Variables

Exploring Gender in Dataset

Gender Counts

Gender Count
female 49
male 835

Exploring Gender and Category Variables

Chi-Square Test

In exploring my data, I wanted to see if there was statistical significance in the Categories women will win a Nobel Prize in. According to my Chi-Square and small p-value their is evidence against the null hypothesis and that female gender and category of the Nobel prize is not independent of each other.

## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 43.898, df = 5, p-value = 2.429e-08

ANOVA for Age of Winner and Gender

I also wanted to look at an ANOVA for this data, looking to see if their was any significant differnce between the mean ages of winners between the female and male groups.The p-value is 0.407, which is greater than the typical significance level of 0.05. Therefore, we fail to reject the null hypothesis and conclude that there is no significant difference in the mean ages of winners between the “female” and “male” groups.

##              Df Sum Sq Mean Sq F value Pr(>F)
## gender        1    106   105.7   0.689  0.407
## Residuals   882 135367   153.5
## Call:
##    aov(formula = Winner.Age ~ gender, data = df)
## 
## Terms:
##                    gender Residuals
## Sum of Squares     105.72 135366.64
## Deg. of Freedom         1       882
## 
## Residual standard error: 12.38858
## Estimated effects may be unbalanced

Conclusions

As expected, women are underrepresented across all categories of Nobel Prizes and there is statistical significance between the nobel prize category and gender. There is no significant difference in the mean ages of winners between the “female” and “male” groups.

If I continued with this dataset in the future, I’d like to look at the emergence of female nobel prize winners over time.