Data Introduction & Description

I have chosen to work with the Nobel Prize Data Set for my final project. This data set contains information about Nobel Prize winners from it’s beginning in 1901. I became interested in this data while reading a book called Cassandra Speaks, which is a book about rewriting women’s stories in the past and present. In one of the chapters she talked through the underrepresentation of women Nobel prize winners and particularly those in STEM categories. I thought it would be interesting to be able to explore the claims that she made and to bring light a subject that I feel personally compelled by. I ended up also looking at age of winner with other variables because I wanted to be able to perform addtional statistical analyses.

Data Sourcing

I found my data on Kaggl http://www.kaggle.com/datasets/thedevastator/a-complete-history-of-nobel-prize-winners?resource=download and utilized wikipedia to fill in any missing dates in this data set.

Data Cleaning

For my data cleaning process, I uploaded my data into excel. I am excluding rows that are not exclusive to a person (any organizational winners have been removed), I also had to remove duplicates, join birth year into this data set and fill in additional missing information. I discovered some errors in my data set that needed to be corrected as I looked at the summary statistics.

Data Structure

Subject Matter: Nobel Prize Winners by Name
Number of Rows: 884
# of Columns: 12
Columns I utilized: ID, firstname, surname, birth year, gender, winner age

Winner Ages

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   50.00   60.00   59.43   69.00   90.00

Summary Statistics for Winner Age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   50.00   60.00   59.43   69.00   90.00
## [1] 59.4276
## [1] 12.3864
## [1] 153.4228
## [1] 884
## 25% 50% 75% 
##  50  60  69

Nobel Prize Ages Histogram

Bivariate Comparison of 2 Variables

Exploring Gender in Dataset

Gender Counts

Gender Count
female 49
male 835

Exploring Gender and Category Variables

Chi-Square Test

In exploring my data, I wanted to see if there was statistical significance in the Categories women will win a Nobel Prize in. According to my Chi-Square and small p-value their is evidence against the null hypothesis and that female gender and category of the Nobel prize is not independent of each other.

## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 43.898, df = 5, p-value = 2.429e-08
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 43.898, df = 5, p-value = 2.429e-08

ANOVA for Age of Winner and Gender

I also wanted to look at an ANOVA for this data, looking to see if their was any significant differnce between the mean ages of winners between the female and male groups.The p-value is 0.407, which is greater than the typical significance level of 0.05. Therefore, we fail to reject the null hypothesis and conclude that there is no significant difference in the mean ages of winners between the “female” and “male” groups.

##              Df Sum Sq Mean Sq F value Pr(>F)
## gender        1    106   105.7   0.689  0.407
## Residuals   882 135367   153.5
## Call:
##    aov(formula = Winner.Age ~ gender, data = df)
## 
## Terms:
##                    gender Residuals
## Sum of Squares     105.72 135366.64
## Deg. of Freedom         1       882
## 
## Residual standard error: 12.38858
## Estimated effects may be unbalanced