About the data

The data considered here is for best actor and actress Oscar winners from 1929 to 2018. Below are the variables description in the data

Data Source

The data source is Journal of Statistical Education, http://jse.amstat.org/datasets/oscars.dat.txt, updated through 2019 using information from Oscars.org and Wikipedia.org. Here it was taken from https://www.openintro.org/data/index.php?data=oscars.

Load the data

The dataset is placed in a csv file in the github repository and loaded here using read.csv function. kable library is used to display data tables.

# get the github URL from github
theURL <- "https://raw.githubusercontent.com/amit-kapoor/data607/master/week6/oscdata.csv"

# read the data from csv
oscars_df <- read.csv(theURL)

# show intial rows through head
kable(head(oscars_df), align = "l")
oscar_no oscar_yr award name movie age birth_pl birth_date birth_mo birth_d birth_y
1 1929 Best actress Janet Gaynor 7th Heaven 22 Pennsylvania 1906-10-06 10 6 1906
2 1930 Best actress Mary Pickford Coquette 37 Canada 1892-04-08 4 8 1892
3 1931 Best actress Norma Shearer The Divorcee 28 Canada 1902-08-10 8 10 1902
4 1932 Best actress Marie Dressler Min and Bill 63 Canada 1868-11-09 11 9 1868
5 1933 Best actress Helen Hayes The Sin of Madelon Claudet 32 Washington DC 1900-10-10 10 10 1900
6 1934 Best actress Katharine Hepburn Morning Glory 26 Connecticut 1907-05-12 5 12 1907

Tidying the data

Approach: My goal is to clean the data and create new columns for further analysis.

# rename columns
oscars_df <- oscars_df %>% 
  rename("oscar_year"=oscar_yr) %>% 
  rename("birth_place"=birth_pl) %>% 
  rename("birth_month"=birth_mo) %>% 
  rename("birth_day"=birth_d) %>% 
  rename("birth_year"=birth_y)

# show intial rows through head
kable(head(oscars_df), align = "l")
oscar_no oscar_year award name movie age birth_place birth_date birth_month birth_day birth_year
1 1929 Best actress Janet Gaynor 7th Heaven 22 Pennsylvania 1906-10-06 10 6 1906
2 1930 Best actress Mary Pickford Coquette 37 Canada 1892-04-08 4 8 1892
3 1931 Best actress Norma Shearer The Divorcee 28 Canada 1902-08-10 8 10 1902
4 1932 Best actress Marie Dressler Min and Bill 63 Canada 1868-11-09 11 9 1868
5 1933 Best actress Helen Hayes The Sin of Madelon Claudet 32 Washington DC 1900-10-10 10 10 1900
6 1934 Best actress Katharine Hepburn Morning Glory 26 Connecticut 1907-05-12 5 12 1907

Loading the data from builtin R package state which will be used to compare the values in birth_pl column where most of the entries are from US states

data(state)
state.name
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

Creating a new column ‘from_US’ where the values will be yes or no. If the value in birth_pl column is a valid US state then from_US column will be yes else no. Then changed the value to yes for birth place as Washington DC.

# create new column
oscars_df <- oscars_df %>% 
  mutate(from_US = ifelse(birth_place %in% state.name,"yes", "no"))

# put yes for birth place as Washington DC
oscars_df["from_US"] <- ifelse(oscars_df$birth_place =="Washington DC", "yes",oscars_df$from_US)

kable(head(oscars_df))
oscar_no oscar_year award name movie age birth_place birth_date birth_month birth_day birth_year from_US
1 1929 Best actress Janet Gaynor 7th Heaven 22 Pennsylvania 1906-10-06 10 6 1906 yes
2 1930 Best actress Mary Pickford Coquette 37 Canada 1892-04-08 4 8 1892 no
3 1931 Best actress Norma Shearer The Divorcee 28 Canada 1902-08-10 8 10 1902 no
4 1932 Best actress Marie Dressler Min and Bill 63 Canada 1868-11-09 11 9 1868 no
5 1933 Best actress Helen Hayes The Sin of Madelon Claudet 32 Washington DC 1900-10-10 10 10 1900 yes
6 1934 Best actress Katharine Hepburn Morning Glory 26 Connecticut 1907-05-12 5 12 1907 yes

Created a new dataframe which will have name and number of counts he/she won the award. Here I filtered the names who won more than once.

# group by name and then summarize
oscars_grybyname_df <- oscars_df %>% 
  group_by(name) %>% 
  summarise(n = n())

# get the name who won more than once
oscars_grybyname_df[oscars_grybyname_df$n>1,]
## # A tibble: 22 x 2
##    name                  n
##    <fct>             <int>
##  1 Bette Davis           2
##  2 Daniel Day-Lewis      3
##  3 Dustin Hoffman        2
##  4 Frances McDormand     2
##  5 Fredric March         2
##  6 Gary Cooper           2
##  7 Glenda Jackson        2
##  8 Hilary Swank          2
##  9 Ingrid Bergman        2
## 10 Jack Nicholson        2
## # … with 12 more rows

Below are the full details of actors and actresses name appeared in dataset more than once.

oscars_df %>% group_by(name) %>% filter(n() > 1)
## # A tibble: 47 x 12
## # Groups:   name [22]
##    oscar_no oscar_year award name  movie   age birth_place birth_date
##       <int>      <int> <fct> <fct> <fct> <int> <fct>       <fct>     
##  1        6       1934 Best… Kath… Morn…    26 Connecticut 1907-05-12
##  2        8       1936 Best… Bett… Dang…    27 Massachuse… 1908-04-05
##  3        9       1937 Best… Luis… The …    26 Germany     1910-01-12
##  4       10       1938 Best… Luis… The …    27 Germany     1910-01-12
##  5       11       1939 Best… Bett… Jeze…    30 Massachuse… 1908-04-05
##  6       12       1940 Best… Vivi… Gone…    26 India       1913-11-05
##  7       17       1945 Best… Ingr… Gasl…    29 Sweden      1915-08-29
##  8       19       1947 Best… Oliv… To E…    30 Japan       1916-07-01
##  9       22       1950 Best… Oliv… The …    33 Japan       1916-07-01
## 10       24       1952 Best… Vivi… A St…    38 India       1913-11-05
## # … with 37 more rows, and 4 more variables: birth_month <int>,
## #   birth_day <int>, birth_year <int>, from_US <chr>

In the next step, created ranges for the dataset which will further be used to put in new column based on the age of actor/actress.

ranges <- c(paste(seq(0, 90, by = 10), seq(10, 99, by = 10), sep = "-"), paste(100, "+", sep = ""))
ranges
##  [1] "0-10"  "10-20" "20-30" "30-40" "40-50" "50-60" "60-70" "70-80"
##  [9] "80-90" "90-10" "100+"

To add actor/actress ages to age groups, we create a new column age_group and use the cut function to have age into groups with the ranges, defined in the previous step.

oscars_df$age_group <- cut(oscars_df$age, breaks = c(seq(0, 100, by = 10), Inf), labels = ranges, right = FALSE)
head(oscars_df)
##   oscar_no oscar_year        award              name
## 1        1       1929 Best actress      Janet Gaynor
## 2        2       1930 Best actress     Mary Pickford
## 3        3       1931 Best actress     Norma Shearer
## 4        4       1932 Best actress    Marie Dressler
## 5        5       1933 Best actress       Helen Hayes
## 6        6       1934 Best actress Katharine Hepburn
##                        movie age   birth_place birth_date birth_month
## 1                 7th Heaven  22  Pennsylvania 1906-10-06          10
## 2                   Coquette  37        Canada 1892-04-08           4
## 3               The Divorcee  28        Canada 1902-08-10           8
## 4               Min and Bill  63        Canada 1868-11-09          11
## 5 The Sin of Madelon Claudet  32 Washington DC 1900-10-10          10
## 6              Morning Glory  26   Connecticut 1907-05-12           5
##   birth_day birth_year from_US age_group
## 1         6       1906     yes     20-30
## 2         8       1892      no     30-40
## 3        10       1902      no     20-30
## 4         9       1868      no     60-70
## 5        10       1900     yes     30-40
## 6        12       1907     yes     20-30

Data Analysis

Approach: To analyze the data, I have used below described graphs to get the insights from oscars dataset.

ggplot(oscars_df, aes(birth_place, fill=birth_place))+
  geom_bar() + coord_flip()

ggplot(oscars_df, aes(age_group, fill=age_group))+
  geom_bar() + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(oscars_df, aes(birth_place, fill = award)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(oscars_df, aes(birth_place, fill=award))+
  geom_bar() + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(oscars_df, aes(from_US, fill = award)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent)

On further analysis, noticed for 1933 and 1969 there was a tie for best actor / actress awards.

filter(oscars_df, oscar_year %in% c("1933", "1969"))
##   oscar_no oscar_year        award              name
## 1        5       1933 Best actress       Helen Hayes
## 2       41       1969 Best actress  Barbra Streisand
## 3       41       1969 Best actress Katharine Hepburn
## 4        5       1933   Best actor     Fredric March
## 5        5       1933   Best actor     Wallace Beery
## 6       41       1969   Best actor   Cliff Robertson
##                        movie age   birth_place birth_date birth_month
## 1 The Sin of Madelon Claudet  32 Washington DC 1900-10-10          10
## 2                 Funny Girl  26      New York 1942-04-24           4
## 3         The Lion in Winter  61   Connecticut 1907-05-12           5
## 4    Dr. Jekyll and Mr. Hyde  35     Wisconsin 1897-08-31           8
## 5                  The Champ  47      Missouri 1885-04-01           4
## 6                     Charly  43    California 1925-09-09           9
##   birth_day birth_year from_US age_group
## 1        10       1900     yes     30-40
## 2        24       1942     yes     20-30
## 3        12       1907     yes     60-70
## 4        31       1897     yes     30-40
## 5         1       1885     yes     40-50
## 6         9       1925     yes     40-50

Summary/Conclusion

After doing data anaysis above, we can conclude most of the awards won were from England, California and New York. The age group in which most of the actors and actresses won awards is between 30 to 40 years. Also I see few birth places where only see best actor awards like Australia, Iowa, Hungary to name a few. Same holds true for best actress as well. Finally noticed that 1933 and 1969 were the years where it was a tie for best actor / actress awards.