About the data

The data considered here is for best actor and actress Oscar winners from 1929 to 2018. Below are the variables description in the data

oscar_no - Oscar ceremony number.
oscar_yr - Year the Oscar ceremony was held.
award - Best actress or Best actor.
name - Name of winning actor or actress.
movie - Name of movie actor or actress got the Oscar for.
age - Age at which the actor or actress won the Oscar.
birth_pl - US State where the actor or actress was born, country if foreign.
birth_mo - Birth month of actor or actress.
birth_d - Birth day of actor or actress.
birth_y - Birth year of actor or actress.

Data Source

The data source is Journal of Statistical Education, http://jse.amstat.org/datasets/oscars.dat.txt, updated through 2019 using information from Oscars.org and Wikipedia.org. Here it was taken from https://www.openintro.org/data/index.php?data=oscars.

Load the data

The dataset is placed in a csv file in the github repository and loaded here using read.csv function. kable library is used to display data tables.

# get the github URL from github
theURL <- "https://raw.githubusercontent.com/amit-kapoor/data607/master/week6/oscdata.csv"

# read the data from csv
oscars_df <- read.csv(theURL)

# show intial rows through head
kable(head(oscars_df), align = "l")

oscar_no	oscar_yr	award	name	movie	age	birth_pl	birth_date	birth_mo	birth_d	birth_y
1	1929	Best actress	Janet Gaynor	7th Heaven	22	Pennsylvania	1906-10-06	10	6	1906
2	1930	Best actress	Mary Pickford	Coquette	37	Canada	1892-04-08	4	8	1892
3	1931	Best actress	Norma Shearer	The Divorcee	28	Canada	1902-08-10	8	10	1902
4	1932	Best actress	Marie Dressler	Min and Bill	63	Canada	1868-11-09	11	9	1868
5	1933	Best actress	Helen Hayes	The Sin of Madelon Claudet	32	Washington DC	1900-10-10	10	10	1900
6	1934	Best actress	Katharine Hepburn	Morning Glory	26	Connecticut	1907-05-12	5	12	1907

Tidying the data

Approach: My goal is to clean the data and create new columns for further analysis.

Rename columns where column names are abbreviated. Here I renamed the columns oscar_yr, birth_pl, birth_mo, birth_d, birth_y.
In the next step I loaded the data from builtin R package state. I will use it to compare the values in birth_pl column where most of the entries are from US states.
The reason to use state R dataset, desribed above, is to create a new column in our oscars dataset ‘from_US’ where the values will be yes or no. If the value in birth_pl column is a valid US state then from_US column will be yes else no. This column will be further analyzed in data analysis to find out best actor/actress are from US or not.
Then I created a new sub dataframe to check the number of actors/actress who received Oscar award more than once.
Next I created a new column age_group which will have an age range corresponding to the age of actor and actress. This column will be further analyzed in data analysis to find out which age range has most of the best actor/actress.

# rename columns
oscars_df <- oscars_df %>% 
  rename("oscar_year"=oscar_yr) %>% 
  rename("birth_place"=birth_pl) %>% 
  rename("birth_month"=birth_mo) %>% 
  rename("birth_day"=birth_d) %>% 
  rename("birth_year"=birth_y)

# show intial rows through head
kable(head(oscars_df), align = "l")

oscar_no	oscar_year	award	name	movie	age	birth_place	birth_date	birth_month	birth_day	birth_year
1	1929	Best actress	Janet Gaynor	7th Heaven	22	Pennsylvania	1906-10-06	10	6	1906
2	1930	Best actress	Mary Pickford	Coquette	37	Canada	1892-04-08	4	8	1892
3	1931	Best actress	Norma Shearer	The Divorcee	28	Canada	1902-08-10	8	10	1902
4	1932	Best actress	Marie Dressler	Min and Bill	63	Canada	1868-11-09	11	9	1868
5	1933	Best actress	Helen Hayes	The Sin of Madelon Claudet	32	Washington DC	1900-10-10	10	10	1900
6	1934	Best actress	Katharine Hepburn	Morning Glory	26	Connecticut	1907-05-12	5	12	1907

Loading the data from builtin R package state which will be used to compare the values in birth_pl column where most of the entries are from US states

data(state)
state.name

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

Creating a new column ‘from_US’ where the values will be yes or no. If the value in birth_pl column is a valid US state then from_US column will be yes else no. Then changed the value to yes for birth place as Washington DC.

# create new column
oscars_df <- oscars_df %>% 
  mutate(from_US = ifelse(birth_place %in% state.name,"yes", "no"))

# put yes for birth place as Washington DC
oscars_df["from_US"] <- ifelse(oscars_df$birth_place =="Washington DC", "yes",oscars_df$from_US)

kable(head(oscars_df))

oscar_no	oscar_year	award	name	movie	age	birth_place	birth_date	birth_month	birth_day	birth_year	from_US
1	1929	Best actress	Janet Gaynor	7th Heaven	22	Pennsylvania	1906-10-06	10	6	1906	yes
2	1930	Best actress	Mary Pickford	Coquette	37	Canada	1892-04-08	4	8	1892	no
3	1931	Best actress	Norma Shearer	The Divorcee	28	Canada	1902-08-10	8	10	1902	no
4	1932	Best actress	Marie Dressler	Min and Bill	63	Canada	1868-11-09	11	9	1868	no
5	1933	Best actress	Helen Hayes	The Sin of Madelon Claudet	32	Washington DC	1900-10-10	10	10	1900	yes
6	1934	Best actress	Katharine Hepburn	Morning Glory	26	Connecticut	1907-05-12	5	12	1907	yes

Created a new dataframe which will have name and number of counts he/she won the award. Here I filtered the names who won more than once.

# group by name and then summarize
oscars_grybyname_df <- oscars_df %>% 
  group_by(name) %>% 
  summarise(n = n())

# get the name who won more than once
oscars_grybyname_df[oscars_grybyname_df$n>1,]

## # A tibble: 22 x 2
##    name                  n
##    <fct>             <int>
##  1 Bette Davis           2
##  2 Daniel Day-Lewis      3
##  3 Dustin Hoffman        2
##  4 Frances McDormand     2
##  5 Fredric March         2
##  6 Gary Cooper           2
##  7 Glenda Jackson        2
##  8 Hilary Swank          2
##  9 Ingrid Bergman        2
## 10 Jack Nicholson        2
## # … with 12 more rows

Below are the full details of actors and actresses name appeared in dataset more than once.

oscars_df %>% group_by(name) %>% filter(n() > 1)

## # A tibble: 47 x 12
## # Groups:   name [22]
##    oscar_no oscar_year award name  movie   age birth_place birth_date
##       <int>      <int> <fct> <fct> <fct> <int> <fct>       <fct>     
##  1        6       1934 Best… Kath… Morn…    26 Connecticut 1907-05-12
##  2        8       1936 Best… Bett… Dang…    27 Massachuse… 1908-04-05
##  3        9       1937 Best… Luis… The …    26 Germany     1910-01-12
##  4       10       1938 Best… Luis… The …    27 Germany     1910-01-12
##  5       11       1939 Best… Bett… Jeze…    30 Massachuse… 1908-04-05
##  6       12       1940 Best… Vivi… Gone…    26 India       1913-11-05
##  7       17       1945 Best… Ingr… Gasl…    29 Sweden      1915-08-29
##  8       19       1947 Best… Oliv… To E…    30 Japan       1916-07-01
##  9       22       1950 Best… Oliv… The …    33 Japan       1916-07-01
## 10       24       1952 Best… Vivi… A St…    38 India       1913-11-05
## # … with 37 more rows, and 4 more variables: birth_month <int>,
## #   birth_day <int>, birth_year <int>, from_US <chr>

In the next step, created ranges for the dataset which will further be used to put in new column based on the age of actor/actress.

ranges <- c(paste(seq(0, 90, by = 10), seq(10, 99, by = 10), sep = "-"), paste(100, "+", sep = ""))
ranges

##  [1] "0-10"  "10-20" "20-30" "30-40" "40-50" "50-60" "60-70" "70-80"
##  [9] "80-90" "90-10" "100+"

To add actor/actress ages to age groups, we create a new column age_group and use the cut function to have age into groups with the ranges, defined in the previous step.

oscars_df$age_group <- cut(oscars_df$age, breaks = c(seq(0, 100, by = 10), Inf), labels = ranges, right = FALSE)
head(oscars_df)

##   oscar_no oscar_year        award              name
## 1        1       1929 Best actress      Janet Gaynor
## 2        2       1930 Best actress     Mary Pickford
## 3        3       1931 Best actress     Norma Shearer
## 4        4       1932 Best actress    Marie Dressler
## 5        5       1933 Best actress       Helen Hayes
## 6        6       1934 Best actress Katharine Hepburn
##                        movie age   birth_place birth_date birth_month
## 1                 7th Heaven  22  Pennsylvania 1906-10-06          10
## 2                   Coquette  37        Canada 1892-04-08           4
## 3               The Divorcee  28        Canada 1902-08-10           8
## 4               Min and Bill  63        Canada 1868-11-09          11
## 5 The Sin of Madelon Claudet  32 Washington DC 1900-10-10          10
## 6              Morning Glory  26   Connecticut 1907-05-12           5
##   birth_day birth_year from_US age_group
## 1         6       1906     yes     20-30
## 2         8       1892      no     30-40
## 3        10       1902      no     20-30
## 4         9       1868      no     60-70
## 5        10       1900     yes     30-40
## 6        12       1907     yes     20-30

Data Analysis

Approach: To analyze the data, I have used below described graphs to get the insights from oscars dataset.

First plot is to show the bar for all birth_place counts in dataset.
Second plot is to show the age_group who has received oscar awards.
Next two graphs are to show the best actor and actress award ratio against all birth places in dataset.
Last graph to see best actor and actress award ratio against the column from_US.

ggplot(oscars_df, aes(birth_place, fill=birth_place))+
  geom_bar() + coord_flip()

ggplot(oscars_df, aes(age_group, fill=age_group))+
  geom_bar() + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(oscars_df, aes(birth_place, fill = award)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(oscars_df, aes(birth_place, fill=award))+
  geom_bar() + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

ggplot(oscars_df, aes(from_US, fill = award)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent)

On further analysis, noticed for 1933 and 1969 there was a tie for best actor / actress awards.

filter(oscars_df, oscar_year %in% c("1933", "1969"))

##   oscar_no oscar_year        award              name
## 1        5       1933 Best actress       Helen Hayes
## 2       41       1969 Best actress  Barbra Streisand
## 3       41       1969 Best actress Katharine Hepburn
## 4        5       1933   Best actor     Fredric March
## 5        5       1933   Best actor     Wallace Beery
## 6       41       1969   Best actor   Cliff Robertson
##                        movie age   birth_place birth_date birth_month
## 1 The Sin of Madelon Claudet  32 Washington DC 1900-10-10          10
## 2                 Funny Girl  26      New York 1942-04-24           4
## 3         The Lion in Winter  61   Connecticut 1907-05-12           5
## 4    Dr. Jekyll and Mr. Hyde  35     Wisconsin 1897-08-31           8
## 5                  The Champ  47      Missouri 1885-04-01           4
## 6                     Charly  43    California 1925-09-09           9
##   birth_day birth_year from_US age_group
## 1        10       1900     yes     30-40
## 2        24       1942     yes     20-30
## 3        12       1907     yes     60-70
## 4        31       1897     yes     30-40
## 5         1       1885     yes     40-50
## 6         9       1925     yes     40-50

Summary/Conclusion

After doing data anaysis above, we can conclude most of the awards won were from England, California and New York. The age group in which most of the actors and actresses won awards is between 30 to 40 years. Also I see few birth places where only see best actor awards like Australia, Iowa, Hungary to name a few. Same holds true for best actress as well. Finally noticed that 1933 and 1969 were the years where it was a tie for best actor / actress awards.

Data607 - Assignment6

Amit Kapoor