Objective

Objective is to perform statistical analysis on the normalized data set view_year_numeric_CLEANED.csv to answer the analysis question:

Is being nominated for Film Editing the best predictor of winning the Best Picture?

Hyphothesis

The comparison samples are defined as the mean of the nominated Category (i.e. Film Editing, Directing) where the Nominee also won Best Picture.

Ho = Mean of Film Editing sample is similar to comparison sample
Ha = Mean of Film Editing sample is different from comparison sample

In other words…

If Ho is accepted, then the Film Editing and comparison group have similar probabilities of wining the Best Picture award.
If Ho is rejected, then the Film Editing and comparison groups have different probabilities of winning the Best Picture award.

Methodology

T-Test

For this analysis, the two-sample t-test statistic will be used to compare the means of two samples. The results of this analysis will be used to infer if Film Editing is a the best predictor of winning Best Picture.

In this test, a P-Value:

Greater than 0.05 means that the two groups are significantly similar and we can accept the Ho hypothesis.
Less than 0.05 means that the two groups are different and we can reject the Ho hypothesis in favor of the Ha hypothesis.

F-Test

The F-Test is applied to our samples to evaluate the sample variances between the two groups to verify homoskedasticity (homogeneity of variances). This is a necessary step before applying the t-test.

In this test, a P-Value:

Greater than 0.05 means the two variances are homogeneous. If this is the case, the we need to set the t-test argument: var.equal = TRUE.
Less than 0.05 means the two variances are not homogeneous. If this is the case, the we need to set the t-test argument: var.equal = FALSE.

Required Libraries

library(tidyr)
library(dplyr)
library(RCurl)
library(stringr)

Loading the Data

To ensure the results of this analysis are reproducible, the data set is extracted from Github.

#Code from Github
url <- getURL("https://raw.githubusercontent.com/mkds/IS607_Project3/gh-pages/Data/project_view_year_numeric_CLEANED.csv")

#Reading the data and assigning it to mydata
mydata <- read.csv(text = url, stringsAsFactors = FALSE) 

#Taking a peek at the dataset
head(mydata)

##   Year                  Category                 Nominee Won
## 1 1934   ACTOR IN A LEADING ROLE   It Happened One Night yes
## 2 1934   ACTOR IN A LEADING ROLE The Affairs of Cellini   no
## 3 1934   ACTOR IN A LEADING ROLE           The Thin Man   no
## 4 1934 ACTRESS IN A LEADING ROLE   It Happened One Night yes
## 5 1934 ACTRESS IN A LEADING ROLE       Of Human Bondage   no
## 6 1934 ACTRESS IN A LEADING ROLE                    <NA>  no

#Reviewng the data structure
str(mydata)

## 'data.frame':    3417 obs. of  4 variables:
##  $ Year    : int  1934 1934 1934 1934 1934 1934 1934 1934 1934 1934 ...
##  $ Category: chr  "ACTOR IN A LEADING ROLE" "ACTOR IN A LEADING ROLE" "ACTOR IN A LEADING ROLE" "ACTRESS IN A LEADING ROLE" ...
##  $ Nominee : chr  "It Happened One Night" "The Affairs of Cellini " "The Thin Man " "It Happened One Night" ...
##  $ Won     : chr  "yes" "no" "no" "yes" ...

Data Transformation

In the view_year_numeric_CLEANED.csv data set, the column Won indicates whether the film won Best Picture or not. The values are:

Yes = Won Best Picture
No = Did not win Best Picture

This flag is not suitable for analysis, so we need to transform it to a numerical values such as:

Yes = 1
No = 0

This transformation is done by using the ifelse function within the mutate dplyr facility. A new column is created called TEST to store the new values.

mydata2 <- mutate(mydata, TEST = ifelse(Won == "yes", 1, 0))

head(mydata2)

##   Year                  Category                 Nominee Won TEST
## 1 1934   ACTOR IN A LEADING ROLE   It Happened One Night yes    1
## 2 1934   ACTOR IN A LEADING ROLE The Affairs of Cellini   no    0
## 3 1934   ACTOR IN A LEADING ROLE           The Thin Man   no    0
## 4 1934 ACTRESS IN A LEADING ROLE   It Happened One Night yes    1
## 5 1934 ACTRESS IN A LEADING ROLE       Of Human Bondage   no    0
## 6 1934 ACTRESS IN A LEADING ROLE                    <NA>  no    0

Subsetting

In this step, the data set is subsetted into the components needed for analysis. A key analysis component will be the Best Picture subset as this will be used in the next step to flag where a film won the Best Picture award.

#Rest of the data excluding Film Editing and Best Picture
rt <- filter(mydata2, Category != "FILM EDITING" & Category != "BEST PICTURE")

#Noticed that some values int the Nominee column had extra spaces to the right. As a result, I needed to trim it using  str_trim. 
nc <- str_trim(rt$Nominee, c("right"))

#Deleting the old Nominee column
rt [["Nominee"]] <- NULL

#Appending the new Nominee column with the trimmed values
rt [["Nominee"]] <- nc 

#Film Editing
ed <- filter(mydata2, Category == "FILM EDITING")

#Best Picture
bp <- filter(mydata2, Category == "BEST PICTURE")

#Cinematography
cm <- filter(mydata2, Category == "CINEMATOGRAPHY")

#Directing
dr <- filter(mydata2, Category == "DIRECTING")

#Sound Editing
sm <- filter(mydata2, Category == "SOUND MIXING")

#Actor in leading role
ar <- filter(mydata2, Category == "ACTOR IN A LEADING ROLE")

Left Joins

In the above sections, the Won values were transformed to numbers and the data was subsetted into specific subgroups. These subgroups are needed to create the samples that will be passed on to the F and T tests.

Essentially, these joins will build our samples. Because a left join is used, there may be NAs generated by non-matches. These will be converted to zero since they indicate the Nominee was nominated for a category other than Best Picture so there are no win or lose values.

Rest vs. Best Picture

Joining the Rest of the data subset, excluding Film Editing and Best Picture Categories, with Best Picture subset. This will create the Rest sample.

#Left join for Rest vs. Best Picture
lj1a <- left_join(rt,bp, by = "Nominee")


#Selecting and unlisting the TEST column
lj1b <- unlist(select(lj1a,TEST.y))

#Converting NAs to zeros
lj1b[is.na(lj1b)] <- 0

#Converting to Vector
lj1 <- as.vector(lj1b)

Film Editing vs. Best Picture

Joining the Film Editing subset with Best Picture subset. This will create the Film Editing sample.

#Left join for Film Editing vs. Best Picture
lj2a <- left_join(ed,bp, by = "Nominee")


#Selecting and unlisting the TEST column
lj2b <- unlist(select(lj2a,TEST.y))


#Converting NAs to zeros
lj2b[is.na(lj2b)] <- 0


#Converting to Vector
lj2 <- as.vector(lj2b)

Cinematography vs. Best Picture

Joining the Cinematography subset with Best Picture subset. This will create the Cinematography sample.

#Left join for Cinematography vs. Best Picture
lj3a <- left_join(cm,bp, by = "Nominee")

#Selecting and unlisting the TEST column
lj3b <- unlist(select(lj3a,TEST.y))

#Converting NAs to zeros
lj3b[is.na(lj3b)] <- 0 

#Converting to Vector
lj3 <- as.vector(lj3b)

Directing vs. Best Picture

Joining the Directing subset with Best Picture subset. This will create the Directing sample.

#Left join for Directing vs. Best Picture
lj4a <- left_join(dr,bp, by = "Nominee")

#Selecting and unlisting the TEST column
lj4b <- unlist(select(lj4a,TEST.y))

#Converting NAs to zeros
lj4b[is.na(lj4b)] <- 0 

#Converting to Vector
lj4 <- as.vector(lj4b)

Sound Mixing vs. Best Picture

Joining the Sound Mixing subset with Best Picture subset. This will create the Sound Mixing sample.

#Left join for Sound Sound Mixing vs. Best Picture
lj6a <- left_join(sm,bp, by = "Nominee")

#Selecting and unlisting the TEST column
lj6b <- unlist(select(lj6a,TEST.y))

#Converting NAs to zeros
lj6b[is.na(lj6b)] <- 0

#Converting to Vector
lj6 <- as.vector(lj6b)

Actor in Leading Role vs. Best Picture

Joining the Actor in Leading Role subset with Best Picture subset. This will create the Actor in Leading Role sample.

#Left join for Actor in Laeding Role vs. Best Picture
lj7a <- left_join(ar,bp, by = "Nominee")

#Selecting and unlisting the TEST column
lj7b <- unlist(select(lj7a,TEST.y))

#Converting NAs to zeros
lj7b[is.na(lj7b)] <- 0 

#Converting to Vector
lj7 <- as.vector(lj7b)

F and T-Test Analysis

Film Editing vs. Rest

F-Test

var.test(lj2,lj1)

## 
##  F test to compare two variances
## 
## data:  lj2 and lj1
## F = 1.3243, num df = 402, denom df = 2644, p-value = 0.0001147
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.146220 1.542761
## sample estimates:
## ratio of variances 
##           1.324333

T-Test

t.test(lj2,lj1, var.equal=FALSE, paired=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  lj2 and lj1
## t = 2.2666, df = 498.82, p-value = 0.02384
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.005737063 0.080405151
## sample estimates:
## mean of x mean of y 
## 0.1538462 0.1107750

Film Editing vs. Cinematography

F-Test

var.test(lj2,lj3)

## 
##  F test to compare two variances
## 
## data:  lj2 and lj3
## F = 1.2669, num df = 402, denom df = 266, p-value = 0.0368
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.014697 1.573889
## sample estimates:
## ratio of variances 
##           1.266876

T-Test

t.test(lj2,lj3, var.equal=FALSE, paired=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  lj2 and lj3
## t = 1.4168, df = 613.8, p-value = 0.1571
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01457316  0.09005573
## sample estimates:
## mean of x mean of y 
## 0.1538462 0.1161049

Film Editing vs. Directing

F-Test

var.test(lj2,lj4)

## 
##  F test to compare two variances
## 
## data:  lj2 and lj4
## F = 0.92441, num df = 402, denom df = 400, p-value = 0.4316
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.759837 1.124581
## sample estimates:
## ratio of variances 
##          0.9244133

T-Test

t.test(lj2,lj4, var.equal=TRUE, paired=FALSE)

## 
##  Two Sample t-test
## 
## data:  lj2 and lj4
## t = -0.60511, df = 802, p-value = 0.5453
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06675627  0.03529646
## sample estimates:
## mean of x mean of y 
## 0.1538462 0.1695761

Film Editing vs. Sound Mixing

var.test(lj2,lj6)

## 
##  F test to compare two variances
## 
## data:  lj2 and lj6
## F = 1.2085, num df = 402, denom df = 57, p-value = 0.3831
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7900273 1.7423976
## sample estimates:
## ratio of variances 
##           1.208508

T-Test

t.test(lj2,lj6, var.equal=TRUE, paired=FALSE)

## 
##  Two Sample t-test
## 
## data:  lj2 and lj6
## t = 0.66066, df = 459, p-value = 0.5092
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06546761  0.13178061
## sample estimates:
## mean of x mean of y 
## 0.1538462 0.1206897

Film Editing vs. Actor in Leading Role

var.test(lj2,lj7)

## 
##  F test to compare two variances
## 
## data:  lj2 and lj7
## F = 1.1877, num df = 402, denom df = 406, p-value = 0.08413
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9770518 1.4439575
## sample estimates:
## ratio of variances 
##           1.187723

T-Test

t.test(lj2,lj7, var.equal=TRUE, paired=FALSE)

## 
##  Two Sample t-test
## 
## data:  lj2 and lj7
## t = 1.1717, df = 808, p-value = 0.2417
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01927229  0.07635034
## sample estimates:
## mean of x mean of y 
## 0.1538462 0.1253071

Recap of Results & Conclusion

Recapping the Resulst of the t-test

Film Editing vs. Rest

p-value = 0.02384 is less than 0.05, so I can reject the Ho hypothesis. The averages of the two groups are not significantly similar. This means that these two samples have different probabilities for winning the Best Picture awards.

Film Editing vs. Cinematography

p-value = 0.1571 is greater than 0.05, so I can accept the Ho hypothesis. The averages of the two groups are significantly similar. This means that these two samples have similar probabilities for winning the Best Picture awards.

Film Editing vs. Directing

T-test p-value = 0.5453 is greater than 0.05, so I can accept the Ho hypothesis. The averages of the two groups are significantly similar. This means that these two samples have similar probabilities for winning the Best Picture awards.

Film Editing vs. Sound Mixing

T-test p-value = 0.5092 is greater than 0.05, so I can accept the Ho hypothesis. The averages of the two groups are significantly similar. This means that these two samples have similar probabilities for winning the Best Picture awards.

Film Editing vs. Actor in Leading Role

T-test p-value = 0.2417 is greater than 0.05, so I can accept the Ho hypothesis. The averages of the two groups are significantly similar. This means that these two samples have similar probabilities for winning the Best Picture awards.

Conclusion

Based on the results of the above analysis, we can infer that the Film Editing sample is a better predictor of winning the Best Picture award when compared to the rest of the categories put together.

Comparing the Film Editing sample to the to individual categories, such as Directing, Sound Mixing, etc., did not yield the same results.

Project_3_t-test

Diego Diaz

October 22, 2015

Objective

Required Libraries

Loading the Data

Data Transformation

Subsetting

Left Joins

F and T-Test Analysis

Recap of Results & Conclusion