Project 1: Factorial Design Experiment

1.Setting

This is a factorial design experiment, where multiple factors are varied at the same time. This uses a fixed effect model; thus, the aim is to determine the effect of the four categorical variables. No inferences will be made about the population in this experiment. This experiment studies the 2016 presidential candidates and candidate disbursements to determine if candidate, state, type of disbursement, and year have an effect on disbursement amount. The null hypothesis is that there is no difference between these factors; by the null hypothesis, observed variation is due to randomization. The dataset was chosen from the list of 100 interesting datasets for Statistics: http://www.fec.gov/disclosurep/PDownload.do

#Load in excel data using Utils package
library(utils)
setwd("C:/Users/Alexis/Documents/Alexis/RPI/DoE")
election_data <- read.csv("Project1_PresidentialElectionData3.csv", header = TRUE)

#Show first and last ten rows of data table
head(election_data, 10)
tail(election_data, 10)

#The 4 factors (independent variables) being studied are cand_nm (presidential candidate), recipient_st (state where donations were made/money disbursed), disb_desc (description of the disbursements), and disb_dt (year money disbursed). disb_dt was changed from d-m-y to YEAR, so there are no continuous variables used here.

cand_nm levels: Hillary Clinton, Bernie Sanders, Donald Trump, Ted Cruz
recipient_st levels: DC, CA, NY, MA, KY, UT, GA
disb_desc levels: in-kind contribution, office supplies, travel, online advertising
disb_dt levels: 2015, 2016
The response variable (dependent variable) is disb_amt (disbursement amount).

#Select the final 2 democrats and final 2 republicans (also select recipient state, disbursement description, and dispursement date (simplified to year))
election_data_subset <- subset(election_data, cand_nm == "Clinton, Hillary Rodham" | cand_nm == "Sanders, Bernard" | cand_nm == "Cruz, Ted" | cand_nm == "Trump, Donald", select = cand_nm:disb_amt) 
election_data_subset2 <- subset(election_data_subset, recipient_st == "DC" | recipient_st == "MA" | recipient_st == "NY" | recipient_st == "CA" | recipient_st == "KY" | recipient_st == "GA" | recipient_st == "UT")
election_data_subset3 <- subset(election_data_subset2, disb_desc == "IN-KIND CONTRIBUTION" | disb_desc == "OFFICE SUPPLIES" | disb_desc == "TRAVEL" | disb_desc == "ONLINE ADVERTISING")
election_data_subset4 <- subset(election_data_subset3, disb_dt == "2015" | disb_dt == "2016") 

#Show first and last ten rows of new subsetted data table
head(election_data_subset4, 10)
tail(election_data_subset4, 10)

sample_size = length(1:nrow(election_data_subset4))
sample_size

#The data has five columns, 1 for each of the factors and 1 for the response variable. For the subsetted data, the factors are all categorical variables with a set number of levels. The data has 9662 rows.

2. (Experimental) Design

The main effect will be conducted for each factor as well as the interaction effects for all two factor interaction.

Randomization: As there was no control over randomization in data collection, the data will be randomized without replacement for analysis. Factorial design experiments assume that data are randomized (in object selection, assignment to treatment, and experimental run order).

election_data_randomized = election_data_subset4[sample(1:nrow(election_data_subset4), size = sample_size, replace = FALSE),]

#Show first and last ten rows of randomized data table
head(election_data_randomized, 10)
tail(election_data_randomized, 10)

Replication: As all disbursements by each candidate were made on different days in different states, there are no repeated measurements. However, as the candidates purchase the same items on different days in different states, those disbursements can be counted as replicates.

Blocking: Blocking is used to reduce the variability of a sample. Typically, there are nuisance factors for a given experiment that are suspected to have an effect on the response variable but are not considered to be one of the main factors. To block for a nuisance factor, it is held constant during the experiment. As the data used for this study was selected after completion and the original data included many factors, data were blocked to include only the four factors listed above.

3. (Statistical) Analysis


**Exploratory Boxplots**
fifteen_data = na.omit(subset(election_data_randomized, disb_dt == "2015"))
fifteen_data$cand_nm <- factor(fifteen_data$cand_nm)
fifteen_data$recipient_st <- factor(fifteen_data$recipient_st)
fifteen_data$disb_desc <- factor(fifteen_data$disb_desc)
fifteen_data$disb_dt <- factor(fifteen_data$disb_dt)

#Boxplots examining levels of each factor for exploratory analysis
boxplotcand <- windows(boxplot(fifteen_data$disb_amt~fifteen_data$cand_nm, xlab = "Candidate", ylab = "Disbursement Amount", main = "Candidate Effect on Disbursement"))
boxplotstate <- windows(boxplot(fifteen_data$disb_amt~fifteen_data$recipient_st, xlab = "State", ylab = "Disbursement Amount", main = "State Effect on Disbursement"))
boxplottype <- windows(boxplot(fifteen_data$disb_amt~fifteen_data$disb_desc, xlab = "Disbursemet Type", ylab = "Disbursement Amount", main = "Disbursement Type Effect on Disbursement"))
boxplotyear <- windows(boxplot(fifteen_data$disb_amt~fifteen_data$disb_dt, xlab = "Year", ylab = "Disbursement Amount", main = "Year Effect on Disbursement"))


#Calculation of Main Effect
HC_data = subset(election_data_randomized, cand_nm == "Clinton, Hillary Rodham")
max_HC = max(HC_data$disb_amt)
max_HC
min_HC = min(HC_data$disb_amt)
min_HC
mean_HC = mean(HC_data$disb_amt)
mean_HC
median_HC = median(HC_data$disb_amt)
median_HC

DT_data = subset(election_data_randomized, cand_nm == "Trump, Donald")
max_DT = max(DT_data$disb_amt)
max_DT
min_DT = min(DT_data$disb_amt)
min_DT
mean_DT = mean(DT_data$disb_amt)
mean_DT
median_DT = median(DT_data$disb_amt)
median_DT

TC_data = subset(election_data_randomized, cand_nm == "Cruz, Ted")
max_TC = max(TC_data$disb_amt)
max_TC
min_TC = min(TC_data$disb_amt)
min_TC
mean_TC = mean(TC_data$disb_amt)
mean_TC
median_TC = median(TC_data$disb_amt)
median_TC

BS_data = subset(election_data_randomized, cand_nm == "Sanders, Bernard")
max_BS = max(BS_data$disb_amt)
max_BS
min_BS = min(BS_data$disb_amt)
min_BS
mean_BS = mean(BS_data$disb_amt)
mean_BS
median_BS = median(BS_data$disb_amt)
median_BS

max_cand = max(median_HC, median_DT, median_BS, median_TC)
max_cand

min_cand = min(median_HC, median_DT, median_BS, median_TC)
min_cand

Based on max and min, Hillary Clinton had both the highest and lowest disbursement. Thus, means were used as a second assessment. As the mean is more skewed by outliers, median was also used to determine the which levels will be used to calculate the main effect. Donald Trump had the highest median disbursement, while Hillary had the lowest median disbursement. Thus, these levels were used to calculate the main effect of candidate. 
main_effect_candidate = median_DT - median_HC
main_effect_candidate

Calculate the main effect of state
NY_data = subset(election_data_randomized, recipient_st == "NY")
max_NY = max(NY_data$disb_amt)
max_NY
min_NY = min(NY_data$disb_amt)
min_NY
mean_NY = mean(NY_data$disb_amt)
mean_NY
median_NY = median(NY_data$disb_amt)
median_NY

MA_data = subset(election_data_randomized, recipient_st == "MA")
max_MA = max(MA_data$disb_amt)
max_MA
min_MA = min(MA_data$disb_amt)
min_MA
mean_MA = mean(MA_data$disb_amt)
mean_MA
median_MA = median(MA_data$disb_amt)
median_MA

CA_data = subset(election_data_randomized, recipient_st == "CA")
max_CA = max(CA_data$disb_amt)
max_CA
min_CA = min(CA_data$disb_amt)
min_CA
mean_CA = mean(CA_data$disb_amt)
mean_CA
median_CA = median(CA_data$disb_amt)
median_CA

GA_data = subset(election_data_randomized, recipient_st == "GA")
max_GA = max(GA_data$disb_amt)
max_GA
min_GA = min(GA_data$disb_amt)
min_GA
mean_GA = mean(GA_data$disb_amt)
mean_GA
median_GA = median(GA_data$disb_amt)
median_GA

KY_data = subset(election_data_randomized, recipient_st == "KY")
max_KY = max(KY_data$disb_amt)
max_KY
min_KY = min(KY_data$disb_amt)
min_KY
mean_KY = mean(KY_data$disb_amt)
mean_KY
median_KY = median(KY_data$disb_amt)
median_KY

UT_data = subset(election_data_randomized, recipient_st == "UT")
max_UT = max(UT_data$disb_amt)
max_UT
min_UT = min(UT_data$disb_amt)
min_UT
mean_UT = mean(UT_data$disb_amt)
mean_UT
median_UT = median(UT_data$disb_amt)
median_UT

max_st = max(median_NY, median_MA, median_CA, median_GA, median_UT, median_KY)
max_st

min_st = min(median_NY, median_MA, median_CA, median_GA, median_UT, median_KY)
min_st

#Based on max and min, NY had both the highest and lowest disbursement. Thus, means were used as a second assessment. As the mean is more skewed by outliers, median was also used to determine the which levels will be used to calculate the main effect. KY had the highest median disbursement, while NY had the lowest median disbursement. Thus, these levels were used to calculate the main effect of state. 
main_effect_state = median_KY - median_NY
main_effect_state

inkind_data = na.omit(subset(election_data_randomized, disb_desc == "IN-KIND CONTRIBUTION"))
max_inkind = max(inkind_data$disb_amt)
max_inkind
min_inkind = min(inkind_data$disb_amt)
min_inkind
mean_inkind = mean(inkind_data$disb_amt)
mean_inkind
median_inkind = median(inkind_data$disb_amt)
median_inkind

office_data = na.omit(subset(election_data_randomized, disb_desc == "OFFICE SUPPLIES"))
max_office = max(office_data$disb_amt)
max_office
min_office = min(office_data$disb_amt)
min_office
mean_office = mean(office_data$disb_amt)
mean_office
median_office = median(office_data$disb_amt)
median_office

travel_data = na.omit(subset(election_data_randomized, disb_desc == "TRAVEL"))
max_travel = max(travel_data$disb_amt)
max_travel
min_travel = min(travel_data$disb_amt)
min_travel
mean_travel = mean(travel_data$disb_amt)
mean_travel
median_travel = median(travel_data$disb_amt)
median_travel

online_data = na.omit(subset(election_data_randomized, disb_desc == "ONLINE ADVERTISING"))
max_online = max(online_data$disb_amt)
max_online
min_online = min(online_data$disb_amt)
min_online
mean_online = mean(online_data$disb_amt)
mean_online
median_online = median(online_data$disb_amt)
median_online

max_type = max(median_travel, median_online, median_office, median_inkind)
max_type

min_type = min(median_travel, median_online, median_office, median_inkind)
min_type

#To be consistent with other conditions, median was the final method used to determine which levels will be used to calculate the main effect. Online Advertising had the highest median disbursement, while office supplies had the lowest median disbursement. Thus, these levels were used to calculate the main effect of type of disbursement. 
main_effect_type = median_online - median_office
main_effect_type

fifteen_data = subset(election_data_randomized, disb_dt == "2015")
max_fifteen = max(fifteen_data$disb_amt)
max_fifteen
min_fifteen= min(fifteen_data$disb_amt)
min_fifteen
mean_fifteen= mean(fifteen_data$disb_amt)
mean_fifteen
median_fifteen= median(fifteen_data$disb_amt)
median_fifteen

sixteen_data = subset(election_data_randomized, disb_dt == "2016")
max_sixteen= max(sixteen_data$disb_amt)
max_sixteen
min_sixteen= min(sixteen_data$disb_amt)
min_sixteen
mean_sixteen= mean(sixteen_data$disb_amt)
mean_sixteen
median_sixteen = median(sixteen_data$disb_amt)
median_sixteen

max_year = max(median_fifteen, median_sixteen)
max_year

#To be consistent with other conditions, median was the final method used to determine which levels will be used to calculate the main effect. Year 2016 had the highest median disbursement, while year 2015 had the lowest median disbursement. Thus, these levels were used to calculate the main effect of type of disbursement. 
main_effect_year = median_sixteen - median_fifteen
main_effect_year

#**All main effects**
main_effect_candidate
main_effect_state
main_effect_type
main_effect_year
#**The largest main effect is due to the type of disbursement.**


**ANOVA**
#Compute Analysis of Variance for all main effects (me) and two factor interactions (2fi)
cand_nm
anova_cand <- aov(election_data_randomized$disb_amt ~ election_data_randomized$cand_nm)
summary.aov(anova_cand)

# recipient_st
anova_state <- aov(election_data_randomized$disb_amt ~ election_data_randomized$recipient_st)
summary(anova_state)

# disb_desc
anova_description <- aov(election_data_randomized$disb_amt ~ election_data_randomized$disb_desc)
summary(anova_description)

# disb_dt
anova_date <- aov(election_data_randomized$disb_amt ~ election_data_randomized$disb_dt)
summary(anova_date)

#**By ANOVA, the main effects from type of disbursement and state both demonstrate that the variance is explained, not due to randomization.**

# cand_nm and recipient_st
anova_cand_state <- aov(election_data_randomized$disb_amt ~ election_data_randomized$cand_nm*election_data_randomized$recipient_st)
summary(anova_cand_state)

# cand_nm and disb_desc
anova_cand_description <- aov(election_data_randomized$disb_amt ~ election_data_randomized$cand_nm*election_data_randomized$disb_desc)
summary(anova_cand_description)

# cand_nm and disb_dt
anova_cand_date <- aov(election_data_randomized$disb_amt ~ election_data_randomized$cand_nm*election_data_randomized$disb_dt)
summary(anova_cand_date)

# recipient_st and disb_desc
anova_state_description <- aov(election_data_randomized$disb_amt ~ election_data_randomized$recipient_st*election_data_randomized$disb_desc)
summary(anova_state_description)

# recipient_st and disb_dt
anova_state_date <- aov(election_data_randomized$disb_amt ~ election_data_randomized$recipient_st*election_data_randomized$disb_dt)
summary(anova_state_date)

#disb_desc and disb_dt
anova_description_date <- aov(election_data_randomized$disb_amt ~ election_data_randomized$disb_desc*election_data_randomized$disb_dt)
summary(anova_description_date)

#**By studying the F statistic results from the ANOVA, many of the interactions have large F statistics, suggesting we can reject the null hypothesis and that the variance is more than variance resulting from randomization. 


**Interaction Plots**
#Interaction plots for factors are first split by year, 2015 and 2016. The interactions between candidate, state, and type of disbursement are then plotted.

#interaction of candidate and type of disbursement in 2015
cand_type_2015_plot = windows(interaction.plot(fifteen_data$disb_desc, fifteen_data$cand_nm, fifteen_data$disb_amt))

#interaction of candidate and state in 2015
cand_state_2015_plot = windows(interaction.plot(fifteen_data$cand_nm, fifteen_data$recipient_st, fifteen_data$disb_amt))

#interaction of type of disbursement and state in 2015
type_state_2015_plot = windows(interaction.plot(fifteen_data$disb_desc, fifteen_data$recipient_st, fifteen_data$disb_amt))

fifteen_data = na.omit(subset(election_data_randomized, disb_dt == "2016"))
fifteen_data$cand_nm <- factor(fifteen_data$cand_nm)
fifteen_data$recipient_st <- factor(fifteen_data$recipient_st)
fifteen_data$disb_desc <- factor(fifteen_data$disb_desc)

#interaction of candidate and state in 2016
cand_state_2016_plot = windows(interaction.plot(fifteen_data$cand_nm, fifteen_data$recipient_st, fifteen_data$disb_amt))

#interaction of candidate and type of disbursement in 2016
cand_type_2016_plot = windows(interaction.plot(fifteen_data$disb_desc, fifteen_data$cand_nm, fifteen_data$disb_amt))
 
#interaction of state and type of disbursement in 2016
type_state_2016_plot = windows(interaction.plot(fifteen_data$disb_desc, fifteen_data$recipient_st, fifteen_data$disb_amt))

In conclusion, this was a fixed effect model, factorial design experiment. The factors candidate, state, type of disbursement, and disbursement year were all studied to determine if they had an effect on disbursement amount. Ultimately, many of these factors had main effects as well as interaction effects. ANOVA was used to confirm that the variance seen were more than a result of variation. However, the results must be examined with caution as the appropriateness of the model was not studied.

4. References

Federal Election Commision, “Presidential Campaign Finance Download,” 2016 Presidential Campaign Finance. [Online]. Available: http://www.fec.gov/disclosurep/PDownload.do. [Accessed: 12-Oct-2016].
D. C. Montgomery, Design and Analysis of Experiments, 8th ed. Hoboken, NJ: John Wiley & Sons, Inc., 2013.

Project 1: Factorial Design Experiment

Alexis Ziemba

October 03, 2016

1.Setting

2. (Experimental) Design

3. (Statistical) Analysis

4. References

5. Appendix