source("http://stat.duke.edu/~kkl13/courses/sta102F13/labs/inference.R")
cont = read.csv("http://stat.duke.edu/~kkl13/courses/sta102F13/labs/contributions.csv")
Bachmann, Cain, Gingrich, Huntsman, Johnson, McCotter, Obama, Paul, Pawlenty, Perry, Roemer, Romney, and Santorum are all the candidate with contributers. Barack Obama had the greatest number of contributors with 7,454 contributors. Thaddeus McCotter had the fewest contributors with only one person caring enough to contribute info for him.
table(cont$cand_nm)
##
## Bachmann, Michele Cain, Herman
## 34 64
## Gingrich, Newt Huntsman, Jon
## 171 16
## Johnson, Gary Earl McCotter, Thaddeus G
## 8 1
## Obama, Barack Paul, Ron
## 7454 445
## Pawlenty, Timothy Perry, Rick
## 15 46
## Roemer, Charles E. 'Buddy' III Romney, Mitt
## 14 1579
## Santorum, Rick
## 153
it is quite interesting that they all have some negative contributions, some more or less than others.
# subset for major Republican candidates
rep_mjr = subset(cont, (cont$cand_nm == "Romney, Mitt" | cont$cand_nm == "Paul, Ron" |
cont$cand_nm == "Gingrich, Newt" | cont$cand_nm == "Santorum, Rick"))
# subset for primary election
rep_mjr_pri = subset(rep_mjr, rep_mjr$election_tp == "P2012")
table(rep_mjr_pri$cand_nm)
##
## Bachmann, Michele Cain, Herman
## 0 0
## Gingrich, Newt Huntsman, Jon
## 165 0
## Johnson, Gary Earl McCotter, Thaddeus G
## 0 0
## Obama, Barack Paul, Ron
## 0 445
## Pawlenty, Timothy Perry, Rick
## 0 0
## Roemer, Charles E. 'Buddy' III Romney, Mitt
## 0 952
## Santorum, Rick
## 151
pri = droplevels(rep_mjr_pri)
table(pri$cand_nm)
##
## Gingrich, Newt Paul, Ron Romney, Mitt Santorum, Rick
## 165 445 952 151
par(mfrow = c(2, 2))
boxplot(pri$contb_receipt_amt[pri$cand_nm == "Romney, Mitt"], main = "Romney")
boxplot(pri$contb_receipt_amt[pri$cand_nm == "Paul, Ron"], main = "Paul")
boxplot(pri$contb_receipt_amt[pri$cand_nm == "Gingrich, Newt"], main = "Ging")
boxplot(pri$contb_receipt_amt[pri$cand_nm == "Santorum, Rick"], main = "Santa")
Romney has the highest total contribution of 519044.3.
neg_index = which(pri$contb_receipt_amt < 0)
pri$receipt_desc[neg_index]
## [1] REDESIGNATION TO GENERAL
## [3] REDESIGNATION TO GENERAL REATTRIBUTION TO SPOUSE
## [5] Refund REDESIGNATION TO GENERAL
## [7] REDESIGNATION TO GENERAL REDESIGNATION TO GENERAL
## [9] REDESIGNATION TO GENERAL REATTRIBUTED BELOW
## [11] REATTRIBUTION TO SPOUSE REDESIGNATION TO GENERAL
## [13] REATTRIBUTION TO SPOUSE REDESIGNATION TO GENERAL
## [15] REATTRIBUTION TO SPOUSE
## [17] Refund
## [19] Refund REDESIGNATION TO GENERAL
## [21] REDESIGNATION TO GENERAL
## [23] Refund
## [25] REDESIGNATION TO GENERAL REDESIGNATION TO GENERAL
## [27] REATTRIBUTION TO SPOUSE
## [29] Refund REDESIGNATION TO GENERAL
## [31] REDESIGNATION TO GENERAL REDESIGNATION TO GENERAL
## [33] REDESIGNATION TO GENERAL REDESIGNATION TO GENERAL
## [35] REDESIGNATION TO GENERAL REDESIGNATION TO GENERAL
## [37] REDESIGNATION TO GENERAL
## 9 Levels: REATTRIBUTED BELOW ... SEE REATTRIBUTION
sum(pri$contb_receipt_amt[pri$cand_nm == "Romney, Mitt"])
## [1] 519044
sum(pri$contb_receipt_amt[pri$cand_nm == "Paul, Ron"])
## [1] 67228
sum(pri$contb_receipt_amt[pri$cand_nm == "Gingrich, Newt"])
## [1] 23638
sum(pri$contb_receipt_amt[pri$cand_nm == "Santorum, Rick"])
## [1] 31747
Romney also has the highest average contribution 0f 545.2146.
mean(pri$contb_receipt_amt[pri$cand_nm == "Romney, Mitt"])
## [1] 545.2
mean(pri$contb_receipt_amt[pri$cand_nm == "Paul, Ron"])
## [1] 151.1
mean(pri$contb_receipt_amt[pri$cand_nm == "Gingrich, Newt"])
## [1] 143.3
mean(pri$contb_receipt_amt[pri$cand_nm == "Santorum, Rick"])
## [1] 210.2
Ho:the variation is no greater than that due to normal variation of characteristics and error in measurement Ha:the variation is greater than that due to normal variation of characteristics and error in measurement. The variation is dependent on the candidates as individuals.
The conditions of an ANOVA test are It must be reasonable to regard the groups of observations as random samples from their respective populations.(Samuels et.al.) The “I” samples must be independent of each otheter. The “I” population distributions mut be normal with equal standard deviations.
The data is pretty normal except there are signs of short tails for all of the normal probability plots. The conditions are met in all other respects.
qqnorm(pri$contb_receipt_amt[pri$cand_nm == "Romney, Mitt"], main = "Romney")
qqline(pri$contb_receipt_amt[pri$cand_nm == "Romney, Mitt"])
qqnorm(pri$contb_receipt_amt[pri$cand_nm == "Paul, Ron"], main = "Paul")
qqline(pri$contb_receipt_amt[pri$cand_nm == "Paul, Ron"])
qqnorm(pri$contb_receipt_amt[pri$cand_nm == "Gingrich, Newt"], main = "Gingrich")
qqline(pri$contb_receipt_amt[pri$cand_nm == "Gingrich, Newt"])
qqnorm(pri$contb_receipt_amt[pri$cand_nm == "Santorum, Rick"], main = "Santorum")
qqline(pri$contb_receipt_amt[pri$cand_nm == "Santorum, Rick"])
0.008333333 is the new significance level.
inference(data = pri$contb_receipt_amt, group = pri$cand_nm, est = "mean", type = "ht",
alternative = "greater", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
##
## Summary statistics:
## n_Gingrich, Newt = 165, mean_Gingrich, Newt = 143.3, sd_Gingrich, Newt = 432.8
## n_Paul, Ron = 445, mean_Paul, Ron = 151.1, sd_Paul, Ron = 277.3
## n_Romney, Mitt = 952, mean_Romney, Mitt = 545.2, sd_Romney, Mitt = 968.5
## n_Santorum, Rick = 151, mean_Santorum, Rick = 210.2, sd_Santorum, Rick = 411.1
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: data
## Df Sum Sq Mean Sq F value Pr(>F)
## group 3 6.29e+07 20951773 36.5 <2e-16
## Residuals 1709 9.82e+08 574777
##
## Pairwise tests: t tests with pooled SD
## Gingrich, Newt Paul, Ron Romney, Mitt
## Paul, Ron 0.9100 NA NA
## Romney, Mitt 0.0000 0.0000 NA
## Santorum, Rick 0.4328 0.4074 0
0.05/6
## [1] 0.008333
inference(data = pri$contb_receipt_amt, group = pri$cand_nm, est = "mean", type = "ht",
alternative = "greater", method = "theoretical", siglevel = 0.008333333)
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
##
## Summary statistics:
## n_Gingrich, Newt = 165, mean_Gingrich, Newt = 143.3, sd_Gingrich, Newt = 432.8
## n_Paul, Ron = 445, mean_Paul, Ron = 151.1, sd_Paul, Ron = 277.3
## n_Romney, Mitt = 952, mean_Romney, Mitt = 545.2, sd_Romney, Mitt = 968.5
## n_Santorum, Rick = 151, mean_Santorum, Rick = 210.2, sd_Santorum, Rick = 411.1
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: data
## Df Sum Sq Mean Sq F value Pr(>F)
## group 3 6.29e+07 20951773 36.5 <2e-16
## Residuals 1709 9.82e+08 574777
##
## Pairwise tests: t tests with pooled SD
## Gingrich, Newt Paul, Ron Romney, Mitt
## Paul, Ron 0.9100 NA NA
## Romney, Mitt 0.0000 0.0000 NA
## Santorum, Rick 0.4328 0.4074 0
Romney is still at both significance levels, the only one with significantly different average contributions from the other candidates.
# subset for general elections and Obama, Romney, and Johnson
pres_temp1 = subset(cont, cont$election_tp == "G2012")
pres_temp2 = subset(pres_temp1, (pres_temp1$cand_nm == "Obama, Barack" | pres_temp1$cand_nm ==
"Romney, Mitt" | pres_temp1$cand_nm == "Johnson, Gary Earl"))
# droplevels
pres = droplevels(pres_temp2)
inference(data = pres_temp2$contb_receipt_amt, group = pres_temp2$cand_nm, est = "mean",
type = "ht", alternative = "greater", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
##
## Summary statistics:
## n_Bachmann, Michele = NA, mean_Bachmann, Michele = NA, sd_Bachmann, Michele = NA
## n_Cain, Herman = NA, mean_Cain, Herman = NA, sd_Cain, Herman = NA
## n_Gingrich, Newt = NA, mean_Gingrich, Newt = NA, sd_Gingrich, Newt = NA
## n_Huntsman, Jon = NA, mean_Huntsman, Jon = NA, sd_Huntsman, Jon = NA
## n_Johnson, Gary Earl = 6, mean_Johnson, Gary Earl = 230, sd_Johnson, Gary Earl = 226.1
## n_McCotter, Thaddeus G = NA, mean_McCotter, Thaddeus G = NA, sd_McCotter, Thaddeus G = NA
## n_Obama, Barack = 2008, mean_Obama, Barack = 159.1, sd_Obama, Barack = 441.4
## n_Paul, Ron = NA, mean_Paul, Ron = NA, sd_Paul, Ron = NA
## n_Pawlenty, Timothy = NA, mean_Pawlenty, Timothy = NA, sd_Pawlenty, Timothy = NA
## n_Perry, Rick = NA, mean_Perry, Rick = NA, sd_Perry, Rick = NA
## n_Roemer, Charles E. 'Buddy' III = NA, mean_Roemer, Charles E. 'Buddy' III = NA, sd_Roemer, Charles E. 'Buddy' III = NA
## n_Romney, Mitt = 627, mean_Romney, Mitt = 500.1, sd_Romney, Mitt = 795.2
## n_Santorum, Rick = NA, mean_Santorum, Rick = NA, sd_Santorum, Rick = NA
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: data
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 5.56e+07 27782585 93.1 <2e-16
## Residuals 2638 7.87e+08 298380
##
## Pairwise tests: t tests with pooled SD
## Johnson, Gary Earl Obama, Barack
## Obama, Barack 0.7510 NA
## Romney, Mitt 0.2281 0
Obama and Romney have are significantly different from the other candidates but not different from each other, average contribution wise.
inference(data = pres_temp2$contb_receipt_amt, group = pres_temp2$cand_nm, est = "mean",
type = "ht", alternative = "greater", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
##
## Summary statistics:
## n_Bachmann, Michele = NA, mean_Bachmann, Michele = NA, sd_Bachmann, Michele = NA
## n_Cain, Herman = NA, mean_Cain, Herman = NA, sd_Cain, Herman = NA
## n_Gingrich, Newt = NA, mean_Gingrich, Newt = NA, sd_Gingrich, Newt = NA
## n_Huntsman, Jon = NA, mean_Huntsman, Jon = NA, sd_Huntsman, Jon = NA
## n_Johnson, Gary Earl = 6, mean_Johnson, Gary Earl = 230, sd_Johnson, Gary Earl = 226.1
## n_McCotter, Thaddeus G = NA, mean_McCotter, Thaddeus G = NA, sd_McCotter, Thaddeus G = NA
## n_Obama, Barack = 2008, mean_Obama, Barack = 159.1, sd_Obama, Barack = 441.4
## n_Paul, Ron = NA, mean_Paul, Ron = NA, sd_Paul, Ron = NA
## n_Pawlenty, Timothy = NA, mean_Pawlenty, Timothy = NA, sd_Pawlenty, Timothy = NA
## n_Perry, Rick = NA, mean_Perry, Rick = NA, sd_Perry, Rick = NA
## n_Roemer, Charles E. 'Buddy' III = NA, mean_Roemer, Charles E. 'Buddy' III = NA, sd_Roemer, Charles E. 'Buddy' III = NA
## n_Romney, Mitt = 627, mean_Romney, Mitt = 500.1, sd_Romney, Mitt = 795.2
## n_Santorum, Rick = NA, mean_Santorum, Rick = NA, sd_Santorum, Rick = NA
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: data
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 5.56e+07 27782585 93.1 <2e-16
## Residuals 2638 7.87e+08 298380
##
## Pairwise tests: t tests with pooled SD
## Johnson, Gary Earl Obama, Barack
## Obama, Barack 0.7510 NA
## Romney, Mitt 0.2281 0
table(pres$cand_nm)
##
## Johnson, Gary Earl Obama, Barack Romney, Mitt
## 6 2008 627
While Obama and Romney have oodles of contributions, poor Johnson only has six, which is far too few to meet the conditions, making the ANOVA test inreliable.
# subset for general elections and Obama, and Romney
pres_temp3 = subset(cont, cont$election_tp == "G2012")
pres_temp4 = subset(pres_temp3, (pres_temp3$cand_nm == "Obama, Barack" | pres_temp3$cand_nm ==
"Romney, Mitt"))
# droplevels
pres2 = droplevels(pres_temp4)
Romney has a larger average contribution amount than Barack, but a lower total contribution amount. This may be due to a weightier contribution in Romney's lot than in Obama's. Romney may have a more positive contributions than Obama, while Obama has more contributions despite the neg or pos value.
neg_index = which(pres2$contb_receipt_amt < 0)
pres2$receipt_desc[neg_index]
## [1] Refund Refund Refund Refund Refund Refund Refund Refund
## [11] Refund
## 5 Levels: REATTRIBUTION FROM SPOUSE ... SEE REATTRIBUTION
sum(pres2$contb_receipt_amt[pres2$cand_nm == "Romney, Mitt"])
## [1] 313580
sum(pres2$contb_receipt_amt[pres2$cand_nm == "Obama, Barack"])
## [1] 319497
mean(pres2$contb_receipt_amt[pres2$cand_nm == "Romney, Mitt"])
## [1] 500.1
mean(pres2$contb_receipt_amt[pres2$cand_nm == "Obama, Barack"])
## [1] 159.1
Because we are only comparing two variables, Obama and Romney, and be cause I do not already know the standard deviation, we should use a T test.
Th p-value is so small, it shows sufficient evidence of a significant difference between Romney's and Obama's average contribution amounts.
inference(data = pres2$contb_receipt_amt, est = "mean", siglevel = 0.05, null = 0,
alternative = "twosided", type = "ht", method = "theoretical")
## Single mean
## Summary statistics:
## mean = 240.257 ; sd = 565.5362 ; n = 2635
## H0: mu = 0
## HA: mu != 0
## Standard error = 11.02
## Test statistic: Z = 21.808
## p-value = 0
The confidence interval, ( 218.6637 , 261.8503 ), does include the mean difference from the data, supporting the conclusion that there is a significant difference between the two average contribution amounts.
inference(data = pres2$contb_receipt_amt, est = "mean", siglevel = 0.05, null = 0,
alternative = "twosided", type = "ci", method = "theoretical")
## Single mean
## Summary statistics:
## mean = 240.257 ; sd = 565.5362 ; n = 2635
## Standard error = 11.0172
## 95 % Confidence interval = ( 218.6637 , 261.8503 )