Task 1

Question 1

skewness = function(x) {
  skewness = sum((x-mean(x))/sd(x)^3)/length(x)
    return(skewness)
}

Question 2

d2012 = read.csv("http://faraway.neu.edu/biostats/lab2_dataset1.csv")
d2008 = read.csv("http://faraway.neu.edu/biostats/lab2_dataset2.csv")
d2004 = read.csv("http://faraway.neu.edu/biostats/lab2_dataset3.csv")
mean(d2004$population.size)
## [1] 1017.972
median(d2004$population.size)
## [1] 814.5
skewness(d2004$population.size)
## [1] 5.087492e-23
mean(d2008$population.size)
## [1] 1038.789
median(d2008$population.size)
## [1] 868
skewness(d2008$population.size)
## [1] 4.581423e-23
mean(d2012$population.size)
## [1] 993.953
median(d2012$population.size)
## [1] 969
skewness(d2012$population.size)
## [1] 2.068825e-21

I would say that most voting precincts are small, but over time they have become more balanced. The skew within how districts vote, however, is getting higher, showing that polarization is increasing.

Question 3

hist(d2004$voted.democrat/d2004$population.size * 100, col = adjustcolor("steelblue1", alpha.f = 0.3), xlim = c(0,100), xlab = "Percentage of Votes", main = "2004 R and D votes", breaks = seq(0, 100, by = 1))
par(new = T)
hist(d2012$voted.democrat/d2012$population.size * 100, axes = F, col = adjustcolor("blue", alpha.f = 0.3), xlim = c(0, 100), breaks = seq(0, 100, by = 1), add = T)

#### Question 4

There are temporal trends in the distribution of votes. As seen in the above histogram, the frequency of votesis becoming more entrenched at the 55% mark, with less outliers, moving away from the 50% mark. This shows that there are less swing counties, with precincts becoming more and more divisive.

Task 2

Question 1

first.digit = substr(as.character(d2012$voted.democrat), start=1, stop=1)
first.digit = as.numeric(first.digit)
observed = vector()
observed = table(first.digit)
observed
## first.digit
##   1   2   3   4   5   6   7   8   9 
##   4  12 104 257 314 190  92  26   1

Question 2

# Benford's law
expected = log10(1+1/(1:9))
# Expected count for each digit based on Benford's Law
expected = round(expected*sum(observed))
expected
## [1] 301 176 125  97  79  67  58  51  46

Question 3

bp = barplot(observed, names=1:9, xlab = "Digit", ylab = "Frequency", main = "Expected vs Observed digit frequency in 2012 election")
par(new = T)
plot(bp, expected, axes = F, col = "red", add = T, xlab = "", ylab = "", main = "")
## Warning in plot.window(...): "add" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "add" is not a graphical parameter
## Warning in title(...): "add" is not a graphical parameter

Question 4

The first digits of votes are not consistant with Benford’s Law.

Question 5

Ho: The first digits of the votes are consistant with Benford’s Law. HA: The first digits of the votes are not consistant with Benford’s Law.

Question 6

expected = expected/sum(expected)
chisq.test(observed, y=NULL, correct=TRUE, p=expected)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 1714.4, df = 8, p-value < 2.2e-16

The p value of this test is less than 2.2e-16, which is way less than 5, so we must reject the null hypothesis.

Task 3

Question 1

# P(Win)
pW=0.5
# P(Favored|Win)
pF.W=0.75
# P(Favored|Loss)
pF.L=0.20

Question 2

# Create a vector of pW values (i.e., P(W))
pWvals=seq(0, 1, length=101)
# Initialize the vector of pW.Fvals (i.e., P(W|F))
pW.Fvals=numeric(101)
for (i in 1:length(pWvals)) {
pF = (pF.W*pWvals[i]) + (pF.L*(1-pWvals[i]))
pW.Fvals[i] = pF.W*pWvals[i]/pF
}
pW.Fvals
##   [1] 0.00000000 0.03649635 0.07109005 0.10392610 0.13513514 0.16483516
##   [7] 0.19313305 0.22012579 0.24590164 0.27054108 0.29411765 0.31669866
##  [13] 0.33834586 0.35911602 0.37906137 0.39823009 0.41666667 0.43441227
##  [19] 0.45150502 0.46798030 0.48387097 0.49920761 0.51401869 0.52833078
##  [25] 0.54216867 0.55555556 0.56851312 0.58106169 0.59322034 0.60500695
##  [31] 0.61643836 0.62753036 0.63829787 0.64875491 0.65891473 0.66878981
##  [37] 0.67839196 0.68773234 0.69682152 0.70566948 0.71428571 0.72267920
##  [43] 0.73085847 0.73883162 0.74660633 0.75418994 0.76158940 0.76881134
##  [49] 0.77586207 0.78274760 0.78947368 0.79604579 0.80246914 0.80874873
##  [55] 0.81488934 0.82089552 0.82677165 0.83252191 0.83815029 0.84366063
##  [61] 0.84905660 0.85434174 0.85951941 0.86459286 0.86956522 0.87443946
##  [67] 0.87921847 0.88390501 0.88850174 0.89301122 0.89743590 0.90177815
##  [73] 0.90604027 0.91022444 0.91433278 0.91836735 0.92233010 0.92622294
##  [79] 0.93004769 0.93380615 0.93750000 0.94113091 0.94470046 0.94821021
##  [85] 0.95166163 0.95505618 0.95839525 0.96168018 0.96491228 0.96809282
##  [91] 0.97122302 0.97430407 0.97733711 0.98032326 0.98326360 0.98615917
##  [97] 0.98901099 0.99182004 0.99458728 0.99731363 1.00000000

Question 3

plot.new()
plot(pWvals, pW.Fvals, type="l", col="red", lwd=5, xlab="P(W)", ylab="P(W|F)", main="Probability Win given Favored")
abline(b=1, a=0)

The red line is always above the black line. This means that the probability that someone wins given that they are favored in the polls is higher than or equal to the probability that they win in the first place. Being favored in the polls can not make someone less likely to win.