NOTE THIS ASSESSMENT IS DUE ON 5 September BY 11:59 PM.


For this Assessment we will use the following dataset:

The dataset episodes included in the MXB107 package for R contains records for 704 episodes of the Star Trek aired between 1966 and 2005. (Type ?episodes for a detailed description of the data.)

Part 1: Summarising Data

Question 1

  1. Name three principles for good practice when creating graphical summaries of data.

Type your answer here:
1. Choose the right visualization 2. Provide Context 3. Use visual cues to show relationships

  1. Identify three elements of the following graphical summary of data that should be corrected.

Type your answer here:

  1. There are no labels for the plots
  2. The bin width needs to be increased to better illustrate the data provided
  3. The Series names need to be defined in a title and the aes needs to be differentiated between the two of them
  1. Create a set of boxplots showing the IMDB rankings for each series of Star Trek. Discuss the results.

Show your code here:

library(MXB107)
data("episodes")

ggplot(episodes, aes(x = Series.Name, y = IMDB.Ranking)) +
  geom_boxplot() + 
  xlab ("Series Names") + 
  ylab ("IMDB User Ratings (0-10)") +
  ggtitle ("IMDB Ranking by Series") +
  theme (plot.title = element_text(hjust = 0.5))

Some elements that are evident when analysing the data set provided by the box plot include: The episodes that lay outside of the mean and standard deviation, most clearly represented by the single episode of The Next Generation that has a score much closer to a 3 than any of the other series. Another obvious trend is that Enterprise has the best ratings on average.

  1. Create a pair of histograms comparing the IMDB rankings for episodes of Star Trek: The Next Generation that pass the Bechdel-Wallace Test versus those that failed. Discuss the results.

Show your code here:

library (MXB107)
data ("episodes")

episodes%>%
  count(Female.Director,Bechdel.Wallace.Test)%>%
  group_by(Bechdel.Wallace.Test)%>%
  pivot_wider(names_from = Female.Director, values_from = n)%>%
  kable()
Bechdel.Wallace.Test FALSE TRUE
FALSE 323 15
TRUE 346 20
ggplot(episodes,aes(x=IMDB.Ranking, fill = Bechdel.Wallace.Test))+
  geom_histogram(bins = 10)+
  facet_wrap(vars(Bechdel.Wallace.Test))+
  ylab("Bechdel Wallace Test")+
  ggtitle("Bechdel Wallace Test/The Next Generation")

Type your answer here:

It can be sighted that there is a slight correlation between the Rankings and the Bechdel Wallace Test. This can be seen through the number of episodes that pass the Bechdel Wallace Test on average receiving a higher score than those that do not pass.

Question 2

  1. Identify and define three numerical summaries of centrality for data.

Type your answer here:

  1. Mean

  2. Median

  3. Mode

  1. Identify and define three numerical summaries of dispersion for data.

Type your answer here:

  1. standard deviation

  2. Varaiance

  3. Skew

Question 3

  1. For all 704 episodes of Star Trek compute the standard deviation of their IMDB rankings using the definition of standard deviation and then use the empirical rule to estimate the standard deviation. Compare and discuss the results.

Show your code here:

library (MXB107)
data ("episodes")

IMDB_Data <- episodes$IMDB.Ranking
sd_IMDB <- sqrt(sum((IMDB_Data - mean(IMDB_Data))^2)/length(IMDB_Data))

sd_IMDB
## [1] 0.7754944
mean(IMDB_Data)*0.341
## [1] 2.574792

Type your answer here:

There is quite a stark contrast between the definition of standard deviation and the epirical rule. This is due to the base definition stating that one standard deviation is 34.1% of the data when looking at a normal distribution. Whereas the empirical rule provides a much more accurate representation of the standard deviation.

  1. For all 704 episodes of Star Trek compute the mean and median of their IMDB rankings. Do the data appear to be skewed? Compute the skew of the data and plot a histogram of the episodes’ IMDB rankings, do they appear skewed? Compare and discuss the numerical results and the your histogram.

Show your code here:

library (MXB107)
data ("episodes")

mean(IMDB_Data)
## [1] 7.55071
median(IMDB_Data)
## [1] 7.6
# Skew
(1/length(IMDB_Data)) * sum(((IMDB_Data - mean(IMDB_Data))/sd(IMDB_Data))^3)
## [1] -0.3873874
ggplot(episodes, aes(x = IMDB.Ranking)) +
  geom_histogram(aes(y = ..count..), binwidth = 0.05) +
  xlab("User Ratings") + ylab ("Episode Count") +
    ggtitle ("IMDB Ranking by Series") +
  theme (plot.title = element_text(hjust = 0.5))

Type your answer here: The data set is skewed left, this is clearly visualised through the histogram and also the value given from the skew value given.

Part 2: Computing Basic Probabilities for Events

Question 1

  1. What is the classical definition of probability?

The probability of an event is the ratio of the number of cases favourable to it, tho the number of cases possible when nothing leads us to expect that any one of these cases should occur more than any other which renders them, for us, equally possible.

  1. What is the probability that a randomly selected episode of Star Trek will pass the Bechdel-Wallace Test?

Show your code here:

library (MXB107)
data ("episodes")

episodes%>%
  count(Female.Director,Bechdel.Wallace.Test)%>%
  group_by(Bechdel.Wallace.Test)%>%
  pivot_wider(names_from = Female.Director, values_from = n)%>%
  kable()
Bechdel.Wallace.Test FALSE TRUE
FALSE 323 15
TRUE 346 20
366/704
## [1] 0.5198864

The probability of an episode passing the Bechdel Wallace test is approximately 51.9%. Meaning little over a half of the epsiodes pass the test.

Question 2

  1. What is the definition of joint probability?

Joint probability is a statistical measure that calculates the likelihood of two events occurring together and at the same point in time. Pr(AB) = Pr(A) * Pr(B)

  1. What is the probability that an original series episode passes the Bechdel-Wallace Test?

Show your code here:

library (MXB107)
data ("episodes")

episodes %>%
  count(Series, Bechdel.Wallace.Test) %>%
  filter(Series == "VOY") %>%
  group_by(Bechdel.Wallace.Test)
5/75
## [1] 0.06666667
0.519884*0.06666667
## [1] 0.03465894

The Original Series only has 6.67% of its episodes pass the Bechdel Wallace Test when sampling only The Original Series. As for all of the episodes that pass the Bechdel Wallace Test a mere 3.47% are from The Original Series

Question 3

  1. What is the definition of conditional probability?

The probability of an event occurring, given that another event has already occurred.

Pr(A|B) = (Pr(A)Pr(B))/Pr(B)

  1. What is the probability that an episode fails the Bechdel-Wallace Test given that it is an episode from Star Trek: Deep Space Nine?

Show your code here:

library (MXB107)
data ("episodes")
episodes %>%
 count(Series.Name,Bechdel.Wallace.Test)
176/704
## [1] 0.25
(0.25*0.48116)/0.48116
## [1] 0.25

The Probability that an episode fails the Bechdel-Wallace Test given that it is an episode from Star Trek: Deep Space Nine is 25.%.

Question 4

  1. What is Bayes’ Theorem

Bayes’ Theorem states that the conditional probability of an event, based on the occurrence of another event, is equal to the likelihood of the second event given the first event multiplied by the probability of the first event.

Type your answer here: \[ Pr(B|A) = (Pr(A|B)Pr(A))/Pr(B) \]

  1. Given that an episode passes the Bechdel-Wallace Test what is the probability that is was from Season 3 of Star Trek: Voyager

Show your code here:

episodes%>%
filter(Bechdel.Wallace.Test == TRUE)%>% # Remove this filter for all episodes
group_by(Series.Name, Season)%>%
tally()%>%
pivot_wider(names_from = Series.Name, values_from = n)%>%
bind_rows(summarise_all(., ~sum(., na.rm=TRUE)))%>% # Total column
mutate(Total = rowSums(.[setdiff(names(.),"Season")], na.rm = TRUE)) # Total row
17/145 # Pr(A|B)
## [1] 0.1172414
366/704 #Pr(A)
## [1] 0.5198864
145/366 #Pr(B)
## [1] 0.3961749
(0.1172414*0.5198864)/0.3961749 #Pr(B|A)
## [1] 0.1538518

There is an 15.38% chance that an episode from season 3 of star trek voyager passes the bechdel wallace test.

  1. Is this probability greater or less than the marginal probability that a randomly selected episode is from Season 3 of Star Trek: Voyager? Why?

This probability is greater than that of the marginal probability that a randomly selected episode is from Season 3 of Star Trek: Voyager. This is due to the the sample space being restricted.

Part 3: Modelling with Probability Distributions

Question 1

  1. Define a Bernoulli random variable.

A Bernoulli random variable is the simplest kind of random variable. It can take on two values, 1 and 0. It takes on a 1 if an experiment with probability p resulted in success and a 0 otherwise.

  1. Assume I have a fair coin, What is the probability that I will need more than two coin tosses to get a “heads”?

Show your code here:

pgeom(2, prob = 0.5)
## [1] 0.875
1-0.875
## [1] 0.125

There is a 12.5% chance that it will require more than 2 tosses to get a “heads”.

  1. Define a geometrically distributed random variable and Write out the probability mass distribution for a geometric probability distribution. Define the process that gives rise to a geometrically distributed random variable in terms of Bernoulli trials.

Show your code here:

Type your answer here:

A geometric random variable can be defined in bernoulli terms, this is demonstrated through a sequence of Bernoulli Trials each with the probability of success ‘p’ (p all (0,10)). As given by the distribution of the number of fails, X, Until the first success has occured is a geometric distribution with the p.m.f:

Pr(X = x) = ((1-p)^k)p, k = 0,1,2,3,…

A geometric distribution can also be defined as a negative binomial distribution which is made up of ‘n’ bernoulli trials.

  1. If the overall proportion of Star Trek episodes that pass the Bechdel-Wallace Test is \(0.52\) then assume I begin watching episodes selecting them at random, how many episodes do I have to watch until the probability I see at least one episode that passes the Bechdel-Wallace Test is more than 95%?

Show your code here:

qgeom(0.95, 0.52)
## [1] 4

Type your answer here:

It would require that you watch 4 episodes.

Question 2

  1. I have a coin that comes up heads for a given coin toss with probability \(p\). If I toss the coin \(n\) times, on average how many heads should I get? What is the standard deviation for the random variable \(X=\) number of heads in \(n\) coin tosses?

Type your answer here: X ~ Binomial(n,p)

\(Pr(X = x) = (n,x)p^x q^{n-x}\)

Expected Value: E[x] = np

Variance: var[x]= npq

On average you would expect to have n/2 (or n0.5) heads. As for the standard deviations it would be n/4 (or n0.5*0.5).

  1. Describe a binomial random variable in terms of Bernoulli trials. For what value of \(p\) is the variance for a binomial random variable maximised?

Type your answer here:

Indicator random variables are Bernoulli random variables, with p = P(A). A binomial random variable is random variable that represents the number of successes in ‘n’ successive independent trials of a Bernoulli experiment.

As for the value of p where the binomial random variable is maximised, is when it is closest to the mean. This is due to the variance not being able to exceed the mean.

  1. What proportion of Star Trek: The Original Series episodes pass the Bechdel-Wallace Test? If I select 10 episodes of Star Trek: The Original Series at random, what is the probability that I will see 2 or fewer episodes that pass the Bechdel-Wallace Test?

Show your code here:

pbinom(2,10,0.0625)
## [1] 0.9789929

Type your answer here:

Pr(X =< 2) X ~ Binomial(0.52,10)

=97.899%

  1. Now assume that I sample episodes at random from all 704 episodes of Star Trek and the proportion of all episodes that pass the Bechdel-Wallace Test is \(0.52\). If I select 100 episodes at random from all the episodes of Star Trek what is probability that I see less than 50 episodes that pass the Bechdel-Wallace Test. Compute this using the binomial probability distribution, the Poisson probability distribution, and the Gaussian distribution. Compare and contrast the results.

Show your code here:

pbinom(49,100,0.52)
## [1] 0.3081545
sum(ppois(49,52))
## [1] 0.3721497

Type your answer here:

Pr(X<50)

Binomial: 30.815% Poisson: 37.21%

Gaussian: \(\mu = 52\) \(Var(x) = np(1-p) = 24.96\) $^2 = = 4.995998

p(y) = 0.3081016

It can be sighted that the Gaussian and Binomial distributions are fairly similar, being near on identical, however the poisson distribution has a slight discrepencies.

Question 3

  1. Show that as \(n\rightarrow \infty\) and \(p\rightarrow 0\) the probability distribution of a random variable \(X\sim Binom(n,p)\) converges to a Poisson probability distribution.

Type your answer here: Recall the Binomial Distribution:

\(B(p,n) = P(X = k) = (n,k)(p^k)*(1-p)^(n-k)\)

Define lamda as:

\(\lambda = np\) \(p = \lambda/n\)

Sub in value for \(p\) into binomial distribution:

\(\lim(n\rightarrow \infty) P(X = k) = \lim(n\rightarrow \infty) ((n!) /k!(n-k)!)*((\lambda/n)^k)*(1-\dfrac\lambda n)^(n-k)\)

Remove constants

\((\lambda^k/k!)\)

New term

\((\lambda^k/k!) \lim (n\rightarrow \infty)\dfrac{n!}{(n-k)!}*(1-(\dfrac{\lambda}{n})^n)*(1-(\dfrac{\lambda}{n})^{-k}\)

\(\lim (n\rightarrow \infty) \dfrac{(n(n-1)(n-2)...(n-k+1))}{n^k}\)

\(\lim (n\rightarrow \infty) (1-\dfrac{\lambda}{n})^n\)

Recall:

\(e = \lim (x\rightarrow \infty)(1+\dfrac{1}{x})^x\)

let x = \(\dfrac{-n}{\lambda}\)

\(\lim (n\rightarrow \infty) (1+\dfrac{1}{x})^{-\lambda x}\) \(\\lim (n\rightarrow \infty) (1-\dfrac{\lambda}{n})^{-k}\)

\((\lambda^k/k!) \lim (n\rightarrow \infty)\dfrac{n!}{(n-k)!}*(1-(\dfrac{\lambda}{n})^n*(1-(\dfrac{\lambda}{n})^{-k}\) $ = (k/k!)e-$

Simplifies to:

\(P(\lambda,k) = \dfrac{(\lambda^ke^{-\lambda)}}{k!}\)

The output is a poisson pmf.

  1. For Star Trek: The Original Series plot the probability distribution for the number of episodes out ten that that would pass the Bechdel-Wallace Test. Use the Binomial and the Poisson distributions. Compare and discuss the results.

Show your code here:

n <- 10
p <- 0.52
number_of_successes <- 1:n
# Generate the dataframe with 3 columns: Successes, Binomial, Poisson
data <- data.frame(Successes = number_of_successes,
Binomial = dbinom(number_of_successes,n,p),
Poisson = dpois(number_of_successes,n*p))
# Plot side-by-side plots
data %>%
pivot_longer(cols = -c(Successes), names_to = "Distribution") %>%
ggplot(aes(x = Successes, y = value))+
scale_x_continuous(breaks=data$Successes)+
geom_bar(stat="identity")+
facet_wrap(~Distribution)

Type your answer here:

One of the most obvious observations when comparing the two distributions, is that the poisson distribution is a lot closer in this mean and its standard deviations, whereas the Binomial distribution is a lot more narrow as a distribution.

  1. What is the relationship between the Poisson and Exponential probability distributions?

Type your answer here:

If the number of events per unit time follows a Poisson distribution, then the amount of time between events follows the exponential distribution.

Assume that the average episode is 45 minutes long, and given the probability that a given episode has a probability of passing the Bechdel-Wallace Test of \(p=0.52\), that is the equivalent \(0.693\) instances of passing the Bechdel-Wallace Test per hour of Star Trek viewing.

  1. If I watch ten hours of Star Trek (assume the hours are completely random), what is the probability that I see more than 7 instances of passing the Bechdel-Wallace Test.

Type your answer here:

\(Pr(X \geq 7)\)

\(X ~ pois(0.693)\) \(E[X] = 0.693, therefore; 10*E[X] = 10*0.693\)

\(x_7 < - ppois(7, 10*0.693)\)

\(=0.3913041\)

There is a 39.13% chance that you will see one episode that passes the Bechdel-Wallace-Test within the 3 hours

  1. What is the probability that I will have to watch more than three hours to see one instance of passing the Bechdel-Wallace Test

Type your answer here:

\(Pr(X > 3)\)

\(X ~ exponential(3,0.693)\)

x_3 <- 1 - pexp(3, 0.6933) <- 1 - 0.875 <- 0.125

There is approximately 12.5% chance that you will see one episode that passes the Bechdel-Wallace-Test within the 3 hours

Question 4

  1. Define the \(Z\)-score, or how we convert a Gaussian random variable to a Standard Gaussian random variable.

Type your answer here:

A z-score describes the position of a raw score in terms of its distance from the mean, when measured in standard deviation units. \(Z = (x - \mu)/\sigma\)

Any point (x) from a normal distribution can be converted to the standard normal distribution (z) with the formula z = (x-mean) / standard deviation. z for any particular x value shows how many standard deviations x is away from the mean for all x values.

For \(X\sim N(\mu,\sigma^2)\), \[ Z = \dfrac{(x - \mu)}{\sigma} \] where \(Z\sim N(0,1)\).

  1. For \(X\sim N(4.3,2.7)\) find \(Pr(X>5)\)

Type your answer here: $ Z = $ $ Z = 0.629$ $ X = 0.7357$

  1. Assume that the IMDB rankings for episodes of Star Trek follow a Gaussian distribution with \(\mu = 7.55\) and \(\sigma^2=0.60\) based on the Gaussian distribution, what is the probability that a randomly selected episode will have an IMDB ranking of less than 7?

Type your answer here:

$ Z =$ $ Z = -1.08$ $ X = 0.14007$

  1. Assume that the IMDB rankings for episodes of Star Trek follow a Gaussian distribution with \(\mu = 7.55\) and \(\sigma^2=0.60\) based on the Gaussian distribution, what proportion of epsiodes have an IMDB ranking of over 7.9? What is the actual proportion of episodes with an IMDB ranking of over 7.9? Compare your results.

Type your answer here:

\(Z = \dfrac{8 - 7.55}{0.6}\) \(Z = 0.75\) \(X = 0.7734\)