NOTE THIS ASSESSMENT IS DUE ON 5 September BY 11:59 PM.
For this Assessment we will use the following
dataset:
The dataset episodes included
in the MXB107 package for R contains records for 704 episodes of the
Star Trek aired between 1966 and 2005. (Type
?episodes for a detailed description of the
data.)
Type your answer here:
1. Choose the right visualization 2. Provide Context 3. Use visual cues
to show relationships
Type your answer here:
Show your code here:
library(MXB107)
data("episodes")
ggplot(episodes, aes(x = Series.Name, y = IMDB.Ranking)) +
geom_boxplot() +
xlab ("Series Names") +
ylab ("IMDB User Ratings (0-10)") +
ggtitle ("IMDB Ranking by Series") +
theme (plot.title = element_text(hjust = 0.5))
Some elements that are evident when analysing the data set provided by the box plot include: The episodes that lay outside of the mean and standard deviation, most clearly represented by the single episode of The Next Generation that has a score much closer to a 3 than any of the other series. Another obvious trend is that Enterprise has the best ratings on average.
Show your code here:
library (MXB107)
data ("episodes")
episodes%>%
count(Female.Director,Bechdel.Wallace.Test)%>%
group_by(Bechdel.Wallace.Test)%>%
pivot_wider(names_from = Female.Director, values_from = n)%>%
kable()
| Bechdel.Wallace.Test | FALSE | TRUE |
|---|---|---|
| FALSE | 323 | 15 |
| TRUE | 346 | 20 |
ggplot(episodes,aes(x=IMDB.Ranking, fill = Bechdel.Wallace.Test))+
geom_histogram(bins = 10)+
facet_wrap(vars(Bechdel.Wallace.Test))+
ylab("Bechdel Wallace Test")+
ggtitle("Bechdel Wallace Test/The Next Generation")
Type your answer here:
It can be sighted that there is a slight correlation between the Rankings and the Bechdel Wallace Test. This can be seen through the number of episodes that pass the Bechdel Wallace Test on average receiving a higher score than those that do not pass.
Type your answer here:
Mean
Median
Mode
Type your answer here:
standard deviation
Varaiance
Skew
Show your code here:
library (MXB107)
data ("episodes")
IMDB_Data <- episodes$IMDB.Ranking
sd_IMDB <- sqrt(sum((IMDB_Data - mean(IMDB_Data))^2)/length(IMDB_Data))
sd_IMDB
## [1] 0.7754944
mean(IMDB_Data)*0.341
## [1] 2.574792
Type your answer here:
There is quite a stark contrast between the definition of standard deviation and the epirical rule. This is due to the base definition stating that one standard deviation is 34.1% of the data when looking at a normal distribution. Whereas the empirical rule provides a much more accurate representation of the standard deviation.
Show your code here:
library (MXB107)
data ("episodes")
mean(IMDB_Data)
## [1] 7.55071
median(IMDB_Data)
## [1] 7.6
# Skew
(1/length(IMDB_Data)) * sum(((IMDB_Data - mean(IMDB_Data))/sd(IMDB_Data))^3)
## [1] -0.3873874
ggplot(episodes, aes(x = IMDB.Ranking)) +
geom_histogram(aes(y = ..count..), binwidth = 0.05) +
xlab("User Ratings") + ylab ("Episode Count") +
ggtitle ("IMDB Ranking by Series") +
theme (plot.title = element_text(hjust = 0.5))
Type your answer here: The data set is skewed left, this is clearly visualised through the histogram and also the value given from the skew value given.
The probability of an event is the ratio of the number of cases favourable to it, tho the number of cases possible when nothing leads us to expect that any one of these cases should occur more than any other which renders them, for us, equally possible.
Show your code here:
library (MXB107)
data ("episodes")
episodes%>%
count(Female.Director,Bechdel.Wallace.Test)%>%
group_by(Bechdel.Wallace.Test)%>%
pivot_wider(names_from = Female.Director, values_from = n)%>%
kable()
| Bechdel.Wallace.Test | FALSE | TRUE |
|---|---|---|
| FALSE | 323 | 15 |
| TRUE | 346 | 20 |
366/704
## [1] 0.5198864
The probability of an episode passing the Bechdel Wallace test is approximately 51.9%. Meaning little over a half of the epsiodes pass the test.
Joint probability is a statistical measure that calculates the likelihood of two events occurring together and at the same point in time. Pr(AB) = Pr(A) * Pr(B)
Show your code here:
library (MXB107)
data ("episodes")
episodes %>%
count(Series, Bechdel.Wallace.Test) %>%
filter(Series == "VOY") %>%
group_by(Bechdel.Wallace.Test)
5/75
## [1] 0.06666667
0.519884*0.06666667
## [1] 0.03465894
The Original Series only has 6.67% of its episodes pass the Bechdel Wallace Test when sampling only The Original Series. As for all of the episodes that pass the Bechdel Wallace Test a mere 3.47% are from The Original Series
The probability of an event occurring, given that another event has already occurred.
Pr(A|B) = (Pr(A)Pr(B))/Pr(B)
Show your code here:
library (MXB107)
data ("episodes")
episodes %>%
count(Series.Name,Bechdel.Wallace.Test)
176/704
## [1] 0.25
(0.25*0.48116)/0.48116
## [1] 0.25
The Probability that an episode fails the Bechdel-Wallace Test given that it is an episode from Star Trek: Deep Space Nine is 25.%.
Bayes’ Theorem states that the conditional probability of an event, based on the occurrence of another event, is equal to the likelihood of the second event given the first event multiplied by the probability of the first event.
Type your answer here: \[ Pr(B|A) = (Pr(A|B)Pr(A))/Pr(B) \]
Show your code here:
episodes%>%
filter(Bechdel.Wallace.Test == TRUE)%>% # Remove this filter for all episodes
group_by(Series.Name, Season)%>%
tally()%>%
pivot_wider(names_from = Series.Name, values_from = n)%>%
bind_rows(summarise_all(., ~sum(., na.rm=TRUE)))%>% # Total column
mutate(Total = rowSums(.[setdiff(names(.),"Season")], na.rm = TRUE)) # Total row
17/145 # Pr(A|B)
## [1] 0.1172414
366/704 #Pr(A)
## [1] 0.5198864
145/366 #Pr(B)
## [1] 0.3961749
(0.1172414*0.5198864)/0.3961749 #Pr(B|A)
## [1] 0.1538518
There is an 15.38% chance that an episode from season 3 of star trek voyager passes the bechdel wallace test.
This probability is greater than that of the marginal probability that a randomly selected episode is from Season 3 of Star Trek: Voyager. This is due to the the sample space being restricted.
A Bernoulli random variable is the simplest kind of random variable. It can take on two values, 1 and 0. It takes on a 1 if an experiment with probability p resulted in success and a 0 otherwise.
Show your code here:
pgeom(2, prob = 0.5)
## [1] 0.875
1-0.875
## [1] 0.125
There is a 12.5% chance that it will require more than 2 tosses to get a “heads”.
Show your code here:
Type your answer here:
A geometric random variable can be defined in bernoulli terms, this is demonstrated through a sequence of Bernoulli Trials each with the probability of success ‘p’ (p all (0,10)). As given by the distribution of the number of fails, X, Until the first success has occured is a geometric distribution with the p.m.f:
Pr(X = x) = ((1-p)^k)p, k = 0,1,2,3,…
A geometric distribution can also be defined as a negative binomial distribution which is made up of ‘n’ bernoulli trials.
Show your code here:
qgeom(0.95, 0.52)
## [1] 4
Type your answer here:
It would require that you watch 4 episodes.
Type your answer here: X ~ Binomial(n,p)
\(Pr(X = x) = (n,x)p^x q^{n-x}\)
Expected Value: E[x] = np
Variance: var[x]= npq
On average you would expect to have n/2 (or n0.5) heads. As for the standard deviations it would be n/4 (or n0.5*0.5).
Type your answer here:
Indicator random variables are Bernoulli random variables, with p = P(A). A binomial random variable is random variable that represents the number of successes in ‘n’ successive independent trials of a Bernoulli experiment.
As for the value of p where the binomial random variable is maximised, is when it is closest to the mean. This is due to the variance not being able to exceed the mean.
Show your code here:
pbinom(2,10,0.0625)
## [1] 0.9789929
Type your answer here:
Pr(X =< 2) X ~ Binomial(0.52,10)
=97.899%
Show your code here:
pbinom(49,100,0.52)
## [1] 0.3081545
sum(ppois(49,52))
## [1] 0.3721497
Type your answer here:
Pr(X<50)
Binomial: 30.815% Poisson: 37.21%
Gaussian: \(\mu = 52\) \(Var(x) = np(1-p) = 24.96\) $^2 = = 4.995998
p(y) = 0.3081016
It can be sighted that the Gaussian and Binomial distributions are fairly similar, being near on identical, however the poisson distribution has a slight discrepencies.
Type your answer here: Recall the Binomial Distribution:
\(B(p,n) = P(X = k) = (n,k)(p^k)*(1-p)^(n-k)\)
Define lamda as:
\(\lambda = np\) \(p = \lambda/n\)
Sub in value for \(p\) into binomial distribution:
\(\lim(n\rightarrow \infty) P(X = k) = \lim(n\rightarrow \infty) ((n!) /k!(n-k)!)*((\lambda/n)^k)*(1-\dfrac\lambda n)^(n-k)\)
Remove constants
\((\lambda^k/k!)\)
New term
\((\lambda^k/k!) \lim (n\rightarrow \infty)\dfrac{n!}{(n-k)!}*(1-(\dfrac{\lambda}{n})^n)*(1-(\dfrac{\lambda}{n})^{-k}\)
\(\lim (n\rightarrow \infty) \dfrac{(n(n-1)(n-2)...(n-k+1))}{n^k}\)
\(\lim (n\rightarrow \infty) (1-\dfrac{\lambda}{n})^n\)
Recall:
\(e = \lim (x\rightarrow \infty)(1+\dfrac{1}{x})^x\)
let x = \(\dfrac{-n}{\lambda}\)
\(\lim (n\rightarrow \infty) (1+\dfrac{1}{x})^{-\lambda x}\) \(\\lim (n\rightarrow \infty) (1-\dfrac{\lambda}{n})^{-k}\)
\((\lambda^k/k!) \lim (n\rightarrow \infty)\dfrac{n!}{(n-k)!}*(1-(\dfrac{\lambda}{n})^n*(1-(\dfrac{\lambda}{n})^{-k}\) $ = (k/k!)e-$
Simplifies to:
\(P(\lambda,k) = \dfrac{(\lambda^ke^{-\lambda)}}{k!}\)
The output is a poisson pmf.
Show your code here:
n <- 10
p <- 0.52
number_of_successes <- 1:n
# Generate the dataframe with 3 columns: Successes, Binomial, Poisson
data <- data.frame(Successes = number_of_successes,
Binomial = dbinom(number_of_successes,n,p),
Poisson = dpois(number_of_successes,n*p))
# Plot side-by-side plots
data %>%
pivot_longer(cols = -c(Successes), names_to = "Distribution") %>%
ggplot(aes(x = Successes, y = value))+
scale_x_continuous(breaks=data$Successes)+
geom_bar(stat="identity")+
facet_wrap(~Distribution)
Type your answer here:
One of the most obvious observations when comparing the two distributions, is that the poisson distribution is a lot closer in this mean and its standard deviations, whereas the Binomial distribution is a lot more narrow as a distribution.
Type your answer here:
If the number of events per unit time follows a Poisson distribution, then the amount of time between events follows the exponential distribution.
Assume that the average episode is 45 minutes long, and given the probability that a given episode has a probability of passing the Bechdel-Wallace Test of \(p=0.52\), that is the equivalent \(0.693\) instances of passing the Bechdel-Wallace Test per hour of Star Trek viewing.
Type your answer here:
\(Pr(X \geq 7)\)
\(X ~ pois(0.693)\) \(E[X] = 0.693, therefore; 10*E[X] = 10*0.693\)
\(x_7 < - ppois(7, 10*0.693)\)
\(=0.3913041\)
There is a 39.13% chance that you will see one episode that passes the Bechdel-Wallace-Test within the 3 hours
Type your answer here:
\(Pr(X > 3)\)
\(X ~ exponential(3,0.693)\)
x_3 <- 1 - pexp(3, 0.6933) <- 1 - 0.875 <- 0.125
There is approximately 12.5% chance that you will see one episode that passes the Bechdel-Wallace-Test within the 3 hours
Type your answer here:
A z-score describes the position of a raw score in terms of its distance from the mean, when measured in standard deviation units. \(Z = (x - \mu)/\sigma\)
Any point (x) from a normal distribution can be converted to the standard normal distribution (z) with the formula z = (x-mean) / standard deviation. z for any particular x value shows how many standard deviations x is away from the mean for all x values.
For \(X\sim N(\mu,\sigma^2)\), \[ Z = \dfrac{(x - \mu)}{\sigma} \] where \(Z\sim N(0,1)\).
Type your answer here: $ Z = $ $ Z = 0.629$ $ X = 0.7357$
Type your answer here:
$ Z =$ $ Z = -1.08$ $ X = 0.14007$
Type your answer here:
\(Z = \dfrac{8 - 7.55}{0.6}\) \(Z = 0.75\) \(X = 0.7734\)