Assignment 1

Topics: Probabilities & Distributions & Central Limit Theorem

In the course syllabus you can find the deadline for the assignment and the suggested literature.

Probabilities

1.

What is meant with probabilities being `mutually exclusive’?

This applies to the Sum Rule of probabilities, the Sum Rule is only applicable when the different probabilities in question are mutually exclusive. This means that the variables can’t co occur at the same time.

2.

Think of a (fictional) example for which the following statement holds: \(P(A|B) = P(A)\).

The probability of being super rich, given the probability that you are Lionel Messi. If you are Lionel Messi, you are super rich.

3.

Show, using a (fictional) example, that the following statement holds: \(P(A|B) \neq P(B|A)\).

The probality of being rich given that you’ve won the lotery is not equal to the probability of winning the lotery given that you’re rich.

4.

Calculate the probability of selecting a male PM student from the total population of 3rd year Psychology students (\(P(male \; and \; PM)\)). Consider the following facts: population = 400; # male students = 80; # PM students = 20; # male PM students = 15.

\(15/400 = 0.0375\)

5.

Simulate the selection of 1 student from the population of 400 3rd year Psychology students (\((N_{male} = 80; N_{PM} = 20)\)), and do this a 1000 times. To do this, download assignment1.Rdata (from Canvas) and load it to your workspace in R (use the load() function and check with ls()). How many PM students, males, and male PM students did you select? The following script gives you the right direction. Does the result confirm your expectations?

#setwd("C:/Users/Max/Desktop/Premaster BDS/Basic Skills in Mathematics Programming and Statistics/Statistics/Week1")
load(file = "assignment1 for assignment.Rdata")

ls()

## [1] "stu"

n_PM <- 0
n_M <- 0
n_PM_M <- 0
for(i in 1:1000){ 
  mySample <- sample(1:nrow(stu), size = 1)
  if(stu$pml[mySample] == "PML" & stu$sekse[mySample] == "Man"){
    n_PM_M <- n_PM_M + 1
  }
  if(stu$pml[mySample] == "PML") {
    n_PM <- n_PM + 1
  }
  if(stu$sekse[mySample] == "Man"){
    n_M <- n_M +1
  }
}

n_PM

## [1] 50

n_M

## [1] 213

n_PM_M

## [1] 42

6.

What is the probability of selecting two PM students in a sample of 2 (without replacement)? Are the probabilities of the first and second selection independent? Explain.

No they are not independent, picking one male influences the probability of picking another male.

\(20/400*19/399 = 0.002380952\)

7.

Check your calculation using a simulation. Give the R code and explain it.

N_PM_Q6 <- 0 

for (i in 1:10000) {
  sampleQ6 <- sample(1:nrow(stu), size = 2)
  if (stu$pml[sampleQ6[1]] == "PML" & stu$pml[sampleQ6[2]] == "PML") {
    N_PM_Q6 <- N_PM_Q6 + 1
  }
}

N_PM_Q6/10000

## [1] 0.0028

Considering the calculation in question 6, and the output of question 7 I can conclude that both ways give approximately the same probability for selecting two PML students in a sample of two. ***

Probability Distributions

The following questions are based on the students taking an Xhosa exam. The exam consist of 10 two-choice questions. None of the students understands a single Xhosa word. Using R we can simulate scores of 250 students as follows. And using some of our plotting skills:

scores <- rbinom(250, 10, .5)
plot(table(scores))

8.

What is the (theoretical) probability that a student answers a random question correctly? Why?

.5, you can either get it right or wrong since its binomial.

9.

Is the (theoretical) probability of obtaining the series 0101010101 smaller, bigger, or equal to obtaining the series 1111111111? Why?

It’s equal, the probability of the first is \(.5e10\) and so is it for the second vector since all series are independent series.

10.

Is the (theoretical) probability of obtaining sum score 5 smaller, bigger, or equal to obtaining sum score 10? Why?

The probability of answering correct is .5. To get all 10 questions right, there is only 1 combination of the vector which can get this result: \(P(Sum10) = P(correct)e10 = 9.536743e-07\). The probability of getting a sum score of 5 can be obtained in more than one manner.

dbinom(10, 10, 0.5)

## [1] 0.0009765625

dbinom(5, 10, 0.5)

## [1] 0.2460938

#more explanation

The sumscore of obtaining 5 is bigger than obtaining the sumscore of 10. Explanation: the sum of 10 in 10 questions is \(.5e10\), the sum of 5 in 10 questions is a probability of \(.5e5\) NOT SURE WHETHER THIS IS RIGHT

11.

How many different series have sum score 6? Calculate this using a quick function in R.

scores <- rbinom(250, 10, .5)

all6 <- length(subset(scores, scores == 6))

initially I read the question wrong and though I’d have to indicate how many 6 sumscores the scores data contained

choose(10,6)

## [1] 210

12.

What is the (theoretical) probability of sum score 5? Calculate this using the binomial formula (and the choose function), and using the dbinom function in R.

p <- 0.5
k <- 5
n <- 10

(factorial(n)/(factorial(k) * factorial(n - k))) * p^5 * (1 - p)^(n-k) # binominal function

## [1] 0.2460938

choose(10,5)*p^5*(1-p)^5 #choose function

## [1] 0.2460938

dbinom(5, 10, 0.5) #dbinominal function

## [1] 0.2460938

13.

It’s found that the meaning of a few words in Xhosa can be deduced from English. The probability of answering the two-choice questions correctly is now \(p\left(correct\right)=.75\). You don’t need to simulate this using the quincunx.

What is the (theoretical) probability of sum score 5? Why is it lower, higher, or equal to Q11

dbinom(5, 10, 0.75)

It’s lower now, a higher sumscore around 7,5 is more likely since the probability of answering correct increases.

14.

What’s the (theoretical) probability of sum score 9 or higher? Calculate this using a formula (use R as a calculator), and show a quick function in R.

p <- 0.75
sum9_choose <- choose(10,9)*p^9*(1-p)^1
sum10_choose <- choose(10,10)*p^10*(1-p)^0
sum9or10_choose <- sum9_choose + sum10_choose
sum9or10_choose

## [1] 0.2440252

1-pbinom(8,10,0.75)

## [1] 0.2440252

Central limit theorem (CLT)

Watch the graph below from http://www.washingtonpost.com/blogs/wonkblog/wp/2014/09/25/think-you-drink-a-lot-this-chart-will-tell-you/.

15.

What distribution provides an accurate description of the means of the number of alcoholic beverages that an American drinks, when we ask 100 random Americans?

The Pareto principle/distribution: 80% of the consumption is done by 20% of the population.

Or, sample distribution

16.

Explain what population distributions, sample distributions, and distributions of sample meansare? Clearly describe how these differ.

Population distributions: The distribution of a population, this is often what is sought to measure using a sample distribution.
Sample distributions: The distribution of a sample, a sample is a part of a population
Distributions of sample means: the distribution of a sample statistic, it’s the mean of the means.

17.

Read up to chapter 3.1. Why do you expect that the sample mean is close to the population mean?

Since it’s the mean of all means it will give a more robust/skewed/narrow view of the distribution. This is likely in line with the population.

18.

Plot a distribution of sample means (choose your favorite distribution). Show what happens to this distribution if you increase or decrease the sample sizes of those samples.

reps<- 10000
MEANS<- numeric(reps)
sampleSize<- 100


for (i in 1:reps){
  
  MEANS[i] <- mean(rnorm(sampleSize, 0,1))
}

hist(MEANS)

##### 19.

Show, based on the formula of the standard error, what happens with the SE if you increase the sample size.

10/sqrt(10)

## [1] 3.162278

10/sqrt(100)

## [1] 1

Once you increase the sample size the standard error decreases.

20.

Why is the SE dependent on the sample size? Clearly explain. Hint: compare the distribution of the sample means with the population distribution.

in the sample distribution you have the variation between the individuals, going from this towards a distribution of samples you compare not the variation between individuals but the variation between the samples. So not only the numbers increase, but also the variation decreases. This variation is much closer to the population distribution than the sample distribution.

Vasishth & Broe

A selection of Vasishth & Broe’s questions about the literature.

1.

Imagine that you have a biased coin, where the probability of obtaining a heads is not 0.5 but 0.1. When the coin is tossed four times, what are the probabilities of obtaining 0, 1, 2, 3, 4 heads? [Tip: Use R to do the calculations for you!]

for (i in 0:4){
  probability <- dbinom (i, 4, 0.1)
  print(probability)
}

## [1] 0.6561
## [1] 0.2916
## [1] 0.0486
## [1] 0.0036
## [1] 1e-04

2.

What is the probability of obtaining any of the numbers 2, 4, or 6 if a die is tossed three times in a row?

The probability of throwing 2, 4 or 6 in one toss is
\(1/6 + 1/6 +1/6 =1/2\) and this probability three dependent throws after each other
\(1/2*1/2*1/2=1/8\)

Now the probability of three throws and getting an uneven number is \(1-1/8=0.875\)

Assignment 1

Basic Skills in Statistics | 2020-2021

[YOUR NAME]

Probabilities

1.

2.

3.

4.

5.

6.

7.

Probability Distributions

8.

9.

10.

11.

12.

13.

14.

Central limit theorem (CLT)

15.

16.

17.

18.

20.

Vasishth & Broe

1.

2.