Assignment3_answers

Part II Assignment

Because we have three weeks to complete, we are doing things a little differently. It is a challenge to create data-sets that are of interest to all students so instead for this lab we are going to get you to find your own. This can be as simple or as complex as you want to make it. The key point is that you collect/create the dataset yourself (i.e., this is not taken from an example elsewhere).

Try to get a sample size of at least 30. If collecting this dataset requires you going outside and measuring something - even better. Or this can be something related to sports, music, whatever interests you. Use Piazza to ask questions about suitable data sets if you are not sure. You could for example check the temperature at environment Canada on this day for the past 30 years. Or you could check with goals against average for Toronto Maple Leafs goalies for the past 30 years. It is up to you - but the dataset needs to be unique to you and you will have to explain where it comes from in your write up.

Write a paragraph describing your dataset. Include how you acquired it, why you selected it, what the variable of interest is, whether there is any measurement error, sources of sampling error, bias, etc. Also include a histogram and a table with basic summary statistics describing your dataset. The histogram and table should have descriptive captions. Include your dataset with your R commands. (out of 10)

Here students need to think about type of dataset they want to collect. They have to be able to find population standard deviation and mean for that. For example they can ask from their friends how tall are they, then they construct a dataset related to that. With a simple googling they can find stdev of the canadian men/woman height. Number of accidents in past 30 months in Toronto or natural events such as earthquake, wildfires and floods can be easily found on internet. can be used. Just let them know that they need to find population’s standard deviation as well.

another workaround is that they find a dataset with lets say 100 members, then calculate its mean and stdev and consider it as a population. then select 30 members from that 100 member randomly and consider it as a random sample.

To have a table of basic summery they can use following codes

mean(x)
sd(x)
summary(x)

To plot a histogram students can use hist function. In lab1 I have shown how we can use it. They can also type ?hist to learn how to work with it. Students need to mention source of the data, and mention where they could find the

Some of the links for the students

https://earthquakescanada.nrcan.gc.ca/stndon/NEDB-BNDS/bulletin-en.php

https://tsunami.gov

https://climate.weather.gc.ca/historical_data/search_historic_data_e.html

Minimum expectation:

one histogram (2.5 marks)
A table of summery (2.5 marks)
A paragraph about dataset (2.5 marks)
Code (2.5 marks)

Create 90%, 95% and 99% confidence intervals for your data (using a sample mean). Plot these on graph. Include a sentence describing what these mean and commands used to generate the answer. (out of 10)

They need to have their sample defined using this code. They should have 30 numbers

x = c(1,2,3,4,5)

They also need to have population’s stdev. So

population_sd <- 10 #based on their selected topic

Once they have these two numbers they can plot confidence intervals using similar codes (they need to read confidence intervals section to answer this question)

z_crit = qnorm(0.05/2, lower.tail = FALSE) # if its 95% (1-0.95=0.05) so we use 0.05, if it is 99% (1-.99=0.01) so they use 0.01 instead of 0.05 and so on
lower = mean(x) - (z_crit * (population_sd/sqrt(length(x)))) 
upper = mean(x) + (z_crit * (population_sd/sqrt(length(x))))
plot(x=c(lower, upper), y=c(1, 1), type="l", xlim = c(min(x),max(x)), xlab="Variable Title, They have to change this", ylab="", col = "Black")
points(x = mean(x), y = 1, pch=20, cex=3)

#we could add another to it if we wanted
z_crit = qnorm(0.10/2, lower.tail = FALSE)
lower = mean(x) - (z_crit * (population_sd/sqrt(length(x))))
upper = mean(x) + (z_crit * (population_sd/sqrt(length(x))))
lines(x = c(lower, upper), y=c(1.2, 1.2), col="red")
points(x = mean(x), y = 1.2, pch=20, cex=3)

Minimum expectation:

one plot (2.5 marks)
3 confidence intervals (2.5 marks)
A sentence explaining the outputs (2.5 marks)
Code (2.5 marks)

Conduct your own analysis of the dataset. This could include calculating z-scores to find unusual values or groups of unusual values and any plots you want to explore. Write a paragraph commenting on the results and interpreting your findings. (out of 10)

Here I want to see what students know so far. They can find z-score associated to each of sample members and see what sample member has higher probability of appearance and so on (numbers around sample mean, have higher probability value of being selected). They can also find probability of different sample means fall in different intervals using central limit theorem. Like what is probability of sample means fall in specific distance. Or what percentage of the sample means falls in in specific interval.

For the charts They will be able to plot bar charts, Histograms and they can also search to learn more about different charts.

This is an example of plot

#define population mean and standard deviation
x = c(653, 646, 654, 153, 305, 1200, 1193, 172)
sample_mean <- mean(x)

population_mean <- 860
population_sd <- 270

sampling_distributions_sd <- population_sd/sqrt(length(x))
sampling_distributions_mean <- population_mean
#Create a sequence of 1000 x values based on population mean and standard deviation
sampling_distributions <- seq(-4, 4, length = 1000) * sampling_distributions_sd + sampling_distributions_mean
#for each value in x
y <- dnorm(sampling_distributions, sampling_distributions_mean, sampling_distributions_sd)
#plot normal distribution with customized x-axis labels
plot(sampling_distributions,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "")
sd_axis_bounds = 5 # we will plot until 5 stdev
axis_bounds <- seq(-sd_axis_bounds * sampling_distributions_sd + sampling_distributions_mean,
                   sd_axis_bounds * sampling_distributions_sd + sampling_distributions_mean,
                   by = sampling_distributions_sd)

axis(side = 1, at = axis_bounds, pos = 0)
#lets calculate 3sd if x bar falls above or bellow 3stdev we know it has very low probability
upper_b <- sampling_distributions_mean +1 * sampling_distributions_sd
lower_b <- sampling_distributions_mean - 1 * sampling_distributions_sd

abline(v=upper_b,col="orange")
polygon(c(sampling_distributions[sampling_distributions>=upper_b], max(sampling_distributions), upper_b), c(y[sampling_distributions>=upper_b], 0, 0), col="red")
text(upper_b+10, 0.001, "P(X>)=0.01")

abline(v=lower_b,col="orange")
polygon(c(min(sampling_distributions),sampling_distributions[sampling_distributions<=lower_b],lower_b ), c(0,y[sampling_distributions<=lower_b], 0), col="red")
text(lower_b-10, 0.001, "P(X<)=0.01")


abline(v=sample_mean,col="green")
text(sample_mean+10, 0.004, "sample mean")

In order to calculate z-scores

#define population mean and standard deviation
x = c(653, 646, 654, 153, 305, 1200, 1193, 172)
sample_mean <- mean(x)

population_mean <- 860
population_sd <- 270

sampling_distributions_sd <- population_sd/sqrt(length(x))
z_score_for_sample_mean <-(sample_mean - population_mean)/sampling_distributions_sd
pnorm(z_score_for_sample_mean, lower.tail = TRUE) #The probability of such a z-score or lower is

## [1] 0.006329766

Minimum expectation:

At least one plot (2.5 marks)
different z-score (2.5 marks)
Explaining the outputs (2.5 marks)
Code (2.5 marks)