Exercise 1. A number of individuals are interested in the proportion of citizens within a county who will vote to use tax money to upgrade a professional football stadium in the upcoming vote. Consider the following methods:
The Football Team Owner surveyed 10,000 people attending one of the football games held in the stadium. Seventy three percent (73%) of respondents said they supported the use of tax money to upgrade the stadium.
The Pollster generated 1,000 random numbers between 1-40,768 (number of county voters in last election) and surveyed the 1,000 citizens who corresponded to those numbers on the voting roll. Forty seven percent (47%) of respondents said they supported the use of tax money to upgrade the stadium.
- What is the population of interest? What is the parameter of interest? Will this parameter ever be calculated?
Population: The collection of Yes/No responses from all county citizens who will vote in next election
Parameter: The proportion of citizens within a county who will vote to use tax money to upgrade a professional football stadium in the upcoming vote
We’ll know this parameter value after the vote.
- What were the sample sizes used and statistics calculated from those samples? Are these simple random samples from the population of interest?
Football Team Owner survey: 10,000 Pollster survey: 1,000
The pollster survey is a simple random sample from the population of interest whereas the Football Team Owner survey is not.
- The football team owner claims that the survey done at the football stadium will better predict the voting outcome because the sample size was much larger. What is your response?
Even though the football team owner survey had a significantly larger sample size, it is not indicative of the whole population parameter because the survey wasn’t chosen in a randomized manner. The survey would contain bias because it only takes into consideration of people who were attending the football game at the stadium. This might lead to an overestimation of the population value as the people at the stadium are mostly interested in football and they may be more supportive of stadium than the general voting public.
Exercise 2. There are 12 numbers in a sample, and the mean is \(\bar{x}=24\). The minimum of the sample is accidentally changed from 11.9 to 1.19.
- Is it possible to determine the direction in which the mean (\(\bar{x}\)) changes (increase/decrease/same)? Or how much the mean changes? If so, by how much does it change? If not, why not?
The mean would decrease because the total sum of the samples with the error will be lower than the original, causing the new mean to be lower as well.
24*12 #original sum
## [1] 288
288-11.9-1.19 #sum with error
## [1] 274.91
277.29/12 #mean with error value
## [1] 23.1075
24-23.1075 #difference in mean values
## [1] 0.8925
The mean decreases by 0.8925
- Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not?
The median would not change because it only takes the middle number(s) into consideration. The min or max does not affect the median.
- Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? Describe why it is difficult to predict by how much the standard deviation will change in this case.
The standard deviation is highly likely to increase as the minimum value is changing to be even smaller and further away from the mean, creating a greater variance/deviation. It is difficult to predict by how much the standard deviation will change because we are not given the whole data set, so we are unaware of the different deviations.
Exercise 3: After manufacture, computer disks are tested for errors. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.
| Number of Defects | Number of Disks |
|---|---|
| 0 | 41 |
| 1 | 31 |
| 2 | 15 |
| 3 | 8 |
| 4 | 5 |
- Describe the type of data that is being recorded about the sample of 100 disks, being as specific as possible.
The data is quantitative, discrete and ordinal.
- A frequency histogram showing the frequency for number of errors on the 100 disks is given below. Write the R code to produce this frequency histogram. Be sure to create useful labels. Hint: the rep() function will simplify your code.
Defects= c(rep(0,41), rep(1,31), rep(2,15), rep(3,8), rep(4,5))
hist(Defects, breaks=seq(-0.5,4.5), labels=TRUE, ylim=c(0,50))
- What is the shape of the histogram for the number of defects observed in this sample? Why does that make sense in the context of the question?
The histogram is right-skewed. The shape makes sense because there is less and less chance of computers having 1 or more defects than having no defects at all.
- Calculate the mean and median number of errors detected on the 100 disks by hand and with R. How do the mean and median values compare and is that consistent with what we would guess based on the shape?
sum(Defects)/length(Defects) #mean with long hand
## [1] 1.05
((41*0)+(1*31)+(2*15)+(3*8)+(4*5))/100
## [1] 1.05
mean(Defects) #mean with R
## [1] 1.05
100/2 #median with hand
## [1] 50
#the value in the 50th and 51st position is 1, therefore the median is also 1
median(Defects) #median with R
## [1] 1
The mean is greater than the median and that is consistent with the shape being right skewed.
- Calculate the sample standard deviation ``by hand” and using R. Are the values consistent between the two methods? How would our calculation differ if instead we know that these 100 values were the whole population?
sqrt(((((0-1.05)^2)*(41))+((1-1.05)^2)*(31)+((2-1.05)^2)*(15)+((3-1.05)^2)*(8)+((4-1.05)^2)*(5))/(100-1)) #standard deviation by hand
## [1] 1.157976
sd(Defects) #standard deviation with R
## [1] 1.157976
The values are consistent between the two methods. We would divide by 100 instead of 100-1 if these 100 values were the whole population.
- Construct a boxplot for the number of errors data using R with helpful labels. Explain how the shape of the data (identified in (c)) can be seen from the boxplot.
boxplot(Defects, horizontal=TRUE, xlab="Defects", labels=TRUE)
text(x=fivenum(Defects), labels=fivenum(Defects), y=1.25)
quantile(Defects, type=2)
## 0% 25% 50% 75% 100%
## 0 0 1 2 4
The shape can be seen as right skewed by looking at which end the whisker is at/which side is longer. Since the whisker is on the right(and longer), it is right skewed.
- Explain why the histogram is better able to show the discrete nature of the data than a boxplot.
Histogram distinctively shows the trend of the data and puts them into groups/classes, so it’s easier to see. The boxplot glosses over the important characteristics of the data and makes it harder to see the details of where the data lies.
| Sadia | Chelsey Green | chelseygreen@wisc.edu |
|---|---|---|
| Emma Holmes | holmes8@wisc.edu |