*Submit your homework to Canvas by the due date and time. Email your lecturer if you have extenuating circumstances and need to request an extension.
*If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.
*If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manually calculations on your exams, so practice accordingly.
*You must include an explanation and/or intermediate calculations for an exercise to be complete.
*Be sure to submit the HWK1 Autograde Quiz which will give you ~20 of your 40 accuracy points.
*50 points total: 40 points accuracy, and 10 points completion
Exercise 1. A number of individuals are interested in the proportion of citizens within a county who will vote to use tax money to upgrade a professional baseball stadium in the upcoming vote. Consider the following methods:
The Baseball Team Owner surveyed 8,000 people attending one of the baseball games held in the stadium. Seventy eight percent (78%) of respondents said they supported the use of tax money to upgrade the stadium.
The Pollster generated 1,000 random numbers between 1-52,661 (number of county voters in last election) and surveyed the 1,000 citizens who corresponded to those numbers on the voting roll. Forty three percent (43%) of respondents said they supported the use of tax money to upgrade the stadium.
- What is the population of interest? What is the parameter of interest? Will this parameter ever be calculated?
Population of interest: Total number of people voting in the specific county.
Parameter of interest: The number of people who support the use of tax money to upgrade the stadium.
The parameter will be calculated after the voting has been completed and all the votes can be seen.
- What were the sample sizes used and statistics calculated from those samples? Are these simple random samples from the population of interest?
The samples sizes used was 8000 by the baseball team owner and 1000 by the Pollster
The statistics obtained was that 78% of the 8000 baseball game attendees supported the use of tax money for the stadium while only 43% of the voters supported the use of tax money.
These are not simple random samples as the baseball attendees might have a bias towards the stadium, while even the voters may not be accurate to support political parties or certain groups not being included in the sample.
- The baseball team owner claims that the survey done at the baseball stadium will better predict the voting outcome because the sample size was much larger. What is your response?
Though it is beneficial that a larger sample size can yield better precision in values, other criteria such as, representativeness between the sample, biases might weaken its overall precision. For example since the sample was completely made of baseball attendees their bias would be towards supporting the use of tax money and this may cause over estimation for the population.
Exercise 2. There are 12 numbers in a sample, and the mean is \(\bar{x}=24\). The minimum of the sample is accidentally changed from 11.9 to 1.19.
- Is it possible to determine the direction in which (increase/decrease) the mean (\(\bar{x}\))changes? Or how much the mean changes? If so, by how much does it change? If not, why not?
Yes it is possible to determine the direction of the mean. In this case the mean would decrease as the minimum sample value gows down as well. It is possible to see how the mean would change as well.
Value Reduction = 11.9 - 1.19 = 10.71
Mean Reduction = \(\frac{Value Reduction}{Total values}\) = \(\frac{10.71}{12}\) = 0.8925
- Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not?
Yes it is possible to determine the direction of the median, in this scenario the median stays the same as no positions changed in the number collection, thus the number at the median/center is the same.
- Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? Describe why it is difficult to predict by how much the standard deviation will change in this case.
Yes it is possible to predict the standard deviation, as the minimum value reduces it moves even farther away from the mean and so the standard deviation would increase as the distribution of values increases. But it is difficiult to predict the exact amount the standard deviation increases as we so not know the exact positions and distribution of all the 12 values.
Exercise 3: After manufacture, computer disks are tested for errors. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.
| Number of Defects | Number of Disks |
|---|---|
| 0 | 41 |
| 1 | 31 |
| 2 | 15 |
| 3 | 8 |
| 4 | 5 |
- Describe the type of data that is being recorded about the sample of 100 disks, being as specific as possible.
The Number of disks is Quantitative - Discrete, while the number of defects being the x variable is fixed and also discrete.
- A frequency histogram showing the number of errors on the 100 disks is given below. Write the R code to produce this frequency histogram. Be sure to create useful labels. Hints: use the rep() function to define your defect data. Also use ylim and breaks to format your graph.
Defects<-c(rep(0,41), rep(1,31), rep(2,15), rep(3,8), rep(4,5))
hist(Defects, breaks=-0.5:4.5, labels=TRUE, right = TRUE, ylim=c(0,50), main = "Number of Defects", include.lowest = TRUE)
- What is the shape of the histogram for the number of defects observed in this sample? Why does that make sense in the context of the question?
The histogram is rightly skewed and this makes sense as the number of defects should be kept to a minimum in production and this follows that with most of the samples having less defects.
- Calculate the mean and median number of errors detected on the 100 disks by hand and with R. How do the mean and median values compare and is that consistent with what we would guess based on the shape? [You can use LaTeX such as \(\bar{x}=\frac{value1}{value2}\) to help you show your work neatly.]
Mean (Hand) = \(\bar{x}=\frac{(41*0) + (31*1) + (15*2) + (8*3) + (4*5)}{41 + 31 + 15 + 8 + 5}\)
Mean (Hand) = Mean amount of errors/defects = 1.05
Median (Hand):
Median Frequency: \(\frac{100+1}{2}\) = 50.5
Median (Hand) = Median errors/defects = 1
mean(Defects)
## [1] 1.05
median(Defects)
## [1] 1
- Calculate the sample standard deviation ``by hand” and using R. Are the values consistent between the two methods? How would our calculation differ if instead we know that these 100 values were the whole population? [hint: use multiplication instead of repeated addition]
Standard Deviation (Hand)
\(\sigma \Sigma (Frequency * (defects - \bar{x})\) =
\(41(0-1.05)^{2} + 31(1-1.05)^{2} + 15(2-1.05)^{2} + 8(3-1.05)^{2} + 5(4-1.05)^{2}\)
= 132.75
Standard Deviation = \(\sqrt{\frac{132.75}{100-1}}\) = 1.57976
Standard Deviation (R)
sd(Defects)
## [1] 1.157976
The values are consistent between methods. However if the whole population was used, the n - 1 in the denominator would be changed to just n causing us to have a lower standard deviation.
- Construct a boxplot for the number of errors data using R with helpful labels. Explain how the shape of the data (identified in (c)) can be seen from the boxplot using words such as median and quartile.
boxplot(Defects, horizontal=TRUE, xlab="Number of Defects", names=TRUE)
text(x = fivenum(Defects), y = 1.25, labels = fivenum(Defects))
quantile(Defects)
## 0% 25% 50% 75% 100%
## 0 0 1 2 4
range(Defects)
## [1] 0 4
IQR(Defects)
## [1] 2
The box-plot has a similar right skewed data as the histogram as the median is closer to the left side of the box rather than the right. In addition the IQR shows that 50% of the data is in the first 2 quantiles of data to the left side of the box plot.
- Explain why the histogram is better able to show the discrete nature of the data than a boxplot.
The histogram is better suited for depicting the discrete nature of the data compared to a box plot due to its ability to directly display the distribution and individual counts within specific intervals.