*Submit your homework to Canvas by the due date and time. Email your lecturer if you have extenuating circumstances and need to request an extension.
*If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.
*If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manually calculations on your exams, so practice accordingly.
*You must include an explanation and/or intermediate calculations for an exercise to be complete.
*Be sure to submit the HWK1 Autograde Quiz which will give you ~20 of your 40 accuracy points.
*50 points total: 40 points accuracy, and 10 points completion
Exercise 1. A number of individuals are interested in the proportion of citizens within a county who will vote to use tax money to upgrade a professional football stadium in the upcoming vote. Consider the following methods:
The Football Team Owner surveyed 10,000 people attending one of the football games held in the stadium. Seventy three percent (73%) of respondents said they supported the use of tax money to upgrade the stadium.
The Pollster generated 1,000 random numbers between 1-40,768 (number of county voters in last election) and surveyed the 1,000 citizens who corresponded to those numbers on the voting roll. Forty seven percent (47%) of respondents said they supported the use of tax money to upgrade the stadium.
- What is the population of interest? What is the parameter of interest? Will this parameter ever be calculated?
The population of interest is the people who live in the county. The parameter of interest is the proportion of people who are in favor of building a new stadium with tax dollars. Yes this parameter will be calculated.
- What were the sample sizes used and statistics calculated from those samples? Are these simple random samples from the population of interest?
The sample sizes were 10,000 and 1,000. The statistics calculated were that 73% from the Football team owners survey and 47% of people from the Pollster supported using tax dollars for upgrading the stadium. The Football team owner’s survey is not an SRS, but the Pollsters is.
- The football team owner claims that the survey done at the football stadium will better predict the voting outcome because the sample size was much larger. What is your response?
While the sample size is larger, there is a decent level of bias as they are interviewing people who go to football games and interact with the stadium so it makes sense that those people would more often support upgrading it.
Exercise 2: After manufacture, computer disks are tested for errors. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.
| Number of Defects | Number of Disks |
|---|---|
| 0 | 41 |
| 1 | 31 |
| 2 | 15 |
| 3 | 8 |
| 4 | 5 |
- Describe the type of data that is being recorded about the sample of 100 disks, being as specific as possible.
The data collected here is quantitative and discrete.
- Code for a frequency histogram showing the frequency for number of errors on the 100 disks is given below.
bi. Knit the document and confirm that the histogram displays in the knitted file.
error.data=c(rep(0,41), rep(1,31), rep(2,15), rep(3,8), rep(4, 5))
hist(error.data, breaks=c(seq(from=-0.5, 4.5, by=1)),
xlab="Defects", main="Number of Defects",
labels=TRUE, ylim=c(0,50))
bii. Describe what the rep() function does in this code chunk.
The rep() function replicates the values in x as many times as specified. For example, rep(0,41) will repeat the number 0 41 times in the vector. This is both easier and more efficient than typing 0 41 times.
biii. Describe how this breaks command affects the histogram’s appearance in this code chunk.
The breaks command is set up in a way that sets the edges of each bin in the middle of each value of defect. without this, we see an overlap of columns in bins and also bad alignment of columns inside the bins.
biv. Describe how setting ylim=c(0,30) instead of ylim=c(0,50) would change the histogram’s appearance. Which value for ylim is preferable for clear communication of the data?
If the y limit is set to 30 instead of 50, the leftmost column is cut off and we lose the ability to tell how many disks has zero defects without additional knowledge. The preferable limit is 50 as it allows for all of every collumn to be shown, and is both easier and quicker to read.
Exercise 3. There are 12 numbers in a sample, and the mean is \(\bar{x}=27\). The minimum of the sample is accidentally changed from 13.8 to 1.38.
- Is it possible to determine the direction in which (increase/decrease) the mean (\(\bar{x}\))changes? Or how much the mean changes? If so, by how much does it change? If not, why not?
Yes, it is possible to predict both the shift direction and amount. Since we are decreasing a value, the average must also decrease. We can determine this amount by calculating the new average by finding the total sum (average*number of samples), subtracting our old minimum, adding our new minimum, and dividing by our number of samples. When we do this, we get a new average of 25.965, meaning our average decreased by 1.035.
- Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not?
It is possible to predict. Since the value changed was the minimum and the value was decreased, it remained the minimum, this the order of the data values must not change meaning that the median would not change.
- Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? Describe why it is difficult to predict by how much the standard deviation will change in this case.
It is possible to predict the direction in which the standard deviation changes but we cannot predict the value without knowing the full data set. Since the value of the minimum decreased, the distance from the minimum to the average increases so thus the average distance to the average must also increase, meaning that the SD increases.
Exercise 4: Recall the computer disk error data given used in Exercise 2. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.
| Number of Defects | Number of Disks |
|---|---|
| 0 | 42 |
| 1 | 30 |
| 2 | 16 |
| 3 | 7 |
| 4 | 5 |
A frequency histogram showing the frequency for number of errors on the 100 disks is given below.
error.data=c(rep(0,42), rep(1,30), rep(2,16), rep(3,7), rep(4, 5))
hist(error.data, breaks=c(seq(from=-0.5, 4.5, by=1)), xlab="Defects", main="Number of Defects", labels=TRUE, ylim=c(0,60))
- What is the shape of the histogram for the number of defects observed in this sample? Why does that make sense in the context of the question?
The histogram has a right skew shape. This makes sense as we would expect (and hope) that most of the disks would have a lower number of defects.
- Calculate the mean and median number of errors detected on the 100 disks ‘by hand’ and using the built-in R functions. How do the mean and median values compare and is that consistent with what we would guess based on the shape? [You can use the text such as \(\bar{x}=\frac{value1}{value2}\) to help you show your work neatly].
mean(error.data)
## [1] 1.03
By hand: \(\bar{x} = \frac{(0*42)+(1*30)+(2*16)+(3*7)+(4*5)}{100} = \frac{103}{100}\) = 1.03
- Calculate the sample standard deviation ``by hand” and using the built in R function. Are the values consistent between the two methods? How would our calculation differ if instead we considered these 100 values the whole population?
sd(error.data)
## [1] 1.149923
By hand: \(\bar{x} = \frac{(42*(0-1.03)^2)+(30*(1-1.03)^2)+(16*(2-1.03)^2)+(7*(3-1.03)^2)+(5*(4-1.03)^2)^.5}{99} = \frac{130.91}{99} = 1.1499\)
- Construct a boxplot for the number of errors data using R with helpful labels. Explain how the shape of the data identified in (a) can be seen from the boxplot.
boxplot(error.data, horizontal = TRUE)
The shape of the histogram can be seen in the box plot because we see the box at the left end of the plot, just as we see higher bars on the left in the histogram.
- Explain why the histogram is better able to show the discrete nature of the data than a boxplot.
Boxplots do not show discrete data values very well because other than quartile ranges, there are not any discrete bins where data points can sit in. This leads to it looking more continuous.
- Use R to create histograms to display the sample data from each model (any kind of histogram that you want since sample sizes are the same). Have identical x and y axis scales so the two groups’ values are more easily compared. Include useful titles.
current.model <- c(1.63, 1.25, 1.23, 1.49, 2.11, 1.48, 1.94, 1.72, 1.85, 1.54, 1.67, 1.76, 1.46, 1.32, 1.23, 1.67, 1.74, 1.63, 1.25, 1.56)
new.model <- c(1.28, 1.19, 0.90, 1.24, 1.00, 0.80, 0.71, 1.03, 1.27, 1.14, 1.36, 0.91, 1.09, 1.36, 0.91, 0.91, 0.86, 0.93, 1.36)
hist(current.model, xlim=c(0.5,2.5), ylim=c(0,6), breaks=c(seq(from=0, 2.5, by=0.1)), xlab="Gallons of water used per flush", main="Current model", labels=TRUE)
hist(new.model, xlim=c(0.5,2.5), ylim=c(0,6), breaks=c(seq(from=0, 2.5, by=0.1)), xlab="Gallons of water used per flush", main="New model", labels=TRUE)
- Compare the shapes of the distributions of amount of water used by the two models observed in the sample.
The current model has a slight right skew while the new model has a slight left skew.
- Compute the mean and median gallons flushed for the Current and New Model toilets using the built-in R function. Compare both measures of center within each group and comment on how that relationship corresponds to the datas’ shapes. Also compare the measures of center across the two groups and comment on how that relationship is evident in the histograms.
mean(current.model)
## [1] 1.5765
mean(new.model)
## [1] 1.065789
median(current.model)
## [1] 1.595
median(new.model)
## [1] 1.03
The slight right skew in the current model is seen in these numbers as we see the average being slightly less than the median. Similarly we the slight left skew in the new model as the average is slightly higher than the median.
- Compute (using built-in R function) and compare the sample standard deviation of gallons flushed by the current and new model toilets. Comment on how the relative size of these values can be identified from the histograms.
sd(current.model)
## [1] 0.2456843
sd(new.model)
## [1] 0.2058941
We see that the standard deviation for the new model is less than the current model, meaning that the gallons of water used in the new model is more consistent than the current model. We can see this in the histograms as the new model’s histogram is less spread out/more compact.
- Use R to create side-by-side boxplots of the two sets in R so they are easily comparable.
boxplot(current.model, new.model, horizontal=TRUE, names=c("Current model","New model"))
- Explain why there are no values shown as a dot on the Current Model flush boxplot. To what values do the Current model flush boxplot whiskers extend? (Use R for your boxplot calculations and type=2 for quantiles)
There are no values shows as dots because there are no outliers in the data.
quantile(current.model, c(.25,.75), type=2)
## 25% 75%
## 1.39 1.73
min(current.model)
## [1] 1.23
max(current.model)
## [1] 2.11
The whiskers in the current model go from the minimum (1.23) to the fist quartile (1.39) and from the 3rd quartile (1.73) to the maximum (2.11)
- What would be the mean and median gallons flushed if we combined the two data sets into one large data set with 39 observations? Show how the mean can be calculated from the summary measures in part (c) along with the sample sizes and explain why the median of the combined set cannot be computed based on (c).
combined.data=c(current.model, new.model)
mean(combined.data)
## [1] 1.327692
median(combined.data)
## [1] 1.28
we can calculate the new mean with the summaries from part C because we can add the two averages together and then average those. We cannot do this with the median because there is not really any calculations for the median, it is simply just the number in the middle of the data set.