*Submit your homework to Canvas by the due date and time. Email your lecturer if you have extenuating circumstances and need to request an extension.

*If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.

*If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manually calculations on your exams, so practice accordingly.

*You must include an explanation and/or intermediate calculations for an exercise to be complete.

*Be sure to submit the HWK1 Autograde Quiz which will give you ~20 of your 40 accuracy points.

*50 points total: 40 points accuracy, and 10 points completion

Basics of Statistics and Summarizing Data Numerically and Graphically (I)

Exercise 1. A number of individuals are interested in the proportion of county voters who will vote to use tax money to upgrade a professional baseball stadium in the upcoming vote. Consider the following methods:

The Baseball Team Owner surveyed 8,000 people attending one of the baseball games held in the stadium. Seventy eight percent (78%) of respondents said they supported the use of tax money to upgrade the stadium.

The Pollster generated 1,000 random numbers between 1-52,661 (number of county voters in last election) and surveyed the 1,000 county voters who corresponded to those numbers on the voting roll. Forty three percent (43%) of respondents said they supported the use of tax money to upgrade the stadium.

  1. What is the population of interest? What is the parameter of interest? Will this parameter ever be calculated?

Population of interest is all of the county voters that are eligible to vote. Parameter of interest is portion of county voters who support to use tax money to upgrade the stadium. This parameter will never be calculated because it would require to survey every single county voter.

  1. What were the sample sizes used and statistics calculated from those samples? Are these simple random samples from the population of interest?

Sample sizes were the 8,000 people that attended one of the baseball games held in the stadium and 1,000 randomly generated county voters. Both were trying to calculate sample proportion supporting the tax. No the first sample is not a simple random sample because they only surveyed people attending the baseball game. The second sample is a simple random sample because they were selected using random numbers from the full voting population.

  1. The baseball team owner claims that the survey done at the baseball stadium will better predict the voting outcome because the sample size was much larger. What is your response?

The survey done at the baseball stadium will not better predict the voting outcome just because the sample size was much larger. There still is a very large bias because it only includes people attending a baseball game, who are more likely to support the baseball stadium upgrade.

Exercise 2. There are 12 numbers in a sample, and the mean is \(\bar{x}=24\). The minimum of the sample is accidentally changed from 11.9 to 1.19.

  1. Is it possible to determine the direction in which (increase/decrease) the mean (\(\bar{x}\))changes? Or how much the mean changes? If so, by how much does it change? If not, why not?

Sample size = 12 Original mean = 24 Minimum value goes from 11.9 to 1.19 11-1.19 = 10.71 To find how much the mean decreases, use the change in the value / the sample size 10.71/12 = 0.89 So the mean decreases by 0.89 The new mean 24 - 0.89 = 23.1

  1. Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not?

If you decrease the minimum number, the median will not change at all. However, if you happen to decrease the other numbers the median may or may not change. The median depends on the middle of the data. Changing the minimum does not affect those middle values.

  1. Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? Describe why it is difficult to predict by how much the standard deviation will change in this case.

It is possible to predict the direction of change in the standard deviation. The standard deviation will increase. Changing from 11.9 to 1.19, moves it much farther away from the mean. It is hard to predict by how much it changes because without knowing the other 11 data values, we cannot compute the exact change exact change in the standard deviations.

Exercise 3: After manufacture, computer disks are tested for errors. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.

Number of Defects Number of Disks
0 41
1 31
2 15
3 8
4 5
  1. Describe the type of data that is being recorded about the sample of 100 disks, being as specific as possible.

The type of data that is being recorded about the sample of 100 disks is numerical. The data set consists of a numerical number of defects per disk.

  1. A frequency histogram showing the number of errors on the 100 disks is given below. Write the R code to produce this frequency histogram. Be sure to create useful labels. Hints: use the c() and rep() functions to help define your defect data. Also use ylim() and breaks() to format your graph.
defects = c(rep(0, 41), rep(1, 31), rep(2, 15), rep(3, 8), rep(4, 5))
hist(defects, 
     breaks = seq(-0.5,5,1),
     labels = TRUE,
     ylim = c(0,50),
     main = "Number of Defects",
     xlab = "Defects",
     ylab = "Frequency",
)

Defect Histogram
Defect Histogram
  1. What is the shape of the histogram for the number of defects observed in this sample? Explain why that shape make sense in the context of the question.

The histogram is right skewed. Most disks will have few or no defects. Having many defects on a disk is uncommon so the higher the defect count, this occurs less frequently.

  1. Calculate both the mean and median number of errors detected on the 100 disks by hand and using built-in R functions. How do the mean and median values compare and is that consistent with what we would guess based on the shape? [You can use LaTeX such as \(\bar{x}=\frac{value1}{value2}\) to help you show your work neatly.]
mean(defects)
## [1] 1.05
median(defects)
## [1] 1

The mean is slightly larger than the median. This is what we expect for a right-skewed distribution.

  1. Calculate the sample standard deviation first ``by hand” [hint: use multiplication instead of repeated addition] and then again using a built-in R function. Are the values consistent between the two methods? How would our calculation differ if instead we know that these 100 values were the whole population of interest?

x = 1.05 n = 100

Sample standard deviation formula = (Sigma (n/i=1) (xi - x)^2/ (n-1))^1/2) = (132.75/99)^1/2 = 1.16

sd(defects)
## [1] 1.157976

These two values are consistent between the two methods. If we know that these 100 values were the whole population of interest (132.75/10)^0.5 = (1.3275)^1/2 = 1.15 Sample = 1.16 Population = 1.15 Sample is slightly larger because dividing by n-1 corrects underestimating variability when sampling

  1. Use the built-in R function to construct a boxplot for the number of errors data. Be sure to include helpful labels. Explain how the shape of the data (identified in (c)) can be seen from the boxplot using words such as median and quartile.
boxplot(defects, 
        main = "Boxplot of Number of Defects Per Disk",
ylab = "Number of Defects")

The box plot reflects the right skewed shape of data. The median is closer to the lower quartile than to the upper quartile.

  1. Explain why the histogram is better able to show the discrete nature of the data than a boxplot.

The histogram is better able to show the discrete nature of the data than a boxplot because it actually displays the frequency at each integer value of the variable.

Exercise 4 A machine produces metal rods used in an automobile suspension system. A random sample of 10 rods is selected, and the diameter is measured for each. The resulting data (in millimeters) are as follows: \[8.24, 8.25, 8.20, 8.23, 8.24, 8.21, 8.26, 8.26, 8.20, 8.25\]

  1. Define a vector Diameters to store the 10 diameter measurements.
Diameters = c(8.24, 8.25, 8.20, 8.23, 8.24, 8.21, 8.26, 8.26, 8.20, 8.25)
  1. Define a second object Radii=Diameters/2 to calculate and store the radii of the 10 metal rods.
Radii = c(Diameters / 2)
  1. Use the built-in R functions to compute the sample mean, sample standard deviation, and sample variances for the Diameters and Radii vectors.
mean(Diameters)
## [1] 8.234
sd(Diameters)
## [1] 0.02319004
var(Diameters)
## [1] 0.0005377778
mean(Radii)
## [1] 4.117
sd(Radii)
## [1] 0.01159502
var(Radii)
## [1] 0.0001344444
  1. Compare the magnitudes of the sample means, sample standard deviations, and sample variances for the Diameters and Radii vectors. Explain how the statistics’ relative magnitudes relates to the relationship between the Diameters and Radii values.
mean(Radii) / mean(Diameters)
## [1] 0.5
sd(Radii) / sd(Diameters)
## [1] 0.5
var(Radii) / var(Diameters)
## [1] 0.25

The sample mean of Radii is half the sample mean of diameters, mean is the linear average of the data

The sample standard deviation of Radii is also half of the Diameters, all the distances are scaled by the same factor

The sample variance of Radii is 1/4 of Diameters, because variance is the square of the standard deviation