*Submit your homework to Canvas by the due date and time. Email your lecturer if you have extenuating circumstances and need to request an extension.

*If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.

*If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manual calculations on your exams, so practice accordingly.

*You must include an explanation and/or intermediate calculations for an exercise to be complete.

*Be sure to submit the HWK2 Autograde Quiz which will give you ~20 of your 40 accuracy points.

*50 points total: 40 points accuracy, and 10 points completion

Basics of Statistics and Summarizing Data Numerically and Graphically (I)

Exercise 1. There are 12 numbers in a sample, and the mean is \(\bar{x}=27\). The minimum of the sample is accidentally changed from 13.8 to 1.38.

  1. Is it possible to determine the direction in which (increase/decrease) the mean (\(\bar{x}\))changes? Or how much the mean changes? If so, by how much does it change? If not, why not? How do you know?

We know that the mean will decrease, and it will decrease by the difference between 13.8 and 1.38 (12.42) divided by the numbers in the sample (12). The sample mean will decrease by 1.035

samp<-c(13.8,25,25,26,27,27,27,27,28,29,29,40.2)
samp2<-c(1.38,25,25,26,27,27,27,27,28,29,29,40.2)
mean(samp)
## [1] 27
mean(samp2)
## [1] 25.965
  1. Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not? How do you know?

The median is not affected by the changed minimum sample value. The median is taken by the middle data point (or the average of the two most middle data points) and is not affected by outliers

  1. Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? How do you know?

The standard deviation will increase.We will not be able to predict the change in standard deviation without the full data set however. The changed minimum causes a changed mean, and therefore every difference between the data and the sample mean will be changed. Since this is directly related to the standard deviation, we won’t be able to predict the change in standard deviation with the information provided.

sd(samp)
## [1] 5.772033
sd(samp2)
## [1] 8.716597

Exercise 2: Recall the computer disk error data given in HWK 1. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.

Number of Defects Number of Disks
0 42
1 30
2 16
3 7
4 5

A frequency histogram showing the frequency for number of errors on the 100 disks is given below.

error.data=c(rep(0,42), rep(1,30), rep(2,16), rep(3,7), rep(4, 5))
hist(error.data, 
     breaks=c(seq(from=-0.5, 4.5, by=1)), 
     xlab="Defects", main="Number of Defects", 
     labels=TRUE, ylim=c(0,60))

  1. What is the shape of the histogram for the number of defects observed in this sample? Why does that make sense in the context of the question?

This is a right skewed histogram. This makes sense because generally the produced disks should be without errors, so the majority of disks should be without error

  1. Calculate the mean and median number of errors detected on the 100 disks ‘by hand’ and using the built-in R functions. How do the mean and median values compare and is that consistent with what we would guess based on the shape? [You can use the text such as \(\bar{x}=\frac{value1}{value2}\) to help you show your work neatly].

xbar = ((0x42)+(1x30)+(2x16)+(3x7)+(4x5))/(42+30+16+7+5) = 1.03 Median: (42+30+16+7+5)/2 = 50 (value) = 1 (This is actually an average between value 50 and 51, but since both are 1 it doesn’t matter)

mean(error.data)
## [1] 1.03
median(error.data)
## [1] 1
  1. Calculate the sample standard deviation ``by hand” and using the built in R function. Are the values consistent between the two methods? How would our calculation differ if instead we considered these 100 values the whole population? hint: use multiplication instead of repeated addition

sigma = sqrt((((0-1.03)2x42)+((1-1.03)2x30)+((2-1.03)2x16)+((3-1.03)2x7)+((4-1.03)^2x5))/100) = 1.144

sd(error.data)
## [1] 1.149923
  1. Construct a boxplot for the number of errors data using R with helpful labels. Explain how the shape of the data identified in (a) can be seen from the boxplot.
boxplot(error.data, horizontal = TRUE, main = "Errors per disk", xlab = "Errors")

  1. Describe why the histogram is better able to show the discrete nature of the data than a boxplot.

The boxplot is not able to show how many disks are in each error amount. The histogram is able to better show this distribution, because it shows frequency in each error amount

Exercise 3: A company that manufactures toilets claims that its new presure-assisted toilet reduces the average amount of water used by more thaan 0.5 gallons per flush when compared to its current model. The company selects 20 toilets of the current type and 19 of the New type and measures the amount of water used when each toilet is flushed once. The number of gallons measured for each flush are recorded below. The measurements are also given in flush.csv.

Current Model: 1.63, 1.25, 1.23, 1.49, 2.11, 1.48, 1.94, 1.72, 1.85, 1.54, 1.67, 1.76, 1.46, 1.32, 1.23, 1.67, 1.74, 1.63, 1.25, 1.56

New Model: 1.28, 1.19, 0.90, 1.24, 1.00, 0.80, 0.71, 1.03, 1.27, 1.14, 1.36, 0.91, 1.09, 1.36, 0.91, 0.91, 0.86, 0.93, 1.36

  1. Use R to create histograms to display the sample data from each model (any kind of histogram that you want since sample sizes are similar). Have identical x and y axis scales so the two groups’ values are more easily compared. Include useful titles.
currentModel<-c(1.63, 1.25, 1.23, 1.49, 2.11, 1.48, 1.94, 1.72, 1.85, 1.54, 1.67, 1.76, 1.46, 1.32, 1.23, 1.67, 1.74, 1.63, 1.25, 1.56)
newModel<-c(1.28, 1.19, 0.90, 1.24, 1.00, 0.80, 0.71, 1.03, 1.27, 1.14, 1.36, 0.91, 1.09, 1.36, 0.91, 0.91, 0.86, 0.93, 1.36)

hist(currentModel, breaks=seq(0.4,2.4,0.1), ylim=c(0,5), xlim=c(0.5,2.5), main="Gallons per Flush on the Current Toilet Model",xlab="Gallons per Flush")

hist(newModel, breaks=seq(0.4,2.4,0.1), ylim=c(0,5), xlim=c(0.5,2.5), main="Gallons per Flush on the New Toilet Model", xlab="Gallons per Flush")

  1. Compare the shape of the gallons flushed from the two models of toilets samples.

The current model’s distribution is more spread out and has a higher mean value than the new model. There is also an outlier in the current model distribution, and it is a maximum. The new model’s distribution is tighter

  1. Compute the mean and median gallons flushed for the Current and New Model toilets using the built-in R function. Compare both measures of center within each group and comment on how that relationship corresponds to the datas’ shapes. Also compare the measures of center across the two groups and comment on how that relationship is evident in the histograms.
mean(currentModel)
## [1] 1.5765
mean(newModel)
## [1] 1.065789
median(currentModel)
## [1] 1.595
median(newModel)
## [1] 1.03

For the current model, the mean is effected by the large spread and outlier in the data. This leads to the median being a slightly better measure of center for it’s data. For the new model, the tighter spread without outliers means that the mean and median are similar, and because the mean isn’t very skewed it is likely the better measure of center. The center data point in the new model is smaller than the center data point in the current model. This is consistent with the trend in the histogram, showing improvements in design from the new to current model.

  1. Compute (using built-in R function) and compare the sample standard deviation of gallons flushed by the current and new model toilets. Comment on how the relative size of these values can be identified from the histograms.
sd(currentModel)
## [1] 0.2456843
sd(newModel)
## [1] 0.2058941

The sample standard deviation between the two datasets show that the current model’s spread was greater than the new model’s. This is consistent with the greater width present in the current model’s histogram versus the new model’s histogram

  1. Use R to create side-by-side boxplots of the two sets in R so they are easy to compare.
boxplot(currentModel,newModel, names = c("Current Model","New Model"), main = "Current vs New model of Toilet", ylab="gallons/flush")

  1. Explain why there are no values shown as a dot (outlier) on the Current Model flush boxplot. To what values do the Current model flush boxplot whiskers extend? (Use R for your boxplot calculations and type=2 for quantiles)
quantile(currentModel, .75, type=2)*1.5
##   75% 
## 2.595

Since no data points went past 1.5 times the 75th percentile of the distribution, there were no outliers depicted. The whiskers go out to the furthest data point, which is 2.11

  1. What would be the mean and median gallons flushed if we combined the two data sets into one large data set with 39 observations? Show how the mean can be calculated using R and then from the summary measures in part (c) along with the sample sizes. Explain why the median of the combined set cannot be computed based on the summaries in part (c).
both<-c(1.63, 1.25, 1.23, 1.49, 2.11, 1.48, 1.94, 1.72, 1.85, 1.54, 1.67, 1.76, 1.46, 1.32, 1.23, 1.67, 1.74, 1.63, 1.25, 1.56,1.28, 1.19, 0.90, 1.24, 1.00, 0.80, 0.71, 1.03, 1.27, 1.14, 1.36, 0.91, 1.09, 1.36, 0.91, 0.91, 0.86, 0.93, 1.36)
mean(both)
## [1] 1.327692
(mean(currentModel)*20+mean(newModel)*19)/39
## [1] 1.327692

Even with different amounts of samples, we can still find the mean of the combined data using the calculated means in part C, you just have to weight each of them according to the amount of samples in each. However, since the median requires information about each individual data point, you can’t find the median of combined data sets from the median value of each.

Exercise 4: An elementary school surveys its families and tabulates the number of children reported in each household. A frequency histogram summarizes the data received:

Children=c(rep(1, 47), rep(2, 70), rep(3, 45), rep(4, 23), rep(5,11), rep(6,5), rep(7,2))
hist(Children, 
     breaks=seq(0.5, 7.5, 1), 
     labels=TRUE, 
     ylim=c(0,80),
     main="Number of Children in Household",
     ylab="Number of Households")

  1. Consider a randomly chosen household, Household A. Identify whether the events “Household A has 1 Child” and “Household A has 2 Children” are (i) independent, (ii) mutually exclusive, (iii) both independent and mutually exclusive or (iv) neither independent nor mutually exclusive. Explain how you know.

In the case of this survey, having one or two children isn’t mutually exclusive since if you have two children, you also have one. However, since we are also separating these two groups, those two events are independent

  1. Suppose the principal chooses a random family from those at the school to call each day and each family is equally likely to be chosen on the first day of school. What is the probability that the family has more than two (2) children?

86/203

  1. Suppose the principal randomly chooses a family to call from those at the school that they have not already called. What is the probability that all of the families called the first 5 days of school had a single (1) child in the household? Is this a highly likely or unlikely event?

Assuming that the school is calling one a day, the probability is (47/203)^5, which results to a very small percent chance of occuring. This is an unlikely event.