*Submit your homework to Canvas by the due date and time. Email your lecturer if you have extenuating circumstances and need to request an extension.

*If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.

*If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manually calculations on your exams, so practice accordingly.

*You must include an explanation and/or intermediate calculations for an exercise to be complete.

*Be sure to submit the HWK1 Autograde Quiz which will give you ~20 of your 40 accuracy points.

*50 points total: 40 points accuracy, and 10 points completion

Basics of Statistics and Summarizing Data Graphically (I)

Exercise 1. A number of individuals are interested in the proportion of citizens within a county who will vote to use tax money to upgrade a professional football stadium in the upcoming vote. Consider the following methods:

The Football Team Owner surveyed 10,000 people attending one of the football games held in the stadium. Seventy three percent (73%) of respondents said they supported the use of tax money to upgrade the stadium.

The Pollster generated 1,000 random numbers between 1-40,768 (number of county voters in last election) and surveyed the 1,000 citizens who corresponded to those numbers on the voting roll. Forty seven percent (47%) of respondents said they supported the use of tax money to upgrade the stadium.

  1. What is the population of interest? What is the parameter of interest? Will this parameter ever be calculated?

The population of interest includes all eligible voters in the country. The parameter of interest focuses on determining the actual percentage of eligible voters who will support using tax funds to enhance the football stadium. The exact calculations of this parameter remains impossible because surveying all eligible voters in the county is not feasible. The parameter can be estimated through the analysis of sample data.

  1. What were the sample sizes used and statistics calculated from those samples? Are these simple random samples from the population of interest?

Football Team Owner: Sample size: 10,000 people Statistic: 73% support The sampling method used is not a simple random sample because the sample is biased toward people who are already attending football games and may be more likely to support stadium improvements. Pollster: Sample size: 1,000 people Statistic: 47% support This is a simple random sample, because it randomly selected individuals from the full list of county voters.

  1. The football team owner claims that the survey done at the football stadium will better predict the voting outcome because the sample size was much larger. What is your response?

The football team owner surveyed more people but the respondents were football fans attending a game. The surveyed group does not represent the voting population of the general public. The pollster obtained 1,000 voters through random sampling which produced results that better represented the county population even though the sample size was smaller. A smaller unbiased sample in statistics proves more reliable than a large biased sample. The pollster’s results demonstrate higher trustworthiness for predicting actual voting outcomes.

Exercise 2: After manufacture, computer disks are tested for errors. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.

Number of Defects Number of Disks
0 41
1 31
2 15
3 8
4 5
  1. Describe the type of data that is being recorded about the sample of 100 disks, being as specific as possible.

The recorded data consists of discrete quantitative information because the number of defects on each disk is counted through whole numbers (e.g., 0, 1, 2..) instead of being measured on a continuous scale.

  1. Code for a frequency histogram showing the frequency for number of errors on the 100 disks is given below.
x=rep(x=c(0,1,2,3,4), times=c(41,31,15,8,5))
hist(x, breaks=seq(-0.5,4.5,1), main="Number of defects")

bi. Knit the document and confirm that the histogram displays in the knitted file.

error.data=c(rep(0,41), rep(1,31), rep(2,15), rep(3,8), rep(4, 5))
hist(error.data, breaks=c(seq(from=-0.5, 4.5, by=1)), 
     xlab="Defects", main="Number of Defects", 
     labels=TRUE, ylim=c(0,50))

bii. Describe what the rep() function does in this code chunk.

The rep() function repeats values. The function rep(0,41) generates a sequence of 41 repeated occurrences of the number 0. The code uses this function to create a data set that includes 41 disks with 0 defects and 31 disks with 1 defect and continues in this pattern.

biii. Describe how this breaks command affects the histogram’s appearance in this code chunk.

The breaks command determines the positions of the histogram bars. Putting bin breaks before and after each whole number, ensuring that each bin will be positioned at the center of these whole numbers. Simplifying the interpretation.

biv. Describe how setting ylim=c(0,30) instead of ylim=c(0,50) would change the histogram’s appearance. Which value for ylim is preferable for clear communication of the data?

The ylim=c(0,30) command would truncate the tallest bars which would result in a misleading and difficult to read chart because some bars exceed 30. The ylim=c(0,50) option provides better visualization of bar heights because it keeps all bars within the chart area.

Exercise 3. There are 12 numbers in a sample, and the mean is \(\bar{x}=27\). The minimum of the sample is accidentally changed from 13.8 to 1.38.

  1. Is it possible to determine the direction in which (increase/decrease) the mean (\(\bar{x}\))changes? Or how much the mean changes? If so, by how much does it change? If not, why not?

We can establish both the direction and magnitude of mean charge. Original mean = 27 Sample size=12 The original total sum equals to: 12*27=324 The new total sum results from: 324-13.8+1.38=311.58 New mean: 311.58/12=25.965 Change in mean: 27-25.965=1.035 Hence, the mean decreases by 1.035

  1. Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not?

The determination of median change remains impossible because we lack access to the complete set of values. The direction of change remains predictable. The changed value of 13.8 to 1.38 was the minimum value which did not affect the middle position of the data. The median value in a 12-number sample equals the average of the 6th and 7th values after sorting the data. The original position of 13.8 as one of the lowest values means the median value remains unchanged because it does not occupy the 6th or 7th position. The median value remains unchanged.

  1. Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? Describe why it is difficult to predict by how much the standard deviation will change in this case.

Yes, we can predict the direction of change. The replacement of 13.8 with 1.38 results in a value that moves away from the mean thus increasing the data spread. Hence, the standard deviation increases. We cannot determine the exact amount of change because we lack the complete set of values. The change depend on: where 13.8 was relative to the mean, the distance between 1.38 and the mean value in relation to other data points, and the number of data points that are near or distant from the mean value.

Exercise 4: Recall the computer disk error data given used in Exercise 2. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.

Number of Defects Number of Disks
0 42
1 30
2 16
3 7
4 5

A frequency histogram showing the frequency for number of errors on the 100 disks is given below.

error.data=c(rep(0,42), rep(1,30), rep(2,16), rep(3,7), rep(4, 5))
hist(error.data, breaks=c(seq(from=-0.5, 4.5, by=1)), xlab="Defects", main="Number of Defects", labels=TRUE, ylim=c(0,60))

  1. What is the shape of the histogram for the number of defects observed in this sample? Why does that make sense in the context of the question?

The histogram displays a right-skewed distribution which means it is posivitely skewed. This makes sense because: 1. the number of defects cant go below 0, 2.most disks have fewer defects, with fewer and fewer having more, so we see a tall bar on the left and gradually shorter bars to the right.

  1. Calculate the mean and median number of errors detected on the 100 disks ‘by hand’ and using the built-in R functions. How do the mean and median values compare and is that consistent with what we would guess based on the shape? [You can use the text such as \(\bar{x}=\frac{value1}{value2}\) to help you show your work neatly].

Mean by hand: \(\bar{x}=\frac{42*0+30*1+16*2+7*3+5*4}{100}=1.03\) Median by hand: 100 values, so the median is the average of the 50th and 51st values. Since first 42 values are 0, next 30 values are 1 so 50th and 51st are both 1. Therefore, the median =1

 mean(error.data)
## [1] 1.03
median(error.data)
## [1] 1

The mean is greater than the median, proving that it is consistent with the right-skewed shape.

  1. Calculate the sample standard deviation ``by hand” and using the built in R function. Are the values consistent between the two methods? How would our calculation differ if instead we considered these 100 values the whole population?

Standard deviation by hand: \[ s =\sqrt((42*(0-1.03)^2+30*(1-1.03)^2+16*(2-1.03)^2+7*(3-1.03)^2+5*(4-1.03)^2)/99)=1.1499 \]

 sd(error.data)
## [1] 1.149923

Yes, the results match. The population standard deviation calculation would require dividing by 100 instead of 99 which would produce a slightly smaller result.

  1. Construct a boxplot for the number of errors data using R with helpful labels. Explain how the shape of the data identified in
    1. can be seen from the boxplot.
 boxplot(error.data, horizontal=TRUE, xlab="Number of Defects")

From the box plot we can tell that right skew is visible because the upper whisker the one toward 4 is longer, most of the data is clustered on the left side (Q1=0, median =1), and the spacing from Q3 to Max is wider than Min to Q1. Matching the histogram’s shape.

  1. Explain why the histogram is better able to show the discrete nature of the data than a boxplot.

The histogram displays each specific defect count through individual bars. The histogram displays the exact number of disks that fall under each defect category. Box plots present data summaries including minimum values and first quartile and median but they do not display actual data points which makes the whole number defect counts less apparent. Therefore, the histogram makes the discrete nature obvious.

Exercise 5: A company that manufactures toilets claims that its new presure-assited toilet reduces the averageamount of water used by more thaan 0.5 gallons per flush when compared to its current model. The company selects 20 toilets of the current type and 19 of the New type and measures the amount of water used when each toilet is flushed once. The number of gallons measured for each flush are recorded below. The measurements are also given in flush.csv.

Current Model: 1.63, 1.25, 1.23, 1.49, 2.11, 1.48, 1.94, 1.72, 1.85, 1.54, 1.67, 1.76, 1.46, 1.32, 1.23, 1.67, 1.74, 1.63, 1.25, 1.56

New Model: 1.28, 1.19, 0.90, 1.24, 1.00, 0.80, 0.71, 1.03, 1.27, 1.14, 1.36, 0.91, 1.09, 1.36, 0.91, 0.91, 0.86, 0.93, 1.36

  1. Use R to create histograms to display the sample data from each model (any kind of histogram that you want since sample sizes are the same). Have identical x and y axis scales so the two groups’ values are more easily compared. Include useful titles.
 # Step 1: Read the CSV file
flush_data <- read.csv("flush.csv", header = TRUE)

# Step 2: Check what's inside
head(flush_data)
##     Model gallons
## 1 Current    1.63
## 2 Current    1.25
## 3 Current    1.23
## 4 Current    1.49
## 5 Current    2.11
## 6 Current    1.48
str(flush_data)
## 'data.frame':    39 obs. of  2 variables:
##  $ Model  : chr  "Current" "Current" "Current" "Current" ...
##  $ gallons: num  1.63 1.25 1.23 1.49 2.11 1.48 1.94 1.72 1.85 1.54 ...
 # Split the data into two groups
Curr_df <- subset(flush_data, Model == "Current")
New_df <- subset(flush_data, Model == "New")

# Extract gallon values
Curr <- Curr_df$gallons
New <- New_df$gallons
 par(mfrow = c(1,2))
hist(Curr, breaks=seq(0.5,2.5,0.2), main="Current Model", xlab="Gallons", ylim=c(0,8))
hist(New, breaks=seq(0.5,2.5,0.2), main="New Model", xlab="Gallons", ylim=c(0,8))

par(mfrow = c(1,1))
  1. Compare the shapes of the distributions of amount of water used by the two models observed in the sample.

The current model shows symmetry with its center point at 1.6. The new model maintains symmetry but its center point is located near 1. The new model shows uni modal symmetry but its distribution is more compact than the current model.

  1. Compute the mean and median gallons flushed for the Current and New Model toilets using the built-in R function. Compare both measures of center within each group and comment on how that relationship corresponds to the datas’ shapes. Also compare the measures of center across the two groups and comment on how that relationship is evident in the histograms.
 mean(Curr); median(Curr)
## [1] 1.5765
## [1] 1.595
mean(New); median(New)
## [1] 1.065789
## [1] 1.03

Current: Mean=1.5765, Median=1.595 New: Mean=1.0658, Median= 1.03 Both are symmetric because the mean = median, and the new odel uses less water on average.

  1. Compute (using built-in R function) and compare the sample standard deviation of gallons flushed by the current and new model toilets. Comment on how the relative size of these values can be identified from the histograms.
 sd(Curr)
## [1] 0.2456843
sd(New)
## [1] 0.2058941

Current SD=0.2457 New SD=0.2059 The new model has lower variability, which matches the tighter histogram bars.

  1. Use R to create side-by-side boxplots of the two sets in R so they are easily comparable.
 boxplot(Curr, New, names=c("Current", "New"),
        ylab="Water Flushed (Gallons)",
        main="Toilet Water Flushed by Model")

  1. Explain why there are no values shown as a dot on the Current Model flush boxplot. To what values do the Current model flush boxplot whiskers extend? (Use R for your boxplot calculations and type=2 for quantiles)
quantile(Curr, type=2)
##    0%   25%   50%   75%  100% 
## 1.230 1.390 1.595 1.730 2.110
IQR(Curr, type=2)
## [1] 0.34
sort(Curr)
##  [1] 1.23 1.23 1.25 1.25 1.32 1.46 1.48 1.49 1.54 1.56 1.63 1.63 1.67 1.67 1.72
## [16] 1.74 1.76 1.85 1.94 2.11

Q1=1.39 Q3=1.73 IQR=Q3-Q1=0.34 Lower bound= Q1-1.5IQR=1.39-0.51=0.88 Upper bound= Q3+1.5IQR=1.73+0.51=2.24 All values in the sorted current data fall between 1.23-2.11, which is between the bounds (0.88,2.24). So, there’s no outliers, and that’s why no dots appear on the box plot.

  1. What would be the mean and median gallons flushed if we combined the two data sets into one large data set with 39 observations? Show how the mean can be calculated from the summary measures in part (c) along with the sample sizes and explain why the median of the combined set cannot be computed based on (c).

The data set combines the 20 Current Model values with the 19 New Model values:

combined <- c(Curr, New)
mean(combined)
## [1] 1.327692
median(combined)
## [1] 1.28

Mean calculation: \(\bar{x}=\frac{(20*1.5765)+(19*1.0658)}{39}=1.3277\)

mean(combined)
## [1] 1.327692

The median depends on the ordering of all individual values, not just the medians of each group. We cannot determine the combined median by averaging the two medians because we do not know how the individual values from the Current and New groups overlap.

Instead, we must compute it directly:

median(combined)
## [1] 1.28

The combined mean equals 1.3277 while the combined median equals 1.28 but only the mean can be calculated from group summaries.