Stat 371 Homework #2

Summarizing Data Numerically and Graphically

Exercise 1. There are \(n = 12\) numbers in a sample, and the mean is \(\bar{x} = 24\). The minimum of the sample is accidentally changed from 11.9 to 1.19.

Is it possible to determine the direction (increase/decrease) in which the mean \(\bar{x}\) changes? And how much the mean changes? If so, by how much does it change? If not, why not?

24*12

## [1] 288

288-11.9

## [1] 276.1

276.1+1.19

## [1] 277.29

277.29/12

## [1] 23.1075

The sample mean will decrease because the sample minimum has decreased, effecting the mean more than it had before. You can determine the new mean by assuming the average to be 24 and multiplying that by the sample size. Then, you can remove the previous minimum and replace it with the new minimum and divide by the same sample size of 12. > b. Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not?

The median should not be affected by this change because it is inherently robust. The median will be the average of samples 6 and 7. > c. Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? Describe why it is difficult to predict by how much the standard deviation will change in this case.

The standard deviation should increase because the spread increases. Any time that a new, more polar maximum or minimum is introduced, the standard deviation will increase.

Exercise 2. Recall the computer disk error data given used in Homework 1. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.

Number of Defects	Number of Disks
0	41
1	31
2	15
3	8
4	5

A frequency histogram (without the typo from Homework 1) showing the frequency for number of errors on the 100 disks is given below.

error.data <- c(rep(0,41), rep(1,31), rep(2,15), rep(3,8), rep(4, 5))

hist(error.data, breaks=c(seq(from = -0.5, 4.5, by = 1)),
     xlab = "Defects", main = "Number of Defects", labels = TRUE,
     ylim = c(0, 50))

What is the shape of the histogram for the number of defects observed in this sample? Why does that make sense in the context of the question?

The shape of the histogram is right skewed. This makes sense because it is more likely that disks are produced with no flaws or minimal flaws. In order to stay in business, those making the disks must produce a certain number of functional disks to make a profit. > b. Calculate the mean and median number of errors detected on the 100 disks ‘by hand’ and using the built-in R functions. How do the mean and median values compare and is that consistent with what we would guess based on the shape?

mean(error.data)

## [1] 1.05

median(error.data)

## [1] 1

By hand: mean- 0 + 31 + 30 + 24 + 20 is 105. 105 divided by 100 is 1.05. median- middle of 50 and 51 is 1.

The mean value is slightly larger than the median value. This makes sense because the data is skewed right, meaning the mean would be influenced to be larger due to some disks having 3 or 4 defects. > c. Calculate the sample standard deviation “by hand” and using the built in R function. Are the values consistent between the two methods?

sd(error.data)

## [1] 1.157976

By hand: all zeros are 0-1.05 squared, equaling 45.2025 all ones are 1-1.05 squared, equaling 0.0775 all twos are 2-1.05 squared, equaling 13.5375 all threes are 3-1.05 squared, equaling 30.42 all fours are 4-1.05 squared, equaling 43.51 Add all those numbers up to get 132.75 Divide that by 100 to get 1.3275 Take the square root to get approximately 1.15

The values are consistent between the two. The difference is about .005 which I would attribute to rounding.

Construct a boxplot for the number of errors data using R with helpful labels. Explain how the shape of the data identified in (a) can be seen from the boxplot.

boxplot(error.data, main="Boxplot of Disk Errors in Production", ylab="Number of Errors")

The shape, being right skewed, makes sense in this box plot because the majority of data is between zero and two. The data extends to four errors, but this is a minority of the data. Because of this, the tail of the data is at the highest number of errors(right). > e. Explain why the histogram is better able to show the discrete nature of the data than a boxplot.

The histogram better displays the discrete nature of the data because the bars clearly show that number of errors can only be in whole numbers. The box plot shows the range of the data as well as where majority of it is located, but does not list exact values for each disk produced.

Exercise 3. A certain reaction was run several times using each of two catalysts, A and B. The catalysts are supposed to control the yield of an undesireable side product. Results, in units of percentage yield, for 25 runs of catalyst A and 23 runs of catalyst B are given below and also in Catalysts.csv.

Catalyst A: 4.3, 4.4, 3.4, 2.6, 3.8, 4.9, 4.6, 5.2, 4.7, 4.1, 2.6, 6.7, 4.1, 3.6, 2.9, 2.6, 4.0, 4.3, 3.9, 4.8, 4.5, 4.4, 3.1, 5.7, 4.5

Catalyst B: 3.4, 5.9, 1.2, 2.1, 5.5, 6.4, 5.0, 5.8, 2.5, 3.7, 3.8, 5.1, 3.1, 1.6, 3.5, 5.9, 6.7, 5.2, 5.8, 2.2, 4.3, 3.8, 1.2

Use R to create a histogram for the percentage yield of the undesireable side product for the two catalysts (any kind of histogram that you want since sample sizes are similar). Have identical x and y axis scales so the two groups’ values are more easily compared. Include useful titles.

CatalystA <-c(4.3, 4.4, 3.4, 2.6, 3.8, 4.9, 4.6, 5.2, 4.7, 4.1, 2.6, 6.7, 4.1, 3.6, 2.9, 2.6, 4.0, 4.3, 3.9, 4.8, 4.5, 4.4, 3.1, 5.7, 4.5)
hist(CatalystA, main="% Yield of Undersired Side Products for Catalyst A", ylab="Frequency in range", xlab="Percentage Yield in a 0.5% range", breaks=seq(0,7,.5))

CatalystB <-c(3.4, 5.9, 1.2, 2.1, 5.5, 6.4, 5.0, 5.8, 2.5, 3.7, 3.8, 5.1, 3.1, 1.6, 3.5, 5.9, 6.7, 5.2, 5.8, 2.2, 4.3, 3.8, 1.2)
hist(CatalystB, main="% Yield of Undesired Side Products for Catalyst B", ylab="Frequency in range", xlab="Percentage Yield in a 0.5% range", breaks=seq(0,7,.5))

Compare the shape of the percentage yields from the two catalysts observed in this sample.

Catalyst A has a smaller range than Catalyst B. Catalyst A is right-skewed, while catalyst B is more normal. Catalyst B is bimodal while catalyst A is unimodal (for a range of 1%). > c. Compute the mean and median percentage yields observed for catalyst A and catalyst B using R. Compare both measures of center within each group and comment on how that relationship corresponds to the datas’ shapes. Also compare the measures of center across the two groups and comment on how that relationship is evident in the histograms.

mean(CatalystA)

## [1] 4.148

median(CatalystA)

## [1] 4.3

mean(CatalystB)

## [1] 4.073913

median(CatalystB)

## [1] 3.8

Median for catalyst A is larger than its mean, and the mean for catalyst B is larger than its median. For A, this makes sense because the middle of the data is in the first bar of the four range, yet the mean is negatively influenced by the decent amount of data in the 2.5 to 4 range. For B, this makes sense because the mean is positively influenced by the frequency of data in the 5 to 7 range but the median fits between 3 and 4. The mean and median are higher for catalyst A because the data is more concentrated around 4 to 4.5 than it is in the B histogram. > d. Compute (in R) and compare the sample standard deviation of percentage yield from catalyst A and catalyst B. Comment on how the relative size of these values can be identified from the histograms. Describe in words what these values mean when considering which catalyst to use for your experiment.

sd(CatalystA)

## [1] 0.9760123

sd(CatalystB)

## [1] 1.715496

The standard deviation in catalyst A is much than it is in B. The relative size of this values makes sense when you look at the two histograms because catalyst A has far more points that are further from its mean. Catalyst A would be a better choice to move forward with because it is more predicatable. > e. Use R to create side-by-side boxplots of the two sets in R so they are easily comparable.

boxplot(CatalystA, main="% yield of undesired side products from catalyst A", horizontal=TRUE, xlab="percentage yield in a 0.5% range", breaks=seq(0,7,.5) )

boxplot(CatalystB, main="% yield of undesired side products from catalyst B", horizontal=TRUE, xlab="percentage yield in a 0.5% range", breaks=seq(0,7,.5) )

> f. Explain why the highest value in the catalyst A boxplot is shown as a point. That is, explain what calculations determined that 6.7 was an outlying value. Also specify to what value the upper catalyst A percentage yield whisker extends.

The whiskers of a box plot extend 1.5 times the interquartile range. The highest value in the catalyst A box plot is a dot because it extends beyond this whisker, which is at 6.1. This makes it an outlying value.

IQR(CatalystA)

## [1] 1

quantile(CatalystA)

##   0%  25%  50%  75% 100% 
##  2.6  3.6  4.3  4.6  6.7

4.6+ 1*(1.5)

## [1] 6.1

What would be the mean and median percentage yield if we combined the two data sets into one large data set? Show how the mean can be calculated from the summary measures in part (c) along with the sample sizes and explain why the median of the combined set cannot be computed based on (c).

mean(CatalystA)

## [1] 4.148

mean(CatalystB)

## [1] 4.073913

4.148*25

## [1] 103.7

4.073913*23

## [1] 93.7

103.7+93.7

## [1] 197.4

197.4/48

## [1] 4.1125

The mean for the two sets together is 4.1125. This was calculated by multiplying both sets’ means by their sample size. Next, you add these together and divide by the total sample size. Median cannot be calculate because it does not rely on averages. In order to calculate median, you must combine both sets into a new, 48 point data set. This way, you can simply use r or count to find the median.

Exercise 4. You are adding Badger-themed bedazzle to your striped overalls and are using both red and white beads. You are interested in how the size of the bag of beads you select your beads from changes the probability of outcomes of interest. Compute the probability for the outcomes in (a) and (b) for all three different sampling strategies.

(Small Pop): Drawing without replacement from a small bag of beads with 7 White beads and 3 Red beads.

(Large Pop): Drawing without replacement from a large bag of beads with 700 White beads and 300 Red beads.

(Same Pop): Drawing from a bag of beads that always contains exactly 70% White and 30% Red beads.

For example, consider choosing 3 beads. Calculate the probability of selecting no white beads.

Small Pop: P(RRR) = \(\frac{3}{10}*\frac{2}{9}*\frac{1}{8}=0.008333333\)

Large Pop: P(RRR) = \(\frac{300}{1000}*\frac{299}{999}*\frac{298}{998}=0.02681098\)

Same Pop: P(RRR) = \(0.30*0.30*0.30=0.027\)

Consider choosing 3 beads. Calculate the probability of selecting exactly 1 white bead.

Order will be listed small, large, then same

((7/10)*(3/9)*(2/8))

## [1] 0.05833333

((700/1000)*(300/999)*(299/998))

## [1] 0.06297881

((7/10)*(3/10)*(3/10))

## [1] 0.063

Consider choosing 3 beads. Calculate the probability of selecting at least 1 white bead.

Order will be listed small, large, then same

((7/10)*(3/9)*(2/8))

## [1] 0.05833333

((7/10)*(6/9)*(3/8))

## [1] 0.175

((7/10)*(6/9)*(5/8))

## [1] 0.2916667

0.05833333+0.175+0.2916667

## [1] 0.525

((700/1000)*(300/999)*(299/998))

## [1] 0.06297881

((700/1000)*(699/999)*(300/998))

## [1] 0.1472314

((700/1000)*(699/999)*(698/998))

## [1] 0.3425584

0.06297881+0.1472314+0.3425584

## [1] 0.5527686

((7/10)*(3/10)*(3/10))

## [1] 0.063

((7/10)*(7/10)*(3/10))

## [1] 0.147

((7/10)*(7/10)*(7/10))

## [1] 0.343

0.063+0.147+0.343

## [1] 0.553

Consider sampling without replacement. Does drawing from a population that is small or large relative to your sample size result in an probability that is closest to the probability when sampling with replacement?

Drawing from a large sample size is closer to the probability with replacement. This makes sense because removing 1, 2, or 3 beads from the larger sample results in a more subdued shift in probability as more beads are removed. For example, if you have 400 red beads out of 1000 and remove 2, the probability that both are red is ~0.16. If the sample was 4 red beads out of 10, the probability would be 0.12.

Stat 371 Homework #2

Owen Crowell

Summarizing Data Numerically and Graphically