Stat 371 Homework #2

*Submit your homework to Canvas by the due date and time. Email your lecturer if you have extenuating circumstances and need to request an extension.

*If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.

*If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manually calculations on your exams, so practice accordingly.

*You must include an explanation and/or intermediate calculations for an exercise to be complete.

*Be sure to submit the HWK2 Autograde Quiz which will give you ~20 of your 40 accuracy points.

*50 points total: 40 points accuracy, and 10 points completion

Summarizing Data Numerically and Graphically

Exercise 1. There are 12 numbers in a sample, and the mean is \(\bar{x}=24\). The minimum of the sample is accidentally changed from 11.9 to 1.19.

Is it possible to determine the direction in which (increase/decrease) the mean (\(\bar{x}\))changes? Or how much the mean changes? If so, by how much does it change? If not, why not?

Yes, it is possible to determine the direction in which the mean bar changes. It decreased by 0.8935

Show computations by hand and insert image or type out computations using latex

Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not?

It is possible to determine the median changed because the median is just the middle number. The median will stay the same.

Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? Describe why it is difficult to predict by how much the standard deviation will change in this case.

The prediction for standard deviation can be found, and the standard deviation is likely to increase. It is hard to find exact amount of change that will occur because we do not know what the numbers are in the set and do not know what the smallest and largest numbers are.

Exercise 2: Recall the computer disk error data given used in HWK 1. The table below tabulates the number of errors detected on each of the 100 disks produced in a day.

Number of Defects	Number of Disks
0	41
1	31
2	15
3	8
4	5

A frequency histogram showing the frequency for number of errors on the 100 disks is given below.

error.data=c(rep(0,41), rep(1,31), rep(2,15), rep(3,8), rep(4, 5))
hist(error.data, breaks=c(seq(from=-0.5, 4.5, by=1)),
     xlab="Defects", main="Number of Defects", labels=TRUE, ylim=c(0,60))

What is the shape of the histogram for the number of defects observed in this sample? Why does that make sense in the context of the question?

The shape of this histogram is right-skewed. This makes sense because you are monitoring the errors in disks and would expect that most disks are error free or have low amounts of defects, so they would be more on the lower scale of defects, or right skewed.

Calculate the mean and median number of errors detected on the 100 disks ‘by hand’ and using the built-in R functions. How do the mean and median values compare and is that consistent with what we would guess based on the shape? [You can use the text such as \(\bar{x}=\frac{value1}{value2}\) to help you show your work neatly].

mean(error.data)

## [1] 1.05

median(error.data)

## [1] 1

The mean and median values are around what we might predict based on the shape and the number of data that falls around the 1 on the histogram.

Calculate the sample standard deviation ``by hand” and using the built in R function. Are the values consistent between the two methods? How would our calculation differ if instead we considered these 100 values the whole population?

sd(error.data)

## [1] 1.157976

The values are consistent with the two methods, just slightly higher. If we considered 100 values the entire population, the calculation would just then represent the entire population, not just a sample or smaller amount.

Construct a boxplot for the number of errors data using R with helpful labels. Explain how the shape of the data identified in (a) can be seen from the boxplot.

boxplot(error.data,horizontal=TRUE,main="error.data",xlab="Defects", main="Number of Defects", labels=TRUE, ylim=c(0,5))

## Warning in (function (z, notch = FALSE, width = NULL, varwidth = FALSE, :
## Duplicated argument main = "Number of Defects" is disregarded

The right-skewed shape can be seen in the box plot, as most of the data is pushed to the front of the box plot, or on the left side of the plot.

Explain why the histogram is better able to show the discrete nature of the data than a boxplot.

The Histogram is able to show the discrete nature of the data better than a box plot because they show the distribution of data better visually. It also labels how many times a number or error is repeated to give you a more accurate reading instead of a visual approximate.

Exercise 3: A certain reaction was run several times using each of two catalysts, A and B. The catalysts are supposed to control the yield of an undesireable side product. Results, in units of percentage yield, for 25 runs of catalyst A and 23 runs of catalyst B are given below and also in Catalysts.csv.

Catalyst A: 4.3, 4.4, 3.4, 2.6, 3.8, 4.9, 4.6, 5.2, 4.7, 4.1, 2.6, 6.7, 4.1, 3.6, 2.9, 2.6, 4.0, 4.3, 3.9, 4.8, 4.5, 4.4, 3.1, 5.7, 4.5

Catalyst B: 3.4, 5.9, 1.2, 2.1, 5.5, 6.4, 5.0, 5.8, 2.5, 3.7, 3.8, 5.1, 3.1, 1.6, 3.5, 5.9, 6.7, 5.2, 5.8, 2.2, 4.3, 3.8, 1.2

Use R to create a histogram for the percentage yield of the undesireable side product for the two catalysts (any kind of histogram that you want since sample sizes are similar). Have identical x and y axis scales so the two groups’ values are more easily compared. Include useful titles.

catalystA<-c(4.3, 4.4, 3.4, 2.6, 3.8, 4.9, 4.6, 5.2, 4.7, 4.1, 2.6, 6.7, 4.1, 3.6, 2.9, 2.6, 4.0, 4.3, 3.9, 4.8, 4.5, 4.4, 3.1, 5.7, 4.5)
catalystB<-c(3.4, 5.9, 1.2, 2.1, 5.5, 6.4, 5.0, 5.8, 2.5, 3.7, 3.8, 5.1, 3.1, 1.6, 3.5, 5.9, 6.7, 5.2, 5.8, 2.2, 4.3, 3.8, 1.2)
hist(catalystA, labels=TRUE, ylim=c(0,10), xlim=c(0,7), xlab="% Undesireable Product", breaks=seq(0,9,0.5))

hist(catalystB,labels=TRUE, ylim=c(0,10), xlim=c(0,7), xlab="% Undesireable Product", breaks=seq(0,9,0.5))

Compare the shape of the percentage yields from the two catalysts observed in this sample.

The percentage yields histogram of Catalyst A is left skewed while Catalyst B’s histogram is more spread out across the graph. B is still not even, but in comparison to A, it looks similar to that. The shape of Catalyst A is a normal histogram, while Catalyst B is biomodel.

Compute the mean and median percentage yields observed for Catalyst A and Catalyst B using R. Compare both measures of center within each group and comment on how that relationship corresponds to the datas’ shapes. Also compare the measures of center across the two groups and comment on how that relationship is evident in the histograms.

mean(catalystA)

## [1] 4.148

mean(catalystB)

## [1] 4.073913

median(catalystA)

## [1] 4.3

median(catalystB)

## [1] 3.8

The measures of center in Catalyst A and B differ with about 0.2. In catalyst A, the relationship corresponds to the data’s shape as a normal histogram shape and model. The highest data points are clustered together resulting in a high median and high average. For catalyst B, the relationship with the measures of center of 0.2 and the shape corresponds with the biomodel shape. The highest points are near each other possibly indication and median and mean will also be close to each other. Both Catalyst A and B have similar peaks in which the highest peaks are near each other, but the slight gap in catalyst B shows up different in comparison to A.

Compute (in R) and compare the sample standard deviation of percentage yield from Catalyst A and Catalyst B. Comment on how the relative size of these values can be identified from the histograms. Describe in words what these values mean when considering which catalyst to use for your experiment.

sd(catalystA)

## [1] 0.9760123

sd(catalystB)

## [1] 1.715496

The relative size of the standard deviation is about 0.8. Both standards of deviation can be found and identified by the histograms. Catalyst B will have a higher standard deviation because the data is more spread out, as shown by the histogram. Catalyst A has data grouped closer together, which will result in a smaller standard deviation in respect to catalyst B. Catalyst A would be a better choice to use in an experiment because of the consistency of the values in comparison to the values of catalyst B.

Use R to create side-by-side boxplots of the two sets in R so they are easily comparable.

boxplot(catalystA, catalystB, horizontal=TRUE, names=c("A","B"), main="Percent Undesirable product by Catalyst")

Explain why the highest value in the Catalyst A boxplot is shown as a point. That is, explain what calculations determined that 6.7 was an outlying value. Also specify to what value the upper Catalyst A percentage yield whisker extends.

The highest value in Catalyst A box plot shows an outliar, as indicated by a dot, or point. This outlier has a value of 6.7 and is an outlying value because it is not consistent with the rest of the data. It is abnormally far off from the rest of the data. The outlier was determined by being mroe than 3 standard deviations of the mean. The upper catalyst A whisker extends to the value of around 5.2.

What would be the mean and median percentage yield if we combined the two data sets into one large data set? Show how the mean can be calculated from the summary measures in part (c) along with the sample sizes and explain why the median of the combined set cannot be computed based on (c).

The mean of the combined data sets would be calculated from the values and steps from part c: multiplying each mean by the number of vectors. 4.14825=103.7, and 4.0723=93.61. When adding these values together, we get 197.31. The final mean would then be the sum divided by the total amount of vectors. 197.31/(25+23)=4.111. The median can not be computed based off of C because both sets do not have the same sample size. You would need to order all the numbers from smallest to largest and then cross out from both sides to find the median.

mean(catalystA)

## [1] 4.148

mean(catalystB)

## [1] 4.073913

Probability

Exercise 4 You are adding Badger-themed bedazzle to your striped overalls and are using both red and white beads. You are interested in how the size of the bag of beads you select your beads from changes the probability of outcomes of interest. Compute the probability for outcomes a and b using three different sampling strategies each time.

(Small Pop) drawing without replacement from a small population where the bag of beads contains 7 White beads and 3 Red beads.

(Large Pop) drawing without replacement from a large population where the bag of beads contains 700 White beads and 300 Red beads.

(Same Pop) drawing from a population where the bag of beads always contains 70% White and 30% Red beads.

Example: Consider choosing 3 beads. Calculate the probability of selecting no white beads.

Small Pop: P([RRR])=\(\frac{3}{10}*\frac{2}{9}*\frac{1}{8}=0.008333333\)

Large Pop: P([RRR])=\(\frac{300}{1000}*\frac{299}{999}*\frac{298}{998}=0.02681098\)

Same Pop: P([RRR])=\(0.30*.30*.30=0.027\)

Consider choosing 3 beads. Calculate the probability of selecting exactly 1 white bead.

The probability of selecting exactly 1 white bead for each population size would be: Small Pop: P([WRR])=\(\frac{7}{10}*\frac{3}{9}*\frac{2}{8}=0.233\) 3*.233=.175

Large Pop: P([RWR])=\(\frac{700}{1000}*\frac{300}{999}*\frac{299}{998}=.0629\) .0629*3=.18894

Same Pop: P([RRW])=\(0.70*.30*.30=0.343\) .343*3=.189

The probability of selecting exactly 1 white bead for each population size would be: >b. Consider choosing 3 beads. Calculate the probability of selecting at least 1 white bead.

I am calculating the RRR for each population because that outcome is not possible for this problem. I then subtracted the probability of RRR from 1 to find the probability of finding white bead. Small Pop: P([RRR])=\(\frac{3}{10}*\frac{2}{9}*\frac{1}{8}=0.00833\) 1-.00833=.992

Large Pop: P([RRR])=\(\frac{300}{1000}*\frac{299}{999}*\frac{298}{998}=0.0268\) 1-.0268=0.9732 Same Pop: P([WRR])=\(0.30*.30*.30=.027\) 1-.027=0.973

Consider sampling without replacement. Does drawing from a population that is small or large relative to your sample size result in an probability that is closest to the probability when sampling with replacement?

If you drew from a population that is large relative to the sample size, it may result in a probability closest to a probability of sampling with replacement. This is due to the fact that there is a smaller difference between sample sizes so there would be closer probabilities. Only having 10 beads in comparison to 100 beads leaves you with less beads to choose from leading to a quickly decreasing probability.

Stat 371 Homework #2

Alexa Schram

Summarizing Data Numerically and Graphically

Probability