Homework 2

Rachel Walsh 1. If you wanted to estimate the mean height of all the students at UW Madison, which one of the following sampling strategies would be best? Why? (Note that none of the methods are true simple random samples.) a. Measure the heights of 100 students found in the gym during basketball intramurals. b. Measure the heights of the engineering majors. c. Measure the heights of the students selected by choosing the first name on each page of a list of students enrolled that semester. C, it provides the most diverse set of data that is randomly selected excluding outside variables.

There are 12 numbers on a list, and the mean is 24. The smallest number on the list is changed from 11.9 to 1.19.

Is it possible to determine the direction in which (increase/decrease) the mean changes? Or how much the mean changes? If so, by how much does it change? If not, why not? Yes, the mean would decrease if one number that is included in the numerical set is decreased. You can calculate the change by the following: 24=(11.9+x)/12 > 24*12 [1] 288 > 288-11.9 [1] 276.1 > (1.19+276.1)/12 [1] 23.1075 > 24-23.1075 [1] 0.8925
Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not? No, you would need to know the other numbers to determine this.
Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? Describe why it is difficult to predict by how much the standard deviation will change in this case. No, we do not know all of the numbers so we cannot compute the deviations.

Physical education researchers interested in the development of the overarm throw measured the horizontal velocity of a thrown ball at the time of release. The results for first-grade children (in feet/sec) (courtesy of L. Halverson and M. Roberton*) are:

Males<-c(54.2, 39.6, 52.3, 48.4, 35.9, 30.4, 25.2, 45.4, 48.9, 48.9, 45.8, 44.0, 52.5, 48.3, 59.9, 51.7, 38.6, 39.1, 49.9, 38.3)
Females<-c(30.3, 43.0, 25.7, 26.7, 27.3, 31.9, 53.7, 32.9, 19.4, 23.7, 23.3, 23.3, 37.8, 39.5, 33.5, 30.4, 28.5)
sort(Males)

##  [1] 25.2 30.4 35.9 38.3 38.6 39.1 39.6 44.0 45.4 45.8 48.3 48.4 48.9 48.9 49.9
## [16] 51.7 52.3 52.5 54.2 59.9

sort(Females)

##  [1] 19.4 23.3 23.3 23.7 25.7 26.7 27.3 28.5 30.3 30.4 31.9 32.9 33.5 37.8 39.5
## [16] 43.0 53.7

Use R to create a histogram for the males and a histogram for the females (any kind of histogram that you want since sample sizes are similar). Adjust the x axis and y scales so the two groups are more easily compared.

hist(Males, xlim = c(25,60))

hist(Females, xlim = c(18,55), ylim = c(0,6))

b. Compare the shape of the throws from the male and female students observed in this sample. The females historgram is shifted further left than the males and had a smaller arrary of data, the males was more towards the right with larger chunks of data with less distribution. c. Compute the mean and median throw velocities observed separately for the male and female students using R. Compare both measures of center across the two groups.

mean(Males)

## [1] 44.865

median(Males)

## [1] 47.05

mean(Females)

## [1] 31.22941

median(Females)

## [1] 30.3

Males have a higher measure of center d. Compute and compare the standard deviation in throw velocities observed in the male and female students.

sd(Males)

## [1] 8.513845

sd(Females)

## [1] 8.519666

The standard deviations are very similar with the females being slightly larger. e. Use R to create side-by-side boxplots of the two sets so they are easily comparable.

genders<-c("Males","Females")
boxplot(Males, Females, horizontal = TRUE, main = "Males and Females")

f. Explain why the highest value in the Female Velocity boxplot is shown as a point. That is, explain what calculations determined that 53.7 was an outlying value. Also specify to what value the upper female whisker extends. The female data is more spread out with an outlier at 53.7 which turns it into a point because there is no nearby data. g. What would be the mean and median throw velocity if we combined the throw velocities into one large data set? Show how only one of the mean or median can be calculated from your female and male summary measures in part (c).

Throw<-c(54.2, 39.6, 52.3, 48.4, 35.9, 30.4, 25.2, 45.4, 48.9, 48.9, 45.8, 44.0, 52.5, 48.3, 59.9, 51.7, 38.6, 39.1,
49.9, 38.3,30.3, 43.0, 25.7, 26.7, 27.3, 31.9, 53.7, 32.9, 19.4, 23.7, 23.3, 23.3, 37.8, 39.5, 33.5, 30.4, 28.5)
mean(Throw)

## [1] 38.6

median(Throw)

## [1] 38.6

What would be the standard deviation of the throw velocities if we combined the throw velocities into one large data set? How does this value compare to the standard deviations you calculated in part (d) and why does this make sense based on the Male and Female data sets?

sd(Throw)

## [1] 10.86166

The standard deviation increased as the data set grew to include a wider spread of values. This makes sense because males threw at a higher velocity than the females. 4. After manufacture, computer disks are tested for errors. The table below tabulates the number of errors detected on each of the 100 disks produced in a day. Number of Defects: 0,1,2,3,4 Number of Disks: 41, 31, 15, 8, 5 a. Describe the type of data that is being recorded about the sample of 100 disks, being as specific as possible. The data presented are numbers making it quantitive, and is discrete as it is distinct how many defects there are and how many defects are present, this data isn’t continous as time isn’t a factor and the situation isn’t fluid. b. A frequency histogram showing the frequency for number of errors on the 100 disks is given below. Write the R code to produce this frequency histogram by creating bins at [-.5, .5), [.5, 1.5), etc. Be sure to create useful labels.

Zerodefects<-rep(0,41)
Onedefect<-rep(1,31)
Twodefects<-rep(2,15)
Threedefects<-rep(3,8)
Fourdefects<-rep(4,5)
Defects<-c(Zerodefects, Onedefect, Twodefects,Threedefects, Fourdefects)
hist(Defects, breaks = c(-.5,.5,1.5,2.5,3.5,4.5))

What is the shape of the histogram for the number of defects observed in this sample? Left Skewed
Calculate the mean and median number of errors detected on the 100 disks by hand and with R. How do the mean and median values compare and is that consistent with what we would guess based on the shape?

mean(Defects)

## [1] 1.05

median(Defects)

## [1] 1

mean by hand: (31+30+21+20)/100=1.05, median by hand: 41 zeros then add 9 to get to 50 which is half of 100 so the middle and next there are 31 ones so adding 9 would mean the median is one. This makes sense as the shape is left skewed so the center values would be towards the beginning of the data. e. Calculate the sample standard deviation “by hand” and using R. Are the values consistent between the two methods?

sd(Defects)

## [1] 1.157976

by hand: 41(0-1.05)^{2+31(1-1.05)^2+15(2-1.05)}2+8(3-1.05)^2+5*(4-1.05)^2 [1] 132.75 > 132.75/99 [1] 1.340909 > (1.340909)^(1/2) [1] 1.157976

The values are the exact same. f. Explain why the histogram is better able to show the discrete nature of the data than a boxplot. The histogram shows the spread of data which is important with discrete quantitive data as the categories can be easily displayed where the boxplot would just group all of the data together and doesn’t visually diplay differences. g. Suppose a customer came to pick up a single computer disk from the 100 produced on that day. What is the probability that disk has at least 1 defect? (31+15+8+5)/100=59/100=59% h. Suppose a customer came to pick up three computer disks from the 100 produced on that day. What is the probability that at least 1 of the three disks has at least 1 defect? 1-1-41/10040/9939/98=0.9340754