Jack Hirstein

title: “HOMEWORK 1” output: html_document date: “2025-06-17” —

Exercise 1

Part a

The population of interest is the county citizens who are all eligible to vote for the election. > The parameter of interest is if people voted yes or no during the vote. We will know the parameter after the vote in done.

Part b

The sample sized used was 1,000 citizens who were eligible to vote for the Pollester. The statistics calculated was 47% of the people voted yes for the use of tax money to upgrade the stadium. A different sample size was 10,000 people who attended one of the football games. The statistics calculated was 73% of the people voted yes for the use of tax money to upgrade the stadium

The Pollester is a simple random sample and the people who attended the football game was not a simple random sample.

Part c

I would say that the owner is wrong. He is wrong because that sample of 10,000 citizens could be a bias sample because most of the citizens at the game probably like football and already support the team. They would then be more likely to want an upgrade to the stadium with tax money, compared to someone would doesn’t like or watch football who might not support the team and not want the tax money to go to the stadium. Just because the sample size is bigger, does not mean that it will better predict the voting outcome because of the bias.

Exercise 2

Part a

The type of data that is being recorded is the number of defects per disk, and the number of disks that corresponds to the number of defects, which is represented by quantitative discrete data.

Part b

bi.

error.data=c(rep(0,41), rep(1,31), rep(2,15), rep(3,8), rep(4, 5))
hist(error.data, breaks=c(seq(from=-0.5, 4.5, by=1)),
xlab="Defects", main="Number of Defects",
labels=TRUE, ylim=c(0,50))

bii. The rep() function helps replicate vectors, so it repeats numbers a certain amount of times. So rep(0,41) will repeat the number zero 41 times. rep(1,31) will repeat the number one 31 times, ect.

biii. The breaks command makes that the bars on the histogram will be centered with each number on the x-axis. It helps make the historgam look more accurate and pleasing to the eye.

biv. If you changed the ylim from 50 to 30 the histigram would would stop at 30 for zero defects and one defect. This is because both zero and one defects have a number bigger than 30 for the number of disks. Someone looking at the histogram would then not accurately be able to tell how many disks represent zero defects and one defect. A ylim value that would used for clear communcation would be either 45 or 50 because both numbers are greater than the largest number of disks in each defect category and they are not too large to make the histogram visually unpleasing.

Exercise 3

Part a

It is possibly to tell that the mean would decrease. We know that it would decrease because if one of the number in the sample decrease this would also cause the mean to decrease, because the mean is an average of the numbers in the sample. Yes we can tell by how much the mean changes. If we multiply the mean by 12, which gives of 324. We can take the minimum value of 13.8 and subtract 1.38 to get 12.42. Then subtract 324 by 12.42 to get 311.58. Take 311.58 and divide by the number of samples which is 12 to get 25.965. This number is the new mean.

Part b

It is not possible to tell which way the median because it doesn’t change at all. Since the minimim number is the only number that is changed out of the samples, and that number is only made smaller the median would not change.

Part c

Yes we can predict that the standard deviation would increase. This is because the minimum number is going from 13.8 to 1.38 which is further away from the mean of 27. It can be difficult to predict how much the stanard deviation will change because we do not know the values of the other samples. Without this information we cannot determine the change of the standard deviation.

Exercise 4

error.data=c(rep(0,42), rep(1,30), rep(2,16), rep(3,7), rep(4, 5))
hist(error.data, breaks=c(seq(from=-0.5, 4.5, by=1)),
xlab="Defects", main="Number of Defects",
labels=TRUE, ylim=c(0,60))

Part a

The shape of the histogram is right skewed. This would make sense because there is a lower chance that a disk will have a higher number of defects compared to a disk that might not have any defects or a small amount of defects. The disks are made to have as little of defects as possible.

Part b

mean=103/100=1.03 median=1 this was done by a “checking off method”, or finding the middle number of the data.

mean(error.data)
## [1] 1.03
median(error.data)
## [1] 1

The mean and median values are very similar and almost the same number. Since the mean is bigger than the median the histogram should indicate a shape of right skewed which is what we see in the histogram, so the mean and median is consistent with the shape of the histogram.

Part c

sqrt(((0-1.03)242+(1-1.03)^230+(2-1.03)216+(3-1.03)^27+(4-1.03)^2*5)/99)=1.15 The stanard devation by hand is about 1.15

sd(error.data)
## [1] 1.149923

The values between the two methods is consistent as they are almost the same number. If the 100 disk was the whole population you would divide by 100 and not 99 because the disk are no longer a sample but a population. This would give a slightly smaller stanard deviation.

Part d

boxplot(error.data, horizontal=TRUE,main="Number of Errors per Disk",xlab="Number of Errors",ylab="Disks")

We can see the shape identified in part a from the boxplot because we can see the dotted line out to the right which shows the outliers while the middle of the boxplot is more to the left. This also tells us that the data is skewed right.

Part e

The histogram is better at showing the data than the boxplot because you can see the frequency of the errors for the 100 disks. The boxplot doesn’t actually show how many disks are in each number of errors where the histogram does show the number of disks for each number of error.

Exercise 5

Part a

current.model=c(1.63,1.25,1.23,1.49,2.11,1.48,1.94,1.72,1.85,1.54,1.67,1.76,1.46,1.32,1.23,1.67,1.74,1.63,1.25,1.56)
new.model=c(1.28,1.19,0.90,1.24,1.00,0.80,0.71,1.03,1.27,1.14,1.36,0.91,1.09,1.36,0.91,0.91,0.86,0.93,1.36)
hist(current.model,main="Current Model",breaks=c(seq(from=0.6, 2.2, by=0.1)),xlab="Gallons of Water",ylim=c(0,6))

hist(new.model,main="New Model",breaks=c(seq(from=0.6, 2.2, by=0.1)),xlab="Gallons of Water",ylim=c(0,6))

Part b

The shape of the current model distribution is slightly right skewed with the center of the histogram at about 1.6-1.7. The shape of the new model distribution is slightly left skewed with the center of the histogram at about 0.90.

Part c

mean(current.model)
## [1] 1.5765
median(current.model)
## [1] 1.595
mean(new.model)
## [1] 1.065789
median(new.model)
## [1] 1.03

From this data we can see that for the current model the median is larger than the mean which would give a right skewed, and from the new model the mean is larger than the median which would give a left skewed shape. This is evident from both histogram models above. The new model also has a smaller mean and median compared to the current model, which proves the companies claim that the toilets would flush less water on an average flush.

Part d

sd(current.model)
## [1] 0.2456843
sd(new.model)
## [1] 0.2058941

We can see that the standard deviation from the current model is larger than the new model. We can see this on the histograms because the current model has an outlier with over 2 gallons of water in one flush, where the new model doesn’t have any outliers. The current model also has a wider spread of values than the new model.

Part e

boxplot(current.model,new.model,names=c("Current Model", "New Model"),horizontal=TRUE,xlim=c(0,2.4),main="Water Usage per Flush",xlab="Gallons of Water")

Part f

quantile(current.model,type=2)
##    0%   25%   50%   75%  100% 
## 1.230 1.390 1.595 1.730 2.110
quantile(new.model,type=2)
##   0%  25%  50%  75% 100% 
## 0.71 0.91 1.03 1.27 1.36

There are no values shown as a dot because there are no values that fall outside the 1.5*IQR ragne, so there are no outliers. The lower fence is 0.88 and the upper fence is 2.24, and all of the values fit into that range. The current model has the whiksers extend from 1.230 to 2.110.

Part g

total.model=c(current.model,new.model)
mean(total.model)
## [1] 1.327692
median(total.model)
## [1] 1.28

The mean can be calculated from part c by multiplying the mean by the number of samples from the current model and the new model. You can then add those two numbers up and divide by 39 which is all of the samples from both models to get a mean for all of the samples combined. Current model: 1.576520=31.53. New model: 1.06578919=20.25. 31.53+20.25=51.78. 51.78/39=1.32. The new mean is 1.32

The median cannot be determined from the values in part c because to find the median you need to have the values sorted in numerical order. When you only have the medians for the current and new models there is no order of values to sort to find a new median for the total data put together.