## Warning: package 'ggplot2' was built under R version 4.0.4
library(openintro)
library(ggplot2)
library(lattice)
## Warning: package 'lattice' was built under R version 4.0.4
## Warning in data("i_COHb"): data set 'i_COHb' not found
## Warning in data("df"): data set 'df' not found
Question 1
Systolic blood pressure rates for treated hypertensive adults in the U.S. follow a roughly bell curve distribution with:mean = 130 mm HG and standard deviation = 7.1 mmHg
- Label the bell curve (to the left) with appropriate raw data values for systolic blood pressure rates.
b.Calculate the z-score (using z=(y-μ)/σ ) for a treated hypertensive American adult with systolic blood pressure of 148. Label this point on your curve above.
What percentage of this adult population have systolic blood pressure rates between 100 and 135? Draw/Label/Shade to show this answer. Show how you arrive at your solution, and use proper notation with your solution.
What blood pressure rate represents the top 5% or higher for this population of adults? Draw/Label/Shade to show this answer.
Answer
1A - 1C: See Below 1D: According to z-scores, the top 5% fall between 130 and 135.
# 1A. Bell curve and raw data
#population mean and standard deviation
population_mean <- 130
population_sd <- 7.1
#upper and lower bound
lower_bound <- population_mean - population_sd
upper_bound <- population_mean + population_sd
#Create a sequence of 50 x values based on population mean and standard deviation
x <- seq(-4, 4, length = 50) * population_sd + population_mean
#create a vector of values that shows the height of the probability distribution
#for each value in x
y <- dnorm(x, population_mean, population_sd)
#plot normal distribution with customized x-axis labels
plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "mm HG", ylab = "")
sd_axis_bounds = 5
axis_bounds <- seq(-sd_axis_bounds * population_sd + population_mean,
sd_axis_bounds * population_sd + population_mean,
by = population_sd)
axis(side = 1, at = axis_bounds, pos = 0)

# 1B. p-value = z score of .005619 at 148
z <- (148 - population_mean) / population_sd
list(z) #The probability of a systolic pressure of 148 or greater according to the p-value of 2.535 is .005619 (very small).
## [[1]]
## [1] 2.535211
# 1c. What percentage of this adult population have systolic blood pressure rates between 100 and 135? Draw/Label/Shade to show this answer. Show how you arrive at your solution, and use proper notation with your solution.
z <- (100 - population_mean) / population_sd
list(z)
## [[1]]
## [1] -4.225352
z <- (135 - population_mean) / population_sd
list(z)
## [[1]]
## [1] 0.7042254
#With a Z score o (number of standard deviations from the mean) f -4.225, the probability of X<100 = .00001; x>100 = .99999 ; 100<X<130 =.4999
#With a z score of .704, the p(x)<135 = .759; p(x)>135 = .241; p(130<x<135)=.259
Question 2
A student researcher tries to determine the proportion of students with advanced degrees who take at least one class at Montgomery College. In a post-pandemic environment, she stands outside the MC library and asks the first 20 people who walk by whether they have advanced degrees or not. She found that 12/20 people said they did have advanced degrees, so she concluded that 60% of all students at Montgomery College have advanced degrees. Briefly describe at least three things that made this a poor study (use proper statistical terminology/vocabulary).
Answer
Sample size: The number of individuals included in the sample depends on various factors, including the size and variability of the MC population. Selecting the first 20 to enter a single building on a particular day at a particular time contributes to a poorly designed study. Other factors should be considered. As we recently learned, When the sample size increases, the distribution becomes more normal. The mean (center point of the distribution) is more accurate when the sample size is larger. A larger sample size has a smaller spread and is preferable for a sampling distribution
Number of Samples: In this case, there is only one sample from the entire MC population, and there is no account for any distribution data. With an increased number of samples (for example, 25 samples, each with 50 observations), each element would be the mean of the random samples. A histogram would display the sampling size and the distribution. By increasing the number of samples, there would be a more accurate estimate of the population proportion.
Data provided - 12/20 (60%) isn’t representative of the population. The student researcher should look at mean, standard deviation, 5-number summary, histograms, box-plots, and the Shapiro Wilkes test to determine if the sample data comes from a population that is normally distributed.
In summary, biased, small sample, non-representative population. The researcher placed herself in front of the library which is only where a few students will pass by. Finally, it is voluntary, because students do not need to respond to the questions. In summary, the study results are not useful.
Question 3 - SEE WRITTEN EXAM FOR SOLUTION
Question 4
SEE WRITTEN EXAM FOR Rest of SOLUTION - I attempted to do some of it in R.
The prevalence of color blindness in males is 3%. Let Y denote the number with color blindness out of a random sample of seven
What would the expected mean and standard deviation of this binomial distribution? Also, would it be appropriate to use normal distribution calculations for this type of sample – why or why not?
Complete the table above.
Find Pr{Y≥4} PR{Y<_2}
testB <- 0:7
plot(testB,dbinom(testB,size=7,prob=.03),
type='h',
main='Binomial Distribution (n=7, p=0.03)',
ylab='Color Blindness',
xlab ='',
lwd=3)

##display probability of success for each number of trials
#dbinom(success, size=7, prob=.3)
Question 5
- Twenty-one years of deaths due to the flu virus were recorded for a particular region.
The data below are given for each year from 1985 – 2005. 35 41 55 65 29 60 49 24 37 110 28 40 47 58 30 37 15 50 28 37 41
ANSWERSEE BELOW a. Create a histogram of the data including a scale for the vertical frequency axis.
b. Find the mean and standard deviation, and the 5-number summary for the data. Be sure to include appropriate symbols as well. c. Create a box plot of the data. .Show appropriate scale along number line. Also perform the outlier test for the upper fence (Q_3+1.5*IQR)d. d. Below, create a Normal Quantile Plot (qqnorm and qqline) for the flu data, then compute the Shapiro Wilkes test.
In a short paragraph, provide a detailed explanation of how these three components fit together to describe the shape of the distribution. Be sure to include any conclusions about the results of your findings
The histogram shows the distribution and the spread of the variable (deaths due to flu). In the image below, it tends to skew a bit to the left in its distributions. Since it uses area instead of height to represent values, the width of bars can vary. The box plot standardizes the distribution of data based on the five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum (upper fence”). The Normal Quantile Plot was used to test the normality of the data by plotting the data against a standard normal distribution. The flu data was used to create a sample with standard normal distribution and then compared against the original input set. With the Shapiro Wilkes test, my final conclusion is that the sample data comes from a population that is normally distributed.
#Put deaths into a data frame and check data by printing
i_COHb <- c("35", "41","55", "65","29","60", "49", "24", "37", "110", "28", "40", "47", "58", "30", "37", "15", "50", "28", "37", "41")
df <-data.frame(i_COHb)
print(df)
## i_COHb
## 1 35
## 2 41
## 3 55
## 4 65
## 5 29
## 6 60
## 7 49
## 8 24
## 9 37
## 10 110
## 11 28
## 12 40
## 13 47
## 14 58
## 15 30
## 16 37
## 17 15
## 18 50
## 19 28
## 20 37
## 21 41
# 5B - Calculations
#Since I don't know how to draw symbols using R, I included extra calculations for the data.
#The 5 Number Summary gives us the distribution of the observations. We don't have to decide on the most appropriate summary statistic. The five-number summary gives information about the location (from the median), spread (from the quartiles) and range (from the sample minimum and maximum) of the observations.
i_COHb <- as.numeric(i_COHb)
mean(i_COHb) #mean
## [1] 43.61905
## [1] 40
sd(i_COHb) #standard deviation
## [1] 19.84055
## [1] 20
fivenum(i_COHb) #5-number summary (110 is the outlier upper fence)
## [1] 15 30 40 50 110
# 5C - Box Plot
df <- as.numeric(i_COHb)
boxplot(df, notch=TRUE)

boxplot(df,
main = "Deaths Due to Flu")
# legend
legend("topright", legend = "Boxplot",
fill = rgb(1, 0, 0, alpha = 0.4), # Color
inset = c(0.03, 0.05), # Modify margins
bg = "white") # Legend background color

#110 is the outlier upper fence of the box plot
# 5A - HISTOGRAM
hist(df, # Change number of histogram breaks
breaks = 50)

hist(df, # Change main title of histogram
main = "Flu Virus Deaths: 1985 - 2005")
lines(density(df), col = "red")

# 5D - qqnorm and qqline
set.seed(5332) # Set seed
x <- rnorm(df) #Create random normally distributed values
qqnorm(df) # QQplot of normally distributed values
qqline(df, col = "red") # Add qqline to plot

shapiro.test(x) #Shapiro Wilkes Test. Since the p-value is not less than .05, the sample data comes from a population that is normally distributed.
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.92051, p-value = 0.08884
Question 6
- Back to the breast cancer example…. (be sure to DRAW/LABEL/SHADE FOR PARTS (b) and (c) ) Suppose that it is known that in a certain country, 16% of women will develop breast cancer, if we select a random sample of 100 women from that country,
Use the binomial distribution calculations to find the probability that exactly 20 will develop breast cancer.
Now use the normal approximation to the binomial distribution with continuity correction to calculate that same probability that exactly 20 out of the 100 women will develop breast cancer.
Finally, use the sampling distribution for the normal approximation to the binomial distribution to calculate the probability that p ̂=0.16 is within ±0.05 of p. Do not use the continuity correction.
- The FEV, or forced expiratory volume of air in one second, is an important indicator of lung capacity. Assuming that the population distribution is normal, the mean for young women is 3,000 ml and the standard deviation is 400 ml.
With a random sample of 50 young women, what is the mean and standard error of this sampling distribution? Draw and label a curve to show that a random variable Y is approximately normally distributed with this mean and standard error.
For this sampling distribution, what proportion of the group would you expect to have FEV levels below 2900? Write proper notation. DRAW/LABEL/SHADE.
**Answer
See below
#population mean and standard deviation
population_mean <- 3000
population_sd <- 400
#upper and lower bound
lower_bound <- population_mean - population_sd
upper_bound <- population_mean + population_sd
#Create a sequence of 50 x values based on population mean and standard deviation
x <- seq(-4, 4, length = 50) * population_sd + population_mean
#create a vector of values that shows the height of the probability distribution
#for each value in x
y <- dnorm(x, population_mean, population_sd)
#plot normal distribution with customized x-axis labels
plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "FEV/ml", ylab = "")
sd_axis_bounds = 5
axis_bounds <- seq(-sd_axis_bounds * population_sd + population_mean,
sd_axis_bounds * population_sd + population_mean,
by = population_sd)
axis(side = 1, at = axis_bounds, pos = 0)

#z score
z <- (2900 - population_mean) / population_sd
list(z) # The probability of x<2900 is 0.40129; that is, the probability of sample having FEV levels below 2900 is .401 or about 40%.
## [[1]]
## [1] -0.25
