#a
66/85
[1] 0.7764706
65/81
[1] 0.8024691
Due 10/11/25 at 11 pm
Download the .qmd file for this assignment from https://github.com/niederhausen/BSTA_511_F25/blob/main/homework/HW_1_F25_bsta511.qmd
The exercises listed below will be graded for this assignment. You are strongly encouraged to complete the entire assignment. You will receive feedback on exercises you turn in that are not being graded.
*
Starred exercises in the section Book exercises
may be completed by hand (such as on paper or using a tablet) instead of using Quarto.It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your Qmd file and rendering frequently helps you catch your errors more quickly.
To help me learn your names and faces, please upload a photo of yourself on Sakai. You will find the Upload Photo “assignment” in the Assignments section of Sakai. These photos will only be seen by me and the TA.
On Monday, September 22, 2025, President Trump and Health Secretary Robert F. Kennedy Jr. claimed that taking acetaminophen (the active ingredient in the pain reliever Tylenol) during pregnancy was a cause autism in the child. This led to an extensive debate on the topic, much of which has focused on a systemic review of observational studies researching the association between prenatal acetaminophen use and autism spectrum disorder (Evaluation of the evidence on acetaminophen use and neurodevelopmental disorders using the Navigation Guide methodology). There are many news reports that discussed this issue, one of which is from the New York Times, “Debate Flares Over an Unproven Link Between Tylenol and Autism”
Can causation be deduced based on the observational studies in the systemic review linked to above? Explain why or why not.
No causation cannot be deduced from these observational studies. These studies have shown there are associations but cannot themselves be deduced as causation as there are still confounding variables, biases, and systemic uncertainity where causation cannot be deduced alone.
Would it be ethical to conduct an experiment to study the effects of prenatal acetaminophen use on the development of autism spectrum disorder
No a randomized control trial would not be ethical as conducting experiments on pregnant women where there could be potential side effects.
Describe a stratified sampling method that could be used to study this topic in a hypothetical study.
A stratified sampling method would be to identify strata such as mothers in different trimesters and randomizing the usage of acetaminophen within these trimesters. They could then manipulate the dosage each other uses where they take into account of any fevers or other ailments during the pregnancy that would need acetaminophen.
Describe a cluster sampling method that could be used to study this topic in a hypothetical study.
A cluster sampling method could be to focus on prenatal clinics and hospitals up across the United States. From there randomizing and selecting a handful of clinics from each state and measuring acetaminophen exposure and developmental outcomes using a longitudinal study.
Describe a multistage sample sampling method that could be used to study this topic in a hypothetical study.
For the multistage sample, first I would randomly select 20 countries, which each one is from a different part of the world where there are 10 countries with higher usage of acetaminophen and 10 ones with the lowest usage. From there randomly choosing 5 cities from each and following the babies postpartum with a longitudinal study as well.
One of the studies included in the systematic review included “All Singleton live born children in Sweden with linkable personal identifiers with follow-up until Dec 31,2021.” What type of sampling method did they use?
The sampling method used was simple random sampling over a nationwide population. It could also be classified as cluster sampling as it only looked in Sweden’s population and randomized the clinical trial.
Show the work of your calculations using R code within a code chunk. Make sure that both your code and output are visible in the rendered html file.
Write your answers in complete sentences as if communicating the results to a collaborator.
If you are having difficulty with exercise 1.2, take a look at exercise 1.1, whose answers are at the back of the book.
#a
66/85
[1] 0.7764706
65/81
[1] 0.8024691
a. 77.65% of patients in the treatment group experienced an improvement in symptoms.
0.7764706
b. 80.25% of patients experienced an improvement in symptoms in the control group.
0.8024691
c.
In the control group there was a higher percentage of patients experiencing improvement in symptoms.
d.
A possible explanation could be due to antibiotic immunity in our body, due to constant exposure. It could also be due to it being acute sinusitis, were the antibiotics were too strong that our immunity didn’t need it.
a. The main research is to look at the effectiveness of the Buteyko method by looking at if it reduces asthma symptoms and improve quality of life.
b. Subjects are aged 18-69 whom relied on medication for asthma and there are 600 subjects.
c. The variables are quality of life, activity, asthma symptoms, and medication reduction. These are all numerical discrete as they are measured on a scale of 0 to 10.
a. The population of interest is healthy volunteers and the sample size is 437 subjects
b. It cannot be as it is a small sample pool and only looking at healthy young adults, and might not apply towards an older population. There has also been multiple conflicting results so it needs to be further studied with more established results before generalizing to a larger population.
c. No, the results cannot establish a casual relationship as it the sample pool needs to be larger as it should be looking at the effects of Echinacea on different age groups as each group will have different immune responses.
a. The median would be better representative as it is more robust against outliers, as the mean would include the 2 new statistics and skew the data.
b. The IQR would be a better representation of variability as it includes the 25th through 75th quartile and excludes the outliers, the 2 new people’s salary. This tells us that IQR is a more robust and a better measurement as standard deviation will take into account of the new salary’s.
The midrange will have outliers effect its values and extremely skew it. Due to it looking at the maximum and minimum versus the mode, where the values aren’t near those extremities makes it’s robustness non-robustness.
See Section 1.6.2 for more on how the relative risk is calculated.
51/215
[1] 0.2372093
67/215
[1] 0.3116279
54/215
[1] 0.2511628
43/215
[1] 0.2
#b
51/94
[1] 0.5425532
54/121
[1] 0.446281
#c
51/94)/(54/121) (
[1] 1.215721
a. 23.72% of smokers had aortic stenosis. 31.16% non-smokers that didn’t have aortic stenosis. 25.12% non-smokers had aortic stenosis. 20% smokers did not have aortic stenosis.
b. 0.5425 is the proportion of smokers that had aortic stenosis.
0.4462 is the proportion of non-smokers that have aortic stenosis.
c. The relative risk was calculated to be 1.2157 which is greater than 1.2 which seems to be an association with increased probability of stenosis and smokers.
# 14.6% of Americans live below the poverty line, 20.7% speak a language other than English (foreign language) at home, and 4.2% fall into both categories
#c P(P)-P(PnL)
146-0.042) (.
[1] 0.104
#d P(A)+P(B) - P(A and B) <-P(PuL)
0.146+0.207-0.042) (
[1] 0.311
#e P(P u L)^c
1-0.311) (
[1] 0.689
#f P(P n L) = P(P)*P(L)
0.146*0.207
[1] 0.030222
a. The two events of living below the poverty line and speaking a foreign language are not disjoint events as they can both occur at the same time.
b. exempt
c. The percent of people that live below the poverty line and only speak English at home is 10.4%
d. The percent of Americans that live below the poverty line or speak a foreign language at home is 31.1%
e. The percent of Americans that live above the poverty line and only speak English at home is 68.9%
f. The event that someone that lives below the poverty line and speaks a foreign language IS NOT independent as the probability calculated does not equal the value given.
#a P(none) = 1-P(OuTuF)
1-(.25+.15+.28)) (
[1] 0.32
#b P(no more than 1)= P(Z)+P(O)
0.32+0.25
[1] 0.57
#c P(at least 1) = 1-P(no days)
1-0.32
[1] 0.68
#d P(none) * P(none)
0.32*0.32
[1] 0.1024
#e P(at least 1)* P(at least 1)
0.68^2
[1] 0.4624
a. The probability that a student chosen at random hasn’t missed any days of school is 0.32.
b. The probability that a student chosen at random misses no more than one day is 0.32.
c. The probability that a student chosen at random misses at least one day of school is 0.68.
d. The probability of two kids at DeKalb county elementary school to not miss any school is 0.1024.
e. The probability that two kids at DeKalb will miss at least one day is 0.4624.
#a
459/20000
[1] 0.02295
#b P(excellent health)+P(no coverage)-P(excellent and no)
4657+2524-459)/20000 (
[1] 0.3361
a. The probability that the respondent has excellent health and no health coverage is 0.02295.
b. The probability that the respondent ha excellent health or doesn’t have health coverage is 0.3361
#c
0.2099/0.8738
[1] 0.2402152
#d
0.0230/0.1262
[1] 0.1822504
#e P(A ∩ B) = P(A) ⋅ P(B)
0.2329*0.8738
[1] 0.203508
a. Being in good health and having health coverage is not mutually exclusive as the probability of them both occurring together is greater than 0.
b. The probability that a randomly chosen individual has excellent health is 0.2329.
c. The probability that a random chosen individual has excellent health given they have health coverage is 0.2402.
d. The probability that a randomly chosen individual has excellent health and doesn’t have health coverage is 0.01823.
e. Having excellent health and having health coverage are not independent events as the proabilty of excellent health and having coverage is 0.2035 while the value given was 0.2329.
Write a sentence (or a few) using all the different types of formatting text shown in slide 29 of the Day 1 slides. Your choice of text does not matter or even need to make sense. Although the TA will appreciate it if you make them laugh.
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. The BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet, weekly exercise, possible tobacco use, and health care coverage.
The dataset
cdc
is a sample of 20,000 people from the survey conducted in 2000, and contains responses from a subset of the questions asked on the survey.
Load the
cdc
dataset from the web using thesource()
command below:
source("http://www.openintro.org/stat/data/cdc.R")
cdc
dataset.nrow(cdc)
[1] 20000
ncol(cdc)
[1] 9
rows: 20000
columns: 9
For each variable, what identify both its “statistical” variable type (numerical (discrete, continuous) or categorical (nominal, ordinal) and its R variable type.
Fill in your answers in the table I created below. I recommend using the Visual editor in RStudio for filling in the table.
variable name | R type | variable type |
---|---|---|
genhlth | factor | categorical - ordinal |
exerany | numerical | categorical - nominal |
hlthplan | numerical | categorical - nominal |
smoke100 | numerical | categorical - nominal |
height | integer | numerical - discrete |
weight | integer | numerical - discrete |
wtdesire | integer | numerical - discrete |
age | integer | numerical - discrete |
gender | factor | categorical - nominal |
What is the difference between the average weight and the average desired weight?
mean(cdc$weight)
[1] 169.683
mean(cdc$wtdesire)
[1] 155.0939
mean(cdc$weight) - mean(cdc$wtdesire)
[1] 14.5891
The difference between average weight and desired weight is 14.6 pounds.
Which of the height, weight, and desired weight variables has the most variability? Which has the least variability?
sd(cdc$height)
[1] 4.125954
sd(cdc$weight)
[1] 40.08097
sd(cdc$wtdesire)
[1] 32.01331
Weight has the most variability and height has the least variability.
The coefficient of variation (CV) divides the standard deviation by the mean so that we have a measure of variation relative to the mean. This makes it easier to compare variability of measures that are on very different scales or even units since the CV is unitless. Calculate the CV for the height, weight, and desired weight variables. Which has the most and which has the least variability? Are these answers consistent with part d)?
sd(cdc$weight)/mean(cdc$weight)
[1] 0.2362109
sd(cdc$height)/mean(cdc$height)
[1] 0.06141376
sd(cdc$wtdesire)/mean(cdc$wtdesire)
[1] 0.2064125
The CV for weight is 0.2362, the CV for height is 0.0614, and the CV for desired weight is 0.2064. The one with the most variability is weight due to the highest value calucated and the one with the least variability is height with the lowest value calculated. This is consistent with part D.
hlthplan
Calculate the mean of the hlthplan
variable. How do we interpret this mean? In other words, what does this mean measure?
mean(cdc$hlthplan)
[1] 0.8738
The average of the hlthpln variable is 0.8738 and it means that they are rating the