HW 1: BSTA 511/611 F25

Author

Rosalyn Minh

Published

October 11, 2025

Due 10/11/25 at 11 pm

Download the .qmd file for this assignment from https://github.com/niederhausen/BSTA_511_F25/blob/main/homework/HW_1_F25_bsta511.qmd

Graded exercises

The exercises listed below will be graded for this assignment. You are strongly encouraged to complete the entire assignment. You will receive feedback on exercises you turn in that are not being graded.

  • Non-Book exercises
    • NBE 2: Tylenol during pregnancy?
  • Book exercises
    • 1.12, 1.31, 2.6, 2.14
  • R exercises
    • R2: BRFSS

Directions

Important
  • * Starred exercises in the section Book exercises may be completed by hand (such as on paper or using a tablet) instead of using Quarto.
  • If you complete this part of the assignment not using Quarto, you will be uploading 3 files on Sakai for this HW: qmd & html files for your R work, and a pdf with your written work.
  • If you are completing the homework on paper, you can use a scanning app, such as Adobe Scan, to create a pdf of your assignment.
  • Please upload your homework to Sakai. Upload both your .qmd code file and the rendered .html file.
    • Use the assignment .qmd file linked to above as a template for your own assignment.
  • For each question, make sure to show all of your work. This includes all code and resulting output in the html file to support your answers for exercises requiring work done in R (including any arithmetic calculations).
  • For each question, include a sentence summarizing the answer for that question in the context of the research question.
Tip

It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your Qmd file and rendering frequently helps you catch your errors more quickly.

Non-book exercises

NBE 1

a) Upload a photo using Sakai submission

To help me learn your names and faces, please upload a photo of yourself on Sakai. You will find the Upload Photo “assignment” in the Assignments section of Sakai. These photos will only be seen by me and the TA.

b) Background survey

c) Slack post

NBE 2: Tylenol during pregnancy?

On Monday, September 22, 2025, President Trump and Health Secretary Robert F. Kennedy Jr. claimed that taking acetaminophen (the active ingredient in the pain reliever Tylenol) during pregnancy was a cause autism in the child. This led to an extensive debate on the topic, much of which has focused on a systemic review of observational studies researching the association between prenatal acetaminophen use and autism spectrum disorder (Evaluation of the evidence on acetaminophen use and neurodevelopmental disorders using the Navigation Guide methodology). There are many news reports that discussed this issue, one of which is from the New York Times, Debate Flares Over an Unproven Link Between Tylenol and Autism

a) Causation?

Can causation be deduced based on the observational studies in the systemic review linked to above? Explain why or why not.

No causation cannot be deduced from these observational studies. These studies have shown there are associations but cannot themselves be deduced as causation as there are still confounding variables, biases, and systemic uncertainity where causation cannot be deduced alone.

b) Experiment?

Would it be ethical to conduct an experiment to study the effects of prenatal acetaminophen use on the development of autism spectrum disorder

No a randomized control trial would not be ethical as conducting experiments on pregnant women where there could be potential side effects.

c) Sampling: stratified

Describe a stratified sampling method that could be used to study this topic in a hypothetical study.

A stratified sampling method would be to identify strata such as mothers in different trimesters and randomizing the usage of acetaminophen within these trimesters. They could then manipulate the dosage each other uses where they take into account of any fevers or other ailments during the pregnancy that would need acetaminophen.

d) Sampling: cluster

Describe a cluster sampling method that could be used to study this topic in a hypothetical study.

A cluster sampling method could be to focus on prenatal clinics and hospitals up across the United States. From there randomizing and selecting a handful of clinics from each state and measuring acetaminophen exposure and developmental outcomes using a longitudinal study.

e) Sampling: Multistage sample

Describe a multistage sample sampling method that could be used to study this topic in a hypothetical study.

For the multistage sample, first I would randomly select 20 countries, which each one is from a different part of the world where there are 10 countries with higher usage of acetaminophen and 10 ones with the lowest usage. From there randomly choosing 5 cities from each and following the babies postpartum with a longitudinal study as well.

f) Sampling method type?

One of the studies included in the systematic review included “All Singleton live born children in Sweden with linkable personal identifiers with follow-up until Dec 31,2021.” What type of sampling method did they use?

The sampling method used was simple random sampling over a nationwide population. It could also be classified as cluster sampling as it only looked in Sweden’s population and randomized the clinical trial.

Book exercises

  • Exercises are in the last section of the chapter.
  • Exercises are numbered as chapter#.exercise#. For example, exercise 1.2 is Chapter 1 #2, which is on pg. 75.

1.2 Sinusitis and antibiotics, Part I.

  • Show the work of your calculations using R code within a code chunk. Make sure that both your code and output are visible in the rendered html file.

  • Write your answers in complete sentences as if communicating the results to a collaborator.

  • If you are having difficulty with exercise 1.2, take a look at exercise 1.1, whose answers are at the back of the book.

    #a 
    66/85
    [1] 0.7764706
    65/81
    [1] 0.8024691

a. 77.65% of patients in the treatment group experienced an improvement in symptoms.

0.7764706

b. 80.25% of patients experienced an improvement in symptoms in the control group.

0.8024691

c.

In the control group there was a higher percentage of patients experiencing improvement in symptoms.

d.

A possible explanation could be due to antibiotic immunity in our body, due to constant exposure. It could also be due to it being acute sinusitis, were the antibiotics were too strong that our immunity didn’t need it.

1.4 Buteyko method, study components

a. The main research is to look at the effectiveness of the Buteyko method by looking at if it reduces asthma symptoms and improve quality of life.

b. Subjects are aged 18-69 whom relied on medication for asthma and there are 600 subjects.

c. The variables are quality of life, activity, asthma symptoms, and medication reduction. These are all numerical discrete as they are measured on a scale of 0 to 10.

1.12 Herbal remedies

a. The population of interest is healthy volunteers and the sample size is 437 subjects

b. It cannot be as it is a small sample pool and only looking at healthy young adults, and might not apply towards an older population. There has also been multiple conflicting results so it needs to be further studied with more established results before generalizing to a larger population.

c. No, the results cannot establish a casual relationship as it the sample pool needs to be larger as it should be looking at the effects of Echinacea on different age groups as each group will have different immune responses.

1.31 Income at the coffee shop

a. The median would be better representative as it is more robust against outliers, as the mean would include the 2 new statistics and skew the data.

b. The IQR would be a better representation of variability as it includes the 25th through 75th quartile and excludes the outliers, the 2 new people’s salary. This tells us that IQR is a more robust and a better measurement as standard deviation will take into account of the new salary’s.

1.32 Midrange

The midrange will have outliers effect its values and extremely skew it. Due to it looking at the maximum and minimum versus the mode, where the values aren’t near those extremities makes it’s robustness non-robustness.

1.38 Smoking and stenosis

See Section 1.6.2 for more on how the relative risk is calculated.

51/215
[1] 0.2372093
67/215
[1] 0.3116279
54/215
[1] 0.2511628
43/215
[1] 0.2
#b
51/94
[1] 0.5425532
54/121
[1] 0.446281
#c
(51/94)/(54/121)
[1] 1.215721

a. 23.72% of smokers had aortic stenosis. 31.16% non-smokers that didn’t have aortic stenosis. 25.12% non-smokers had aortic stenosis. 20% smokers did not have aortic stenosis.

b. 0.5425 is the proportion of smokers that had aortic stenosis.

0.4462 is the proportion of non-smokers that have aortic stenosis.

c. The relative risk was calculated to be 1.2157 which is greater than 1.2 which seems to be an association with increased probability of stenosis and smokers.

* 2.6 Poverty and language

# 14.6% of Americans live below the poverty line, 20.7% speak a language other than English (foreign language) at home, and 4.2% fall into both categories
#c P(P)-P(PnL)
(.146-0.042)
[1] 0.104
#d P(A)+P(B) - P(A and B) <-P(PuL)
(0.146+0.207-0.042)
[1] 0.311
#e P(P u L)^c
(1-0.311)
[1] 0.689
#f P(P n L) = P(P)*P(L)
0.146*0.207
[1] 0.030222

a. The two events of living below the poverty line and speaking a foreign language are not disjoint events as they can both occur at the same time.

b. exempt

c. The percent of people that live below the poverty line and only speak English at home is 10.4%

d. The percent of Americans that live below the poverty line or speak a foreign language at home is 31.1%

e. The percent of Americans that live above the poverty line and only speak English at home is 68.9%

f. The event that someone that lives below the poverty line and speaks a foreign language IS NOT independent as the probability calculated does not equal the value given.

* 2.8 School absences

#a P(none) = 1-P(OuTuF)
(1-(.25+.15+.28))
[1] 0.32
#b P(no more than 1)= P(Z)+P(O)
0.32+0.25
[1] 0.57
#c P(at least 1) = 1-P(no days)
1-0.32
[1] 0.68
#d P(none) * P(none)
0.32*0.32
[1] 0.1024
#e P(at least 1)* P(at least 1)
0.68^2
[1] 0.4624

a. The probability that a student chosen at random hasn’t missed any days of school is 0.32.

b. The probability that a student chosen at random misses no more than one day is 0.32.

c. The probability that a student chosen at random misses at least one day of school is 0.68.

d. The probability of two kids at DeKalb county elementary school to not miss any school is 0.1024.

e. The probability that two kids at DeKalb will miss at least one day is 0.4624.

* 2.10 Health coverage, frequencies

#a 
459/20000
[1] 0.02295
#b P(excellent health)+P(no coverage)-P(excellent and no)
(4657+2524-459)/20000
[1] 0.3361

a. The probability that the respondent has excellent health and no health coverage is 0.02295.

b. The probability that the respondent ha excellent health or doesn’t have health coverage is 0.3361

* 2.14 Health coverage, relative frequencies

#c
0.2099/0.8738
[1] 0.2402152
#d
0.0230/0.1262
[1] 0.1822504
#e P(A ∩ B) = P(A) ⋅ P(B)
0.2329*0.8738
[1] 0.203508

a. Being in good health and having health coverage is not mutually exclusive as the probability of them both occurring together is greater than 0.

b. The probability that a randomly chosen individual has excellent health is 0.2329.

c. The probability that a random chosen individual has excellent health given they have health coverage is 0.2402.

d. The probability that a randomly chosen individual has excellent health and doesn’t have health coverage is 0.01823.

e. Having excellent health and having health coverage are not independent events as the proabilty of excellent health and having coverage is 0.2035 while the value given was 0.2329.

R exercises

R1: Formatting text practice

Write a sentence (or a few) using all the different types of formatting text shown in slide 29 of the Day 1 slides. Your choice of text does not matter or even need to make sense. Although the TA will appreciate it if you make them laugh.

R2: BRFSS

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. The BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet, weekly exercise, possible tobacco use, and health care coverage.

The dataset cdc is a sample of 20,000 people from the survey conducted in 2000, and contains responses from a subset of the questions asked on the survey.

Load the cdc dataset from the web using the source() command below:

source("http://www.openintro.org/stat/data/cdc.R")
  • Answer the questions below about the cdc dataset.
  • Please do not delete the statements of the questions so that they remained numbered in the correct order.
  • Show the work of your calculations using R code within a code chunk. Make sure that both your code and output are visible in the knitted html file.
  • Write your answers in complete sentences as if communicating the results to a collaborator.

a) How many rows and columns are in the dataset?

nrow(cdc)
[1] 20000
ncol(cdc)
[1] 9

rows: 20000

columns: 9

b) Variable types

For each variable, what identify both its “statistical” variable type (numerical (discrete, continuous) or categorical (nominal, ordinal) and its R variable type.
Fill in your answers in the table I created below. I recommend using the Visual editor in RStudio for filling in the table.

variable name R type variable type
genhlth factor categorical - ordinal
exerany numerical categorical - nominal
hlthplan numerical categorical - nominal
smoke100 numerical categorical - nominal
height integer numerical - discrete
weight integer numerical - discrete
wtdesire integer numerical - discrete
age integer numerical - discrete
gender factor categorical - nominal

c) Average weight vs. desired weight

What is the difference between the average weight and the average desired weight?

mean(cdc$weight)
[1] 169.683
mean(cdc$wtdesire)
[1] 155.0939
mean(cdc$weight) - mean(cdc$wtdesire)
[1] 14.5891

The difference between average weight and desired weight is 14.6 pounds.

d) Compare variability

Which of the height, weight, and desired weight variables has the most variability? Which has the least variability?

sd(cdc$height)
[1] 4.125954
sd(cdc$weight)
[1] 40.08097
sd(cdc$wtdesire)
[1] 32.01331

Weight has the most variability and height has the least variability.

e) Coefficient of variation

The coefficient of variation (CV) divides the standard deviation by the mean so that we have a measure of variation relative to the mean. This makes it easier to compare variability of measures that are on very different scales or even units since the CV is unitless. Calculate the CV for the height, weight, and desired weight variables. Which has the most and which has the least variability? Are these answers consistent with part d)?

sd(cdc$weight)/mean(cdc$weight)
[1] 0.2362109
sd(cdc$height)/mean(cdc$height)
[1] 0.06141376
sd(cdc$wtdesire)/mean(cdc$wtdesire)
[1] 0.2064125

The CV for weight is 0.2362, the CV for height is 0.0614, and the CV for desired weight is 0.2064. The one with the most variability is weight due to the highest value calucated and the one with the least variability is height with the lowest value calculated. This is consistent with part D.

f) Mean of the hlthplan

Calculate the mean of the hlthplan variable. How do we interpret this mean? In other words, what does this mean measure?

mean(cdc$hlthplan)
[1] 0.8738

The average of the hlthpln variable is 0.8738 and it means that they are rating the