Instructions

The .Rmd file for this homework assignment is posted on GLOW. Please download this document and type your answers directly in the document in the space that is marked. This will keep everything very neat and in-order.

The problems in this homework were adapted from problems in “Stats: data and models” (our course textbook) as well as problems from the “OpenIntro” statistics textbook, which is freely available online (https://www.openintro.org/book/os/).

A few RMarkdown reminders

Anything written in this document on plain white background is interpretted as text (write here as you would write in a microsoft word document), and any code written on gray background is interpreted as R code (write here as you would write in the R console, with one command per line). We call the gray areas code chunks.

Knitting an RMarkdown document means turning the input file (.Rmd) into a nicely formatted output file (.pdf in this case). To knit your current template document, select knit from the buttons at the top of the RMarkdown file. A nicely formatted document will open up in a new window. When the document is knit, R automatically saves the output file in the same directory as your template file. So you should know have a file called HW1.pdf in the same folder as HW1.Rmd on your computer. If you can’t find these files, check your downloads folder (this is the most likely location since you downloaded the template file from GLOW; feel free to move these files to any folder that you wish).

Note that, when used on white background, the # symbol creates titles and headers that show up in large font in the output document. We use text such as # Exercise 1 to label the exercises. When used inside of a code chunk, the # symbol in R creates a code comment. This can be used to write regular text inside of a code chunk. Any text written in a code chunk after the # symbol is ignored; it is not run as R code. We use comments to leave you messages inside of lab and HW templates, such as “write your answer here”.

Just typing code into a gray chunk does not run the code. To run our code we need to either knit our document (which runs all code chunks in order), or else run a single chunk. To run a single chunk, you can use the Run button on the chunk, or you can highlight the code and click Run on the top right corner of the R Markdown editor.

While it is good to run individual chunks to test out ideas and explore, we also strongly recommend that you knit your document frequently as you work on the homework. The knitting process can sometimes lead to errors. Students often run in to trouble if they wait to try knitting their document until right before the assignment deadline. Knitting after each exercise ensures that you catch errors as you go.

Reminder: When an R Markdown document is knitting, it will run every chunk in order. To execute properly when knititng, a code chunk can only rely on variables or packages that were defined/loaded inside of the same RMarkdown document, above the current chunk. On the other hand, when running a chunk locally, the chunk has access to any variables/packages that are currently in your environment. Forgetting to define variables in order inside of the document is the most common reason for knitting errors.

Exercise 1

Ian Walker, a psychologist at the University of Bath, wondered weather drivers treat bicycle riders differently when they wear helmets. He rigged his bicycle with an ultrasonic sensor that could measure how close each car was that passed him. He then rode on alternating days with and without a helmet. Out of 2500 cars passing him, he found that when he wore his helmet, mototorists passed 3.35 inches closer to him, on average, then when his head was bare (Source: NYTimes, Dec. 10, 2006).

1a: What is the observational unit in his dataset? In other words, what does a single row of data correspond to?

ian Walker riding his bike

1b: What is the response variable in this dataset? Is it quantitative or categorical?

The response variable is how close each car was that passed him, in inches. This is quantitative.

1c: What is the predictor variable in this dataset? Is it quantitative or categorical?

ian Walker riding his bike with or without his helmet. This is categorical.

Exercise 2

Researchers collected data to examine the relationship between air pollutants and preterm births in Southern California. During the study, air pollution levels were measured by air quality monitoring stations. Specifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and coarse particulate matter (PM10) in \(\mu\)g/m3. Length of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth. The analysis suggested that increased ambient PM10 and, to a lesser degree, CO concentrations may be associated with the occurrence of preterm births.

2a: What was the main research question of the study?

Is there a relationship between air poillutants and preterm births in Southern California?

2b: Who are the subjects in this study, and how many are included?

Babies and 143,196 babies

2c: What is the response variable in this study? Is the variable being used as a numerical or a categorical variable?

Length of gestation and it is being used numerically

2d: What are the predictor variables in this study? For each variable, is it being used as a numerical or a categorical variable?

Carbon monoxide levels, PM10, and nitrogen dioxide and ozone in parts per hundred million. These are all numerical.

2e: Comment on whether or not the results of the study can be generalized to a larger population, and whether or not they can be used to establish causal relationships.

This data can not be used to establish casual relationship because there can be confounding variables leading to preterm births such as health habits, family history, etc that can be affecting the data and need to be accounted for. Given the large sample size, the information should be generalized however, i think the research is missing a vital part–the effects air pollution has on those giving birth (like thheir health) and how these affects contribute to preterm births. Because of this, the data can not be generalized or used to show casual relationships.

Exercise 3

Identify the flaw(s) in reasoning in the following scenarios. Explain the likely consequence of the flawed reasoning, and what the researchers in the study should have done differently if they wanted to make such strong conclusions.

3a: Students at an elementary school are given a questionnaire that they are asked to return after their parents have completed it. One of the questions asked is, “Do you find that your work schedule makes it difficult for you to spend time with your kids after school?” Of the parents who replied, 85% said “no”. Based on these results, the school officials conclude that a great majority of the parents have no difficulty spending time with their kids after school.

There is a response bias because parents do not want schools to know that they have difficulty prioritizing their children because of work or make themselves look like “bad parents”. This leads more parents to vote “no”. The phrasing of the question may lead to a less biased response especially because the way it is currently written makes you feel guilty if you say yes. Something like “how many hours do you spend with your kids after school” can help make the question appear more neutral.

3b: An orthopedist administers a questionnaire to 30 of his patients who do not have any joint problems and finds that 20 of them regularly go running. He concludes that running decreases the risk of joint problems.

The subjects he surveys is not randomized to all patients, which would create a larger variation in resposes. By having only patients who do not have joint problems, it eliminates the people who potentially have joint pain and go running. The sample size can be increased and be randomized to collect more conclusive results.

Exercise 4: Exploring New York City flight data

The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.

The nycflights dataset is built into the openintro package, so we first need to load this package. We should also load tidyverse, since that is our main tool for data exploration.

library(openintro)
library(tidyverse)

Once the openintro package is loaded, you should be able to load the data simply by running the following code chunk. Please make sure that, after running this code chunk, the dataset appears in your environment.

data(nycflights)

The dataset nycflights is stored as a data frame. Each row represents an observation and each column represents a variable. To view the names of the variables, we can run the following chunk:

names(nycflights)
##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay" "arr_time" 
##  [7] "arr_delay" "carrier"   "tailnum"   "flight"    "origin"    "dest"     
## [13] "air_time"  "distance"  "hour"      "minute"

This returns the names of the variables in this data frame. To view the dataset or a susbset of the dataset, try typing View(nycflights), head(nycflights), or glimpse(nycflights) in your console.

4a: What is the observational unit in this dataset?

People going on flights

4b: How many observational units and how many variables are in this dataset?

135 observational units and there are 16 variables

4c: Using ggplot, make a plot that displays the distribution of flight delayes (stored in the variable dep_delay). Comment on what you see.

Hint: check out what we did in class when we wanted to visualize a single quantitative variable.

ggplot(data=nycflights, aes(x=air_time))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The air time of longer flight has less of a departure delay. There seems to be a higher frequency of delayed flights around the air time of 150 minutes. The highest amount of departure delay is around the air time of 60 minutes. Outliers can be the flights that have an air time of 600. The range is 600 and the shape is bimodal.

4d: Compute both the overall average delay time, as well as the average delay time among flights headed to Los Angeles.

Hint: check out the use of filter and summarize from class.

  nycflights_clean <- nycflights %>% 
  mutate(nycflight_numeric = as.numeric(dep_delay)) %>%
  filter(!is.na(nycflight_numeric)) 
  nycflights_clean %>% summarize(mean(dep_delay), median(dep_delay))
  nycflights %>% summarize(mean(dep_delay)) 
  nycflights %>% filter(dest=="LAX") %>% summarize(mean(dep_delay))

Exercise 5: More about the New York City flight data

5a: Create a visual summary of the dep_delay variable, broken down by departing airport (origin).

Hint: Since dep_delay is quantitative and origin is categorical, a nice option is side-by-side boxplots of dep_delay for each origin airport. To create side by side boxplots using ggplot, we should specify:

  • data = nycflights
  • aes(x=origin, y=dep_delay)
  • geom_boxplot

Create this graphic and describe what you see. Is this a useful graph? Your answer should discuss the shape of the distribution of the dep_delay variable.

  ggplot(data = nycflights, aes(x=origin, y=dep_delay)) + geom_boxplot()

The departure delay seems to be skewed right for EWR, JFK, and LGA. Given the closeness of so many flights being less than 500 minutes delayed, it makes it difficult to be sure where exactly the departured delay is most centered around. Changing the scale of the box plot may help with that however, we are unable to see how many delayed flights appear for each time. Because the data is so heavily skewed, it becomes difficult to get useful information.

5b: Next, let’s compare numerical summaries of dep_delay across the three origin airports. Using group_by() and summarize(), print out the mean and median dep_delay for each of the three origin airports.

  nycflights%>% 
  group_by(origin) %>%
  summarize(mean(dep_delay) , median(dep_delay))

5c: What explains the big differences between the means and the medians? Based on the shape of the distributions, do you prefer to compare the aiports using the mean or the median?

The difference in mean and median is because of the skew since planes are more likely to be late than really early. The median is not affected by skews however the mean is which why the number is larger. Because the data is right skewed, it is more reasonable to compare the medians.

5d: Creating a categorical variable.

Practically speaking, most travelers probably don’t care if their flight departs late if the amount of the delay is 10 minutes or less. Let’s define a delayed_departure flight to be one that leaves more than 10 minutes late (dep_delay > 10. ).

Using mutate(), create a variable called delayed_departure that stores TRUE if dep_delay > 10 and FALSE otherwise. Please see the notes from class if you are stuck!

  nycflights <- nycflights %>%
  mutate(delayed_departure = dep_delay > 10)

5e: How many total flights had a delayed departure according to this definition?

Hint: you could take your dataset and `group_by(delayed_departure) %>% summarize(n()) to count up the number of TRUE and FALSE values for this variable.

  nycflights %>% 
  group_by(delayed_departure) %>%
  summarize(n())

8271 flights

5f: In parts a,b, and c of this question, you treated delay as a quantitative variable. In part d, you treated delay as a categorical variable. What are the real life pros and cons of each approach?

Given that certain data can be skewed, because of circumstances like planes having a tendency to be more late than early, we need to take into account large outliers that may not be useful for our dataset. Having this categorical variable is useful in this case because it can help to omit these outliers. These categorical variables will not be useful for more symeteric data because there is not much information that can be pulled out without the data being evenly split. Quantitative variables can be more useful to understand more general trends which can be helpful in understanding the greater context.