The .Rmd file for this homework assignment is posted on GLOW. Please download this document and type your answers directly in the document in the space that is marked. This will keep everything very neat and in-order.
The problems in this homework were adapted from problems in “Stats: data and models” (our course textbook) as well as problems from the “OpenIntro” statistics textbook, which is freely available online (https://www.openintro.org/book/os/).
Anything written in this document on plain white background is interpretted as text (write here as you would write in a microsoft word document), and any code written on gray background is interpreted as R code (write here as you would write in the R console, with one command per line). We call the gray areas code chunks.
Knitting an RMarkdown document means turning the input file
(.Rmd) into a nicely formatted output file
(.pdf in this case). To knit your current template
document, select knit from the buttons at the top of the
RMarkdown file. A nicely formatted document will open up in a new
window. When the document is knit, R automatically saves the output file
in the same directory as your template file. So you should know have a
file called HW1.pdf in the same folder as
HW1.Rmd on your computer. If you can’t find these files,
check your downloads folder (this is the most likely location since you
downloaded the template file from GLOW; feel free to move these files to
any folder that you wish).
Note that, when used on white background, the # symbol
creates titles and headers that show up in large font in the output
document. We use text such as # Exercise 1 to label the
exercises. When used inside of a code chunk, the # symbol
in R creates a code comment. This can be used to write
regular text inside of a code chunk. Any text written in a code chunk
after the # symbol is ignored; it is not run as R code. We
use comments to leave you messages inside of lab and HW templates, such
as “write your answer here”.
Just typing code into a gray chunk does not run the code. To
run our code we need to either knit our document (which runs all code
chunks in order), or else run a single chunk. To run a single chunk, you
can use the Run button on the chunk, or you can highlight
the code and click Run on the top right corner of the R
Markdown editor.
While it is good to run individual chunks to test out ideas and explore, we also strongly recommend that you knit your document frequently as you work on the homework. The knitting process can sometimes lead to errors. Students often run in to trouble if they wait to try knitting their document until right before the assignment deadline. Knitting after each exercise ensures that you catch errors as you go.
Reminder: When an R Markdown document is knitting, it will run every chunk in order. To execute properly when knititng, a code chunk can only rely on variables or packages that were defined/loaded inside of the same RMarkdown document, above the current chunk. On the other hand, when running a chunk locally, the chunk has access to any variables/packages that are currently in your environment. Forgetting to define variables in order inside of the document is the most common reason for knitting errors.
Ian Walker, a psychologist at the University of Bath, wondered weather drivers treat bicycle riders differently when they wear helmets. He rigged his bicycle with an ultrasonic sensor that could measure how close each car was that passed him. He then rode on alternating days with and without a helmet. Out of 2500 cars passing him, he found that when he wore his helmet, mototorists passed 3.35 inches closer to him, on average, then when his head was bare (Source: NYTimes, Dec. 10, 2006).
ian Walker riding his bike
The response variable is how close each car was that passed him, in inches. This is quantitative.
ian Walker riding his bike with or without his helmet. This is categorical.
Researchers collected data to examine the relationship between air pollutants and preterm births in Southern California. During the study, air pollution levels were measured by air quality monitoring stations. Specifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and coarse particulate matter (PM10) in \(\mu\)g/m3. Length of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth. The analysis suggested that increased ambient PM10 and, to a lesser degree, CO concentrations may be associated with the occurrence of preterm births.
Is there a relationship between air poillutants and preterm births in Southern California?
Babies and 143,196 babies
Length of gestation and it is being used numerically
Carbon monoxide levels, PM10, and nitrogen dioxide and ozone in parts per hundred million. These are all numerical.
This data can not be used to establish casual relationship because there can be confounding variables leading to preterm births such as health habits, family history, etc that can be affecting the data and need to be accounted for. Given the large sample size, the information should be generalized however, i think the research is missing a vital part–the effects air pollution has on those giving birth (like thheir health) and how these affects contribute to preterm births. Because of this, the data can not be generalized or used to show casual relationships.
Identify the flaw(s) in reasoning in the following scenarios. Explain the likely consequence of the flawed reasoning, and what the researchers in the study should have done differently if they wanted to make such strong conclusions.
There is a response bias because parents do not want schools to know that they have difficulty prioritizing their children because of work or make themselves look like “bad parents”. This leads more parents to vote “no”. The phrasing of the question may lead to a less biased response especially because the way it is currently written makes you feel guilty if you say yes. Something like “how many hours do you spend with your kids after school” can help make the question appear more neutral.
The subjects he surveys is not randomized to all patients, which would create a larger variation in resposes. By having only patients who do not have joint problems, it eliminates the people who potentially have joint pain and go running. The sample size can be increased and be randomized to collect more conclusive results.
The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.
The nycflights dataset is built into the
openintro package, so we first need to load this package.
We should also load tidyverse, since that is our main tool
for data exploration.
library(openintro)
library(tidyverse)
Once the openintro package is loaded, you should be able
to load the data simply by running the following code chunk. Please make
sure that, after running this code chunk, the dataset appears in your
environment.
data(nycflights)
The dataset nycflights is stored as a data
frame. Each row represents an observation and each
column represents a variable. To view the names of the
variables, we can run the following chunk:
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
This returns the names of the variables in this data frame. To view
the dataset or a susbset of the dataset, try typing
View(nycflights), head(nycflights), or
glimpse(nycflights) in your console.
People going on flights
135 observational units and there are 16 variables
dep_delay). Comment
on what you see.Hint: check out what we did in class when we wanted to visualize a single quantitative variable.
ggplot(data=nycflights, aes(x=air_time))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The air time of longer flight has less of a departure delay. There seems to be a higher frequency of delayed flights around the air time of 150 minutes. The highest amount of departure delay is around the air time of 60 minutes. Outliers can be the flights that have an air time of 600. The range is 600 and the shape is bimodal.
Hint: check out the use of filter and
summarize from class.
nycflights_clean <- nycflights %>%
mutate(nycflight_numeric = as.numeric(dep_delay)) %>%
filter(!is.na(nycflight_numeric))
nycflights_clean %>% summarize(mean(dep_delay), median(dep_delay))
nycflights %>% summarize(mean(dep_delay))
nycflights %>% filter(dest=="LAX") %>% summarize(mean(dep_delay))
dep_delay variable,
broken down by departing airport (origin).Hint: Since dep_delay is quantitative
and origin is categorical, a nice option is side-by-side
boxplots of dep_delay for each origin airport. To create
side by side boxplots using ggplot, we should specify:
data = nycflightsaes(x=origin, y=dep_delay)geom_boxplotCreate this graphic and describe what you see. Is this a useful
graph? Your answer should discuss the shape of the
distribution of the dep_delay variable.
ggplot(data = nycflights, aes(x=origin, y=dep_delay)) + geom_boxplot()
The departure delay seems to be skewed right for EWR, JFK, and LGA. Given the closeness of so many flights being less than 500 minutes delayed, it makes it difficult to be sure where exactly the departured delay is most centered around. Changing the scale of the box plot may help with that however, we are unable to see how many delayed flights appear for each time. Because the data is so heavily skewed, it becomes difficult to get useful information.
dep_delay across the three origin airports.
Using group_by() and summarize(), print out
the mean and median dep_delay for each of the three
origin airports. nycflights%>%
group_by(origin) %>%
summarize(mean(dep_delay) , median(dep_delay))
The difference in mean and median is because of the skew since planes are more likely to be late than really early. The median is not affected by skews however the mean is which why the number is larger. Because the data is right skewed, it is more reasonable to compare the medians.
Practically speaking, most travelers probably don’t care if their
flight departs late if the amount of the delay is 10 minutes or less.
Let’s define a delayed_departure flight to be one that
leaves more than 10 minutes late (dep_delay > 10. ).
Using mutate(), create a variable called
delayed_departure that stores TRUE if
dep_delay > 10 and FALSE otherwise. Please
see the notes from class if you are stuck!
nycflights <- nycflights %>%
mutate(delayed_departure = dep_delay > 10)
Hint: you could take your dataset and
`group_by(delayed_departure) %>% summarize(n()) to count
up the number of TRUE and FALSE values for this variable.
nycflights %>%
group_by(delayed_departure) %>%
summarize(n())
8271 flights
delay as a quantitative variable. In part d, you treated
delay as a categorical variable. What are the real life
pros and cons of each approach?Given that certain data can be skewed, because of circumstances like planes having a tendency to be more late than early, we need to take into account large outliers that may not be useful for our dataset. Having this categorical variable is useful in this case because it can help to omit these outliers. These categorical variables will not be useful for more symeteric data because there is not much information that can be pulled out without the data being evenly split. Quantitative variables can be more useful to understand more general trends which can be helpful in understanding the greater context.