We will grade the knitted PDF or HTML document from within your private GitHub repository. Remember to make regular, small commits (e.g., at least one commit per question) to save your work. We will grade the latest knit, as long as it occurs before the start of the class in which we advance to the next chapter. As always, reach out with questions via GitHub Issues or during office hours.
You are probably sick of seeing the ozone data, but there’s still more to do with the file. Ozone concentration measurement is considered univariate, thus we can use basic exploratory data analysis approaches to examine the data.
Load the necessary R packages into your R session.
Recreate the pipe of dplyr functions that you used to
import the data, select and rename the variables listed below, drop
missing observations, and assign the output with a good name.
sample_measurement renamed as ozone_ppm
(ozone measurement in ppm)datetime (date in YYYY-MM-DD format and time of
measurement in HH:MM:SS)Check that the data imported correctly.
## # A tibble: 16,914 × 2
## ozone_ppm datetime
## <dbl> <dttm>
## 1 0.017 2019-01-01 07:00:00
## 2 0.017 2019-01-01 08:00:00
## 3 0.017 2019-01-01 09:00:00
## 4 0.017 2019-01-01 10:00:00
## 5 0.015 2019-01-01 11:00:00
## 6 0.017 2019-01-01 12:00:00
## 7 0.028 2019-01-01 13:00:00
## 8 0.03 2019-01-01 14:00:00
## 9 0.036 2019-01-01 15:00:00
## 10 0.036 2019-01-01 16:00:00
## # ℹ 16,904 more rows
Through Question 5, you will use all of the available ozone measurements from January 2019 through January 2020. Starting in Question 6, you will use a subset of the dataset: ozone concentration measurements on July 4, 2019.
Guess the location, dispersion, and shape of ozone concentration data, based on the definitions of each described in the coursebook. No code needed; just use your intuition. For shape, take a look at the coursebook appendix on reference distributions.
I think that the location of the ozone concentration data will have a central tendency of around 0.04 to 0.05 ppm. The ozone concentration data taken in fort collins has a range of 0 to 1.0 and the maximum concentration reported was 0.096 ppm, so I hypothesize that a majority of the data will fall towards the median. I think that the dispersion of the data, for the most part, will be in a small range. I hypothesize that most of the data will fall between 0.03 and 0.06 ppm range. I am assuming that a typical day in fort collins will have relatively moderate ozone concentration levels. I think that the shape of the data will be linear and not vary a whole lot.
Calculate the quartiles of ozone_ppm. What is the
minimum? Maximum? Median?
## 0% 25% 50% 75% 100%
## 0.000 0.023 0.033 0.043 0.096
Create a
similar table for ozone_ppm. Hint: You will need to
investigate table options in the knitr package.
Using either relevant ggplot2 geom option,
create a cumulative distribution plot of ozone_ppm. Tweak
the axis ranges for optimal data representation, using
scale_*_continuous() with breaks = and
minor_breaks = arguments. Add axis labels, title, subtitle,
and theme.
Create a histogram of ozone_ppm. Within the
geom, mess with the number of bins (e.g., 20, 50, 75, 100,
200) to explore the true shape and granularity of the data. Match the
plot style (e.g., title, subtitle, axis labels, theme) you chose in
Question 3, with the relevant adjustments such as “Histogram” instead of
“Cumulative Distribution Plot”.
What mathematical concept is a histogram (Q4) attempting to visualize?
The histogram is showing how often different ozone concentration values were recorded.
Based on the histogram (Q4), does ozone concentration appear to be normally distributed?
Yes, I think that the ozone concentration appears to be normally distributed with a couple outliers. The shape of the histogram is relatively symmetrical with 0.03 ppm being close to the axis of symmetry.
Based on the histogram (Q4), do you see any possible outliers? Skewness? How might this affect the spread and central tendency?
There looks to be an outlier close to zero which might cause unreliable data for the mean making it seem less than it should be.
Generate a boxplot of ozone concentration on the y-axis with a title,
subtitle, y-axis label, and theme consistent with the style of the
previous two plots. Use quotes ("") as the x
arguments within the calls to the aesthetic and labels to remove the
x-axis scale and label.
Use the following code to create a dataframe for use in the remaining
questions. These ozone concentration measurements were taken on July 4,
2019 in Fort Collins, CO. This code detects certain characters with the
datetime object and filters to observations containing
those characters. There are other ways this could have been done (e.g.,
dplyr::filter() with %in% operator).
## # A tibble: 48 × 2
## ozone_ppm datetime
## <dbl> <dttm>
## 1 0.058 2019-07-04 00:00:00
## 2 0.049 2019-07-04 01:00:00
## 3 0.043 2019-07-04 02:00:00
## 4 0.043 2019-07-04 03:00:00
## 5 0.036 2019-07-04 04:00:00
## 6 0.033 2019-07-04 05:00:00
## 7 0.036 2019-07-04 06:00:00
## 8 0.034 2019-07-04 07:00:00
## 9 0.036 2019-07-04 08:00:00
## 10 0.033 2019-07-04 09:00:00
## # ℹ 38 more rows
Define autocorrelation as it relates to ozone concentration measurement.
Autocorrelation means correlated with oneself across time and refers to haw if what is happening through the series is likely to keep happening in the future. With this ozone concentration data taken in fort collins, we can predict that the data taken in the following years will likely look very similar to this.
Create an autocorrelation plot of ozone concentration, using
stats::acf() and include axis labels and title. Describe
what you see based on the features of interest outlined in the
coursebook.
Define partial autocorrelation as it relates to ozone concentration measurement.
shows the direct relationship between a specific observation in a time series and its past observations and helps identify patterns and relationships in time series data. By looking at a partial autocorrelation of the fort collins ozone concentration data, we can get a better understanding of what data recordings in the future will look like.
Now create a partial autocorrelation plot of day ozone concentration with axis labels. Describe what you see. How does this compare to the autocorrelation plot in the previous question?
# set global options for figures, code, warnings, and messages
knitr::opts_chunk$set(fig.width = 6, fig.height = 4, fig.path = "../figs/",
echo = FALSE, warning = FALSE, message = FALSE)
# load packages
library(tidyverse)
library(dplyr)
library(ggplot2)
# ozone: import, select, drop missing observations, rename
ozone_data <- readr::read_csv(file = "ftc_o3.csv") %>%
dplyr::select("sample_measurement","datetime") %>%
drop_na() %>%
dplyr::rename(ozone_ppm = sample_measurement)
# examine dataframe object
structure(ozone_data)
# calculate quantiles of ozone concentration
quartiles <- quantile(ozone_data$ozone_ppm, probs = seq(0, 1, 0.25))
quartiles
# plot cumulative distribution of ozone concentration
ozone_data %>%
ggplot2::ggplot(mapping = aes(x = ozone_ppm)) +
geom_step(stat = "ecdf") +
labs(x = "Ozone Concentration (ppm)",
y = "Cumulative Fraction",
title = "Cumulative Distribution of Ozone Concentration") +
scale_y_continuous(limits = c(-0.05, 1.05),
expand = c(0,0),
breaks = seq(from = 0,
to = 1,
by = 0.1)) +
scale_x_continuous(minor_breaks = seq(from = 0,
to = 0.1,
by = 0.005)) +
geom_segment(data = data.frame(x = quantile(ozone_data$ozone_ppm),
y = rep.int(-.05, 5),
xend = quantile(ozone_data$ozone_ppm),
yend = seq(from = 0, to = 1, by = 0.25)),
aes(x = x, y = y, xend = xend, yend = yend),
color = "red",
linetype = "dashed") +
theme_minimal()
# create histogram of ozone concentration
ggplot(data = ozone_data, aes(x = ozone_ppm)) +
geom_histogram(bins = 50,
fill = "lightblue",
color = "black") +
labs(x = "Ozone Concentration (ppm)",
y = "Count",
title = "Histogram of Ozone Concentration") +
theme_minimal()
# create ozone boxplot
ggplot(data = ozone_data,
aes(x = "",
y = ozone_ppm)) +
geom_boxplot(width = 0.5,
fill = "lightblue") +
labs(x = "",
y = "Ozone Concentration (ppm)",
title = "Boxplot of Ozone Concentration")+
theme_minimal()
# create subset of data with only one day to examine daily pattern
# I did not ask you to code this because we have not discussed dates or stringr
# You need to uncomment the below three lines and run it; check object names
ozone_day <- ozone_data %>%
dplyr::filter(stringr::str_detect(string = datetime,
pattern = "2019-07-04"))
ozone_day
# create autocorrelation plot with ozone_day df
stats::acf(ozone_day$ozone_ppm,
main = "Autocorrelation of Ozone Concentration on July 4th, 2019",
xlab = "Lag (hours)",
ylab = "Correlation Coefficient")
# create partial autocorrelation plot
stats::pacf(ozone_day$ozone_ppm,
main = "Partial Autocorrelation of Ozone Concentration on July 4th, 2019",
xlab = "Lag (hours)",
ylab = "Partial Correlation Coefficient")