Grading

We will grade the knitted PDF or HTML document from within your private GitHub repository. Remember to make regular, small commits (e.g., at least one commit per question) to save your work. We will grade the latest knit, as long as it occurs before the start of the class in which we advance to the next chapter. As always, reach out with questions via GitHub Issues or during office hours.

Data

You are probably sick of seeing the ozone data, but there’s still more to do with the file. Ozone concentration measurement is considered univariate, thus we can use basic exploratory data analysis approaches to examine the data.

Preparation

Load the necessary R packages into your R session.

Recreate the pipe of dplyr functions that you used to import the data, select and rename the variables listed below, drop missing observations, and assign the output with a good name.

Check that the data imported correctly.

## # A tibble: 16,914 × 2
##    ozone_ppm datetime           
##        <dbl> <dttm>             
##  1     0.017 2019-01-01 07:00:00
##  2     0.017 2019-01-01 08:00:00
##  3     0.017 2019-01-01 09:00:00
##  4     0.017 2019-01-01 10:00:00
##  5     0.015 2019-01-01 11:00:00
##  6     0.017 2019-01-01 12:00:00
##  7     0.028 2019-01-01 13:00:00
##  8     0.03  2019-01-01 14:00:00
##  9     0.036 2019-01-01 15:00:00
## 10     0.036 2019-01-01 16:00:00
## # ℹ 16,904 more rows

Chapter 5 Homework: Exploring Univariate Data

Through Question 5, you will use all of the available ozone measurements from January 2019 through January 2020. Starting in Question 6, you will use a subset of the dataset: ozone concentration measurements on July 4, 2019.

Question 1: Definitions

Guess the location, dispersion, and shape of ozone concentration data, based on the definitions of each described in the coursebook. No code needed; just use your intuition. For shape, take a look at the coursebook appendix on reference distributions.

I think that the location of the ozone concentration data will have a central tendency of around 0.04 to 0.05 ppm. The ozone concentration data taken in fort collins has a range of 0 to 1.0 and the maximum concentration reported was 0.096 ppm, so I hypothesize that a majority of the data will fall towards the median. I think that the dispersion of the data, for the most part, will be in a small range. I hypothesize that most of the data will fall between 0.03 and 0.06 ppm range. I am assuming that a typical day in fort collins will have relatively moderate ozone concentration levels. I think that the shape of the data will be linear and not vary a whole lot.

Question 2: Quartiles

Calculate the quartiles of ozone_ppm. What is the minimum? Maximum? Median?

##    0%   25%   50%   75%  100% 
## 0.000 0.023 0.033 0.043 0.096

Extra Credit

Create a similar table for ozone_ppm. Hint: You will need to investigate table options in the knitr package.

Question 3: Cumulative Distribution Plot

Using either relevant ggplot2 geom option, create a cumulative distribution plot of ozone_ppm. Tweak the axis ranges for optimal data representation, using scale_*_continuous() with breaks = and minor_breaks = arguments. Add axis labels, title, subtitle, and theme.

Question 4: Histogram

Create a histogram of ozone_ppm. Within the geom, mess with the number of bins (e.g., 20, 50, 75, 100, 200) to explore the true shape and granularity of the data. Match the plot style (e.g., title, subtitle, axis labels, theme) you chose in Question 3, with the relevant adjustments such as “Histogram” instead of “Cumulative Distribution Plot”.

Question 5: Concept

What mathematical concept is a histogram (Q4) attempting to visualize?

The histogram is showing how often different ozone concentration values were recorded.

Question 6: Distribution

Based on the histogram (Q4), does ozone concentration appear to be normally distributed?

Yes, I think that the ozone concentration appears to be normally distributed with a couple outliers. The shape of the histogram is relatively symmetrical with 0.03 ppm being close to the axis of symmetry.

Question 7: Outliers

Based on the histogram (Q4), do you see any possible outliers? Skewness? How might this affect the spread and central tendency?

There looks to be an outlier close to zero which might cause unreliable data for the mean making it seem less than it should be.

Question 8: Boxplot

Generate a boxplot of ozone concentration on the y-axis with a title, subtitle, y-axis label, and theme consistent with the style of the previous two plots. Use quotes ("") as the x arguments within the calls to the aesthetic and labels to remove the x-axis scale and label.

Subset Data

Use the following code to create a dataframe for use in the remaining questions. These ozone concentration measurements were taken on July 4, 2019 in Fort Collins, CO. This code detects certain characters with the datetime object and filters to observations containing those characters. There are other ways this could have been done (e.g., dplyr::filter() with %in% operator).

## # A tibble: 48 × 2
##    ozone_ppm datetime           
##        <dbl> <dttm>             
##  1     0.058 2019-07-04 00:00:00
##  2     0.049 2019-07-04 01:00:00
##  3     0.043 2019-07-04 02:00:00
##  4     0.043 2019-07-04 03:00:00
##  5     0.036 2019-07-04 04:00:00
##  6     0.033 2019-07-04 05:00:00
##  7     0.036 2019-07-04 06:00:00
##  8     0.034 2019-07-04 07:00:00
##  9     0.036 2019-07-04 08:00:00
## 10     0.033 2019-07-04 09:00:00
## # ℹ 38 more rows

Question 9: Autocorrelation Plot

Define autocorrelation as it relates to ozone concentration measurement.

Autocorrelation means correlated with oneself across time and refers to haw if what is happening through the series is likely to keep happening in the future. With this ozone concentration data taken in fort collins, we can predict that the data taken in the following years will likely look very similar to this.

Create an autocorrelation plot of ozone concentration, using stats::acf() and include axis labels and title. Describe what you see based on the features of interest outlined in the coursebook.

Question 10: Parial Autocorrelation Plot

Define partial autocorrelation as it relates to ozone concentration measurement.

shows the direct relationship between a specific observation in a time series and its past observations and helps identify patterns and relationships in time series data. By looking at a partial autocorrelation of the fort collins ozone concentration data, we can get a better understanding of what data recordings in the future will look like.

Now create a partial autocorrelation plot of day ozone concentration with axis labels. Describe what you see. How does this compare to the autocorrelation plot in the previous question?

Appendix

# set global options for figures, code, warnings, and messages
knitr::opts_chunk$set(fig.width = 6, fig.height = 4, fig.path = "../figs/",
                      echo = FALSE, warning = FALSE, message = FALSE)
# load packages
library(tidyverse)
library(dplyr)
library(ggplot2)
# ozone: import, select, drop missing observations, rename
ozone_data <- readr::read_csv(file = "ftc_o3.csv") %>%
  dplyr::select("sample_measurement","datetime") %>%
  drop_na() %>%
  dplyr::rename(ozone_ppm = sample_measurement)
# examine dataframe object 
structure(ozone_data)
# calculate quantiles of ozone concentration
quartiles <- quantile(ozone_data$ozone_ppm, probs = seq(0, 1, 0.25))
quartiles
# plot cumulative distribution of ozone concentration
ozone_data %>% 
  ggplot2::ggplot(mapping = aes(x = ozone_ppm)) +
  geom_step(stat = "ecdf") + 
  labs(x = "Ozone Concentration (ppm)", 
       y = "Cumulative Fraction",
       title = "Cumulative Distribution of Ozone Concentration") +
  
  scale_y_continuous(limits = c(-0.05, 1.05), 
                     expand = c(0,0),
                     breaks = seq(from = 0, 
                                  to = 1, 
                                  by = 0.1)) +
  scale_x_continuous(minor_breaks = seq(from = 0,
                                        to = 0.1,
                                        by = 0.005)) +
  geom_segment(data = data.frame(x = quantile(ozone_data$ozone_ppm),
                                 y = rep.int(-.05, 5),
                                 xend = quantile(ozone_data$ozone_ppm),
                                 yend = seq(from = 0, to = 1, by = 0.25)),
               aes(x = x, y = y, xend = xend, yend = yend),
               color = "red",
               linetype = "dashed") +
  theme_minimal()

# create histogram of ozone concentration
ggplot(data = ozone_data, aes(x = ozone_ppm)) +
  geom_histogram(bins = 50,
                 fill = "lightblue",
                 color = "black") +
  labs(x = "Ozone Concentration (ppm)", 
       y = "Count",
       title = "Histogram of Ozone Concentration") +
  theme_minimal()
# create ozone boxplot
ggplot(data = ozone_data,
       aes(x = "",
           y = ozone_ppm)) +
  geom_boxplot(width = 0.5,
               fill = "lightblue") +
  labs(x = "",
       y = "Ozone Concentration (ppm)",
       title = "Boxplot of Ozone Concentration")+
  theme_minimal()


# create subset of data with only one day to examine daily pattern
# I did not ask you to code this because we have not discussed dates or stringr
# You need to uncomment the below three lines and run it; check object names
ozone_day <- ozone_data %>% 
  dplyr::filter(stringr::str_detect(string = datetime,
                                    pattern = "2019-07-04"))
ozone_day
# create autocorrelation plot with ozone_day df
stats::acf(ozone_day$ozone_ppm,
           main = "Autocorrelation of Ozone Concentration on July 4th, 2019",
           xlab = "Lag (hours)",
           ylab = "Correlation Coefficient")
# create partial autocorrelation plot

stats::pacf(ozone_day$ozone_ppm,
            main = "Partial Autocorrelation of Ozone Concentration on July 4th, 2019",
            xlab = "Lag (hours)",
            ylab = "Partial Correlation Coefficient")