The data was collected from the U.S. embassy in China. Here’s a link to some of the information regarding measurements and how the data was collected: http://www.stateair.net/web/historical/1/1.html. The data includes hourly readings of the air pollution level in Beijing over eight years, between 2008 and 2015.

The motivation for this writeup is to visualize the fluctuation of the levels of air pollution in Beijing.

We use the packages: TTR, dplyr, rbokeh and lubridate. The code below is for loading packages, reading data, filtering out missing values, formating datetime values of the observations,

suppressMessages(library(TTR))
suppressMessages(library(rbokeh))
suppressMessages(library(dplyr))
suppressMessages(library(lubridate))

# reads csv files
bj08 <- read.csv("~/Documents/datasets/Beijing_2008_HourlyPM2.5_created20140325.csv")
bj09 <- read.csv("~/Documents/datasets/Beijing_2009_HourlyPM25_created20140709.csv")
bj10 <- read.csv("~/Documents/datasets/Beijing_2010_HourlyPM25_created20140709.csv")
bj11 <- read.csv("~/Documents/datasets/Beijing_2011_HourlyPM25_created20140709.csv")
bj12 <- read.csv("~/Documents/datasets/Beijing_2012_HourlyPM2.5_created20140325.csv")
bj13 <- read.csv("~/Documents/datasets/Beijing_2013_HourlyPM2.5_created20140325.csv")
bj14 <- read.csv("~/Documents/datasets/Beijing_2014_HourlyPM25_created20150203.csv")
bj15 <- read.csv("~/Documents/datasets/Beijing_2015_HourlyPM25_created20160201.csv")

# binds the data frames above by row, creates a new data frame
BJ_AQI <- rbind(bj08, bj09, bj10, bj11, bj12, bj13, bj14, bj15)
# Removes all observations with missing values
BJ_AQI <- BJ_AQI %>% filter(QC.Name == "Valid")

# checks for missing values # returns logical values
#any(BJ_AQI$QC.Name == "Missing")) ##FALSE

# creates a new variable date, which is of type date
BJ_AQI$date <- mdy_hm(BJ_AQI$Date..LST.)
# We create a new data frame with just the variables of interest
ts.BJ_AQI <- BJ_AQI %>% select(date, Year, Month, Value)

Plots Using the Base Package in R

Because the readings/measurements were taken hourly in a span of eight years, plotting the measurements Value against time using the R base package produces an unintelligible graph. We include such graph below to show the shortcomings of the base package plotting abilities.

plot(Value ~ date, data = ts.BJ_AQI, 
     main = "Air Pollution readings over 2008-2015")

Plots Using rbokeh

We also include a plot using rbokeh. The graph looks slightly better cosmetically, but still very hard to read.

figure(width = 800, height = 400, 
       title = "Air Pollution readings over 2008-2015") %>%
   ly_lines(date, Value, data = ts.BJ_AQI, alpha = 1) %>%
   ly_points(date, Value, data = ts.BJ_AQI, size = 2,
             color = "green")

Solution: Plotting Moving Averages

The solution to the problem above is to calculate the simple moving averages of the readings, attribute Value over monthly intervals. We use the funciton SMA in the TTR package to achieve the moving averages. We then plot them using the package rbokeh, for of its interactive capabilities. We can zoom in and out of color plots, and we can also hover over points on the graph to get the dates represented by that point.

The plot below shows that there is quite a bit of variablity in Value over time. Since this general graph doesn’t offer us a whole lot of insight in the data, we proceed to plot all of the years separately. We won’t include the code for all those years.

For the graph immediately below, we’ve collapsed hourly readings of 8 years down to 672 points. For the reast of the graphs, we’ve collapsed hourly reading of each year down to 168 points. We can see that the graphs for each year are clear and helpful when visually analyzing the pollution levels.

#############################################################
# Moving averages of value plotted over the span of 8 years #
#############################################################
ts.BJ_AQI$SMA <- SMA(ts.BJ_AQI$Value, n =672) # Simple Moving Average of PM 2.5 values vs time

figure(width = 800, height = 400, title = "Readings over 2008-2015") %>%
  ly_lines(date, SMA, data = ts.BJ_AQI, alpha = 1) %>%
  ly_points(date, SMA, data = ts.BJ_AQI, size = 2,
            color = "green")

Year 2008

Year 2009

Year 2010

Year 2011

Year 2012

Year 2013

Year 2014

Year 2015