Hi there! Welcome to my data visualization land. Today I will use the data about PM 2.5 to do the data exploration.

1) Import and Preprocess Data

classes <- c("character","character","character","character","character","numeric","numeric","numeric","numeric","numeric","numeric","numeric")
data99 <- read.csv(url("https://bit.ly/3c4AHbL"),header = TRUE, colClasses = classes)

require(data.table) ## for fread()
## 載入需要的套件:data.table
data12 <- fread("https://bit.ly/3nZicL2",header = TRUE, colClasses = classes)
str(data12)
## Classes 'data.table' and 'data.frame':   1304287 obs. of  12 variables:
##  $ X..RD          : chr  "RD" "RD" "RD" "RD" ...
##  $ Action.Code    : chr  "I" "I" "I" "I" ...
##  $ State.Code     : chr  "01" "01" "01" "01" ...
##  $ County.Code    : chr  "003" "003" "003" "003" ...
##  $ Site.ID        : chr  "0010" "0010" "0010" "0010" ...
##  $ Parameter      : num  88101 88101 88101 88101 88101 ...
##  $ POC            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sample.Duration: num  7 7 7 7 7 7 7 7 7 7 ...
##  $ Unit           : num  105 105 105 105 105 105 105 105 105 105 ...
##  $ Method         : num  118 118 118 118 118 118 118 118 118 118 ...
##  $ Date           : num  20120101 20120104 20120107 20120110 20120113 ...
##  $ Sample.Value   : num  6.7 9 6.5 7 5.8 8 7.9 8 6 9.6 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Let’s take a look at the data I imported:

library(tidyverse)
## Warning: 套件 'tidyverse' 是用 R 版本 4.2.2 來建造的
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2
## Warning: 套件 'ggplot2' 是用 R 版本 4.2.2 來建造的
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
glimpse(data99)
## Rows: 117,421
## Columns: 12
## $ X..RD           <chr> "RD", "RD", "RD", "RD", "RD", "RD", "RD", "RD", "RD", …
## $ Action.Code     <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I",…
## $ State.Code      <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", …
## $ County.Code     <chr> "027", "027", "027", "027", "027", "027", "027", "027"…
## $ Site.ID         <chr> "0001", "0001", "0001", "0001", "0001", "0001", "0001"…
## $ Parameter       <dbl> 88101, 88101, 88101, 88101, 88101, 88101, 88101, 88101…
## $ POC             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Sample.Duration <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
## $ Unit            <dbl> 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105,…
## $ Method          <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120,…
## $ Date            <dbl> 19990103, 19990106, 19990109, 19990112, 19990115, 1999…
## $ Sample.Value    <dbl> NA, NA, NA, 8.841, 14.920, 3.878, 9.042, 5.464, 20.170…
head(data99,3)
##   X..RD Action.Code State.Code County.Code Site.ID Parameter POC
## 1    RD           I         01         027    0001     88101   1
## 2    RD           I         01         027    0001     88101   1
## 3    RD           I         01         027    0001     88101   1
##   Sample.Duration Unit Method     Date Sample.Value
## 1               7  105    120 19990103           NA
## 2               7  105    120 19990106           NA
## 3               7  105    120 19990109           NA
summary(summary(data99$Sample.Value))
## Number of cases in table: 13424.44 
## Number of factors: 0

Seems like there are some missing data in PM 2.5 observations. We should do the research and find out the proportion of it:

require(scales)
## 載入需要的套件:scales
## 
## 載入套件:'scales'
## 下列物件被遮斷自 'package:purrr':
## 
##     discard
## 下列物件被遮斷自 'package:readr':
## 
##     col_factor
percent(sum(is.na(data99$Sample.Value)) / length(data99$Sample.Value))
## [1] "11%"

Now, we are going to bind 2 years data together and do some data transformimg:

pm <- rbind(data99,data12)
library("lubridate")
pm$year <- as.factor(year(ymd(pm$Date)))
pm <- rename(pm,PM = Sample.Value)
head(pm,3)
##   X..RD Action.Code State.Code County.Code Site.ID Parameter POC
## 1    RD           I         01         027    0001     88101   1
## 2    RD           I         01         027    0001     88101   1
## 3    RD           I         01         027    0001     88101   1
##   Sample.Duration Unit Method     Date PM year
## 1               7  105    120 19990103 NA 1999
## 2               7  105    120 19990106 NA 1999
## 3               7  105    120 19990109 NA 1999

I will sample the data and visualize it in boxplots:

set.seed(2021)
sam <- sample(1:nrow(pm), 1000)
pm_sample <- pm[sam,]
head(pm_sample,3)
##        X..RD Action.Code State.Code County.Code Site.ID Parameter POC
## 444582    RD           I         11         001    0043     88101   4
## 876986    RD           I         30         075    0001     88101   3
## 439180    RD           I         10         003    2004     88101   3
##        Sample.Duration Unit Method     Date   PM year
## 444582               1  105    170 20120212 8.00 2012
## 876986               1  105    170 20120305 8.60 2012
## 439180               1  105    184 20120323 8.88 2012
require(ggplot2)
x <- ggplot(pm_sample, aes(year,log2(PM)))
x1 <- ggplot(pm_sample, aes(year, log2(PM), color = year)) +
  geom_boxplot() +
  theme(plot.title = element_text(hjust = 0)) +
  labs(title = "Boxplt of PM values in 1999 and 2012",x = "Year",y = "log2 PM2.5") +
  ylim(-2.5,6.25) +
  theme_bw();x1
## Warning in FUN(X[[i]], ...): 產生了 NaNs
## Warning: Removed 90 rows containing non-finite values (`stat_boxplot()`).

As we can see in this chart, in terms of average, PM value in 2012 is lower than the value in 1999.

In 1999, it is centered between 2.945 and 4.177, whereas it is centered between 2.138 and 3.585 in 2012.

In 1999, it is spread out by the minimum value 0.848 and maximum vlaue 5.629. In 2012, the degree of the PM values which is spread out longer than before.

To sum up, most of the site in 2012 has lower PM values than in 1999. However, according to the variance, the degree of PM values dispersion is more obviously in 2012.

To find more insight of this data, the first task is to identify a monitor in New York State that has data in 1999 and 2012.

data_sub <- unique(select(filter(pm,pm$State.Code == 36),State.Code,County.Code,Site.ID))
head(data_sub,3)
##     State.Code County.Code Site.ID
## 1           36         001    0005
## 123         36         001    0012
## 184         36         005    0073
data_sub$Site.Code <- paste(data_sub$County.Code,data_sub$Site.ID,sep = ".")
head(data_sub,3)
##     State.Code County.Code Site.ID Site.Code
## 1           36         001    0005  001.0005
## 123         36         001    0012  001.0012
## 184         36         005    0073  005.0073

Because not all of the monitors are all operated in 1999 and 2012, I need to find the intersection of the sites (i.e., monitors) in between 1999 and 2012 which gives us the list of monitors in New York that operated both in 1999 and 2012.

data_sub1 <- unique(select(filter(pm,pm$State.Code == 36),State.Code,County.Code,Site.ID,year))
data_sub1$Site.Code <- paste(data_sub1$County.Code,data_sub1$Site.ID,sep = ".")
data_sp <- split(data_sub1,data_sub1$year) 
intersect(data_sp[[1]][5],data_sp[[2]][5])
##    Site.Code
## 1   001.0005
## 2   001.0012
## 3   005.0080
## 4   013.0011
## 5   029.0005
## 6   031.0003
## 7   063.2008
## 8   067.1015
## 9   085.0055
## 10  101.0003

We observe that the list contains 10 monitors. Rather than choosing a monitor at random, it would make more sense to choose one that had the most observations.

Now I am going to identify the monitor in the original data (i.e., pm) that had the most data using mutate(), filter(), group_by(), summarize(), and arrange().

data_temp <- mutate(pm,Site.Code = paste(County.Code,Site.ID,sep = "."))
data_fil <- filter(data_temp,Site.Code == "001.0005" | Site.Code == "001.0012" | Site.Code == "005.0080" | Site.Code == "013.0011" | Site.Code == "029.0005" | Site.Code == "031.0003" | Site.Code == "063.2008" | Site.Code == "067.1015" | Site.Code == "085.0055" | Site.Code == "101.0003")
data_rank <- group_by(data_fil,Site.Code) %>%
  summarize(number_of_sites = n()) %>%
  arrange(desc(number_of_sites));data_rank
## # A tibble: 10 × 2
##    Site.Code number_of_sites
##    <chr>               <int>
##  1 101.0003              527
##  2 013.0011              213
##  3 031.0003              198
##  4 001.0005              186
##  5 067.1015              153
##  6 063.2008              152
##  7 029.0005               94
##  8 001.0012               92
##  9 005.0080               92
## 10 085.0055               38

It seems that monitor 101.0003 had collected the most data in the U.S. (i.e., pm) during 1999 and 2012 (n = 527).

Therefore, I focus on this monitor which is just identified (State.Code = 36 & County.Code = 101 & Site.ID = 0003) and assign the subset data to an obj. called ‘pmsub’.

pmsub <- subset(pm,State.Code == "36" & County.Code == "101" & Site.ID == "0003")
head(pmsub,3)
##       X..RD Action.Code State.Code County.Code Site.ID Parameter POC
## 68260    RD           I         36         101    0003     88101   1
## 68261    RD           I         36         101    0003     88101   1
## 68262    RD           I         36         101    0003     88101   1
##       Sample.Duration Unit Method     Date   PM year
## 68260               7  105    118 19990802   NA 1999
## 68261               7  105    118 19990803  3.9 1999
## 68262               7  105    118 19990804 11.8 1999
pmsub$Date <- as.Date(ymd(pmsub$Date))
pmsub$yday <- yday(pmsub$Date)
head(pmsub,3)
##       X..RD Action.Code State.Code County.Code Site.ID Parameter POC
## 68260    RD           I         36         101    0003     88101   1
## 68261    RD           I         36         101    0003     88101   1
## 68262    RD           I         36         101    0003     88101   1
##       Sample.Duration Unit Method       Date   PM year yday
## 68260               7  105    118 1999-08-02   NA 1999  214
## 68261               7  105    118 1999-08-03  3.9 1999  215
## 68262               7  105    118 1999-08-04 11.8 1999  216

To visualize it, let’s draw a scatter plot by mapping the year-day variable on the x-axis, PM2.5 level on the y-axis separately for 1999 and 2012.

g <- ggplot(pmsub, aes(yday,PM))
g1 <- g +
  geom_point() +
  labs(x = "Day of the Year") +
  facet_grid(. ~year) + 
  theme_bw();g1
## Warning: Removed 45 rows containing missing values (`geom_point()`).

Due to this result, we can know that the concentration of PM 2.5 has been decrease within 13 years.

That’s all my observation. See you next time!