Hi there! Welcome to my data visualization land. Today I will use
the data about PM 2.5 to do the data exploration.
1) Import and Preprocess Data
classes <- c("character","character","character","character","character","numeric","numeric","numeric","numeric","numeric","numeric","numeric")
data99 <- read.csv(url("https://bit.ly/3c4AHbL"),header = TRUE, colClasses = classes)
require(data.table) ## for fread()
## 載入需要的套件:data.table
data12 <- fread("https://bit.ly/3nZicL2",header = TRUE, colClasses = classes)
str(data12)
## Classes 'data.table' and 'data.frame': 1304287 obs. of 12 variables:
## $ X..RD : chr "RD" "RD" "RD" "RD" ...
## $ Action.Code : chr "I" "I" "I" "I" ...
## $ State.Code : chr "01" "01" "01" "01" ...
## $ County.Code : chr "003" "003" "003" "003" ...
## $ Site.ID : chr "0010" "0010" "0010" "0010" ...
## $ Parameter : num 88101 88101 88101 88101 88101 ...
## $ POC : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Sample.Duration: num 7 7 7 7 7 7 7 7 7 7 ...
## $ Unit : num 105 105 105 105 105 105 105 105 105 105 ...
## $ Method : num 118 118 118 118 118 118 118 118 118 118 ...
## $ Date : num 20120101 20120104 20120107 20120110 20120113 ...
## $ Sample.Value : num 6.7 9 6.5 7 5.8 8 7.9 8 6 9.6 ...
## - attr(*, ".internal.selfref")=<externalptr>
Let’s take a look at the data I imported:
library(tidyverse)
## Warning: 套件 'tidyverse' 是用 R 版本 4.2.2 來建造的
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: 套件 'ggplot2' 是用 R 版本 4.2.2 來建造的
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
glimpse(data99)
## Rows: 117,421
## Columns: 12
## $ X..RD <chr> "RD", "RD", "RD", "RD", "RD", "RD", "RD", "RD", "RD", …
## $ Action.Code <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I",…
## $ State.Code <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", …
## $ County.Code <chr> "027", "027", "027", "027", "027", "027", "027", "027"…
## $ Site.ID <chr> "0001", "0001", "0001", "0001", "0001", "0001", "0001"…
## $ Parameter <dbl> 88101, 88101, 88101, 88101, 88101, 88101, 88101, 88101…
## $ POC <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Sample.Duration <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
## $ Unit <dbl> 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105,…
## $ Method <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120,…
## $ Date <dbl> 19990103, 19990106, 19990109, 19990112, 19990115, 1999…
## $ Sample.Value <dbl> NA, NA, NA, 8.841, 14.920, 3.878, 9.042, 5.464, 20.170…
head(data99,3)
## X..RD Action.Code State.Code County.Code Site.ID Parameter POC
## 1 RD I 01 027 0001 88101 1
## 2 RD I 01 027 0001 88101 1
## 3 RD I 01 027 0001 88101 1
## Sample.Duration Unit Method Date Sample.Value
## 1 7 105 120 19990103 NA
## 2 7 105 120 19990106 NA
## 3 7 105 120 19990109 NA
summary(summary(data99$Sample.Value))
## Number of cases in table: 13424.44
## Number of factors: 0
Seems like there are some missing data in PM 2.5 observations. We
should do the research and find out the proportion of it:
require(scales)
## 載入需要的套件:scales
##
## 載入套件:'scales'
## 下列物件被遮斷自 'package:purrr':
##
## discard
## 下列物件被遮斷自 'package:readr':
##
## col_factor
percent(sum(is.na(data99$Sample.Value)) / length(data99$Sample.Value))
## [1] "11%"
I will sample the data and visualize it in boxplots:
set.seed(2021)
sam <- sample(1:nrow(pm), 1000)
pm_sample <- pm[sam,]
head(pm_sample,3)
## X..RD Action.Code State.Code County.Code Site.ID Parameter POC
## 444582 RD I 11 001 0043 88101 4
## 876986 RD I 30 075 0001 88101 3
## 439180 RD I 10 003 2004 88101 3
## Sample.Duration Unit Method Date PM year
## 444582 1 105 170 20120212 8.00 2012
## 876986 1 105 170 20120305 8.60 2012
## 439180 1 105 184 20120323 8.88 2012
require(ggplot2)
x <- ggplot(pm_sample, aes(year,log2(PM)))
x1 <- ggplot(pm_sample, aes(year, log2(PM), color = year)) +
geom_boxplot() +
theme(plot.title = element_text(hjust = 0)) +
labs(title = "Boxplt of PM values in 1999 and 2012",x = "Year",y = "log2 PM2.5") +
ylim(-2.5,6.25) +
theme_bw();x1
## Warning in FUN(X[[i]], ...): 產生了 NaNs
## Warning: Removed 90 rows containing non-finite values (`stat_boxplot()`).

As we can see in this chart, in terms of average, PM value in 2012
is lower than the value in 1999.
In 1999, it is centered between 2.945 and 4.177, whereas it is
centered between 2.138 and 3.585 in 2012.
In 1999, it is spread out by the minimum value 0.848 and maximum
vlaue 5.629. In 2012, the degree of the PM values which is spread out
longer than before.
To sum up, most of the site in 2012 has lower PM values than in
1999. However, according to the variance, the degree of PM values
dispersion is more obviously in 2012.
To find more insight of this data, the first task is to identify a
monitor in New York State that has data in 1999 and 2012.
data_sub <- unique(select(filter(pm,pm$State.Code == 36),State.Code,County.Code,Site.ID))
head(data_sub,3)
## State.Code County.Code Site.ID
## 1 36 001 0005
## 123 36 001 0012
## 184 36 005 0073
data_sub$Site.Code <- paste(data_sub$County.Code,data_sub$Site.ID,sep = ".")
head(data_sub,3)
## State.Code County.Code Site.ID Site.Code
## 1 36 001 0005 001.0005
## 123 36 001 0012 001.0012
## 184 36 005 0073 005.0073
Because not all of the monitors are all operated in 1999 and 2012, I
need to find the intersection of the sites (i.e., monitors) in between
1999 and 2012 which gives us the list of monitors in New York that
operated both in 1999 and 2012.
data_sub1 <- unique(select(filter(pm,pm$State.Code == 36),State.Code,County.Code,Site.ID,year))
data_sub1$Site.Code <- paste(data_sub1$County.Code,data_sub1$Site.ID,sep = ".")
data_sp <- split(data_sub1,data_sub1$year)
intersect(data_sp[[1]][5],data_sp[[2]][5])
## Site.Code
## 1 001.0005
## 2 001.0012
## 3 005.0080
## 4 013.0011
## 5 029.0005
## 6 031.0003
## 7 063.2008
## 8 067.1015
## 9 085.0055
## 10 101.0003
We observe that the list contains 10 monitors. Rather than choosing
a monitor at random, it would make more sense to choose one that had the
most observations.
Now I am going to identify the monitor in the original data (i.e.,
pm) that had the most data using mutate(), filter(), group_by(),
summarize(), and arrange().
data_temp <- mutate(pm,Site.Code = paste(County.Code,Site.ID,sep = "."))
data_fil <- filter(data_temp,Site.Code == "001.0005" | Site.Code == "001.0012" | Site.Code == "005.0080" | Site.Code == "013.0011" | Site.Code == "029.0005" | Site.Code == "031.0003" | Site.Code == "063.2008" | Site.Code == "067.1015" | Site.Code == "085.0055" | Site.Code == "101.0003")
data_rank <- group_by(data_fil,Site.Code) %>%
summarize(number_of_sites = n()) %>%
arrange(desc(number_of_sites));data_rank
## # A tibble: 10 × 2
## Site.Code number_of_sites
## <chr> <int>
## 1 101.0003 527
## 2 013.0011 213
## 3 031.0003 198
## 4 001.0005 186
## 5 067.1015 153
## 6 063.2008 152
## 7 029.0005 94
## 8 001.0012 92
## 9 005.0080 92
## 10 085.0055 38
It seems that monitor 101.0003 had collected the most data in the
U.S. (i.e., pm) during 1999 and 2012 (n = 527).
Therefore, I focus on this monitor which is just identified
(State.Code = 36 & County.Code = 101 & Site.ID = 0003) and
assign the subset data to an obj. called ‘pmsub’.
pmsub <- subset(pm,State.Code == "36" & County.Code == "101" & Site.ID == "0003")
head(pmsub,3)
## X..RD Action.Code State.Code County.Code Site.ID Parameter POC
## 68260 RD I 36 101 0003 88101 1
## 68261 RD I 36 101 0003 88101 1
## 68262 RD I 36 101 0003 88101 1
## Sample.Duration Unit Method Date PM year
## 68260 7 105 118 19990802 NA 1999
## 68261 7 105 118 19990803 3.9 1999
## 68262 7 105 118 19990804 11.8 1999
pmsub$Date <- as.Date(ymd(pmsub$Date))
pmsub$yday <- yday(pmsub$Date)
head(pmsub,3)
## X..RD Action.Code State.Code County.Code Site.ID Parameter POC
## 68260 RD I 36 101 0003 88101 1
## 68261 RD I 36 101 0003 88101 1
## 68262 RD I 36 101 0003 88101 1
## Sample.Duration Unit Method Date PM year yday
## 68260 7 105 118 1999-08-02 NA 1999 214
## 68261 7 105 118 1999-08-03 3.9 1999 215
## 68262 7 105 118 1999-08-04 11.8 1999 216
To visualize it, let’s draw a scatter plot by mapping the year-day
variable on the x-axis, PM2.5 level on the y-axis separately for 1999
and 2012.
g <- ggplot(pmsub, aes(yday,PM))
g1 <- g +
geom_point() +
labs(x = "Day of the Year") +
facet_grid(. ~year) +
theme_bw();g1
## Warning: Removed 45 rows containing missing values (`geom_point()`).

Due to this result, we can know that the concentration of PM 2.5 has
been decrease within 13 years.
That’s all my observation. See you next time!