class: center, middle, inverse, title-slide # Air Quality Monitoring Data Analysis and Visualisation in R ### Dr Mahmudur Rahman ### Data Analyst and Atmospheric Scientist, Sydney, Australia ### 07/01/2022 --- # Why R - It is free! - It runs on a variety of platforms including Windows, Unix and MacOS. - It provides a lot of useful packages for air quality data analysis, visualisation and statistical modelling. - It has state-of-the-art graphics capabilities. - It is very popular among data science professionals. - It is reproducible. - It can be learned with the help of so many free educational materials. <img src="data:image/png;base64,#./R_logo.png" width="643" /> --- # R Console - [__Download R - https://cran.r-project.org/bin/windows/base/__](https://cran.r-project.org/bin/windows/base/) <img src="data:image/png;base64,#./R_consol.png" width="473" /> --- # RStudio - __It's the most populat integrated development environment (IDE) for R__ - [__It's freely available - www.rstudio.com__](https://www.rstudio.com/products/rstudio/download/) <img src="data:image/png;base64,#./RStudio.png" width="690" /> --- # Basic R commands ```r a <- c(1,2,3,4) a ``` ``` ## [1] 1 2 3 4 ``` ```r a + 5 ``` ``` ## [1] 6 7 8 9 ``` ```r a - 10 ``` ``` ## [1] -9 -8 -7 -6 ``` ```r a*5 ``` ``` ## [1] 5 10 15 20 ``` ```r a/5 ``` ``` ## [1] 0.2 0.4 0.6 0.8 ``` --- # Basic R commands ```r a ``` ``` ## [1] 1 2 3 4 ``` ```r sqrt(a) ``` ``` ## [1] 1.000000 1.414214 1.732051 2.000000 ``` ```r a^2 ``` ``` ## [1] 1 4 9 16 ``` ```r a[4] ``` ``` ## [1] 4 ``` ```r b <- c(5,6,7,8) b ``` ``` ## [1] 5 6 7 8 ``` ```r a+b ``` ``` ## [1] 6 8 10 12 ``` --- # Install and Load R Packages Install a package `install.packages("....")` Load a library `library(library_name)` __We will use a few popular R packages for today's demonestration__ - `library(tidyverse)` __for data manipulation and plotting__ <br> [https://www.tidyverse.org/packages/](https://www.tidyverse.org/packages/) - `library(openair)` - __for air quality data analysis and visualisation__ - `library(lubridate)` __working with date and time__ - __R packages come with the example data sets.__ We will use those data to make some plots in the following sections. --- # Reading and exploring data in R ```r #Reading a sample air quality data aq <- read.csv("./airquality.csv") # Column names colnames(aq) ``` ``` ## [1] "date" "ws" "wd" "nox" "no2" "o3" "pm10" "so2" "co" "pm25" ``` ```r #First few rows of the data head(aq) ``` ``` ## date ws wd nox no2 o3 pm10 so2 co pm25 ## 1 1998-01-01 00:00:00 0.60 280 285 39 1 29 4.7225 3.3725 NA ## 2 1998-01-01 01:00:00 2.16 230 NA NA NA 37 NA NA NA ## 3 1998-01-01 02:00:00 2.76 190 NA NA 3 34 6.8300 9.6025 NA ## 4 1998-01-01 03:00:00 2.16 170 493 52 3 35 7.6625 10.2175 NA ## 5 1998-01-01 04:00:00 2.40 180 468 78 2 34 8.0700 8.9125 NA ## 6 1998-01-01 05:00:00 3.00 190 264 42 0 16 5.5050 3.0525 NA ``` ```r # Dimention dim(aq) ``` ``` ## [1] 65533 10 ``` --- # Reading and exploring data in R ```r str(aq) ``` ``` ## 'data.frame': 65533 obs. of 10 variables: ## $ date: chr "1998-01-01 00:00:00" "1998-01-01 01:00:00" "1998-01-01 02:00:00" "1998-01-01 03:00:00" ... ## $ ws : num 0.6 2.16 2.76 2.16 2.4 3 3 3 3.36 3.96 ... ## $ wd : int 280 230 190 170 180 190 140 170 170 170 ... ## $ nox : int 285 NA NA 493 468 264 171 195 137 113 ... ## $ no2 : int 39 NA NA 52 78 42 38 51 42 39 ... ## $ o3 : int 1 NA 3 3 2 0 0 0 1 2 ... ## $ pm10: int 29 37 34 35 34 16 11 12 12 12 ... ## $ so2 : num 4.72 NA 6.83 7.66 8.07 ... ## $ co : num 3.37 NA 9.6 10.22 8.91 ... ## $ pm25: int NA NA NA NA NA NA NA NA NA NA ... ``` ```r # Let's do some basic date format library(lubridate) aq$date <- ymd_hms(aq$date) head(aq$date) ``` ``` ## [1] "1998-01-01 00:00:00 UTC" "1998-01-01 01:00:00 UTC" ## [3] "1998-01-01 02:00:00 UTC" "1998-01-01 03:00:00 UTC" ## [5] "1998-01-01 04:00:00 UTC" "1998-01-01 05:00:00 UTC" ``` --- # Reading and exploring data in R ```r library(lubridate) library(tidyverse) aq$year <- year(aq$date) aq$hour <- hour(aq$date) aq %>% select(date, year, hour) %>% head() ``` ``` ## date year hour ## 1 1998-01-01 00:00:00 1998 0 ## 2 1998-01-01 01:00:00 1998 1 ## 3 1998-01-01 02:00:00 1998 2 ## 4 1998-01-01 03:00:00 1998 3 ## 5 1998-01-01 04:00:00 1998 4 ## 6 1998-01-01 05:00:00 1998 5 ``` --- # Reading and exploring data in R ```r library(openair) library(tidyverse) # Very easy approach for doing time average aq_daily <- timeAverage(aq, avg.time = "day") head(aq_daily) ``` ``` ## # A tibble: 6 × 12 ## date ws wd nox no2 o3 pm10 so2 co pm25 ## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1998-01-01 00:00:00 6.84 188. 154. 39.4 6.87 18.2 3.15 2.70 NaN ## 2 1998-01-02 00:00:00 7.07 223. 132. 39.5 6.48 27.8 3.94 1.77 NaN ## 3 1998-01-03 00:00:00 11.0 226. 120. 38.0 8.41 20.2 3.20 1.74 NaN ## 4 1998-01-04 00:00:00 11.5 223. 105. 35.3 9.61 21.0 2.96 1.62 NaN ## 5 1998-01-05 00:00:00 6.61 237. 175. 46.0 4.96 24.2 4.52 2.13 NaN ## 6 1998-01-06 00:00:00 4.38 197. 214. 45.3 1.35 34.6 5.70 2.53 NaN ## # … with 2 more variables: year <dbl>, hour <dbl> ``` --- # Reading and exploring data in R ```r aq_monthly <- timeAverage(aq, avg.time = "month") head(aq_monthly) ``` ``` ## # A tibble: 6 × 12 ## date ws wd nox no2 o3 pm10 so2 co pm25 ## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1998-01-01 00:00:00 5.09 213. 168. 42.2 4.35 29.2 5.18 1.95 NaN ## 2 1998-02-01 00:00:00 4.26 227. 284. 58.3 2.49 40.2 9.65 2.90 NaN ## 3 1998-03-01 00:00:00 4.67 233. 194. 49.7 5.29 32.7 8.98 2.05 NaN ## 4 1998-04-01 00:00:00 4.20 207. 177. 47.9 8.81 28.9 6.60 1.78 NaN ## 5 1998-05-01 00:00:00 3.46 348. 122. 43.7 9.46 32.5 5.02 1.25 21.0 ## 6 1998-06-01 00:00:00 4.93 229. 197. 47.1 5.51 33.0 6.30 2.05 20 ## # … with 2 more variables: year <dbl>, hour <dbl> ``` --- # Reading and exploring data in R ```r # Summary of the data summary(aq) ``` ``` ## date ws wd nox ## Min. :1998-01-01 00:00:00 Min. : 0.000 Min. : 0 Min. : 0.0 ## 1st Qu.:1999-11-14 15:00:00 1st Qu.: 2.600 1st Qu.:140 1st Qu.: 82.0 ## Median :2001-09-27 06:00:00 Median : 4.100 Median :210 Median : 153.0 ## Mean :2001-09-27 06:00:00 Mean : 4.489 Mean :200 Mean : 178.8 ## 3rd Qu.:2003-08-10 21:00:00 3rd Qu.: 5.760 3rd Qu.:270 3rd Qu.: 249.0 ## Max. :2005-06-23 12:00:00 Max. :20.160 Max. :360 Max. :1144.0 ## NA's :632 NA's :219 NA's :2423 ## no2 o3 pm10 so2 ## Min. : 0.00 Min. : 0.000 Min. : 1.00 Min. : 0.000 ## 1st Qu.: 33.00 1st Qu.: 2.000 1st Qu.: 22.00 1st Qu.: 2.175 ## Median : 46.00 Median : 4.000 Median : 31.00 Median : 4.000 ## Mean : 49.13 Mean : 7.122 Mean : 34.38 Mean : 4.795 ## 3rd Qu.: 61.00 3rd Qu.:10.000 3rd Qu.: 44.00 3rd Qu.: 6.500 ## Max. :206.00 Max. :70.000 Max. :801.00 Max. :63.205 ## NA's :2438 NA's :2589 NA's :2162 NA's :10450 ## co pm25 year hour ## Min. : 0.000 Min. : 0.0 Min. :1998 Min. : 0.0 ## 1st Qu.: 0.635 1st Qu.: 13.0 1st Qu.:1999 1st Qu.: 5.0 ## Median : 1.140 Median : 20.0 Median :2001 Median :11.0 ## Mean : 1.464 Mean : 21.7 Mean :2001 Mean :11.5 ## 3rd Qu.: 1.980 3rd Qu.: 28.0 3rd Qu.:2003 3rd Qu.:17.0 ## Max. :19.705 Max. :398.0 Max. :2005 Max. :23.0 ## NA's :1936 NA's :8775 ``` --- # Openair package - Windrose .pull-left[ ```r library(openair) #Windrose windRose(aq) ``` <img src="data:image/png;base64,#Air_quality_R_v2_files/figure-html/unnamed-chunk-12-1.png" width="100%" /> ```r # ??windRose ``` ] -- .pull-right[ __windRose__ (mydata, ws = __"ws"__, wd = __"wd"__, ws2 = NA, wd2 = NA, ws.int = 2, angle = 30, type = "default", bias.corr = TRUE, cols = "default", grid.line = NULL, width = 1, seg = NULL, auto.text = TRUE, breaks = 4, offset = 10, normalise = FALSE, max.freq = NULL, paddle = TRUE, key.header = NULL, key.footer = "(m/s)", key.position = "bottom", key = TRUE, dig.lab = 5, statistic = "prop.count", pollutant = NULL, annotate = TRUE, angle.scale = 315, border = NA, ...) ] --- # Openair package plots - pollutionRose .center[ ```r pollutionRose(aq, pollutant = "pm10") ``` <img src="data:image/png;base64,#Air_quality_R_v2_files/figure-html/unnamed-chunk-13-1.png" width="60%" /> ] --- # Openair package plots - percentileRose .center[ ```r percentileRose(aq, pollutant = "so2", percentile = c(25, 50, 75, 90, 95, 99, 99.9), col = "brewer1", key.position = "right", smooth = TRUE) ``` <img src="data:image/png;base64,#Air_quality_R_v2_files/figure-html/unnamed-chunk-14-1.png" width="50%" /> ] --- # Openair package plots - percentileRose .center[ ```r percentileRose(aq, poll="so2", percentile = 95, method = "cpf", col = "darkorange", smooth = TRUE) ``` <img src="data:image/png;base64,#Air_quality_R_v2_files/figure-html/unnamed-chunk-15-1.png" width="50%" /> ] --- # Openair package plots - polarPlot .center[ ```r polarPlot(aq, pollutant = "nox", col = "jet", key.position = "bottom", key.header = "mean nox (ug/m3)", key.footer = NULL) ``` <img src="data:image/png;base64,#Air_quality_R_v2_files/figure-html/unnamed-chunk-16-1.png" width="50%" /> ] --- # Openair package plots - polarPlot .center[ ```r timePlot(selectByDate(mydata, year = 2003, month = "aug"), pollutant = c("nox", "o3", "pm25", "pm10", "ws"), y.relation = "free") ``` <img src="data:image/png;base64,#Air_quality_R_v2_files/figure-html/unnamed-chunk-17-1.png" width="50%" /> ] --- # Openair package plots - timeVariation .center[ ```r timeVariation(subset(mydata, ws > 3 & wd > 100 & wd < 270), pollutant = "pm10", ylab = "pm10 (ug/m3)") ``` <img src="data:image/png;base64,#Air_quality_R_v2_files/figure-html/unnamed-chunk-18-1.svg" width="40%" /> ] --- class: inverse, center, middle # Openair package practice __There are so many useful functions available in the Openair package__ .font190[I highly recommend to have a look at the user mannual of the [Opeanair package - https://davidcarslaw.com/files/openairmanual.pdf](https://davidcarslaw.com/files/openairmanual.pdf) ] --- # Tidyverse package - ggplot ```r library(tidyverse) ggplot(aq, aes(x = date, y = pm10)) + geom_line(colour = "blue", alpha = 0.5) + theme_bw() + theme(axis.title = element_text(size = 20), axis.text=element_text(size=20)) + labs(x = "Date", y = expression(PM[10])) ``` <!-- --> --- # Tidyverse package - ggplot .pull-left[ ```r plot<- ggplot(aq, aes(x = pm25, y = pm10)) + geom_point(colour = "red", size = 3) + theme_bw() + geom_smooth(method = "lm") + theme( axis.title = element_text(size = 20), axis.text=element_text(size=20)) + xlab (expression(PM[2.5])) + ylab (expression(PM[10])) ``` ] .pull-right[ ```r plot ``` <!-- --> ] --- # Histogram .center[ ```r ggplot(aq, aes(so2)) + geom_histogram() + theme_bw() + theme(axis.title = element_text(size = 20), axis.text= element_text(size=20)) ``` <img src="data:image/png;base64,#Air_quality_R_v2_files/figure-html/unnamed-chunk-22-1.png" width="50%" /> ] --- # Boxplot .pull-left[ ```r # create a dataset data <- data.frame( name=c( rep("A",500), rep("B",500), rep("B",500), rep("C",20), rep('D', 100) ), value=c( rnorm(500, 10, 5), rnorm(500, 13, 1), rnorm(500, 18, 1), rnorm(20, 25, 4), rnorm(100, 12, 1) ) ) # Plot plot<- data %>% ggplot( aes(x=name, y=value, fill=name)) + geom_boxplot() + theme_bw() + theme( legend.position="none", plot.title = element_text(size=11) ) + xlab("") ``` ] .pull-right[ ```r plot ``` <!-- --> ] --- # Regression - Linear model example ```r lm_aq <- aq %>% filter(year == "2000") lm_aq <- lm (data = aq, formula = pm25 ~ pm10 + co + hour) summary(lm_aq) ``` ``` ## ## Call: ## lm(formula = pm25 ~ pm10 + co + hour, data = aq) ## ## Residuals: ## Min 1Q Median 3Q Max ## -195.207 -2.966 -0.424 2.556 258.144 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.351575 0.067277 49.82 <2e-16 *** ## pm10 0.442714 0.001495 296.09 <2e-16 *** ## co 2.336936 0.029483 79.26 <2e-16 *** ## hour -0.050235 0.004126 -12.17 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.352 on 53962 degrees of freedom ## (11567 observations deleted due to missingness) ## Multiple R-squared: 0.7332, Adjusted R-squared: 0.7332 ## F-statistic: 4.944e+04 on 3 and 53962 DF, p-value: < 2.2e-16 ``` --- # Regression - Linear model example ```r library(broom) tidy(lm_aq) ``` ``` ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.35 0.0673 49.8 0 ## 2 pm10 0.443 0.00150 296. 0 ## 3 co 2.34 0.0295 79.3 0 ## 4 hour -0.0502 0.00413 -12.2 4.72e-34 ``` ```r augment(lm_aq) ``` ``` ## # A tibble: 53,966 × 11 ## .rownames pm25 pm10 co hour .fitted .resid .hat .sigma .cooksd ## <chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2888 16 24 0.81 7 15.5 0.482 0.0000311 6.35 4.48e- 8 ## 2 2889 16 23 0.91 8 15.3 0.741 0.0000286 6.35 9.74e- 8 ## 3 2890 15 21 1.18 9 15.0 0.0343 0.0000290 6.35 2.11e-10 ## 4 2891 17 23 1.48 10 16.5 0.498 0.0000283 6.35 4.34e- 8 ## 5 2892 18 25 1.16 11 16.6 1.42 0.0000227 6.35 2.84e- 7 ## 6 2893 21 27 1.16 12 17.4 3.59 0.0000215 6.35 1.72e- 6 ## 7 2894 19 26 1.14 13 16.9 2.14 0.0000235 6.35 6.65e- 7 ## 8 2895 20 28 1.44 14 18.4 1.60 0.0000233 6.35 3.70e- 7 ## 9 2896 22 30 1.38 15 19.1 2.88 0.0000245 6.35 1.26e- 6 ## 10 2897 19 30 1.81 16 20.1 -1.06 0.0000297 6.35 2.09e- 7 ## # … with 53,956 more rows, and 1 more variable: .std.resid <dbl> ``` --- # What's next - Here I presented a very few tips and tricks to analyse air quality data in R. - There are many free R data science e-books available online to sharpen skills. - If you are interested in analysing data with the `tidyverse` package, I highly recommend [R for Data Science - https://r4ds.had.co.nz/](https://r4ds.had.co.nz/) e-book. - `Rmarkdown` is a great tool for sharing and documenting codes. - This presentation is prepared by `rmarkdown` and `xaringan` package in R. - Contact me if you have any questions (e-mail `mahmud05@gmail.com`). Feel free to connect with me on [https://www.linkedin.com/in/drmahmudurrahman/](https://www.linkedin.com/in/drmahmudurrahman/)