We will create a data analysis report about air pollution in Nis with the explanation of the R code used. Data that will be used for the analysis comes from the stations, which are set up by an independent initiative Vazduh Gradjanima
Data is not frilly available with a single click, and we need to do some investigation. From the map https://maps.sensor.community/#9/43.4655/21.3952 we can identify the klimerko stations in Nis and download their readings from https://archive.sensor.community/2023-05-22/. The readings are then combined into a single csv file ‘data_nis.csv’ that is made available from our GitHub.
Once we have downloaded our data, we will start our workshop by installing R and RStudio, by following the instructions available on this link: Install R i RStudia.
To learn how to use R and develop a report like this visit Data Challenge Platform with learning material developed in partnership with UNDP Serbia. In particular we will focus on matirijal Day 1 and Day 2.
After we get familiar with the basic R syntax and data wrangling using data available from the open data portal Republic of Serbia https://data.gov.rs/sr/, we can dig into the air pollution data set following the instructions given below.
We will start the analysis by uploading the necessary packages and data into R.
If you have not got the packages used in the following code, you will
need to uncomment the first few lines (delete the #
symbol)
in the code below.
> #install.packages("rmarkdown")
> #install.packages("leaflet")
> #install.packages("lubridate")
> #install.packages("dplyr")
> #install.packages("tidyverse")
> #install.packages("DT")
>
> library(leaflet)
> suppressPackageStartupMessages(library(lubridate))
> suppressPackageStartupMessages(library(dplyr))
> suppressPackageStartupMessages(library(tidyverse))
>
> mydata <- read.csv('data/data_bg.csv',
+ header=T)
It is always a good idea to have a look at data you upload, before you start using it for your analysis.
> # scan data
> glimpse(mydata)
## Rows: 10,108
## Columns: 12
## $ sensor_id <int> 21152, 21152, 21152, 21152, 21152, 21152, 21152, 21152, 21…
## $ sensor_type <chr> "SDS011", "SDS011", "SDS011", "SDS011", "SDS011", "SDS011"…
## $ location <int> 10734, 10734, 10734, 10734, 10734, 10734, 10734, 10734, 10…
## $ lat <dbl> 44.792, 44.792, 44.792, 44.792, 44.792, 44.792, 44.792, 44…
## $ lon <dbl> 20.482, 20.482, 20.482, 20.482, 20.482, 20.482, 20.482, 20…
## $ timestamp <chr> "2023-05-22T00:01:04", "2023-05-22T00:03:32", "2023-05-22T…
## $ P1 <dbl> 5.90, 6.10, 8.65, 9.07, 8.45, 10.45, 9.27, 5.38, 6.90, 12.…
## $ durP1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ratioP1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ P2 <dbl> 3.42, 3.70, 4.20, 4.53, 4.28, 4.40, 4.78, 3.47, 4.05, 4.78…
## $ durP2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ratioP2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
We can do some tweaking, by removing a few variables and separating day from time and creating separate columns for latitude and longitude variables.
> # separate date from time
> mydata <- separate(mydata, timestamp, c("date", "time"), sep = "T")
> ## remove the last character from the `time` variable
> mydata$time <- (str_sub(mydata$time, end = -1))
> # scan data
> glimpse(mydata)
## Rows: 10,108
## Columns: 13
## $ sensor_id <int> 21152, 21152, 21152, 21152, 21152, 21152, 21152, 21152, 21…
## $ sensor_type <chr> "SDS011", "SDS011", "SDS011", "SDS011", "SDS011", "SDS011"…
## $ location <int> 10734, 10734, 10734, 10734, 10734, 10734, 10734, 10734, 10…
## $ lat <dbl> 44.792, 44.792, 44.792, 44.792, 44.792, 44.792, 44.792, 44…
## $ lon <dbl> 20.482, 20.482, 20.482, 20.482, 20.482, 20.482, 20.482, 20…
## $ date <chr> "2023-05-22", "2023-05-22", "2023-05-22", "2023-05-22", "2…
## $ time <chr> "00:01:04", "00:03:32", "00:05:59", "00:08:26", "00:10:55"…
## $ P1 <dbl> 5.90, 6.10, 8.65, 9.07, 8.45, 10.45, 9.27, 5.38, 6.90, 12.…
## $ durP1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ratioP1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ P2 <dbl> 3.42, 3.70, 4.20, 4.53, 4.28, 4.40, 4.78, 3.47, 4.05, 4.78…
## $ durP2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ ratioP2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> ## remove 2nd and the last column
> mydata <- mydata[, -c(2, 3)]
We will check how many unique records each variable in our data has.
> (uniq <- unlist(lapply(mydata, function(x) length(unique(x)))))
## sensor_id lat lon date time P1 durP1 ratioP1
## 19 19 19 1 9348 1154 1 1
## P2 durP2 ratioP2
## 810 1 1
To us it is most interesting to notice that we have 19
stations in total. Next, we’ll check if all of the stations are active,
by examining the number of readings for each station.
> # how many reading each station makes
> mydata %>%
+ group_by(sensor_id) %>%
+ summarise(no_readings = n()) %>%
+ arrange(no_readings) %>%
+ DT::datatable()
There are all active and are providing the reading.
To see where the stations are allocated we will plot them on Google
maps. We will use the leaflet
package to do this and create
a sub data with only a list of the stations and their positions.
> mydata[,1] <- as.factor(mydata[,1])
> summary(mydata)
## sensor_id lat lon date
## 76714 : 580 Min. :44.73 Min. :20.22 Length:10108
## 77933 : 579 1st Qu.:44.80 1st Qu.:20.31 Class :character
## 57223 : 577 Median :44.82 Median :20.46 Mode :character
## 56435 : 576 Mean :44.84 Mean :20.44
## 29856 : 575 3rd Qu.:44.89 3rd Qu.:20.49
## 66175 : 574 Max. :44.99 Max. :20.66
## (Other):6647
## time P1 durP1 ratioP1
## Length:10108 Min. : 0.100 Mode:logical Mode:logical
## Class :character 1st Qu.: 6.400 NA's:10108 NA's:10108
## Mode :character Median : 8.480
## Mean : 9.866
## 3rd Qu.: 10.885
## Max. :726.300
##
## P2 durP2 ratioP2
## Min. : 0.100 Mode:logical Mode:logical
## 1st Qu.: 3.700 NA's:10108 NA's:10108
## Median : 4.800
## Mean : 5.882
## 3rd Qu.: 6.530
## Max. :578.450
##
> # creating a dataframe with only names and lat & lng for each station
> stations <- data.frame(mydata[,c(1:3)]) %>%
+ drop_na()
> stations [, 2] <- as.numeric(as.character(stations[,2]))
> stations [, 3] <- as.numeric(as.character(stations[,3]))
> summary(stations)
## sensor_id lat lon
## 76714 : 580 Min. :44.73 Min. :20.22
## 77933 : 579 1st Qu.:44.80 1st Qu.:20.31
## 57223 : 577 Median :44.82 Median :20.46
## 56435 : 576 Mean :44.84 Mean :20.44
## 29856 : 575 3rd Qu.:44.89 3rd Qu.:20.49
## 66175 : 574 Max. :44.99 Max. :20.66
## (Other):6647
Once we have the subset data we can plot it using the
leaflet
package as below.
> minlat <- min(stations$lat)
> maxlat <- max(stations$lat)
> minlng <- min(stations$lon)
> maxlng <- max(stations$lon)
>
> stations %>%
+ group_by(sensor_id, lat, lon) %>%
+ leaflet() %>%
+ addTiles() %>%
+ fitBounds(~minlng, ~minlat, ~maxlng, ~maxlat) %>%
+ addCircles(lng = ~lon, lat = ~lat,
+ radius = 250, weight = 5, color = "black",
+ fillColor = "green", fillOpacity = 0.7,
+ popup = ~paste("<b>", sensor_id)
+ )
In this section we will look through the readings and try to make some sense of it all. If you are not familiar with the information collected by the stations, ie. about the pm particles you can check the following link: https://www.irceline.be/en/documentation/faq/what-is-pm10-and-pm2.5.
> mydata$time <- factor(mydata$time)
> mydata$time <- hms(as.character(mydata$time))
>
> mydata %>%
+ group_by(hour(time)) %>%
+ summarise(no_readings = n()) %>%
+ DT::datatable()
> mydata%>%
+ group_by(hour(time)) %>%
+ summarise(mean_P1 = mean(P1, na.rm = TRUE), mean_P2 = mean(P2, na.rm = TRUE)) %>%
+ gather("pm_no", "mean_pm", -`hour(time)`, factor_key = TRUE) %>%
+ ggplot(aes(x = `hour(time)`, y = mean_pm, fill = pm_no )) +
+ geom_bar(stat="identity", position = "dodge", color = "black") +
+ coord_flip() +
+ theme(plot.title = element_text(size = 14, vjust = 2, hjust=0.5)) +
+ labs (title = "average value of P1 and P2 per hour",
+ caption = "Data from: https://vazduhgradjanima.rs",
+ x = "month", y = "average pm") +
+ scale_fill_brewer(palette="Paired") +
+ theme(legend.position="bottom")