To begin the exam, we have to load some packages we are going to work with:
library(tidyverse)
library(lubridate)
library(stringr)
library(ggplot2)
library(ggpubr)
Then we have load the databases we are going to deal with:
calidadaire_19_20 <- readRDS("~/Desktop/3 BIOQUIMICA/DATA ANALYSIS/Exam/calidadaire_19_20.rds")
meteo_19_20 <- readRDS("~/Desktop/3 BIOQUIMICA/DATA ANALYSIS/Exam/meteo_19_20.rds")
First we analyse them to see what is inside:
str(meteo_19_20)
## tibble [731 × 7] (S3: tbl_df/tbl/data.frame)
## $ Fecha : Date[1:731], format: "2019-01-01" "2019-01-02" ...
## $ Temperatura: num [1:731] 5.7 3.6 3.3 2.9 4.8 3.5 2.1 5.2 6.2 2.2 ...
## $ Humedad : num [1:731] 72 73 60 50 43 65 72 75 71 60 ...
## $ lluvia : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
## $ Rad : num [1:731] 87.1 77.4 82.9 86.8 85.7 86.3 59.1 42.6 39.8 74.5 ...
## $ Insol : num [1:731] 27568 17570 25906 27622 26602 ...
## $ viento : num [1:731] 8.2 15 5 4.9 5.8 7 12.6 11.6 15.8 16 ...
We see that meteo_19_20 includes data about 7 variables: Fecha (Date format: “2019-01-01”), Temperatura, Humedad, Lluvia, Rad, Insol, viento (all these numerical data)
str(calidadaire_19_20)
## tibble [175,440 × 13] (S3: tbl_df/tbl/data.frame)
## $ Fecha_Hora: POSIXct[1:175440], format: "2019-01-01 01:00:00" "2019-01-01 02:00:00" ...
## $ NO : num [1:175440] 4 2 3 2 2 3 3 2 9 32 ...
## $ NOx : num [1:175440] 24 16 13 14 11 18 20 14 27 69 ...
## $ SO2 : num [1:175440] 2 2 2 2 2 2 3 2 2 7 ...
## $ O3 : num [1:175440] 18 24 27 26 23 23 19 21 18 22 ...
## $ PM10 : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
## $ CO : num [1:175440] 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ NO2 : num [1:175440] 18 14 9 12 9 13 14 11 14 20 ...
## $ xileno : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
## $ tolueno : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
## $ benceno : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
## $ PM2.5 : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
## $ Estacion : Factor w/ 11 levels "Alsasua","Funes",..: 1 1 1 1 1 1 1 1 1 1 ...
calidadaire_19_20 it includes data about 13 variables: Fecha_Hora (format –> “2019-01-01 01:00:00”), NO, SO2, O3, etc (numerical values) and also Estacion (factor w/13 levels)
We are going to work with calidadaire_19_20 data and first we are
going to introduce 2 new columns in this database with
that tells us to which day of the week
(“weekday” or “weekend”) correspond the data and also
another one for the year “2019” or “2020” with functions
from lubridate and tidyverse
weekdays <- 1:5
weekends <- 6:7
year_2020 <- 2020
year_2019 <- 2019
calidadaire<-calidadaire_19_20 %>%
mutate(type_of_day=ifelse(wday(Fecha_Hora,week_start = 1) %in% weekdays, "weekday","weekend")) %>%
mutate(Year=ifelse(year(Fecha_Hora) %in% year_2020, "2020","2019"))
Before doing the boxplot, we are going to check the mean for weekdays and weekendas for each station in 2020 (what we are going to represent in boxplot but numerical here)
calidad_aire_2020_mean <- calidadaire %>%
group_by(Estacion, type_of_day) %>%
filter(Year == 2020) %>%
summarize(mean_NO2 = mean(NO2, na.rm = TRUE)) %>%
view()
Then we extract NO2 for both weekdays and weekends,
for each station and filtering only for
2020 and do a boxplot with this
information using ggplot2 functions.
calidad_aire_2020 <- calidadaire %>%
group_by(Estacion, type_of_day) %>%
filter(Year == 2020) %>%
select(NO2) %>%
ggplot(aes(x=Estacion, y=NO2, fill=type_of_day)) +
geom_boxplot() +
facet_wrap(~Estacion, scale="free") +
labs(title = "Quantity of NO2 between weekends and weekdays for each station") +
xlab("Estacion") + ylab("NO2 levels") +
theme(plot.title = element_text(hjust = 0.5))
calidad_aire_2020
Last thing we need to do was to perform an appropriate statistical procedure to test this comparison for Iturrama Station. We assume that the distribution is not normal (we have checked w/ t.test and the p-value was very high) so we perform a Wilcox test
Iturrama_Station_NO2 <- calidadaire %>%
filter(Estacion == "Iturrama") %>%
select(NO2)
wilcox.test(Iturrama_Station_NO2$NO2, mu= mean(Iturrama_Station_NO2$NO2, na.rm = TRUE))
##
## Wilcoxon signed rank test with continuity correction
##
## data: Iturrama_Station_NO2$NO2
## V = 59113824, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 20.2953
In “calidadaire” data crated for the above exercise, we already have a column with the year, so we will use it now. We are going to filter only for Iturrama Station and diferenciate the NO2 data from 2019 and 2020
IturramaNO2 <- calidadaire %>%
filter(Estacion == "Iturrama") %>%
group_by(Year) %>%
select(NO2)
Then we do a plot to compare the NO2 levels in Iturrama Station in each year (2019 and 2020)
ggplot(IturramaNO2, aes(x=Year, y=NO2)) +
geom_boxplot(fill = "#0099f8") +
labs(title = "Quantity of NO2 in Iturrama Station in 2019 & 2020") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_minimal()
We are going to use calidadaire data frame (customized for previous exercises with Year column) and we are going to filtrate by year 2020 and plot O3 levels for each hour (hourly average) for each station.
Ozone2020 <-calidadaire %>%
filter(Year=="2020") %>%
ggplot(aes(x=Fecha_Hora, y=O3, color = Estacion)) +
geom_point(alpha=0.02)+
theme_minimal() +
geom_smooth() +
labs(title = "Hourly Average of O3 for each station in 2020") +
xlab("Hours") +
ylab("O3 levels") +
theme(plot.title = element_text(hjust = 0.5))
Ozone2020
Then, taking advantage of the column introduced with type of day in
calidadaire data frame, we are going to separate the information
for weekdays and weekends. In that purpose, we are going to
create 2 new data frames (for weekdays and weekends)
with the appropriate filters and conditions, create a plot for each, and
then merge both plots with ggarrange
Ozone2020_weekdays <- calidadaire %>%
filter(Year == 2020) %>%
filter(type_of_day == "weekday") %>%
group_by(Estacion) %>%
select(O3, Fecha_Hora)
pweekdays <- Ozone2020_weekdays %>%
ggplot(aes(x=Fecha_Hora, y=O3, color = Estacion)) +
geom_point(alpha=0.05)+
theme_minimal() +
geom_smooth() +
labs(title = "Hourly Average of O3 on weekdays for each station in 2020") +
xlab("Hours") +
ylab("O3 levels") +
theme(plot.title = element_text(hjust = 0.5))
Ozone2020_weekends <- calidadaire %>%
filter(Year == 2020) %>%
filter(type_of_day == "weekend") %>%
group_by(Estacion) %>%
select(O3, Fecha_Hora)
pweekends <- Ozone2020_weekends %>%
ggplot(aes(x=Fecha_Hora, y=O3, color = Estacion)) +
geom_point(alpha=0.05)+
theme_minimal() +
geom_smooth() +
labs(title = "Hourly Average of O3 on weekends for each station in 2020") +
xlab("Hours") +
ylab("O3 levels") +
theme(plot.title = element_text(hjust = 0.5))
figure_3 <- ggarrange(pweekdays, pweekends,
labels = c("A", "B"),
ncol = 1, nrow = 2,
common.legend = TRUE, legend="bottom")
figure_3
First, we add a new column to “calidadaire” with the month and another one with the hours
calidadaire_new<-calidadaire %>%
mutate(date_month=month(Fecha_Hora)) %>%
mutate(date_hour=hour(Fecha_Hora))
Then, we create a graph with the info we want
O3_Iturrama <- calidadaire_new %>%
filter(Estacion == "Iturrama") %>%
ggplot(aes(x=date_hour,y=O3)) +
facet_wrap(~date_month, scale="free") +
geom_point(col="slategray4", size = 1, alpha = 0.05) +
theme_classic() +
geom_smooth(colour="black") +
labs(title = "Iturrama station's hourly average of Ozone for each month of 2020") +
xlab("Hours") + ylab("O3 levels") +
theme(plot.title = element_text(hjust = 0.5))
O3_Iturrama
We can see that, in general, during the first 10 days (aprox) of each month there is a decrease in O3 levels while from days 10 to 15 there is usually a rise. In the winter season (11-2), O3 levels are lower than in summer months (6-8) as it can be seen in the scale used for each graph.
In this exercise we need to work with both databases. To begin with, we have to customize the new one, meteo_19_20 to include a column with the Year to be able to filtrate only the Global Radiation from 2020.
GR_2020 <- meteo_19_20 %>%
mutate(Year = year(Fecha)) %>%
filter(Year == 2020) %>%
select(Fecha,Rad)
Then we are going to filter the data from 2020 in calidadaire database (already customized with a column for the Year for previous exercises) and we are going to add a new column with the Date.
O3_2020 <- calidadaire %>%
mutate(Fecha = date(Fecha_Hora)) %>%
filter(Year == 2020) %>%
select(Fecha, O3)
Once we have the info we want in O3_2020, we are going to calculate the daily average and extract this values related to the date
AverageO3 <- O3_2020 %>%
group_by(Fecha) %>%
summarize(O3_average= mean(O3,na.rm=TRUE))
We have to merge both data frames by the common column “fecha”
O3_GR_2020 <- merge(GR_2020, AverageO3, by="Fecha")
Now that we have average O3 and global radiation in same data frame, we are going to plot them using a linear fit and a smooth fit
figure5 <- O3_GR_2020 %>%
ggplot(aes(x=Rad, y=O3_average)) +
geom_point() +
geom_smooth(method='lm', level=0.5, color="aquamarine4")+
geom_smooth(level=0.5, color="brown4") +
labs(title = "Daily average of Ozone vs Global Radiation for 2020") +
theme(plot.title=element_text(hjust=0.5)) +
xlab("Global Radiation") + ylab("Average O3") +
theme_minimal()
figure5
Last step in this exercise is obtaining the equation of the
corresponding linear model (Ozone in term of Global Radiation)
so Global Radiation is the predictor and Ozone
is the response. The linear model above is achieved by using
the lm() function and the output by
summary()function on the model.
linear_model <- lm(O3_average ~ Rad, data = O3_GR_2020)
summary(linear_model)
##
## Call:
## lm(formula = O3_average ~ Rad, data = O3_GR_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.530 -9.452 0.583 10.006 32.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.737548 1.406865 27.54 <2e-16 ***
## Rad 0.092425 0.007389 12.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.62 on 364 degrees of freedom
## Multiple R-squared: 0.3006, Adjusted R-squared: 0.2987
## F-statistic: 156.4 on 1 and 364 DF, p-value: < 2.2e-16
Interpretation:
Estimated regression line equation can be written as follow: Ozone = 38.737548 + 0.092425*Rad. From the residuals we see that the median is close to 0 and that the Min and Max values are similar in absolute values. As the p-value obtained is <2e-16 (using an alpha level of 0.05) we can say that Global Radiation (predictor variable) is significantly associated to Ozone (response variable), or in other words, that Global Radiation is a statistically significant predictor (based also on *** signif. code).
However, the Residual Standard Error is quite high (13.62) and we know that the the lower, the best the model fits to our data. Our model also has a adjusted R2 of 0.2987 which is closer to 0 than to 1 indicating that the regression model did not explain much of the variability in the outcome.