Exam Data Analysis

To begin the exam, we have to load some packages we are going to work with:

library(tidyverse)
library(lubridate)
library(stringr)
library(ggplot2)
library(ggpubr)

Then we have load the databases we are going to deal with:

calidadaire_19_20 <- readRDS("~/Desktop/3 BIOQUIMICA/DATA ANALYSIS/Exam/calidadaire_19_20.rds")

meteo_19_20 <- readRDS("~/Desktop/3 BIOQUIMICA/DATA ANALYSIS/Exam/meteo_19_20.rds")

First we analyse them to see what is inside:

str(meteo_19_20)

## tibble [731 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Fecha      : Date[1:731], format: "2019-01-01" "2019-01-02" ...
##  $ Temperatura: num [1:731] 5.7 3.6 3.3 2.9 4.8 3.5 2.1 5.2 6.2 2.2 ...
##  $ Humedad    : num [1:731] 72 73 60 50 43 65 72 75 71 60 ...
##  $ lluvia     : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Rad        : num [1:731] 87.1 77.4 82.9 86.8 85.7 86.3 59.1 42.6 39.8 74.5 ...
##  $ Insol      : num [1:731] 27568 17570 25906 27622 26602 ...
##  $ viento     : num [1:731] 8.2 15 5 4.9 5.8 7 12.6 11.6 15.8 16 ...

We see that meteo_19_20 includes data about 7 variables: Fecha (Date format: “2019-01-01”), Temperatura, Humedad, Lluvia, Rad, Insol, viento (all these numerical data)

str(calidadaire_19_20)

## tibble [175,440 × 13] (S3: tbl_df/tbl/data.frame)
##  $ Fecha_Hora: POSIXct[1:175440], format: "2019-01-01 01:00:00" "2019-01-01 02:00:00" ...
##  $ NO        : num [1:175440] 4 2 3 2 2 3 3 2 9 32 ...
##  $ NOx       : num [1:175440] 24 16 13 14 11 18 20 14 27 69 ...
##  $ SO2       : num [1:175440] 2 2 2 2 2 2 3 2 2 7 ...
##  $ O3        : num [1:175440] 18 24 27 26 23 23 19 21 18 22 ...
##  $ PM10      : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
##  $ CO        : num [1:175440] 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ NO2       : num [1:175440] 18 14 9 12 9 13 14 11 14 20 ...
##  $ xileno    : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
##  $ tolueno   : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
##  $ benceno   : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
##  $ PM2.5     : num [1:175440] NA NA NA NA NA NA NA NA NA NA ...
##  $ Estacion  : Factor w/ 11 levels "Alsasua","Funes",..: 1 1 1 1 1 1 1 1 1 1 ...

calidadaire_19_20 it includes data about 13 variables: Fecha_Hora (format –> “2019-01-01 01:00:00”), NO, SO2, O3, etc (numerical values) and also Estacion (factor w/13 levels)

Exercise 1

Boxplot comparing the quantity of NO2 between weekends and weekdays for each station that measured this contaminant in 2020. Perform an appropriate statistical procedure to test this comparison for Iturrama Station.

We are going to work with calidadaire_19_20 data and first we are going to introduce 2 new columns in this database with that tells us to which day of the week (“weekday” or “weekend”) correspond the data and also another one for the year “2019” or “2020” with functions from lubridate and tidyverse

weekdays <- 1:5
weekends <- 6:7
year_2020 <- 2020
year_2019 <- 2019

calidadaire<-calidadaire_19_20 %>%   
 mutate(type_of_day=ifelse(wday(Fecha_Hora,week_start = 1) %in% weekdays, "weekday","weekend")) %>%
 mutate(Year=ifelse(year(Fecha_Hora) %in% year_2020, "2020","2019"))

Before doing the boxplot, we are going to check the mean for weekdays and weekendas for each station in 2020 (what we are going to represent in boxplot but numerical here)

calidad_aire_2020_mean <- calidadaire %>%
  group_by(Estacion, type_of_day) %>%
  filter(Year == 2020) %>%
  summarize(mean_NO2 = mean(NO2, na.rm = TRUE)) %>%
  view()

Then we extract NO2 for both weekdays and weekends, for each station and filtering only for 2020 and do a boxplot with this information using ggplot2 functions.

calidad_aire_2020 <- calidadaire %>%
  group_by(Estacion, type_of_day) %>%
  filter(Year == 2020) %>%
  select(NO2) %>%
  ggplot(aes(x=Estacion, y=NO2, fill=type_of_day)) + 
  geom_boxplot() +
  facet_wrap(~Estacion, scale="free") +
  labs(title = "Quantity of NO2 between weekends and weekdays for each station") +
  xlab("Estacion") + ylab("NO2 levels") +
  theme(plot.title = element_text(hjust = 0.5))

calidad_aire_2020

Last thing we need to do was to perform an appropriate statistical procedure to test this comparison for Iturrama Station. We assume that the distribution is not normal (we have checked w/ t.test and the p-value was very high) so we perform a Wilcox test

Iturrama_Station_NO2 <- calidadaire %>%
  filter(Estacion == "Iturrama") %>%
  select(NO2)

wilcox.test(Iturrama_Station_NO2$NO2, mu= mean(Iturrama_Station_NO2$NO2, na.rm = TRUE))

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Iturrama_Station_NO2$NO2
## V = 59113824, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 20.2953

Exercise 2

Produce a single graph with boxplots comparing the quantity of NO2 in the Iturrama station in the years 2019 and 2020.

In “calidadaire” data crated for the above exercise, we already have a column with the year, so we will use it now. We are going to filter only for Iturrama Station and diferenciate the NO2 data from 2019 and 2020

IturramaNO2 <- calidadaire %>%
  filter(Estacion == "Iturrama") %>%
  group_by(Year) %>%
  select(NO2)

Then we do a plot to compare the NO2 levels in Iturrama Station in each year (2019 and 2020)

ggplot(IturramaNO2, aes(x=Year, y=NO2)) + 
  geom_boxplot(fill = "#0099f8") +
  labs(title = "Quantity of NO2 in Iturrama Station in 2019 & 2020") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme_minimal()

Exercise 3

Plot the global hourly average of Ozone for all stations (2020).

We are going to use calidadaire data frame (customized for previous exercises with Year column) and we are going to filtrate by year 2020 and plot O3 levels for each hour (hourly average) for each station.

Ozone2020 <-calidadaire %>%
  filter(Year=="2020") %>% 
  ggplot(aes(x=Fecha_Hora, y=O3, color = Estacion)) + 
  geom_point(alpha=0.02)+ 
  theme_minimal() +
  geom_smooth() +
  labs(title = "Hourly Average of O3 for each station in 2020") +
  xlab("Hours") + 
  ylab("O3 levels") +
  theme(plot.title = element_text(hjust = 0.5))

Ozone2020

Repeat the plot separating weekdays and weekends (plotting with different colors).

Then, taking advantage of the column introduced with type of day in calidadaire data frame, we are going to separate the information for weekdays and weekends. In that purpose, we are going to create 2 new data frames (for weekdays and weekends) with the appropriate filters and conditions, create a plot for each, and then merge both plots with ggarrange

Ozone2020_weekdays <- calidadaire %>%
  filter(Year == 2020) %>%
  filter(type_of_day == "weekday") %>%
  group_by(Estacion) %>%
  select(O3, Fecha_Hora)
  
pweekdays <- Ozone2020_weekdays %>%
    ggplot(aes(x=Fecha_Hora, y=O3, color = Estacion)) + 
    geom_point(alpha=0.05)+ 
    theme_minimal() +
    geom_smooth() +
    labs(title = "Hourly Average of O3 on weekdays for each station in 2020") +
    xlab("Hours") + 
    ylab("O3 levels") +
    theme(plot.title = element_text(hjust = 0.5))
  
Ozone2020_weekends <- calidadaire %>%
    filter(Year == 2020) %>%
    filter(type_of_day == "weekend") %>%
    group_by(Estacion) %>%
    select(O3, Fecha_Hora)

pweekends <- Ozone2020_weekends %>%
  ggplot(aes(x=Fecha_Hora, y=O3, color = Estacion)) + 
  geom_point(alpha=0.05)+ 
  theme_minimal() +
  geom_smooth() +
  labs(title = "Hourly Average of O3 on weekends for each station in 2020") +
  xlab("Hours") + 
  ylab("O3 levels") +
  theme(plot.title = element_text(hjust = 0.5))

figure_3 <- ggarrange(pweekdays, pweekends,
                    labels = c("A", "B"),
                    ncol = 1, nrow = 2,
                    common.legend = TRUE, legend="bottom")
figure_3

Exercise 4

Plot in a single graph the Iturrama hourly average of Ozone for each month of 2020. (1 panel for each month).

First, we add a new column to “calidadaire” with the month and another one with the hours

calidadaire_new<-calidadaire %>%   
  mutate(date_month=month(Fecha_Hora)) %>%
  mutate(date_hour=hour(Fecha_Hora))

Then, we create a graph with the info we want

O3_Iturrama <- calidadaire_new %>%
  filter(Estacion == "Iturrama") %>%
  ggplot(aes(x=date_hour,y=O3)) +
  facet_wrap(~date_month, scale="free") +
  geom_point(col="slategray4", size = 1, alpha = 0.05) +
  theme_classic() +
  geom_smooth(colour="black") +
  labs(title = "Iturrama station's hourly average of Ozone for each month of 2020") +
  xlab("Hours") + ylab("O3 levels") +
  theme(plot.title = element_text(hjust = 0.5))

O3_Iturrama

Can you see a season trend?

We can see that, in general, during the first 10 days (aprox) of each month there is a decrease in O3 levels while from days 10 to 15 there is usually a rise. In the winter season (11-2), O3 levels are lower than in summer months (6-8) as it can be seen in the scale used for each graph.

Exercise 5

Plot in the same graph the daily average of Ozone vs Global Radiation for 2020. Draw a linear fit and a smooth fit using ggplot.

In this exercise we need to work with both databases. To begin with, we have to customize the new one, meteo_19_20 to include a column with the Year to be able to filtrate only the Global Radiation from 2020.

GR_2020 <- meteo_19_20 %>%
  mutate(Year = year(Fecha)) %>%
  filter(Year == 2020) %>%
  select(Fecha,Rad)

Then we are going to filter the data from 2020 in calidadaire database (already customized with a column for the Year for previous exercises) and we are going to add a new column with the Date.

O3_2020 <- calidadaire %>%
  mutate(Fecha = date(Fecha_Hora)) %>%
  filter(Year == 2020) %>%
  select(Fecha, O3)

Once we have the info we want in O3_2020, we are going to calculate the daily average and extract this values related to the date

AverageO3 <- O3_2020 %>%
  group_by(Fecha) %>%
  summarize(O3_average= mean(O3,na.rm=TRUE))

We have to merge both data frames by the common column “fecha”

O3_GR_2020 <- merge(GR_2020, AverageO3, by="Fecha")

Now that we have average O3 and global radiation in same data frame, we are going to plot them using a linear fit and a smooth fit

figure5 <- O3_GR_2020 %>%
  ggplot(aes(x=Rad, y=O3_average)) +
  geom_point() +
  geom_smooth(method='lm', level=0.5, color="aquamarine4")+
  geom_smooth(level=0.5, color="brown4") +
  labs(title = "Daily average of Ozone vs Global Radiation for 2020") +
  theme(plot.title=element_text(hjust=0.5)) +
  xlab("Global Radiation") + ylab("Average O3") +
  theme_minimal()

figure5

Write the equation of the corresponding linear model (Ozone in term of Global Radiation) and explain it

Last step in this exercise is obtaining the equation of the corresponding linear model (Ozone in term of Global Radiation) so Global Radiation is the predictor and Ozone is the response. The linear model above is achieved by using the lm() function and the output by summary()function on the model.

linear_model <- lm(O3_average ~ Rad, data = O3_GR_2020)

summary(linear_model)

## 
## Call:
## lm(formula = O3_average ~ Rad, data = O3_GR_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.530  -9.452   0.583  10.006  32.214 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.737548   1.406865   27.54   <2e-16 ***
## Rad          0.092425   0.007389   12.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.62 on 364 degrees of freedom
## Multiple R-squared:  0.3006, Adjusted R-squared:  0.2987 
## F-statistic: 156.4 on 1 and 364 DF,  p-value: < 2.2e-16

Interpretation:

Estimated regression line equation can be written as follow: Ozone = 38.737548 + 0.092425*Rad. From the residuals we see that the median is close to 0 and that the Min and Max values are similar in absolute values. As the p-value obtained is <2e-16 (using an alpha level of 0.05) we can say that Global Radiation (predictor variable) is significantly associated to Ozone (response variable), or in other words, that Global Radiation is a statistically significant predictor (based also on *** signif. code).

However, the Residual Standard Error is quite high (13.62) and we know that the the lower, the best the model fits to our data. Our model also has a adjusted R2 of 0.2987 which is closer to 0 than to 1 indicating that the regression model did not explain much of the variability in the outcome.