So, now we are going to start subsetting our dataset to check for quality filters.

We are going to remove cities that do not have points in 2019,2020,2021 and that have less than 200 points in total.

filtered_data <- filtered_data %>%
  mutate(
    Mean = apply(select(., `2016`:`2022`), 1, mean, na.rm = TRUE),
    sd = apply(select(., `2016`:`2022`), 1, sd, na.rm = TRUE),
    CV = sd / Mean 
  )

This is a table with the amount of values per year and also the mean, sd and CV for the years quantity of data. The bigger the sd and CV more disperse data across the year. The CV is better to compare the data.

So now we have the different CV for all the dataset, we can calculate then, the mean and SD of the CV to know the outliers.

Overall we see that the mean of CV is 0.88 and the standard deviation of the CV in the dataset is 0.22.

## Overall Mean of CV: 0.8831521
## Overall SD of CV: 0.2209343
## Lower Bound (3 SD): 0.2203492
## Upper Bound (3 SD): 1.545955
# Assuming you have the plotly library installed
# If not, install it using install.packages("plotly")
library(plotly)

# Sort the data by the 'total' column in descending order and select the top 8 rows
top_10_cities <- head(filtered_data[order(-filtered_data$total), ], 8)

# Reshape the data to long format for plot_ly
top_10_cities_long <- tidyr::pivot_longer(top_10_cities, cols = c("2016", "2017", "2018", "2019", "2020", "2021", "2022"), names_to = "Year", values_to = "Value")

# Create the plotly line graph
plot_ly(top_10_cities_long, x = ~as.factor(Year), y = ~Value, color = ~ciudad, type = 'scatter', mode = 'lines+markers') %>%
  layout(title = "Evolution of 2016 to 2022 - Top 8 Cities",
         xaxis = list(title = "Year"),
         yaxis = list(title = "Observations"))