Climate data analysis project for the Caserta city area

knitr::opts_chunk$set(echo = TRUE)

# Libraries
library(knitr)
library(rmdformats)
library(ggplot2)
library(dplyr)
library(raster)
library(ncdf4)
library(tidyverse)
library(ncdf.tools)
library(dplyr)
library(magrittr)
library(PCICt) 
library(readr)
library(DT)
library(kableExtra)

Abstract

During my third year of graduate school, I was an intern at the Caserta office of the organization Euro-Mediterranean Center on Climate Change (CMCC), a scientific research facility that works in the field of climate science and aims to deepen the understanding of climate variability, its causes and consequences, through the development of high-resolution simulations with global Earth system models and with regional models, with a focus on the Mediterranean area.

The purpose of the intership was to do an in-depth climate analysis so as to construct a detailed representation of the current climate in the Caserta area (specifically, the coordinates of the area under analysis are N: 41.19046, S: 41.00669, E: 14.39992, W: 14. 20776), making use of a set of indicators commonly used in the literature to characterize the climate and its evolution both in terms of average values, such as temperature trends on an annual and seasonal scale, and in terms of trends in the most extreme values, i.e., the values assumed by the variables of interest (e.g., temperature) that differ from the average values for the area in a given reference period. Specifically, the indicators most commonly used to describe the intensity and frequency of occurrence of these events are those defined by the Expert Team on Climate Change Detection and Indices (ETCCDI).

It is important to point out that the study of climate implies, by definition, the use of long time scales; in particular, the World Meteorological Organization (WMO 2007) establishes 30 years as the standard duration over which to carry out statistical analyses that can be considered representative of the climate of a certain area.

The different indicators are calculated based on atmospheric data derived from a very high spatial resolution (about 2 km) climate reanalysis simulation produced by the CMCC Foundation and available over Italy for the period 1989-2020. This simulation (hereafter referred to as ERA5-2km) is obtained by dynamically localizing, with the COSMO-CLM regional climate model (RCM), a climate model developed by CLM Assembly with which the CMCC Foundation collaborates, the ERA5 reanalysis.

The results presented in this paper indicate that the temperature has been steadily increasing, especially over the past 10 years, an increase driven particularly by the summer and spring seasons. The number of extremely hot days and warm nights is increasing, while days with particularly low temperatures are decreasing. The hottest locations are precisely those in urban areas, while the coolest are hilly and sparsely populated areas.

While the fall months have presented a continuous and gradual increase in temperature over time, the summer months have shown a drastic increase in temperature over the past 10 years; in contrast, the spring months presented this change between 30 and 20 years ago. Spring and summer show an increase in the temperature variance between years, while autumn months show a decrease in the variance and some months have presented an increase in the average temperature difference over time.

1. Introduction

2. World of Change; Global Temperatures

3. Dataset Description

3.1. Temperature Time Series Distributions
3.2. Basic statistic values

4. Climate Change Indicators

4.1. Temperature Anomalies over Time
4.2. Max, Mean and Min Temperature over Year
4.3. Climate Change Indices
4.4. Number of Days above/below the average

5. Temperature Evolution over Time

5.1. Daily Temperature over Time
- 5.1.1. Detailed Temperature Plot by Periods
5.2. Caserta Temperature Maps
- 5.2.1. Difference between Periods

6. Seasonal Analysis

6.1. Seasonal Distribution
- 6.1.1. Monthly Temperature
6.2. Trend Analysis on Seasons
6.2. Seasons during Time
6.2. Monthly temperature change

7. Conclusion

8. Acknowledgements

Introduction

As part of my third year of the degree, I did an internship at the Caserta office of organisation CMCC.

The CMCC (Euro-Mediterranean Centre on Climate Change) collaborates with several international organisations specialising in advanced and applied research to carry out studies and models of our climate system and its interactions with society, in order to ensure reliable, timely and rigorous results to stimulate sustainable growth, protect the environment and develop science-based adaptation and mitigation policies in the context of climate change.

Other research groups of particular relevance in the company and to be named are the IPCC and the REMHI division:

IPCC (Intercontinental Panel of Climate Change): reports are politically neutral, does not conduct any research or monitor climate-related data or parameters. The purpose of the IPCC is to report on the state of scientific, technical and socio-economic knowledge about climate change, its impacts and future risks, as well as options for reducing the rate of climate change, and to provide governments at all levels with scientific information to use in developing climate policies.
REMHI (Regional Models Impacts Coupling Climate with Impact models): its mission is to link climate problems at the local level using regional observations to obtain very detailed information so that statistical models can be used to provide quantitative (not just qualitative) estimates of climate change trends, including the assessment of uncertainty.

Before addressing the topic of statistical models, it is important to understand the difference between some concepts and words that are often misused or confused as synonyms, such as the difference between weather and climate, the distinction between climate projections and weather forecasts, and the differentiation between mitigation and adaptation;

Regarding the difference between weather and climate: the first is studied by meteorology on a daily basis, while climate is the totality of weather conditions at a given location over a long period, at least 30 years.
The second distinction concerns climate projections and weather forecasts: climate projections typically start from the past, from the pre-industrial world to the present, and historical measurements are guided by estimates of past, human-induced and natural climatic situations, while weather forecasts predict the conditions of the atmosphere for a given place and time and are made by collecting quantitative data on the current state of the atmosphere at a given location, using a series of equations to estimate how the atmosphere will change.
The third differentiation is based on the two approaches to the problem of climate change: mitigation and adaptation: mitigation seeks to reduce the causes of climate change, in other words, it is the set of policies that serve to reduce CO2 and other harmful gases in the atmosphere, while adaptation seeks to manage the impacts of climate change that we already have, so it is the set of those policies that ask society to adapt in order to reduce the negative impacts we already have.

Climate represents the set of weather conditions that characterise a geographical region and its variability is defined as the fluctuation of a specific climate variable indicator around its mean value, obtained from long-term measurements of at least thirty years. More specifically, annual or decadal fluctuations involve year-to-year or decade-to-decade variations that overlap with the annual or decadal mean value.

World of Change; Global Temperatures

Earth’s air temperature has been rising since the industrial revolution. Although natural variability plays some role, the preponderance of evidence indicates that human activities, particularly greenhouse gas emissions, are primarily responsible for warming our planet. According to an ongoing temperature analysis by scientists at NASA’s Goddard Institute for Space Studies (GISS), Earth’s global average temperature has increased by at least 1.1° Celsius (1.9° Fahrenheit) since 1880. Most of the warming has occurred since 1975, at a rate of about 0.15-0.20°C per decade. The image below shows global temperature anomalies in 2021, the sixth warmest year on record. Nine of the world’s 10 warmest years have occurred in the past decade.

knitr::include_graphics("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/img/Immagine1.jpg")

As the maps show, global warming does not mean that temperatures always and everywhere increase at the same rate-for example, exceptionally cold winters in one place might be balanced by extremely warm winters in another part of the world. Generally, warming is greater on land than on oceans because water is slower to absorb and release heat (thermal inertia).
In the bar graph below, the years from 1880 to 1939 tend to be cooler, then stabilize by the 1950s. The decades within the base period (1951-1980) do not appear particularly warm or cold because they are the standard against which other years are measured.

The leveling off of temperatures in the mid-20th century can be explained by natural variability and the cooling effects of aerosols generated by factories, power plants, and motor vehicles in the years of rapid economic growth after World War II. Fossil fuel use also increased after the war (5 percent per year), increasing greenhouse gases. Cooling due to aerosol pollution occurred rapidly; in contrast, greenhouse gases accumulated slowly, but remain in the atmosphere for a much longer period of time.

knitr::include_graphics("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/img/Immagine2.jpg")

Experts generally agree that the Earth is warming. Globally, 14 of the 15 warmest years on record have all occurred in the 21st century, and each of the last three decades has been warmer than the previous one. The climate changes recorded so far depend on changes in the concentration of climate-altering gases in the atmosphere due to anthropogenic activities; these changes vary geographically and the impacts also differ depending on the atmospheric variable indicator considered. The human influence on the climate system is clear and undeniable, as the IPCC states; in fact, after years and years of observations, thanks to statistical models and new technologies that can be used, it has been concluded that human activity is indeed the cause of climate change.

Since 1950, some changes have been observed in all sectors of the Earth’s climate system:

Atmospheric concentrations of greenhouse gases are increasing;
The atmosphere and oceans have warmed;
The extent and volume of ice has shrunk;
Sea levels have risen;
The increase in CO2 has caused the pH of the oceans to decrease;
Snow cover in the Northern Hemisphere has decreased.

This is why global warming is called “Virtually Certain” (probability > 99%).

Dataset Description

To understand current climate change and extreme weather events, it is important that observations of the Earth system go back as far in time as possible. However, the observations have always been unevenly distributed and are accompanied by errors.

As part of the Copernicus Climate Change Service (C3S), ECMWF has produced the ERA5 reanalysis, which encapsulates a detailed record of the global atmosphere, land surface and ocean waves since 1950, updated daily with a latency of about 5 days. ERA5 benefits from a decade of developments in model physics, core dynamics and data assimilation and is the fifth generation of ECMWF reanalyses for global weather and climate over the past 4-7 decades. The reanalyses combine model data with observations from around the world into a comprehensive and globally consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centers, where every few hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called an analysis, from which an updated and improved forecast is issued. Reanalysis works in the same way, but with reduced resolution to allow for the provision of a data set spanning several decades; in addition, it does not have the constraint of issuing timely forecasts, so there is more time to collect observations and, when going further back in time, to allow for the inclusion of improved versions of the original observations, which benefits the quality of the reanalysis product. In addition to a significantly improved horizontal resolution of 31 km, compared to the 80 km of ERA-Interim, ERA5 has hourly output over the entire area.

In addition to these global reanalysis projects, there are also high-resolution regional reanalysis activities for different regions, such as North America, Europe, or Australia. These regional reanalyses are usually based on a regional weather forecast model and use the boundary conditions of a global reanalysis.

With the Highlander project, CMCC, particularly the REMHI division, and its partners present a new dataset for recent climate developed by dynamically downscaling ERA5 reanalyses, originally available at ’31 km horizontal resolution to a resolution of 2.2 km. The dynamic downscaling was conducted through the COSMO regional climate model (RCM). The temporal resolution of the output is hourly (as for ERA5),the runs cover the entire Italian territory (and neighboring areas as needed) to provide a very detailed (in terms of spatiotemporal resolution) and complete (in terms of meteorological fields) picture of climatological data for at least the last 30 years (01/1989-12/2020).

In this case, the dataset contains dynamically rescaled ERA5 reanalyses at 2.2 km x 2.2 km of measurements collected in the confined area of Caserta: N: 41.19046, S: 41.00669, E: 14.39992, W: 14.20776. In this area, temperatures were measured at 80 different points, one every 2.2 km, forming a 10x8 grid so as to cover the entire city and its boundaries.

library(GiNA)
library(ggmap)
library(rgdal)
library(sf)


 Data_first <- read_csv("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/Data_first.csv")
 Data_second <- read_csv("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/Data_second.csv")


 
lon_lat = Data_first %>% dplyr::select(long,lat)
 
long_lat <- lon_lat %>%   st_as_sf(coords = c("lat", "long"), crs = 4326) 

stations <- st_as_sf( Data_first, coords = c( "long", "lat" ) ) 
 


library(pdp)
library(grDevices)
library(leaflet)
library(RColorBrewer)
library(htmltools)

mybins <- c(0,14,14.5,15,15.5,16,16.5,17,17.5,18,18.5,19, Inf)
mypalette <- colorBin( palette="YlOrBr", domain=Data_first$mean_T_2M, na.color="transparent", bins=mybins)


library(cowplot)



tag.map.title <- tags$style(HTML("
  .leaflet-control.map-title { 
    left: 20%;
    top: -90px;
    background: rgba(255,255,255,0.75);
    font-weight: bold;
    font-size: 28px;
  }
"))

title1 <- tags$div(
  tag.map.title, HTML("Temperature sensors")
)  

 
df_long_tot_plot2 = Data_first %>% dplyr::select(long,lat)
df_long_tot_plot2$mean_T_2M <- "30"
df_long_tot_plot2$mean_T_2M= as.numeric(df_long_tot_plot2$mean_T_2M)

lonlat_round1=Data_first %>%
   dplyr::select(long,lat)

lonlat_round1$long <- round(lonlat_round1$long ,digit=3) # Round off the column for 2 decimal
lonlat_round1$lat <- round(lonlat_round1$lat ,digit=3) # Round off the column for 2 decimal



mytext <- paste(
    "lon: ", lonlat_round1$long,"<br/>", 
    "lat: ", lonlat_round1$lat,"<br/>", 
     sep="") %>%
  lapply(htmltools::HTML)
l1=leaflet(Data_first) %>% 

# Final Mapl=leaflet(Data_first) %>% 
  addTiles()  %>% 
  setView(lng = mean(Data_first$long), lat = mean(Data_first$lat), zoom = 11) %>%
 addCircleMarkers(data = stations,
    fillColor = ~mypalette(df_long_tot_plot2$mean_T_2M), 
    stroke=TRUE, 
    fillOpacity = 0.8, 
    color="white", 
    weight=0.3,
    label = mytext,
    labelOptions = labelOptions( 
      style = list("font-weight" = "normal", padding = "3px 8px"), 
      textsize = "13px", 
      direction = "auto"
    )
  ) %>%
  
  addControl(title1, position = "topleft", className="map-title")

l1

The runs cover the entire territory to provide a very detailed (in terms of spatio-temporal resolution) and complete (in terms of meteorological fields) dataset of climatological data for at least the last 30 years (01/1989-10/2020). The temporal coverage of the dataset is from 01/01/1989 00:00 to 31/12/2020 23:00 and the temporal resolution is 1 hour.

Specifically, the data we are going to analyse represent:

mean_T_2M = hourly average of all observations in the area under analysis with respect to the longitude and latitude of the analysed territory;

max_T_2M = represents the maximum value reached per hour in the entire area under analysis;

min_T_2M = represents the minimum value reached per hour in the entire area under analysis.

Link (CCMC DDS) = https://dds.cmcc.it/#/dataset/era5-downscaled-over-italy/VHR-REA_IT_1989_2020_hourly

Database Temperature by Hour :

 df_1989_2020_by_hour <- read_csv("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/df_1989_2020_by_hour.csv")

df_1989_2020_by_hour_plot= df_1989_2020_by_hour %>% dplyr::select(Date, Time, mean_T_2M, min_T_2M, max_T_2M)

df_1989_2020_by_hour_plot=df_1989_2020_by_hour_plot %>% mutate_if(is.numeric, round, digits=3)
 datatable(df_1989_2020_by_hour_plot, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

df_1989_2020_by_day1 <- df_1989_2020_by_hour %>%
  group_by(Date,Year) %>%
summarise(
  mean_T_2M=mean(mean_T_2M),
  min_T_2M=min(min_T_2M),
  max_T_2M=max(max_T_2M))

df_1989_2020_by_day1 <- mutate(df_1989_2020_by_day1, Date = as.Date(Date, format= "%Y-%m-%d"))
 

df_1989_1999_by_day= df_1989_2020_by_day1 %>% dplyr::select(Date, mean_T_2M, min_T_2M, max_T_2M, Year) %>% dplyr::filter(Year < 2000)
df_2000_2009_by_day= df_1989_2020_by_day1 %>% dplyr::select(Date, mean_T_2M, min_T_2M, max_T_2M, Year) %>% dplyr::filter(Year > 1999, Year < 2010)
df_2010_2020_by_day= df_1989_2020_by_day1 %>% dplyr::select(Date, mean_T_2M, min_T_2M, max_T_2M, Year) %>% dplyr::filter(Year > 2009)

 
df_1989_1999_by_day <- subset( df_1989_1999_by_day, select = -c( Year ) )
df_2000_2009_by_day <- subset( df_2000_2009_by_day, select = -c( Year ) )
df_2010_2020_by_day <- subset( df_2010_2020_by_day, select = -c( Year ) )

df_1989_1999_by_day$Period <- "Data_1989_1999"
df_2000_2009_by_day$Period <- "Data_2000_2009"
df_2010_2020_by_day$Period <- "Data_2010_2020"
 
 

df_1989_2009_by_day = rbind(df_1989_1999_by_day,df_2000_2009_by_day)
df_1989_2020_by_day = rbind(df_1989_2009_by_day,df_2010_2020_by_day)

Temperature Time Series Distributions

This graph shows the temperature behaviour over the period 1989-2020 by hour. As can be seen, this first distribution shows how the temperature value adapts to the period of the year following a seasonal pattern. Seasonality, as the name suggests, refers to the seasonal characteristics of time series data. It is a predictable pattern that repeats with a certain frequency within a year, e.g. weekly, monthly, quarterly, etc. In general, we always expect the temperature to be higher in summer and lower in winter in most places on Earth.

The blue area containing it, on the other hand, represents the average of the maximum and minimum temperatures recorded hour by hour in the same area.

Daily Time Range

Diurnal Time Range

The graph above shows a repeated trend each year. Although the heights reached by each summer differ from each other, we can still see that the temperature is highest during summer and lowest during winter. In turn, the days also follow a cyclicity by reaching the maximum peak during the day and the minimum peak during the night/early morning.

In particular, the highlighted blue stripe is the average of all daily measurements of the temperature in the Caserta air confined to: N: 41.19046, S: 41.00669, E: 14.39992, O: 14.20776. The blue area containing it, on the other hand, represents the maximum and minimum temperatures recorded day by day in the same area.

Basic statistic values

Temperature variable :

These are the main statistical indices we can derive from our hourly time series.

stats = brotools::describe(df_1989_2020_by_hour, mean_T_2M, min_T_2M,max_T_2M )

stats$mode=as.double(stats$mode)
stats=stats %>% mutate_if(is.numeric, round, digits=3)
stats1 = stats[,1:8]
stats2 = stats[,11]
stats = cbind(stats1,stats2)

 
df_1989_2020_by_hour2= df_1989_2020_by_hour %>% dplyr::select(Date,mean_T_2M, min_T_2M, max_T_2M)

df_1989_2020_by_hour2_long = df_1989_2020_by_hour2 %>%
  pivot_longer(!Date, names_to = "Type", values_to = "tmp")


stats2 <- df_1989_2020_by_hour2_long %>%
  group_by(Type) %>%
  summarize(
    sample_size = n(),
    minimum = min(tmp),
mean = mean(tmp),
    median = median(tmp),
    max = max(tmp),
    .groups = 'drop')

Index	sample_size	minimum	mean	median	max
Max Tmp	280480	-4.437811	18.91033	18.57701	44.62814
Mean Tmp	280480	-5.739857	17.30942	17.02921	43.02329
Min Tmp	280480	-7.819739	15.46890	15.22518	40.35836

As we can see, we have 280,480 measurements (~ 24h X 365d X 32Y) of temperature for each coordinate point included in the dataset.
The average temperature of the city is 17.31 C° and varies from a minimum of -5.74 C° to a maximum of 43.02 C° and its most common value is 5.26 C°.
The highest hourly value reached is 44.63 C° while the lowest value reached is -7.82 C°.

Date Range :

df_1989_2020_by_hour$Date_Time=as.POSIXct(paste(df_1989_2020_by_hour$Date, df_1989_2020_by_hour$Time), format="%Y-%m-%d %H:%M:%S")

range(df_1989_2020_by_hour$Date_Time)

## [1] "1989-01-01 00:00:00 CET" "2020-12-31 23:00:00 CET"

The observations are taken over a time period from 1989-01-01 at 00:00:00 until 2020-12-31 at 23:00:00.

Climate Change Indicators

An important task of modern climatology is to assess extreme weather and climate events along with various aspects of the average climate. The analysis of these extreme events has become increasingly important because of their significant impact on society and natural systems. Climate indices are relatively simple but statistically consistent quantitative indicators of climate extremes. Modern sets of such indices, the most widely used of which is the collection developed by the Expert Team on Climate Change Detection and Indices (ETCCDI), sponsored by the World Meteorological Organization’s (WMO) Commission on Climatology (CCI), the Climate Predictability and Variability (CLIVAR) project, and the Joint Technical Commission on Oceanography and Marine Meteorology (JCOMM), which are statistically robust, cover a wide range of climatic conditions, and have high signal-to-noise ratios. The World Meteorological Organization encourages the application of the indices, saying that by using the same definitions of extremes and analyzing the data in a standardized way, it is possible to compare results from different places and obtain consistent pictures of changes around the world.

Although there is no unambiguous definition of climate extremes, the use of a common theoretical basis for defining extreme events enables their systematic study. In this regard, the ETCCDI has developed a set of 27 indices (16 associated with temperature and 11 with precipitation) to detect changes in the behavior of climate extremes. This set of indices has been defined to calculate changes in climate extremes in exactly the same way around the world, with the goal of having a common global baseline; this baseline allows comparison of studies done anywhere on the planet and encourages further such research, especially in less developed countries.

A wide range of ETCCDI indexes can be easily found and explained on the dedicated page Link = http://etccdi.pacificclimate.org/ .

Temperature Anomalies over Time

In climate change studies, temperature anomalies are more important than absolute temperature. A temperature anomaly is the difference from an average, or baseline, temperature. The baseline temperature is usually calculated by averaging 30 or more years of temperature data. A positive anomaly indicates that the observed temperature has been warmer than the baseline, while a negative anomaly indicates that the observed temperature has been cooler than the baseline. When averaging absolute temperatures, factors such as station location or altitude influence the data (for example, higher altitudes tend to be cooler than lower altitudes, and urban areas tend to be warmer than rural areas). However, when looking at anomalies, these factors are less critical. For example, a summer month over an area may be cooler than average both on top of a mountain and in a nearby valley, but absolute temperatures will be very different in the two locations.

The graph below shows the anomalies for 1989-2020, the difference between the annual average values and the average for the entire period under analysis.

The graph of anomalies over time shows that the temperature would appear to be increasing; in particular, years with an average temperature that is always higher than the average for the entire time period under analysis are recorded from 2010 onward.

This means that the average temperature is gradually increasing over time; this result, which can be visualized by looking at the graph, was then confirmed by the Mann-Kendall test for the monotonic trend.

Mann-Kendall Test For Monotonic Trend:

The purpose of the Mann-Kendall (MK) test (Mann 1945, Kendall 1975, Gilbert 1987) is to statistically assess if there is a monotonic upward or downward trend of the variable of interest over time. A monotonic upward (downward) trend means that the variable consistently increases (decreases) through time, but the trend may or may not be linear. The MK test can be used in place of a parametric linear regression analysis, which can be used to test if the slope of the estimated linear regression line is different from zero. The regression analysis requires that the residuals from the fitted regression line be normally distributed; an assumption not required by the MK test, that is, the MK test is a non-parametric (distribution-free) test.

The MK test tests whether to reject the null hypothesis (H0) and accept the alternative hypothesis (H1), where

H0: There is no monotonic trend in the series;
H1: A trend exists. This trend can be positive, negative, or non-null.

MK_Test	pvalue
Temperature over Years	***

library(Kendall)


#MK_df_1989_2020_diff_mean = MannKendall(df_1989_2020_diff_mean$C)
#summary(MK_df_1989_2020_diff_mean)

The p-value is less than 0.05, so we will reject the null hypothesis of the test and conclude that there is a trend in the data and by viewing the graph we can confirm that this trend is a positive trend.

Max, Mean and Min Temperature over Year

Another index of significance can be the measurement of average temperature, the annual maximum and minimum temperature over time. Taking it for granted that the maximum temperature is recorded in summer and the minimum temperature in winter, these two indices represent particularly extreme events even if they are not necessarily representative of the season in which they occurred. The time series at the top and bottom represent the maximum and minimum temperatures that have occurred over the years, while the one in the middle represents the average annual temperature.

It is not difficult to see that both maximum and average annual temperatures are slightly increasing, while annual minima, especially in the last 10 years and with the exception of the last year, are decreasing, with a minimum peak in 2017. However, these considerations visualized by the graph are not statistically confirmed, so let us now analyze some climate indices.

Climate Change Indices

Through climate indicators, specific characteristics of the city’s climate are analysed. The indicators considered here with regard to temperature are four:

Tropical Nights : This indicates the number of days with a minimum temperature greater than 20°C.
Hot summer days and heat waves become especially stressful if nighttime temperatures do not provide relief. Tropical nights make it harder for the body to cool down and recover from hot days. The elderly, the homeless, and those living in homes or apartments without air conditioning are especially vulnerable during these heat events, especially if they last longer than a few days.
Very hot days : Indicates the number of days on which the daily maximum temperature exceeds 35°.
In general, the hottest days of the year occur during the summer months. High temperatures are important; they determine how plants and animals thrive, limit or allow outdoor activities, define how we design buildings and vehicles, and shape transportation and energy use. However, when temperatures are very high, people, especially the elderly, are more likely to suffer from heat exhaustion and heat stroke; many outdoor activities become dangerous or impossible at very high temperatures.
Cold days : The number of days on which the average temperature falls below 0°C and the number of days on which the minimum temperature falls below 0°C.
This index is an indicator of the length and severity of the winter season.

Mann-Kendall Test For Monotonic Trend:

Mann-Kendall Test For Monotonic Trend
Indicators	pvalue
Max Over 35	**
Min Over 20	***
Min Less 0	NS
Mean Less 0	NS

With these results, added with those obtained above and already discussed, we can confirm that:

Warm days are increasing along with the average annual temperature and annual maximums, Cold days, on the other hand, are decreasing despite the fact that annual minimum temperatures are falling.

#MK_max_Over_35 = MannKendall(Day_max_over_35_per_year$count)
#summary(MK_max_Over_35)
#MK_min_over_20 = MannKendall(Day_min_over_20_per_year$count)
#summary(MK_min_over_20)
#MK_min_less_0 = MannKendall(Day_min_less_0_per_year$count)
#summary(MK_min_less_0)
#MK_mean_less_0 = MannKendall(Day_mean_temp_less_0_per_year$count)
#summary(MK_mean_less_0)

Number of Days above/below the average

Now, using Tornado’s diagram, we want to see if the number of days with a temperature above and below the general average for the entire period remains constant over time.

As the red line in the center of the diagram shows, the number of days with below-average temperature seems to be higher than those with above-average temperature until 2006/2007, after which, this trend is reversed in the second half of the period, where warmer days predominate.

Temperature Evolution over Time

Temperatures fluctuate throughout the day, week, month, and year, as well as varying substantially depending on where you are.
Calculating average temperatures provides a more accurate picture of the temperature at a specific location than a single measurement.
Average daily temperature is an environmental indicator with many applications in agriculture, engineering, health, energy management, recreation, and more. Let us now look at the trend of this indicator from 1989 to 2020 using the “geom_raster” function, which allows us to create a colored map according to a color scale of temperature.

Daily Temperature over Time

As you can easily see from the coloured scale, if a tile is red, it means that the average daily temperature was very high on that day, while if the tile is coloured blue, it means that the temperature was very low. The colours between red and blue indicate an intermediate temperature with the various shades if slightly warmer or colder.

Links

Let us now analyse the graph starting from the top. The first tiles are coloured blue, moving downwards we go from light blue to yellow to red, ending in blue again.

This is because the uppermost tiles show the winter months, with lower average daily temperatures, then spring is coloured light blue and yellow, summer is coloured red, and autumn like spring.

Detailed Temperature Plot by Periods

Let’s look in more detail to see if we can catch a glimpse of temperamental changes here, too, by comparing two time periods, the first running from 1989 to 1999 and the second running from 2010 to 2020.

#days = 1:365
#define the total number of years as indicated by the data
#years = 1989:1999

#first=df_1989_2020_by_day1 %>%
#  filter(Year< 1999)

#second=df_1989_2020_by_day1 %>%
#  filter(Year> 2009)


#par(mfrow = c(2,1))

## make a matrix using the origin temperature values
#all.mat = matrix(first$mean_T_2M, nrow = length(days)) 
## plot the matrix
#imagep(y = years, x = days, z = all.mat, filledContour = F, ylim = c(1989,1999),zlab = expression(Temperature~(degree *C)), xlab = "Day")
#contour(y = years, x = days, z = all.mat, add = TRUE, col = 1, nlevels = 3)

## interpolate the missing values
#interpolated.temperature = interpBarnes(x = second$day.year, y = second$year, z = second$mean_T_2M, xgl = 366, ygl = 21)

### plot the interpolated values
#imagep(x = interpolated.temperature$xg, y = interpolated.temperature$yg, z = interpolated.temperature$zg, filledContour = F, ylim = c(2010,2020),zlab = #expression(Temperature~(degree *C)), xlim = c(1,360), xlab = "Day")
#contour(x = interpolated.temperature$xg, y = interpolated.temperature$yg, z = interpolated.temperature$zg, col = 1, nlevels = 4,add = TRUE)

Links

Just as expected, in the second graph there are many more yellow-coloured tiles corresponding to temperatures greater than or equal to 30C° than in the first graph, and at the same time the dark blue-coloured tiles representing cold temperatures have decreased dramatically.

This presents a scenario in which in the future we will have more and more hot days with increasing temperatures and a decrease in cold days with decreasing temperatures.

Caserta Temperature Maps

A recent global study conducted by the Joint Research Center analyzes the difference between surface temperatures in urban areas and those in nearby rural areas in summer. Worldwide, more than half of people live in cities, and the share of urban dwellers is expected to grow further. Cities often suffer from “heat islands,” the phenomenon of higher temperatures within cities than in nearby rural areas, and this amplifies the effect of heat waves in cities and increases the risk to human health.

Scientists from the European Commission’s Joint Research Center examined the difference between land surface temperatures in urban areas with a population of more than 50,000 people and surrounding rural areas in the summer between 2003 and 2020.
Working with satellite data, the scientists measured that surface temperatures in cities were sometimes up to 10-15°C higher than in rural areas. The study also estimated that extreme heat island temperatures in cities around the world have increased by an average of 1°C since 2003.

Hot spots are often found in industrial areas, where waste heat, the use of dark building materials and the absence of vegetation can result in very high surface temperatures.

The report provides concrete advice to city authorities on how to implement a range of measures to counteract the urban heat island effect. By creating ventilation corridors, designing green roofs and facades for buildings, using lighter colors in buildings, planting more vegetation and making better use of water, urban temperatures can be reduced and the living conditions of city dwellers can be improved and the urban heat island effect mitigated. \

With this graph, we can see the average temperature for each longitude and latitude point from which we obtained temperature measurements in our data set. The graph is divided in two with respect to the two time periods 1989-2000 / 2010-2020 and the points are colored with respect to the average temperature recorded at those points over the entire period, the darker the higher the temperature.

library(pdp)
library(grDevices)
library(leaflet)
library(RColorBrewer)
library(htmltools)

mybins <- c(0,14,14.5,15,15.5,16,16.5,17,17.5,18,18.5,19, Inf)
mypalette <- colorBin( palette="YlOrBr", domain=Data_first$mean_T_2M, na.color="transparent", bins=mybins)


library(cowplot)
Data_first <- read_csv("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/Data_first.csv")
Data_second <- read_csv("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/Data_second.csv")


tag.map.title <- tags$style(HTML("
  .leaflet-control.map-title { 
     left: 20%;
     background: rgba(255,255,255,0.75);
    font-weight: bold;
    font-size: 28px;
  }
"))

title1 <- tags$div(
  tag.map.title, HTML("Caserta 1989-2000")
)  

 


lonlat_round1=Data_first %>%
   dplyr::select(long,lat)

lonlat_round1$long <- round(lonlat_round1$long ,digit=3) # Round off the column for 2 decimal
lonlat_round1$lat <- round(lonlat_round1$lat ,digit=3) # Round off the column for 2 decimal



mytext <- paste(
    "lon: ", lonlat_round1$long,"<br/>", 
    "lat: ", lonlat_round1$lat,"<br/>", 
    "tmp: ", round(Data_first$mean_T_2M, 2), 
    sep="") %>%
  lapply(htmltools::HTML)
l1=leaflet(Data_first) %>% 

# Final Mapl=leaflet(Data_first) %>% 
  addTiles()  %>% 
  setView(lng = mean(Data_first$long), lat = mean(Data_first$lat), zoom = 11) %>%
 addCircleMarkers(data = stations,
    fillColor = ~mypalette(Data_first$mean_T_2M), 
    stroke=TRUE, 
    fillOpacity = 0.8, 
    color="white", 
    weight=0.3,
    label = mytext,
    labelOptions = labelOptions( 
      style = list("font-weight" = "normal", padding = "3px 8px"), 
      textsize = "13px", 
      direction = "auto"
    )
  ) %>%
  leaflet::addLegend( pal=mypalette, values=~Data_first$mean_T_2M, opacity=0.9, title = "temperature:", position = "bottomleft" )%>%
  addControl(title1, position = "topleft", className="map-title")



 



tag.map.title <- tags$style(HTML("
  .leaflet-control.map-title { 
    left: 20%;
    top: -90px;
    background: rgba(255,255,255,0.75);
    font-weight: bold;
    font-size: 28px;
  }
"))





lonlat_round2=Data_second %>%
   dplyr::select(long,lat)

lonlat_round2$long <- round(lonlat_round2$long ,digit=3) # Round off the column for 2 decimal
lonlat_round2$lat <- round(lonlat_round2$lat ,digit=3) # Round off the column for 2 decimal




title2 <- tags$div(
  tag.map.title, HTML("Caserta 2010-2020")
)  

 
mytext <- paste(
    "lon: ", lonlat_round2$long,"<br/>", 
    "lat: ", lonlat_round2$lat,"<br/>", 
    "tmp: ", round(Data_second$mean_T_2M, 2), 
    sep="") %>%
  lapply(htmltools::HTML)
# Final Map
library(leaflet)
l2=leaflet(Data_second) %>% 
  addTiles() %>% 
  setView(lng = mean(Data_second$long), lat = mean(Data_second$lat), zoom = 11) %>%
  addCircleMarkers( data = stations, 
    fillColor = ~mypalette(Data_second$mean_T_2M), 
    stroke=TRUE, 
    fillOpacity = 0.8, 
    color="white", 
    weight=0.3,
    label = mytext,
    labelOptions = labelOptions( 
      style = list("font-weight" = "normal", padding = "3px 8px"), 
      textsize = "13px", 
      direction = "auto"
    )
  )  


library(leafsync)

latticeView(l1,l2)

#knitr::include_graphics("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/img/Rplot13.png")

Comparing the two maps, we see that, in general, in the second map, temperatures are higher in all areas monitored and under analysis.

As can be seen from the map, the hottest places coincide exactly with those where there are urban areas and in particular are in the area south and west of Caserta. In contrast, the coolest areas are those in the North and Northeast because hills and sparsely populated areas are present.

Difference between Periods

We now want to check whether or not the median temperature in the two time periods 1989-1999 and 2010-2020 are indeed statistically different. To do this, we first check their distribution against each factor graphically and then use the Whitney Wilcoxon signed-rank sum test to get a statistical confirmation of their diversity/equality.

The Wilcoxon signed-rank sum test is a non-parametric test to assess whether two samples of measures come from the same distribution:

H0: The two populations are equal;
H1: The two populations are not equal.

It is an alternative to the two-sample unpaired t-test and focuses on medians: the more symmetrically the data are distributed around the median (almost the same number of values above and below the median), the more similar the group averages are.

It is not important to check normality; It is not important to check homogeneity of variances.

library(rstatix)
library(infer)
 
 
one= df_1989_1999_by_day %>% dplyr::select(mean_T_2M)
two= df_2010_2020_by_day %>% dplyr::select(mean_T_2M)

df_1989_1999_by_day$Period <- "Data_1989_1999"
df_2010_2020_by_day$Period <- "Data_2010_2020"
 
df_2010_2020_by_day <- df_2010_2020_by_day [-c(4018), ] 
 

df_1989_1999_2010_2020_by_day = rbind(df_1989_1999_by_day,df_2010_2020_by_day)

 
median <- df_1989_1999_2010_2020_by_day %>%
  group_by(Period) %>%
  summarize(median=median(mean_T_2M ))

ggplot(df_1989_1999_2010_2020_by_day, aes(x= mean_T_2M , fill=Period))+
  geom_density(alpha=0.3,size=1)+ 
  scale_fill_brewer(palette  = "Set1")+
  geom_vline(data = median, aes(xintercept = median, 
                                       color = Period), size=1.5)+ 
  labs(x= "Temperature",
       subtitle="Temperature Distribution respect to Period")+
  theme_minimal()

sampled_means <- df_1989_1999_2010_2020_by_day %>% 
  group_by(Period) %>%
  summarize(
     minimum = min(min_T_2M),
     mean = mean(mean_T_2M),
     median = median(mean_T_2M),
     max = max(max_T_2M),
     sd = sd(mean_T_2M),
    .groups = 'drop')

Period	minimum	mean	median	max	sd
Data_1989_1999	-5.544257	17.02074	16.57759	42.5032	7.218123
Data_2010_2020	-7.819739	17.61742	17.38216	42.8803	7.434116

As we said, the Wilcoxon rank sum test uses the median as the comparison statistic, in this case we have 16.57759 for measurements recorded from 1989 to 1999 and 17.38216 for those recorded from 2010 to 2020.

Now let us check whether our data are normally distributed using the Shapiro Test.

Normality Test

before= df_1989_1999_2010_2020_by_day %>% dplyr::select(Date, Period, mean_T_2M) %>%
  filter(Period == "Data_1989_1999")

after= df_1989_1999_2010_2020_by_day %>% dplyr::select(Date, Period, mean_T_2M)%>%
  filter(Period == "Data_2010_2020")


diff_obs <- with(df_1989_1999_2010_2020_by_day, 
        mean_T_2M[Period == "Data_1989_1999"] - mean_T_2M[Period == "Data_2010_2020"])
shapiro.test(diff_obs)

## 
##  Shapiro-Wilk normality test
## 
## data:  diff_obs
## W = 0.99606, p-value = 7.45e-09

Normality Plot

qqnorm(diff_obs,main="QQ plot male age data",pch=19) 
qqline(diff_obs)

The p-value of the Shapiro Test is approximately equal to 0 so our data are not normally distributed, we can avoid using a t-test and continue with the Wilcoxon rank sum test.

The p-value of the test turns out to be 0 Since this value is less than .05, we have sufficient evidence to say that the sample data does not come from a population that is normally distributed.

TWO PAIRED SAMPLE WILCOXON TEST

#res <- wilcox.test(mean_T_2M ~ Period, data = df_1989_1999_2010_2020_by_day, paired = TRUE)
#res

Wilcoxon Test on 1989-1999 & 2010-2020 temperature distribution
Wilcoxon_Test	pvalue
1989-1999 & 2010-2020	***

The p-value of the test is near to 0, which is less than the significance level alpha = 0.05. We can then reject null hypothesis and conclude that true median difference is significantly different from 0 .

Seasonal Analysis

Seasonal Distribution

Obviously, the temperature of a region, especially if it is located in the Mediterranean area, changes over time following a seasonal pattern. In our case, since the area under study is located in Italy, the temperature values are adapted to the area’s four seasons: Winter, Spring, Summer and Autumn.

Now let us see what temperature values are assumed by these seasons during each hour of the day:

Season	mean	minimum	max	sd
Data_Autumn	13.814115	-6.0252747	34.75335	5.690038
Data_Spring	19.791396	-0.9895996	42.16229	5.719066
Data_Summer	25.848926	9.5848938	44.62814	4.713983
Data_Winter	9.661274	-7.8197388	29.23516	4.380895

As expected in the density graph, the seasons with the greatest variance are autumn and spring.

These seasons therefore have the largest temperature range, this observation can be explained by the fact that they are two transition seasons between summer and winter. The main factor is the sun. In summer, the sun particularly affects daily temperatures as it is more present during the day, generally raising temperatures, whereas in winter, when the sun is scarcely present, temperatures are lowered. In autumn and spring, on the other hand, day and night ‘balance out’, resulting in a wider temperature range during the day.

Monthly Temperature

The temperatures reached by each month during the entire observation period are shown here. The months are coloured differently according to the season they are part of and with the same colours as in the previous graph. I have chosen this graph so that one can simultaneously appreciate both the values reached by different months and the cyclicality of temperature over the course of a year. In each distribution with a bar, the average temperature of the month is highlighted.

To simplify, if a month lies between two seasons, it has been represented as being part of the previous season.

Obviously, August and July are the hottest months while January and February are the coldest.

In the next sections we are going to check and explore whether over time, both in terms of seasons and months, any shift in these distributions has/is occurring, and which among them is most affected by this effect and in what way; whether it is increasing in temperature or decreasing.

Trend Analysis

As the Earth warms globally, average temperatures increase throughout the year, but the increases may be greater in some seasons than in others. The change in seasons is directly related to the increase in global temperatures. This indicator analyzes changes in average temperatures in each season, here defined by calendar months (e.g., winter consists of January, February and March).

To better understand how the seasonal temperature changes over time, we display the difference between the seasonal values per year and the season’s average over the entire time period, then with the Mann-Kendall Test we check whether there is indeed a positive or negative trend.

With the Mann-Kendall Test, we thus obtain that; while the hot seasons, so Spring and Summer, present a positive trend, the cold seasons (Autumn and Winter) do not seem to present a statistically significant trend.

Mann-Kendall Test For Monotonic Trend:

Mann-Kendall Test For Monotonic Trend
Season	pvalue
Winter	NS
Spring	***
Summer	***
Autumn	NS

library(Kendall)


#MK_Winter = MannKendall(df_Winter_diff_mean$mean)
#summary(MK_Winter)
#MK_Spring = MannKendall(df_Spring_diff_mean$mean)
#summary(MK_Spring)
#MK_Summer = MannKendall(df_Summer_diff_mean$mean)
#summary(MK_Summer)
#MK_Autumn = MannKendall(df_Autumn_diff_mean$mean)
#summary(MK_Autumn)

Winter does not seem particularly vulnerable to rising temperatures. In general, minimum temperatures have decreased, while the number of days with low temperatures is gradually decreasing; as a result, much of the warming is occurring during the period when it is usually coldest, that is, mostly at night.

Warming in other seasons can also affect daily life in various ways. For example, warmer temperatures in spring and fall extend the growing season to benefit some forms of agriculture, but they also lead to a longer and more intense pollen season for allergy sufferers. These changes could also extend the fire season. A warmer summer could mean more money spent on air conditioning, while a warmer winter could mean less money spent on home heating. Seasonal warming could also alter the timing of important events for plants and animals, such as budding, flowering, hibernation, spawning, hatching and migration.

A slight change in temperature is enough to bring forward the spring thaw and delay the first frosts until autumn. These environmental changes result in earlier spring, longer summers, and later autumn arrives. Scientists are convinced that the earlier arrival of spring events is related to recent warming trends in the global climate. Altering the timing of these events can have a range of impacts on ecosystems and human society: e.g., an earlier spring and an extended summer, could lead to longer growing seasons, increased abundance of invasive species and pests, longer and earlier allergy seasons, can create a “false spring” that triggers the too-early start of new plant growth making them vulnerable to later frosts, increased drought risk due to earlier snowmelt and longer summer duration, increased forest fires, and shifting of planting areas increasingly northward.

Seasons during Time

The k-means algorithm is a traditional and widely used clustering algorithm that requires specifying the number of k clusters before clustering. The system organises the data into k clusters where each is identified by the vector of the mean value (i.e. the average) of each variable for the observations within the cluster and returns the index of the cluster to which it assigns each observation so that the objects within each cluster are as close to each other as possible and as far away as possible from objects in other clusters. In each cluster, kmeans minimises the sum of the distances between the centroid and all member objects of the cluster. It is then possible to extract the number of clusters that has been assigned to our data.

Having ascertained that the temperature is increasing over time, in this chapter we want to see whether indeed the seasonally increasing temperature can be matched to time periods and by which months of these seasons this increasing trend is driven.

We now want to see whether the three clusters representing the low, medium and high temperature values for each year can be matched to the time periods T1, T2 and T3; in particular, the first cluster C_Cold represents the cold years, C_Medium the years with an average temperature and C_Hot the years with a particularly high temperature, while T1 collects data from 1989 to 2000, T2 those from 2001 to 2010 and T3 from 2010 to 2020.

# 
# set.seed(2021) 
# 
# library(scater)
# library(clValid)
# library(factoextra)
# library(FactoMineR)
# library(cluster)
# library(vegan)
# library(Rtsne)
# library(rstatix)
# library(ggpubr)
# library(readxl)
#  
# Data_plot <- read_csv("C:/Users/User/Desktop/STAGE/PROJECT/Report/Data_plot.csv")
# 
# 
#    
# ## GENNAIO}
#   
#   
#   
# 
# 
#  
# Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="January")
# 
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 )
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021)
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data_january = rbind(Data1,Data_july3)
# 
# 
# Data2 = Data_january %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# Data2 <- Data2[,-1] #remove Year variable
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# Clust_Data_1$Period=as.factor(Clust_Data_1$Period)
# Clust_Data_1$Year=as.factor(Clust_Data_1$Year)
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#   
# 
# ### K-MEANS ON JANUARY
# 
#  set.seed(14)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c1 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#   
# 
# library(DT)
# library(yardstick)
# 
# table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data1=my_data %>%dplyr::rename( Period= Var1 )
#  
# 
# 
# 
# 
# 
# ## FEBRUARY   
#   
#     
# Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="February")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data1 = rbind(Data1,Data_july3)
# 
#  Data1$Day=as.numeric(Data1$Day)
# 
#  Data1=Data1 %>% dplyr::filter(Day!= 28)
# Data_february=Data1 %>% dplyr::filter(Day!= 29)
# 
# 
# 
# Data2 = Data_february %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
# Clust_Data_2 = Clust_Data_2[,1:27]
#  
#  
# ### K-MEANS ON FEBRUARY
# 
#  set.seed(37)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c2 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#  table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data2=my_data %>%dplyr::rename( Period= Var1 )
# 
# 
#  
#  
# 
# 
# 
# 
# ## MARCH   
#   
#     
# Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="March")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data1 = rbind(Data1,Data_july3)
# 
#  
# Data_march = as.data.frame(Data1)
# Data_march$Day=as.numeric(Data_march$Day)
#  
# 
# Data2 = Data_march %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
# 
#  
#  
# ### K-MEANS ON MARCH
# 
#  set.seed(17)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c3 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#  
# table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data3=my_data %>%dplyr::rename( Period= Var1 )
# 
#  
#  
# 
# 
# 
# 
# 
# 
# ## APRIL   
#   
#     
# Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="April")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data_april = rbind(Data1,Data_july3)
# 
# 
# Data2 = Data_april %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
#  
# ### K-MEANS ON APRIL
# 
#  set.seed(12)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c4 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#  table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data4=my_data %>%dplyr::rename( Period= Var1 )
# 
#  
#  
#   
#   
#   
# 
#   
#   
# 
# 
# 
# ## MAY   
#   
#    
#  Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="May")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data1 = rbind(Data1,Data_july3)
# 
#  
# Data_may = as.data.frame(Data1)
# Data_may$Day=as.numeric(Data_may$Day)
#  
# 
# 
# Data2 = Data_may %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#   
# 
#  
# ### K-MEANS ON MAY
# 
#  set.seed(5)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c5 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#  
# table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data5=my_data %>%dplyr::rename( Period= Var1 )
#   
#   
#   
#   
#   
# ## JUNE   
#    
# Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="June")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data_june = rbind(Data1,Data_july3)
# 
#   
# 
# Data2 = Data_june %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
#  
#  
# ### K-MEANS ON JUNE
# 
#  set.seed(7)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c6 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#   
# 
#  table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data6=my_data %>%dplyr::rename( Period= Var1 )
#   
# 
#  
# 
#  
# ## JULY   
#   
#    
# 
#  Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="July")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data_July = rbind(Data1,Data_july3)
# 
#  
#  
# 
# 
# Data2 = Data_July %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
#  
# 
#  
# ### K-MEANS ON JULY
# 
#  set.seed(5)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c7 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#  
# table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data7=my_data %>%dplyr::rename( Period= Var1 )
#  
#  
# 
# 
# 
# 
# 
#  
# 
# 
# 
# 
# 
# ## AUGUST   
#   
#    
# 
# 
#  Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="August")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data_august = rbind(Data1,Data_july3)
# 
#  
#  
# 
# Data2 = Data_august %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
#  
# 
#  
# ### K-MEANS ON AUGUST
# 
#  set.seed(2)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c8 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#  table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data8=my_data %>%dplyr::rename( Period= Var1 )
#   
# 
#  
# 
# 
# 
# 
# 
# ## SEPTEMBER   
#   
#    
# 
#  
# Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="September")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data_september = rbind(Data1,Data_july3)
# 
#  
#  
# 
# Data2 = Data_september %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
# ### K-MEANS ON SEPTEMBER
# 
#  set.seed(7)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c9 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit) 
# table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data9=my_data %>%dplyr::rename( Period= Var1 )
#  
#  
# 
# 
# 
# 
#  
# 
# 
# 
# 
# 
# ## OCTOBER   
#   
#    
# 
# 
#   
# Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="October")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data1 = rbind(Data1,Data_july3)
# 
#  
# Data_october = as.data.frame(Data1)
# Data_october$Day=as.numeric(Data_october$Day) 
# 
# 
# 
# Data2 = Data_october %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
#  
# 
#  
# ### K-MEANS ON OCTOBER
# 
#  set.seed(1)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c10 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#  table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data10=my_data %>%dplyr::rename( Period= Var1 )
# 
#  
#  
# 
#  
# 
# 
# ## NOVEMBER   
#   
#    
# 
#  
# Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="November")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data_november = rbind(Data1,Data_july3)
# 
#  
#  
# 
# Data2 = Data_november %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
# ### K-MEANS ON NOVEMBER
# 
#  set.seed(2)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c11 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit) 
# table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data11=my_data %>%dplyr::rename( Period= Var1 )
# 
#  
# 
# 
# 
# 
# 
#  
# 
# 
# 
#  ## DECEMBER   
# 
# Data_plot1= Data_plot %>% 
#    group_by(Date,Year, Month) %>%
#   summarise(mean_T_2M=mean(mean_T_2M))
# 
# Data_plot1$Day <- format(Data_plot1$Date,format=" %d")
# Data_july = Data_plot1 %>% filter(Month=="December")
# 
# 
# Data_july1=Data_july %>% filter(Year<1999)   %>% ungroup()
# Data_july2=Data_july %>% filter(Year>1999 && Year<2010 ) %>% ungroup()
# Data_july3=Data_july %>% filter(Year>2010 && Year<2021) %>% ungroup()
# 
# 
# Data_july1$Period <- "T1"
# Data_july2$Period <- "T2"
# Data_july3$Period <- "T3"
# 
# 
# Data_july1 <- tibble::rowid_to_column(Data_july1, "ID")
# Data_july2 <- tibble::rowid_to_column(Data_july2, "ID")
# Data_july3 <- tibble::rowid_to_column(Data_july3, "ID")
# 
# 
# 
# Data1 = rbind(Data_july1,Data_july2)
# Data_december = rbind(Data1,Data_july3)
# 
#  
# 
# 
# Data2 = Data_december %>% dplyr::select(Period, Day, Year, mean_T_2M )   %>% ungroup()
# 
# Clust_Data_1=Data2 %>%
#   pivot_wider(names_from = Day , values_from = mean_T_2M )
# 
# 
# 
# Clust_Data_2 <- as.data.frame(Clust_Data_1) #change from tibble to dataframe since tibble does not have row names, and we need it for the plots of PCA
# #row.names(Clust_Data_2) <- Clust_Data_1$Period #write the row names according to the country variable
# #Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# 
# row.names(Clust_Data_2) <- Clust_Data_2$Year #write the row names according to the country variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
# Clust_Data_2 <- Clust_Data_2[,-1] #remove Year variable
#  
#  
#  
# ### K-MEANS ON DECEMBER
# 
#  set.seed(5)
# kk=kmeans(Clust_Data_2, centers = 3, iter.max = 10, nstart = 20)
# c12 = fviz_cluster(kk, Clust_Data_2,ellipse.type = "norm")
# 
# 
# #fit <- princomp(Clust_Data_1, cor=TRUE)
# #fviz_pca_biplot(fit)
#  table = table(Clust_Data_1$Period, kk$cluster)
# my_data = as.data.frame(table)
# my_data12=my_data %>%dplyr::rename( Period= Var1 )
# 
#  
# 
# TABLE_winter=rbind(my_data1,my_data2)
# TABLE_winter=rbind(my_data3,TABLE_winter)
# TABLE_spring=rbind(my_data5,my_data4)
# TABLE_spring=rbind(my_data6,TABLE_spring)
# TABLE_summer=rbind(my_data7,my_data8)
# TABLE_summer=rbind(my_data9,TABLE_summer)
# TABLE_autumn=rbind(my_data10,my_data11)
# TABLE_autumn=rbind(my_data12,TABLE_autumn)
#  
# 
# TABLE_winter= TABLE_winter %>% 
#    group_by(Period,Var2) %>%
#   summarise(Freq=sum(Freq)) %>% ungroup()
# 
# Clust_Data_winter=TABLE_winter %>%
#   pivot_wider(names_from = Var2 , values_from = Freq )
# 
# Clust_Data_winter =  Clust_Data_winter%>%dplyr::rename( C_Cold= `1` ) %>%dplyr::rename( C_Medium= `2` ) %>%dplyr::rename( C_Hot= `3`)
# 
# 
# 
# TABLE_spring= TABLE_spring %>% 
#    group_by(Period,Var2) %>%
#   summarise(Freq=sum(Freq)) %>% ungroup()
# 
# Clust_Data_spring=TABLE_spring %>%
#   pivot_wider(names_from = Var2 , values_from = Freq )
# 
# Clust_Data_spring =  Clust_Data_spring%>%dplyr::rename( C_Cold= `1` ) %>%dplyr::rename( C_Medium= `2` ) %>%dplyr::rename( C_Hot= `3`)
# 
# 
# 
# TABLE_summer= TABLE_summer %>% 
#    group_by(Period,Var2) %>%
#   summarise(Freq=sum(Freq)) %>% ungroup()
# 
# Clust_Data_summer=TABLE_summer %>%
#   pivot_wider(names_from = Var2 , values_from = Freq )
# 
# Clust_Data_summer =  Clust_Data_summer%>%dplyr::rename( C_Cold= `1` ) %>%dplyr::rename( C_Medium= `2` ) %>%dplyr::rename( C_Hot= `3`)
# 
# 
# 
# 
# TABLE_autumn= TABLE_autumn %>% 
#    group_by(Period,Var2) %>%
#   summarise(Freq=sum(Freq)) %>% ungroup()
# 
# Clust_Data_autumn=TABLE_autumn %>%
#   pivot_wider(names_from = Var2 , values_from = Freq )
# 
# Clust_Data_autumn =  Clust_Data_autumn%>%dplyr::rename( C_Cold= `1` ) %>%dplyr::rename( C_Medium= `2` ) %>%dplyr::rename( C_Hot= `3`)
# 
# 
# 
# Clust_Data_winter <- Clust_Data_winter %>%
#       mutate(across(C_Cold:C_Hot, ~ ./30)*100)
# 
# 
# Clust_Data_spring <- Clust_Data_spring %>%
#       mutate(across(C_Cold:C_Hot, ~ ./30)*100)
# 
# Clust_Data_summer <- Clust_Data_summer %>%
#       mutate(across(C_Cold:C_Hot, ~ ./30)*100)
# 
# Clust_Data_autumn <- Clust_Data_autumn %>%
#       mutate(across(C_Cold:C_Hot, ~ ./30)*100)
# 
# 
# Clust_Data_winter = Clust_Data_winter %>% 
#  mutate_if(is.numeric, round)
# 
# Clust_Data_spring = Clust_Data_spring %>% 
#  mutate_if(is.numeric, round)
# 
# Clust_Data_summer = Clust_Data_summer %>% 
#  mutate_if(is.numeric, round)
# 
# Clust_Data_autumn = Clust_Data_autumn %>% 
#  mutate_if(is.numeric, round)
#  
# Clust_Data_winter$C_Cold <- paste0(as.matrix(Clust_Data_winter$C_Cold), '%')
# Clust_Data_winter$C_Medium <- paste0(as.matrix(Clust_Data_winter$C_Medium), '%')
# Clust_Data_winter$C_Hot <- paste0(as.matrix(Clust_Data_winter$C_Hot), '%')
# 
# Clust_Data_spring$C_Cold <- paste0(as.matrix(Clust_Data_spring$C_Cold), '%')
# Clust_Data_spring$C_Medium <- paste0(as.matrix(Clust_Data_spring$C_Medium), '%')
# Clust_Data_spring$C_Hot <- paste0(as.matrix(Clust_Data_spring$C_Hot), '%')
# 
# Clust_Data_summer$C_Cold <- paste0(as.matrix(Clust_Data_summer$C_Cold), '%')
# Clust_Data_summer$C_Medium <- paste0(as.matrix(Clust_Data_summer$C_Medium), '%')
# Clust_Data_summer$C_Hot <- paste0(as.matrix(Clust_Data_summer$C_Hot), '%')
# 
# Clust_Data_autumn$C_Cold <- paste0(as.matrix(Clust_Data_autumn$C_Cold), '%')
# Clust_Data_autumn$C_Medium <- paste0(as.matrix(Clust_Data_autumn$C_Medium), '%')
# Clust_Data_autumn$C_Hot <- paste0(as.matrix(Clust_Data_autumn$C_Hot), '%')
# 
# library(kableExtra)
#  
# 
# t_winter =Clust_Data_winter %>%
#   kbl(caption = "Recreating booktabs style table") %>%
#   kable_classic(full_width = F, html_font = "Cambria")
# 
# t_spring =Clust_Data_spring %>%
#   kbl(caption = "Recreating booktabs style table") %>%
#   kable_classic(full_width = F, html_font = "Cambria")
# t_summer =Clust_Data_summer %>%
#   kbl(caption = "Recreating booktabs style table") %>%
#   kable_classic(full_width = F, html_font = "Cambria")
# t_autumn =Clust_Data_autumn %>%
#   kbl(caption = "Recreating booktabs style table") %>%
#   kable_classic(full_width = F, html_font = "Cambria")
# 
# 
#  
# library(knitr)
#  Clust_Data_winter = as.data.frame(Clust_Data_winter)
# Clust_Data_spring = as.data.frame(Clust_Data_spring)
# Clust_Data_summer = as.data.frame(Clust_Data_summer)
# Clust_Data_autumn = as.data.frame(Clust_Data_autumn)
# 
#  Clust_Data_winter=as.data.frame(Clust_Data_winter) %>% ungroup()
# row.names(Clust_Data_winter) <- Clust_Data_winter$Period #write the row names according to the country variable
# Clust_Data_winter <- Clust_Data_winter[,-1] #remove Year variable
#  
# Clust_Data_spring=as.data.frame(Clust_Data_spring) %>% ungroup()
# row.names(Clust_Data_spring) <- Clust_Data_spring$Period #write the row names according to the country variable
# Clust_Data_spring <- Clust_Data_spring[,-1] #remove Year variable
#  
# Clust_Data_summer=as.data.frame(Clust_Data_summer) %>% ungroup()
# row.names(Clust_Data_summer) <- Clust_Data_summer$Period #write the row names according to the country variable
# Clust_Data_summer <- Clust_Data_summer[,-1] #remove Year variable
#  
# Clust_Data_autumn=as.data.frame(Clust_Data_autumn) %>% ungroup()
# row.names(Clust_Data_autumn) <- Clust_Data_autumn$Period #write the row names according to the country variable
# Clust_Data_autumn <- Clust_Data_autumn[,-1] #remove Year variable
#  
# 
# 
# df1 =cbind(Clust_Data_winter,Clust_Data_autumn)
# df2 =cbind(Clust_Data_spring,Clust_Data_summer)
# 
# #kbl(df1) %>%
# #  kable_classic() %>%
# #  add_header_above(c(" " = 1, "Winter" = 3, "Autumn" = 3 ))
# 
# #kbl(df2) %>%
# #  kable_classic() %>%
# #  add_header_above(c(" " = 1, "Spring" = 3, "Summer" = 3 ))
#

By performing the month-by-month clustering method, the high peaks of each distribution are classified in the same group as the mean and the lowest values, then the data of the clusters collected for each month are grouped with respect to the season they belong to in order to obtain a general table. We therefore want to see for which seasons the cluster with the lowest values corresponds to the first period, the cluster with the middle values corresponds to the second period and the cluster with the highest values to the third period.

As we can see from the tables, Spring and Summer are the seasons with the largest number of correctly classified data where the data from the first period (T1) correspond to the coldest years, the data from the second period (T2) correspond to the years with average temperatures, while the data from the third period (T3) are the years with the highest temperatures.

Monthly temperature change

One approach that could be considered is to measure the temperatures of each month in three different time periods (T1, T2 and T3) and test the hypothesis that the temperature has changed during these time periods to see which months have a significance level greater than 95 %.

Data_plot <- read_csv("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/Data_plot.csv")

 
Data_plot= Data_plot %>% 
   group_by(Date,Year, Month) %>%
  summarise(mean_T_2M=mean(mean_T_2M))

Data_plot1=Data_plot %>% filter(Year<1999) %>% ungroup()
Data_plot2=Data_plot %>%dplyr::filter(Year>1999 && Year<2010 ) %>% ungroup()
Data_plot3=Data_plot %>% filter(Year>2010 && Year<2021) %>% ungroup()


Data_plot1$Period <- "T1"
Data_plot2$Period <- "T2"
Data_plot3$Period <- "T3"

Data_plot = rbind(Data_plot1,Data_plot2)
Data_plot = rbind(Data_plot,Data_plot3)
 


  
Data_plot$Month[Data_plot$Month == "January"] <- "Jan"
Data_plot$Month[Data_plot$Month == "February"] <- "Feb"
Data_plot$Month[Data_plot$Month == "March"] <- "Mar"
Data_plot$Month[Data_plot$Month == "April"] <- "Apr"
Data_plot$Month[Data_plot$Month == "May"] <- "May"
Data_plot$Month[Data_plot$Month == "June"] <- "Jun"
Data_plot$Month[Data_plot$Month == "July"] <- "Jul"
Data_plot$Month[Data_plot$Month == "August"] <- "Aug"
Data_plot$Month[Data_plot$Month == "September"] <- "Sep"
Data_plot$Month[Data_plot$Month == "October"] <- "Oct"
Data_plot$Month[Data_plot$Month == "November"] <- "Nov"
Data_plot$Month[Data_plot$Month == "December"] <- "Dec"

Data_plot$Month <- factor(Data_plot$Month , levels=c("Jan"  , "Feb", "Mar" , "Apr" , "May" ,"Jun" , "Jul", "Aug", "Sep", "Oct", "Nov" , "Dec"))
 
# Then I make the boxplot, asking to use the 2 factors : variety (in the good order) AND treatment :
par(mar=c(3,4,3,1))
myplot <- boxplot(mean_T_2M ~ Period*Month , data=Data_plot  , 
        boxwex=0.4 , ylab="Temperature (°C)",  outline=FALSE,
        main="Months respect Periods" , 
        col=c("slateblue1" , "tomato", "green") ,  
        xaxt="n")
 
 
# To add the label of x axis
my_names <- sapply(strsplit(myplot$names , '\\.') , function(x) x[[2]] )
my_names <- my_names[seq(1 , length(my_names) , 3)]
 
 

axis(1, 
     at = seq(1 , 36 , 3), 
     labels = my_names , 
     tick=FALSE , cex=0.3)

 
# Add the grey vertical lines
for(i in seq(0.5 , 50 , 3)){ 
  abline(v=i,lty=1, col="grey")
  }
# Add a legend
legend("topright", legend = c("T1", "T2", "T3"), 
       col=c("slateblue1" , "tomato", "green"),
       pch = 15, bty = "n", pt.cex = 3, cex = 1.2,  horiz = F )

Now, with the pairwise t-test, we test which months show a difference in temperature between the periods. The paired-samples t-test, sometimes called the dependent-samples t-test, is a statistical procedure used to determine whether the mean difference between two or more sets of observations is zero.

The paired-samples t-test has the following prerequisites:

the dependent variable must be numeric and continuous;
the observations are independent of each other;
the dependent variable must be approximately normally distributed;
the dependent variable must contain no outliers.

Like many statistical procedures, the paired-samples t-test involves two competing hypotheses, the null hypothesis and the alternative hypothesis. The null hypothesis assumes that the true mean difference between paired samples is zero. According to this model, all observable differences are explained by random variation. The alternative hypothesis, on the other hand, assumes that the true mean difference between the paired samples is non-zero.

Hypothesis :

H0: there are no temperature differences between three time periods;
H1: the time periods has affected the temperature.

# Data_anova = Data_january %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
# Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# 
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_1 = cbind(pwc1,pwc2)
# 
# 
# 
# 
# 
#  Data_anova = Data_february %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# 
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_2 = cbind(pwc1,pwc2)
# 
# 
# 
# 
# 
# 
# Data_anova = Data_march %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# 
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_3 = cbind(pwc1,pwc2)
# 
# 
# 
# 
# 
# 
# 
# 
# 
# Data_anova = Data_april %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# 
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_4 = cbind(pwc1,pwc2)
# 
# 
# 
# 
# 
# 
# 
# Data_anova = Data_may %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# 
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_5 = cbind(pwc1,pwc2)
# 
# 
# 
# 
# 
# 
# 
# 
# Data_anova = Data_june %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# 
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_6 = cbind(pwc1,pwc2)
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# Data_anova = Data_July %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# 
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_7 = cbind(pwc1,pwc2)
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# Data_anova = Data_august %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_8 = cbind(pwc1,pwc2)
# 
# 
# 
# 
# 
# 
# Data_anova = Data_september %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     ) 
# 
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_9 = cbind(pwc1,pwc2)
# 
# 
# Data_anova = Data_october %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_10 = cbind(pwc1,pwc2)
# 
# 
# 
# 
# 
# Data_anova = Data_november %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_11 = cbind(pwc1,pwc2)
# 
# 
# 
# Data_anova = Data_december %>% dplyr::select(ID, Period, mean_T_2M ) %>%  ungroup()
#  Data_anova$ID=as.factor(Data_anova$ID)
# Data_anova$Period =as.factor(Data_anova$Period)
#  
# 
# 
# Data_anova =Data_anova %>% ungroup()
# 
# 
# res.aov <- anova_test(data = Data_anova, dv = mean_T_2M, wid = ID, within = Period)
#  
# pwc <- Data_anova %>%
#   pairwise_t_test(
#     mean_T_2M ~ Period, paired = TRUE,
#     p.adjust.method = "bonferroni"
#     )
#  
# pwc1 = pwc[2:3]
# pwc2 = pwc[10]
# anov_12 = cbind(pwc1,pwc2)
# 
# 
# 
# anov = rbind(anov_1,anov_2)
# anov = rbind(anov,anov_3)
# anov = rbind(anov,anov_4)
# anov = rbind(anov,anov_5)
# anov = rbind(anov,anov_6)
# anov = rbind(anov,anov_7)
# anov = rbind(anov,anov_8)
# anov = rbind(anov,anov_9)
# anov = rbind(anov,anov_10)
# anov = rbind(anov,anov_11)
# anov = rbind(anov,anov_12)

#lm.D9 <- data.frame(
#  
#  MONTH = c("January"," ", " ","February"," "," ","March "," "," ","April"," "," ","May"," "," ","June "," "," ","July"," "," ","August"," "," ","September "," "," #","October"," "," ","November"," "," ","December", " "," " ),
#  
#  Group1 = c("T1","T1","T2 ","T1","T1","T2 ","T1","T1","T2 ","T1","T1","T2 ","T1","T1","T2 ","T1","T1","T2 ","T1","T1","T2 ","T1","T1","T2 ","T1","T1","T2 ","T1","T1","T2 #","T1","T1","T2 ","T1","T1","T2 "),
#  
#  Group2 = c("T2","T3","T3 ","T2","T3","T3 ","T2","T3","T3 ","T2","T3","T3 ","T2","T3","T3 ","T2","T3","T3 ","T2","T3","T3 ","T2","T3","T3 ","T2","T3","T3 ","T2","T3","T3 #","T2","T3","T3 ","T2","T3","T3 "),
#  
#  
#    p.adj.signif=c("**","NS","NS","*","NS","*","NS","NS","NS","****","****","***","****","NS","****","****","****","NS","NS","***","**","NS","****","***","NS","****","****#","NS","***","NS","*","****","*","NS","NS","NS"))

#lm.D9=lm.D9 %>%
#  kbl() %>%
#   kable_classic_2(full_width = F) %>%
#   kable_styling(bootstrap_options = c("striped", "hover"))


#lm.D9 %>%
#  kbl(caption = "Pairwise t-test on Month and Periods") %>%
#  kable_classic(full_width = F, html_font = "Cambria")

The table below shows month by month and period by period whether they are statistically significant (*) or not significant (NS).

Thus, only the summer and spring months appear to present a statistically significant high temperature increase and presented a change between both the first period and the second (t1-t2) and between the first period and the third (t1-t3).

The autumn months, on the other hand, presented a significance only between the first period and the third (t1-t3).

This means that while the autumn months have presented a continuous and constant increase in temperature over time, the summer months have undergone a drastic increase in temperature over the last 10 years, with a shift in temperature between the T2-T3 period, while the spring months are more significant between the T1-T2 period and therefore presented this shift between 30 and 20 years ago. March and December, on the contrary, show no particular changes in temperature between the time periods under analysis, while January and February show small and insignificant differences.

this is a same table where months are grouped according to their behavior over time, so;

→ : Constant
↗ : Increase
↘ : Decrease

As you can see, January and February decrease in temperature during the first 20 years and increase it later balancing the change and resulting constant. The reverse happens for May. March and December always remain constant. April, October and November increase temperature regularly in all periods. June increases only in the first 20 years while July, August and September increase significantly only in the last 20 years.

One aspect that may interest us is to analyze the jump in temperature that occurred between two consecutive months, where the first remains constant and the second increases in temperature

One aspect that may interest us is to analyze the jump in temperature that occurred between two consecutive months, where the first remains constant and the second increases in temperature.

And this may be the reason for when people say that half seasons no longer exist. In fact, April is the first month of Spring and, compared to before, has a much higher temperature than March. The same happens between November and December where November has increased the temperature.

 Data_plot <- read_csv("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/Data_plot.csv")

Data_plot= Data_plot %>% 
   group_by( Year, Month) %>%
  summarise(mean_T_2M=mean(mean_T_2M))

Data_plot1=Data_plot %>% filter(Year<1999) %>% ungroup()



Data_plot= Data_plot %>% 
   group_by(Year, Month) %>%
  summarise(mean_T_2M=mean(mean_T_2M))

Data_plot2=Data_plot %>% filter(Year>2010 ) %>% ungroup()

 
Data_plot1$Period <- "T1"
Data_plot2$Period <- "T3"


Data_plot1=Data_plot1 %>% filter(Month =="November" | Month =="December"  )  
Data_plot2=Data_plot2 %>% filter(  Month =="November" |Month =="December"  )  


Data_plot1= Data_plot1 %>% 
   group_by(Period, Month) %>%
  summarise(mean_T_2M=mean(mean_T_2M))


Data_plot2= Data_plot2 %>% 
   group_by(Period, Month) %>%
  summarise(mean_T_2M=mean(mean_T_2M))


Data_plot = rbind(Data_plot1,Data_plot2)

Data_plot=Data_plot %>%
  pivot_wider(names_from = Period , values_from = mean_T_2M )
 


Data_plot <- data.frame(
  Period=c("April-March" , "November-December" ),
  T1= c(0.10704,0.409939),
  T3=c(1.9029,1.442972)
)
Data_plot <- as.data.frame(Data_plot)
Data_plot$Period=as.factor(Data_plot$Period)
row.names(Data_plot) <- Data_plot$Period 
Data_plot <- as.matrix(Data_plot[,-1])

lim=2

ze_barplot <- barplot(Data_plot , beside=T  , legend.text=T,col=c("blue" , "skyblue") , ylim=c(0,2.3) , ylab="height",main="Difference in Mean Temperature")

To better understand the behaviour of these distributions, let us analyse the variance between one year and the previous year of each month. Specifically, among the months that showed significant changes over time, two groups of distributions can be distinguished based on their trends:

those that present an decreasing variance over time and these include September, October and November;
those with a increasing variance over time, of which April, June, July and August are part.

library(readxl)
Variazioni_medie_mese_anno <- read_excel("C:/Users/claud_kcmwfzd/Desktop/STAGE/PROJECT/Report/Variazioni_medie_mese_anno.xlsx")
 
 

Data_decr = Variazioni_medie_mese_anno %>% dplyr::select(Year,September, October,November  )
Data_cre = Variazioni_medie_mese_anno %>% dplyr::select(Year,April, June, July, August  )



Data_decr = Data_decr %>%
  pivot_longer(!Year, names_to = "Month", values_to = "tmp")


Data_cre = Data_cre %>%
  pivot_longer(!Year, names_to = "Month", values_to = "tmp")


library(ggnewscale)
library(shadowtext)
library(ggtext)




BROWN <- "#AD8C97"
BROWN_DARKER <- "#7d3a46"
GREEN <- "#2FC1D3"
BLUE <- "#076FA1"
GREY <- "#C7C9CB"
GREY_DARKER <- "#5C5B5D"
RED <- "#E3120B"
 
# Aesthetics defined in the `ggplot()` call are reused in the 
# `geom_line()` and `geom_point()` calls.
plt1 <- ggplot(Data_decr, aes(Year, tmp)) +
  geom_line(aes(color = Month), size = 1.5) +
  geom_point(
    aes(fill = Month), 
    size = 3.5, 
    pch = 21, # Type of point that allows us to have both color (border) and fill.
    color = "white", 
    stroke = 1 # The width of the border, i.e. stroke.
   ) +
  # Set values for the color and the fill
  scale_color_manual(values = c(BLUE, GREEN, BROWN)) +
  scale_fill_manual(values = c(BLUE, GREEN, BROWN)) + 
  # Do not include any legend
  theme(legend.position = "right")

 

plt1 <- plt1 + 
    theme(
    # Set background color to white
    panel.background = element_rect(fill = "white"),
    # Remove all grid lines
    panel.grid = element_blank(),
    # But add grid lines for the vertical axis, customizing color and size 
    panel.grid.major.y = element_line(color = "#A8BAC4", size = 0.3),
    # Remove tick marks on the vertical axis by setting their length to 0
    axis.ticks.length.y = unit(0, "mm"), 
    # But keep tick marks on horizontal axis
    axis.ticks.length.x = unit(2, "mm"),
    # Remove the title for both axes
    axis.title = element_blank(),
    # Only the bottom line of the vertical axis is painted in black
    axis.line.x.bottom = element_line(color = "black"),
    # Remove labels from the vertical axis
    axis.text.y = element_blank(),
    # But customize labels for the horizontal axis
    axis.text.x = element_text(family = "Econ Sans Cnd", size = 16)
  )

  
 

# Add labels for the horizontal lines
plt1 <- plt1 + 
  geom_text(
    data = data.frame(x = 2021.5, y = seq(-4, 4, by = 2)),
    aes(x, y, label = y),
    hjust = 1, # Align to the right
    vjust = 0, # Align to the bottom
    nudge_y = 32 * 0.01, # The pad is equal to 1% of the vertical range (32 - 0)
    family = "Econ Sans Cnd",
    size = 6
  )
Decr =plt1 









BROWN <- "#CD9600"
BROWN_DARKER <- "#7d3a46"
GREEN <- "#00BA38"
BLUE <- "#076FA1"
GREY <- "#C7C9CB"
GREY_DARKER <- "#ffc425"
RED <- "#E3120B"
 
# Aesthetics defined in the `ggplot()` call are reused in the 
# `geom_line()` and `geom_point()` calls.
plt1 <- ggplot(Data_cre, aes(Year, tmp)) +
  geom_line(aes(color = Month), size = 1.5) +
  geom_point(
    aes(fill = Month), 
    size = 3.5, 
    pch = 21, # Type of point that allows us to have both color (border) and fill.
    color = "white", 
    stroke = 1 # The width of the border, i.e. stroke.
   ) +
  # Set values for the color and the fill
  scale_color_manual(values = c(GREY_DARKER, GREEN,BROWN_DARKER,RED)) +
  scale_fill_manual(values = c(GREY_DARKER, GREEN, BROWN_DARKER, RED)) + 
  # Do not include any legend
  theme(legend.position = "right")

 

plt1 <- plt1 + 
    theme(
    # Set background color to white
    panel.background = element_rect(fill = "white"),
    # Remove all grid lines
    panel.grid = element_blank(),
    # But add grid lines for the vertical axis, customizing color and size 
    panel.grid.major.y = element_line(color = "#A8BAC4", size = 0.3),
    # Remove tick marks on the vertical axis by setting their length to 0
    axis.ticks.length.y = unit(0, "mm"), 
    # But keep tick marks on horizontal axis
    axis.ticks.length.x = unit(2, "mm"),
    # Remove the title for both axes
    axis.title = element_blank(),
    # Only the bottom line of the vertical axis is painted in black
    axis.line.x.bottom = element_line(color = "black"),
    # Remove labels from the vertical axis
    axis.text.y = element_blank(),
    # But customize labels for the horizontal axis
    axis.text.x = element_text(family = "Econ Sans Cnd", size = 16)
  )

  
 
# Add labels for the horizontal lines
plt1 <- plt1 + 
  geom_text(
    data = data.frame(x = 2021.5, y = seq(-4, 4, by = 2)),
    aes(x, y, label = y),
    hjust = 1, # Align to the right
    vjust = 0, # Align to the bottom
    nudge_y = 32 * 0.01, # The pad is equal to 1% of the vertical range (32 - 0)
    family = "Econ Sans Cnd",
    size = 6
  )
Cre = plt1 

# ggarrange(Decr, Cre, ncol = 2, nrow = 2)
Decr

Cre

Spring and Summer are therefore experiencing an increase in the temperature deviation between one year and the next while the autumn months show a reduction in variance.

Spring and summer are thus experiencing an increase in the temperature gap between one year and the next, which results in years with spring and/or summer months with particularly high temperatures and the following year months with particularly low temperatures.

Spring months, in this way, if they present particularly high temperatures, could be included in the summer season, while if they present particularly low temperatures, they could be included in the winter season.

The summer months, likewise, can have, on average, extremely high temperatures in some years and particularly low temperatures that can be common to spring.

In contrast, autumn months show a decrease in variance that compacts them into a range of similar values.

Conclusion

We can therefore come to the following conclusion :

The results presented in this paper indicate that the temperature is steadily increasing, especially in the last 10 years. The number of extremely warm days and warm nights is increasing, while days with particularly low temperatures are decreasing.

The hottest locations are precisely those in urban areas and particularly those in South and West Caserta. In contrast, the coolest are the North and Northeast areas since these are hilly and sparsely populated areas.

The seasons with a statistically significant drastic increase in average temperature are summer and spring; autumn also seems to increase its temperature non-significantly, while winter slightly decreases the annual average.

Only the summer and spring months seem to show a high statistically significant increase in temperature, and in particular they showed a change between the first period and the second (t1-t2) and/or between the first period and the third (t1-t3). The fall months, on the other hand, presented significance only between the first period and the third period (t1-t3).

This means that while the fall months have presented a continuous and steady increase in temperature over time, the summer months have experienced a drastic increase in temperature over the past 10 years, with a high temperature shift between the T2-T3 period, while the spring months are more significant between the T1-T2 period and thus presented this shift between 30 and 20 years ago.

In addition, we have seen that the difference in average temperature between some consecutive months, especially between April-March and November-December. This results in a drastic increase (between the first two months) in temperature and a drastic decrease (between the second two) in temperature, fueling the gap between stations, particularly between the Winter-Spring one and the Fall-Winter one.

Thus, spring and summer record an increase in the temperature shift between years, while autumn months show a decrease in the variance. Spring months, in this way, if they show particularly high temperatures, could be included in the summer season, while if they show particularly low temperatures, they could be included in the winter season; summer months, likewise, may show, on average, particularly low temperatures in some years that may be common to spring and extremely high temperatures in other years. The fall months, on the other hand, exhibit a decrease in variance that compacts them into a range of similar values each year.

Acknowledgements

I would like to thank my University Tutor Dr. Rosanna Verde who put me in touch with the company and guided me throughout my training. A big thank you to my Company Tutor Paola Mercogliano who supervised me during my internship and welcomed me into the team behind CMCC, particularly the REMHI division of the company and without whom this internship would not have been possible. I would like to thank the entire CMCC and REMHI team for welcoming me into their team and teaching me so a lot about life as a scientific researcher.

Climate data analysis project for the Caserta city area