Hevans Vinicius Pereira
João Matheus Slujala Krüger Taborda Hneda
Marcos da Silva Medeiros
Vanessa de Oliveira Lima
Professor Vyacheslav Lyubchich
Professora Eniuce Menezes
After importing the data, we can try to understand it.
## Rows: 1,242,662
## Columns: 22
## $ WTEMP_avg <dbl> 11.80897, 12.09988, 12.20975, 11.34375, 10.01036, 10.07240…
## $ WTEMP_min <dbl> 11.524287, 12.038053, 11.977629, 10.581435, 9.753779, 9.70…
## $ WTEMP_max <dbl> 12.07597, 12.21082, 12.36400, 11.92079, 10.42982, 10.38063…
## $ WTEMP_sd <dbl> 0.209204162, 0.066137404, 0.158417522, 0.525706448, 0.2586…
## $ SAL_avg <dbl> 26.86879, 27.62771, 28.25292, 26.35504, 24.36928, 25.07213…
## $ SAL_min <dbl> 26.32957, 27.37041, 27.82235, 24.98701, 23.62500, 23.56463…
## $ SAL_max <dbl> 27.41063, 28.03810, 28.60699, 27.78203, 25.05359, 26.32559…
## $ SAL_sd <dbl> 0.41049358, 0.25444444, 0.30748888, 1.32687911, 0.59092177…
## $ TPOC_avg <dbl> 0.4474859, 0.4845421, 0.4721686, 0.5109920, 0.5433232, 0.5…
## $ TPOC_min <dbl> 0.4309263, 0.4746964, 0.4493208, 0.4934338, 0.4904417, 0.5…
## $ TPOC_max <dbl> 0.4664134, 0.4928869, 0.5139369, 0.5401617, 0.6037775, 0.6…
## $ TPOC_sd <dbl> 0.012511266, 0.007685450, 0.024912795, 0.018752936, 0.0500…
## $ CHLAVEG_avg <dbl> 1.795635, 1.894775, 1.779504, 2.330629, 3.270122, 3.748852…
## $ CHLAVEG_min <dbl> 1.721618, 1.869838, 1.690495, 1.991545, 2.797004, 3.416998…
## $ CHLAVEG_max <dbl> 1.859602, 1.928245, 1.928089, 2.878195, 3.921468, 4.087770…
## $ CHLAVEG_sd <dbl> 0.05235933, 0.02543517, 0.08790034, 0.36788327, 0.49513495…
## $ DOAVEG_avg <dbl> 7.824160, 7.678386, 8.013510, 8.437577, 9.373971, 9.646327…
## $ DOAVEG_min <dbl> 7.729130, 7.634193, 7.843742, 8.122249, 9.000209, 9.563031…
## $ DOAVEG_max <dbl> 7.938271, 7.739849, 8.122721, 8.869912, 9.730952, 9.747666…
## $ DOAVEG_sd <dbl> 0.07595991, 0.03805551, 0.10140850, 0.27087610, 0.31028372…
## $ CellID <int> 332, 332, 332, 332, 332, 332, 332, 332, 332, 333, 333, 333…
## $ Date <chr> "2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04", "2…
We use just the columns ‘CellID’, ‘Date’ and ‘DOAVEG_avg’.
After importing the data, we selected the columns ‘CellID’, ‘Date’ and ‘DOAVEG_avg’.
After this, we filtered the year 2012. In here, we have daily data. We develop a shiny app to visualize the trend of DOAVEG_avg by CellID.
DataViz in Shiny App: DOAVG Time Series by Species
After this, we summarised the mean of DOAVEG_avg grouped by CellID and Month (reduced dataset).
Then, we selected 100 cells randomly.
This gives us a data.frame of 1200 rows.
## [1] 1200 3
Then, we format the data.frame (long to wide format). Now we can apply the methods of clustering time series.
##
## Precomputing distance matrix...
##
## Iteration 1: Changes / Distsum = 12 / 1286.42
## Iteration 2: Changes / Distsum = 1 / 1097.847
## Iteration 3: Changes / Distsum = 0 / 1097.847
##
## Elapsed time is 0.21 seconds.
## partitional clustering with 3 clusters
## Using dtw_basic distance
## Using pam centroids
##
## Time required for analysis:
## usuário sistema decorrido
## 0.11 0.08 0.21
##
## Cluster sizes with average intra-cluster distance:
##
## size av_dist
## 1 2 52.35864
## 2 6 104.98856
## 3 4 90.79962
##
## Calculating distance matrix...
## Performing hierarchical clustering...
## Extracting centroids...
##
## Elapsed time is 0.17 seconds.
## hierarchical clustering with 3 clusters
## Using sbd distance
## Using PAM (Hierarchical) centroids
## Using method average
##
## Time required for analysis:
## usuário sistema decorrido
## 0.14 0.00 0.17
##
## Cluster sizes with average intra-cluster distance:
##
## size av_dist
## 1 6 0.004315081
## 2 4 0.008560076
## 3 2 0.006757188
## Iteration 1: Objective = 0.1240
## Iteration 2: Objective = 0.1032
## Iteration 3: Objective = 0.0688
## Iteration 4: Objective = 0.0462
## Iteration 5: Objective = 0.0385
## Iteration 6: Objective = 0.0378
##
## Elapsed time is 0.04 seconds.
## fuzzy clustering with 3 clusters
## Using l2 distance
## Using fcm centroids
## Using acf_fun preprocessing
##
## Time required for analysis:
## usuário sistema decorrido
## 0.03 0.01 0.04
##
## Head of fuzzy memberships:
##
## cluster_1 cluster_2 cluster_3
## 1 0.85275296 0.11840863 0.028838412
## 2 0.01393058 0.98417808 0.001891333
## 3 0.05681618 0.93535412 0.007829701
## 4 0.04096201 0.95319926 0.005838727
## 5 0.92919688 0.05190126 0.018901858
## 6 0.48552754 0.12076132 0.393711142
After importing the data, we selected the columns ‘WTEMP_avg’, ‘SAL_avg’, ‘TPOC_avg’, ‘CHLAVEG_avg’ and’DOAVEG_avg’ and the first 40.000 rows of the dataset.
Then, we divided the dataset in train and test.
After that, we trained the random forest model, considering the WTEMP_avg as the response variable and rest as explanatory variables.
# Making predictions
prev <- predict(model, teste[,2:5])
# Avaliating the model
rmse(teste[,1], prev)
## [1] 0.4841574
## [1] 0.9709738
ggplot(aes(x=prev,y=WTEMP_avg), data=as.data.frame(cbind(prev,teste))) +
geom_point() +
geom_abline(slope=1, intercept=0, col='blue') +
ylab("prediction") + xlab("observed")
## IncNodePurity
## SAL_avg 60023.34
## TPOC_avg 17953.17
## CHLAVEG_avg 15005.18
## DOAVEG_avg 37470.71