Introduction to Machine Learning with Applications

Students:

Hevans Vinicius Pereira

João Matheus Slujala Krüger Taborda Hneda

Marcos da Silva Medeiros

Vanessa de Oliveira Lima

Report presenting as part of the avaliation to the discipline “Introduction to Machine Learning with Applications”.

Professor Vyacheslav Lyubchich

Professora Eniuce Menezes

Data Visualization in Shiny App and Slides Presentation

DOAVG Time Series by Species

Slides Presentation

Presentation Codes

Clustering Time Series

After importing the data, we can try to understand it.

## Rows: 1,242,662
## Columns: 22
## $ WTEMP_avg   <dbl> 11.80897, 12.09988, 12.20975, 11.34375, 10.01036, 10.07240…
## $ WTEMP_min   <dbl> 11.524287, 12.038053, 11.977629, 10.581435, 9.753779, 9.70…
## $ WTEMP_max   <dbl> 12.07597, 12.21082, 12.36400, 11.92079, 10.42982, 10.38063…
## $ WTEMP_sd    <dbl> 0.209204162, 0.066137404, 0.158417522, 0.525706448, 0.2586…
## $ SAL_avg     <dbl> 26.86879, 27.62771, 28.25292, 26.35504, 24.36928, 25.07213…
## $ SAL_min     <dbl> 26.32957, 27.37041, 27.82235, 24.98701, 23.62500, 23.56463…
## $ SAL_max     <dbl> 27.41063, 28.03810, 28.60699, 27.78203, 25.05359, 26.32559…
## $ SAL_sd      <dbl> 0.41049358, 0.25444444, 0.30748888, 1.32687911, 0.59092177…
## $ TPOC_avg    <dbl> 0.4474859, 0.4845421, 0.4721686, 0.5109920, 0.5433232, 0.5…
## $ TPOC_min    <dbl> 0.4309263, 0.4746964, 0.4493208, 0.4934338, 0.4904417, 0.5…
## $ TPOC_max    <dbl> 0.4664134, 0.4928869, 0.5139369, 0.5401617, 0.6037775, 0.6…
## $ TPOC_sd     <dbl> 0.012511266, 0.007685450, 0.024912795, 0.018752936, 0.0500…
## $ CHLAVEG_avg <dbl> 1.795635, 1.894775, 1.779504, 2.330629, 3.270122, 3.748852…
## $ CHLAVEG_min <dbl> 1.721618, 1.869838, 1.690495, 1.991545, 2.797004, 3.416998…
## $ CHLAVEG_max <dbl> 1.859602, 1.928245, 1.928089, 2.878195, 3.921468, 4.087770…
## $ CHLAVEG_sd  <dbl> 0.05235933, 0.02543517, 0.08790034, 0.36788327, 0.49513495…
## $ DOAVEG_avg  <dbl> 7.824160, 7.678386, 8.013510, 8.437577, 9.373971, 9.646327…
## $ DOAVEG_min  <dbl> 7.729130, 7.634193, 7.843742, 8.122249, 9.000209, 9.563031…
## $ DOAVEG_max  <dbl> 7.938271, 7.739849, 8.122721, 8.869912, 9.730952, 9.747666…
## $ DOAVEG_sd   <dbl> 0.07595991, 0.03805551, 0.10140850, 0.27087610, 0.31028372…
## $ CellID      <int> 332, 332, 332, 332, 332, 332, 332, 332, 332, 333, 333, 333…
## $ Date        <chr> "2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04", "2…

We use just the columns ‘CellID’, ‘Date’ and ‘DOAVEG_avg’.

Data manipulations

After importing the data, we selected the columns ‘CellID’, ‘Date’ and ‘DOAVEG_avg’.

After this, we filtered the year 2012. In here, we have daily data. We develop a shiny app to visualize the trend of DOAVEG_avg by CellID.

DataViz in Shiny App: DOAVG Time Series by Species

After this, we summarised the mean of DOAVEG_avg grouped by CellID and Month (reduced dataset).

Then, we selected 100 cells randomly.

This gives us a data.frame of 1200 rows.

## [1] 1200    3

Then, we format the data.frame (long to wide format). Now we can apply the methods of clustering time series.

Methods and Results (Partitional)

## 
##  Precomputing distance matrix...
## 
## Iteration 1: Changes / Distsum = 12 / 1286.42
## Iteration 2: Changes / Distsum = 1 / 1097.847
## Iteration 3: Changes / Distsum = 0 / 1097.847
## 
##  Elapsed time is 0.21 seconds.

## partitional clustering with 3 clusters
## Using dtw_basic distance
## Using pam centroids
## 
## Time required for analysis:
##   usuário   sistema decorrido 
##      0.11      0.08      0.21 
## 
## Cluster sizes with average intra-cluster distance:
## 
##   size   av_dist
## 1    2  52.35864
## 2    6 104.98856
## 3    4  90.79962

Methods and Results (Partitional)

Methods and Results (Hierarchical)

## 
## Calculating distance matrix...
## Performing hierarchical clustering...
## Extracting centroids...
## 
##  Elapsed time is 0.17 seconds.

## hierarchical clustering with 3 clusters
## Using sbd distance
## Using PAM (Hierarchical) centroids
## Using method average 
## 
## Time required for analysis:
##   usuário   sistema decorrido 
##      0.14      0.00      0.17 
## 
## Cluster sizes with average intra-cluster distance:
## 
##   size     av_dist
## 1    6 0.004315081
## 2    4 0.008560076
## 3    2 0.006757188

Methods and Results (Hierarchical)

## Iteration 1: Objective = 0.1240
## Iteration 2: Objective = 0.1032
## Iteration 3: Objective = 0.0688
## Iteration 4: Objective = 0.0462
## Iteration 5: Objective = 0.0385
## Iteration 6: Objective = 0.0378
## 
##  Elapsed time is 0.04 seconds.

## fuzzy clustering with 3 clusters
## Using l2 distance
## Using fcm centroids
## Using acf_fun preprocessing
## 
## Time required for analysis:
##   usuário   sistema decorrido 
##      0.03      0.01      0.04 
## 
## Head of fuzzy memberships:
## 
##    cluster_1  cluster_2   cluster_3
## 1 0.85275296 0.11840863 0.028838412
## 2 0.01393058 0.98417808 0.001891333
## 3 0.05681618 0.93535412 0.007829701
## 4 0.04096201 0.95319926 0.005838727
## 5 0.92919688 0.05190126 0.018901858
## 6 0.48552754 0.12076132 0.393711142

Methods and Results (Fuzzy)

Random Forest

Data manipulations

After importing the data, we selected the columns ‘WTEMP_avg’, ‘SAL_avg’, ‘TPOC_avg’, ‘CHLAVEG_avg’ and’DOAVEG_avg’ and the first 40.000 rows of the dataset.

Then, we divided the dataset in train and test.

Methods and Results

After that, we trained the random forest model, considering the WTEMP_avg as the response variable and rest as explanatory variables.

# Making predictions
prev <- predict(model, teste[,2:5])

# Avaliating the model
rmse(teste[,1], prev)

## [1] 0.4841574

Methods and Results

## [1] 0.9709738

ggplot(aes(x=prev,y=WTEMP_avg), data=as.data.frame(cbind(prev,teste))) +
    geom_point() +
    geom_abline(slope=1, intercept=0, col='blue') +
    ylab("prediction") + xlab("observed")

##             IncNodePurity
## SAL_avg          60023.34
## TPOC_avg         17953.17
## CHLAVEG_avg      15005.18
## DOAVEG_avg       37470.71