Utilizando os dados sobre filmes e suas revisões extraídos do MovieLens + IMDb/Rotten Tomatoes, tentaremos predizer a variável rtAudienceScore (quantas pessoas gostaram do filme segundo o rotten tomatoes, numa escala de 0 a 100).

Como estamos usando dados de diversas fontes temos bastante informações sobre cada filme, como por exemplo nome, diretor, gênero, atores, localidade em que o filme foi filmado, etc e suas avaliações nos dois maiores sites de mídia social sobre filmes.

Analisando os dados do arquivo movies.dat temos:

library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
movies <- read.delim("~/Projetos/DataAnalysis/Assignment7/hetrec2011-movielens-2k-v2/movies_m.dat")

# Data set
head(movies)
##   id                       title imdbID
## 1  1                   Toy story 114709
## 2  2                     Jumanji 113497
## 3  3              Grumpy Old Men 107050
## 4  4           Waiting to Exhale 114885
## 5  5 Father of the Bride Part II 113041
## 6  6                        Heat 113277
##                                         spanishTitle
## 1                               Toy story (juguetes)
## 2                                            Jumanji
## 3                                Dos viejos grunones
## 4                               Esperando un respiro
## 5 Vuelve el padre de la novia (Ahora tambien abuelo)
## 6                                               Heat
##                                                                                                     imdbPictureURL
## 1 http://ia.media-imdb.com/images/M/MV5BMTMwNDU0NTY2Nl5BMl5BanBnXkFtZTcwOTUxOTM5Mw@@._V1._SX214_CR0,0,214,314_.jpg
## 2 http://ia.media-imdb.com/images/M/MV5BMzM5NjE1OTMxNV5BMl5BanBnXkFtZTcwNDY2MzEzMQ@@._V1._SY314_CR3,0,214,314_.jpg
## 3     http://ia.media-imdb.com/images/M/MV5BMTI5MTgyMzE0OF5BMl5BanBnXkFtZTYwNzAyNjg5._V1._SX214_CR0,0,214,314_.jpg
## 4 http://ia.media-imdb.com/images/M/MV5BMTczMTMyMTgyM15BMl5BanBnXkFtZTcwOTc4OTQyMQ@@._V1._SY314_CR4,0,214,314_.jpg
## 5 http://ia.media-imdb.com/images/M/MV5BMTg1NDc2MjExOF5BMl5BanBnXkFtZTcwNjU1NDAzMQ@@._V1._SY314_CR5,0,214,314_.jpg
## 6 http://ia.media-imdb.com/images/M/MV5BMTM1NDc4ODkxNV5BMl5BanBnXkFtZTcwNTI4ODE3MQ@@._V1._SY314_CR1,0,214,314_.jpg
##   year                        rtID rtAllCriticsRating
## 1 1995                   toy_story                  9
## 2 1995             1068044-jumanji                5.6
## 3 1993              grumpy_old_men                5.9
## 4 1995           waiting_to_exhale                5.6
## 5 1995 father_of_the_bride_part_ii                5.3
## 6 1995                1068182-heat                7.7
##   rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten
## 1                     73                   73                     0
## 2                     28                   13                    15
## 3                     36                   24                    12
## 4                     25                   14                    11
## 5                     19                    9                    10
## 6                     58                   50                     8
##   rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews
## 1               100                8.5                     17
## 2                46                5.8                      5
## 3                66                  7                      6
## 4                56                5.5                     11
## 5                47                5.4                      5
## 6                86                7.2                     17
##   rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore
## 1                   17                     0               100
## 2                    2                     3                40
## 3                    5                     1                83
## 4                    5                     6                45
## 5                    1                     4                20
## 6                   14                     3                82
##   rtAudienceRating rtAudienceNumRatings rtAudienceScore
## 1              3.7               102338              81
## 2              3.2                44587              61
## 3              3.2                10489              66
## 4              3.3                 5666              79
## 5                3                13761              64
## 6              3.9                42785              92
##                                                   rtPictureURL
## 1 http://content7.flixster.com/movie/10/93/63/10936393_det.jpg
## 2  http://content8.flixster.com/movie/56/79/73/5679734_det.jpg
## 3      http://content6.flixster.com/movie/25/60/256020_det.jpg
## 4 http://content9.flixster.com/movie/10/94/17/10941715_det.jpg
## 5      http://content8.flixster.com/movie/25/54/255426_det.jpg
## 6      http://content9.flixster.com/movie/26/80/268099_det.jpg
# Quantidade de filmes diferentes
length(unique(movies$id))
## [1] 10197
# Ano do filme
summary(movies$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1903    1981    1995    1988    2002    2011

Podemos observar que alguns filmes tem o valor do rtAudienceScore igual à NA. Decidimos retirar os filmes que tem o valor de rtAudienceScore igual à NA.

movies$rtAudienceScore <- as.numeric(as.character(movies$rtAudienceScore))
movies <- movies[complete.cases(movies),]

# Histograma da Audience Score
hist(movies$rtAudienceScore, main = "Histograma da Audience Score", xlab = "Audience Score")

Foi observado que os filmes com valor rtAudienceScore igual à 0 não tiveram nenhuma avaliação. Por essa razão decidimos retirar esses filmes da nossa amostra.

movies <- filter(movies, rtAudienceScore > 0)

hist(movies$rtAudienceScore, main = "Histograma da Audience Score", xlab = "Audience Score")

Como os dados estão espalhados em várias tabelas, o próximo passo é juntar esses dados em uma única tabela. Para só depois criar o modelo que vai predizer o valor de Audience Score de cada filme.

movie_countries <- read.delim("~/Projetos/DataAnalysis/Assignment7/hetrec2011-movielens-2k-v2/movie_countries.dat")

movie_directors <- read.delim("~/Projetos/DataAnalysis/Assignment7/hetrec2011-movielens-2k-v2/movie_directors.dat")

movie_directors$directorID <- as.numeric(movie_directors$directorID)

movies_data <- movies %>% 
  select(id, title, year, rtAudienceScore) %>%
  left_join(movie_countries, c("id" = "movieID")) %>%
  left_join(movie_directors, c("id" = "movieID"))

head(movies_data)
##   id                       title year rtAudienceScore country directorID
## 1  1                   Toy story 1995              81     USA       2209
## 2  2                     Jumanji 1995              61     USA       2130
## 3  3              Grumpy Old Men 1993              66     USA       1267
## 4  4           Waiting to Exhale 1995              79     USA       1477
## 5  5 Father of the Bride Part II 1995              64     USA        887
## 6  6                        Heat 1995              92     USA       2834
##      directorName
## 1   John Lasseter
## 2    Joe Johnston
## 3   Donald Petrie
## 4 Forest Whitaker
## 5   Charles Shyer
## 6    Michael Mann

Antes de criar o modelo é importante verificar se existe algum filme que possui algum valor de alguma coluna igual à NA. Se existir vamos optar por retirar o filme do dataset.

# Removendo linhas com valores NA
movies_data <- movies_data[complete.cases(movies_data),]

Vamos dividir os dados em treino e teste.

require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     movies
require(lattice)
require(ggplot2)

#Transformando as variáveis para numeric
movies_data$title <- as.numeric(movies_data$title)
movies_data$country <- as.numeric(movies_data$country)
movies_data$directorID <- as.numeric(movies_data$directorID)
movies_data$directorName <- as.numeric(movies_data$directorName)

movies_data_new <- movies_data

set.seed(12345)
split<-createDataPartition(y = movies_data$rtAudienceScore, 
                           p = 0.7, 
                           list = FALSE)

# Divisão em treino e teste
movies_data.treino <- movies_data[split,]
movies_data.teste <- movies_data[-split,]

# Criando partições de treinamento
ctrl <- trainControl(method = "cv", number = 10)

Vamos agora criar o nosso primeiro modelo, com base na tabela criada.

# Trienando 
lm <- train(rtAudienceScore ~. , 
               data = movies_data.treino, 
               method = "lm", 
               trControl = ctrl,
               metric = "Rsquared")

lm
## Linear Regression 
## 
## 5139 samples
##    6 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 4625, 4624, 4625, 4625, 4624, 4626, ... 
## 
## Resampling results
## 
##   RMSE      Rsquared    RMSE SD    Rsquared SD
##   16.97969  0.09961777  0.4196769  0.02768347 
## 
## 

Podemos observar que o valor do Rsquared foi de 0.104. Esse é o percentual de variância explicada pelo o modelo. Não podemos afimar se esse valor é bom ou ruim ainda. Para poder afimar isso é necessário que a gente compare esse valor com outros modelos. É possível notar quais variáveis são as mais importantes para o modelo.

plot(varImp(lm))

summary(lm)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.888 -11.996   2.148  13.169  38.201 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.864e+02  3.046e+01  22.530  < 2e-16 ***
## id            7.009e-05  1.359e-05   5.156 2.61e-07 ***
## title         7.638e-05  8.643e-05   0.884    0.377    
## year         -3.069e-01  1.529e-02 -20.072  < 2e-16 ***
## country      -1.917e-01  1.429e-02 -13.415  < 2e-16 ***
## directorID    3.472e-04  4.767e-04   0.728    0.466    
## directorName  1.254e-05  4.627e-04   0.027    0.978    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.97 on 5132 degrees of freedom
## Multiple R-squared:  0.09967,    Adjusted R-squared:  0.09862 
## F-statistic: 94.69 on 6 and 5132 DF,  p-value: < 2.2e-16

Antes de comparar com outros modelos, vamos verificar qual é o valor do Rsquared passando os dados de teste.

predictedVal <- predict(lm, movies_data.teste)
modelvalueslm <-data.frame(obs = movies_data.teste$rtAudienceScore, pred = predictedVal)

summary(movies_data.teste$rtAudienceScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   52.00   68.00   65.67   80.00  100.00
defaultSummary(modelvalueslm)
##       RMSE   Rsquared 
## 16.7035884  0.1053235

É possível verificar que o valor do Rsquared deu um pouco diferente do valor do treino, isso já era esperado. Vamos agora repetir o procedimento para outros modelos para só depois escolher o modelo e melhorá-lo.

Criando outro modelos, Boosted LM.

require(bst)
## Loading required package: bst
# Trienando 
bstLs <- train(rtAudienceScore ~. , 
               data = movies_data.treino, 
               method = "bstLs", 
               trControl = ctrl,
               metric = "Rsquared")

bstLs
## Boosted Linear Model 
## 
## 5139 samples
##    6 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 4626, 4626, 4624, 4625, 4625, 4626, ... 
## 
## Resampling results across tuning parameters:
## 
##   mstop  RMSE      Rsquared    RMSE SD    Rsquared SD
##    50    17.82715  0.01518627  0.2631804  0.006972554
##   100    17.79750  0.01839019  0.2639842  0.008070009
##   150    17.77121  0.02148814  0.2631823  0.009112617
## 
## Tuning parameter 'nu' was held constant at a value of 0.1
## Rsquared was used to select the optimal model using  the largest value.
## The final values used for the model were mstop = 150 and nu = 0.1.
plot(varImp(bstLs))

predictedVal <- predict(bstLs, movies_data.teste)
modelvaluesbstLs <-data.frame(obs = movies_data.teste$rtAudienceScore, pred = predictedVal)

defaultSummary(modelvaluesbstLs)
##        RMSE    Rsquared 
## 17.54204844  0.02409413

Para Knn

# Trienando 
knn <- train(rtAudienceScore ~. , 
                data = movies_data.treino, 
                method = "knn", 
                trControl = ctrl,
                preProcess = c("center","scale"), 
                tuneGrid = expand.grid(.k = 3:6),
                metric = "Rsquared")

knn
## k-Nearest Neighbors 
## 
## 5139 samples
##    6 predictors
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 4625, 4625, 4624, 4625, 4625, 4626, ... 
## 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared    RMSE SD    Rsquared SD
##   3  18.86263  0.05842262  0.4335260  0.01448581 
##   4  18.33816  0.06415954  0.4456524  0.01504604 
##   5  18.04841  0.06698007  0.3559604  0.01441641 
##   6  17.83106  0.07116678  0.3311527  0.01561322 
## 
## Rsquared was used to select the optimal model using  the largest value.
## The final value used for the model was k = 6.
predictedVal <- predict(knn, movies_data.teste)
modelvaluesknn <-data.frame(obs = movies_data.teste$rtAudienceScore, pred = predictedVal)

plot(varImp(knn))

defaultSummary(modelvaluesknn)
##        RMSE    Rsquared 
## 17.40348300  0.08785419

Para Boosted Tree

# Trienando 
bstTree <- train(rtAudienceScore ~. , 
                data = movies_data.treino, 
                method = "bstTree", 
                trControl = ctrl,
                metric = "Rsquared")

bstTree
## Boosted Tree 
## 
## 5139 samples
##    6 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 4624, 4624, 4625, 4625, 4624, 4627, ... 
## 
## Resampling results across tuning parameters:
## 
##   maxdepth  mstop  RMSE      Rsquared   RMSE SD    Rsquared SD
##   1          50    16.67926  0.1344349  0.3900484  0.02643764 
##   1         100    16.59078  0.1423434  0.3930347  0.02623926 
##   1         150    16.54617  0.1463535  0.3912763  0.02635218 
##   2          50    16.46734  0.1552640  0.4346779  0.03320780 
##   2         100    16.35321  0.1658153  0.4437705  0.03352897 
##   2         150    16.30016  0.1707879  0.4794858  0.03669232 
##   3          50    16.33290  0.1685160  0.4552249  0.03414061 
##   3         100    16.20553  0.1803845  0.5033434  0.03751579 
##   3         150    16.16259  0.1843452  0.5369949  0.04029391 
## 
## Tuning parameter 'nu' was held constant at a value of 0.1
## Rsquared was used to select the optimal model using  the largest value.
## The final values used for the model were mstop = 150, maxdepth = 3 and
##  nu = 0.1.
predictedVal <- predict(bstTree, movies_data.teste)
modelvaluesbstTree <-data.frame(obs = movies_data.teste$rtAudienceScore, pred = predictedVal)

plot(varImp(bstTree))

defaultSummary(modelvaluesbstTree)
##      RMSE  Rsquared 
## 15.795556  0.200198

Depois de criar vários modelos, vamos escolher o modelo que teve maior valor de Rsquared.

results <- resamples(list(lm = lm, bstLs = bstLs, Knn = knn, bstTree = bstTree))

bwplot(results)

bwplot(results, xlim=0:1)

Removendo dados que não vão ser mais usados para liberar mais memória no pc

rm(modelvaluesbstLs)
rm(modelvaluesknn)
rm(movie_countries)
rm(movies)
rm(movies_data)
rm(modelvalueslm)
rm(movies_data.teste)
rm(movies_data.treino)
rm(split)
rm(bstLs)
rm(knn)
rm(lm)

É possível notar que bstTree obteve o melhor valor de Rsquared, o que significa que ele é o melhor modelo para o nosso problema. Tentando aumentar o valor do Rsquared criamos novas variáveis.

# Quantidade de filmes de cada diretor
movies_director <- as.data.frame(table(movie_directors$directorID))
colnames(movies_director) <- c("directorID", "nFilmes")

movies_director$directorID <- as.numeric(movies_director$directorID)

summary(movies_director$nFilmes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   2.501   3.000  48.000
movie_actors <- read.delim("~/Projetos/DataAnalysis/Assignment7/hetrec2011-movielens-2k-v2/movie_actors.dat")

# Soma dos Ranking dos atores de um filmes
groupbyActors <-  group_by(movie_actors, movieID)
sumActores <- summarise(groupbyActors, sum(ranking))
colnames(sumActores) <- c("movieID", "nRanking")

# Ator com o maior ranking
maxActores <- summarise(groupbyActors, max(ranking))
colnames(maxActores) <- c("movieID", "maxRanking")

# Media do ranking dos atores 
meanActores <- summarise(groupbyActors, mean(ranking))
colnames(meanActores) <- c("movieID", "meanRanking")

# Mediana do ranking dos atores 
medianActores <- summarise(groupbyActors, median(ranking))
colnames(medianActores) <- c("movieID", "medianRanking")

# Gênero de cada filme
movie_genres <- read.delim("~/Projetos/DataAnalysis/Assignment7/hetrec2011-movielens-2k-v2/movie_genres.dat")

movie_genres$genre <- as.numeric(movie_genres$genre)

groupbyGenres <-  group_by(movie_genres, movieID)
firstGenres <- summarise(groupbyGenres, max(genre))
colnames(firstGenres) <- c("movieID", "firstGenres")

lastGenres <- summarise(groupbyGenres, min(genre))
colnames(lastGenres) <- c("movieID", "lastGenres")

medianGenres <- summarise(groupbyGenres, median(genre))
colnames(medianGenres) <- c("movieID", "medianGenres")

meanGenres <- summarise(groupbyGenres, mean(genre))
colnames(meanGenres) <- c("movieID", "meanGenres")

# Tags de cada filme
movie_tags <- read.delim("~/Projetos/DataAnalysis/Assignment7/hetrec2011-movielens-2k-v2/movie_tags.dat")

movie_tags$tagID <- as.numeric(movie_tags$tagID)

groupbyTags <-  group_by(movie_tags, movieID)
firstTags <- summarise(groupbyTags, max(tagID))
colnames(firstTags) <- c("movieID", "firstTags")

lastTags <- summarise(groupbyTags, min(tagID))
colnames(lastTags) <- c("movieID", "lastTags")

medianTags <- summarise(groupbyTags, median(tagID))
colnames(medianTags) <- c("movieID", "medianTags")

meanTags <- summarise(groupbyTags, mean(tagID))
colnames(meanTags) <- c("movieID", "meanTags")

Depois de criar as novas variáveis vamos juntar com a nossa tabela.

movies_data_new <- movies_data_new %>% 
  left_join(movies_director, c("directorID" = "directorID")) %>%
  left_join(sumActores, c("id" = "movieID")) %>%
  left_join(maxActores, c("id" = "movieID")) %>%
  left_join(meanActores, c("id" = "movieID")) %>%
  left_join(medianActores, c("id" = "movieID")) %>%
  left_join(firstGenres, c("id" = "movieID")) %>%
  left_join(lastGenres, c("id" = "movieID")) %>%
  left_join(medianGenres, c("id" = "movieID")) %>%
  left_join(meanGenres, c("id" = "movieID")) %>%
  left_join(firstTags, c("id" = "movieID")) %>%
  left_join(lastTags, c("id" = "movieID")) %>%
  left_join(medianTags, c("id" = "movieID")) %>%
  left_join(meanTags, c("id" = "movieID"))

head(movies_data_new)
##   id title year rtAudienceScore country directorID directorName nFilmes
## 1  1  8709 1995              81      68       2209         2025       5
## 2  2  3743 1995              61      68       2130         1936       6
## 3  3  2933 1993              66      68       1267         1020      12
## 4  4  9031 1995              79      68       1477         1239       3
## 5  5  2437 1995              64      68        887          609       8
## 6  6  3073 1995              92      68       2834         2718      10
##   nRanking maxRanking meanRanking medianRanking firstGenres lastGenres
## 1      300         24        12.5          12.5           9          2
## 2      171         18         9.5           9.5           9          2
## 3      136         16         8.5           8.5          15          5
## 4      210         20        10.5          10.5          15          5
## 5      351         26        13.5          13.5           5          5
## 6     1128         47        24.0          24.0          18          1
##   medianGenres meanGenres firstTags lastTags medianTags meanTags
## 1            4   4.600000     15170        7     1925.0 4198.884
## 2            4   5.000000     14371       13     1893.5 3589.500
## 3           10  10.000000     13668      380     3219.0 5109.000
## 4            8   9.333333        NA       NA         NA       NA
## 5            5   5.000000      4953      125     2185.0 2523.375
## 6            6   8.333333     15248      351     2773.0 4287.542
# Substituindo NAs por 0
movies_data_new[is.na(movies_data_new)] <- 0

Agora vamos criar o novo modelo

set.seed(12345)
split<-createDataPartition(y = movies_data_new$rtAudienceScore, 
                           p = 0.7, 
                           list = FALSE)

# Divisão em treino e teste
movies_data_new.treino <- movies_data_new[split,]
movies_data_new.teste <- movies_data_new[-split,]

# Criando partições de treinamento
ctrl <- trainControl(method = "cv", number = 10)

# Trienando 
bstTree_new <- train(rtAudienceScore ~. , 
                data = movies_data_new.treino, 
                method = "bstTree", 
                trControl = ctrl,
                metric = "Rsquared")

bstTree_new
## Boosted Tree 
## 
## 5139 samples
##   19 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 4625, 4624, 4625, 4625, 4624, 4626, ... 
## 
## Resampling results across tuning parameters:
## 
##   maxdepth  mstop  RMSE      Rsquared   RMSE SD    Rsquared SD
##   1          50    15.77575  0.2579118  0.3740444  0.02872593 
##   1         100    15.26956  0.2905114  0.4177861  0.03112831 
##   1         150    15.03854  0.3045682  0.4544440  0.03423687 
##   2          50    14.97435  0.3131390  0.4398069  0.03315997 
##   2         100    14.67772  0.3308739  0.5040734  0.03735540 
##   2         150    14.59160  0.3362230  0.5156930  0.03704260 
##   3          50    14.68480  0.3324162  0.4519338  0.03315860 
##   3         100    14.50532  0.3439243  0.5037983  0.03754901 
##   3         150    14.41783  0.3510464  0.5329146  0.04010407 
## 
## Tuning parameter 'nu' was held constant at a value of 0.1
## Rsquared was used to select the optimal model using  the largest value.
## The final values used for the model were mstop = 150, maxdepth = 3 and
##  nu = 0.1.
predictedVal <- predict(bstTree_new, movies_data_new.teste)
modelvaluesbstTree_new <-data.frame(obs = movies_data_new.teste$rtAudienceScore, pred = predictedVal)

plot(varImp(bstTree_new))

defaultSummary(modelvaluesbstTree_new)
##       RMSE   Rsquared 
## 13.9993938  0.3726231

Podemos notar que o novo modelo apresentou uma melhora significativa no valor do Rsquared o que significa que ele é o melhor modelo para o nosso problema.

#Modelo antigo
bstTree
## Boosted Tree 
## 
## 5139 samples
##    6 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 4624, 4624, 4625, 4625, 4624, 4627, ... 
## 
## Resampling results across tuning parameters:
## 
##   maxdepth  mstop  RMSE      Rsquared   RMSE SD    Rsquared SD
##   1          50    16.67926  0.1344349  0.3900484  0.02643764 
##   1         100    16.59078  0.1423434  0.3930347  0.02623926 
##   1         150    16.54617  0.1463535  0.3912763  0.02635218 
##   2          50    16.46734  0.1552640  0.4346779  0.03320780 
##   2         100    16.35321  0.1658153  0.4437705  0.03352897 
##   2         150    16.30016  0.1707879  0.4794858  0.03669232 
##   3          50    16.33290  0.1685160  0.4552249  0.03414061 
##   3         100    16.20553  0.1803845  0.5033434  0.03751579 
##   3         150    16.16259  0.1843452  0.5369949  0.04029391 
## 
## Tuning parameter 'nu' was held constant at a value of 0.1
## Rsquared was used to select the optimal model using  the largest value.
## The final values used for the model were mstop = 150, maxdepth = 3 and
##  nu = 0.1.
# Novo modelo
bstTree_new
## Boosted Tree 
## 
## 5139 samples
##   19 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 4625, 4624, 4625, 4625, 4624, 4626, ... 
## 
## Resampling results across tuning parameters:
## 
##   maxdepth  mstop  RMSE      Rsquared   RMSE SD    Rsquared SD
##   1          50    15.77575  0.2579118  0.3740444  0.02872593 
##   1         100    15.26956  0.2905114  0.4177861  0.03112831 
##   1         150    15.03854  0.3045682  0.4544440  0.03423687 
##   2          50    14.97435  0.3131390  0.4398069  0.03315997 
##   2         100    14.67772  0.3308739  0.5040734  0.03735540 
##   2         150    14.59160  0.3362230  0.5156930  0.03704260 
##   3          50    14.68480  0.3324162  0.4519338  0.03315860 
##   3         100    14.50532  0.3439243  0.5037983  0.03754901 
##   3         150    14.41783  0.3510464  0.5329146  0.04010407 
## 
## Tuning parameter 'nu' was held constant at a value of 0.1
## Rsquared was used to select the optimal model using  the largest value.
## The final values used for the model were mstop = 150, maxdepth = 3 and
##  nu = 0.1.