link

1. Import the spam dataset and print the first six rows.

head(data,6)
##   word_freq_make word_freq_address word_freq_all word_freq_3d
## 1           0.00              0.64          0.64            0
## 2           0.21              0.28          0.50            0
## 3           0.06              0.00          0.71            0
## 4           0.00              0.00          0.00            0
## 5           0.00              0.00          0.00            0
## 6           0.00              0.00          0.00            0
##   word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1          0.32           0.00             0.00               0.00
## 2          0.14           0.28             0.21               0.07
## 3          1.23           0.19             0.19               0.12
## 4          0.63           0.00             0.31               0.63
## 5          0.63           0.00             0.31               0.63
## 6          1.85           0.00             0.00               1.85
##   word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1            0.00           0.00              0.00           0.64
## 2            0.00           0.94              0.21           0.79
## 3            0.64           0.25              0.38           0.45
## 4            0.31           0.63              0.31           0.31
## 5            0.31           0.63              0.31           0.31
## 6            0.00           0.00              0.00           0.00
##   word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1             0.00             0.00                0.00           0.32
## 2             0.65             0.21                0.14           0.14
## 3             0.12             0.00                1.75           0.06
## 4             0.31             0.00                0.00           0.31
## 5             0.31             0.00                0.00           0.31
## 6             0.00             0.00                0.00           0.00
##   word_freq_business word_freq_email word_freq_you word_freq_credit
## 1               0.00            1.29          1.93             0.00
## 2               0.07            0.28          3.47             0.00
## 3               0.06            1.03          1.36             0.32
## 4               0.00            0.00          3.18             0.00
## 5               0.00            0.00          3.18             0.00
## 6               0.00            0.00          0.00             0.00
##   word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1           0.96              0          0.00            0.00            0
## 2           1.59              0          0.43            0.43            0
## 3           0.51              0          1.16            0.06            0
## 4           0.31              0          0.00            0.00            0
## 5           0.31              0          0.00            0.00            0
## 6           0.00              0          0.00            0.00            0
##   word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1             0                0             0             0
## 2             0                0             0             0
## 3             0                0             0             0
## 4             0                0             0             0
## 5             0                0             0             0
## 6             0                0             0             0
##   word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1              0                0             0              0
## 2              0                0             0              0
## 3              0                0             0              0
## 4              0                0             0              0
## 5              0                0             0              0
## 6              0                0             0              0
##   word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1             0            0                    0           0.00
## 2             0            0                    0           0.07
## 3             0            0                    0           0.00
## 4             0            0                    0           0.00
## 5             0            0                    0           0.00
## 6             0            0                    0           0.00
##   word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1               0            0             0.00            0
## 2               0            0             0.00            0
## 3               0            0             0.06            0
## 4               0            0             0.00            0
## 5               0            0             0.00            0
## 6               0            0             0.00            0
##   word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1                 0               0.00                 0         0.00
## 2                 0               0.00                 0         0.00
## 3                 0               0.12                 0         0.06
## 4                 0               0.00                 0         0.00
## 5                 0               0.00                 0         0.00
## 6                 0               0.00                 0         0.00
##   word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1          0.00               0                    0        0.00
## 2          0.00               0                    0        0.00
## 3          0.06               0                    0        0.01
## 4          0.00               0                    0        0.00
## 5          0.00               0                    0        0.00
## 6          0.00               0                    0        0.00
##   char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1         0.000             0         0.778         0.000         0.000
## 2         0.132             0         0.372         0.180         0.048
## 3         0.143             0         0.276         0.184         0.010
## 4         0.137             0         0.137         0.000         0.000
## 5         0.135             0         0.135         0.000         0.000
## 6         0.223             0         0.000         0.000         0.000
##   capital_run_length_average capital_run_length_longest
## 1                      3.756                         61
## 2                      5.114                        101
## 3                      9.821                        485
## 4                      3.537                         40
## 5                      3.537                         40
## 6                      3.000                         15
##   capital_run_length_total spam
## 1                      278    1
## 2                     1028    1
## 3                     2259    1
## 4                      191    1
## 5                      191    1
## 6                       54    1

2.Which three variables in the dataset do you think will be important predictors in a model of spam? Why?

Probably the word frequency for “money”, “you” and “mail”

3.In R, make a copy of the spam dataset. Delete the dependent variable, spam, from the copied dataset.

data_copy<-data[,1:57]

a)How could you summarize the relationships between all the variables in this data in a single matrix? Summarize the data in a single matrix and then create a visualization of these relationships.

 names(data_copy)[1:54]<-sub('_([^_])*', "",names(data_copy)[1:54])

M<-cor(data_copy,use ="complete.obs")


corrplot(M, method="circle", type="lower",title = "Correlation Matrix",mar=c(0,0,2,0),tl.cex = 0.3,tl.col = "black")

b)Use an unsupervised learning technique to classify the data into two categories. Evaluate how well these unsupervised categories match the true original values of the spam dependent variable.

df <- scale(data_copy)


k2 <- kmeans(df, 2, nstart = 50)
p2 <- fviz_cluster(k2, data = df, main = "2 Clusters",palette = "jco",ggtheme = theme_minimal(),repel = TRUE)

comparisson<-NULL

comparisson$cluster<-data.frame(p2$data$cluster)
comparisson$data<-data.frame(data$spam)
comparisson<-data.frame(comparisson)
names(comparisson)<-c("cluster","spam")
comparisson$cluster <- ifelse(comparisson$cluster == 2, 1, 0)
p<-table(comparisson$cluster,comparisson$spam)
p
##    
##        0    1
##   0 2754 1813
##   1   34    0
wkappa(p, w = NULL)
## $kappa
## [1] -0.01472089
## 
## $weighted.kappa
## [1] -0.01472089

The kappa statistic is quite low as can be also seen from the confusion table, the k-means clasifier assumes that everything is not Spam.

4. Name five supervised learning models that we learned this semester that are used to predict dependent variables like “spam”.

We could use first a logistic regression as we are focusing on a dummy variable, also the K-nearest neighbors, Classification Tree,random forest , neural network.

5.. What metric(s) would you use to evaluate prediction error for classification models? (Choose two, define them, and discuss which approach, if any, you think is better)

You can use the accuracy that measures the amount of observations that were correctly assigned or a weighted kappa coefficient which is a statistic which measures inter-rater agreement for qualitative (categorical) items. While accuracy just focuses on the rigth predictions, kappa also gives you a sense of first how accurate your model is in predicting the correct values, but also how many false positives and negatives your model predicts.

6.What metric(s) would you use to evaluate prediction error for regression models with continuous dependent variables? (Choose three, define them, and discuss which approach, if any, you think is better)

Mean squared error, sums the square of the residuals( the difference between predicted and actual values) and divides them by the total number of observations, achieving the average squared root of mean squared error which is the square root of the above mentioned metric. Mean absolute error, which is similar to the mean squared error, but computes all residuals on their absolute value, squares them and then divides them by the number of observations. It would depend on the data set which one is better for example MSE is good for data without big outliers while MAE does a better job when outliers are present.

7.What is k-fold cross validation and what do we use it for?

Is a method to create a number of sub samples from an original sample, the k stands for the number of random sub samples that will originate. k-1 of this sub samples are used to train our algorithm to then validate it on the remaining sub sample, this process is repeated k times till all the sub samples have been used to validate.

8. Choose one model from question four. Select three variables in the dataset that you think will be good predictors of “spam”. Run the model and evaluate prediction error on test data using an appropriate model evaluation technique.

Data<-data



num.vars <- sapply(Data, is.numeric) #identify numeric variables
Data[num.vars] <- lapply(Data[num.vars], scale) #scale numeric variables
Data$spam<- factor(Data$spam)


#Here we set tuneLength to tell function to try many possibilities for k


set.seed(400)

test.index <- sample.int(4601,1000)
train <- Data[-test.index, ] #full training data with normalized features
test <- Data[test.index, ]  

ctrl <- trainControl(method="repeatedcv",number=5, repeats = 2) # k value selected automatically by model
knnFit <- train(spam~word_freq_money+word_freq_free+word_freq_you, data = train, method = "knn", trControl = ctrl, tuneLength = 9)


#Output of kNN fit
knnFit #KnnFit model selects k= value with highest accuracy
## k-Nearest Neighbors 
## 
## 3601 samples
##    3 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 2881, 2880, 2881, 2881, 2881, 2880, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7995003  0.5672311
##    7  0.8067175  0.5819644
##    9  0.8033859  0.5742764
##   11  0.8039413  0.5752676
##   13  0.8057470  0.5786600
##   15  0.8072750  0.5813298
##   17  0.8092181  0.5845855
##   19  0.8088018  0.5829427
##   21  0.8106066  0.5868026
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 21.

9.Choose a second model from question four. Using the same three variables in the dataset that you think will be good predictors of “spam”, run the model and evaluate prediction error on test data using an appropriate model evaluation technique. Did this model predict test data better than your previous model?

df<-data
df$spam<- factor(df$spam)



## Let's predict on a test set of 30 observations using the rest of the data 
#as training observations

test.index <- sample.int(4601,1000)
train <- df[-test.index, ] 
test <- df[test.index, ]  


data_ctrl <- trainControl(method = "repeatedcv", number = 3, repeats=10)

model_caret <- train(spam~word_freq_money+word_freq_free+word_freq_you,
                     data = train,   # model to fit
                     method="glm", 
                     family="binomial",         
                     trControl = data_ctrl,              # folds
                     na.action = na.pass)                # pass missing data to model - some 
#models will handle this.  glm model will drop missing rows
#Might get errors if not enough zeros or ones in DV in repeated subsets,
#Not an issue

#Examine average prediction error of each fold's model built from training data applied to test data

model_caret 
## Generalized Linear Model 
## 
## 3601 samples
##    3 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 10 times) 
## Summary of sample sizes: 2401, 2400, 2401, 2400, 2401, 2401, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7636486  0.4662533
#prediction error for each model
model_caret$resample # How well did training model predict test data in each fold?
##     Accuracy     Kappa    Resample
## 1  0.7816667 0.5099216 Fold1.Rep01
## 2  0.7560366 0.4496211 Fold2.Rep01
## 3  0.7600000 0.4549601 Fold3.Rep01
## 4  0.7676936 0.4741623 Fold1.Rep02
## 5  0.7650000 0.4716347 Fold2.Rep02
## 6  0.7658333 0.4706503 Fold3.Rep02
## 7  0.7558333 0.4424570 Fold1.Rep03
## 8  0.7726894 0.4846069 Fold2.Rep03
## 9  0.7525000 0.4478870 Fold3.Rep03
## 10 0.7835137 0.5093553 Fold1.Rep04
## 11 0.7683333 0.4752060 Fold2.Rep04
## 12 0.7400000 0.4192846 Fold3.Rep04
## 13 0.7525000 0.4405094 Fold1.Rep05
## 14 0.7583333 0.4548343 Fold2.Rep05
## 15 0.7835137 0.5109965 Fold3.Rep05
## 16 0.7568693 0.4535533 Fold1.Rep06
## 17 0.7691667 0.4737909 Fold2.Rep06
## 18 0.7658333 0.4745994 Fold3.Rep06
## 19 0.7641667 0.4682150 Fold1.Rep07
## 20 0.7641667 0.4623928 Fold2.Rep07
## 21 0.7543714 0.4486273 Fold3.Rep07
## 22 0.7783333 0.5049029 Fold1.Rep08
## 23 0.7558333 0.4429269 Fold2.Rep08
## 24 0.7618651 0.4620962 Fold3.Rep08
## 25 0.7566667 0.4506166 Fold1.Rep09
## 26 0.7641667 0.4708599 Fold2.Rep09
## 27 0.7618651 0.4575600 Fold3.Rep09
## 28 0.7516667 0.4350872 Fold1.Rep10
## 29 0.7535387 0.4395389 Fold2.Rep10
## 30 0.7875000 0.5267436 Fold3.Rep10
#How much variation was there across all predictions?

sd(model_caret$resample$Accuracy) 
## [1] 0.01102273
summary(model_caret$resample$Accuracy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7400  0.7559  0.7630  0.7636  0.7682  0.7875

10. Choose a third model from question four. Using the same three variables in the dataset that you think will be good predictors of “spam”, run the model and evaluate prediction error on test data using an appropriate model evaluation technique. Did this model predict test data better than your previous models?

df<-data
df$spam<- factor(df$spam)
control <-trainControl(method = "cv") #out of bag samples to cross validate



rpartFit <- train(spam~word_freq_money+word_freq_free+word_freq_you,
                  data=df,
                  method = "rpart",
                  trControl=control,
                  tuneLength = 9)

rpartFit
## CART 
## 
## 4601 samples
##    3 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4141, 4141, 4141, 4140, 4141, 4141, ... 
## Resampling results across tuning parameters:
## 
##   cp            Accuracy   Kappa    
##   0.0006618864  0.8139527  0.5946099
##   0.0007354293  0.8137353  0.5933620
##   0.0011031440  0.8150410  0.5953024
##   0.0013789300  0.8150396  0.5957660
##   0.0014708586  0.8156918  0.5974205
##   0.0024820739  0.8165609  0.6012791
##   0.0029417172  0.8161261  0.6008237
##   0.1252068395  0.7776596  0.5080156
##   0.4087148373  0.6815727  0.2286378
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.002482074.

11. Choose a fourth model from question four. Using the same three variables in the dataset that you think will be good predictors of “spam”, run the model and evaluate prediction error on test data using an appropriate model evaluation technique. Did this model predict test data better than your previous models?

df<-data
df$spam<- factor(df$spam)
test.index <- sample.int(4601,1000)
train <- df[-test.index, ] 
test <- df[test.index, ]


# Bagged rpart classification models                    
set.seed(200)

 forrest <- randomForest(spam~word_freq_money+word_freq_free+word_freq_you,
                      data=train,ntree=200,na.action=na.exclude)





predictions<-predict(forrest, test) #predictions via bagged model
test$predictions<-predictions

 forrest
## 
## Call:
##  randomForest(formula = spam ~ word_freq_money + word_freq_free +      word_freq_you, data = train, ntree = 200, na.action = na.exclude) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 17.97%
## Confusion matrix:
##      0   1 class.error
## 0 2024 164   0.0749543
## 1  483 930   0.3418259
 p<-table(test$prediction,test$spam)
p
##    
##       0   1
##   0 559 133
##   1  41 267
wkappa(p, w = NULL)
## $kappa
## [1] 0.6230503
## 
## $weighted.kappa
## [1] 0.6230503

12. Now rerun your best model from questions 8 through 11, but this time add three new variables to the model that you think will increase prediction accuracy. Did this model predict test data better than your previous models?

It seems that from the Kappa metric, the most accurate model is the Random forest. I will add the variables, word count for the word free, email and receive.

df<-data
df$spam<- factor(df$spam)
test.index <- sample.int(4601,1000)
train <- df[-test.index, ] 
test <- df[test.index, ]


# Bagged rpart classification models                    
set.seed(200)

 forrest <- randomForest(spam~word_freq_money+word_freq_free+word_freq_you+word_freq_free+word_freq_email+word_freq_receive,
                      data=train,ntree=200,na.action=na.exclude)





predictions<-predict(forrest, test) #predictions via bagged model
test$predictions<-predictions

 forrest
## 
## Call:
##  randomForest(formula = spam ~ word_freq_money + word_freq_free +      word_freq_you + word_freq_free + word_freq_email + word_freq_receive,      data = train, ntree = 200, na.action = na.exclude) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.44%
## Confusion matrix:
##      0    1 class.error
## 0 1977  201   0.0922865
## 1  391 1032   0.2747716
 p<-table(test$prediction,test$spam)
p
##    
##       0   1
##   0 555  98
##   1  55 292
wkappa(p, w = NULL)
## $kappa
## [1] 0.6719132
## 
## $weighted.kappa
## [1] 0.6719132

13. Rerun all your other models with this final set of six variables, evaluate prediction error, and choose a final model. Why did you select this model among all of the models that you ran?

df<-data
df$spam<- factor(df$spam)
control <-trainControl(method = "cv") #out of bag samples to cross validate



rpartFit <- train(spam~word_freq_money+word_freq_free+word_freq_you+word_freq_free+word_freq_email+word_freq_receive,
                  data=df,
                  method = "rpart",
                  trControl=control,
                  tuneLength = 9)

rpartFit
## CART 
## 
## 4601 samples
##    5 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4141, 4142, 4141, 4142, 4141, 4140, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa    
##   0.001470859  0.8322165  0.6395488
##   0.001838573  0.8324358  0.6408218
##   0.003309432  0.8311305  0.6372933
##   0.003585218  0.8311305  0.6372933
##   0.004228719  0.8272208  0.6300256
##   0.005239934  0.8243932  0.6240411
##   0.008273580  0.8152666  0.6049807
##   0.125206839  0.7728964  0.4967135
##   0.408714837  0.6950225  0.2709886
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001838573.
df<-data
df$spam<- factor(df$spam)



## Let's predict on a test set of 30 observations using the rest of the data 
#as training observations

test.index <- sample.int(4601,1000)
train <- df[-test.index, ] 
test <- df[test.index, ]  


data_ctrl <- trainControl(method = "repeatedcv", number = 3, repeats=10)

model_caret <- train(spam~word_freq_money+word_freq_free+word_freq_you+word_freq_free+word_freq_email+word_freq_receive,
                     data = train,   # model to fit
                     method="glm", 
                     family="binomial",         
                     trControl = data_ctrl,              # folds
                     na.action = na.pass)                # pass missing data to model - some 
#models will handle this.  glm model will drop missing rows
#Might get errors if not enough zeros or ones in DV in repeated subsets,
#Not an issue

#Examine average prediction error of each fold's model built from training data applied to test data

model_caret 
## Generalized Linear Model 
## 
## 3601 samples
##    5 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 10 times) 
## Summary of sample sizes: 2400, 2401, 2401, 2401, 2401, 2400, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7807001  0.5142949
#prediction error for each model
model_caret$resample # How well did training model predict test data in each fold?
##     Accuracy     Kappa    Resample
## 1  0.7926728 0.5418510 Fold1.Rep01
## 2  0.7841667 0.5264043 Fold2.Rep01
## 3  0.7741667 0.4962076 Fold3.Rep01
## 4  0.7766667 0.5027706 Fold1.Rep02
## 5  0.7950000 0.5485907 Fold2.Rep02
## 6  0.7718568 0.4940732 Fold3.Rep02
## 7  0.7858333 0.5293270 Fold1.Rep03
## 8  0.7793505 0.5074083 Fold2.Rep03
## 9  0.7691667 0.4903091 Fold3.Rep03
## 10 0.7808333 0.5149272 Fold1.Rep04
## 11 0.7858333 0.5244967 Fold2.Rep04
## 12 0.7776853 0.5079629 Fold3.Rep04
## 13 0.7925000 0.5450448 Fold1.Rep05
## 14 0.7793505 0.5120304 Fold2.Rep05
## 15 0.7750000 0.4954654 Fold3.Rep05
## 16 0.7776853 0.5117834 Fold1.Rep06
## 17 0.7708333 0.4895813 Fold2.Rep06
## 18 0.7858333 0.5241211 Fold3.Rep06
## 19 0.7775000 0.5028550 Fold1.Rep07
## 20 0.7860117 0.5282371 Fold2.Rep07
## 21 0.7783333 0.5118908 Fold3.Rep07
## 22 0.7900000 0.5357634 Fold1.Rep08
## 23 0.7793505 0.5093447 Fold2.Rep08
## 24 0.7775000 0.5087095 Fold3.Rep08
## 25 0.7650000 0.4759663 Fold1.Rep09
## 26 0.8100000 0.5839011 Fold2.Rep09
## 27 0.7676936 0.4838292 Fold3.Rep09
## 28 0.7841667 0.5211743 Fold1.Rep10
## 29 0.7835137 0.5251362 Fold2.Rep10
## 30 0.7675000 0.4796847 Fold3.Rep10
#How much variation was there across all predictions?

sd(model_caret$resample$Accuracy) 
## [1] 0.009488517
summary(model_caret$resample$Accuracy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7650  0.7754  0.7794  0.7807  0.7858  0.8100
Data<-data



num.vars <- sapply(Data, is.numeric) #identify numeric variables
Data[num.vars] <- lapply(Data[num.vars], scale) #scale numeric variables
Data$spam<- factor(Data$spam)


#Here we set tuneLength to tell function to try many possibilities for k


set.seed(400)

test.index <- sample.int(4601,1000)
train <- Data[-test.index, ] #full training data with normalized features
test <- Data[test.index, ]  

ctrl <- trainControl(method="repeatedcv",number=5, repeats = 2) # k value selected automatically by model
knnFit <- train(spam~word_freq_money+word_freq_free+word_freq_you+word_freq_free+word_freq_email+word_freq_receive, data = train, method = "knn", trControl = ctrl, tuneLength = 9)


#Output of kNN fit
knnFit #KnnFit model selects k= value with highest accuracy
## k-Nearest Neighbors 
## 
## 3601 samples
##    5 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 2881, 2880, 2881, 2881, 2881, 2880, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8282428  0.6320280
##    7  0.8264369  0.6271547
##    9  0.8254656  0.6258662
##   11  0.8244926  0.6234629
##   13  0.8233807  0.6208745
##   15  0.8224087  0.6192706
##   17  0.8246300  0.6241910
##   19  0.8250468  0.6250665
##   21  0.8247686  0.6242884
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

The Random forrest is still the most accurate model in terms of the Kappa metrics. Therefore this is the model to be selected, however in terms of eassiness to interpret, the GLM model could also be used as it is easier for the user to understand how are the variables affecting the outcome.

14. What variable that currently is not in the data, if included, would be likely to increase your final model’s predictive power?

Probably having a list of common contacts, or a dummy for the fact that the sender has already exchanged emails with the receiver. Other option could be to have a list of common email addreses that are associated with Spam.

15.Lastly, are feature selection and feature engineering important to building models that predict test data well? If so, discuss why they are important?

Both are important, selecting the right varaibles is the basic component of every model, if you lack the right data even the most advanced technique will fail. Which leads us to the second issue of feature engineering, I think this is also important but less crucial. With the right data you can construct a model that does the job, however if you require precision feature engineer and getting the rigth model parameters becomes crucial.