1. Introduction

1.1 Problem Statement

This is a Project Report for the Santander Customer Transaction Prediction Project. The objective of this Case is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals

1.2 Data

The goal is to build regression models which will predict whether a customer is satisfied? Will a customer buy the product? Can a customer pay the loan?

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] TRUE

## [1] 200000    202

## [1] 200000    201

## 'data.frame':    200000 obs. of  202 variables:
##  $ ID_code: Factor w/ 200000 levels "train_0","train_1",..: 1 2 111113 122224 133335 144446 155557 166668 177779 188890 ...
##  $ target : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ var_0  : num  8.93 11.5 8.61 11.06 9.84 ...
##  $ var_1  : num  -6.79 -4.15 -2.75 -2.15 -1.48 ...
##  $ var_2  : num  11.91 13.86 12.08 8.95 12.87 ...
##  $ var_3  : num  5.09 5.39 7.89 7.2 6.64 ...
##  $ var_4  : num  11.5 12.4 10.6 12.6 12.3 ...
##  $ var_5  : num  -9.28 7.04 -9.08 -1.84 2.45 ...
##  $ var_6  : num  5.12 5.62 6.94 5.84 5.94 ...
##  $ var_7  : num  18.6 16.5 14.6 14.9 19.3 ...
##  $ var_8  : num  -4.92 3.15 -4.92 -5.86 6.27 ...
##  $ var_9  : num  5.75 8.09 5.95 8.24 7.68 ...
##  $ var_10 : num  2.925 -0.403 -0.325 2.306 -9.446 ...
##  $ var_11 : num  3.18 8.06 -11.26 2.81 -12.14 ...
##  $ var_12 : num  14 14 14.2 13.8 13.8 ...
##  $ var_13 : num  0.575 8.414 7.312 11.97 7.889 ...
##  $ var_14 : num  8.8 5.43 7.52 6.46 7.79 ...
##  $ var_15 : num  14.6 13.7 14.6 14.8 15.1 ...
##  $ var_16 : num  5.75 13.83 7.68 10.74 8.49 ...
##  $ var_17 : num  -7.24 -15.58 -1.74 -0.43 -3.07 ...
##  $ var_18 : num  4.28 7.8 4.7 15.94 6.53 ...
##  $ var_19 : num  30.7 28.6 20.5 13.7 11.3 ...
##  $ var_20 : num  10.54 3.43 17.76 20.3 21.42 ...
##  $ var_21 : num  16.22 2.74 18.14 12.56 18.96 ...
##  $ var_22 : num  2.58 8.55 1.21 6.82 10.11 ...
##  $ var_23 : num  2.47 3.37 3.51 2.72 2.71 ...
##  $ var_24 : num  14.38 6.98 5.68 12.14 14.21 ...
##  $ var_25 : num  13.4 13.9 13.2 13.7 13.5 ...
##  $ var_26 : num  -5.149 -11.768 -7.994 0.814 3.174 ...
##  $ var_27 : num  -0.407 -2.559 -2.903 -0.906 -3.342 ...
##  $ var_28 : num  4.93 5.05 5.85 5.91 5.9 ...
##  $ var_29 : num  5.997 0.548 6.144 2.841 7.935 ...
##  $ var_30 : num  -0.308 -9.299 -11.102 -15.24 -3.158 ...
##  $ var_31 : num  12.9 7.88 12.49 10.44 9.47 ...
##  $ var_32 : num  -3.8766 1.2859 -2.2871 -2.5731 -0.0083 ...
##  $ var_33 : num  16.89 19.37 19.04 6.18 19.32 ...
##  $ var_34 : num  11.2 11.4 11 10.6 12.4 ...
##  $ var_35 : num  10.579 0.74 4.109 -5.916 0.633 ...
##  $ var_36 : num  0.676 2.8 4.697 8.172 2.792 ...
##  $ var_37 : num  7.89 5.84 6.93 2.85 5.82 ...
##  $ var_38 : num  4.67 10.82 10.89 9.17 19.3 ...
##  $ var_39 : num  3.874 3.678 0.9 0.666 1.445 ...
##  $ var_40 : num  -5.24 -11.11 -13.52 -3.83 -5.6 ...
##  $ var_41 : num  7.37 1.87 2.24 -1.04 14.07 ...
##  $ var_42 : num  11.58 9.88 11.53 11.78 11.92 ...
##  $ var_43 : num  12 11.8 12 11.3 11.5 ...
##  $ var_44 : num  11.64 1.24 4.1 8.05 6.91 ...
##  $ var_45 : num  -7.02 -47.38 -7.91 -24.68 -65.49 ...
##  $ var_46 : num  5.92 7.37 11.14 12.74 13.87 ...
##  $ var_47 : num  -14.2136 0.1948 -5.7864 -35.1659 0.0444 ...
##  $ var_48 : num  16.028 34.401 20.748 0.761 -0.135 ...
##  $ var_49 : num  5.33 25.7 6.89 8.38 14.43 ...
##  $ var_50 : num  12.9 11.8 12.9 12.7 13.3 ...
##  $ var_51 : num  29.05 13.23 19.59 9.55 10.49 ...
##  $ var_52 : num  -0.694 -4.108 0.727 1.79 -1.437 ...
##  $ var_53 : num  5.17 6.69 6.41 5.21 5.76 ...
##  $ var_54 : num  -0.747 -8.095 9.312 8.091 -8.541 ...
##  $ var_55 : num  14.83 18.6 6.28 12.4 14.15 ...
##  $ var_56 : num  11.3 19.3 15.6 14.5 17 ...
##  $ var_57 : num  5.38 7.01 5.82 6.58 6.18 ...
##  $ var_58 : num  2.02 1.92 1.1 3.32 1.95 ...
##  $ var_59 : num  10.12 8.87 9.19 9.46 9.2 ...
##  $ var_60 : num  16.18 8.01 12.6 15.78 8.66 ...
##  $ var_61 : num  4.96 -7.24 -10.37 -25.02 -27.74 ...
##  $ var_62 : num  2.077 1.794 0.875 3.442 -0.495 ...
##  $ var_63 : num  -0.215 -1.315 5.804 -4.392 -1.784 ...
##  $ var_64 : num  8.67 8.1 3.72 8.65 5.27 ...
##  $ var_65 : num  9.53 1.54 -1.1 6.31 -4.32 ...
##  $ var_66 : num  5.81 5.4 7.37 5.62 6.99 ...
##  $ var_67 : num  22.43 7.93 9.86 23.61 1.62 ...
##  $ var_68 : num  5.01 5.02 5.02 5.02 5.03 ...
##  $ var_69 : num  -4.7 2.23 -5.78 -4 -3.24 ...
##  $ var_70 : num  21.64 40.56 2.36 4.05 40.12 ...
##  $ var_71 : num  0.566 0.513 0.852 0.25 0.774 ...
##  $ var_72 : num  5.2 3.17 6.358 1.252 -0.726 ...
##  $ var_73 : num  8.86 20.11 12.17 24.42 4.59 ...
##  $ var_74 : num  43.11 7.78 19.73 4.53 -4.53 ...
##  $ var_75 : num  18.38 7.05 19.45 15.42 23.35 ...
##  $ var_76 : num  -2.34 3.27 4.5 11.69 1.03 ...
##  $ var_77 : num  23.4 23.5 23.2 23.6 19.2 ...
##  $ var_78 : num  6.52 5.51 6.32 4.08 7.17 ...
##  $ var_79 : num  12.2 13.8 12.8 15.3 14.4 ...
##  $ var_80 : num  13.647 2.546 7.473 0.784 2.96 ...
##  $ var_81 : num  13.8 18.2 15.8 10.5 13.3 ...
##  $ var_82 : num  1.367 0.368 13.353 1.621 -9.259 ...
##  $ var_83 : num  2.94 -4.82 10.19 -5.29 -6.71 ...
##  $ var_84 : num  -4.52 -5.49 5.46 1.6 7.9 ...
##  $ var_85 : num  21.5 13.8 19.1 18 14.5 ...
##  $ var_86 : num  9.32 -13.59 -4.46 -2.32 7.08 ...
##  $ var_87 : num  16.46 11.1 9.54 15.63 20.17 ...
##  $ var_88 : num  8 7.9 11.91 4.55 8.01 ...
##  $ var_89 : num  -1.71 12.23 2.14 7.55 3.8 ...
##  $ var_90 : num  -21.449 0.477 -22.404 -7.587 -39.8 ...
##  $ var_91 : num  6.78 6.89 7.09 7.04 7.01 ...
##  $ var_92 : num  11.09 8.09 14.16 14.4 9.36 ...
##  $ var_93 : num  9.99 10.96 10.51 10.78 10.43 ...
##  $ var_94 : num  14.84 11.76 14.26 7.29 14.06 ...
##  $ var_95 : num  0.1812 -1.2722 0.2647 -1.093 0.0213 ...
##  $ var_96 : num  8.96 24.79 20.4 11.36 14.72 ...
##   [list output truncated]

2. Methods used

2.1 Pre processing

Any predictive modeling requires that we look at the data before we start modeling. However, in data mining terms looking at data refers to so much more than just looking. Looking at data refers to exploring the data, cleaning the data as well as visualizing the data through graphs and plots. This is often called as Exploratory Data Analysis. To start this process we will first try and look at all the probability distributions of the variables. Most analysis like regression, require the data to be normally distributed. We can visualize that in a glance by looking at the probability distributions or probability density functions of the variable.

## [1] 0

2.2 Distribution of all types of variables

We check for the distribution shown by various variables. For this, we draw various plots like Bargraph, Histogram, Scatterplot an Boxplot etc.

2.3 Check for outliers

The next step is to check for outliers. Outliers are those points in the data that are not following the pattern shown by other data poins. These outliers need to be removed from the data.

## 
##      0      1 
## 179902  20098

## 
##      0      1 
## 89.951 10.049

#Distribution of feature variables

From the above layout, we can select any of the independennt variables and see corresponding relations between the train and train datasets

Distribution of mean values per row and column in train and test dataset

Distribution of standard deviation values per row and column in train and test dataset

Distribution of skewness values per row and column in train and test dataset

Correlations Analysis

Correlations in train data

correlation between the train attributes is very small.

Correlations in train data

correlation between the test attributes is also very small.

###############################################Feature Engineering############################################## ## 2.4 Feature Selection

Before performing any type of modeling we need to assess the importance of each predictor variable in our analysis. There is a possibility that many variables in our analysis are not important at all to the problem of class prediction. There are several methods of doing that. Below we have used Random Forests to perform features selection.

As we could not get any relevant informatkion from correlation analysis and our objective is to find out the variables that can stand out so that we can concentrate on those variables only. So, we will take a sample of data and try to find out value importance.

## [1] 150000    202

## [1] 50000   202

Random forest classifier

Feature importance by random forest

Here the features are sorted by their importance and we can observe that the most significant variable is var_81

We can also visualise using the plot

Impact of the main features which are discovered in the previous section by using pdp package.

Performing models on data

3. Model Selection

The dependent variable can fall in either of the four categories:

1. Nominal

2. Ordinal

3. Interval

4. Ratio

If the dependent variable, in our case Quality, is Nominal the only predictive analysis that we can perform is Classification, and if the dependent variable is Interval or Ratio the normal method is to do a Regression analysis, or classification after binning.

#Splitting the data

## [1] 160000    202

## [1] 40000   202

Training and validation dataset

#Logistic Regression

##            Length Class     Mode     
## a0            61  -none-    numeric  
## beta       12200  dgCMatrix S4       
## df            61  -none-    numeric  
## dim            2  -none-    numeric  
## lambda        61  -none-    numeric  
## dev.ratio     61  -none-    numeric  
## nulldev        1  -none-    numeric  
## npasses        1  -none-    numeric  
## jerr           1  -none-    numeric  
## offset         1  -none-    logical  
## classnames     2  -none-    character
## call           4  -none-    call     
## nobs           1  -none-    numeric

Cross validation prediction

## 
## Call:  cv.glmnet(x = X_t, y = y_t, type.measure = "class", family = "binomial") 
## 
## Measure: Misclassification Error 
## 
##        Lambda Measure        SE Nonzero
## min 0.0003163 0.08590 0.0005722     195
## 1se 0.0008800 0.08632 0.0006442     184

Plotting the missclassification error vs log(lambda) where lambda is regularization parameter

## [1] 0.000316272

# Observation ## Miss classification error increases as increasing the log(Lambda).

Model performance on validation dataset

Creating Confusion Matrix for Logistic Regression

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     2
##          1 35618  2973
##          2   434   975
##                                          
##                Accuracy : 0.9148         
##                  95% CI : (0.912, 0.9175)
##     No Information Rate : 0.9013         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3292         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9880         
##             Specificity : 0.2470         
##          Pos Pred Value : 0.9230         
##          Neg Pred Value : 0.6920         
##              Prevalence : 0.9013         
##          Detection Rate : 0.8904         
##    Detection Prevalence : 0.9648         
##       Balanced Accuracy : 0.6175         
##                                          
##        'Positive' Class : 1              
##

Reciever operating characteristics(ROC)-Area under curve(AUC) score and curve

Area under the curve is coming out to be 0.6175

Now, this model needs to be tested on Test Data

Santander_Transaction_Prediction

Viditya

22/12/2019

1. Introduction

1.1 Problem Statement

1.2 Data

The goal is to build regression models which will predict whether a customer is satisfied? Will a customer buy the product? Can a customer pay the loan?

2. Methods used

2.1 Pre processing

2.2 Distribution of all types of variables

We check for the distribution shown by various variables. For this, we draw various plots like Bargraph, Histogram, Scatterplot an Boxplot etc.

2.3 Check for outliers

The next step is to check for outliers. Outliers are those points in the data that are not following the pattern shown by other data poins. These outliers need to be removed from the data.

From the above layout, we can select any of the independennt variables and see corresponding relations between the train and train datasets

Distribution of mean values per row and column in train and test dataset

Distribution of standard deviation values per row and column in train and test dataset

Distribution of skewness values per row and column in train and test dataset

Correlations in train data

correlation between the train attributes is very small.

Correlations in train data

correlation between the test attributes is also very small.

As we could not get any relevant informatkion from correlation analysis and our objective is to find out the variables that can stand out so that we can concentrate on those variables only. So, we will take a sample of data and try to find out value importance.

Random forest classifier

Feature importance by random forest

Here the features are sorted by their importance and we can observe that the most significant variable is var_81

We can also visualise using the plot

Impact of the main features which are discovered in the previous section by using pdp package.

Performing models on data

3. Model Selection

The dependent variable can fall in either of the four categories:

1. Nominal

2. Ordinal

3. Interval

4. Ratio

If the dependent variable, in our case Quality, is Nominal the only predictive analysis that we can perform is Classification, and if the dependent variable is Interval or Ratio the normal method is to do a Regression analysis, or classification after binning.

Training and validation dataset

Cross validation prediction

Plotting the missclassification error vs log(lambda) where lambda is regularization parameter

Model performance on validation dataset

Creating Confusion Matrix for Logistic Regression

Reciever operating characteristics(ROC)-Area under curve(AUC) score and curve

Area under the curve is coming out to be 0.6175

Now, this model needs to be tested on Test Data

Converting to Probability

4. Conclusion

The sub_df is the final data frame that captures the predictions done by the Logistic Regression Midel on the Test dataset with ROC equal to 0.6175