Requirements
A work sample report should be redacted, reflect your individual work effort, and illustrate your capability as a data analyst. Please include a brief summary that identifies the project goals, methodology, data sample, tools, etc. You are requested to submit the document in PDF format to us no later than noon on Thursday May 30th.
Summary
This work sample will be created using a tool called R. R Core Team (2016) is a language and environment for statistical computing and graphics that is rich for statistical and data analysis and for sharing results in various forms.
This sample, will encompass a total of two different projects, one involving time series; the other involving a more methodical approach to a given data set.
Work Samples
Example 1
Overview
Example 1 consists of a simple data set of residential power usage for January 1998 until December 2013. The data is given in a single file. The variable “KWH” is power consumption in Kilowatt hours, the rest is straight forward.
Objective
The objective is to model the data and to perform a monthly forecast for 2014.
Procedure
First, let’s have a small idea of how the data look like:
| CaseSequence | YYYY-MMM | KWH |
|---|---|---|
| 733 | 1998-Jan | 6862583 |
| 734 | 1998-Feb | 5838198 |
| 735 | 1998-Mar | 5420658 |
| 736 | 1998-Apr | 5010364 |
| 737 | 1998-May | 4665377 |
| 738 | 1998-Jun | 6467147 |
From above, we notice 3 columns as follows:
CaseSequence: Indicate the Sequence of the readings.
YYYY-MMM : Indicate the date of the reading.
KWH: Indicate the value of the reading in KWH.
Second, let’s have a description of the data:
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
From above, there seems to be a missing value as reported in the summary table under NA’s.
Third, I would like to have a visualization of the missing data since there’s an indication of NA. For this purpose, I will make use of the function vis_miss() from the library naniar.
Let’s have a better understanding of the missing data.
| CaseSequence | YYYY-MMM | KWH |
|---|---|---|
| 861 | 2008-Sep | NA |
Currently, we are not sure why there’s a missing value for the month of September of 2008. At this current point in time I am not sure if I should just remove the missing value or replace it with a more meaningful reading perhaps the mean value for all months representing September. I will come back to this issue as I go further.
Fourth: Let’s create a time series object.
Let’s have a better understanding of the time series.
## Jan Feb Mar Apr May Jun Jul
## 1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755
## 1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416
## 2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662
## 2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129
## 2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702
## 2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676
## 2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931
## 2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998
## 2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564
## 2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220
## 2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987
## 2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260
## 2010 9397357 8390677 7347915 5776131 4919289 6696292 770523
## 2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343
## 2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851
## 2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321
## Aug Sep Oct Nov Dec
## 1998 8607428 6989888 6345620 4640410 4693479
## 1999 7564391 7899368 5358314 4436269 4419229
## 2000 7517830 8912169 5844352 5041769 6220334
## 2001 8450717 7112069 5242535 4461979 5240995
## 2002 8058748 8245227 5865014 4908979 5779958
## 2003 8476499 7791791 5344613 4913707 5756193
## 2004 7309774 6690366 5444948 4824940 5791208
## 2005 7786659 7057213 6694523 4313019 6181548
## 2006 8241110 7296355 5104799 4458429 6226214
## 2007 7447146 7666970 5785964 4907057 6047292
## 2008 8037137 NA 5101803 4555602 6442746
## 2009 8350517 7583146 5566075 5339890 7089880
## 2010 7922701 7819472 5875917 4800733 6152583
## 2011 10308076 8943599 5603920 6154138 8273142
## 2012 9612423 7559148 5576852 5731899 6609694
## 2013 9080226 7968220 5759367 5769083 9606304
From the above table, it is evident that we need to replace the NA with a more “meaningful” value, it is not recommended eliminate such value; my approach will be to calculate the mean of all readings for all years for the month of September and replace the NA with such value.
Time series after replacement of missing data with the mean for the respective month, in this case it was for Sep, 2008; it got replaced for 7702333.
## Jan Feb Mar Apr May Jun Jul
## 1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755
## 1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416
## 2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662
## 2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129
## 2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702
## 2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676
## 2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931
## 2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998
## 2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564
## 2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220
## 2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987
## 2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260
## 2010 9397357 8390677 7347915 5776131 4919289 6696292 770523
## 2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343
## 2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851
## 2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321
## Aug Sep Oct Nov Dec
## 1998 8607428 6989888 6345620 4640410 4693479
## 1999 7564391 7899368 5358314 4436269 4419229
## 2000 7517830 8912169 5844352 5041769 6220334
## 2001 8450717 7112069 5242535 4461979 5240995
## 2002 8058748 8245227 5865014 4908979 5779958
## 2003 8476499 7791791 5344613 4913707 5756193
## 2004 7309774 6690366 5444948 4824940 5791208
## 2005 7786659 7057213 6694523 4313019 6181548
## 2006 8241110 7296355 5104799 4458429 6226214
## 2007 7447146 7666970 5785964 4907057 6047292
## 2008 8037137 7702333 5101803 4555602 6442746
## 2009 8350517 7583146 5566075 5339890 7089880
## 2010 7922701 7819472 5875917 4800733 6152583
## 2011 10308076 8943599 5603920 6154138 8273142
## 2012 9612423 7559148 5576852 5731899 6609694
## 2013 9080226 7968220 5759367 5769083 9606304
Let’s visualize our data.
In this particular case, I am not sure as to why there is a very low reading for July, 2010,it is currently showing unusually low (corresponding to some large values in the remainder time series). Some possibilities could point to be a power outage during summer time; this seems to be a good possibility. I did some research and since there’s no reference as to the geographical area for the data set, I could not confirm such thing. I will assume this to be the cause, I will consider this to be an accurate reading and I will not change that value.
Also, the data are clearly non-stationary, as the series wanders up and down for some periods. Consequently, we will take a first difference of the data. The difference data are shown below.
Let’s have a visualization of the differences.
In the above plot, we notice some auto correlations in the lag, the PACF suggest a AR(3) model. So an initial candidate model is ARIMA(3,1,0).
Training/test: In this section, I will split the given data into Train/Test data. This will be used in order to determine the accuracy of the model.
power.train <- window(power.ts, end=c(2012,12))
power.test <- window(power.ts, start=c(2013,1))ARIMA Let’s find an arima model.
Regular fit. No transformation, the reason why, is because there seems to be no evidence of changing variance.
power.fit.manual.arima <- Arima(power.train, c(3,1,0))
power.fit.auto.arima <- auto.arima(power.train, seasonal=FALSE, stepwise=FALSE, approximation=FALSE)Let’s see the results:
Manual Arima fit
## Series: power.train
## ARIMA(3,1,0)
##
## Coefficients:
## ar1 ar2 ar3
## -0.0852 -0.2971 -0.4079
## s.e. 0.0681 0.0643 0.0678
##
## sigma^2 estimated as 1.707e+12: log likelihood=-2773.71
## AIC=5555.43 AICc=5555.66 BIC=5568.18
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -185.2833 1292085 954369.9 -6.655492 19.56656 1.381861
## ACF1
## Training set -0.1858258
Auto Arima fit
## Series: power.train
## ARIMA(3,1,1) with drift
##
## Coefficients:
## ar1 ar2 ar3 ma1 drift
## 0.4654 -0.3206 -0.2597 -0.9759 6359.358
## s.e. 0.0782 0.0759 0.0789 0.0527 2604.902
##
## sigma^2 estimated as 1.191e+12: log likelihood=-2742.1
## AIC=5496.2 AICc=5496.68 BIC=5515.32
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -28105.18 1072783 791147.6 -7.011082 16.8035 1.145527
## ACF1
## Training set -0.01324442
If we compare both models, we notice how the RMSE value is by far a better value in the Auto Arima model, also, another indication is the AICc value, in this case the Auto Arima model has a better value compared to our manually selected model. Hence, I will pick the Automated Arima model.
Accuracy Let’s find how accurate the models are:
Manual Arima(3,1,0)
Let’s visualize the manually selected Arima(3,1,0) model forecast results:
| Point.Forecast | Lo.80 | Hi.80 | Lo.95 | Hi.95 | |
|---|---|---|---|---|---|
| Jan 2013 | 7297359 | 5622774 | 8971944 | 4736302.5 | 9858415 |
| Feb 2013 | 6914690 | 4645149 | 9184231 | 3443725.9 | 10385654 |
| Mar 2013 | 6384942 | 3885764 | 8884120 | 2562779.0 | 10207105 |
| Apr 2013 | 6263306 | 3724433 | 8802180 | 2380434.5 | 10146178 |
| May 2013 | 6587161 | 3953365 | 9220957 | 2559117.4 | 10615204 |
| Jun 2013 | 6811776 | 3974484 | 9649068 | 2472511.9 | 11151040 |
| Jul 2013 | 6746019 | 3667713 | 9824325 | 2038155.9 | 11453882 |
| Aug 2013 | 6552789 | 3324221 | 9781356 | 1615121.1 | 11490457 |
| Sep 2013 | 6497179 | 3169416 | 9824942 | 1407804.7 | 11586554 |
| Oct 2013 | 6586154 | 3156549 | 10015759 | 1341026.3 | 11831282 |
| Nov 2013 | 6673909 | 3110534 | 10237285 | 1224196.9 | 12123622 |
| Dec 2013 | 6662675 | 2957089 | 10368262 | 995469.7 | 12329881 |
Let’s take a look at the accuracy table and let’s focus on the RMSE results for the manually selected Arima model. In this particular case, the test set results are not very promising.
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil.s.U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -185.2833 | 1292085 | 954369.9 | -6.655492 | 19.56656 | 1.381861 | -0.1858258 | NA |
| Test set | 953505.3038 | 1709364 | 1376213.1 | 9.233251 | 16.48537 | 1.992661 | 0.1195800 | 0.817616 |
Let’s have a visualization of the manually selected Arima model forecasts.
In effect, the curve seems not to follow the pattern of the data.
Autmated Arima(3,1,1)
Let’s visualize the automatically selected Arima(3,1,1) with drift model forecast results:
| Point.Forecast | Lo.80 | Hi.80 | Lo.95 | Hi.95 | |
|---|---|---|---|---|---|
| Jan 2013 | 7695337 | 6297001 | 9093673 | 5556767 | 9833907 |
| Feb 2013 | 7886045 | 6329165 | 9442925 | 5505002 | 10267088 |
| Mar 2013 | 7405840 | 5845999 | 8965681 | 5020270 | 9791411 |
| Apr 2013 | 6846288 | 5177271 | 8515305 | 4293747 | 9398829 |
| May 2013 | 6697353 | 4983444 | 8411263 | 4076155 | 9318552 |
| Jun 2013 | 6939245 | 5224004 | 8654485 | 4316011 | 9562479 |
| Jul 2013 | 7252012 | 5502568 | 9001456 | 4576469 | 9927556 |
| Aug 2013 | 7365814 | 5595131 | 9136498 | 4657787 | 10073841 |
| Sep 2013 | 7262770 | 5491779 | 9033761 | 4554273 | 9971267 |
| Oct 2013 | 7104174 | 5328558 | 8879790 | 4388604 | 9819744 |
| Nov 2013 | 7040922 | 5262050 | 8819793 | 4320373 | 9761470 |
| Dec 2013 | 7096182 | 5317239 | 8875126 | 4375523 | 9816842 |
Let’s take a look at the accuracy table and let’s focus on the RMSE results for the automatically selected Arima model. In this particular case, the test set results are not very promising. Now, comparing the RMSE values to our manually selected model, there’s an improvement but still, it seems that the forecasts are not very accurate with this model.
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil.s.U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -28105.18 | 1072783 | 791147.6 | -7.011082 | 16.80350 | 1.145527 | -0.0132444 | NA |
| Test set | 402336.77 | 1477517 | 1270174.5 | 1.774910 | 16.20176 | 1.839124 | 0.0851104 | 0.7294665 |
Let’s have a visualization of the automatically selected Arima model forecasts.
Let’s compare side by side the test forecasts, compared to our test data.
In effect, the forecasts are not very accurate and perhaps another model should be selected.
STL model: Based on the previous results, I will focus on the STL model.
STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”.
STL has several advantages over the classical, SEATS and X11 decomposition methods:
Unlike SEATS and X11, STL will handle any type of seasonality, not only monthly and quarterly data.
The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
The smoothness of the trend-cycle can also be controlled by the user.
It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the remainder component.
Let’s visualize the STL decomposition.
Let’s forecast with the naive and snaive method.
# Calculating forecasts for naive and snaive
power.fit.naive <- forecast(power.fit.stl, method="naive", h =12)
power.fit.snaive <- snaive(power.train[,1], h =12)
# Calculating accuracy
power.accuracy.naive <- accuracy(power.fit.naive, power.test)
power.accuracy.snaive <- accuracy(power.fit.snaive, power.test)Let’s see the respective accuracy results for both models.
Naive Forecast accuracy results.
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil.s.U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 8710.895 | 984708.5 | 579079.1 | -5.226505 | 13.670208 | 0.8384662 | -0.3862982 | NA |
| Test set | 621948.397 | 1140409.9 | 710563.5 | 6.869135 | 8.314329 | 1.0288465 | 0.0602833 | 0.6368615 |
SNaive Forecast accuracy results.
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil.s.U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 72034.37 | 1182065 | 690640.9 | -4.526757 | 14.989184 | 1.0000000 | 0.2550212 | NA |
| Test set | 405195.25 | 1035538 | 618605.6 | 4.547778 | 7.058492 | 0.8956979 | -0.0313027 | 0.6156093 |
In this particular case, the snaive method offers a much better RMSE value. Making this model the most accurate of them all.
Let’s visualize the results.
Forecasting 2014 Employing snaive model.
Forecast results
| Point.Forecast | Lo.80 | Hi.80 | Lo.95 | Hi.95 | |
|---|---|---|---|---|---|
| Jan 2014 | 10655730 | 9152641 | 12158819 | 8356954 | 12954506 |
| Feb 2014 | 7681798 | 6178709 | 9184887 | 5383022 | 9980574 |
| Mar 2014 | 6517514 | 5014425 | 8020603 | 4218738 | 8816290 |
| Apr 2014 | 6105359 | 4602270 | 7608448 | 3806583 | 8404135 |
| May 2014 | 5940475 | 4437386 | 7443564 | 3641699 | 8239251 |
| Jun 2014 | 7920627 | 6417538 | 9423716 | 5621851 | 10219403 |
| Jul 2014 | 8415321 | 6912232 | 9918410 | 6116545 | 10714097 |
| Aug 2014 | 9080226 | 7577137 | 10583315 | 6781450 | 11379002 |
| Sep 2014 | 7968220 | 6465131 | 9471309 | 5669444 | 10266996 |
| Oct 2014 | 5759367 | 4256278 | 7262456 | 3460591 | 8058143 |
| Nov 2014 | 5769083 | 4265994 | 7272172 | 3470307 | 8067859 |
| Dec 2014 | 9606304 | 8103215 | 11109393 | 7307528 | 11905080 |
Forecast visualization
Conclusion
From the above analysis, we could conclude that a good prediction model will be the STL employing the snaive method. Thus due to the similar pattern followed for the testing data and the predicted future values.
Example 2
Overview
Example 2 consist as follows: to explore, analyze and model a data set containing approximately \(8000\) records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a \(1\) or a \(0\). A “\(1\)” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Objective
The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. I can only use the variables given (or variables that I derive from the variables provided).
Procedure
In this example, I will make use of a version and collaboration tool called github, alongside R. GitHub, a subsidiary of Microsoft, is an American web-based hosting service for version control using Git. It is mostly used for computer code. It offers all of the distributed version control and source code management (SCM) functionality of Git as well as adding its own features.
It provides access control and several collaboration features such as bug tracking, feature requests, task management, and wikis for every project.
Dataset description
Let’s start by taking a look at the data and it’s respective dictionary.
Variable definitions
The below list represent the definitions for each given variable.
Theoretical effect of variables
The below list represent the theoretical effects for each given variable.
Data Exploration
Let’s take a look at the hidden layers and some composition of the data set.
Data acquisition
For reproducibility purposes, I have included the original data sets in my Git Hub account, I will read it as a data frame from that location.
data.train <- get_data(git_user, git_dir, 'insurance_training_data.csv')
data.eval <- get_data(git_user, git_dir, 'insurance-evaluation-data.csv') General exploration
The below process will help us obtain insights from our given data.
Dimensions
Let’s see the dimensions of our training data set.
From the above table, we can see how the training data set has a total of 8161 different records and 26 variables including INDEX, TARGET_FLAG and TARGET_AMT. These variables do not represent much of the initial insights since they correspond to our response variables.
For simplicity reasons, I will discard the INDEX column.
remove_cols <- names(data.train) %in% c('INDEX')
data.train <- data.train[!remove_cols]Structure
The below structure is currently present in the data, for simplicity reasons, I have previously loaded and treated this data set as a data frame in which all the variables with decimals are numeric.
From the above table, we can notice how we need to take care of certain strings that are seeing as factors but in reality they are representing numbers and should not be seeing as factors. This will be addressed in more detail as we advance.
Summaries
Let’s find some summary statistics about our given data, for that; I will get a little bit more insights for all the columns including the TARGET_FLAG and TARGET_AMT variables.
Combined Summary
In this section, we will explore the combined results as introductory insights.
data.train.summary <- get_df_summary(data.train)
data.train.summaryPlease note that this is for introductory insights and should not be considered as complete results.
TARGET_FLAG Summaries
In the mean time I will split the data into two data-sets depending on the TARGET_FLAG. Let’s see some summaries for each group.
TARGET_FLAG_0 <- data.train[data.train$TARGET_FLAG == 0,]
TARGET_FLAG_1 <- data.train[data.train$TARGET_FLAG == 1,]Number of records by group
The below table shows how many records each group has.
TARGET_FLAG = 0
Let’s have a better look at the individualized summaries by having TARGET_FLAG = 0.
data.train.summary_0 <- get_df_summary(TARGET_FLAG_0)
data.train.summary_0TARGET_FLAG = 1
Let’s have a better look at the individualized summaries by having TARGET_FLAG = 1.
data.train.summary_1 <- get_df_summary(TARGET_FLAG_1)
data.train.summary_1From the above reports, we can notice how we need to “transform” our data in order to make it more workable.
In order to do so, I will start to prep the data. Graphical visualizations and correlations will be provided later on since we need to address a few things in order to make our data more workable.
Findings
From the above tables, is interesting to note as follows:
The training data-set shows the presence of missing values or NAs in some columns; that can be seeing in the Other column. This will be addressed as we prepare our data down the road.
“(Other)” means that there are factor values that could not be grouped accordingly.
Interesting to see that CAR_AGE shows a minimum value of -3. This needs to be investigated since it seems that it’s not accurate.
The Maximum value for TARGET_AMT seems to be very far away from the mean and the median value. This needs to be evaluated and find out if this is accurate.
Data Preparation
In this section, I will prepare our given data-set. For that I will need to address a few things, like factors and missing data.
Data conversion
In this section, I will describe the conversion of the data that is required in order to have a more manageable understanding of it.
FACTOR to NUMERIC
This section explains the conversions of currency values in which the system interpreted as factors when in reality these should have been treated as numeric type.
The variables that need to be converted are: INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM.
FACTOR to “Dummy”
In this section, I will transform the remaining factor variables into various binary “Dummy” variables with values 1 for “Yes” and 0 for “No”.
The variables that need to be converted are: PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED and URBANICITY.
Please note that there will be a a new set of variables summarizing the above data as follows:
IS_SINGLE_PARENT will represent if PARENT1 = “Yes” with a value of \(1\); \(0\) otherwise.
IS_MARRIED will represent if MSTATUS = “Yes”" with a value of \(1\); \(0\) otherwise.
IS_FEMALE will represent if SEX = “z_F” with a value of \(1\); \(0\) otherwise.
EDUCATION will represent diverse EDUCATION levels, “<High School” is the default and this column was not included.
JOB will represent diverse JOB levels, “Blank” is the default and this column was not included.
IS_CAR_PRIVATE_USE will represent if CAR_USE = “Private” with a value of \(1\); \(0\) otherwise.
CAR_TYPE will represent diverse CAR_TYPE levels, “Minivan” is the default and this column was not included.
IS_CAR_RED will represent if RED_CAR = “Yes” with a value of \(1\); \(0\) otherwise.
IS_LIC_REVOKED will represent if REVOKED = “Yes” with a value of \(1\); \(0\) otherwise.
IS_URBAN will represent if URBANICITY = “Highly Urban/ Urban” with a value of \(1\); \(0\) otherwise.
Let’s see our resulting table.
NAs prep
First, let’s see how our values are represented after the above transformations.
In order to work the missing values, I will proceed as follows.
Proportion findings
Let’s calculate the proportion of missing values in order to determine the best approach for these variables.
The below list display the combined missing percentage values for each variable.
NAs by TARGET_FLAG group
For this, let’s see how many records each group has.
Since those values are considered low percentages compared to our data set; I will replace the missing NA values with randomly selected values in between the respective Min and Max value with the exception of CAR_AGE since it shows a negative value, hence I will select 0 to be the minimum value for that particular variable.
New Structure
In order to visualize our new structure, I will put together the new set of variables with the transformations. Let’s see our structure once again, but this time after the transformation of the data.
CAR_AGE investigation
Let’s find out why CAR_AGE has a minimum value of \(-3\) which seems to be incorrect, for this I will select the records for CAR_AGE < 0, with the goal of identifying more possible unrealistic values.
From the above results, there seems to be no apparent reason as to why this value was entered. A possible reason could be that the person who typed the record, entered a wrong number; it could be probably 3 or 0 or any other value. In order to keep data integrity, I will remove that record from our data set.
Visualizations
From previous summary tables, we established that TARGET_FLAG groups have diverse values as means, from which we could start creating some hypothesis such as:
\(H_0\): The means for the divided data-set in which TARGET_FLAG = 1 and TARGET_FLAG = 0 are the same.
\(H_1\): The means for the divided data-set in which TARGET_FLAG = 1 and TARGET_FLAG = 0 are not the same.
Let’s create some visualizations and see how this data behave.
TARGET_FLAG vs other variables.
In this case, I will compare our data by separating our TARGET_FLAG values. That is, the light-green color represent “0” and the color red represent “1”, meaning that red was involved in a Car accident while blue or light green was not.
Box plots
From the previous visualizations, we can notice how some sort of relationship exist in between some of the variables. In order to create better understanding, let’s visualize their behavior by analyzing individual cases.
Now, If we compare our previous plots and compare them to our given theoretical effect, we can see as follows:
AGE: Definitely AGE seems to be an important factor, we can notice how the mean value on the data-set that had an accident has a considerable age difference, implying that younger people are more risky.
BLUEBOOK: Definitely, the values are considerable different for the means in terms of records having accidents vs the ones who do not. The BLUEBOOK mean value is lower when there’s an accident, thus making sense for the data, since the car should have less value after an accident happens.
CAR_AGE: This is very interesting, it seems that newer cars are more involved in accidents than older cars.
INCOME: The provided data agree with the theoretical effect. That is, the mean income value of the people who are involved in accidents is lower than those who are not. From my perspective, this is something very interesting that could be studied in more detail.Perhaps this is a factor for economic growth, lower income individuals tend to have more accidents, hence limiting their income growth due to repair expenses, fees, fine, loss of time and insurance premiums hikes.
MVR_PTS: The data agree with this theoretical effect, it is noted how the mean of MVR_PTS is higher when accidents are reported.
TIF: This seems to be true, the data report a higher mean for those who do not have an accident.
TRAVTIME: Not much of a difference but the data seems to agree.
Correlations
Let’s create some visualizations for the correlation matrix.
Let’s start with a combined correlation, that is, no difference in between TARGET_FLAG.
Combined Graphical visualization
First, let’s create a visual representation of correlations with a heat map as a guide.
Something interesting to note from the above graph is the existing moderate negative correlation in between IS_FEMALE and IS_RED_CAR, the value for this correlation is: -0.6666. Now, at this point we should not make any inference from this data since IS_FEMALE means either “MALE” or “FEMALE” and IS_RED_CAR means either “Yes” or “No”. Further analysis needs to be performed to attain any conclusion related to those two variables.
Also, we can notice some moderate strong correlations from our given data set such as the relation ship in between EDUCATIONPhD and JOBDoctor with a correlation value of 0.5633. Another moderate correlation noticed in the data set is between EDUCATIONMasters and JOBLawyer with a correlation value of 0.5993.
Combined Numerical visualization
From the above graph, we can easily identify some sort of correlations in between the response variables TARGET_FLAG and TARGET_AMT and other variables.
Let’s read our correlations table to gain extra insights.
Combined Correlations histogram
Something very interesting to note from the above table, is that the correlations in between the data-sets seems to be very low. We can notice that in the below density distributions for the respective correlations.
TARGET_FLAG = Accidents
Let’s do a correlation analysis for the data set in which TARGET_FLAG = 1 (meaning it had an accident).
Accidents Graphical visualization
Let’s create a visual representation of correlations with a heat map as a guide for all records in which TARGET_FLAG = 1.
Please note that the row named TARGET_FLAG is not included since all records have TARGET_FLAG value of 1, making it redundant.
Same as before, is interesting to note from the above graph, the existing moderate negative correlation in between IS_FEMALE and IS_RED_CAR, the value for this correlation is: -0.6678. Now, at this point we should not make any inference from this data since IS_FEMALE means either “MALE” or “FEMALE” and IS_RED_CAR means either “Yes” or “No”. Further analysis needs to be performed to attain any conclusion related to those two variables.
Also, we can notice how new moderate correlations appeared that were not present before the split; that is:
IS_FEMALEseems to be moderately correlated toCAR_TYPEz_SUVwith a value of 0.5529.BLUEBOOKseems to be moderately correlated toCAR_TYPEPanel.Truckwith a value of 0.5484.OLDCLAIMseems to be moderately correlated toIS_LIC_REVOKEDwith a value of 0.5427.
Now, if we think about the data and their correlations, some data points seem to make sense. I will not extrapolate too much into this since our main goal is to create a Model in which we could predict the probability that a person will crash their car and also how much money it will cost if the person does crash the car. Hence, I will continue but will keep this correlations in mind.
Accidents Numerical visualization
From the above graph, we can easily identify some sort of correlations in between the response variables TARGET_AMT and the other variables.
Let’s read our correlations table to gain extra insights.
Accidents Correlations histogram
Something very interesting to note from the above table, is that the correlations in between the data-sets seems to be very low. We can notice that behavior in the below density distribution for the respective correlations.
Comparing Means
The below table, compare the means for both records indicated in the TARGET_FLAG variable, that is 1 = Accident vs 0 = No Accident.
It is very important to note how some insights are taken from the two data sets by comparing side by side. In order to identify those results, I have created two extra columns labeled Obs in which denotes how some variables could imply positive or negative outcomes related to accidents, the Level has four different indicators depending on the percentage increase as follows:
Pos: Two options “+” or “-” meaning, the data shows an increase or decrease in between comparisons.
Level: Has four level as follows:
– “.”: Percentage of difference is less than \(25\%\).
– “*”: Percentage of difference is between \([25, 75[\%\).
– “**”: Percentage of difference is between \([75, 100[\%\).
– “***”: Percentage of difference is more or equal than \(100\%\).
Also, is important to note that some variables seem to have a beneficial role in avoiding accidents, and that can be seeing in the above table as well.
Something worthy of mentioning is that our theoretical effect mentions: “urban leyend says that women have less crashes than men”. By looking at the above table, this legend could be answered as to be false, we could see a slight increase of FEMALES involved in car accidents in about \(3\%\) to \(4\%\) higher than not having accidents, that is an increase from about \(0.53\) to about \(0.55\). Also, we should expect a rate of about \(50\%\) since is considered even for insured drivers in America.
BUILD MODELS
At this point, we are getting ready to start building models, however I would like to point out that in this case is a little bit difficult to determine what data transformation could be used in order to refine our models.
Binary Logistic Regression Models
I would like to point that since this work requires Binary Logistic Regression, we are going to be using the logit function as our Likelihood link function for Logistic Regression by assuming that it follows a binomial distribution as follows:
\[y_i | x_i \sim Bin(m_i,\theta(x_i))\]
so that,
\[P(Y_i=y_i | x_i)= \binom{m_i}{y_i} \theta(x_i)^{y_i}(1-\theta(x_i))^{m_i-y_i} \]
Now, in order to solve our problem, we need to build a linear predictor model in which the individual predictors that compose the response \(Y_i\) are all subject to the same \(q\) predictors \((x_{i1}, …, x_{iq})\). Please note that the group of predictors, are commonly known as covariate classess. In this case, we need a model that describes the relationship of \(x_1, …, x_q\) to \(p\). In order to solve this problem, we will construct a linear predictor model as follows:
\[\mathfrak{N}_i = \beta_0 + \beta_1x_{i1}+...+\beta_qx_{iq} \]
Logit link function
In this case, since we need to set \(\mathfrak{N}_i = p_i\); with \(0 \le p_i \le 1\), I will use the link function \(g\) such that \(\mathfrak{N}_i = g(p_i)\) with \(0 \le g^{-1}(\mathfrak{N}) \le 1\) for any \(\mathfrak{N}\). In order to do so, I will pick the Logit link function \(\mathfrak{N} = log(p/(1 - p))\).
An alternate way will be by employing the \(\chi^2\) Chi square distribution; for the purposes of this project, I will employ the use of the binomial distribution or the \(\chi^2\) depending on which one is a better choice, also I will assume that all \(Y_i\) are all independent of each other.
Binomial NULL Model
In this section I will build a Binary Logistic Regression Null model utilizing all the variables and data, please note that I won’t do any transformations. This model will be considered to be valid and will be considered as we advance. In order to build this model, I will not include the TARGET_AMT variable since that will be employed in the next model build up.
##
## Call:
## glm(formula = TARGET_FLAG ~ 1, family = binomial(link = "logit"),
## data = data.train.bin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7825 -0.7825 -0.7825 1.6327 1.6327
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.02669 0.02512 -40.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9415.3 on 8159 degrees of freedom
## Residual deviance: 9415.3 on 8159 degrees of freedom
## AIC: 9417.3
##
## Number of Fisher Scoring iterations: 4
I will assume that this to be a valid model.
Binomial FULL Model
In this section I will build a Binary Logistic Regression Full model utilizing all the variables and data, please note that I won’t do any transformations. This model will be considered to be valid and will be considered as we advance.
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"),
## data = data.train.bin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5767 -0.7119 -0.4036 0.6227 3.1503
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.994e+00 3.314e-01 -9.032 < 2e-16 ***
## KIDSDRIV 3.840e-01 6.107e-02 6.287 3.23e-10 ***
## AGE -7.555e-04 3.975e-03 -0.190 0.849257
## HOMEKIDS 4.966e-02 3.692e-02 1.345 0.178586
## YOJ -7.653e-03 7.788e-03 -0.983 0.325777
## INCOME -1.625e-06 6.267e-07 -2.593 0.009526 **
## HOME_VAL -6.858e-07 2.348e-07 -2.921 0.003489 **
## TRAVTIME 1.441e-02 1.881e-03 7.660 1.85e-14 ***
## BLUEBOOK -2.297e-05 5.212e-06 -4.407 1.05e-05 ***
## TIF -5.522e-02 7.330e-03 -7.534 4.92e-14 ***
## OLDCLAIM -1.422e-05 3.907e-06 -3.639 0.000274 ***
## CLM_FREQ 1.999e-01 2.849e-02 7.014 2.31e-12 ***
## MVR_PTS 1.153e-01 1.360e-02 8.477 < 2e-16 ***
## CAR_AGE -7.370e-04 6.266e-03 -0.118 0.906362
## IS_SINGLE_PARENT 3.825e-01 1.094e-01 3.497 0.000471 ***
## IS_MARRIED -5.582e-01 7.783e-02 -7.172 7.39e-13 ***
## IS_FEMALE -8.132e-02 1.118e-01 -0.727 0.466943
## EDUCATIONBachelors -4.376e-01 1.128e-01 -3.881 0.000104 ***
## EDUCATIONMasters -3.537e-01 1.726e-01 -2.049 0.040433 *
## EDUCATIONPhD -3.174e-01 2.051e-01 -1.548 0.121597
## EDUCATIONz_High.School -3.047e-03 9.486e-02 -0.032 0.974379
## JOBClerical 5.050e-01 1.942e-01 2.601 0.009297 **
## JOBDoctor -4.103e-01 2.654e-01 -1.546 0.122087
## JOBHome.Maker 4.230e-01 2.036e-01 2.078 0.037726 *
## JOBLawyer 1.266e-01 1.686e-01 0.751 0.452621
## JOBManager -5.246e-01 1.705e-01 -3.077 0.002089 **
## JOBProfessional 1.895e-01 1.775e-01 1.068 0.285511
## JOBStudent 4.288e-01 2.087e-01 2.055 0.039873 *
## JOBz_Blue.Collar 3.499e-01 1.845e-01 1.896 0.057940 .
## IS_CAR_PRIVATE_USE -7.639e-01 9.170e-02 -8.330 < 2e-16 ***
## CAR_TYPEPanel.Truck 5.505e-01 1.614e-01 3.411 0.000647 ***
## CAR_TYPEPickup 5.510e-01 1.007e-01 5.472 4.46e-08 ***
## CAR_TYPESports.Car 1.029e+00 1.296e-01 7.941 2.01e-15 ***
## CAR_TYPEVan 6.114e-01 1.261e-01 4.847 1.25e-06 ***
## CAR_TYPEz_SUV 7.703e-01 1.111e-01 6.933 4.12e-12 ***
## IS_CAR_RED -2.506e-03 8.618e-02 -0.029 0.976805
## IS_LIC_REVOKED 8.894e-01 9.117e-02 9.756 < 2e-16 ***
## IS_URBAN 2.387e+00 1.129e-01 21.147 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9415.3 on 8159 degrees of freedom
## Residual deviance: 7316.8 on 8122 degrees of freedom
## AIC: 7392.8
##
## Number of Fisher Scoring iterations: 5
In this particular case, we notice how some variables are not statistically significant; for study purposes, I will assume that this is a valid model.
Binomial STEP Model
In this case, I will create multiple models using the STEP function from R.
bin_Model_STEP <- step(bin_Model_NULL,
scope = list(upper=bin_Model_FULL),
direction="both",
test="Chisq",
data=data.train.bin)For simplicity reasons, I have decided not to include the automatic responses; instead I will present the final model results.
ANOVA results
Let’s check an ANOVA table based on the above testing results.
From the above results and calculations, it was concluded that the best model is as follows:
##
## Call:
## glm(formula = TARGET_FLAG ~ IS_URBAN + MVR_PTS + HOME_VAL + IS_CAR_PRIVATE_USE +
## BLUEBOOK + IS_SINGLE_PARENT + IS_LIC_REVOKED + JOBManager +
## TRAVTIME + TIF + KIDSDRIV + CLM_FREQ + CAR_TYPESports.Car +
## CAR_TYPEz_SUV + IS_MARRIED + INCOME + JOBClerical + OLDCLAIM +
## CAR_TYPEPickup + CAR_TYPEVan + CAR_TYPEPanel.Truck + JOBDoctor +
## EDUCATIONBachelors + EDUCATIONMasters + EDUCATIONPhD + HOMEKIDS +
## YOJ, family = binomial(link = "logit"), data = data.train.bin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5863 -0.7155 -0.4047 0.6283 3.1234
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.592e+00 1.839e-01 -14.095 < 2e-16 ***
## IS_URBAN 2.372e+00 1.124e-01 21.092 < 2e-16 ***
## MVR_PTS 1.141e-01 1.357e-02 8.408 < 2e-16 ***
## HOME_VAL -7.446e-07 2.309e-07 -3.225 0.001260 **
## IS_CAR_PRIVATE_USE -7.864e-01 7.499e-02 -10.487 < 2e-16 ***
## BLUEBOOK -2.527e-05 4.647e-06 -5.437 5.41e-08 ***
## IS_SINGLE_PARENT 3.760e-01 1.087e-01 3.458 0.000544 ***
## IS_LIC_REVOKED 8.868e-01 9.105e-02 9.740 < 2e-16 ***
## JOBManager -7.463e-01 1.072e-01 -6.961 3.39e-12 ***
## TRAVTIME 1.449e-02 1.879e-03 7.714 1.21e-14 ***
## TIF -5.541e-02 7.323e-03 -7.567 3.82e-14 ***
## KIDSDRIV 3.820e-01 5.997e-02 6.369 1.91e-10 ***
## CLM_FREQ 1.997e-01 2.845e-02 7.019 2.23e-12 ***
## CAR_TYPESports.Car 9.832e-01 1.065e-01 9.232 < 2e-16 ***
## CAR_TYPEz_SUV 7.235e-01 8.523e-02 8.489 < 2e-16 ***
## IS_MARRIED -5.487e-01 7.748e-02 -7.081 1.43e-12 ***
## INCOME -1.924e-06 6.116e-07 -3.146 0.001653 **
## JOBClerical 1.720e-01 8.898e-02 1.934 0.053174 .
## OLDCLAIM -1.413e-05 3.902e-06 -3.622 0.000293 ***
## CAR_TYPEPickup 5.293e-01 9.852e-02 5.373 7.76e-08 ***
## CAR_TYPEVan 6.065e-01 1.196e-01 5.069 3.99e-07 ***
## CAR_TYPEPanel.Truck 5.395e-01 1.431e-01 3.769 0.000164 ***
## JOBDoctor -5.278e-01 2.494e-01 -2.117 0.034268 *
## EDUCATIONBachelors -4.805e-01 7.559e-02 -6.356 2.07e-10 ***
## EDUCATIONMasters -5.420e-01 9.198e-02 -5.892 3.80e-09 ***
## EDUCATIONPhD -5.061e-01 1.458e-01 -3.472 0.000517 ***
## HOMEKIDS 5.740e-02 3.391e-02 1.693 0.090442 .
## YOJ -1.185e-02 7.098e-03 -1.669 0.095071 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9415.3 on 8159 degrees of freedom
## Residual deviance: 7323.5 on 8132 degrees of freedom
## AIC: 7379.5
##
## Number of Fisher Scoring iterations: 5
From the above results, we noticed how all the predictors are statistical significant, also, we notice how from the above model, the HOME_VAL and the INCOME are not as statistical significant compared to other variables.
Plot of standardized residuals
The below plot shows our fitted models vs the deviance standardized residuals.
Confusion Matrix
Let’s start by building a confusion matrix in order to obtain valuable insights.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5550 1257
## 1 458 895
##
## Accuracy : 0.7898
## 95% CI : (0.7808, 0.7986)
## No Information Rate : 0.7363
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3856
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4159
## Specificity : 0.9238
## Pos Pred Value : 0.6615
## Neg Pred Value : 0.8153
## Prevalence : 0.2637
## Detection Rate : 0.1097
## Detection Prevalence : 0.1658
## Balanced Accuracy : 0.6698
##
## 'Positive' Class : 1
##
Is interesting to note that the reported Accuracy is 0.7898284.
From the above results, we obtain as follows:
ROC and AUC
As we know, the Receiver Operating Characteristic Curves (ROC) is a great quantitative assessment tool of the model. In order to quantify our model, I will employ as follows:
Let’s see our confidence intervals for the area under the curve.
Binary STEP MODIFIED Model
In this case, I will add 1 to the following variables and then I will calculate the log, thus to avoid errors since some entries reported 0 and log(0) will produce errors, also I will remove the variable HOMEKIDS; let’s take a look as follows:
- Log(1 + INCOME)
- Log(1 + HOME_VAL)
- Log(1 + BLUEBOOK)
- Log(1 + OLDCLAIM)
- HOMEKIDS <– Remove
Let’s see the results:
##
## Call:
## glm(formula = TARGET_FLAG ~ IS_URBAN + MVR_PTS + log(1 + HOME_VAL) +
## IS_CAR_PRIVATE_USE + log(1 + BLUEBOOK) + IS_SINGLE_PARENT +
## IS_LIC_REVOKED + JOBManager + TRAVTIME + TIF + KIDSDRIV +
## CLM_FREQ + CAR_TYPESports.Car + CAR_TYPEz_SUV + IS_MARRIED +
## log(1 + INCOME) + JOBClerical + log(1 + OLDCLAIM) + CAR_TYPEPickup +
## CAR_TYPEVan + CAR_TYPEPanel.Truck + JOBDoctor + EDUCATIONBachelors +
## EDUCATIONMasters + EDUCATIONPhD + YOJ, family = binomial(link = "logit"),
## data = data.train.bin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6106 -0.7169 -0.4012 0.6251 3.1464
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.449796 0.525004 0.857 0.39158
## IS_URBAN 2.408764 0.113906 21.147 < 2e-16 ***
## MVR_PTS 0.103476 0.013992 7.395 1.41e-13 ***
## log(1 + HOME_VAL) -0.023995 0.006356 -3.775 0.00016 ***
## IS_CAR_PRIVATE_USE -0.807898 0.075476 -10.704 < 2e-16 ***
## log(1 + BLUEBOOK) -0.328587 0.054484 -6.031 1.63e-09 ***
## IS_SINGLE_PARENT 0.432665 0.094510 4.578 4.69e-06 ***
## IS_LIC_REVOKED 0.700961 0.081276 8.624 < 2e-16 ***
## JOBManager -0.722699 0.107335 -6.733 1.66e-11 ***
## TRAVTIME 0.014845 0.001883 7.885 3.14e-15 ***
## TIF -0.054347 0.007329 -7.415 1.21e-13 ***
## KIDSDRIV 0.423039 0.054971 7.696 1.41e-14 ***
## CLM_FREQ 0.091849 0.043361 2.118 0.03416 *
## CAR_TYPESports.Car 0.945197 0.107131 8.823 < 2e-16 ***
## CAR_TYPEz_SUV 0.732312 0.085011 8.614 < 2e-16 ***
## IS_MARRIED -0.491494 0.081374 -6.040 1.54e-09 ***
## log(1 + INCOME) -0.072339 0.013133 -5.508 3.63e-08 ***
## JOBClerical 0.275260 0.090260 3.050 0.00229 **
## log(1 + OLDCLAIM) 0.022103 0.012436 1.777 0.07552 .
## CAR_TYPEPickup 0.535625 0.098545 5.435 5.47e-08 ***
## CAR_TYPEVan 0.595850 0.119427 4.989 6.06e-07 ***
## CAR_TYPEPanel.Truck 0.418935 0.135944 3.082 0.00206 **
## JOBDoctor -0.484072 0.248086 -1.951 0.05103 .
## EDUCATIONBachelors -0.469375 0.075173 -6.244 4.27e-10 ***
## EDUCATIONMasters -0.548683 0.090087 -6.091 1.13e-09 ***
## EDUCATIONPhD -0.628394 0.139306 -4.511 6.46e-06 ***
## YOJ 0.017980 0.008905 2.019 0.04348 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9415.3 on 8159 degrees of freedom
## Residual deviance: 7301.2 on 8133 degrees of freedom
## AIC: 7355.2
##
## Number of Fisher Scoring iterations: 5
As we can see, this transformation produced similar entries compared to our automatically selected model but it seems to be slightly better since the AIC is lower than the automatically selected model by the STEP procedure.
Plot of standardized residuals
The below plot shows our fitted models vs the deviance standardized residuals.
Confusion Matrix
Let’s start by building a confusion matrix in order to obtain valuable insights.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5561 1268
## 1 447 884
##
## Accuracy : 0.7898
## 95% CI : (0.7808, 0.7986)
## No Information Rate : 0.7363
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3833
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4108
## Specificity : 0.9256
## Pos Pred Value : 0.6642
## Neg Pred Value : 0.8143
## Prevalence : 0.2637
## Detection Rate : 0.1083
## Detection Prevalence : 0.1631
## Balanced Accuracy : 0.6682
##
## 'Positive' Class : 1
##
Is interesting to note that the reported Accuracy is 0.7898284.
From the above results, we obtain as follows:
ROC and AUC
As we know, the Receiver Operating Characteristic Curves (ROC) is a great quantitative assessment tool of the model. In order to quantify our model, I will employ as follows:
Let’s see our confidence intervals for the area under the curve.
IMPORTANT
If we check our theory, the AIC defines as follows: the smaller the value for AIC the better the model; in this case, we can easily observe how the by adding certain variables, our AIC values decrease making it a better model.
Binary MODEL SELECTION
In this case, I will select the model returned in the STEP procedure, that is:
bin_Model_FINAL <- bin_Model_STEPThe reasons are explained below:
This model returned the second lowest Akaike’s Information Criterion AIC.
This model returned the second nearest to zero median value.
This model displayed the smallest standard errors for the considered predictor variables.
This model present the smallest rate of change for all predictor variables.
This model returned the second lowest residual deviance.
From the below table we can see how the probability of being higher than the \(\chi^2\) are very low.
Test Binary model
From the above chosen model, I will create a reduced data frame containing only the variables needed in order to run our model. The selected variables are:
Final Model Comparisons
From here, I will define a NULL model with the chosen variables in order to compare results with the FINAL model.
##
## Call:
## glm(formula = TARGET_FLAG ~ 1, family = binomial(link = "logit"),
## data = data.train.final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7825 -0.7825 -0.7825 1.6327 1.6327
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.02669 0.02512 -40.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9415.3 on 8159 degrees of freedom
## Residual deviance: 9415.3 on 8159 degrees of freedom
## AIC: 9417.3
##
## Number of Fisher Scoring iterations: 4
Analysis of Deviance Table
The below table, will display a Deviance analysis by employing the \(\chi^2\) test.
In the above results, we can easily compare our Residual Deviance in which our model has better results compared to the null model since the null model’s deviance will change in -2091.7941687 units compared to our final model.
Multiple Linear Regression Model
Now, that we have build our Binary Model, I will proceed to build a Multiple Linear Regression Model in order to predict the TARGET_AMT for the records indicating that an accident happened.
Summaries
In order to have a better understanding of our current data, I will present a summary table for all the records that have accidents only.
Transformations
Notice how some variables present \(0\) as minimum input; that is INCOME, HOME_VAL and OLDCLAIM. This is very important since I will perform some transformations as follows:
TARGET_AMT: will be transformed to log(TARGET_AMT).
INCOME: will be transformed to log(1 + INCOME) <- To avoid \(log(0)\) problem.
HOME_VAL: will be transformed to log(1 + HOME_VAL) <- To avoid \(log(0)\) problem.
BLUEBOOK: will be transformed to log(BLUEBOOK).
OLDCLAIM: will be transformed to log(1 + OLDCLAIM) <- To avoid \(log(0)\) problem.
Visualizations
Let’s see if could find some linear relationships in terms of linearity among TARGET_AMT vs other variables.
From the above graphs, we could seems to identify some sort of linearity in the given data set, also, we notice how the log(1 + VARIABLE) has some effect in the plots.
Leverage and Outliers
In this section, I will try to identify and build a list of Leverage points alongside outliers.
Multiple Regression NULL Model
Let’s start with a null model in order to start having a better understanding. This model will be considered to be valid and will be considered as we advance.
##
## Call:
## lm(formula = log_TARGET_AMT ~ 1, data = TARGET_FLAG_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8660 -0.4091 0.0443 0.3871 3.3096
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.27644 0.01751 472.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8125 on 2151 degrees of freedom
Multiple Regression FULL Model
In this section, I will build a FULL model, thus in order to keep having a better understanding of the model. This model will be considered to be valid and will be considered as we advance.
##
## Call:
## lm(formula = log_TARGET_AMT ~ ., data = TARGET_FLAG_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0505 -0.3791 0.0624 0.4248 2.9125
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6219045 0.3637622 18.204 < 2e-16 ***
## TARGET_FLAG NA NA NA NA
## KIDSDRIV -0.0301002 0.0329426 -0.914 0.360970
## AGE 0.0008240 0.0021749 0.379 0.704840
## HOMEKIDS 0.0136086 0.0215630 0.631 0.528037
## YOJ -0.0001170 0.0053961 -0.022 0.982707
## TRAVTIME -0.0003177 0.0011534 -0.275 0.783026
## TIF -0.0011951 0.0044208 -0.270 0.786927
## CLM_FREQ -0.0530680 0.0242192 -2.191 0.028549 *
## MVR_PTS 0.0139056 0.0072827 1.909 0.056345 .
## CAR_AGE -0.0007791 0.0037051 -0.210 0.833478
## IS_SINGLE_PARENT 0.0232964 0.0611578 0.381 0.703299
## IS_MARRIED -0.1201347 0.0526775 -2.281 0.022673 *
## IS_FEMALE -0.0859461 0.0658086 -1.306 0.191694
## EDUCATIONBachelors -0.0649046 0.0647233 -1.003 0.316072
## EDUCATIONMasters 0.1029520 0.1090695 0.944 0.345323
## EDUCATIONPhD 0.1537568 0.1269872 1.211 0.226106
## EDUCATIONz_High.School 0.0059409 0.0535503 0.111 0.911673
## JOBClerical 0.0743435 0.1231961 0.603 0.546270
## JOBDoctor -0.0131399 0.1839070 -0.071 0.943047
## JOBHome.Maker 0.0303227 0.1325648 0.229 0.819094
## JOBLawyer -0.0354852 0.1069424 -0.332 0.740061
## JOBManager 0.0208008 0.1108565 0.188 0.851179
## JOBProfessional 0.1130966 0.1170222 0.966 0.333928
## JOBStudent 0.1260001 0.1361558 0.925 0.354858
## JOBz_Blue.Collar 0.0564755 0.1186269 0.476 0.634069
## IS_CAR_PRIVATE_USE 0.0136397 0.0543605 0.251 0.801907
## CAR_TYPEPanel.Truck 0.0368181 0.0916320 0.402 0.687869
## CAR_TYPEPickup 0.0376195 0.0620571 0.606 0.544441
## CAR_TYPESports.Car 0.0718072 0.0765241 0.938 0.348167
## CAR_TYPEVan -0.0187183 0.0791974 -0.236 0.813184
## CAR_TYPEz_SUV 0.0954950 0.0669215 1.427 0.153736
## IS_CAR_RED 0.0359232 0.0516642 0.695 0.486931
## IS_LIC_REVOKED -0.0337374 0.0440626 -0.766 0.443958
## IS_URBAN 0.0473326 0.0785668 0.602 0.546939
## log_INCOME -0.0011698 0.0090670 -0.129 0.897352
## log_HOME_VAL 0.0056977 0.0039472 1.443 0.149033
## log_BLUEBOOK 0.1578179 0.0342402 4.609 4.28e-06 ***
## log_OLDCLAIM 0.0114251 0.0070581 1.619 0.105657
## INCOME_Outlier -1.2126046 0.4517518 -2.684 0.007327 **
## HOME_VAL_Outlier 0.7621985 0.6101766 1.249 0.211751
## BLUEBOOK_Outlier 0.7088887 0.1902301 3.726 0.000199 ***
## OLDCLAIM_Outlier 0.1474221 0.6087918 0.242 0.808683
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7996 on 2110 degrees of freedom
## Multiple R-squared: 0.0499, Adjusted R-squared: 0.03144
## F-statistic: 2.703 on 41 and 2110 DF, p-value: 4.402e-08
Interesting to see only a few statistically significant variables while the \(R^2\) shows to be low. The p-value is very low and the median is considered to be near zero.
Multiple Regression STEP Model
In this section, I will build a model by employing the STEP function from R, thus in order to keep having a better understanding of the model. This model will be considered to be valid and will be considered as we advance.
##
## Call:
## lm(formula = log_TARGET_AMT ~ BLUEBOOK_Outlier + log_BLUEBOOK +
## IS_MARRIED + INCOME_Outlier + HOME_VAL_Outlier + MVR_PTS +
## EDUCATIONBachelors + IS_FEMALE + CLM_FREQ, data = TARGET_FLAG_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1036 -0.3798 0.0669 0.4300 2.9004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.955403 0.250897 27.722 < 2e-16 ***
## BLUEBOOK_Outlier 0.712254 0.188213 3.784 0.000158 ***
## log_BLUEBOOK 0.145953 0.026226 5.565 2.95e-08 ***
## IS_MARRIED -0.081364 0.034463 -2.361 0.018319 *
## INCOME_Outlier -1.182167 0.422567 -2.798 0.005195 **
## HOME_VAL_Outlier 0.874650 0.403540 2.167 0.030311 *
## MVR_PTS 0.016472 0.006979 2.360 0.018352 *
## EDUCATIONBachelors -0.076328 0.040307 -1.894 0.058403 .
## IS_FEMALE -0.053820 0.034684 -1.552 0.120874
## CLM_FREQ -0.021031 0.014426 -1.458 0.145046
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7961 on 2142 degrees of freedom
## Multiple R-squared: 0.04399, Adjusted R-squared: 0.03997
## F-statistic: 10.95 on 9 and 2142 DF, p-value: < 2.2e-16
Something interesting from the above results is that it shows non statistical significant values as part of the model. That is IS_FEMALE and CLM_FREQ are not statistically significant.
From the above graphs, we can quickly identify the lobster shape figure, which indicates that this model seems to be some how appropriate due to being some how homoscedastic, plus the Normal Q-Q line seems to differ on the lower end but not by much on the upper end which will be more problematic due to the nature of paying out insurance money.
ANOVA results
Let’s check an ANOVA table based on the above testing results.
Multiple Regression STEP MODIFIED Model
In this section, I will create a manual model in order to try to overcome the previous identify problems. In this case, I will add an iteration AGE:IS_FEMALE and I will keep the statistically significant values provided above.
##
## Call:
## lm(formula = log_TARGET_AMT ~ BLUEBOOK_Outlier + log_BLUEBOOK +
## IS_MARRIED + INCOME_Outlier + HOME_VAL_Outlier + MVR_PTS +
## EDUCATIONBachelors, data = TARGET_FLAG_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1177 -0.3756 0.0705 0.4232 2.9049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.873667 0.247386 27.785 < 2e-16 ***
## BLUEBOOK_Outlier 0.707848 0.188193 3.761 0.000174 ***
## log_BLUEBOOK 0.149567 0.026142 5.721 1.21e-08 ***
## IS_MARRIED -0.080899 0.034480 -2.346 0.019052 *
## INCOME_Outlier -1.206908 0.422566 -2.856 0.004330 **
## HOME_VAL_Outlier 0.903923 0.403548 2.240 0.025197 *
## MVR_PTS 0.013377 0.006664 2.008 0.044817 *
## EDUCATIONBachelors -0.076464 0.040322 -1.896 0.058047 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7965 on 2144 degrees of freedom
## Multiple R-squared: 0.04195, Adjusted R-squared: 0.03882
## F-statistic: 13.41 on 7 and 2144 DF, p-value: < 2.2e-16
Something interesting to note is that in this case, all predictors are statistically significant, the Normal Q-Q plot seems to follow a line for almost all the values with only a slight overpriced on the top but with an under prediction on the bottom left. The Residuals vs the Fitted lobster shape figure seems to be homoscedastic, my only concern will be the Multiple \(R^2\) which is very low but I will consider this to be OK due to previous correlations showed that the correlations are very low as well.
Multiple Linear Regression MODEL SELECTION
Below, I will describe why I have chosen the Multiple Regression STEP MODIFIED Model to be my selected model for the multiple linear regression model.
lm_Model_FINAL <- lm_Model_STEP_ModifiedThe generated ANOVA table shows this combination of variables to be have lowest AIC.
The coefficients make sense.
The Median is near Zero.
The coefficients are considered low, alongside the standard errors as well.
The Residuals vs the Fitted values seems to be homoscedastic.
The residuals and the normal Q-Q plot also make sense and follow “good” standards for data analysis.
My only concerns will be the \(R^2\) to be too low but it was noted the low correlation among variables.
Predictions
In this section, I will proceed to predict values from the evaluation data set.
Evaluation data transformations
In this section I will transform our evaluation data same as our original data has.
Predict TARGET_FLAG
In this section, I will predict the probability of having an accident or no accident and categorize it in the TARGET_FLAG variable.
Accident Predictions Table
In this section, I will predict the values on the evaluation data set employing the training data set.
Let’s see a table for the first 20 records.
Predict TARGET_AMT
In this section, I will predict the amount based on the final linear model selected. In order to accomplish this goal, I need to do as follows:
Since there’s no way to indicate if the new values are considered outliers or not, I will assign a zero instead; then I will re-evaluate with the given values.
- data.eval$BLUEBOOK_Outlier <- 0
- data.eval$INCOME_Outlier <- 0
- data.eval$HOME_VAL_Outlier <- 0
Let’s see the first 20 records of the first run for the predicted data set.
Now, the above values were calculated assuming that all outliers were ZERO. Let’s recalculate and see if new Outliers can be found in order to refine our values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 646.2 0.0 4932.3
Let’s see the first 40 records of the final predicted data set.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 719.6 0.0 23918.2
Export file
In order to provide a csv output for the predictions table.
write.csv(data.eval_ORIGINAL, file = "insurance-my-evaluated-data.csv",row.names=FALSE)Conclusion
In the above example, we can comprehend the importance of understanding the data in order to provide meaningful results. Not all data sets are alike and different approaches need to be taken in order to extract valuable information out of it.
References
R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.