The goal is to target existing customers for a re-purchase campaign, by sending a communication to customers who are highly likely to purchase a new vehicle (all customers have purchased at least one vehicle).
#import data
repurchase_training <- read_csv("raw_data/repurchase_training.csv")
## Parsed with column specification:
## cols(
## ID = col_integer(),
## Target = col_integer(),
## age_band = col_character(),
## gender = col_character(),
## car_model = col_character(),
## car_segment = col_character(),
## age_of_vehicle_years = col_integer(),
## sched_serv_warr = col_integer(),
## non_sched_serv_warr = col_integer(),
## sched_serv_paid = col_integer(),
## non_sched_serv_paid = col_integer(),
## total_paid_services = col_integer(),
## total_services = col_integer(),
## mth_since_last_serv = col_integer(),
## annualised_mileage = col_integer(),
## num_dealers_visited = col_integer(),
## num_serv_dealer_purchased = col_integer()
## )
#NOTE:this validation set needs an ID added
repurchase_validation <- read_csv("raw_data/repurchase_validationCORRECTED.csv")
## Parsed with column specification:
## cols(
## ID = col_integer(),
## age_band = col_character(),
## gender = col_character(),
## car_model = col_character(),
## car_segment = col_character(),
## age_of_vehicle_years = col_integer(),
## sched_serv_warr = col_integer(),
## non_sched_serv_warr = col_integer(),
## sched_serv_paid = col_integer(),
## non_sched_serv_paid = col_integer(),
## total_paid_services = col_integer(),
## total_services = col_integer(),
## mth_since_last_serv = col_integer(),
## annualised_mileage = col_integer(),
## num_dealers_visited = col_integer(),
## num_serv_dealer_purchased = col_integer()
## )
View(repurchase_training)
View(repurchase_validation)
There is a specified data dictionary – my *** are instinctive as to what might predict buying again BUYER CHARACTERISTICS . ID unique customer number . TARGET is 0 for people who bought 1 car, 1 if they bought more than 1 . AGE_BAND mostly missing data, the age of the customer . GENDER Male, female or missing - but mostly missing CAR TYPES . CAR_MODEL categorical, 18 models in total . CAR_SEGMENT the type of vehicle .. AGE_OF_VEHICLE_YEARS age of last vehicle in deciles SERVICING OF CAR . SCHED_SERV_WARR number of scheduled services used under warranty . NON_SCHED_SERV_WARR number of non-scheduled services used under warranty - something broke . SCHED_SERV_PAID Amount paid for scheduled services in deciles . NON_SCHED_SERV_PAID amount paid for non scheduled services . TOTAL_PAID_SERVICES amount paid in total for services, in deciles . TOTAL_SERVICES Total number of services in deciles . MTH_SINCE_LAST_SERV the number of months since the last service (the opposite of high service people) CAR & DEALER BEHAVIOUR .. ANNUALISED_MILEAGE annual vehicle mileage in deciles . NUM_DEALERS_VISITED Different dealers visited for servicing .. NUM_SERV_DEALER_PURCHASED nu,ber of services at the same dealer where the vehicle was purchased ***
Do we need to encode the categorical variables as factors?
Try glm and then anova to analyse the table of deviance
Try a model “binomial logistic regression”, since the variable to predict is binary. Logistic regression can also be used to predict a dependent variable which can assume more than 2 values. In this second case, we call the model “multinomial logistic regression”. A typical example, for instance, would be classifying films between “Entertaining”, “borderline” or “boring”
So given that we want to send people communications, we want a binary classification of INTERESTED and NOT INTERESTED???
Logistic regression models the probability that a response variable \(Y\) belongs to a particular category given a set of predictor variables. R uses the glm() function to do logistic regression.
str(repurchase_training)
## Classes 'tbl_df', 'tbl' and 'data.frame': 131337 obs. of 17 variables:
## $ ID : int 1 2 3 5 6 7 8 9 10 11 ...
## $ Target : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age_band : chr "3. 35 to 44" "NULL" "NULL" "NULL" ...
## $ gender : chr "Male" "NULL" "Male" "NULL" ...
## $ car_model : chr "model_1" "model_2" "model_3" "model_3" ...
## $ car_segment : chr "LCV" "Small/Medium" "Large/SUV" "Large/SUV" ...
## $ age_of_vehicle_years : int 9 6 9 5 8 7 8 7 1 3 ...
## $ sched_serv_warr : int 2 10 10 8 9 4 2 4 2 1 ...
## $ non_sched_serv_warr : int 10 3 9 5 4 10 8 9 1 1 ...
## $ sched_serv_paid : int 3 10 10 8 10 5 2 6 1 2 ...
## $ non_sched_serv_paid : int 7 4 9 4 7 7 9 9 3 1 ...
## $ total_paid_services : int 5 9 10 5 9 6 9 8 1 2 ...
## $ total_services : int 6 10 10 6 8 8 4 6 2 1 ...
## $ mth_since_last_serv : int 9 6 7 4 5 8 7 9 1 1 ...
## $ annualised_mileage : int 8 10 10 10 4 5 6 5 1 1 ...
## $ num_dealers_visited : int 10 7 6 9 4 10 10 5 2 1 ...
## $ num_serv_dealer_purchased: int 4 10 10 7 9 4 4 8 3 1 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 17
## .. ..$ ID : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Target : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ age_band : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ gender : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ car_model : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ car_segment : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ age_of_vehicle_years : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ sched_serv_warr : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ non_sched_serv_warr : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ sched_serv_paid : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ non_sched_serv_paid : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ total_paid_services : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ total_services : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ mth_since_last_serv : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ annualised_mileage : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ num_dealers_visited : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ num_serv_dealer_purchased: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
str(repurchase_validation)
## Classes 'tbl_df', 'tbl' and 'data.frame': 50000 obs. of 16 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age_band : chr "NULL" "4. 45 to 54" "NULL" "NULL" ...
## $ gender : chr "NULL" "Male" "Female" "Male" ...
## $ car_model : chr "model_2" "model_7" "model_2" "model_6" ...
## $ car_segment : chr "Small/Medium" "LCV" "Small/Medium" "Small/Medium" ...
## $ age_of_vehicle_years : int 3 6 10 8 3 4 1 9 8 8 ...
## $ sched_serv_warr : int 3 6 8 4 4 4 2 9 8 8 ...
## $ non_sched_serv_warr : int 5 10 10 8 3 4 1 4 9 9 ...
## $ sched_serv_paid : int 2 8 8 4 4 4 1 10 5 9 ...
## $ non_sched_serv_paid : int 5 10 7 5 4 10 3 4 5 9 ...
## $ total_paid_services : int 3 10 7 4 3 10 2 8 4 9 ...
## $ total_services : int 3 10 10 5 3 3 2 7 9 9 ...
## $ mth_since_last_serv : int 4 3 8 8 5 8 2 8 6 5 ...
## $ annualised_mileage : int 2 10 9 6 4 3 2 8 4 8 ...
## $ num_dealers_visited : int 8 9 10 9 9 3 3 8 8 10 ...
## $ num_serv_dealer_purchased: int 4 10 5 6 4 5 3 8 9 8 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 16
## .. ..$ ID : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ age_band : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ gender : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ car_model : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ car_segment : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ age_of_vehicle_years : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ sched_serv_warr : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ non_sched_serv_warr : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ sched_serv_paid : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ non_sched_serv_paid : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ total_paid_services : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ total_services : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ mth_since_last_serv : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ annualised_mileage : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ num_dealers_visited : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ num_serv_dealer_purchased: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
#need to change ID to be "character" in both datasets, so the IDs are not counted as numbers
#Find nulls in the data to work out which columns have a lot of empty records
total_nulls <- data.frame(type = names(repurchase_training[, -1]),
nulls = colSums(repurchase_training[, -1] == "NULL"),
values = colSums(repurchase_training[, -1] != "NULL")
)
total_nulls <- melt(total_nulls, id.var="type")
# We can see from this that 'age_band' has very few values across the training set
# We can also see that gender only has value in about 50% of rows
# Neither is likely very usable for our model
ggplot(total_nulls, aes(x = type, y = value, fill = variable)) +
geom_bar(stat = "identity") +
coord_flip()
#match to the data dictionary on the assignment sheet (note something about transforming variables)
#already done in data
#Let's look at some characteristics of the cars to see if we can detect any patterns
#First write a function that will take a characteristic and return a data frame that gives us the count of entries with target = 1 and target = 0
get_counts <- function (attribute_df, count_df) {
output = data.frame(matrix(ncol = 3, nrow = 0), stringsAsFactors = FALSE)
attribute_name = colnames(attribute_df)[1]
names(output) = c(attribute_name, "target", "non_target")
for(i in 1:nrow(attribute_df)) {
my_attribute = as.character(attribute_df[i, attribute_name])
my_target_count = count_df %>%
filter(Target == 1) %>%
filter(.[,2] == my_attribute)
my_no_target_count = count_df %>%
filter(Target == 0) %>%
filter(.[,2] == my_attribute)
if(nrow(my_target_count) > 0) {
target_count = as.character(my_target_count$count)
} else {
target_count = "0"
}
if(nrow(my_no_target_count) > 0) {
no_target_count = as.character(my_no_target_count$count)
} else {
no_target_count = "0"
}
#Can't divide by zero!
# ratio <- ifelse(no_target_count > 0, as.integer(target_count)/as.integer(no_target_count), 0)
temp <- data.frame(my_attribute, target_count, no_target_count)
names(temp) <- c(attribute_name, "target", "non_target")
output = rbind(output, temp)
output[,1] <- as.character(output[,1])
output[,2] <- as.integer(output[,2])
output[,3] <- as.integer(output[,3])
}
return(output)
}
#get data for car models
car_models <- repurchase_training %>%
group_by(car_model) %>%
summarize(count = n())
car_models_counts <- repurchase_training %>%
group_by(Target, car_model) %>%
summarize(count = n())
car_model_data <- get_counts(car_models, car_models_counts)
car_model_data <- melt(car_model_data, id.var="car_model")
ggplot(car_model_data, aes(x = car_model, y = value, fill = variable)) +
geom_bar(stat = "identity")
#car model is likely to be a key predictor variable
#ggplot(car_models, aes(x=model, y=value, color=variable), data = car_models) +
#geom_point() + geom_smooth(se = FALSE, color = "red") + labs(title="Car models", x="Model", y="Value") + scale_y_continuous(labels = scales::comma)
car_segments <- repurchase_training %>%
group_by(car_segment) %>%
summarize(count = n())
car_segments_counts <- repurchase_training %>%
group_by(Target, car_segment) %>%
summarize(count = n())
car_segment_data <- get_counts(car_segments, car_segments_counts)
car_segment_data <- melt(car_segment_data, id.var="car_segment")
ggplot(car_segment_data, aes(x = car_segment, y = value, fill = variable)) +
geom_bar(stat = "identity")
#age_of_vehicle_years <- repurchase_training %>%
# group_by(age_of_vehicle_years) %>%
# summarize(count = n())
#age_of_vehicle_years_counts <- repurchase_training %>%
# group_by(Target, age_of_vehicle_years) %>%
#summarize(count = n())
#age_of_vehicle_years_data <- get_counts(age_of_vehicle_years, age_of_vehicle_years_counts)
#age_of_vehicle_years_data <- melt(age_of_vehicle_years_data, id.var="age_of_vehicle_years")
#ggplot(age_of_vehicle_years_data, aes(x = age_of_vehicle_years, y = value, fill = variable)) +
#geom_bar(stat = "identity")
comments and chrts
#look at some service history characteristics of the cars to see if we can detect any patterns
#service_characteristics <- repurchase_training %>%
# select(Target, sched_serv_warr, non_sched_serv_warr, sched_serv_paid, non_sched_serv_paid, total_paid_services, total_services, #mth_since_last_serv, annualised_mileage, num_dealers_visited, num_serv_dealer_purchased) %>%
#group_by(Target, car_models, car_segment, age_of_vehicle_years) %>%
#summarize(count = n())
#View(car_characteristics)
#targeted <- filter(car_characteristics, Target == 1)
#not_targeted <- filter(car_characteristics, Target == 0)
We need to assess the distribution for the outcome variable. This table lists off some of the common link functions and the variances in the outcome variable that they describe:
| Family | Variance | Link |
|---|---|---|
| gaussian | gaussian | identity |
| binomial | binomial | logit, probit or cloglogc |
| poisson | poisson | log, identity or sqrt |
| Gamma | Gamma | inverse, identity or log |
| inverse.gaussian | inverse.gaussian | 1/mu^2 |
| quasi | user-defined | user-defined |
Run a linear model with the training data with an experimental classification target, trying to find the right target/predictor variable.
Hypothesis
#Linear Model 1 will try car_model as the predictor
#car_model.lm <- lm(car_model~annualised_mileage, data=repurchase_training)
#summary(car_models.lm)
#ggplot(data = repurchase_training, mapping = aes(x = car_model, y = annualised_mileage)) +
#geom_point() +
#geom_smooth(method = "lm", se = FALSE, color = "red") + labs(title="The linear relationship between car model and annualised mileage", x="car model", y="annualised mileage")
What is the variable we are trying to predict? We are going to try and predict that (i.e. classify ) how likely a customer is to purchase a new vehicle. If likelihood of purchase is the dependent variable, it has three categories - unlikely, somewhat likely and very likely. THIS IS ORDINAL REGRESSION
Multinomial Logistic Regression (MLR) is a form of linear regression analysis conducted when the dependent variable is nominal with more than two levels. It is used to describe data and to explain the relationship between one dependent nominal variable and one or more continuous-level (interval or ratio scale) independent variables. You can understand nominal variable as, a variable which has no intrinsic ordering.
I AM TRYING ORDINAL LOGISTIC REGRESSION - If likelihood of purchase is the dependent variable, it has three categories - unlikely = 0, somewhat likely = 1 and very likely = 2. These scores could be <0.3, =>0.3-,<0.6, =>0.6. Hmm. Maybe try this with just 2 likelihoods, 0 and 1.
# Load library
library(haven)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(klaR)
# Read in data
ordinal <- repurchase_training
summary(ordinal)
## ID Target age_band gender
## Min. : 1 Min. :0.00000 Length:131337 Length:131337
## 1st Qu.: 38563 1st Qu.:0.00000 Class :character Class :character
## Median : 77132 Median :0.00000 Mode :character Mode :character
## Mean : 77097 Mean :0.02681
## 3rd Qu.:115668 3rd Qu.:0.00000
## Max. :154139 Max. :1.00000
## car_model car_segment age_of_vehicle_years
## Length:131337 Length:131337 Min. : 1.000
## Class :character Class :character 1st Qu.: 3.000
## Mode :character Mode :character Median : 5.000
## Mean : 5.493
## 3rd Qu.: 8.000
## Max. :10.000
## sched_serv_warr non_sched_serv_warr sched_serv_paid non_sched_serv_paid
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 5.000 Median : 5.000 Median : 5.000 Median : 5.000
## Mean : 5.452 Mean : 5.473 Mean : 5.452 Mean : 5.497
## 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## total_paid_services total_services mth_since_last_serv
## Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.00
## Median : 5.000 Median : 5.000 Median : 5.00
## Mean : 5.482 Mean : 5.455 Mean : 5.47
## 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.00
## Max. :10.000 Max. :10.000 Max. :10.00
## annualised_mileage num_dealers_visited num_serv_dealer_purchased
## Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 5.000 Median : 5.000 Median : 5.000
## Mean : 5.503 Mean : 5.485 Mean : 5.481
## 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :10.000 Max. :10.000 Max. :10.000
#turn all decile variables into a factor for modelling purposes
ordinal$car_segment <-factor(ordinal$car_segment)
ordinal$annualised_mileage <-factor(ordinal$annualised_mileage)
ordinal$age_of_vehicle_years <-factor(ordinal$age_of_vehicle_years)
ordinal$age_band <- factor(ordinal$age_band)
ordinal$Target <- factor(ordinal$Target)
ordinal$car_model<- factor(ordinal$car_model)
ordinal$sched_serv_warr <- factor(ordinal$sched_serv_warr)
ordinal$non_sched_serv_paid <- factor(ordinal$non_sched_serv_paid)
ordinal$sched_serv_paid <- factor(ordinal$sched_serv_paid)
ordinal$non_sched_serv_paid <- factor(ordinal$non_sched_serv_paid)
ordinal$total_paid_services <- factor(ordinal$total_paid_services)
ordinal$total_services <- factor(ordinal$total_services)
ordinal$mth_since_last_serv <- factor(ordinal$mth_since_last_serv)
ordinal$num_dealers_visited <- factor(ordinal$num_dealers_visited)
ordinal$num_serv_dealer_purchased <- factor(ordinal$num_serv_dealer_purchased)
ordinal_model <- polr(ordinal$car_segment ~ ordinal$annualised_mileage + + ordinal$age_of_vehicle_years)
summary(ordinal_model)
##
## Re-fitting to get Hessian
## Call:
## polr(formula = ordinal$car_segment ~ ordinal$annualised_mileage +
## +ordinal$age_of_vehicle_years)
##
## Coefficients:
## Value Std. Error t value
## ordinal$annualised_mileage2 0.16308 0.02335 6.985
## ordinal$annualised_mileage3 0.32679 0.02543 12.851
## ordinal$annualised_mileage4 0.15264 0.02508 6.086
## ordinal$annualised_mileage5 0.12644 0.02545 4.968
## ordinal$annualised_mileage6 -0.02227 0.02571 -0.866
## ordinal$annualised_mileage7 -0.13211 0.02576 -5.129
## ordinal$annualised_mileage8 -0.25036 0.02566 -9.756
## ordinal$annualised_mileage9 -0.31730 0.02579 -12.305
## ordinal$annualised_mileage10 -0.50641 0.02576 -19.662
## ordinal$age_of_vehicle_years2 0.09176 0.02566 3.576
## ordinal$age_of_vehicle_years3 0.20059 0.02650 7.570
## ordinal$age_of_vehicle_years4 0.49902 0.02813 17.740
## ordinal$age_of_vehicle_years5 0.65940 0.02833 23.279
## ordinal$age_of_vehicle_years6 0.76425 0.02865 26.679
## ordinal$age_of_vehicle_years7 0.74422 0.02868 25.953
## ordinal$age_of_vehicle_years8 0.75258 0.02884 26.096
## ordinal$age_of_vehicle_years9 0.59809 0.02882 20.753
## ordinal$age_of_vehicle_years10 0.60453 0.02879 20.996
##
## Intercepts:
## Value Std. Error t value
## Large/SUV|LCV 0.0117 0.0210 0.5550
## LCV|Other 0.7872 0.0212 37.2050
## Other|Small/Medium 0.7891 0.0212 37.2923
##
## Residual Deviance: 272155.94
## AIC: 272197.94
#let's try using some different variables using ordinal models to compare AIC
ordinal_model2 <- polr(ordinal$car_model ~ ordinal$age_band + ordinal$Target + ordinal$car_segment)
summary(ordinal_model2)
##
## Re-fitting to get Hessian
## Call:
## polr(formula = ordinal$car_model ~ ordinal$age_band + ordinal$Target +
## ordinal$car_segment)
##
## Coefficients:
## Value Std. Error t value
## ordinal$age_band2. 25 to 34 -0.34802 0.06643 -5.2386
## ordinal$age_band3. 35 to 44 -0.50410 0.06543 -7.7047
## ordinal$age_band4. 45 to 54 -0.64358 0.06514 -9.8803
## ordinal$age_band5. 55 to 64 -0.70099 0.06626 -10.5790
## ordinal$age_band6. 65 to 74 -0.88343 0.06983 -12.6517
## ordinal$age_band7. 75+ -0.77017 0.07896 -9.7535
## ordinal$age_bandNULL -0.72527 0.05946 -12.1969
## ordinal$Target1 0.01071 0.03040 0.3523
## ordinal$car_segmentLCV -3.16036 0.02070 -152.6670
## ordinal$car_segmentOther -3.00129 0.18610 -16.1270
## ordinal$car_segmentSmall/Medium -1.50996 0.01101 -137.2032
##
## Intercepts:
## Value Std. Error t value
## model_1|model_10 -4.2278 0.0611 -69.2392
## model_10|model_11 -3.9203 0.0608 -64.4478
## model_11|model_12 -3.8703 0.0608 -63.6572
## model_12|model_13 -3.8217 0.0608 -62.8880
## model_13|model_14 -3.7675 0.0607 -62.0268
## model_14|model_15 -3.7617 0.0607 -61.9351
## model_15|model_16 -3.7373 0.0607 -61.5456
## model_16|model_17 -3.7290 0.0607 -61.4136
## model_17|model_18 -3.7179 0.0607 -61.2368
## model_18|model_19 -3.7106 0.0607 -61.1207
## model_19|model_2 -3.7104 0.0607 -61.1183
## model_2|model_3 -1.9951 0.0600 -33.2454
## model_3|model_4 -1.3603 0.0598 -22.7313
## model_4|model_5 -0.7937 0.0598 -13.2779
## model_5|model_6 0.3786 0.0599 6.3208
## model_6|model_7 0.5711 0.0600 9.5228
## model_7|model_8 1.3782 0.0605 22.7652
## model_8|model_9 3.4808 0.0673 51.6941
##
## Residual Deviance: 514229.65
## AIC: 514287.65
#one more ordinal trial
ordinal_model3 <- polr(ordinal$annualised_mileage ~ ordinal$age_band + ordinal$Target + ordinal$car_segment + age_of_vehicle_years + ordinal$total_paid_services + ordinal$total_services + ordinal$num_serv_dealer_purchased + num_dealers_visited, data = repurchase_training, Hess = TRUE)
summary(ordinal_model3)
## Call:
## polr(formula = ordinal$annualised_mileage ~ ordinal$age_band +
## ordinal$Target + ordinal$car_segment + age_of_vehicle_years +
## ordinal$total_paid_services + ordinal$total_services + ordinal$num_serv_dealer_purchased +
## num_dealers_visited, data = repurchase_training, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## ordinal$age_band2. 25 to 34 -0.154243 0.064925 -2.3757
## ordinal$age_band3. 35 to 44 -0.190132 0.064559 -2.9451
## ordinal$age_band4. 45 to 54 -0.153871 0.064229 -2.3957
## ordinal$age_band5. 55 to 64 -0.337437 0.065366 -5.1623
## ordinal$age_band6. 65 to 74 -0.702870 0.069227 -10.1531
## ordinal$age_band7. 75+ -1.216590 0.080187 -15.1719
## ordinal$age_bandNULL -0.289388 0.058048 -4.9853
## ordinal$Target1 1.558563 0.038958 40.0062
## ordinal$car_segmentLCV -0.010395 0.014404 -0.7217
## ordinal$car_segmentOther -0.031285 0.251036 -0.1246
## ordinal$car_segmentSmall/Medium -0.173825 0.011285 -15.4026
## age_of_vehicle_years -0.000855 0.002212 -0.3865
## ordinal$total_paid_services2 0.629825 0.025937 24.2831
## ordinal$total_paid_services3 1.384594 0.035074 39.4768
## ordinal$total_paid_services4 1.338981 0.036720 36.4647
## ordinal$total_paid_services5 1.309007 0.037302 35.0921
## ordinal$total_paid_services6 1.328508 0.037586 35.3458
## ordinal$total_paid_services7 1.379606 0.038054 36.2539
## ordinal$total_paid_services8 1.459998 0.038540 37.8831
## ordinal$total_paid_services9 1.531594 0.039109 39.1620
## ordinal$total_paid_services10 1.758225 0.039988 43.9690
## ordinal$total_services2 1.230395 0.026911 45.7208
## ordinal$total_services3 3.129024 0.037525 83.3843
## ordinal$total_services4 4.245219 0.040709 104.2813
## ordinal$total_services5 5.124690 0.042526 120.5068
## ordinal$total_services6 5.659473 0.043262 130.8181
## ordinal$total_services7 6.072768 0.044065 137.8147
## ordinal$total_services8 6.446074 0.044884 143.6171
## ordinal$total_services9 6.921090 0.045790 151.1483
## ordinal$total_services10 7.614743 0.048179 158.0526
## ordinal$num_serv_dealer_purchased2 0.635373 0.024744 25.6779
## ordinal$num_serv_dealer_purchased3 0.631433 0.024808 25.4525
## ordinal$num_serv_dealer_purchased4 0.505511 0.029158 17.3371
## ordinal$num_serv_dealer_purchased5 0.266534 0.028912 9.2187
## ordinal$num_serv_dealer_purchased6 0.260159 0.028662 9.0767
## ordinal$num_serv_dealer_purchased7 0.202331 0.028470 7.1069
## ordinal$num_serv_dealer_purchased8 0.206257 0.028219 7.3092
## ordinal$num_serv_dealer_purchased9 0.213876 0.028503 7.5037
## ordinal$num_serv_dealer_purchased10 0.215531 0.030537 7.0580
## num_dealers_visited 0.087024 0.002509 34.6810
##
## Intercepts:
## Value Std. Error t value
## 1|2 1.0567 0.0640 16.5035
## 2|3 3.3422 0.0671 49.7981
## 3|4 4.9905 0.0684 72.9099
## 4|5 6.0501 0.0689 87.8539
## 5|6 6.8967 0.0691 99.7691
## 6|7 7.6280 0.0693 110.0645
## 7|8 8.3393 0.0695 120.0510
## 8|9 9.1175 0.0697 130.8635
## 9|10 10.1609 0.0701 144.9651
##
## Residual Deviance: 454976.78
## AIC: 455074.78
Ordinal model 1 is AIC: 272197.94 and Ordinal model 2 is AIC: 514287.65 and ordinal model 3 is AIC: 455074.78. The first model is the best by AIC standards, 3rd is better and 2nd worst .
#trying generalized linear model version 1
glm1 = glm(ordinal$age_of_vehicle_years ~ ordinal$car_model, family=binomial(logit), data = repurchase_training)
summary(glm1)
##
## Call:
## glm(formula = ordinal$age_of_vehicle_years ~ ordinal$car_model,
## family = binomial(logit), data = repurchase_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0381 0.3623 0.3693 0.4602 0.8664
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.72932 0.02258 76.598 < 2e-16 ***
## ordinal$car_modelmodel_10 -0.94276 0.04423 -21.314 < 2e-16 ***
## ordinal$car_modelmodel_11 0.90450 0.16325 5.541 3.02e-08 ***
## ordinal$car_modelmodel_12 -0.05598 0.11293 -0.496 0.62008
## ordinal$car_modelmodel_13 1.26642 0.17718 7.148 8.82e-13 ***
## ordinal$car_modelmodel_14 13.83675 164.79132 0.084 0.93308
## ordinal$car_modelmodel_15 13.83675 79.63581 0.174 0.86206
## ordinal$car_modelmodel_16 13.83675 136.31047 0.102 0.91915
## ordinal$car_modelmodel_17 -0.75746 0.18256 -4.149 3.34e-05 ***
## ordinal$car_modelmodel_18 2.87585 1.00524 2.861 0.00422 **
## ordinal$car_modelmodel_19 13.83675 1029.12147 0.013 0.98927
## ordinal$car_modelmodel_2 0.96139 0.03158 30.447 < 2e-16 ***
## ordinal$car_modelmodel_3 0.58051 0.03497 16.602 < 2e-16 ***
## ordinal$car_modelmodel_4 0.92182 0.03977 23.181 < 2e-16 ***
## ordinal$car_modelmodel_5 0.46270 0.03095 14.948 < 2e-16 ***
## ordinal$car_modelmodel_6 1.03534 0.07969 12.992 < 2e-16 ***
## ordinal$car_modelmodel_7 0.88727 0.04939 17.963 < 2e-16 ***
## ordinal$car_modelmodel_8 1.17076 0.06041 19.380 < 2e-16 ***
## ordinal$car_modelmodel_9 0.74799 0.12078 6.193 5.90e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 79805 on 131336 degrees of freedom
## Residual deviance: 77048 on 131318 degrees of freedom
## AIC: 77086
##
## Number of Fisher Scoring iterations: 14
#Second generalised linear model, with target
glm2 = glm(Target ~ ordinal$car_model, family=binomial(logit), data = repurchase_training)
summary(glm2)
##
## Call:
## glm(formula = Target ~ ordinal$car_model, family = binomial(logit),
## data = repurchase_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3352 -0.2351 -0.2190 -0.2185 3.0424
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.574507 0.049592 -72.078 < 2e-16 ***
## ordinal$car_modelmodel_10 -0.680317 0.158105 -4.303 1.69e-05 ***
## ordinal$car_modelmodel_11 -0.630185 0.339454 -1.856 0.063387 .
## ordinal$car_modelmodel_12 -1.043908 0.413244 -2.526 0.011533 *
## ordinal$car_modelmodel_13 0.723356 0.171994 4.206 2.60e-05 ***
## ordinal$car_modelmodel_14 -10.991561 99.951017 -0.110 0.912434
## ordinal$car_modelmodel_15 0.096349 0.324878 0.297 0.766796
## ordinal$car_modelmodel_16 0.260321 0.511420 0.509 0.610741
## ordinal$car_modelmodel_17 -0.749625 0.713500 -1.051 0.293428
## ordinal$car_modelmodel_18 0.619597 0.461385 1.343 0.179302
## ordinal$car_modelmodel_19 -10.991561 624.193829 -0.018 0.985951
## ordinal$car_modelmodel_2 -0.144335 0.060935 -2.369 0.017852 *
## ordinal$car_modelmodel_3 0.371834 0.063400 5.865 4.50e-09 ***
## ordinal$car_modelmodel_4 -0.268321 0.075309 -3.563 0.000367 ***
## ordinal$car_modelmodel_5 -0.148658 0.064955 -2.289 0.022101 *
## ordinal$car_modelmodel_6 -0.375741 0.141537 -2.655 0.007937 **
## ordinal$car_modelmodel_7 0.293926 0.077232 3.806 0.000141 ***
## ordinal$car_modelmodel_8 0.222674 0.084895 2.623 0.008718 **
## ordinal$car_modelmodel_9 -0.003854 0.201322 -0.019 0.984725
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 32432 on 131336 degrees of freedom
## Residual deviance: 32194 on 131318 degrees of freedom
## AIC: 32232
##
## Number of Fisher Scoring iterations: 13
#GLM3
glm3 = glm(ordinal$age_of_vehicle_years ~ ordinal$car_segment, family=binomial(logit), data = repurchase_training)
summary(glm3)
##
## Call:
## glm(formula = ordinal$age_of_vehicle_years ~ ordinal$car_segment,
## family = binomial(logit), data = repurchase_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3423 0.3648 0.4676 0.4676 1.6049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.15839 0.01438 150.131 < 2e-16 ***
## ordinal$car_segmentLCV -0.16222 0.02434 -6.663 2.67e-11 ***
## ordinal$car_segmentOther -3.12347 0.29414 -10.619 < 2e-16 ***
## ordinal$car_segmentSmall/Medium 0.51835 0.02261 22.928 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 79805 on 131336 degrees of freedom
## Residual deviance: 78833 on 131333 degrees of freedom
## AIC: 78841
##
## Number of Fisher Scoring iterations: 5
#GLM4
glm4 = glm(ordinal$num_serv_dealer_purchased ~ ordinal$car_model, family=binomial(logit), data = repurchase_training)
summary(glm4)
##
## Call:
## glm(formula = ordinal$num_serv_dealer_purchased ~ ordinal$car_model,
## family = binomial(logit), data = repurchase_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3266 0.4327 0.4373 0.4735 0.7010
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.131638 0.026229 81.270 < 2e-16 ***
## ordinal$car_modelmodel_10 -0.188923 0.059372 -3.182 0.001462 **
## ordinal$car_modelmodel_11 -0.689254 0.106090 -6.497 8.20e-11 ***
## ordinal$car_modelmodel_12 0.505676 0.163779 3.088 0.002018 **
## ordinal$car_modelmodel_13 -0.257409 0.113269 -2.273 0.023053 *
## ordinal$car_modelmodel_14 -0.522200 0.304952 -1.712 0.086823 .
## ordinal$car_modelmodel_15 0.221378 0.196085 1.129 0.258902
## ordinal$car_modelmodel_16 0.008428 0.306309 0.028 0.978048
## ordinal$car_modelmodel_17 -0.347483 0.231897 -1.498 0.134020
## ordinal$car_modelmodel_18 -0.853232 0.242489 -3.519 0.000434 ***
## ordinal$car_modelmodel_19 -11.697595 51.238982 -0.228 0.819417
## ordinal$car_modelmodel_2 0.167635 0.032216 5.204 1.96e-07 ***
## ordinal$car_modelmodel_3 -0.125335 0.035334 -3.547 0.000389 ***
## ordinal$car_modelmodel_4 0.181957 0.038647 4.708 2.50e-06 ***
## ordinal$car_modelmodel_5 0.189771 0.034438 5.510 3.58e-08 ***
## ordinal$car_modelmodel_6 0.183549 0.068330 2.686 0.007226 **
## ordinal$car_modelmodel_7 0.008018 0.044583 0.180 0.857274
## ordinal$car_modelmodel_8 -0.088013 0.047079 -1.869 0.061560 .
## ordinal$car_modelmodel_9 0.317838 0.120163 2.645 0.008168 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 85128 on 131336 degrees of freedom
## Residual deviance: 84843 on 131318 degrees of freedom
## AIC: 84881
##
## Number of Fisher Scoring iterations: 8
library(mgcv) ordinal_gam <- gam(car_model ~ target + car_segment + age_of_vehicle_years, data = repurchase_training)