The goal of this homework is hands-on practice with linear regression, logistic regression, classification, and model selection. You will:
The Assignment
The data in the accompanying file “car_sales.csv” (posted on Canvas) contains data from 10,062 car auctions. Auto dealers purchase used cars at auctions with the plan to sell them to consumers, but sometimes these auctioned vehicles can have severe issues that prevent them from being resold. The data contains information about each auctioned vehicle (for instance: the make, color, and age, among other variables). A full data dictionary is given in carvana_data_dictionary.txt (we have included only a subset of the variables in their data set). See http://www.kaggle.com/c/DontGetKicked for documentation on the problem.
Your task is to develop models to predict the target variable “IsBadBuy”, which labels whether a car purchased at auction was a “bad buy” or not. The intended use case for this model is to help an auto dealership decide whether or not to purchase an individual vehicle. Please answer the questions below clearly and concisely, providing tables or plots where applicable. Turn in a well-formatted compiled HTML document using R Markdown, containing clear answers to the questions and R code in the appropriate places.
RUBRIC: There are three possible grades on this assignment: Fail (F), Pass (P), and High Pass (H). If you receive an F then you will have one more chance to turn it in to receive a P. If you receive H on 3 out of the 4 assignments this semester you’ll get a bonus point on your final average.
Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.
car <- read_csv("car_data.csv") #read the car_data dataset in R
## Rows: 10062 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): Auction, Make, Color, WheelType, Size
## dbl (5): VehicleAge, VehOdo, MMRAcquisitionAuctionAveragePrice, MMRAcquisiti...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(car) #variables used in dataset
## [1] "Auction" "VehicleAge"
## [3] "Make" "Color"
## [5] "WheelType" "VehOdo"
## [7] "Size" "MMRAcquisitionAuctionAveragePrice"
## [9] "MMRAcquisitionRetailAveragePrice" "IsBadBuy"
What is the mean of VehicleAge variable?
ANSWER: The mean age of a vehicle in this dataset is 4.504969.
age_mean <- car %>%
summarise(mean_age = mean(VehicleAge))
ANSWER TO QUESTION 1a HERE:
#PUT QUESTION 1a CODE HERE
#Plotting the boxplot
box_VehOdo <- car %>%
ggplot(aes(x=IsBadBuy, y=VehOdo, group = IsBadBuy)) +
geom_point(color="orange") +
geom_boxplot(fill="peachpuff2", alpha=0.6) +
theme_classic()
#Improving the boxplot
box_VehOdo +
theme_bw() +
theme(axis.ticks = element_blank(),
panel.border = element_blank(),
plot.title = element_text(size=15),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_line(color="grey95"),
panel.grid.minor.x =element_blank(),
panel.grid.major.x = element_blank(),
axis.title = element_text(face="bold")) +
labs(title="Distribution of Odometer Readings by IsBadBuy?",
x="Is a Bad Buy? (1 = Yes, 0 = No)", y="Miles Driven") +
scale_x_continuous(breaks = c(0,1))
#Plotting the boxplot
box_VehicleAge <- car %>%
ggplot(aes(x=IsBadBuy, y=VehicleAge, group = IsBadBuy)) +
geom_point(color="saddlebrown") +
geom_boxplot(fill="burlywood4", alpha=0.6) +
theme_classic()
#Improving the boxplot
box_VehicleAge +
theme_bw() +
theme(axis.ticks = element_blank(),
panel.border = element_blank(),
plot.title = element_text(size=15),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_line(color="grey95"),
panel.grid.minor.x =element_blank(),
panel.grid.major.x = element_blank(),
axis.title = element_text(face="bold"),
plot.caption = element_text(face="italic")) +
labs(title="Distribution of Vehicle Age by IsBadBuy?",
x="Is a Bad Buy? (1 = Yes, 0 = No)", y="Vehicle Age") +
scale_x_continuous(breaks = c(0,1))
After observing both the box plots, there appears to be no relationship between IsBadBuy & VehOdo or IsBadBuy & VehAge. However, it is evident that in both the cases the median is relatively greater for a bad buy .
ANSWER TO QUESTION 1b HERE:
table_badbuy_make = table(car$Make ,car$IsBadBuy)
table_badbuy_make #Printing the table
##
## 0 1
## ACURA 4 5
## BUICK 43 60
## CADILLAC 1 2
## CHEVROLET 1191 930
## CHRYSLER 604 613
## DODGE 911 742
## FORD 774 990
## GMC 42 43
## HONDA 41 36
## HYUNDAI 115 124
## INFINITI 2 8
## ISUZU 10 5
## JEEP 108 134
## KIA 203 169
## LEXUS 0 8
## LINCOLN 7 16
## MAZDA 73 95
## MERCURY 61 91
## MINI 3 5
## MITSUBISHI 81 65
## NISSAN 138 191
## OLDSMOBILE 12 31
## PLYMOUTH 0 1
## PONTIAC 317 280
## SATURN 132 165
## SCION 11 7
## SUBARU 1 3
## SUZUKI 84 110
## TOYOTA 78 65
## VOLKSWAGEN 8 10
## VOLVO 3 0
From the above table, it can be observed that car manufacturers like Chrysler, Ford, Jeep and Suzuki have more number of bad buys than good ones. Additionally, it can be observed that manufacturers including Acura, Cadillac, Infiniti, Lexus, Mini, Plymouth, Subaru and Volvo are not majorly contributing towards populating the dataset. Hence, these brands can not be taken into consideration in the analysis or they need to be considered as a single category.
Also, modify these two existing variables:
ANSWER TO QUESTION 1c HERE:
car_clean <- car %>%
mutate(
MPYind = factor(ifelse(VehOdo/VehicleAge > median(VehOdo/VehicleAge) , 1, 0)),
VehType = case_when(
Size %in% list("LARGE SUV", "MEDIUM SUV", "SMALL SUV") ~ "SUV",
Size %in% list("LARGE TRUCK", "MEDIUM TRUCK", "SMALL TRUCK") ~ "Truck",
Size %in% list("VAN", "CROSSOVER", "LARGE", "MEDIUM") ~ "Regular",
Size %in% list("COMPACT", "SPECIALTY", "SPORTS") ~ "Small" ),
Price0 = factor(ifelse((MMRAcquisitionRetailAveragePrice | MMRAcquisitionAuctionAveragePrice) == 0 ,1, 0))
) %>%
group_by(Make) %>%
mutate(make_count = n()) %>%
ungroup() %>%
mutate(
Make = ifelse(make_count < 20, 'other_make', Make)
) %>%
group_by(Color) %>%
mutate(color_count = n()) %>%
ungroup() %>%
mutate(
Color = ifelse(color_count < 20, 'other_color', Color)
)
Hint: this one is a little tricky. Consider using the special character NA to replace the 0’s.
ANSWER TO QUESTION 1d HERE:
car_clean <- car_clean %>%
mutate(
MPYind = as.factor(ifelse(VehOdo/VehicleAge > median(VehOdo/VehicleAge) , 1, 0)),
VehType = as.factor(case_when(
Size %in% list("LARGE SUV", "MEDIUM SUV", "SMALL SUV") ~ "SUV",
Size %in% list("LARGE TRUCK", "MEDIUM TRUCK", "SMALL TRUCK") ~ "Truck",
Size %in% list("VAN", "CROSSOVER", "LARGE", "MEDIUM") ~ "Regular",
Size %in% list("COMPACT", "SPECIALTY", "SPORTS") ~ "Small" )),
Price0 = as.factor(ifelse((MMRAcquisitionRetailAveragePrice | MMRAcquisitionAuctionAveragePrice) == 0 ,1, 0))
) %>%
group_by(Make) %>%
mutate(make_count = n()) %>%
ungroup() %>%
mutate(
Make = as.factor(ifelse(make_count < 20, 'other_make', Make))
) %>%
group_by(Color) %>%
mutate(color_count = n()) %>%
ungroup() %>%
mutate(
Color = as.factor(ifelse(color_count < 20, 'other_color', Color))
) %>%
mutate(
MMRAcquisitionAuctionAveragePrice = na_if(MMRAcquisitionAuctionAveragePrice, 0),
MMRAcquisitionRetailAveragePrice = na_if(MMRAcquisitionRetailAveragePrice, 0)
) %>%
group_by(Make) %>%
mutate(
MMRAcquisitionAuctionAveragePrice = ifelse(is.na(MMRAcquisitionAuctionAveragePrice), mean(car_clean$MMRAcquisitionAuctionAveragePrice, na.rm = TRUE), MMRAcquisitionAuctionAveragePrice),
MMRAcquisitionRetailAveragePrice = ifelse(is.na(MMRAcquisitionRetailAveragePrice), mean(car_clean$MMRAcquisitionRetailAveragePrice, na.rm =TRUE), MMRAcquisitionRetailAveragePrice)
) %>%
ungroup()
summary(car_clean)
## Auction VehicleAge Make Color
## Length:10062 Min. :1.000 CHEVROLET:2121 SILVER :2081
## Class :character 1st Qu.:3.000 FORD :1764 WHITE :1653
## Mode :character Median :4.000 DODGE :1653 BLUE :1386
## Mean :4.505 CHRYSLER :1217 GREY :1054
## 3rd Qu.:6.000 PONTIAC : 597 BLACK :1013
## Max. :9.000 KIA : 372 RED : 881
## (Other) :2338 (Other):1994
## WheelType VehOdo Size
## Length:10062 Min. : 9446 Length:10062
## Class :character 1st Qu.: 63489 Class :character
## Mode :character Median : 74942 Mode :character
## Mean : 72904
## 3rd Qu.: 83662
## Max. :115717
##
## MMRAcquisitionAuctionAveragePrice MMRAcquisitionRetailAveragePrice
## Min. : 884 Min. : 1455
## 1st Qu.: 3956 1st Qu.: 5981
## Median : 5684 Median : 8150
## Mean : 5883 Mean : 8271
## 3rd Qu.: 7450 3rd Qu.:10315
## Max. :35722 Max. :39080
##
## IsBadBuy MPYind VehType Price0 make_count
## Min. :0.0000 0:5031 Regular:6283 0:9939 Min. : 23
## 1st Qu.:0.0000 1:5031 Small :1508 1: 123 1st Qu.: 372
## Median :0.0000 SUV :1687 Median :1653
## Mean :0.4973 Truck : 584 Mean :1272
## 3rd Qu.:1.0000 3rd Qu.:1764
## Max. :1.0000 Max. :2121
##
## color_count
## Min. : 2
## 1st Qu.: 881
## Median :1386
## Mean :1275
## 3rd Qu.:1653
## Max. :2081
##
ANSWER TO QUESTION 2a HERE:
model1 <- lm(data = car_clean, IsBadBuy ~ factor(Auction) + VehicleAge + factor(Make) + factor(Color) + factor(WheelType) + VehOdo + factor(MPYind) + factor(VehType) + MMRAcquisitionAuctionAveragePrice + MMRAcquisitionRetailAveragePrice)
summary(model1)
##
## Call:
## lm(formula = IsBadBuy ~ factor(Auction) + VehicleAge + factor(Make) +
## factor(Color) + factor(WheelType) + VehOdo + factor(MPYind) +
## factor(VehType) + MMRAcquisitionAuctionAveragePrice + MMRAcquisitionRetailAveragePrice,
## data = car_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2697 -0.3950 -0.1620 0.4688 0.9560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.270e-02 1.091e-01 -0.116 0.907300
## factor(Auction)MANHEIM 4.284e-02 1.201e-02 3.568 0.000361 ***
## factor(Auction)OTHER 7.337e-03 1.367e-02 0.537 0.591418
## VehicleAge 5.026e-02 5.517e-03 9.111 < 2e-16 ***
## factor(Make)CHEVROLET -3.849e-02 4.595e-02 -0.837 0.402339
## factor(Make)CHRYSLER 4.944e-02 4.685e-02 1.055 0.291318
## factor(Make)DODGE 4.746e-03 4.643e-02 0.102 0.918577
## factor(Make)FORD 2.673e-02 4.617e-02 0.579 0.562673
## factor(Make)GMC -3.755e-02 6.694e-02 -0.561 0.574852
## factor(Make)HONDA -1.229e-01 6.842e-02 -1.796 0.072494 .
## factor(Make)HYUNDAI 8.449e-03 5.367e-02 0.157 0.874912
## factor(Make)JEEP 9.915e-03 5.437e-02 0.182 0.855300
## factor(Make)KIA 2.576e-02 5.110e-02 0.504 0.614167
## factor(Make)LINCOLN 6.727e-02 1.045e-01 0.644 0.519713
## factor(Make)MAZDA 3.541e-02 5.680e-02 0.623 0.533010
## factor(Make)MERCURY 4.231e-02 5.779e-02 0.732 0.464084
## factor(Make)MITSUBISHI -1.113e-01 5.850e-02 -1.903 0.057054 .
## factor(Make)NISSAN 3.383e-02 5.123e-02 0.660 0.509117
## factor(Make)OLDSMOBILE 8.039e-02 8.224e-02 0.978 0.328319
## factor(Make)other_make 4.915e-02 6.549e-02 0.751 0.452959
## factor(Make)PONTIAC -1.001e-02 4.856e-02 -0.206 0.836728
## factor(Make)SATURN 3.882e-02 5.202e-02 0.746 0.455535
## factor(Make)SUZUKI 1.415e-01 5.628e-02 2.514 0.011945 *
## factor(Make)TOYOTA -2.325e-02 5.885e-02 -0.395 0.692831
## factor(Color)BEIGE -7.511e-03 9.360e-02 -0.080 0.936048
## factor(Color)BLACK 1.465e-02 9.008e-02 0.163 0.870765
## factor(Color)BLUE 5.836e-03 8.978e-02 0.065 0.948174
## factor(Color)BROWN 2.807e-02 1.052e-01 0.267 0.789634
## factor(Color)GOLD 4.684e-02 9.050e-02 0.518 0.604730
## factor(Color)GREEN -7.624e-03 9.153e-02 -0.083 0.933617
## factor(Color)GREY 7.381e-03 9.007e-02 0.082 0.934694
## factor(Color)MAROON 8.210e-02 9.296e-02 0.883 0.377167
## factor(Color)ORANGE -1.621e-02 1.127e-01 -0.144 0.885639
## factor(Color)OTHER -1.531e-01 1.157e-01 -1.323 0.185998
## factor(Color)other_color -4.843e-01 3.320e-01 -1.459 0.144684
## factor(Color)PURPLE 5.833e-02 1.073e-01 0.543 0.586809
## factor(Color)RED 3.190e-02 9.027e-02 0.353 0.723832
## factor(Color)SILVER 3.332e-02 8.950e-02 0.372 0.709686
## factor(Color)WHITE 2.865e-02 8.966e-02 0.320 0.749349
## factor(Color)YELLOW -8.274e-02 1.163e-01 -0.712 0.476756
## factor(WheelType)Covers -2.524e-02 1.110e-02 -2.275 0.022940 *
## factor(WheelType)NULL 5.193e-01 1.508e-02 34.448 < 2e-16 ***
## factor(WheelType)Special -1.037e-02 4.584e-02 -0.226 0.820983
## VehOdo 2.410e-06 3.967e-07 6.075 1.28e-09 ***
## factor(MPYind)1 -1.113e-02 1.512e-02 -0.736 0.461643
## factor(VehType)Small 6.806e-02 1.375e-02 4.949 7.58e-07 ***
## factor(VehType)SUV 1.237e-02 1.600e-02 0.773 0.439345
## factor(VehType)Truck -2.927e-02 2.205e-02 -1.327 0.184444
## MMRAcquisitionAuctionAveragePrice -2.344e-06 5.396e-06 -0.434 0.663942
## MMRAcquisitionRetailAveragePrice 3.210e-07 3.592e-06 0.089 0.928804
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4513 on 10012 degrees of freedom
## Multiple R-squared: 0.1894, Adjusted R-squared: 0.1854
## F-statistic: 47.75 on 49 and 10012 DF, p-value: < 2.2e-16
summary(model1)$r.squared
## [1] 0.1894114
The R^2 value for the above model is 0.1894114
ANSWER TO QUESTION 2b HERE:
#table(car$Make)
#MPYind <- ifelse(12000/4 > median(car$VehOdo/car$VehicleAge) , 1, 0)
#MPYind = 0
pred_data = data.frame(Auction="MANHEIM", VehicleAge=4, Color="BLUE", VehType="Small", Make="other_make", VehOdo=32000, MPYind=0, WheelType="Special", MMRAcquisitionAuctionAveragePrice=8000, MMRAcquisitionRetailAveragePrice=12000)
predict(model1, newdata = pred_data)
## 1
## 0.4060868
The predicted value for this model is approximately 0.4 and since the classification cutoff is 0.5, it means that it would be classified as a good buy.
ANSWER TO QUESTION 2c HERE:
In my opinion, this prediction can not be trusted because we haven’t cross-validated the model and it is predicting the value based on just training data. In order to build a model which can actually be considered feasible, the data set needs to be partitioned into training data and validation data. The model can be trained using the training data and validated using the validation data.
ANSWER TO QUESTION 3a HERE:
model2 <- glm(data = car_clean, IsBadBuy ~ factor(Auction) + VehicleAge + factor(Make) + factor(Color) + factor(WheelType) + VehOdo + factor(MPYind) + factor(VehType) + MMRAcquisitionAuctionAveragePrice + MMRAcquisitionRetailAveragePrice, family = 'binomial')
summary(model2)
##
## Call:
## glm(formula = IsBadBuy ~ factor(Auction) + VehicleAge + factor(Make) +
## factor(Color) + factor(WheelType) + VehOdo + factor(MPYind) +
## factor(VehType) + MMRAcquisitionAuctionAveragePrice + MMRAcquisitionRetailAveragePrice,
## family = "binomial", data = car_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1104 -0.9833 -0.5302 1.0960 2.1348
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.705e+00 6.476e-01 -4.177 2.95e-05 ***
## factor(Auction)MANHEIM 1.914e-01 5.978e-02 3.202 0.00137 **
## factor(Auction)OTHER 2.404e-02 7.198e-02 0.334 0.73839
## VehicleAge 2.597e-01 2.778e-02 9.348 < 2e-16 ***
## factor(Make)CHEVROLET -1.983e-01 2.295e-01 -0.864 0.38752
## factor(Make)CHRYSLER 2.174e-01 2.340e-01 0.929 0.35300
## factor(Make)DODGE -2.782e-03 2.317e-01 -0.012 0.99042
## factor(Make)FORD 1.173e-01 2.305e-01 0.509 0.61081
## factor(Make)GMC -2.128e-01 3.253e-01 -0.654 0.51292
## factor(Make)HONDA -6.113e-01 3.515e-01 -1.739 0.08200 .
## factor(Make)HYUNDAI 2.965e-02 2.675e-01 0.111 0.91173
## factor(Make)JEEP 2.930e-02 2.699e-01 0.109 0.91354
## factor(Make)KIA 1.188e-01 2.552e-01 0.465 0.64172
## factor(Make)LINCOLN 2.665e-01 5.259e-01 0.507 0.61229
## factor(Make)MAZDA 1.441e-01 2.820e-01 0.511 0.60929
## factor(Make)MERCURY 1.866e-01 2.857e-01 0.653 0.51361
## factor(Make)MITSUBISHI -5.976e-01 2.957e-01 -2.021 0.04331 *
## factor(Make)NISSAN 1.507e-01 2.548e-01 0.591 0.55419
## factor(Make)OLDSMOBILE 3.832e-01 4.222e-01 0.908 0.36408
## factor(Make)other_make 2.403e-01 3.247e-01 0.740 0.45927
## factor(Make)PONTIAC -5.656e-02 2.422e-01 -0.234 0.81532
## factor(Make)SATURN 1.847e-01 2.586e-01 0.714 0.47517
## factor(Make)SUZUKI 6.932e-01 2.804e-01 2.472 0.01344 *
## factor(Make)TOYOTA -1.725e-01 2.923e-01 -0.590 0.55496
## factor(Color)BEIGE 1.988e-02 5.848e-01 0.034 0.97288
## factor(Color)BLACK 1.585e-01 5.694e-01 0.278 0.78075
## factor(Color)BLUE 1.048e-01 5.681e-01 0.184 0.85371
## factor(Color)BROWN 2.196e-01 6.241e-01 0.352 0.72492
## factor(Color)GOLD 3.155e-01 5.709e-01 0.553 0.58052
## factor(Color)GREEN 5.893e-02 5.747e-01 0.103 0.91832
## factor(Color)GREY 1.214e-01 5.693e-01 0.213 0.83120
## factor(Color)MAROON 4.979e-01 5.802e-01 0.858 0.39087
## factor(Color)ORANGE 7.697e-03 6.710e-01 0.011 0.99085
## factor(Color)OTHER -1.165e+00 7.317e-01 -1.592 0.11144
## factor(Color)other_color -3.246e+00 1.597e+00 -2.032 0.04211 *
## factor(Color)PURPLE 4.656e-01 6.435e-01 0.724 0.46936
## factor(Color)RED 2.377e-01 5.701e-01 0.417 0.67665
## factor(Color)SILVER 2.466e-01 5.671e-01 0.435 0.66362
## factor(Color)WHITE 2.237e-01 5.677e-01 0.394 0.69358
## factor(Color)YELLOW -3.500e-01 6.720e-01 -0.521 0.60242
## factor(WheelType)Covers -6.587e-02 5.278e-02 -1.248 0.21204
## factor(WheelType)NULL 3.469e+00 1.368e-01 25.364 < 2e-16 ***
## factor(WheelType)Special -5.133e-02 2.108e-01 -0.244 0.80761
## VehOdo 1.257e-05 1.977e-06 6.358 2.05e-10 ***
## factor(MPYind)1 -4.337e-02 7.386e-02 -0.587 0.55704
## factor(VehType)Small 3.419e-01 6.807e-02 5.023 5.10e-07 ***
## factor(VehType)SUV 5.561e-02 7.840e-02 0.709 0.47815
## factor(VehType)Truck -1.436e-01 1.069e-01 -1.344 0.17895
## MMRAcquisitionAuctionAveragePrice -2.731e-06 2.685e-05 -0.102 0.91897
## MMRAcquisitionRetailAveragePrice -9.533e-07 1.773e-05 -0.054 0.95712
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13949 on 10061 degrees of freedom
## Residual deviance: 11678 on 10012 degrees of freedom
## AIC: 11778
##
## Number of Fisher Scoring iterations: 5
summary(model2)$aic
## [1] 11777.91
The AIC of the above model is 11777.91
ANSWER TO QUESTION 3b HERE:
Coefficient of Vehicle Age is 0.2597. On an average, for an increase in VehicleAge by 1 year, the odds of buying a bad car increases by a factor of 1.2965, provided all other variables remain constant.
ANSWER TO QUESTION 3c HERE:
Coefficient of VehType is 0.3419. On an average, if type of vehicle is small, the odds of buying a bad car increases by a factor of 1.407 than for base criteria (VehType= Truck), provided all other variables remain constant
ANSWER TO QUESTION 3d HERE:
predict(model2, newdata = pred_data, type = 'response')
## 1
## 0.3845428
The predicted value for this model is approximately 0.38
ANSWER TO QUESTION 3e HERE:
In my opinion, a logistic model would do a better job at explaining the relation between feature and target variable compared to the linear model. The reason behind this is that, in this case, the target variable is categorical and any predicted value outside the range of 0 to 1 will not be of much use.
ANSWER TO QUESTION 4a HERE:
train_insts = sample(nrow(car_clean), .7*nrow(car_clean))
data_train <- car_clean[train_insts,]
data_valid <- car_clean[-train_insts,]
model1_tr <- lm(data = data_train, IsBadBuy ~ factor(Auction) + VehicleAge + factor(Make) + factor(Color) + factor(WheelType) + VehOdo + factor(MPYind) + factor(VehType) + MMRAcquisitionAuctionAveragePrice + MMRAcquisitionRetailAveragePrice)
model2_tr <- glm(data = data_train, IsBadBuy ~ factor(Auction) + VehicleAge + factor(Make) + factor(Color) + factor(WheelType) + VehOdo + factor(MPYind) + factor(VehType) + MMRAcquisitionAuctionAveragePrice + MMRAcquisitionRetailAveragePrice, family = 'binomial')
summary(model1_tr)$r.squared
## [1] 0.1930742
summary(model2_tr)$aic
## [1] 8242.946
The R^2 for the linear model is 0.1930742 and the AIC for the logistic model is 8242.946
ANSWER TO QUESTION 4b HERE:
rmse_cal <- function(model, data, actual){
pred <- predict(model, newdata = data)
sqrt(mean((pred - actual)^2))
}
rmse_tr <- rmse_cal(model1_tr, data_train, data_train$IsBadBuy)
rmse_val <- rmse_cal(model1_tr, data_valid, data_valid$IsBadBuy)
rmse_tr
## [1] 0.449139
rmse_val
## [1] 0.4541079
The RMSE for both the training and validation data is approximately same. In my opinion, you can not determine which one is better since the model is already trained on the training data, and it does not make sense calculating RMSE using training data.
ANSWER TO QUESTION 4c HERE:
classify <- function(score, c){
classifications <- ifelse(score > c, 1 , 0)
return(classifications)
}
probs1 <- predict(model1_tr, newdata = data_valid)
classifications1 <- classify(probs1, .5)
probs2 <- predict(model2_tr, newdata = data_valid)
classifications2 <- classify(probs2, .5)
valid_actuals <- data_valid$IsBadBuy
valid_classifications1 <- classifications1
valid_classifications2 <- classifications2
confusion_matrix <- function(actuals, classifications){
CM <- table(actuals, classifications)
TP <- CM[2,2]
TN <- CM[1,1]
FP <- CM[1,2]
FN <- CM[2,1]
return(c(TP, TN, FP, FN))
}
CM1 = confusion_matrix(valid_actuals, valid_classifications1)
TP1 <- CM1[1]
TN1 <- CM1[2]
FP1 <- CM1[3]
FN1 <- CM1[4]
accuracy1 <- (TP1+TN1)/(TP1+TN1+FP1+FN1)
TPR1 <- TP1/(TP1+FN1)
TNR1 <- TN1/(TN1+FP1)
FPR1 <- 1-TNR1
CM1_metrics <- c(accuracy1, TPR1, FPR1)
CM2 = confusion_matrix(valid_actuals, valid_classifications2)
TP2 <- CM2[1]
TN2 <- CM2[2]
FP2 <- CM2[3]
FN2 <- CM2[4]
accuracy2 <- (TP2+TN2)/(TP2+TN2+FP2+FN2)
TPR2 <- TP2/(TP2+FN2)
TNR2 <- TN2/(TN2+FP2)
FPR2 <- 1-TNR2
CM2_metrics <- c(accuracy2, TPR2, FPR2)
#Confusion Matrix and Metrics for Linear Model 1
table(valid_actuals, valid_classifications1)
## valid_classifications1
## valid_actuals 0 1
## 0 1198 320
## 1 673 828
CM1_metrics
## [1] 0.6710831 0.5516322 0.2108037
#Confusion Matrix and Metrics for Logistic Model 2
table(valid_actuals, valid_classifications2)
## valid_classifications2
## valid_actuals 0 1
## 0 1418 100
## 1 939 562
CM2_metrics
## [1] 0.65584631 0.37441706 0.06587615
For the linear model (Model 1) with a cutoff of 0.5, the accuracy is 0.6710831, the TPR is 0.5516322, the FPR is 0.2108037.
For the logisitc model (Model 2) with a cutoff of 0.5, the accuracy is 0.65584631, the TPR is 0.37441706, the FPR is 0.06587615.
The logistic model (Model 2) has relatively higher accuracy.
ANSWER TO QUESTION 4d HERE:
counter = 1
cutoff <- c(0.25,0.75)
accs <- rep(NA,2)
tprs <- rep(NA,2)
tnrs <- rep(NA,2)
fprs <- rep(NA,2)
for (c in cutoff){
#classify given the cutoff
predicted_classes <- classify(probs1, c)
#make the confusion matrix
conf_mat <- confusion_matrix(valid_actuals, predicted_classes)
TP <- conf_mat[1]
TN <- conf_mat[2]
FP <- conf_mat[3]
FN <- conf_mat[4]
#get the relevant quantities
accuracy <- (TP+TN)/(TP+TN+FP+FN)
TPR <- TP/(TP+FN)
TNR <- TN/(TN+FP)
FPR <- 1-TNR
#save the metrics in the vectors we made
accs[counter] <- accuracy
tprs[counter] <- TPR
tnrs[counter] <- TNR
fprs[counter] <- FPR
counter = counter + 1
}
accs #Accuracy Values for Cutoff = 0.25, 0.75
## [1] 0.5382577 0.6127857
tprs #TPRs for Cutoff = 0.25, 0.75
## [1] 0.9626915 0.2345103
tnrs #TNRs Values for Cutoff = 0.25, 0.75
## [1] 0.1185771 0.9868248
fprs #FPRs Values for Cutoff = 0.25, 0.75
## [1] 0.88142292 0.01317523
ANSWER TO QUESTION 4e HERE:
In my opinion, the higher cutoff of 0.75 has better yields compared to the lower cutoff of 0.25. The accuracy and true negative rate for cutoff = 0.75 is higher which means that it is predicting 0s efficiently and correctly. Additionally, the false positive rate is also low which means that the model is not prone to predicting a good buy as a bad one often. However, the true positive rate is low for this cutoff.