This file is still written for a coder/analyst, and you will have to polish it (change text and hide chunks) to make it an effective client ready report. Play around with code chunk options like suppressing warnings or output.
Empty variables and functions in the environment tab/window, set working directory and load the training/testing data.
# Clear the workspace
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 531483 28.4 1183890 63.3 NA 669265 35.8
## Vcells 978152 7.5 8388608 64.0 16384 1840436 14.1
cat("\f") # Clear the console
graphics.off() # Clear all graphs
# Set working directory and path to data
setwd("/Users/arvindsharma/Dropbox/WCAS/Econometrics/")
df_train <- read.csv("insurance-training-data.csv")
df_eval <- read.csv("insurance-evaluation-data.csv")
Now, I will load the packages.
# Prepare needed libraries
packages <- c("psych", # quick summary stats for data exploration,
"mice", # for imputation of missing values and vis of missing data,
"stargazer", # summary stats,
"vtable", # summary stats,
"summarytools",# summary stats,
"naniar", # for visualisation of missing data,
"visdat", # for visualisation of missing data,
"VIM", # for visualisation of missing data,
"DataExplorer",# for visualisation of missing data,
"tidyverse", # data manipulation like selecting variables,
"fastDummies", # Create dummy variables using fastDummies,
"corrplot", # correlation plots,
"ggplot2", # graphing,
"data.table", # reshape for graphing,
"car" # vif for multicollinearity
)
for (i in 1:length(packages)) {
if (!packages[i] %in% rownames(installed.packages())) {
install.packages(packages[i]
, repos = "http://cran.rstudio.com/"
, dependencies = TRUE
)
}
library(packages[i], character.only = TRUE)
}
rm(packages)
Be sure to talk about some insights from your summary statistics tables and graphic visualizations.
A good practice is to tell the reader what you are planning to do in the section right at the beginning/top.
I will first check for missing values and impute them, as there
are only 2 variables with less about 6% missing values. While I can
impute with median/mean or even just drop all rows with any missing
observation, I will impute the mean with the mice
package.
After I have imputed the mean, I will create my summary statistics table and visualize my data through charts.
Throughout the analysis, I will be describing what I am doing with
test and also be commenting my code. Also, make sure to align your code
i.e. = or " below each other - use space if
your have to.
YOJ and CAR_AGE have about \(6\%\) missing values. AGE has
some missing values too.
You can use any of the packages below. I personally prefer
Amelia and naniar as they works on big data
too, but like vis
?naniar
## No documentation for 'naniar' in specified packages and libraries:
## you could try '??naniar'
naniar::gg_miss_var(df_train)
naniar::gg_miss_upset(df_train)
naniar::vis_miss(df_train)
# visdat::vis_miss(df_train) ## same output, but works on small datasets only
?Amelia::missmap
Amelia::missmap(obj = df_train)
# VIM::aggr(df_train)
VIM::matrixplot(df_train)
##
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.
DataExplorer::plot_missing(df_train)
The mice package (Multivariate Imputations by Chained Equations) implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.
First, we will create an imputation_model using the
mice() function, and then complete the
missing values using the complete()
function to obtain the imputed_dataset (with the missing values
completed). The imputed data frame will contain the original data frame
df_train with missing values replaced with
imputed values.
?mice
## No documentation for 'mice' in specified packages and libraries:
## you could try '??mice'
imputed_model <-
mice::mice(data = df_train,
m = 1, # default value is 5
method = "mean", # univariate imputation method - can play with
seed = 7 # integer argument for offsetting the random number generator
)
##
## iter imp variable
## 1 1 AGE YOJ CAR_AGE
## 2 1 AGE YOJ CAR_AGE
## 3 1 AGE YOJ CAR_AGE
## 4 1 AGE YOJ CAR_AGE
## 5 1 AGE YOJ CAR_AGE
## Warning: Number of logged events: 14
?complete
## Help on topic 'complete' was found in the following packages:
##
## Package Library
## tidyr /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
## mice /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##
##
## Using the first match ...
train <- mice::complete(data = imputed_model)
Keep in mind that the mice package uses
a multivariate imputation method that takes into account relationships
between variables to impute missing values. It’s essential to consider
the assumptions and limitations of the imputation method and perform
appropriate validation and analysis to ensure the imputed values are
reasonable and suitable for your analysis.
As you can see now, there are no missing values in the training data now.
?vis_miss
## Help on topic 'vis_miss' was found in the following packages:
##
## Package Library
## visdat /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
## naniar /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##
##
## Using the first match ...
naniar::vis_miss(train)
We will use the sumtable function from
vtable function to explore the raw data. The nice feature
of the command is that if we have categorical data, we will quickly be
able to see the different categories in them.
There are some spelling mistakes that I would like to fix immediately - as they will have implications in default graph labels later on.
# Change spellings
?recode
## Help on topic 'recode' was found in the following packages:
##
## Package Library
## car /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
## dplyr /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##
##
## Using the first match ...
train$SEX <- dplyr::recode(train$SEX,
"M" = "Male",
"z_F" = "Female"
)
train$CAR_TYPE <- dplyr::recode(train$CAR_TYPE,
"z_SUV" = "SUV"
)
train$EDUCATION <- dplyr::recode(train$EDUCATION,
"z_High School" = "High School"
)
train$MSTATUS <- dplyr::recode(train$MSTATUS,
"z_No" = "No"
)
train$RED_CAR <- dplyr::recode(train$RED_CAR,
"yes" = "Yes",
"no" = "No"
)
train$URBANICITY <- dplyr::recode(train$URBANICITY,
"Highly Urban/ Urban" = "Highly Urban / Urban",
"z_Highly Rural/ Rural" = "Highly Rural / Rural"
)
?dplyr::case_when
train$JOB <- dplyr::case_when(
train$JOB == "z_Blue Collar" ~ "Blue Collar",
is.na(train$JOB) | train$JOB == "" ~ "Missing Values",
TRUE ~ as.character(train$JOB) # Keep other values as they are
)
table(train$JOB)
##
## Blue Collar Clerical Doctor Home Maker Lawyer
## 1825 1271 246 641 835
## Manager Missing Values Professional Student
## 988 526 1117 712
There are some variables that I would like to rename immediately to something more informative.
?dplyr::rename
train <- train |>
dplyr::rename(Parent_Single = PARENT1,
Urban_City = URBANICITY
)
In this code,
gsub("[\\$,]", "", train$INCOME) is used
to remove the dollar sign ($) and commas (,) from the
INCOME variable using the
gsub() function with a regular expression.
The regular expression [\\$,] matches both
the dollar sign and the comma, and they are replaced with an empty
string "". Then,
as.numeric() is used to convert the
character values to numeric.
# Remove dollar sign and convert to numeric
train$INCOME <- as.numeric(gsub(pattern = "[\\$,]",
replacement = "",
x = train$INCOME)
)
# Instead of copy-pasting the ocde again for another or more varaibles, you can simply try -
train <- train |>
dplyr::mutate(
INCOME = as.numeric( gsub("[\\$,]", "", INCOME)) ,
HOME_VAL = as.numeric( gsub("[\\$,]", "", HOME_VAL)),
BLUEBOOK = as.numeric( gsub("[\\$,]", "", BLUEBOOK)),
OLDCLAIM = as.numeric( gsub("[\\$,]", "", OLDCLAIM))
)
We will use stargazer again to create summary statistics
table. However, note that stargazer requires integer or numeric data. If
we have categorical data, we will have to convert them into either
integer or numeric data type. There are 10 variables we will have to
treat. Read up on fastDummies syntax.
CASE I: If there are binary categorical variables, we can rename
them and convert them into dummies and summarize them without any
issues. Parent_Single, MSTATUS,
SEX, RED_CAR, REVOKED,
Urban_City
CASE II: If we have categorical variables summing up to many
different values, we can still summarize them in with a stargazer table.
EDUCATION, JOB, CAR.
clean_train_stargazer_table <- dplyr::select(.data = train,
-c("INDEX")
) # remove index variable
# Reorder the variables
clean_train_stargazer_table <- dplyr::select(clean_train_stargazer_table,
-"JOB",
-"CAR_TYPE",
everything()
)
?fastDummies::dummy_cols()
clean_train_stargazer_table <- fastDummies::dummy_cols(.data = clean_train_stargazer_table,
remove_selected_columns = TRUE
)
clean_train_stargazer_table <- dplyr::select(.data = clean_train_stargazer_table ,
-c("Parent_Single_No",
"MSTATUS_No",
"SEX_Male",
"RED_CAR_No",
"REVOKED_No",
"Urban_City_Highly Rural / Rural"
)
) # remove index variable
# Assuming your dataframe is named 'your_data'
clean_train_stargazer_table <- clean_train_stargazer_table |>
dplyr::select(-starts_with("JOB")) |>
dplyr::select(-starts_with("CAR"))
# drop missing values
naniar::gg_miss_upset(clean_train_stargazer_table)
clean_train_stargazer_table$JOB <- train$JOB
clean_train_stargazer_table$CAR_TYPE <- train$CAR_TYPE
clean_train_stargazer_table <- na.omit(clean_train_stargazer_table)
stargazer::stargazer(clean_train_stargazer_table[,1:25], type = "text")
##
## ================================================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------------------------------------
## TARGET_FLAG 7,285 0.265 0.441 0 1
## TARGET_AMT 7,285 1,504.017 4,720.400 0.000 107,586.100
## KIDSDRIV 7,285 0.172 0.515 0 4
## AGE 7,285 44.757 8.658 16.000 81.000
## HOMEKIDS 7,285 0.726 1.121 0 5
## YOJ 7,285 10.509 3.962 0.000 23.000
## INCOME 7,285 61,894.360 47,406.000 0 367,030
## HOME_VAL 7,285 155,195.200 129,411.500 0 885,282
## TRAVTIME 7,285 33.513 15.901 5 142
## BLUEBOOK 7,285 15,770.290 8,429.700 1,500 69,740
## TIF 7,285 5.361 4.145 1 25
## OLDCLAIM 7,285 4,031.242 8,788.665 0 57,037
## CLM_FREQ 7,285 0.796 1.162 0 5
## MVR_PTS 7,285 1.701 2.154 0 13
## Parent_Single_Yes 7,285 0.134 0.341 0 1
## MSTATUS_Yes 7,285 0.596 0.491 0 1
## SEX_Female 7,285 0.536 0.499 0 1
## EDUCATIONHigh School 7,285 0.146 0.353 0 1
## EDUCATION_Bachelors 7,285 0.273 0.446 0 1
## EDUCATION_High School 7,285 0.286 0.452 0 1
## EDUCATION_Masters 7,285 0.206 0.404 0 1
## EDUCATION_PhD 7,285 0.089 0.284 0 1
## RED_CAR_Yes 7,285 0.291 0.454 0 1
## REVOKED_Yes 7,285 0.122 0.328 0 1
## Urban_City_Highly Urban / Urban 7,285 0.798 0.402 0 1
## --------------------------------------------------------------------------------
summarytools::descr(clean_train_stargazer_table)
## Non-numerical variable(s) ignored: JOB, CAR_TYPE
## Descriptive Statistics
## clean_train_stargazer_table
## N: 7285
##
## AGE BLUEBOOK CLM_FREQ EDUCATION_<High School EDUCATION_Bachelors
## ----------------- --------- ---------- ---------- ------------------------ ---------------------
## Mean 44.76 15770.29 0.80 0.15 0.27
## Std.Dev 8.66 8429.70 1.16 0.35 0.45
## Min 16.00 1500.00 0.00 0.00 0.00
## Q1 39.00 9400.00 0.00 0.00 0.00
## Median 45.00 14500.00 0.00 0.00 0.00
## Q3 51.00 20890.00 2.00 0.00 1.00
## Max 81.00 69740.00 5.00 1.00 1.00
## MAD 8.90 8421.17 0.00 0.00 0.00
## IQR 12.00 11490.00 2.00 0.00 1.00
## CV 0.19 0.53 1.46 2.42 1.63
## Skewness -0.03 0.80 1.22 2.00 1.02
## SE.Skewness 0.03 0.03 0.03 0.03 0.03
## Kurtosis -0.06 0.84 0.31 2.01 -0.97
## N.Valid 7285.00 7285.00 7285.00 7285.00 7285.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00
##
## Table: Table continues below
##
##
##
## EDUCATION_High School EDUCATION_Masters EDUCATION_PhD HOME_VAL HOMEKIDS
## ----------------- ----------------------- ------------------- --------------- ----------- ----------
## Mean 0.29 0.21 0.09 155195.25 0.73
## Std.Dev 0.45 0.40 0.28 129411.54 1.12
## Min 0.00 0.00 0.00 0.00 0.00
## Q1 0.00 0.00 0.00 0.00 0.00
## Median 0.00 0.00 0.00 161774.00 0.00
## Q3 1.00 0.00 0.00 239235.00 1.00
## Max 1.00 1.00 1.00 885282.00 5.00
## MAD 0.00 0.00 0.00 148380.09 0.00
## IQR 1.00 0.00 0.00 239235.00 1.00
## CV 1.58 1.97 3.21 0.83 1.54
## Skewness 0.95 1.46 2.89 0.49 1.33
## SE.Skewness 0.03 0.03 0.03 0.03 0.03
## Kurtosis -1.10 0.12 6.37 -0.02 0.61
## N.Valid 7285.00 7285.00 7285.00 7285.00 7285.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00
##
## Table: Table continues below
##
##
##
## INCOME KIDSDRIV MSTATUS_Yes MVR_PTS OLDCLAIM Parent_Single_Yes
## ----------------- ----------- ---------- ------------- --------- ---------- -------------------
## Mean 61894.36 0.17 0.60 1.70 4031.24 0.13
## Std.Dev 47406.00 0.52 0.49 2.15 8788.67 0.34
## Min 0.00 0.00 0.00 0.00 0.00 0.00
## Q1 28123.00 0.00 0.00 0.00 0.00 0.00
## Median 54109.00 0.00 1.00 1.00 0.00 0.00
## Q3 86211.00 0.00 1.00 3.00 4604.00 0.00
## Max 367030.00 4.00 1.00 13.00 57037.00 1.00
## MAD 41865.66 0.00 0.00 1.48 0.00 0.00
## IQR 58088.00 0.00 1.00 3.00 4604.00 0.00
## CV 0.77 2.99 0.82 1.27 2.18 2.54
## Skewness 1.17 3.35 -0.39 1.34 3.12 2.15
## SE.Skewness 0.03 0.03 0.03 0.03 0.03 0.03
## Kurtosis 2.09 11.79 -1.85 1.35 9.84 2.62
## N.Valid 7285.00 7285.00 7285.00 7285.00 7285.00 7285.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00
##
## Table: Table continues below
##
##
##
## RED_CAR_Yes REVOKED_Yes SEX_Female TARGET_AMT TARGET_FLAG TIF
## ----------------- ------------- ------------- ------------ ------------ ------------- ---------
## Mean 0.29 0.12 0.54 1504.02 0.26 5.36
## Std.Dev 0.45 0.33 0.50 4720.40 0.44 4.14
## Min 0.00 0.00 0.00 0.00 0.00 1.00
## Q1 0.00 0.00 0.00 0.00 0.00 1.00
## Median 0.00 0.00 1.00 0.00 0.00 4.00
## Q3 1.00 0.00 1.00 1069.00 1.00 7.00
## Max 1.00 1.00 1.00 107586.14 1.00 25.00
## MAD 0.00 0.00 0.00 0.00 0.00 4.45
## IQR 1.00 0.00 1.00 1069.00 1.00 6.00
## CV 1.56 2.68 0.93 3.14 1.67 0.77
## Skewness 0.92 2.30 -0.14 8.93 1.07 0.89
## SE.Skewness 0.03 0.03 0.03 0.03 0.03 0.03
## Kurtosis -1.15 3.30 -1.98 118.01 -0.87 0.42
## N.Valid 7285.00 7285.00 7285.00 7285.00 7285.00 7285.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00
##
## Table: Table continues below
##
##
##
## TRAVTIME Urban_City_Highly Urban / Urban YOJ
## ----------------- ---------- --------------------------------- ---------
## Mean 33.51 0.80 10.51
## Std.Dev 15.90 0.40 3.96
## Min 5.00 0.00 0.00
## Q1 23.00 1.00 9.00
## Median 33.00 1.00 11.00
## Q3 44.00 1.00 13.00
## Max 142.00 1.00 23.00
## MAD 16.31 0.00 2.97
## IQR 21.00 0.00 4.00
## CV 0.47 0.50 0.38
## Skewness 0.44 -1.48 -1.23
## SE.Skewness 0.03 0.03 0.03
## Kurtosis 0.63 0.20 1.44
## N.Valid 7285.00 7285.00 7285.00
## Pct.Valid 100.00 100.00 100.00
Of course, I am expecting you to clean up the table more -
like label the variables, draw some inferences. You can hide some
commands like
summarytools::descr(clean_train_stargazer_table) but use
the skewness/kurtosis/IQR calculations to describe certain variables. In
fact, I would visualize the raw data too - might be easier to see a few
things there!
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
df7 <- reshape2::melt(data = clean_train_stargazer_table)
## Using JOB, CAR_TYPE as id variables
ggplot(data = df7, mapping = aes(x = ))
ggplot(df7, aes(x = value)) +
geom_histogram() +
facet_wrap(~variable, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Car Types and Jobs are broken down into too
many dummies - so I do not want to put then in the summary statistics
tables. Instead, I would prefer to have them in dedicated charts
below.
You can try to create conditional charts (if the car crash happened).
# Create a table of frequencies
freq_table_CAR_TYPE <- table(clean_train_stargazer_table$CAR_TYPE)
freq_table_CAR_TYPE
##
## Minivan Panel Truck Pickup Sports Car SUV Van
## 1936 603 1240 813 2038 655
# Convert the table to a data frame
df_CAR_TYPE <- data.frame(Value = names(freq_table_CAR_TYPE),
Frequency = freq_table_CAR_TYPE
)
df_CAR_TYPE
## Value Frequency.Var1 Frequency.Freq
## 1 Minivan Minivan 1936
## 2 Panel Truck Panel Truck 603
## 3 Pickup Pickup 1240
## 4 Sports Car Sports Car 813
## 5 SUV SUV 2038
## 6 Van Van 655
# Create a vector of 6 distinct colors
num_colors <- length(df_CAR_TYPE$Value)
my_colors <- rainbow(num_colors)
# Assign the colors to the 'ColorVar' variable
ColorVar <- factor(x = 1:num_colors,
levels = 1:num_colors,
labels = my_colors
)
# ACTUAL PLOT - I want to sort the columns by height and color them differently
ggplot(data = df_CAR_TYPE, aes(x = reorder(x = Value,
X = -Frequency.Freq),
y = Frequency.Freq,
fill = ColorVar
)
) +
geom_bar(stat = "identity",
show.legend = FALSE
) +
labs(title = "Bar Chart of Occupations",
x = "",
y = "Frequency"
) +
theme(axis.text.x = element_text(angle = 45, # fit the labels
hjust = 1
)
)
You can try to create conditional charts (if the car crash happened).
# Create a table of frequencies
freq_table_JOB <- table(clean_train_stargazer_table$JOB)
freq_table_JOB
##
## Blue Collar Clerical Doctor Home Maker Lawyer
## 1630 1135 222 565 763
## Manager Missing Values Professional Student
## 887 466 1000 617
# Convert the table to a data frame
df_JOB <- data.frame(Value = names(freq_table_JOB),
Frequency = freq_table_JOB
)
df_JOB
## Value Frequency.Var1 Frequency.Freq
## 1 Blue Collar Blue Collar 1630
## 2 Clerical Clerical 1135
## 3 Doctor Doctor 222
## 4 Home Maker Home Maker 565
## 5 Lawyer Lawyer 763
## 6 Manager Manager 887
## 7 Missing Values Missing Values 466
## 8 Professional Professional 1000
## 9 Student Student 617
# Create a vector of 9 distinct colors
num_colors <- 9
my_colors <- rainbow(num_colors)
# Assign the colors to the 'ColorVar' variable
ColorVar <- factor(x = 1:num_colors,
levels = 1:num_colors,
labels = my_colors
)
# ACTUAL PLOT - I want to sort the columns by height and color them differently
ggplot(data = df_JOB, aes(x = reorder(x = Value,
X = -Frequency.Freq),
y = Frequency.Freq,
fill = ColorVar
)
) +
geom_bar(stat = "identity",
show.legend = FALSE
) +
labs(title = "Bar Chart of Occupations",
x = "",
y = "Frequency"
) +
theme(axis.text.x = element_text(angle = 45, # fit the labels
hjust = 1
)
)
It would be a good idea to plot the distribution of all the variables
like you did in HW1 - melt the data into long format and use
ggplot2 to see the raw data. In this assignment, it would
be infomrative to split the charts above by crashed or not crashed too.
Run basic regression models and quickly see what trends you are
getting - then you can create the charts to support the story from the
regressions by decomposing the raw data and verifying why you find the
results.
See basic multivariate regression. https://rpubs.com/sharmaar/HW1_solution
Then run some logstic and/or probit
regression in R. https://rpubs.com/sharmaar/Logistic_regression_implementation
Try some lasso/ridge for bonus points. https://rpubs.com/sharmaar/Lasso_Ridge
stargazer package -
make sure the table readable and nicely edited.Talk about the basic metrics like adjusted R squared, RMSE, and tell us which is your preferred model..
Predict the outcome variable.
Create a confusion matrix through caret package. Set
the event of interest as a car crash.
Note that you should get the same answers for metrics like sensitivity, specificity, accuracy if you construct them by hand or by using the package.