This file is still written for a coder/analyst, and you will have to polish it (change text and hide chunks) to make it an effective client ready report. Play around with code chunk options like suppressing warnings or output.
Setup
Working Directory and Data
Empty variables and functions in the environment tab/window, set working directory and load the training/testing data.
# Clear the workspacerm(list =ls())# Clear environmentgc()# Clear unused memory
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 597615 32.0 1357351 72.5 NA 700240 37.4
Vcells 1106754 8.5 8388608 64.0 49152 1963592 15.0
graphics.off()# Clear all graphs# Set working directory and path to datasetwd("/Users/arvindsharma/Dropbox/WCAS/Econometrics/")
Load packages
Now, I will load the packages.
# Prepare needed librariespackages<-c("reader", # importing data, "psych", # quick summary stats for data exploration,"mice", # for imputation of missing values and vis of missing data,"stargazer", # summary stats,"vtable", # summary stats,"summarytools", # summary stats,"naniar", # for visualisation of missing data,"visdat", # for visualisation of missing data,"VIM", # for visualisation of missing data,"DataExplorer", # for visualisation of missing data,"tidyverse", # data manipulation like selecting variables,"fastDummies", # Create dummy variables using fastDummies,"corrplot", # correlation plots,"ggplot2", # graphing,"data.table", # reshape for graphing, "car"# vif for multicollinearity)for(iin1:length(packages)){if(!packages[i]%in%rownames(installed.packages())){install.packages(packages[i] , repos ="http://cran.rstudio.com/" , dependencies =TRUE)}library(packages[i], character.only =TRUE)}rm(packages)
Be sure to talk about some insights from your summary statistics tables and graphic visualizations.
A good practice is to tell the reader what you are planning to do in the section right at the beginning/top.
I will first check for missing values and impute them, as there are only 2 variables with less about 6% missing values. While I can impute with median/mean or even just drop all rows with any missing observation, I will impute the mean with the mice package.
After I have imputed the mean, I will create my summary statistics table and visualize my data through charts.
Throughout the analysis, I will be describing what I am doing with test and also be commenting my code. Also, make sure to align your code i.e. = or " below each other - use space if your have to.
Missing Data
YOJ and CAR_AGE have about \(6\%\) missing values. AGE has some missing values too.
You can use any of the packages below. I personally prefer Amelia and naniar as they works on big data too, but like visdat
?naniar
No documentation for 'naniar' in specified packages and libraries:
you could try '??naniar'
The mice package (Multivariate Imputations by Chained Equations) implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.
First, we will create an imputation_model using the mice() function, and then complete the missing values using the complete() function to obtain the imputed_dataset (with the missing values completed). The imputed data frame will contain the original data frame df_train with missing values replaced with imputed values.
?mice
No documentation for 'mice' in specified packages and libraries:
you could try '??mice'
imputed_model<-mice::mice(data =df_train, m =1, # default value is 5 method ="mean", # univariate imputation method - can play with seed =7# integer argument for offsetting the random number generator)
iter imp variable
1 1 AGE YOJ CAR_AGE
2 1 AGE YOJ CAR_AGE
3 1 AGE YOJ CAR_AGE
4 1 AGE YOJ CAR_AGE
5 1 AGE YOJ CAR_AGE
Warning: Number of logged events: 14
?complete
Help on topic 'complete' was found in the following packages:
Package Library
mice /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
tidyr /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
Using the first match ...
Keep in mind that the mice package uses a multivariate imputation method that takes into account relationships between variables to impute missing values. It’s essential to consider the assumptions and limitations of the imputation method and perform appropriate validation and analysis to ensure the imputed values are reasonable and suitable for your analysis.
As you can see now, there are no missing values in the training data now.
?vis_miss
Help on topic 'vis_miss' was found in the following packages:
Package Library
naniar /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
visdat /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
Using the first match ...
We will use the sumtable function from vtable function to explore the raw data. The nice feature of the command is that if we have categorical data, we will quickly be able to see the different categories in them.
Re-coding data values
There are some spelling mistakes that I would like to fix immediately - as they will have implications in default graph labels later on.
# Change spellings?recode
Help on topic 'recode' was found in the following packages:
Package Library
dplyr /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
car /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
Using the first match ...
train$SEX<-dplyr::recode(train$SEX,"M"="Male","z_F"="Female")train$CAR_TYPE<-dplyr::recode(train$CAR_TYPE,"z_SUV"="SUV")train$EDUCATION<-dplyr::recode(train$EDUCATION,"z_High School"="High School")train$MSTATUS<-dplyr::recode(train$MSTATUS,"z_No"="No")train$RED_CAR<-dplyr::recode(train$RED_CAR,"yes"="Yes","no"="No")train$URBANICITY<-dplyr::recode(train$URBANICITY,"Highly Urban/ Urban"="Highly Urban / Urban","z_Highly Rural/ Rural"="Highly Rural / Rural")?dplyr::case_whentrain$JOB<-dplyr::case_when(train$JOB=="z_Blue Collar"~"Blue Collar",is.na(train$JOB)|train$JOB==""~"Missing Values",TRUE~as.character(train$JOB)# Keep other values as they are)table(train$JOB)
Blue Collar Clerical Doctor Home Maker Lawyer
1452 1054 190 512 688
Manager Missing Values Professional Student
758 419 887 568
Renaming variables
There are some variables that I would like to rename immediately to something more informative.
In this code, gsub("[\\$,]", "", train$INCOME) is used to remove the dollar sign ($) and commas (,) from the INCOME variable using the gsub() function with a regular expression. The regular expression [\\$,] matches both the dollar sign and the comma, and they are replaced with an empty string "". Then, as.numeric() is used to convert the character values to numeric.
# Remove dollar sign and convert to numerictrain$INCOME<-as.numeric(gsub(pattern ="[\\$,]", replacement ="", x =train$INCOME))# Instead of copy-pasting the ocde again for another or more varaibles, you can simply try -train<-train|>dplyr::mutate( INCOME =as.numeric(gsub("[\\$,]", "", INCOME)) , HOME_VAL =as.numeric(gsub("[\\$,]", "", HOME_VAL)), BLUEBOOK =as.numeric(gsub("[\\$,]", "", BLUEBOOK)), OLDCLAIM =as.numeric(gsub("[\\$,]", "", OLDCLAIM)))
Stargazer
We will use stargazer again to create summary statistics table. However, note that stargazer requires integer or numeric data. If we have categorical data, we will have to convert them into either integer or numeric data type. There are 10 variables we will have to treat. Read up on fastDummies syntax.
CASE I: If there are binary categorical variables, we can rename them and convert them into dummies and summarize them without any issues. Parent_Single, MSTATUS, SEX, RED_CAR, REVOKED, Urban_City
CASE II: If we have categorical variables summing up to many different values, we can still summarize them in with a stargazer table. EDUCATION, JOB, CAR.
clean_train_stargazer_table<-dplyr::select(.data =train,-c("INDEX"))# remove index variable# Reorder the variablesclean_train_stargazer_table<-dplyr::select(clean_train_stargazer_table, -"JOB",-"CAR_TYPE",everything())?fastDummies::dummy_cols()clean_train_stargazer_table<-fastDummies::dummy_cols(.data =clean_train_stargazer_table, remove_selected_columns =TRUE)clean_train_stargazer_table<-dplyr::select(.data =clean_train_stargazer_table ,-c("Parent_Single_No", "MSTATUS_No","SEX_Male","RED_CAR_No","REVOKED_No","Urban_City_Highly Rural / Rural"))# remove index variable# Assuming your dataframe is named 'your_data'clean_train_stargazer_table<-clean_train_stargazer_table|>dplyr::select(-starts_with("JOB"))|>dplyr::select(-starts_with("CAR"))# drop missing valuesnaniar::gg_miss_upset(clean_train_stargazer_table)
clean_train_stargazer_table$JOB<-train$JOBclean_train_stargazer_table$CAR_TYPE<-train$CAR_TYPEclean_train_stargazer_table<-na.omit(clean_train_stargazer_table)stargazer::stargazer(clean_train_stargazer_table[,1:25], type ="text")
Of course, I am expecting you to clean up the table more - like label the variables, draw some inferences.
You can hide some commands like summarytools::descr(clean_train_stargazer_table) but use the skewness/kurtosis/IQR calculations to describe certain variables. Might have to download X11 from https://www.xquartz.org/ if using Mac.
Worry about skewness and/or kurtosis only if it is a major issue. For skewness, if this value is between:
-0.5 and 0.5, the distribution of the value is almost symmetrical
-1 and -0.5, the data is negatively skewed, and if it is between 0.5 to 1, the data is positively skewed. The skewness is moderate.
If the skewness is lower than -1 (negatively skewed) or greater than 1 (positively skewed), the data is highly skewed. I might consider to log the varaible in this case and check for skewness on this new distribution.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Graphing Categorical Variables
Car Types and Jobs are broken down into too many dummies - so I do not want to put then in the summary statistics tables. Instead, I would prefer to have them in dedicated charts below.
Car Types
You can try to create conditional charts (if the car crash happened).
# Create a table of frequenciesfreq_table_CAR_TYPE<-table(clean_train_stargazer_table$CAR_TYPE)freq_table_CAR_TYPE
Minivan Panel Truck Pickup Sports Car SUV Van
1582 468 993 634 1634 522
# Convert the table to a data framedf_CAR_TYPE<-data.frame(Value =names(freq_table_CAR_TYPE), Frequency =freq_table_CAR_TYPE)df_CAR_TYPE
Value Frequency.Var1 Frequency.Freq
1 Minivan Minivan 1582
2 Panel Truck Panel Truck 468
3 Pickup Pickup 993
4 Sports Car Sports Car 634
5 SUV SUV 1634
6 Van Van 522
# Create a vector of 6 distinct colorsnum_colors<-length(df_CAR_TYPE$Value)my_colors<-rainbow(num_colors)# Assign the colors to the 'ColorVar' variableColorVar<-factor(x =1:num_colors, levels =1:num_colors, labels =my_colors)# ACTUAL PLOT - I want to sort the columns by height and color them differentlyggplot(data =df_CAR_TYPE, aes(x =reorder(x =Value, X =-Frequency.Freq), y =Frequency.Freq, fill =ColorVar))+geom_bar(stat ="identity", show.legend =FALSE)+labs(title ="Bar Chart of Occupations", x ="", y ="Frequency")+theme(axis.text.x =element_text(angle =45, # fit the labels hjust =1))
Jobs
You can try to create conditional charts (if the car crash happened).
# Create a table of frequenciesfreq_table_JOB<-table(clean_train_stargazer_table$JOB)freq_table_JOB
Blue Collar Clerical Doctor Home Maker Lawyer
1302 933 172 452 635
Manager Missing Values Professional Student
674 378 798 489
# Convert the table to a data framedf_JOB<-data.frame(Value =names(freq_table_JOB), Frequency =freq_table_JOB)df_JOB
Value Frequency.Var1 Frequency.Freq
1 Blue Collar Blue Collar 1302
2 Clerical Clerical 933
3 Doctor Doctor 172
4 Home Maker Home Maker 452
5 Lawyer Lawyer 635
6 Manager Manager 674
7 Missing Values Missing Values 378
8 Professional Professional 798
9 Student Student 489
# Create a vector of 9 distinct colorsnum_colors<-9my_colors<-rainbow(num_colors)# Assign the colors to the 'ColorVar' variableColorVar<-factor(x =1:num_colors, levels =1:num_colors, labels =my_colors)# ACTUAL PLOT - I want to sort the columns by height and color them differentlyggplot(data =df_JOB, aes(x =reorder(x =Value, X =-Frequency.Freq), y =Frequency.Freq, fill =ColorVar))+geom_bar(stat ="identity", show.legend =FALSE)+labs(title ="Bar Chart of Occupations", x ="", y ="Frequency")+theme(axis.text.x =element_text(angle =45, # fit the labels hjust =1))
Visualization
It would be a good idea to plot the distribution of all the variables like you did in HW1 - melt the data into long format and use ggplot2 to see the raw data. In this assignment, it would be informative to split the charts above by crashed or not crashed too. Run basic regression models and quickly see what trends you are getting - then you can create the charts to support the story from the regressions by decomposing the raw data and verifying why you find the results.
MODELS
See basic multivariate regression from Homework 1 solution.