MATH1324 Assignment 2

Risk Factor Analysis for Cars

Karthik Kolume (s3857825), Melvin Meshach Ruban Angeljoy (s3884479), Prateek Kumar Singh (s3890089), Reema Sunder Kumble (s3880556)

Last updated: 23 May, 2021

Introduction

-We are analyzing a data set that has information on different attributes of automobiles.

-The data set contains details of different types of cars with their risk factors.

-We are trying to find if there is a relationship between the body style (hatchback or sedan) of a car and its risk factor.

-In order to do so we are going to load the data and pre-process it first. Some of the variables need data type conversion. Then, clean the data by replacing any missing values we find and remove outliers, if necessary, and finally, begin our hypothesis testing. Since the two variables for analysis are categorical variables, we are going to conduct a Chi-square test of association to test our hypothesis. This will help us interpret our results.

RPubs link information

-Rpubs link : https://rpubs.com/s3857825/772903

Introduction Cont.

Hatchback Car

Sedan Car

Problem Statement

The purpose of this investigation is to find if there is the relationship between the Body style of a car and the Risk Factor. We have chosen the factor variables symboling (Risk Factor) and Body style (hatchback or sedan) for our analysis. We are going to inspect the Body style and its Risk factor and discover if the risk factor depends on the body style. We will use statistics to solve the problem. We will use the Chi-square test of association to determine whether the result is statistically significant.

Data

Data Cont.

Data Attribute Information and Range

  1. symboling: -3, -2, -1, 0, 1, 2, 3. Cars are initially assigned a risk factor symbol associated with their price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. We will convert this variable into an ordered factor.
  2. normalized-losses: continuous from 65 to 256. This is the relative average loss payment per insured vehicle year.
  3. make: Different types of values it can take given below. This would be a factor variable.- alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
  4. fuel-type: diesel, gas. This would be a factor variable.
  5. aspiration: std, turbo. This would be a factor variable.
  6. num-of-doors: four, two. This would be a factor variable.
  7. body-style: hardtop, wagon, sedan, hatchback, convertible. This would be a factor variable.
  8. drive-wheels: 4wd, fwd, rwd. This would be a factor variable.

Data Cont.

  1. engine-location: front, rear. This would be a factor variable.
  2. wheel-base: continuous from 86.6 to 120.9.
  3. length: continuous from 141.1 to 208.1.
  4. width: continuous from 60.3 to 72.3.
  5. height: continuous from 47.8 to 59.8.
  6. curb-weight: continuous from 1488 to 4066.
  7. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor. This would be a factor variable.
  8. num-of-cylinders: eight, five, four, six, three, twelve, two. This would be a factor variable.
  9. engine-size: continuous from 61 to 326.
  10. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. This would be a factor variable.
  11. bore: continuous from 2.54 to 3.94.
  12. stroke: continuous from 2.07 to 4.17.
  13. compression-ratio: continuous from 7 to 23.

Data Cont.

  1. horsepower: continuous from 48 to 288.
  2. peak-rpm: continuous from 4150 to 6600.
  3. city-mpg: continuous from 13 to 49.
  4. highway-mpg: continuous from 16 to 54.
  5. price: continuous from 5118 to 45400.

As part of data pre-processing:

  1. We have converted ‘?’ to NA values.

  2. Filtered the data to get information on only Hatchback or Sedan cars.

  3. Corrected the data types of all the variables e.g. character to numeric.

Descriptive Statistics and Visualisation

R chunks

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df <- read.table(url,sep = ",", col.names = c("symboling",
                                              "normalized-losses",
                                              "make",
                                              "fuel-type",
                                              "aspiration",
                                              "num-of-doors",
                                              "body-style",
                                              "drive-wheels",
                                              "engine-location",
                                              "wheel-base",
                                              "length",
                                              "width",
                                              "height",
                                              "curb-weight",
                                              "engine-type",
                                              "num-of-cylinders",
                                              "engine-size",
                                              "fuel-system",
                                              "bore",
                                              "stroke",
                                              "compression-ratio",
                                              "horsepower",
                                              "peak-rpm",
                                              "city-mpg",
                                              "highway-mpg",
                                              "price"))

R chunks

df[df== "?" ] <- NA

df1 <- df %>% filter(df$body.style=="sedan" | df$body.style=="hatchback" )

str(df1)
## 'data.frame':    166 obs. of  26 variables:
##  $ symboling        : int  1 2 2 2 1 1 0 2 0 0 ...
##  $ normalized.losses: chr  NA "164" "164" NA ...
##  $ make             : chr  "alfa-romero" "audi" "audi" "audi" ...
##  $ fuel.type        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration       : chr  "std" "std" "std" "std" ...
##  $ num.of.doors     : chr  "two" "four" "four" "two" ...
##  $ body.style       : chr  "hatchback" "sedan" "sedan" "sedan" ...
##  $ drive.wheels     : chr  "rwd" "fwd" "4wd" "fwd" ...
##  $ engine.location  : chr  "front" "front" "front" "front" ...
##  $ wheel.base       : num  94.5 99.8 99.4 99.8 105.8 ...
##  $ length           : num  171 177 177 177 193 ...
##  $ width            : num  65.5 66.2 66.4 66.3 71.4 71.4 67.9 64.8 64.8 64.8 ...
##  $ height           : num  52.4 54.3 54.3 53.1 55.7 55.9 52 54.3 54.3 54.3 ...
##  $ curb.weight      : int  2823 2337 2824 2507 2844 3086 3053 2395 2395 2710 ...
##  $ engine.type      : chr  "ohcv" "ohc" "ohc" "ohc" ...
##  $ num.of.cylinders : chr  "six" "four" "five" "five" ...
##  $ engine.size      : int  152 109 136 136 136 131 131 108 108 164 ...
##  $ fuel.system      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ bore             : chr  "2.68" "3.19" "3.19" "3.19" ...
##  $ stroke           : chr  "3.47" "3.40" "3.40" "3.40" ...
##  $ compression.ratio: num  9 10 8 8.5 8.5 8.3 7 8.8 8.8 9 ...
##  $ horsepower       : chr  "154" "102" "115" "110" ...
##  $ peak.rpm         : chr  "5000" "5500" "5500" "5500" ...
##  $ city.mpg         : int  19 24 18 19 19 17 16 23 23 21 ...
##  $ highway.mpg      : int  26 30 22 25 25 20 22 29 29 28 ...
##  $ price            : chr  "16500" "13950" "17450" "15250" ...
# Correct all the incorrect data types

df1$body.style        <- as.factor(df1$body.style)
df1$bore              <- as.numeric(df1$bore)
df1$stroke            <- as.numeric(df1$stroke)
df1$horsepower        <- as.numeric(df1$horsepower)
df1$peak.rpm          <- as.numeric(df1$peak.rpm)
df1$normalized.losses <- as.numeric(df1$normalized.losses)
df1$price             <- as.numeric(df1$price)
df1$make              <- as.factor(df1$make)
df1$fuel.type         <- as.factor(df1$fuel.type) 
df1$aspiration        <- as.factor(df1$aspiration)  
df1$drive.wheels      <- as.factor(df1$drive.wheels)  
df1$engine.location   <- as.factor(df1$engine.type)
df1$engine.type       <- as.factor(df1$engine.type)
df1$fuel.system       <- as.factor(df1$fuel.system)
df1$num.of.cylinders  <- as.factor(df1$num.of.cylinders) 

df1$num.of.doors      <- factor(df1$num.of.doors,levels = c("two","four"),ordered = TRUE)
levels(df1$num.of.doors)
## [1] "two"  "four"
df1$symboling         <- factor(df1$symboling,levels = c(-2,-1,0,1,2,3),ordered = TRUE)
levels(df1$symboling)
## [1] "-2" "-1" "0"  "1"  "2"  "3"

R chunks

str(df1)
## 'data.frame':    166 obs. of  26 variables:
##  $ symboling        : Ord.factor w/ 6 levels "-2"<"-1"<"0"<..: 4 5 5 5 4 4 3 5 3 3 ...
##  $ normalized.losses: num  NA 164 164 NA 158 158 NA 192 192 188 ...
##  $ make             : Factor w/ 22 levels "alfa-romero",..: 1 2 2 2 2 2 2 3 3 3 ...
##  $ fuel.type        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration       : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 2 2 1 1 1 ...
##  $ num.of.doors     : Ord.factor w/ 2 levels "two"<"four": 1 2 2 1 2 2 1 1 2 1 ...
##  $ body.style       : Factor w/ 2 levels "hatchback","sedan": 1 2 2 2 2 2 1 2 2 2 ...
##  $ drive.wheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 2 1 2 2 2 1 3 3 3 ...
##  $ engine.location  : Factor w/ 7 levels "dohc","dohcv",..: 6 4 4 4 4 4 4 4 4 4 ...
##  $ wheel.base       : num  94.5 99.8 99.4 99.8 105.8 ...
##  $ length           : num  171 177 177 177 193 ...
##  $ width            : num  65.5 66.2 66.4 66.3 71.4 71.4 67.9 64.8 64.8 64.8 ...
##  $ height           : num  52.4 54.3 54.3 53.1 55.7 55.9 52 54.3 54.3 54.3 ...
##  $ curb.weight      : int  2823 2337 2824 2507 2844 3086 3053 2395 2395 2710 ...
##  $ engine.type      : Factor w/ 7 levels "dohc","dohcv",..: 6 4 4 4 4 4 4 4 4 4 ...
##  $ num.of.cylinders : Factor w/ 7 levels "eight","five",..: 4 3 2 2 2 2 2 3 3 4 ...
##  $ engine.size      : int  152 109 136 136 136 131 131 108 108 164 ...
##  $ fuel.system      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ bore             : num  2.68 3.19 3.19 3.19 3.19 3.13 3.13 3.5 3.5 3.31 ...
##  $ stroke           : num  3.47 3.4 3.4 3.4 3.4 3.4 3.4 2.8 2.8 3.19 ...
##  $ compression.ratio: num  9 10 8 8.5 8.5 8.3 7 8.8 8.8 9 ...
##  $ horsepower       : num  154 102 115 110 110 140 160 101 101 121 ...
##  $ peak.rpm         : num  5000 5500 5500 5500 5500 5500 5500 5800 5800 4250 ...
##  $ city.mpg         : int  19 24 18 19 19 17 16 23 23 21 ...
##  $ highway.mpg      : int  26 30 22 25 25 20 22 29 29 28 ...
##  $ price            : num  16500 13950 17450 15250 17710 ...

R chunks

# Replace all the missing values
colSums(is.na(df1))
##         symboling normalized.losses              make         fuel.type 
##                 0                26                 0                 0 
##        aspiration      num.of.doors        body.style      drive.wheels 
##                 0                 2                 0                 0 
##   engine.location        wheel.base            length             width 
##                 0                 0                 0                 0 
##            height       curb.weight       engine.type  num.of.cylinders 
##                 0                 0                 0                 0 
##       engine.size       fuel.system              bore            stroke 
##                 0                 0                 4                 4 
## compression.ratio        horsepower          peak.rpm          city.mpg 
##                 0                 1                 1                 0 
##       highway.mpg             price 
##                 0                 4
df1$bore              <- impute(df1$bore,fun = mean) 
df1$bore              <- round(df1$bore,2)
df1$stroke            <- impute(df1$stroke,fun = mean) 
df1$stroke            <- round(df1$stroke,2)
df1$horsepower        <- impute(df1$horsepower, fun= median)
df1$peak.rpm          <- impute(df1$peak.rpm, fun= median)
df1$price             <- impute(df1$price,fun = mean) 
df1$num.of.doors      <- impute(df1$num.of.doors, fun= mode)
df1$normalized.losses <- impute(df1$normalized.losses, fun= median)

sum(is.na(df1))
## [1] 0

Descriptive Statistics Cont.

R chunks

summary(df1$symboling)
## -2 -1  0  1  2  3 
##  3 15 51 50 27 20
summary(df1$body.style)
## hatchback     sedan 
##        70        96
tab1 <-table(df1$symboling,df1$body.style)
tab1 %>%addmargins()
##      
##       hatchback sedan Sum
##   -2          0     3   3
##   -1          2    13  15
##   0           8    43  51
##   1          27    23  50
##   2          13    14  27
##   3          20     0  20
##   Sum        70    96 166
tab2 <-tab1 %>%prop.table(margin=2)
tab2
##     
##       hatchback      sedan
##   -2 0.00000000 0.03125000
##   -1 0.02857143 0.13541667
##   0  0.11428571 0.44791667
##   1  0.38571429 0.23958333
##   2  0.18571429 0.14583333
##   3  0.28571429 0.00000000

Bar Plot

barplot(tab2, main ="Body style of a Car by Risk factor",ylab="Proportion within risk factor",
         ylim=c(0,1),legend=rownames(tab2),beside=TRUE,
         args.legend=c(x ="topright",horiz=TRUE,title="Risk-factor"),xlab="Body-Style",col =             brewer.pal(5, name = "Blues"))

Hypothesis Testing

H0: There is no association between body style and risk factor.

HA: There is an association between body style and risk factor.

Assumption: No more than 25% of expected cell counts are below 5.

R chunk of Hypothesis testing(Chi-square Test of Association)

chi2 <-chisq.test(tab1)
chi2
## 
##  Pearson's Chi-squared test
## 
## data:  tab1
## X-squared = 52.663, df = 5, p-value = 3.944e-10
chi2$observed
##     
##      hatchback sedan
##   -2         0     3
##   -1         2    13
##   0          8    43
##   1         27    23
##   2         13    14
##   3         20     0
chi2$expected
##     
##      hatchback     sedan
##   -2  1.265060  1.734940
##   -1  6.325301  8.674699
##   0  21.506024 29.493976
##   1  21.084337 28.915663
##   2  11.385542 15.614458
##   3   8.433735 11.566265
# There are no cells with expected counts below 5.

Hypothesis Testing Cont.

Discussion

References