MATH1324 Assignment 2

Risk Factor Analysis for Cars

Karthik Kolume (s3857825), Melvin Meshach Ruban Angeljoy (s3884479), Prateek Kumar Singh (s3890089), Reema Sunder Kumble (s3880556)

Last updated: 23 May, 2021

Introduction

-We are analyzing a data set that has information on different attributes of automobiles.

-The data set contains details of different types of cars with their risk factors.

-We are trying to find if there is a relationship between the body style (hatchback or sedan) of a car and its risk factor.

-In order to do so we are going to load the data and pre-process it first. Some of the variables need data type conversion. Then, clean the data by replacing any missing values we find and remove outliers, if necessary, and finally, begin our hypothesis testing. Since the two variables for analysis are categorical variables, we are going to conduct a Chi-square test of association to test our hypothesis. This will help us interpret our results.

RPubs link information

-Rpubs link : https://rpubs.com/s3857825/772903

Introduction Cont.

Hatchback Car

Sedan Car

Problem Statement

The purpose of this investigation is to find if there is the relationship between the Body style of a car and the Risk Factor. We have chosen the factor variables symboling (Risk Factor) and Body style (hatchback or sedan) for our analysis. We are going to inspect the Body style and its Risk factor and discover if the risk factor depends on the body style. We will use statistics to solve the problem. We will use the Chi-square test of association to determine whether the result is statistically significant.

Data

Data source: UCI Machine learning Repository: Automobile Data Set
Link: “http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data”
This data set consists of three types of entities:
1. the specification of an auto in terms of various characteristics,
2. its assigned insurance risk rating,
3. its normalized losses in use as compared to other cars.

Data Cont.

Data Attribute Information and Range

symboling: -3, -2, -1, 0, 1, 2, 3. Cars are initially assigned a risk factor symbol associated with their price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. We will convert this variable into an ordered factor.
normalized-losses: continuous from 65 to 256. This is the relative average loss payment per insured vehicle year.
make: Different types of values it can take given below. This would be a factor variable.- alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
fuel-type: diesel, gas. This would be a factor variable.
aspiration: std, turbo. This would be a factor variable.
num-of-doors: four, two. This would be a factor variable.
body-style: hardtop, wagon, sedan, hatchback, convertible. This would be a factor variable.
drive-wheels: 4wd, fwd, rwd. This would be a factor variable.

Data Cont.

engine-location: front, rear. This would be a factor variable.
wheel-base: continuous from 86.6 to 120.9.
length: continuous from 141.1 to 208.1.
width: continuous from 60.3 to 72.3.
height: continuous from 47.8 to 59.8.
curb-weight: continuous from 1488 to 4066.
engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor. This would be a factor variable.
num-of-cylinders: eight, five, four, six, three, twelve, two. This would be a factor variable.
engine-size: continuous from 61 to 326.
fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. This would be a factor variable.
bore: continuous from 2.54 to 3.94.
stroke: continuous from 2.07 to 4.17.
compression-ratio: continuous from 7 to 23.

Data Cont.

horsepower: continuous from 48 to 288.
peak-rpm: continuous from 4150 to 6600.
city-mpg: continuous from 13 to 49.
highway-mpg: continuous from 16 to 54.
price: continuous from 5118 to 45400.

As part of data pre-processing:

We have converted ‘?’ to NA values.
Filtered the data to get information on only Hatchback or Sedan cars.
Corrected the data types of all the variables e.g. character to numeric.

Descriptive Statistics and Visualisation

The important variables in our investigation are symboling and body style (hatchback or sedan).
Loaded the data directly from the URL into the data frame using read.table function and provided appropriate column names.
Converted the missing values from ‘?’ to N.A.
Used impute function to deal with the missing values.
Verified using the colSums function after dealing with missing values.
Since we have used 2 categorical variables for comparison, we will not be checking for outliers.

R chunks

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df <- read.table(url,sep = ",", col.names = c("symboling",
                                              "normalized-losses",
                                              "make",
                                              "fuel-type",
                                              "aspiration",
                                              "num-of-doors",
                                              "body-style",
                                              "drive-wheels",
                                              "engine-location",
                                              "wheel-base",
                                              "length",
                                              "width",
                                              "height",
                                              "curb-weight",
                                              "engine-type",
                                              "num-of-cylinders",
                                              "engine-size",
                                              "fuel-system",
                                              "bore",
                                              "stroke",
                                              "compression-ratio",
                                              "horsepower",
                                              "peak-rpm",
                                              "city-mpg",
                                              "highway-mpg",
                                              "price"))

R chunks

df[df== "?" ] <- NA

df1 <- df %>% filter(df$body.style=="sedan" | df$body.style=="hatchback" )

str(df1)

## 'data.frame':    166 obs. of  26 variables:
##  $ symboling        : int  1 2 2 2 1 1 0 2 0 0 ...
##  $ normalized.losses: chr  NA "164" "164" NA ...
##  $ make             : chr  "alfa-romero" "audi" "audi" "audi" ...
##  $ fuel.type        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration       : chr  "std" "std" "std" "std" ...
##  $ num.of.doors     : chr  "two" "four" "four" "two" ...
##  $ body.style       : chr  "hatchback" "sedan" "sedan" "sedan" ...
##  $ drive.wheels     : chr  "rwd" "fwd" "4wd" "fwd" ...
##  $ engine.location  : chr  "front" "front" "front" "front" ...
##  $ wheel.base       : num  94.5 99.8 99.4 99.8 105.8 ...
##  $ length           : num  171 177 177 177 193 ...
##  $ width            : num  65.5 66.2 66.4 66.3 71.4 71.4 67.9 64.8 64.8 64.8 ...
##  $ height           : num  52.4 54.3 54.3 53.1 55.7 55.9 52 54.3 54.3 54.3 ...
##  $ curb.weight      : int  2823 2337 2824 2507 2844 3086 3053 2395 2395 2710 ...
##  $ engine.type      : chr  "ohcv" "ohc" "ohc" "ohc" ...
##  $ num.of.cylinders : chr  "six" "four" "five" "five" ...
##  $ engine.size      : int  152 109 136 136 136 131 131 108 108 164 ...
##  $ fuel.system      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ bore             : chr  "2.68" "3.19" "3.19" "3.19" ...
##  $ stroke           : chr  "3.47" "3.40" "3.40" "3.40" ...
##  $ compression.ratio: num  9 10 8 8.5 8.5 8.3 7 8.8 8.8 9 ...
##  $ horsepower       : chr  "154" "102" "115" "110" ...
##  $ peak.rpm         : chr  "5000" "5500" "5500" "5500" ...
##  $ city.mpg         : int  19 24 18 19 19 17 16 23 23 21 ...
##  $ highway.mpg      : int  26 30 22 25 25 20 22 29 29 28 ...
##  $ price            : chr  "16500" "13950" "17450" "15250" ...

# Correct all the incorrect data types

df1$body.style        <- as.factor(df1$body.style)
df1$bore              <- as.numeric(df1$bore)
df1$stroke            <- as.numeric(df1$stroke)
df1$horsepower        <- as.numeric(df1$horsepower)
df1$peak.rpm          <- as.numeric(df1$peak.rpm)
df1$normalized.losses <- as.numeric(df1$normalized.losses)
df1$price             <- as.numeric(df1$price)
df1$make              <- as.factor(df1$make)
df1$fuel.type         <- as.factor(df1$fuel.type) 
df1$aspiration        <- as.factor(df1$aspiration)  
df1$drive.wheels      <- as.factor(df1$drive.wheels)  
df1$engine.location   <- as.factor(df1$engine.type)
df1$engine.type       <- as.factor(df1$engine.type)
df1$fuel.system       <- as.factor(df1$fuel.system)
df1$num.of.cylinders  <- as.factor(df1$num.of.cylinders) 

df1$num.of.doors      <- factor(df1$num.of.doors,levels = c("two","four"),ordered = TRUE)
levels(df1$num.of.doors)

## [1] "two"  "four"

df1$symboling         <- factor(df1$symboling,levels = c(-2,-1,0,1,2,3),ordered = TRUE)
levels(df1$symboling)

## [1] "-2" "-1" "0"  "1"  "2"  "3"

R chunks

str(df1)

## 'data.frame':    166 obs. of  26 variables:
##  $ symboling        : Ord.factor w/ 6 levels "-2"<"-1"<"0"<..: 4 5 5 5 4 4 3 5 3 3 ...
##  $ normalized.losses: num  NA 164 164 NA 158 158 NA 192 192 188 ...
##  $ make             : Factor w/ 22 levels "alfa-romero",..: 1 2 2 2 2 2 2 3 3 3 ...
##  $ fuel.type        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration       : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 2 2 1 1 1 ...
##  $ num.of.doors     : Ord.factor w/ 2 levels "two"<"four": 1 2 2 1 2 2 1 1 2 1 ...
##  $ body.style       : Factor w/ 2 levels "hatchback","sedan": 1 2 2 2 2 2 1 2 2 2 ...
##  $ drive.wheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 2 1 2 2 2 1 3 3 3 ...
##  $ engine.location  : Factor w/ 7 levels "dohc","dohcv",..: 6 4 4 4 4 4 4 4 4 4 ...
##  $ wheel.base       : num  94.5 99.8 99.4 99.8 105.8 ...
##  $ length           : num  171 177 177 177 193 ...
##  $ width            : num  65.5 66.2 66.4 66.3 71.4 71.4 67.9 64.8 64.8 64.8 ...
##  $ height           : num  52.4 54.3 54.3 53.1 55.7 55.9 52 54.3 54.3 54.3 ...
##  $ curb.weight      : int  2823 2337 2824 2507 2844 3086 3053 2395 2395 2710 ...
##  $ engine.type      : Factor w/ 7 levels "dohc","dohcv",..: 6 4 4 4 4 4 4 4 4 4 ...
##  $ num.of.cylinders : Factor w/ 7 levels "eight","five",..: 4 3 2 2 2 2 2 3 3 4 ...
##  $ engine.size      : int  152 109 136 136 136 131 131 108 108 164 ...
##  $ fuel.system      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ bore             : num  2.68 3.19 3.19 3.19 3.19 3.13 3.13 3.5 3.5 3.31 ...
##  $ stroke           : num  3.47 3.4 3.4 3.4 3.4 3.4 3.4 2.8 2.8 3.19 ...
##  $ compression.ratio: num  9 10 8 8.5 8.5 8.3 7 8.8 8.8 9 ...
##  $ horsepower       : num  154 102 115 110 110 140 160 101 101 121 ...
##  $ peak.rpm         : num  5000 5500 5500 5500 5500 5500 5500 5800 5800 4250 ...
##  $ city.mpg         : int  19 24 18 19 19 17 16 23 23 21 ...
##  $ highway.mpg      : int  26 30 22 25 25 20 22 29 29 28 ...
##  $ price            : num  16500 13950 17450 15250 17710 ...

R chunks

# Replace all the missing values
colSums(is.na(df1))

##         symboling normalized.losses              make         fuel.type 
##                 0                26                 0                 0 
##        aspiration      num.of.doors        body.style      drive.wheels 
##                 0                 2                 0                 0 
##   engine.location        wheel.base            length             width 
##                 0                 0                 0                 0 
##            height       curb.weight       engine.type  num.of.cylinders 
##                 0                 0                 0                 0 
##       engine.size       fuel.system              bore            stroke 
##                 0                 0                 4                 4 
## compression.ratio        horsepower          peak.rpm          city.mpg 
##                 0                 1                 1                 0 
##       highway.mpg             price 
##                 0                 4

df1$bore              <- impute(df1$bore,fun = mean) 
df1$bore              <- round(df1$bore,2)
df1$stroke            <- impute(df1$stroke,fun = mean) 
df1$stroke            <- round(df1$stroke,2)
df1$horsepower        <- impute(df1$horsepower, fun= median)
df1$peak.rpm          <- impute(df1$peak.rpm, fun= median)
df1$price             <- impute(df1$price,fun = mean) 
df1$num.of.doors      <- impute(df1$num.of.doors, fun= mode)
df1$normalized.losses <- impute(df1$normalized.losses, fun= median)

sum(is.na(df1))

## [1] 0

Descriptive Statistics Cont.

Used summary function to get the summary statistics of data.
Produced a cross-tabulation of the counts and proportions of hatchback or sedan cars to their risk factor(symboling).
Visualized the association between body style and whether the body style is associated to their risk factor(symboling) using a clustered bar chart.

R chunks

summary(df1$symboling)

## -2 -1  0  1  2  3 
##  3 15 51 50 27 20

summary(df1$body.style)

## hatchback     sedan 
##        70        96

tab1 <-table(df1$symboling,df1$body.style)
tab1 %>%addmargins()

##      
##       hatchback sedan Sum
##   -2          0     3   3
##   -1          2    13  15
##   0           8    43  51
##   1          27    23  50
##   2          13    14  27
##   3          20     0  20
##   Sum        70    96 166

tab2 <-tab1 %>%prop.table(margin=2)
tab2

##     
##       hatchback      sedan
##   -2 0.00000000 0.03125000
##   -1 0.02857143 0.13541667
##   0  0.11428571 0.44791667
##   1  0.38571429 0.23958333
##   2  0.18571429 0.14583333
##   3  0.28571429 0.00000000

Bar Plot

barplot(tab2, main ="Body style of a Car by Risk factor",ylab="Proportion within risk factor",
         ylim=c(0,1),legend=rownames(tab2),beside=TRUE,
         args.legend=c(x ="topright",horiz=TRUE,title="Risk-factor"),xlab="Body-Style",col =             brewer.pal(5, name = "Blues"))

Hypothesis Testing

Performed an appropriate hypothesis test(Chi-square Test of Association) to determine if a car’s risk factor is associated with its body-style(hatchback or sedan). Used the 0.05 level of significance.
Stated the Null and Alternate hypothesis for the hypothesis test below. Stated the assumptions also.

H0: There is no association between body style and risk factor.

HA: There is an association between body style and risk factor.

Assumption: No more than 25% of expected cell counts are below 5.

Reported the statistic X-squared, df ,and p-value from the results of the hypothesis test. Commented on any assumption checked.

R chunk of Hypothesis testing(Chi-square Test of Association)

chi2 <-chisq.test(tab1)
chi2

## 
##  Pearson's Chi-squared test
## 
## data:  tab1
## X-squared = 52.663, df = 5, p-value = 3.944e-10

chi2$observed

##     
##      hatchback sedan
##   -2         0     3
##   -1         2    13
##   0          8    43
##   1         27    23
##   2         13    14
##   3         20     0

chi2$expected

##     
##      hatchback     sedan
##   -2  1.265060  1.734940
##   -1  6.325301  8.674699
##   0  21.506024 29.493976
##   1  21.084337 28.915663
##   2  11.385542 15.614458
##   3   8.433735 11.566265

# There are no cells with expected counts below 5.

Hypothesis Testing Cont.

Used the p-value to make a decision about the Null hypothesis.
Since p-value = 3.944e-10, reject H0.
The results of the Chi-square test of association were statistically significant.
There was sufficient evidence to suggest that the symboling depends on body style(hatchback or sedan) of a car.
In other words, there was a statistically significant association between Risk factors depends on the Body style of a car.

Discussion

Major findings of our investigation are that we have sufficient evidence to prove that symboling(Risk factor) is affected majorly by the body style(hatchback or sedan) of a car.
Statistically significant association between Risk factors depends on the body style of a car.
Limitation: There may be external factors that can affect the Risk factor of an automobile such as weather conditions, Road conditions, human error, etc.
For future investigations, we can find the Relationship between other attributes of a car using different statistical methods.
Conclusion from this investigation is that the safety of the Car can be determined as per aspects from the data.

References

Categorical Outliers Don’t Exist https://medium.com/owl-analytics/categorical-outliers-dont-exist-8f4e82070cb2
Data set details:
- Creator/Donor:Jeffrey C. Schlimmer (Jeffrey.Schlimmer ‘@’ a.gp.cs.cmu.edu)
- Sources:
  1. 1985 Model Import Car and Truck Specifications, 1985 Ward’s Automotive Yearbook.
  2. Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038.
  3. Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037.