Karthik Kolume (s3857825), Melvin Meshach Ruban Angeljoy (s3884479), Prateek Kumar Singh (s3890089), Reema Sunder Kumble (s3880556)
Last updated: 23 May, 2021
-We are analyzing a data set that has information on different attributes of automobiles.
-The data set contains details of different types of cars with their risk factors.
-We are trying to find if there is a relationship between the body style (hatchback or sedan) of a car and its risk factor.
-In order to do so we are going to load the data and pre-process it first. Some of the variables need data type conversion. Then, clean the data by replacing any missing values we find and remove outliers, if necessary, and finally, begin our hypothesis testing. Since the two variables for analysis are categorical variables, we are going to conduct a Chi-square test of association to test our hypothesis. This will help us interpret our results.
RPubs link information
-Rpubs link : https://rpubs.com/s3857825/772903
Hatchback Car
Sedan Car
The purpose of this investigation is to find if there is the relationship between the Body style of a car and the Risk Factor. We have chosen the factor variables symboling (Risk Factor) and Body style (hatchback or sedan) for our analysis. We are going to inspect the Body style and its Risk factor and discover if the risk factor depends on the body style. We will use statistics to solve the problem. We will use the Chi-square test of association to determine whether the result is statistically significant.
Data
Data Attribute Information and Range
As part of data pre-processing:
We have converted ‘?’ to NA values.
Filtered the data to get information on only Hatchback or Sedan cars.
Corrected the data types of all the variables e.g. character to numeric.
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df <- read.table(url,sep = ",", col.names = c("symboling",
"normalized-losses",
"make",
"fuel-type",
"aspiration",
"num-of-doors",
"body-style",
"drive-wheels",
"engine-location",
"wheel-base",
"length",
"width",
"height",
"curb-weight",
"engine-type",
"num-of-cylinders",
"engine-size",
"fuel-system",
"bore",
"stroke",
"compression-ratio",
"horsepower",
"peak-rpm",
"city-mpg",
"highway-mpg",
"price"))df[df== "?" ] <- NA
df1 <- df %>% filter(df$body.style=="sedan" | df$body.style=="hatchback" )
str(df1)## 'data.frame': 166 obs. of 26 variables:
## $ symboling : int 1 2 2 2 1 1 0 2 0 0 ...
## $ normalized.losses: chr NA "164" "164" NA ...
## $ make : chr "alfa-romero" "audi" "audi" "audi" ...
## $ fuel.type : chr "gas" "gas" "gas" "gas" ...
## $ aspiration : chr "std" "std" "std" "std" ...
## $ num.of.doors : chr "two" "four" "four" "two" ...
## $ body.style : chr "hatchback" "sedan" "sedan" "sedan" ...
## $ drive.wheels : chr "rwd" "fwd" "4wd" "fwd" ...
## $ engine.location : chr "front" "front" "front" "front" ...
## $ wheel.base : num 94.5 99.8 99.4 99.8 105.8 ...
## $ length : num 171 177 177 177 193 ...
## $ width : num 65.5 66.2 66.4 66.3 71.4 71.4 67.9 64.8 64.8 64.8 ...
## $ height : num 52.4 54.3 54.3 53.1 55.7 55.9 52 54.3 54.3 54.3 ...
## $ curb.weight : int 2823 2337 2824 2507 2844 3086 3053 2395 2395 2710 ...
## $ engine.type : chr "ohcv" "ohc" "ohc" "ohc" ...
## $ num.of.cylinders : chr "six" "four" "five" "five" ...
## $ engine.size : int 152 109 136 136 136 131 131 108 108 164 ...
## $ fuel.system : chr "mpfi" "mpfi" "mpfi" "mpfi" ...
## $ bore : chr "2.68" "3.19" "3.19" "3.19" ...
## $ stroke : chr "3.47" "3.40" "3.40" "3.40" ...
## $ compression.ratio: num 9 10 8 8.5 8.5 8.3 7 8.8 8.8 9 ...
## $ horsepower : chr "154" "102" "115" "110" ...
## $ peak.rpm : chr "5000" "5500" "5500" "5500" ...
## $ city.mpg : int 19 24 18 19 19 17 16 23 23 21 ...
## $ highway.mpg : int 26 30 22 25 25 20 22 29 29 28 ...
## $ price : chr "16500" "13950" "17450" "15250" ...
# Correct all the incorrect data types
df1$body.style <- as.factor(df1$body.style)
df1$bore <- as.numeric(df1$bore)
df1$stroke <- as.numeric(df1$stroke)
df1$horsepower <- as.numeric(df1$horsepower)
df1$peak.rpm <- as.numeric(df1$peak.rpm)
df1$normalized.losses <- as.numeric(df1$normalized.losses)
df1$price <- as.numeric(df1$price)
df1$make <- as.factor(df1$make)
df1$fuel.type <- as.factor(df1$fuel.type)
df1$aspiration <- as.factor(df1$aspiration)
df1$drive.wheels <- as.factor(df1$drive.wheels)
df1$engine.location <- as.factor(df1$engine.type)
df1$engine.type <- as.factor(df1$engine.type)
df1$fuel.system <- as.factor(df1$fuel.system)
df1$num.of.cylinders <- as.factor(df1$num.of.cylinders)
df1$num.of.doors <- factor(df1$num.of.doors,levels = c("two","four"),ordered = TRUE)
levels(df1$num.of.doors)## [1] "two" "four"
df1$symboling <- factor(df1$symboling,levels = c(-2,-1,0,1,2,3),ordered = TRUE)
levels(df1$symboling)## [1] "-2" "-1" "0" "1" "2" "3"
## 'data.frame': 166 obs. of 26 variables:
## $ symboling : Ord.factor w/ 6 levels "-2"<"-1"<"0"<..: 4 5 5 5 4 4 3 5 3 3 ...
## $ normalized.losses: num NA 164 164 NA 158 158 NA 192 192 188 ...
## $ make : Factor w/ 22 levels "alfa-romero",..: 1 2 2 2 2 2 2 3 3 3 ...
## $ fuel.type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 2 2 1 1 1 ...
## $ num.of.doors : Ord.factor w/ 2 levels "two"<"four": 1 2 2 1 2 2 1 1 2 1 ...
## $ body.style : Factor w/ 2 levels "hatchback","sedan": 1 2 2 2 2 2 1 2 2 2 ...
## $ drive.wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 2 1 2 2 2 1 3 3 3 ...
## $ engine.location : Factor w/ 7 levels "dohc","dohcv",..: 6 4 4 4 4 4 4 4 4 4 ...
## $ wheel.base : num 94.5 99.8 99.4 99.8 105.8 ...
## $ length : num 171 177 177 177 193 ...
## $ width : num 65.5 66.2 66.4 66.3 71.4 71.4 67.9 64.8 64.8 64.8 ...
## $ height : num 52.4 54.3 54.3 53.1 55.7 55.9 52 54.3 54.3 54.3 ...
## $ curb.weight : int 2823 2337 2824 2507 2844 3086 3053 2395 2395 2710 ...
## $ engine.type : Factor w/ 7 levels "dohc","dohcv",..: 6 4 4 4 4 4 4 4 4 4 ...
## $ num.of.cylinders : Factor w/ 7 levels "eight","five",..: 4 3 2 2 2 2 2 3 3 4 ...
## $ engine.size : int 152 109 136 136 136 131 131 108 108 164 ...
## $ fuel.system : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ bore : num 2.68 3.19 3.19 3.19 3.19 3.13 3.13 3.5 3.5 3.31 ...
## $ stroke : num 3.47 3.4 3.4 3.4 3.4 3.4 3.4 2.8 2.8 3.19 ...
## $ compression.ratio: num 9 10 8 8.5 8.5 8.3 7 8.8 8.8 9 ...
## $ horsepower : num 154 102 115 110 110 140 160 101 101 121 ...
## $ peak.rpm : num 5000 5500 5500 5500 5500 5500 5500 5800 5800 4250 ...
## $ city.mpg : int 19 24 18 19 19 17 16 23 23 21 ...
## $ highway.mpg : int 26 30 22 25 25 20 22 29 29 28 ...
## $ price : num 16500 13950 17450 15250 17710 ...
## symboling normalized.losses make fuel.type
## 0 26 0 0
## aspiration num.of.doors body.style drive.wheels
## 0 2 0 0
## engine.location wheel.base length width
## 0 0 0 0
## height curb.weight engine.type num.of.cylinders
## 0 0 0 0
## engine.size fuel.system bore stroke
## 0 0 4 4
## compression.ratio horsepower peak.rpm city.mpg
## 0 1 1 0
## highway.mpg price
## 0 4
df1$bore <- impute(df1$bore,fun = mean)
df1$bore <- round(df1$bore,2)
df1$stroke <- impute(df1$stroke,fun = mean)
df1$stroke <- round(df1$stroke,2)
df1$horsepower <- impute(df1$horsepower, fun= median)
df1$peak.rpm <- impute(df1$peak.rpm, fun= median)
df1$price <- impute(df1$price,fun = mean)
df1$num.of.doors <- impute(df1$num.of.doors, fun= mode)
df1$normalized.losses <- impute(df1$normalized.losses, fun= median)
sum(is.na(df1))## [1] 0
Used summary function to get the summary statistics of data.
Produced a cross-tabulation of the counts and proportions of hatchback or sedan cars to their risk factor(symboling).
Visualized the association between body style and whether the body style is associated to their risk factor(symboling) using a clustered bar chart.
## -2 -1 0 1 2 3
## 3 15 51 50 27 20
## hatchback sedan
## 70 96
##
## hatchback sedan Sum
## -2 0 3 3
## -1 2 13 15
## 0 8 43 51
## 1 27 23 50
## 2 13 14 27
## 3 20 0 20
## Sum 70 96 166
##
## hatchback sedan
## -2 0.00000000 0.03125000
## -1 0.02857143 0.13541667
## 0 0.11428571 0.44791667
## 1 0.38571429 0.23958333
## 2 0.18571429 0.14583333
## 3 0.28571429 0.00000000
barplot(tab2, main ="Body style of a Car by Risk factor",ylab="Proportion within risk factor",
ylim=c(0,1),legend=rownames(tab2),beside=TRUE,
args.legend=c(x ="topright",horiz=TRUE,title="Risk-factor"),xlab="Body-Style",col = brewer.pal(5, name = "Blues"))Performed an appropriate hypothesis test(Chi-square Test of Association) to determine if a car’s risk factor is associated with its body-style(hatchback or sedan). Used the 0.05 level of significance.
Stated the Null and Alternate hypothesis for the hypothesis test below. Stated the assumptions also.
H0: There is no association between body style and risk factor.
HA: There is an association between body style and risk factor.
Assumption: No more than 25% of expected cell counts are below 5.
##
## Pearson's Chi-squared test
##
## data: tab1
## X-squared = 52.663, df = 5, p-value = 3.944e-10
##
## hatchback sedan
## -2 0 3
## -1 2 13
## 0 8 43
## 1 27 23
## 2 13 14
## 3 20 0
##
## hatchback sedan
## -2 1.265060 1.734940
## -1 6.325301 8.674699
## 0 21.506024 29.493976
## 1 21.084337 28.915663
## 2 11.385542 15.614458
## 3 8.433735 11.566265
Categorical Outliers Don’t Exist https://medium.com/owl-analytics/categorical-outliers-dont-exist-8f4e82070cb2
Data set details: