Context
This dataset was collected from kaggle.com about car sale advertisements in 2016. Though there is couple well known car features datasets they seems quite simple and outdated.
Car topic is really interesting. But practicing with real raw data which has all inconvenient moments (as NA’s for example).
This dataset contains data for more than 9.5K cars sale in Ukraine. Most of them are used cars so it opens the possibility to analyze features related to car operation.
Content
Dataset contains 9576 rows and 10 variables with essential meanings:
car: manufacturer brand
price: seller’s price in advertisement (in USD)
body: car body type
mileage: as mentioned in advertisement (’000 Km)
engV: rounded engine volume (’000 cubic cm)
engType: type of fuel (“Other” in this case should be treated as NA)
registration: whether car registered in Ukraine or not
year: year of production
model: specific model name
drive: drive type
Inspiration
Data will be handy to study and practice different models and approaches. As a further step you can compare patters in Ukrainian market to your own domestic car market characteristics.
We can also compare the prices of car with the features invovled and the cost of advertisment.
1. Read your dataset in R and visualize the length and breadth of your dataset.
setwd("C:/Users/Jaya/Desktop/intership/project")
carad.df <- read.csv(paste("car_ad.csv", sep=""))
View(carad.df)
dim(carad.df)
## [1] 9576 10
->Lenght-10 Breadth-9576
3. Create one-way contingency tables for the categorical variables in your dataset.
a) engine type
mytable <- with(carad.df, table(engType))
prop.table(mytable)*100 #percentage
## engType
## Diesel Gas Other Petrol
## 31.464077 17.982456 4.824561 45.728906
mytable #COUNT
## engType
## Diesel Gas Other Petrol
## 3013 1722 462 4379
b) Body type of car
mytable <- with(carad.df, table(body))
prop.table(mytable)*100 #percentage
## body
## crossover hatch other sedan vagon van
## 21.606099 13.074353 8.751044 38.074353 7.539683 10.954470
mytable #count
## body
## crossover hatch other sedan vagon van
## 2069 1252 838 3646 722 1049
c) Registration
mytable <- with(carad.df, table(registration))
prop.table(mytable)*100 #percentage
## registration
## no yes
## 5.858396 94.141604
mytable #count
## registration
## no yes
## 561 9015
d) Drive
mytable <- with(carad.df, table(drive))
prop.table(mytable)*100 #percentage
## drive
## front full rear
## 5.336257 54.177109 26.106934 14.379699
mytable #count
## drive
## front full rear
## 511 5188 2500 1377
4. Create two-way contingency tables for the categorical variables in your dataset.Along with percentages
a) Car and body type
mytable1 <- xtabs(~ car+body, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
b) Car and Engine type
mytable2 <- xtabs(~ car+engType, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
c) Car and Registration
mytable3 <- xtabs(~ car+registration, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
d) Car and drive type
mytable4 <- xtabs(~ car+drive, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
e) Body and engine type
mytable5 <- xtabs(~ body+engType, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
f) body and Registraion
mytable6 <- xtabs(~ body+registration, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
g) body and drive type
mytable7 <- xtabs(~ body+drive, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
h)Engine and Registration
mytable8 <- xtabs(~ engType+registration, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
i) Engine and Drive type
mytable9 <- xtabs(~ engType+drive, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
j) Registation and Drive type
mytable10 <- xtabs(~ registration+drive, data=carad.df)
addmargins(mytable) #count
## drive
## front full rear Sum
## 511 5188 2500 1377 9576
addmargins(prop.table(mytable)*100) #percentage
## drive
## front full rear Sum
## 5.336257 54.177109 26.106934 14.379699 100.000000
5. Draw a boxplot of the variables that belong to your study.
a) Boxplot of Advertisment cost of car.
boxplot(carad.df$price)
##b) Boxplot for mileage of car.
boxplot(carad.df$mileage)
##c) Boxplot of Advertisment cost of car.
boxplot(carad.df$price)

6. Draw Histograms for your suitable data fields.
a) For car and body
library(lattice)
histogram(~car | body, data=carad.df)
##b) For engine type and body
library(lattice)
histogram(~engType | body, data=carad.df)
##c) For car and drive
library(lattice)
histogram(~car | drive, data=carad.df)

7. Draw suitable plot for your data fields.
11. Chi square test
a) Car and body type
chisq.test(mytable1)
## Warning in chisq.test(mytable1): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable1
## X-squared = 6544.5, df = 430, p-value < 2.2e-16
b) Car and Engine type
chisq.test(mytable2)
## Warning in chisq.test(mytable2): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable2
## X-squared = 3511.8, df = 258, p-value < 2.2e-16
c) Car and Registration
chisq.test(mytable3)
## Warning in chisq.test(mytable3): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable3
## X-squared = 605.8, df = 86, p-value < 2.2e-16
d) Car and drive type
chisq.test(mytable4)
## Warning in chisq.test(mytable4): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable4
## X-squared = 7250.9, df = 258, p-value < 2.2e-16
e) Body and engine type
chisq.test(mytable5)
##
## Pearson's Chi-squared test
##
## data: mytable5
## X-squared = 2533, df = 15, p-value < 2.2e-16
f) body and Registraion
chisq.test(mytable6)
##
## Pearson's Chi-squared test
##
## data: mytable6
## X-squared = 283.93, df = 5, p-value < 2.2e-16
g) body and drive type
chisq.test(mytable7)
##
## Pearson's Chi-squared test
##
## data: mytable7
## X-squared = 6291.6, df = 15, p-value < 2.2e-16
h)Engine and Registration
chisq.test(mytable8)
##
## Pearson's Chi-squared test
##
## data: mytable8
## X-squared = 307.3, df = 3, p-value < 2.2e-16
i) Engine and Drive type
chisq.test(mytable9)
##
## Pearson's Chi-squared test
##
## data: mytable9
## X-squared = 442.43, df = 9, p-value < 2.2e-16
j) Registation and Drive type
chisq.test(mytable10)
##
## Pearson's Chi-squared test
##
## data: mytable10
## X-squared = 92.759, df = 3, p-value < 2.2e-16
12. t-test
a) Car and body type
t.test(mytable1)
##
## One Sample t-test
##
## data: mytable1
## t = 8.8793, df = 521, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 14.28605 22.40360
## sample estimates:
## mean of x
## 18.34483
b) Car and Engine type
t.test(mytable2)
##
## One Sample t-test
##
## data: mytable2
## t = 7.6262, df = 347, p-value = 2.344e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 20.42048 34.61400
## sample estimates:
## mean of x
## 27.51724
c) Car and Registration
t.test(mytable3)
##
## One Sample t-test
##
## data: mytable3
## t = 5.2714, df = 173, p-value = 3.996e-07
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 34.42808 75.64088
## sample estimates:
## mean of x
## 55.03448
d) Car and drive type
t.test(mytable4)
##
## One Sample t-test
##
## data: mytable4
## t = 6.539, df = 347, p-value = 2.218e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 19.24047 35.79401
## sample estimates:
## mean of x
## 27.51724
e) Body and engine type
t.test(mytable5)
##
## One Sample t-test
##
## data: mytable5
## t = 4.2046, df = 23, p-value = 0.0003381
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 202.6946 595.3054
## sample estimates:
## mean of x
## 399
f) body and Registraion
t.test(mytable6)
##
## One Sample t-test
##
## data: mytable6
## t = 2.7029, df = 11, p-value = 0.02055
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 148.193 1447.807
## sample estimates:
## mean of x
## 798
g) body and drive type
t.test(mytable7)
##
## One Sample t-test
##
## data: mytable7
## t = 3.2922, df = 23, p-value = 0.003189
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 148.2914 649.7086
## sample estimates:
## mean of x
## 399
h)Engine and Registration
t.test(mytable8)
##
## One Sample t-test
##
## data: mytable8
## t = 2.1828, df = 7, p-value = 0.06537
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -99.69983 2493.69983
## sample estimates:
## mean of x
## 1197
i) Engine and Drive type
t.test(mytable9)
##
## One Sample t-test
##
## data: mytable9
## t = 3.596, df = 15, p-value = 0.002647
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 243.7563 953.2437
## sample estimates:
## mean of x
## 598.5
j) Registation and Drive type
t.test(mytable10)
##
## One Sample t-test
##
## data: mytable10
## t = 1.9925, df = 7, p-value = 0.08658
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -223.5856 2617.5856
## sample estimates:
## mean of x
## 1197
13) Regretion model 1
sow1 <- lm(carad.df$year~carad.df$price+carad.df$price+carad.df$mileage+carad.df$engV)
summary(sow1)
##
## Call:
## lm(formula = carad.df$year ~ carad.df$price + carad.df$price +
## carad.df$mileage + carad.df$engV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.830 -1.198 1.165 3.230 28.596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.010e+03 1.284e-01 15654.946 < 2e-16 ***
## carad.df$price 6.853e-05 2.665e-06 25.714 < 2e-16 ***
## carad.df$mileage -3.074e-02 6.549e-04 -46.931 < 2e-16 ***
## carad.df$engV -4.003e-02 1.028e-02 -3.895 9.88e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.804 on 9138 degrees of freedom
## (434 observations deleted due to missingness)
## Multiple R-squared: 0.3078, Adjusted R-squared: 0.3076
## F-statistic: 1355 on 3 and 9138 DF, p-value: < 2.2e-16