Context

This dataset was collected from kaggle.com about car sale advertisements in 2016. Though there is couple well known car features datasets they seems quite simple and outdated.

Car topic is really interesting. But practicing with real raw data which has all inconvenient moments (as NA’s for example).

This dataset contains data for more than 9.5K cars sale in Ukraine. Most of them are used cars so it opens the possibility to analyze features related to car operation.

Content

Dataset contains 9576 rows and 10 variables with essential meanings:

car: manufacturer brand

price: seller’s price in advertisement (in USD)

body: car body type

mileage: as mentioned in advertisement (’000 Km)

engV: rounded engine volume (’000 cubic cm)

engType: type of fuel (“Other” in this case should be treated as NA)

registration: whether car registered in Ukraine or not

year: year of production

model: specific model name

drive: drive type

Inspiration

Data will be handy to study and practice different models and approaches. As a further step you can compare patters in Ukrainian market to your own domestic car market characteristics.

We can also compare the prices of car with the features invovled and the cost of advertisment.

1. Read your dataset in R and visualize the length and breadth of your dataset.

 setwd("C:/Users/Jaya/Desktop/intership/project")
carad.df <- read.csv(paste("car_ad.csv", sep=""))
View(carad.df)
dim(carad.df) 
## [1] 9576   10

->Lenght-10 Breadth-9576

2. Create a descriptive statistics (min, max, median etc) of each variable.

##             car           price               body         mileage     
##  Volkswagen   : 936   Min.   :     0   crossover:2069   Min.   :  0.0  
##  Mercedes-Benz: 921   1st Qu.:  4999   hatch    :1252   1st Qu.: 70.0  
##  BMW          : 694   Median :  9200   other    : 838   Median :128.0  
##  Toyota       : 541   Mean   : 15633   sedan    :3646   Mean   :138.9  
##  VAZ          : 489   3rd Qu.: 16700   vagon    : 722   3rd Qu.:194.0  
##  Renault      : 469   Max.   :547800   van      :1049   Max.   :999.0  
##  (Other)      :5526                                                    
##       engV          engType     registration      year     
##  Min.   : 0.100   Diesel:3013   no : 561     Min.   :1953  
##  1st Qu.: 1.600   Gas   :1722   yes:9015     1st Qu.:2004  
##  Median : 2.000   Other : 462                Median :2008  
##  Mean   : 2.646   Petrol:4379                Mean   :2007  
##  3rd Qu.: 2.500                              3rd Qu.:2012  
##  Max.   :99.990                              Max.   :2016  
##  NA's   :434                                               
##         model        drive     
##  E-Class   : 199        : 511  
##  A6        : 172   front:5188  
##  Camry     : 134   full :2500  
##  Vito ïàññ.: 131   rear :1377  
##  Lanos     : 127               
##  X5        : 119               
##  (Other)   :8694
##               vars    n     mean       sd median  trimmed     mad    min
## car*             1 9576    51.09    25.63   53.0    52.65   34.10    1.0
## price            2 9576 15633.32 24106.52 9200.0 10981.42 7709.52    0.0
## body*            3 9576     3.30     1.60    4.0     3.25    1.48    1.0
## mileage          4 9576   138.86    98.63  128.0   131.20   91.92    0.0
## engV             5 9142     2.65     5.93    2.0     2.10    0.74    0.1
## engType*         6 9576     2.65     1.33    3.0     2.69    1.48    1.0
## registration*    7 9576     1.94     0.23    2.0     2.00    0.00    1.0
## year             8 9576  2006.61     7.07 2008.0  2007.52    5.93 1953.0
## model*           9 9576   455.45   255.42  486.5   456.62  328.40    1.0
## drive*          10 9576     2.50     0.80    2.0     2.44    0.00    1.0
##                     max     range  skew kurtosis     se
## car*              87.00     86.00 -0.42    -1.04   0.26
## price         547800.00 547800.00  7.13    93.67 246.34
## body*              6.00      5.00 -0.06    -1.05   0.02
## mileage          999.00    999.00  1.30     5.15   1.01
## engV              99.99     99.89 15.18   239.61   0.06
## engType*           4.00      3.00 -0.14    -1.76   0.01
## registration*      2.00      1.00 -3.76    12.13   0.00
## year            2016.00     63.00 -1.55     3.87   0.07
## model*           888.00    887.00 -0.03    -1.18   2.61
## drive*             4.00      3.00  0.54    -0.46   0.01

3. Create one-way contingency tables for the categorical variables in your dataset.

a) engine type

mytable <- with(carad.df, table(engType))
prop.table(mytable)*100 #percentage
## engType
##    Diesel       Gas     Other    Petrol 
## 31.464077 17.982456  4.824561 45.728906
mytable #COUNT
## engType
## Diesel    Gas  Other Petrol 
##   3013   1722    462   4379

b) Body type of car

mytable <- with(carad.df, table(body))
prop.table(mytable)*100 #percentage
## body
## crossover     hatch     other     sedan     vagon       van 
## 21.606099 13.074353  8.751044 38.074353  7.539683 10.954470
mytable #count
## body
## crossover     hatch     other     sedan     vagon       van 
##      2069      1252       838      3646       722      1049

c) Registration

mytable <- with(carad.df, table(registration))
prop.table(mytable)*100 #percentage
## registration
##        no       yes 
##  5.858396 94.141604
mytable #count
## registration
##   no  yes 
##  561 9015

d) Drive

mytable <- with(carad.df, table(drive))
prop.table(mytable)*100 #percentage
## drive
##               front      full      rear 
##  5.336257 54.177109 26.106934 14.379699
mytable #count
## drive
##       front  full  rear 
##   511  5188  2500  1377

4. Create two-way contingency tables for the categorical variables in your dataset.Along with percentages

a) Car and body type

mytable1 <- xtabs(~ car+body, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

b) Car and Engine type

mytable2 <- xtabs(~ car+engType, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

c) Car and Registration

mytable3 <- xtabs(~ car+registration, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

d) Car and drive type

mytable4 <- xtabs(~ car+drive, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

e) Body and engine type

mytable5 <- xtabs(~ body+engType, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

f) body and Registraion

mytable6 <- xtabs(~ body+registration, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

g) body and drive type

mytable7 <- xtabs(~ body+drive, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

h)Engine and Registration

mytable8 <- xtabs(~ engType+registration, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

i) Engine and Drive type

mytable9 <- xtabs(~ engType+drive, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

j) Registation and Drive type

mytable10 <- xtabs(~ registration+drive, data=carad.df)
addmargins(mytable) #count
## drive
##       front  full  rear   Sum 
##   511  5188  2500  1377  9576
addmargins(prop.table(mytable)*100) #percentage
## drive
##                 front       full       rear        Sum 
##   5.336257  54.177109  26.106934  14.379699 100.000000

5. Draw a boxplot of the variables that belong to your study.

a) Boxplot of Advertisment cost of car.

boxplot(carad.df$price)

##b) Boxplot for mileage of car.

boxplot(carad.df$mileage)

##c) Boxplot of Advertisment cost of car.

boxplot(carad.df$price)

6. Draw Histograms for your suitable data fields.

a) For car and body

library(lattice)
histogram(~car | body, data=carad.df)

##b) For engine type and body

library(lattice)
histogram(~engType | body, data=carad.df)

##c) For car and drive

library(lattice)
histogram(~car | drive, data=carad.df)

7. Draw suitable plot for your data fields.

a) For price of advertising cars.

library(lattice)
plot(x=carad.df$price, y=carad.df$car)

#b) For milleage

library(lattice)
barplot(carad.df$mileage)

#C) For cars

library(lattice)
barplot(carad.df$engV)

9. Create a correlation matrix.

library(corrgram)
corrgram(carad.df, order=FALSE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="Corrgram of Variables")

10. Create a scatter plot matrix for your data set.

plot(carad.df[1:5])

11. Chi square test

a) Car and body type

  chisq.test(mytable1)
## Warning in chisq.test(mytable1): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable1
## X-squared = 6544.5, df = 430, p-value < 2.2e-16

b) Car and Engine type

  chisq.test(mytable2)
## Warning in chisq.test(mytable2): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable2
## X-squared = 3511.8, df = 258, p-value < 2.2e-16

c) Car and Registration

  chisq.test(mytable3)
## Warning in chisq.test(mytable3): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable3
## X-squared = 605.8, df = 86, p-value < 2.2e-16

d) Car and drive type

  chisq.test(mytable4)
## Warning in chisq.test(mytable4): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable4
## X-squared = 7250.9, df = 258, p-value < 2.2e-16

e) Body and engine type

  chisq.test(mytable5)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable5
## X-squared = 2533, df = 15, p-value < 2.2e-16

f) body and Registraion

  chisq.test(mytable6)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable6
## X-squared = 283.93, df = 5, p-value < 2.2e-16

g) body and drive type

  chisq.test(mytable7)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable7
## X-squared = 6291.6, df = 15, p-value < 2.2e-16

h)Engine and Registration

  chisq.test(mytable8)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable8
## X-squared = 307.3, df = 3, p-value < 2.2e-16

i) Engine and Drive type

  chisq.test(mytable9)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable9
## X-squared = 442.43, df = 9, p-value < 2.2e-16

j) Registation and Drive type

  chisq.test(mytable10)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable10
## X-squared = 92.759, df = 3, p-value < 2.2e-16

12. t-test

a) Car and body type

  t.test(mytable1)
## 
##  One Sample t-test
## 
## data:  mytable1
## t = 8.8793, df = 521, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  14.28605 22.40360
## sample estimates:
## mean of x 
##  18.34483

b) Car and Engine type

  t.test(mytable2)
## 
##  One Sample t-test
## 
## data:  mytable2
## t = 7.6262, df = 347, p-value = 2.344e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  20.42048 34.61400
## sample estimates:
## mean of x 
##  27.51724

c) Car and Registration

  t.test(mytable3)
## 
##  One Sample t-test
## 
## data:  mytable3
## t = 5.2714, df = 173, p-value = 3.996e-07
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  34.42808 75.64088
## sample estimates:
## mean of x 
##  55.03448

d) Car and drive type

  t.test(mytable4)
## 
##  One Sample t-test
## 
## data:  mytable4
## t = 6.539, df = 347, p-value = 2.218e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.24047 35.79401
## sample estimates:
## mean of x 
##  27.51724

e) Body and engine type

  t.test(mytable5)
## 
##  One Sample t-test
## 
## data:  mytable5
## t = 4.2046, df = 23, p-value = 0.0003381
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  202.6946 595.3054
## sample estimates:
## mean of x 
##       399

f) body and Registraion

  t.test(mytable6)
## 
##  One Sample t-test
## 
## data:  mytable6
## t = 2.7029, df = 11, p-value = 0.02055
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   148.193 1447.807
## sample estimates:
## mean of x 
##       798

g) body and drive type

  t.test(mytable7)
## 
##  One Sample t-test
## 
## data:  mytable7
## t = 3.2922, df = 23, p-value = 0.003189
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  148.2914 649.7086
## sample estimates:
## mean of x 
##       399

h)Engine and Registration

  t.test(mytable8)
## 
##  One Sample t-test
## 
## data:  mytable8
## t = 2.1828, df = 7, p-value = 0.06537
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   -99.69983 2493.69983
## sample estimates:
## mean of x 
##      1197

i) Engine and Drive type

  t.test(mytable9)
## 
##  One Sample t-test
## 
## data:  mytable9
## t = 3.596, df = 15, p-value = 0.002647
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  243.7563 953.2437
## sample estimates:
## mean of x 
##     598.5

j) Registation and Drive type

  t.test(mytable10)
## 
##  One Sample t-test
## 
## data:  mytable10
## t = 1.9925, df = 7, p-value = 0.08658
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -223.5856 2617.5856
## sample estimates:
## mean of x 
##      1197

13) Regretion model 1

sow1 <- lm(carad.df$year~carad.df$price+carad.df$price+carad.df$mileage+carad.df$engV)
summary(sow1)
## 
## Call:
## lm(formula = carad.df$year ~ carad.df$price + carad.df$price + 
##     carad.df$mileage + carad.df$engV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.830  -1.198   1.165   3.230  28.596 
## 
## Coefficients:
##                    Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)       2.010e+03  1.284e-01 15654.946  < 2e-16 ***
## carad.df$price    6.853e-05  2.665e-06    25.714  < 2e-16 ***
## carad.df$mileage -3.074e-02  6.549e-04   -46.931  < 2e-16 ***
## carad.df$engV    -4.003e-02  1.028e-02    -3.895 9.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.804 on 9138 degrees of freedom
##   (434 observations deleted due to missingness)
## Multiple R-squared:  0.3078, Adjusted R-squared:  0.3076 
## F-statistic:  1355 on 3 and 9138 DF,  p-value: < 2.2e-16

Residual model 1 plot

  library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
  residualPlot(sow1)

##14) Regression model 2

sow2 <- lm(carad.df$price~carad.df$year+carad.df$mileage+carad.df$engV)
summary(sow2)
## 
## Call:
## lm(formula = carad.df$price ~ carad.df$year + carad.df$mileage + 
##     carad.df$engV)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45885  -9463  -3802   3672 515786 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.955e+06  7.704e+04 -25.378  < 2e-16 ***
## carad.df$year     9.847e+02  3.829e+01  25.714  < 2e-16 ***
## carad.df$mileage -4.265e+01  2.730e+00 -15.626  < 2e-16 ***
## carad.df$engV     2.894e+02  3.887e+01   7.445 1.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22000 on 9138 degrees of freedom
##   (434 observations deleted due to missingness)
## Multiple R-squared:  0.1641, Adjusted R-squared:  0.1638 
## F-statistic: 597.8 on 3 and 9138 DF,  p-value: < 2.2e-16

Residual model 2 plot

  library(car)
  residualPlot(sow2)

15) Regression model 3

sow3 <- lm(carad.df$mileage~carad.df$year+carad.df$price+carad.df$engV)
summary(sow3)
## 
## Call:
## lm(formula = carad.df$mileage ~ carad.df$year + carad.df$price + 
##     carad.df$engV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -448.89  -44.62   -5.09   44.03  912.72 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.283e+04  2.700e+02  47.514  < 2e-16 ***
## carad.df$year  -6.319e+00  1.346e-01 -46.931  < 2e-16 ***
## carad.df$price -6.102e-04  3.905e-05 -15.626  < 2e-16 ***
## carad.df$engV   5.896e-01  1.474e-01   4.001 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.22 on 9138 degrees of freedom
##   (434 observations deleted due to missingness)
## Multiple R-squared:  0.2774, Adjusted R-squared:  0.2772 
## F-statistic:  1169 on 3 and 9138 DF,  p-value: < 2.2e-16

Residual model 3 plot

  library(car)
  residualPlot(sow3)

##14) Regression model 4

sow4 <- lm(carad.df$price~carad.df$year+carad.df$mileage+carad.df$engV)
summary(sow4)
## 
## Call:
## lm(formula = carad.df$price ~ carad.df$year + carad.df$mileage + 
##     carad.df$engV)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45885  -9463  -3802   3672 515786 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.955e+06  7.704e+04 -25.378  < 2e-16 ***
## carad.df$year     9.847e+02  3.829e+01  25.714  < 2e-16 ***
## carad.df$mileage -4.265e+01  2.730e+00 -15.626  < 2e-16 ***
## carad.df$engV     2.894e+02  3.887e+01   7.445 1.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22000 on 9138 degrees of freedom
##   (434 observations deleted due to missingness)
## Multiple R-squared:  0.1641, Adjusted R-squared:  0.1638 
## F-statistic: 597.8 on 3 and 9138 DF,  p-value: < 2.2e-16

Residual model 4 plot

  library(car)
  residualPlot(sow4)