Fifty Shades of Cars

Are white, silver, grey or black the most popular colours for new cars?

Luke Browne s3689420 and Simon Boyd s3046049

Last updated: 28 October, 2018

Introduction

In March 2014, journalist Nina Nina Hardy(2014) published an article claiming 73% of new cars sold are coloured white, silver, grey or black.

As car enthusiasts, we’ve decided to test this assertation by examining car registrations recorded by Vic Roads for the month of August 2018.

The hypothesis is based on Hardy’s article that 73% of new cars sold are coloured white, silver, grey or black.

Problem Statement

The downloaded data was filtered to produce a list of suspected new cars.

This created a subset of the total data.

The assumption is that cars newly registered for the month were either built in 2018 or 2017 and the sellers were registered LMCT providers.

A chi-squared “Goodness of fit” test was applied to the data subset.

The results prompted the authors to test the colour distribution ratio to all older cars registered in the month of August 2018 and all cars registered in 2014 that remain on the register.

The same test was applied to these two additional subsets.

Data

The car registrations transfer data were taken from https://www.data.vic.gov.au/data/dataset/motor_vehicle_transfers and produced by Vic Roads.

carreg <- read_xlsx('Monthly Vehicle Transfer data.xlsx',skip=3,sheet="VACC Monthly Vehicle Transfers")

Data Cont.

## Classes 'tbl_df', 'tbl' and 'data.frame':    61276 obs. of  8 variables:
##  $ LMCT acquirer         : chr  "n" "n" "n" "n" ...
##  $ LMCT disposer         : chr  "n" "n" "n" "n" ...
##  $ Vehicle Make          : chr  "ALFA R" "ALFA R" "ALFA R" "ALFA R" ...
##  $ Model Name            : chr  "147" "147" "147" "147" ...
##  $ Colour primary        : chr  "BLK" "BLK" "BLU" "GLD" ...
##  $ Year of Manufacture   : num  2008 2009 2005 2001 2005 ...
##  $ Resi. address postcode: num  3083 3004 3089 3058 3337 ...
##  $ Count Transfer Id     : num  1 1 1 1 1 1 1 1 1 1 ...
LMCT acquirer LMCT disposer Vehicle Make Model Name Colour primary Year of Manufacture Resi. address postcode Count Transfer Id
n n ALFA R 147 BLK 2008 3083 1
n n ALFA R 147 BLK 2009 3004 1
n n ALFA R 147 BLU 2005 3089 1
n n ALFA R 147 GLD 2001 3058 1
n n ALFA R 147 RED 2005 3337 1
n n ALFA R 147 SIL 2005 3034 1

Data Cont.

    carreg$`LMCT acquirer` <- factor(carreg$`LMCT acquirer`,levels=c("n","y"),labels=c("No","Yes"))
    carreg$`LMCT disposer` <- factor(carreg$`LMCT disposer`,levels=c("n","y"),labels=c("No","Yes"))
    carreg$`Colour primary` <- factor(carreg$`Colour primary`)
    carreg <- carreg %>% mutate(Age=year(now())-`Year of Manufacture`)

Data Cont.

## Classes 'tbl_df', 'tbl' and 'data.frame':    61276 obs. of  9 variables:
##  $ LMCT acquirer         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LMCT disposer         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Vehicle Make          : chr  "ALFA R" "ALFA R" "ALFA R" "ALFA R" ...
##  $ Model Name            : chr  "147" "147" "147" "147" ...
##  $ Colour primary        : Factor w/ 17 levels "BLK","BLU","BRN",..: 1 1 2 6 14 15 15 14 1 1 ...
##  $ Year of Manufacture   : num  2008 2009 2005 2001 2005 ...
##  $ Resi. address postcode: num  3083 3004 3089 3058 3337 ...
##  $ Count Transfer Id     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Age                   : num  10 9 13 17 13 13 10 14 16 14 ...
LMCT acquirer LMCT disposer Vehicle Make Model Name Colour primary Year of Manufacture Resi. address postcode Count Transfer Id Age
No No ALFA R 147 BLK 2008 3083 1 10
No No ALFA R 147 BLK 2009 3004 1 9
No No ALFA R 147 BLU 2005 3089 1 13
No No ALFA R 147 GLD 2001 3058 1 17
No No ALFA R 147 RED 2005 3337 1 13
No No ALFA R 147 SIL 2005 3034 1 13

Descriptive Statistics and Visualisation

Examination of the age variable highlights an error - some cars had no year of manufacture, creating a age outlier of 2018 years.

We have made the assumption that the year of manuafacture for these cars are 2018, and, using a ifelse statement to change the year of manufacture.

Decsriptive Statistics Cont.

We then created a subset of data with cars that are made in 2017 and 2018.

carreg_subsetnew <- carreg %>% filter(Age<=1 & `LMCT disposer`=="Yes")
table1<- head(carreg_subsetnew)
knitr::kable(table1)
LMCT acquirer LMCT disposer Vehicle Make Model Name Colour primary Year of Manufacture Resi. address postcode Count Transfer Id Age
No Yes ALFA R 4C RED 2017 3101 1 1
No Yes ALFA R GIULIA BLK 2017 3002 1 1
No Yes ALFA R GIULIA BLK 2018 3145 1 0
No Yes ALFA R GIULIA GRY 2017 3107 1 1
No Yes ALFA R GIULIA RED 2018 3002 1 0
No Yes ALFA R GIULIA WHI 2017 3113 1 1

Hypothesis Testing

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \ne \mu_2\] To peform the test we needed to build the expected frequency table based on the article, this gave us specific values for 4 colors and 73% of new cars and stated the remaing colors were distributed evenly across the remaing 27%. This yielded expected colors of:

expcol <- c(0.17,0.02076923,0.02076923,0.02076923,0.02076923,0.02076923,0.02076923,
            0.18,0.02076923,0.02076923,0.02076923,0.02076923,0.02076923,0.02076923,0.17,0.21,0.02076923)

Hypthesis Testing Cont.

We then conducted a goodness of fit test to test our Hypothesis. The result of 26.30 tells us that \(x^2\) above this number has less than a 0.05 probability of occuring if \(H_0\) is True.

qchisq(p=0.05,df=16,lower.tail = FALSE)
## [1] 26.29623
carnewtab <- table(carreg_subsetnew$`Colour primary`)
carnewtab
## 
##  BLK  BLU  BRN  CRM  FWN  GLD  GRN  GRY  MRN  MVE  OGE  PNK  PUR  RED  SIL 
##  614  476   55    2    2   15   19  875    4    0   23    1    1  425  690 
##  WHI  YLW 
## 1616   24
prop.table(carnewtab)
## 
##          BLK          BLU          BRN          CRM          FWN 
## 0.1268071045 0.0983064849 0.0113589426 0.0004130525 0.0004130525 
##          GLD          GRN          GRY          MRN          MVE 
## 0.0030978934 0.0039239983 0.1807104502 0.0008261049 0.0000000000 
##          OGE          PNK          PUR          RED          SIL 
## 0.0047501033 0.0002065262 0.0002065262 0.0877736473 0.1425030979 
##          WHI          YLW 
## 0.3337463858 0.0049566295

Hypthesis Testing Cont.

GOFT <- chisq.test(carnewtab,p=expcol)
GOFT$observed
## 
##  BLK  BLU  BRN  CRM  FWN  GLD  GRN  GRY  MRN  MVE  OGE  PNK  PUR  RED  SIL 
##  614  476   55    2    2   15   19  875    4    0   23    1    1  425  690 
##  WHI  YLW 
## 1616   24
GOFT$expected
##       BLK       BLU       BRN       CRM       FWN       GLD       GRN 
##  823.1400  100.5646  100.5646  100.5646  100.5646  100.5646  100.5646 
##       GRY       MRN       MVE       OGE       PNK       PUR       RED 
##  871.5600  100.5646  100.5646  100.5646  100.5646  100.5646  100.5646 
##       SIL       WHI       YLW 
##  823.1400 1016.8200  100.5646
GOFT
## 
##  Chi-squared test for given probabilities
## 
## data:  carnewtab
## X-squared = 3737.4, df = 16, p-value < 2.2e-16

Hypthesis Testing Cont.

We then tested 2014, which is when the article was written. Again, if \(H_0\) is true, then any \(X^2\) value > 26.30 would have a 0.05 probability of occuring.

qchisq(p=0.05,df=16,lower.tail = FALSE)
## [1] 26.29623
carreg_subset2014 <- carreg %>% filter(`Year of Manufacture`==2014 |  `Year of Manufacture`==2013 )
car2014tab <- table(carreg_subset2014$`Colour primary`)
car2014tab
## 
##  BLK  BLU  BRN  CRM  FWN  GLD  GRN  GRY  MRN  MVE  OGE  PNK  PUR  RED  SIL 
## 1151  669  102    4    9   52   77 1201   26    0   99    6   14  615 1267 
##  WHI  YLW 
## 3032   28
prop.table(car2014tab)
## 
##          BLK          BLU          BRN          CRM          FWN 
## 0.1378113027 0.0801005747 0.0122126437 0.0004789272 0.0010775862 
##          GLD          GRN          GRY          MRN          MVE 
## 0.0062260536 0.0092193487 0.1437978927 0.0031130268 0.0000000000 
##          OGE          PNK          PUR          RED          SIL 
## 0.0118534483 0.0007183908 0.0016762452 0.0736350575 0.1517001916 
##          WHI          YLW 
## 0.3630268199 0.0033524904

Hypthesis Testing Cont.

GOFT3 <- chisq.test(car2014tab,p=expcol)
GOFT3$observed
## 
##  BLK  BLU  BRN  CRM  FWN  GLD  GRN  GRY  MRN  MVE  OGE  PNK  PUR  RED  SIL 
## 1151  669  102    4    9   52   77 1201   26    0   99    6   14  615 1267 
##  WHI  YLW 
## 3032   28
GOFT3$expected
##       BLK       BLU       BRN       CRM       FWN       GLD       GRN 
## 1419.8400  173.4646  173.4646  173.4646  173.4646  173.4646  173.4646 
##       GRY       MRN       MVE       OGE       PNK       PUR       RED 
## 1503.3600  173.4646  173.4646  173.4646  173.4646  173.4646  173.4646 
##       SIL       WHI       YLW 
## 1419.8400 1753.9200  173.4646
GOFT3
## 
##  Chi-squared test for given probabilities
## 
## data:  car2014tab
## X-squared = 4849.6, df = 16, p-value < 2.2e-16

Hypthesis Testing Cont.

carreg_subsetold <- carreg %>% filter(Age>1) # this subset excludes cars that have been removed from the register
qchisq(p=0.05,df=16,lower.tail = FALSE)
## [1] 26.29623
caroldtab <- table(carreg_subsetold$`Colour primary`)
caroldtab
## 
##   BLK   BLU   BRN   CRM   FWN   GLD   GRN   GRY   MRN   MVE   OGE   PNK 
##  7547  5770   464    67   264  1463  1453  6787   397    12   334    43 
##   PUR   RED   SIL   WHI   YLW 
##   201  4184 11341 14500   264
prop.table(caroldtab)
## 
##          BLK          BLU          BRN          CRM          FWN 
## 0.1369915231 0.1047358008 0.0084224283 0.0012161696 0.0047920713 
##          GLD          GRN          GRY          MRN          MVE 
## 0.0265560618 0.0263745439 0.1231961663 0.0072062587 0.0002178214 
##          OGE          PNK          PUR          RED          SIL 
## 0.0060626963 0.0007805268 0.0036485088 0.0759470694 0.2058593963 
##          WHI          YLW 
## 0.2632008858 0.0047920713

Hypthesis Testing Cont.

GOFT2 <- chisq.test(caroldtab,p=expcol)
GOFT2$observed
## 
##   BLK   BLU   BRN   CRM   FWN   GLD   GRN   GRY   MRN   MVE   OGE   PNK 
##  7547  5770   464    67   264  1463  1453  6787   397    12   334    43 
##   PUR   RED   SIL   WHI   YLW 
##   201  4184 11341 14500   264
GOFT2$expected
##       BLK       BLU       BRN       CRM       FWN       GLD       GRN 
##  9365.470  1144.198  1144.198  1144.198  1144.198  1144.198  1144.198 
##       GRY       MRN       MVE       OGE       PNK       PUR       RED 
##  9916.380  1144.198  1144.198  1144.198  1144.198  1144.198  1144.198 
##       SIL       WHI       YLW 
##  9365.470 11569.110  1144.198
GOFT2
## 
##  Chi-squared test for given probabilities
## 
## data:  caroldtab
## X-squared = 36241, df = 16, p-value < 2.2e-16

As \(x^2=36241\) and \(36241 > 26.30\) we reject \(H_0\)

Discussion

A Chi-squared goodness of fit test was conducted to determine whether the distribution of colours for new cards match the figures given by Nina Hardy in 2014. The test was statistically significant, with the \(x^2=3747.4\), \(df=16\) and \(p<0.001\).

The results suggest that for 2018. the color distribution of new cars does not match the figures given in Nina Hardy’s article.

We then conducted a second test on cars sold in 2014, the year the article was written. This test was also statistically significant, with \(x^2=4849.6\), \(df=16\) and \(p<0.001\).

Finally, we tested all calls greater then 1 year in age, and again we found statistically significant results, with \(x^2=386241\), \(df=16\) and \(p<0.001\)

The results obtained do not support the figures given in Nina Hardy’s article with sufficent statistical reliability.

The limitations for this test is that the data is limited to Victoria only, and the assumption that the cars sold were new cars only, not second-hand cars.

References

Hardy, Nina “Any colour as long as it’s white”, https://www.smh.com.au/money/borrowing/any-colour-you-like-as-long-as-its-white-20140308-34e0p.html , Sydney Morning Herald, 10 March 2014.