As an international city, Shanghai has become the center of attention of East Asia, along with the expedited development, traffic and air quality became one of the most cumbersome issues. We will do some exploratory data analysis for overall car license by year,by season and by month, and perform the ARIMA time series.
Note: the price is in ????RMB . 1 Dollar = 7 RMB
We will clean the missing data, assign the correct seasons ,months and days.
$b <- as.character.Date(shcar$Date)
shcar$a <- strsplit(as.character(shcar$b), ".", fixed=TRUE)
shcar
#seperate year and month
$year <- substr(shcar$b, 1,4)
shcar$t1 <- substr(shcar$b, 6,8)
shcar
#clean 20-90
$t <- ifelse ((shcar$t %in% c("20","30","40","50","60","70","80","90")),
shcar
substr(shcar$t,1,1), shcar$t)
$o <- lead(shcar$t, 1)
shcar
#clean 10 (oct, Jan)
$tt <- ifelse( shcar$t=="10" & shcar$o=="2", substr(shcar$t,1,1), shcar$t)
shcar
$ttt <- ifelse( shcar$t=="10" & is.na(shcar$o) ,substr(shcar$t,1,2), shcar$tt)
shcar
#clean 01-09
$ttt <- ifelse (shcar$ttt %in% c("01" , "02", "03", "04","05", "06","07","08" ,"09"),
shcar
substr(shcar$ttt,2,3), shcar$ttt)
#assign season winter(12,1,2), spring(3,4,5) , summer(6,7,8), fall(9,10,11)
$s <- ifelse(shcar$ttt %in% c("1" , "2", "12"), "Winter",
shcar
ifelse( shcar$ttt %in% c("3" , "4" , "5"), "Spring",
ifelse ( shcar$ttt %in% c("6", "7", "8"), "Summer",
"Fall")))
#combine year and month
$d <- paste( as.numeric(shcar$year), as.numeric(shcar$ttt), sep="-")
shcar
$date <- as.yearmon(shcar$d)
shcar
$date1 <- as.Date(as.yearmon(shcar$d))
shcar
$tttt <- as.numeric(shcar$ttt)
shcar
$ttt3 <- factor(shcar$tttt,
shcar
labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul",
"Aug","Sep","Oct","Nov","Dec"))
<- shcar[,c("X","date1","s","avg_deal_price","lowest_deal_price","num_bidder","num_plates")]
msample
head(msample, 20)%>% DT::datatable()
Color Scheme: Set 3
#colorRampPalette(brewer.pal(9,"Set3"))(100)
display.brewer.pal(16, "Set3")
we will perform our analysis in three steps, we first look the overall price over the 12 years from 2002-2017.
then we will perform an analysis by each year. followed by we will group all the seasons over 12 years
see if the price will affected by the season. If so, we will break down to analysis by month.
From the graph we can see the license plate price increased drastically, especially from 2014. It went up from around 20,000Yuan
to nearly 90,000Yuan with in 16 years. Anther feature we can see from the graph is lowest bid price is very close to the
average bid price, and difference is getting closer and closer espcially after 2013, the difference is within 200 yuan.
The second plot showed the participants of license bid has an outrageous increasing rate, particually happen around 2014.
However, the number of plate offer to public is significantly less than the bid number.The number of bids went from average
of 4,373 a year to 253,335 applications a year. Netherless, the license plate is only offer 2,654 to 11,023 per year on average.
<- ggplot(shcar, aes(x=tttt,y=avg_deal_price, size=num_bidder, colour=as.factor(ttt3),alpha=0.05, group = 1))+ geom_point()+
tyeargeom_line(aes(x=tttt, y=avg_deal_price, colour=as.factor(ttt3)))+
facet_wrap(~year, ncol=3)+ theme_minimal(base_size = 8) +theme(axis.text.x=element_text(angle=45,hjust=1))+ guides(colour=FALSE)+
labs(x="Year of Review", y="Average Rating", title="Chocolate Average Rating by Year")
tyear
#plate price
<- cbind(shcar$avg_deal_price, shcar$lowest_deal_price)
car1
<-xts(car1, shcar$date)
car2
dygraph(car2, main="ShangHai Car License Bid Price") %>%
dyOptions(colors=RColorBrewer::brewer.pal(3, "Set3"), fillGraph=TRUE, fillAlpha=0.4 ,
axisLineWidth=1.5) %>%
dyHighlight(highlightCircleSize=5, highlightSeriesBackgroundAlpha=0.2, hideOnMouseOut=FALSE)%>% dyAxis("x", label = "year") %>% dyLegend(show="follow") %>%
dySeries("V1", label="Average Deal Price") %>% dySeries("V2", label="Lowest Deal Price")
#plate price difference
<- shcar %>% mutate(pricediff= avg_deal_price -lowest_deal_price)
car3
<- xts(car3$pricediff, car3$date)
car4
dygraph(car4, main="ShangHai Car License Bid Price Difference") %>%
dyOptions(colors=RColorBrewer::brewer.pal(3, "Set2"), fillGraph=TRUE, fillAlpha=0.4 , axisLineWidth=1.5) %>%
dyHighlight(highlightCircleSize=5, highlightSeriesBackgroundAlpha=0.6, hideOnMouseOut=FALSE)%>% dyAxis("x", label = "year") %>% dyLegend(show="follow") %>%
dySeries("V1", label="Bid Price Difference")
#plate number
<- cbind(shcar$num_bidder, shcar$num_plates)
plate1
<- xts(plate1, shcar$date)
plate2
dygraph(plate2, main="Shanghai Car License bid number") %>%
dyOptions(colors=RColorBrewer::brewer.pal(4, "Set3"), fillGraph=TRUE, fillAlpha=0.4 ,
axisLineWidth=1.5) %>% dyHighlight(highlightCircleSize=5, highlightSeriesBackgroundAlpha=0.2)%>% dyAxis("x", label = "year") %>%
dySeries("V1", label="Number of Bidder") %>% dySeries("V2", label="Number of Plates")
#line plot
<- select(shcar, avg_deal_price, ttt3, year)
line1 colnames(line1) <-c("avgprice","mon","year")
$avgprice <- as.numeric(line1$avgprice)
line1
<- ggplot(line1, aes(x=mon, y=avgprice, colour=year, group=year))+geom_point()+geom_line()+
l1scale_y_continuous( breaks=c(0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000))+
geom_dl(aes(label= year), method=list("first.points", alpha=0.6 ,rot=-45))+
labs(x="Month", y="Avgerage Price", title="Average Price of ShangHai Car License Plate From 2002-2017")+ theme_minimal()
l1
Now we will investigate more into the data, by compare them in year. We will group the data by year then take the average price of each year. We first show a bar plot of the relationship between car license plate applicants and number of car plates. The graph demonstrated from 2014, the number of applicants grow excessively out of the plates. Then the growth rate plot showed the number of license plates are relatively stable compare the the number of applicants. The Car Plate Winning rate supported the previous conclusion, before 2014, the winning rate is 30% or more. from 2014, the winning rate decreased to less than 10%, nowadays it is around 5%. Moreover, the growth of average price is within 10% from 2015. Then we will plot the growth rate with 2002 as the baseline, see how it changes every year respect to 2002. We can see, the winning rate, the average price is gradually converges to stable from 2015. I guess the main reason is the market is gradually saturated.
ggplot(shcar,aes(x=avg_deal_price, fill=as.factor(year)))+geom_density(alpha=0.6)+
theme_minimal()+facet_wrap(~as.factor(year))+coord_cartesian(ylim = c(0, 2e-04))+guides(fill="none") +labs(x="Price", y="Density") +ggtitle("Average Car License Plate Price Year")
ggplot(shcar,aes(x=num_bidder, fill=as.factor(year)))+geom_density(alpha=0.6)+
theme_minimal()+facet_wrap(~as.factor(year))+coord_cartesian(ylim = c(0, 1e-04))+ guides(fill="none") +labs(x="Bidder Number", y="Density") +ggtitle("Average Car License Plate Bidder Number Year")
#dlyr aggregate them in year.
<- group_by(shcar, year)
year1
#Calculate the average of deal price, lowest price, number of bid and number of plates
<- summarise(year1, count=n(),
year2
aprice= mean(avg_deal_price),
lprice=mean(lowest_deal_price), abid=mean(num_bidder),
anum=mean(num_plates))
#Calucate the chance getting win the plate
<- year2 %>% mutate(win=anum/abid*100,
year2
gbid=(abid-lag(abid))*100/lag(abid),
gnum=(anum-lag(anum))*100/lag(anum),
gaprice=(aprice-lag(aprice))*100/lag(aprice),
gbaprice=round(((aprice-20956)*100/20956),0), gbbid=round((abid-4373)*100/4373,0),
gbnum=round((anum-2654)*100/2654,0),
mon1=12)
#Compare the average bid number and plate number
<- ggplot(year2, aes(x=year, fill="white"))+
p1 geom_bar(aes(y=abid,fill="#ADB2A9"), stat="identity", position="dodge",col="#ADB2A9")+
geom_bar(aes(y=anum), stat="identity", position="dodge",col="#F6B762")+
labs(x="Year", y="Number of license Plate",
title="Number of Bids Vs Number of License Plates")+ theme_minimal()+ theme(legend.position="none", axis.text.x = element_text(angle = 60, hjust = 1)) #+ scale_x_continuous(breaks=year2$year)
p1
#Growth Rate of number plate and bid number
$d <- paste( as.numeric(year2$year), year2$mon1, sep="-")
year2$d1 <- as.yearmon(year2$d)
year2
<- cbind(year2$gbid, year2$gnum)
num <-xts( num , order.by=as.POSIXct(as.Date(year2$d1)))
num1
<-dygraph(num1, main="Shanghai Car License plate growth rate vs bid growth rate") %>%
gw0
dyOptions(colors=RColorBrewer::brewer.pal(4, "Set3"), fillGraph=TRUE, fillAlpha=0.4, drawXAxis = TRUE ) %>% dyHighlight(highlightCircleSize=5, highlightSeriesBackgroundAlpha=0.1)%>% dyAxis("x", label = "Year from 2002-2017", c(1002,2017)) %>% dyLegend(show="follow") %>%
dySeries("V1", label="Growth rate of Bids") %>% dySeries("V2", label="Growth rate of Car Plates")
gw0
#Growth Rate of number plate and bid number with baseline 2002
<- cbind(year2$gbbid, year2$gbnum)
bnum
<-xts( bnum , order.by=as.POSIXct(as.Date(year2$d1)))
bnum1
<-dygraph(bnum1, main="Shanghai License Plate vs Bid growth rate with baseline 2002") %>%
bgw0
dyOptions(colors=RColorBrewer::brewer.pal(4, "Set2"), fillGraph=TRUE, fillAlpha=0.4, drawXAxis = TRUE ) %>% dyHighlight(highlightCircleSize=5, highlightSeriesBackgroundAlpha=0.1)%>% dyAxis("x", label = "Year from 2002-2017", c(1002,2017)) %>% dyLegend(show="follow") %>%
dySeries("V1", label="Growth rate of Bids based on 2002") %>% dySeries("V2", label="Growth rate of Car Plates based on 2002")
bgw0
#winning rate
$d <- paste( as.numeric(year2$year), year2$mon1, sep="-")
year2
$d1 <- as.yearmon(year2$d)
year2
<- xts( year2$win , order.by=as.POSIXct(as.Date(year2$d1)))
win1
<-dygraph(win1, main="Shanghai Car License Lottery Winning Chance (%)") %>%
gw1
dyOptions(colors="pink" ,fillGraph=TRUE, fillAlpha=0.2, drawXAxis = TRUE ) %>% dyHighlight(highlightCircleSize=5, highlightSeriesBackgroundAlpha=0.1, hideOnMouseOut=FALSE)%>% dyAxis("x", label = "Year from 2002-2017", c(1002,2017)) %>% dyLegend(show="follow") %>%
dySeries("V1", label="Car Plate Winning (%)")
gw1
#Compare the average price and lowest bid price
<- cbind(year2$lprice, year2$aprice)
price
<- xts( price , order.by=as.POSIXct(as.Date(year2$d1)))
price1
<-dygraph(price1, main="Shanghai Car License Average Price VS Lowest Price by Year") %>%
gp1
dyOptions(colors=RColorBrewer::brewer.pal(4, "Set3"), fillGraph=TRUE, fillAlpha=0.4, drawXAxis = TRUE ) %>% dyHighlight(highlightCircleSize=5, highlightSeriesBackgroundAlpha=0.1)%>% dyAxis("x", label = "Year from 2002-2017", c(1002,2017)) %>% dyLegend(show="follow") %>%
dySeries("V1", label="Lowest Deal Price") %>%
dySeries("V2", label="Average Deal Price")
gp1
#Growth Rate of average price
<- xts( year2$gaprice , order.by=as.POSIXct(as.Date(year2$d1)))
gaprice1
<-dygraph(gaprice1, main="Shanghai car License Plate average rate of growth") %>%
gap1
dyOptions(colors="green" ,fillGraph=TRUE, fillAlpha=0.2, drawXAxis = TRUE ) %>% dyHighlight(highlightCircleSize=5, highlightSeriesBackgroundAlpha=0.1, hideOnMouseOut=FALSE)%>% dyAxis("x", label = "Year from 2002-2017", c(1002,2017)) %>% dyLegend(show="follow") %>%
dySeries("V1", label="Rate of Average Car Plate price growth(%) by Year")
gap1
#Growth Rate of average price from baseline
<- xts( year2$gbaprice , order.by=as.POSIXct(as.Date(year2$d1)))
gbaprice1
<-dygraph(gbaprice1,
gbap1
main="Shanghai car License Plate average rate of growth from 2002") %>% dyOptions(colors="purple" ,fillGraph=TRUE, fillAlpha=0.2, drawXAxis = TRUE ) %>% dyHighlight(highlightCircleSize=5, highlightSeriesBackgroundAlpha=0.1, hideOnMouseOut=FALSE)%>% dyAxis("x", label = "Year from 2002-2017", c(1002,2017)) %>% dyLegend(show="follow") %>%
dySeries("V1", label="Rate of average price growth(%) from 2002")
gbap1
We will group the data by season, as we can see, Winter has the highest winning rate with the lowest average bid price overall.
ggplot(shcar,aes(x=avg_deal_price, fill=as.factor(s)))+geom_density(alpha=0.6)+scale_fill_brewer(palette="Set3")+
theme_minimal()+facet_wrap(~as.factor(s))+coord_cartesian(ylim = c(0, 0.000025))+guides(fill="none") +labs(x="Price", y="Density") +ggtitle("Average Car License Plate Price Season")
ggplot(shcar,aes(x=num_bidder, fill=as.factor(s)))+geom_density(alpha=0.6)+
theme_minimal()+facet_wrap(~as.factor(s))+coord_cartesian(ylim = c(0, 3e-05))+ guides(fill="none") + scale_fill_brewer(palette="Set3")+labs(x="Bidder Number", y="Density") +ggtitle("Average Car License Plate Bidder Number Season")
#dlyr aggregate them in Season.
<- group_by(shcar, s)
s1
#Calculate the average of deal price, lowest price, number of bid and number of plates by Month
<- summarise(s1, count=n(),
s2
aprice= mean(avg_deal_price), lprice=mean(lowest_deal_price), abid=mean(num_bidder), anum=mean(num_plates))
#Calucate the chance getting win the plate
<- s2 %>% mutate(swin=anum/abid*100)
s2
#Plot the average price for each Month
<- ggplot(s2, aes(x=s))+
ps1 geom_bar(aes(y=aprice,fill=s), stat="identity", position="dodge")+
labs(x="Season", y="Average Price of License Plate",
title="Average Price of License Plate by Season")+ scale_fill_brewer(palette="Set3")+ theme_minimal()+
theme(legend.position="none", axis.text.x = element_text(angle = 60, hjust = 1))
ps1
#Plot the winning rate by season
<- ggplot(s2, aes(x=s , y=swin )) + geom_point(aes( size=aprice , colour=factor(aprice)), alpha=1)+ scale_color_brewer(palette = "Set3")+ theme_minimal()+
wins1 labs(Season="Month", y="Probability of Winning car license plate (%)" , title="Shanghai Car Plate Winning Rate by Season") + theme(legend.position="none")
wins1
We will group the data by months now. As it suggested in previous section, Winter has the highest winning rate and lowest bidding price, January in particular. March has the lowest winning rate overall.
ggplot(shcar,aes(x=avg_deal_price, fill=as.factor(ttt3)))+geom_density(alpha=0.6)+
theme_minimal()+facet_wrap(~as.factor(ttt3))+coord_cartesian(ylim = c(0, 0.000025))+guides(fill="none") +scale_fill_brewer(palette="Set3")+labs(x="Price", y="Density") +ggtitle("Average Car License Plate Price Month")
ggplot(shcar,aes(x=num_bidder, fill=as.factor(ttt3)))+geom_density(alpha=0.6)+
theme_minimal()+facet_wrap(~as.factor(ttt3))+coord_cartesian(ylim = c(0, 3e-05))+ guides(fill="none") +labs(x="Bidder Number", y="Density") +ggtitle("Average Car License Plate Bidder Number Month")
#dlyr aggregate them in Month.
$m1 <- substr(shcar$date,1,3)
shcar
<- group_by(shcar, m1 ,ttt)
month1
#Calculate the average of deal price, lowest price, number of bid and number of plates by Month
<- summarise(month1, count=n(), aprice= mean(avg_deal_price), lprice=mean(lowest_deal_price), abid=mean(num_bidder), anum=mean(num_plates))
month2
#Calucate the chance getting win the plate
<- month2 %>% mutate(win=anum/abid*100)
month2
$ttt <- factor(as.numeric(month2$ttt),
month2labels=c("jan","feb","Mar","apr","may","jun","jul",
"aug","sep","oct","nov","dec"))
<-arrange(month2,ttt)
month2
#Plot the average price for each Month
<- ggplot(month2, aes(x=ttt))+
pm1 geom_bar(aes(y=aprice,fill=m1), stat="identity", position="dodge")+
labs(x="Year", y="Average Price of License Plate by Month",
title="Number of Bids Vs Number of License Plates")+scale_fill_brewer(palette="Set3")+theme_minimal()+ theme(legend.position="none", axis.text.x = element_text(angle = 60, hjust = 1))+scale_x_discrete(name = "Month")
pm1
#Plot the winning rate by Month
<- ggplot(month2, aes(x=ttt , y=win , group=1)) + geom_point(color="orangered1") +geom_line(color="#9FDAC3")+
winm1 labs(x="Month", y="Probability of Winning car license plate (%)" , title="Winning rate by month") +theme_minimal()
winm1
#heatmap
ggplot(shcar, aes(year, m1, fill=avg_deal_price))+
geom_tile(colour = "white")+
scale_fill_gradient(low="#CCEBC5", high="#FCCDE5") +
labs(x="Year ",y= "Month" , title="Shanghai Car Plate Average Deal Price Heatmap" ,subtitle=" Year vs Month " , fill="Avg Price")+theme_minimal()
We know the average deal price is very close to the lowest deal price. So we are trying to use
number of bidder to predict the lowest deal price. We can see when the bid number > 50,000,
there might be a linear relationship. Our Model supported our conjecture, the p-value smaller than 0.05. So if we know the total number of bidder then we can calculate the lowest price, and by the lowest price we add 300-500Yuan , it will be in the Lottery Range.
<- ggplot(shcar, aes(x=num_bidder, y=lowest_deal_price))+geom_point(color="darkseagreen2")+
mod1theme_minimal()+
labs(x="Number of Bidder", y="Lowest price of car license plate", title="Number of Bidder Vs Lowest price of car license plate")
mod1
<- filter(shcar, num_bidder >50000)
mod.shcar
<- ggplot(mod.shcar, aes(x=num_bidder, y=lowest_deal_price))+geom_point(color="darkseagreen3")+ stat_smooth(color="violetred2", method="lm")+ theme_minimal()+
mod2labs(x="Number of Bidder", y="Lowest price of car license plate", title="Number of Bidder Vs Lowest price of car license plate")
mod2
## `geom_smooth()` using formula 'y ~ x'
<- lm(lowest_deal_price ~ num_bidder, data=mod.shcar)
lm.car summary(lm.car)
##
## Call:
## lm(formula = lowest_deal_price ~ num_bidder, data = mod.shcar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33432 -438 1229 4138 9291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.631e+04 3.623e+03 15.540 < 2e-16 ***
## num_bidder 1.326e-01 1.871e-02 7.089 8.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8106 on 44 degrees of freedom
## Multiple R-squared: 0.5332, Adjusted R-squared: 0.5226
## F-statistic: 50.26 on 1 and 44 DF, p-value: 8.462e-09
In this section, we will plot a Time Series graph (average price, time), Decompose the series, Select the models based on correlogram, Fitting the model, and Forecasting.
<-xts(shcar$avg_deal_price, shcar$date)
acar2
dygraph(acar2, xlab = "Time", ylab = "Averge Deal Price", main = "Time Series") %>%
# dySeries(labels.default()) %>%
# dyOptions(colors = c("red")) %>%
dyRangeSelector()
= xts(shcar$avg_deal_price, order.by=shcar$date1)
xts attr(xts, 'frequency') <- length(xts)/12
= as.ts(xts, start = c(2002)) ts
<- decompose(ts, type = "additive")
tscomponents_add <- decompose(ts, type = "multiplicative")
tscomponents_mul plot(tscomponents_add, col = "#BC80BD")
plot(tscomponents_mul, col = "#FDB462")
<- diff(xts, differences=1)
xtsdiff1 <- diff(ts, differences=1)
tsdiff1 plot.xts(xtsdiff1, col = "#8DD3C7")
### Correlogram
# plot a correlogram
Acf(xtsdiff1, lag.max=60)
# plot a partial correlogram
Pacf(xtsdiff1, lag.max=60)
### Fitting Model
<- auto.arima(head(xts,-100), max.p = 3, max.q = 3, max.d = 3) # excluding last 240 time series as test data
tsarima100 print(tsarima100)
## Series: head(xts, -100)
## ARIMA(0,1,0)
##
## sigma^2 = 24818479: log likelihood = -874.06
## AIC=1750.12 AICc=1750.16 BIC=1752.6
autoplot(tsarima100)
<- auto.arima(head(xts, -7), max.p = 3, max.q = 3, max.d = 3) #7
tsarima7 print(tsarima7)
## Series: head(xts, -7)
## ARIMA(0,1,2) with drift
##
## Coefficients:
## ma1 ma2 drift
## -0.2841 -0.1543 406.9820
## s.e. 0.0719 0.0700 197.7925
##
## sigma^2 = 22561087: log likelihood = -1787.72
## AIC=3583.44 AICc=3583.67 BIC=3596.24
autoplot(tsarima7)
### Forecasting
# forecast the next 100 time series
<- forecast(tsarima100, h = 100)
tsforecasts100 # forecast the next 7 time series
<- forecast(tsarima7, h = 7) tsforecasts7
autoplot(tsforecasts100)
accuracy(tsforecasts100, head(tail(xts, 100), 100))
## ME RMSE MAE MPE MAPE MASE
## Training set 175.7611 4953.748 2940.952 -0.4711306 9.86983 0.2824895
## Test set 37026.1800 41639.525 37331.300 49.4313811 51.29239 3.5858121
## ACF1
## Training set -0.1123756
## Test set NA
checkresiduals(tsforecasts100)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,0)
## Q* = 12.72, df = 18, p-value = 0.8079
##
## Model df: 0. Total lags used: 18
autoplot(tsforecasts7)
accuracy(tsforecasts7, head(tail(xts, 7), 7))
## ME RMSE MAE MPE MAPE MASE
## Training set 0.9646739 4697.365 2710.474 -1.521765 7.642390 0.2491679
## Test set 1203.3550687 1437.658 1227.129 1.309712 1.336265 0.1128072
## ACF1
## Training set 0.01003699
## Test set NA
checkresiduals(tsforecasts7)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2) with drift
## Q* = 14.621, df = 28.5, p-value = 0.9852
##
## Model df: 3. Total lags used: 31.5
Although, the Car License Plate is expensive, and the chances of getting the plate is very low, ShangHai is still a young and energized city, the Pearl of East.