This dataset from Zillow seemed interesting to me as I can see very interesting features. Also, the data has good depth. I think it shows us how the numbers differ before and after 2008 state-wise. I downloaded int from the Zillow researd data.
I tried to do some illustrative data visualizations with short analysis. I have also performed Linear regression on the New york state, ZHVI_Allhome prices.
library(dplyr)
library(ggplot2)
library(reshape)## Warning: package 'reshape' was built under R version 4.1.2
library(wesanderson)## Warning: package 'wesanderson' was built under R version 4.1.2
library(stringr)## Warning: package 'stringr' was built under R version 4.1.2
library(doBy)## Warning: package 'doBy' was built under R version 4.1.2
library(plotly)## Warning: package 'plotly' was built under R version 4.1.2
library(corrplot)## Warning: package 'corrplot' was built under R version 4.1.2
library(wesanderson)
library(RColorBrewer)
library(gridExtra)## Warning: package 'gridExtra' was built under R version 4.1.2
data=read.csv("C:/Users/neeya/Downloads/State_time_series.csv")
sprintf("The data set has %d rows and %d columns", nrow(data), ncol(data) )## [1] "The data set has 13212 rows and 82 columns"
The data set has 13212 rows and 82 columns. We will get an overview of the data set using the str, and summary.
We find :
str(data)## 'data.frame': 13212 obs. of 82 variables:
## $ Date : chr "1996-04-30" "1996-04-30" "1996-04-30" "1996-04-30" ...
## $ RegionName : chr "Alabama" "Arizona" "Arkansas" "California" ...
## $ DaysOnZillow_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ InventorySeasonallyAdjusted_AllHomes : int NA NA NA NA NA NA NA NA NA NA ...
## $ InventoryRaw_AllHomes : int NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPricePerSqft_1Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPricePerSqft_2Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPricePerSqft_3Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPricePerSqft_4Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPricePerSqft_5BedroomOrMore : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPricePerSqft_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPricePerSqft_CondoCoop : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPricePerSqft_DuplexTriplex : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPricePerSqft_SingleFamilyResidence : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPrice_1Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPrice_2Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPrice_3Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPrice_4Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPrice_5BedroomOrMore : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPrice_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPrice_CondoCoop : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPrice_DuplexTriplex : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianListingPrice_SingleFamilyResidence : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianPctOfPriceReduction_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianPctOfPriceReduction_CondoCoop : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianPctOfPriceReduction_SingleFamilyResidence : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianPriceCutDollar_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianPriceCutDollar_CondoCoop : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianPriceCutDollar_SingleFamilyResidence : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_1Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_2Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_3Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_4Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_5BedroomOrMore : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_CondoCoop : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_DuplexTriplex : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_MultiFamilyResidence5PlusUnits : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_SingleFamilyResidence : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPricePerSqft_Studio : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_1Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_2Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_3Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_4Bedroom : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_5BedroomOrMore : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_CondoCoop : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_DuplexTriplex : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_MultiFamilyResidence5PlusUnits : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_SingleFamilyResidence : num NA NA NA NA NA NA NA NA NA NA ...
## $ MedianRentalPrice_Studio : num NA NA NA NA NA NA NA NA NA NA ...
## $ ZHVIPerSqft_AllHomes : int 50 62 42 102 82 85 71 56 55 185 ...
## $ PctOfHomesDecreasingInValues_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ PctOfHomesIncreasingInValues_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ PctOfHomesSellingForGain_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ PctOfHomesSellingForLoss_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ PctOfListingsWithPriceReductionsSeasAdj_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ PctOfListingsWithPriceReductionsSeasAdj_CondoCoop : num NA NA NA NA NA NA NA NA NA NA ...
## $ PctOfListingsWithPriceReductionsSeasAdj_SingleFamilyResidence: num NA NA NA NA NA NA NA NA NA NA ...
## $ PctOfListingsWithPriceReductions_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ PctOfListingsWithPriceReductions_CondoCoop : num NA NA NA NA NA NA NA NA NA NA ...
## $ PctOfListingsWithPriceReductions_SingleFamilyResidence : num NA NA NA NA NA NA NA NA NA NA ...
## $ PriceToRentRatio_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ Sale_Counts : num NA NA NA NA NA NA NA NA NA NA ...
## $ Sale_Counts_Seas_Adj : num NA NA NA NA NA NA NA NA NA NA ...
## $ Sale_Prices : num NA NA NA NA NA NA NA NA NA NA ...
## $ ZHVI_1bedroom : int 61500 59200 53000 93700 77800 64700 90100 45400 74900 152300 ...
## $ ZHVI_2bedroom : int 48900 86400 54500 123400 97500 97000 88200 65400 64700 186600 ...
## $ ZHVI_3bedroom : int 78200 96100 76800 150900 129000 130400 103500 89100 88000 231800 ...
## $ ZHVI_4bedroom : int 146500 128400 135100 196100 176100 194800 157800 133600 149700 303400 ...
## $ ZHVI_5BedroomOrMore : int 206300 190500 186000 265300 212900 299800 176100 199900 212800 345500 ...
## $ ZHVI_AllHomes : int 79500 103600 64400 157900 128100 132000 106800 86300 92000 227400 ...
## $ ZHVI_BottomTier : int 45600 67100 38400 95100 82700 83700 77200 52500 57200 144500 ...
## $ ZHVI_CondoCoop : int 99500 78900 70300 136100 99400 85000 NA 70600 89300 177000 ...
## $ ZHVI_MiddleTier : int 79500 103600 64400 157900 128100 132000 106800 86300 92000 227400 ...
## $ ZHVI_SingleFamilyResidence : int 79000 107500 64500 162000 133600 141000 107400 92100 92400 262600 ...
## $ ZHVI_TopTier : int 140200 168700 115200 270600 209300 231600 161600 155300 163900 374700 ...
## $ ZRI_AllHomes : int NA NA NA NA NA NA NA NA NA NA ...
## $ ZRI_AllHomesPlusMultifamily : int NA NA NA NA NA NA NA NA NA NA ...
## $ ZriPerSqft_AllHomes : num NA NA NA NA NA NA NA NA NA NA ...
## $ Zri_MultiFamilyResidenceRental : int NA NA NA NA NA NA NA NA NA NA ...
## $ Zri_SingleFamilyResidenceRental : int NA NA NA NA NA NA NA NA NA NA ...
summary(data)## Date RegionName DaysOnZillow_AllHomes
## Length:13212 Length:13212 Min. : 49.25
## Class :character Class :character 1st Qu.: 90.25
## Mode :character Mode :character Median :108.50
## Mean :110.12
## 3rd Qu.:126.75
## Max. :251.62
## NA's :8367
## InventorySeasonallyAdjusted_AllHomes InventoryRaw_AllHomes
## Min. : 972 Min. : 911
## 1st Qu.: 9828 1st Qu.: 9756
## Median : 21713 Median : 21289
## Mean : 33293 Mean : 33299
## 3rd Qu.: 47453 3rd Qu.: 46891
## Max. :260687 Max. :268055
## NA's :8316 NA's :8316
## MedianListingPricePerSqft_1Bedroom MedianListingPricePerSqft_2Bedroom
## Min. : 57.14 Min. : 60.00
## 1st Qu.:125.69 1st Qu.: 92.16
## Median :162.75 Median :121.30
## Mean :182.47 Mean :135.49
## 3rd Qu.:202.63 3rd Qu.:152.24
## Max. :627.55 Max. :550.64
## NA's :9626 NA's :8678
## MedianListingPricePerSqft_3Bedroom MedianListingPricePerSqft_4Bedroom
## Min. : 56.48 Min. : 61.8
## 1st Qu.: 93.28 1st Qu.: 99.6
## Median :116.63 Median :119.8
## Mean :129.45 Mean :133.5
## 3rd Qu.:143.19 3rd Qu.:144.8
## Max. :460.46 Max. :480.8
## NA's :8605 NA's :8535
## MedianListingPricePerSqft_5BedroomOrMore MedianListingPricePerSqft_AllHomes
## Min. : 63.78 Min. : 62.14
## 1st Qu.: 99.08 1st Qu.: 96.01
## Median :119.75 Median :120.58
## Mean :135.67 Mean :136.66
## 3rd Qu.:149.35 3rd Qu.:153.44
## Max. :617.96 Max. :520.72
## NA's :8643 NA's :8538
## MedianListingPricePerSqft_CondoCoop MedianListingPricePerSqft_DuplexTriplex
## Min. : 61.92 Min. : 32.14
## 1st Qu.: 113.08 1st Qu.: 60.89
## Median : 141.00 Median : 81.21
## Mean : 163.55 Mean : 97.09
## 3rd Qu.: 177.07 3rd Qu.:113.52
## Max. :1000.00 Max. :446.43
## NA's :9063 NA's :9248
## MedianListingPricePerSqft_SingleFamilyResidence MedianListingPrice_1Bedroom
## Min. : 63.27 Min. : 49900
## 1st Qu.: 95.55 1st Qu.: 99000
## Median :120.09 Median :130000
## Mean :133.37 Mean :147083
## 3rd Qu.:149.84 3rd Qu.:169900
## Max. :475.36 Max. :399000
## NA's :8573 NA's :10205
## MedianListingPrice_2Bedroom MedianListingPrice_3Bedroom
## Min. : 57000 Min. :109900
## 1st Qu.:109500 1st Qu.:149000
## Median :147000 Median :189900
## Mean :158873 Mean :209226
## 3rd Qu.:179900 3rd Qu.:240000
## Max. :599000 Max. :687000
## NA's :8839 NA's :8842
## MedianListingPrice_4Bedroom MedianListingPrice_5BedroomOrMore
## Min. :169000 Min. : 159900
## 1st Qu.:238745 1st Qu.: 310000
## Median :283020 Median : 369700
## Mean :310850 Mean : 416375
## 3rd Qu.:339900 3rd Qu.: 452173
## Max. :950000 Max. :1847500
## NA's :8876 NA's :8989
## MedianListingPrice_AllHomes MedianListingPrice_CondoCoop
## Min. :112944 Min. : 82500
## 1st Qu.:159900 1st Qu.:152363
## Median :209000 Median :184900
## Mean :223379 Mean :202333
## 3rd Qu.:259900 3rd Qu.:228000
## Max. :610000 Max. :754500
## NA's :8966 NA's :9402
## MedianListingPrice_DuplexTriplex MedianListingPrice_SingleFamilyResidence
## Min. : 64900 Min. :112900
## 1st Qu.:129900 1st Qu.:159900
## Median :178900 Median :209900
## Mean :207475 Mean :228170
## 3rd Qu.:245000 3rd Qu.:265000
## Max. :939000 Max. :725000
## NA's :9323 NA's :9082
## MedianPctOfPriceReduction_AllHomes MedianPctOfPriceReduction_CondoCoop
## Min. :1.744 Min. : 1.676
## 1st Qu.:3.239 1st Qu.: 3.261
## Median :3.718 Median : 3.807
## Mean :3.848 Mean : 4.015
## 3rd Qu.:4.350 3rd Qu.: 4.561
## Max. :8.340 Max. :10.000
## NA's :8724 NA's :9340
## MedianPctOfPriceReduction_SingleFamilyResidence MedianPriceCutDollar_AllHomes
## Min. :1.716 Min. : 5000
## 1st Qu.:3.233 1st Qu.: 5100
## Median :3.737 Median : 7500
## Mean :3.847 Mean : 8034
## 3rd Qu.:4.349 3rd Qu.:10000
## Max. :8.347 Max. :24000
## NA's :8724 NA's :8724
## MedianPriceCutDollar_CondoCoop MedianPriceCutDollar_SingleFamilyResidence
## Min. : 2050 Min. : 5000
## 1st Qu.: 5000 1st Qu.: 5300
## Median : 6800 Median : 7900
## Mean : 7453 Mean : 8245
## 3rd Qu.:10000 3rd Qu.:10000
## Max. :27754 Max. :26000
## NA's :9340 NA's :8724
## MedianRentalPricePerSqft_1Bedroom MedianRentalPricePerSqft_2Bedroom
## Min. :0.716 Min. :0.521
## 1st Qu.:0.992 1st Qu.:0.806
## Median :1.211 Median :0.952
## Mean :1.397 Mean :1.106
## 3rd Qu.:1.606 3rd Qu.:1.262
## Max. :3.369 Max. :3.113
## NA's :9588 NA's :9065
## MedianRentalPricePerSqft_3Bedroom MedianRentalPricePerSqft_4Bedroom
## Min. :0.533 Min. :0.489
## 1st Qu.:0.751 1st Qu.:0.678
## Median :0.863 Median :0.781
## Mean :0.968 Mean :0.847
## 3rd Qu.:1.073 3rd Qu.:0.938
## Max. :2.453 Max. :2.134
## NA's :8985 NA's :9808
## MedianRentalPricePerSqft_5BedroomOrMore MedianRentalPricePerSqft_AllHomes
## Min. :0.363 Min. :0.580
## 1st Qu.:0.629 1st Qu.:0.760
## Median :0.745 Median :0.880
## Mean :0.763 Mean :1.015
## 3rd Qu.:0.891 3rd Qu.:1.125
## Max. :1.267 Max. :3.147
## NA's :11752 NA's :8864
## MedianRentalPricePerSqft_CondoCoop MedianRentalPricePerSqft_DuplexTriplex
## Min. :0.637 Min. :0.499
## 1st Qu.:0.896 1st Qu.:0.740
## Median :1.075 Median :0.871
## Mean :1.297 Mean :1.066
## 3rd Qu.:1.414 3rd Qu.:1.119
## Max. :4.824 Max. :3.862
## NA's :10004 NA's :10293
## MedianRentalPricePerSqft_MultiFamilyResidence5PlusUnits
## Min. :0.585
## 1st Qu.:0.837
## Median :1.000
## Mean :1.170
## 3rd Qu.:1.300
## Max. :3.380
## NA's :9189
## MedianRentalPricePerSqft_SingleFamilyResidence MedianRentalPricePerSqft_Studio
## Min. :0.581 Min. :0.565
## 1st Qu.:0.749 1st Qu.:0.739
## Median :0.861 Median :0.853
## Mean :0.939 Mean :1.176
## 3rd Qu.:1.031 3rd Qu.:1.260
## Max. :2.440 Max. :3.982
## NA's :8923 NA's :10875
## MedianRentalPrice_1Bedroom MedianRentalPrice_2Bedroom
## Min. : 495.0 Min. : 575
## 1st Qu.: 650.0 1st Qu.: 775
## Median : 860.0 Median : 925
## Mean : 978.6 Mean :1098
## 3rd Qu.:1195.0 3rd Qu.:1279
## Max. :2690.0 Max. :3215
## NA's :9686 NA's :9168
## MedianRentalPrice_3Bedroom MedianRentalPrice_4Bedroom
## Min. : 750 Min. : 950
## 1st Qu.:1050 1st Qu.:1395
## Median :1200 Median :1595
## Mean :1357 Mean :1722
## 3rd Qu.:1515 3rd Qu.:1950
## Max. :3550 Max. :3850
## NA's :9075 NA's :9856
## MedianRentalPrice_5BedroomOrMore MedianRentalPrice_AllHomes
## Min. : 795 Min. : 750
## 1st Qu.:1750 1st Qu.:1050
## Median :1995 Median :1200
## Mean :2139 Mean :1362
## 3rd Qu.:2495 3rd Qu.:1590
## Max. :4500 Max. :3600
## NA's :11994 NA's :9060
## MedianRentalPrice_CondoCoop MedianRentalPrice_DuplexTriplex
## Min. : 697.5 Min. : 500.0
## 1st Qu.:1050.0 1st Qu.: 685.0
## Median :1295.0 Median : 800.0
## Mean :1410.2 Mean : 963.6
## 3rd Qu.:1595.0 3rd Qu.:1100.0
## Max. :3200.0 Max. :2895.0
## NA's :10437 NA's :10068
## MedianRentalPrice_MultiFamilyResidence5PlusUnits
## Min. : 550
## 1st Qu.: 750
## Median : 950
## Mean :1091
## 3rd Qu.:1296
## Max. :2895
## NA's :9029
## MedianRentalPrice_SingleFamilyResidence MedianRentalPrice_Studio
## Min. : 750 Min. : 490
## 1st Qu.:1050 1st Qu.: 975
## Median :1205 Median :1150
## Mean :1363 Mean :1221
## 3rd Qu.:1582 3rd Qu.:1400
## Max. :3400 Max. :2500
## NA's :9120 NA's :10211
## ZHVIPerSqft_AllHomes PctOfHomesDecreasingInValues_AllHomes
## Min. : 35.0 Min. : 0.18
## 1st Qu.: 77.0 1st Qu.:16.98
## Median : 98.0 Median :27.45
## Mean :116.4 Mean :33.29
## 3rd Qu.:141.0 3rd Qu.:45.08
## Max. :499.0 Max. :99.38
## NA's :620 NA's :4292
## PctOfHomesIncreasingInValues_AllHomes PctOfHomesSellingForGain_AllHomes
## Min. : 0.47 Min. : 50.99
## 1st Qu.:44.22 1st Qu.: 79.08
## Median :63.54 Median : 89.97
## Mean :59.03 Mean : 85.88
## 3rd Qu.:76.30 3rd Qu.: 95.64
## Max. :99.76 Max. :100.00
## NA's :4292 NA's :12609
## PctOfHomesSellingForLoss_AllHomes
## Min. : 0.000
## 1st Qu.: 4.365
## Median :10.030
## Mean :14.121
## 3rd Qu.:20.915
## Max. :49.010
## NA's :12609
## PctOfListingsWithPriceReductionsSeasAdj_AllHomes
## Min. : 4.509
## 1st Qu.:10.908
## Median :12.309
## Mean :12.382
## 3rd Qu.:13.917
## Max. :20.636
## NA's :8724
## PctOfListingsWithPriceReductionsSeasAdj_CondoCoop
## Min. : 2.233
## 1st Qu.: 8.528
## Median :10.381
## Mean :10.394
## 3rd Qu.:12.189
## Max. :19.608
## NA's :9164
## PctOfListingsWithPriceReductionsSeasAdj_SingleFamilyResidence
## Min. : 4.438
## 1st Qu.:11.094
## Median :12.536
## Mean :12.602
## 3rd Qu.:14.114
## Max. :20.762
## NA's :8724
## PctOfListingsWithPriceReductions_AllHomes
## Min. : 3.564
## 1st Qu.:10.225
## Median :12.302
## Mean :12.364
## 3rd Qu.:14.366
## Max. :20.918
## NA's :8724
## PctOfListingsWithPriceReductions_CondoCoop
## Min. : 1.788
## 1st Qu.: 8.246
## Median :10.249
## Mean :10.379
## 3rd Qu.:12.381
## Max. :22.904
## NA's :9164
## PctOfListingsWithPriceReductions_SingleFamilyResidence
## Min. : 3.745
## 1st Qu.:10.423
## Median :12.498
## Mean :12.584
## 3rd Qu.:14.624
## Max. :21.689
## NA's :8724
## PriceToRentRatio_AllHomes Sale_Counts Sale_Counts_Seas_Adj
## Min. : 7.05 Min. : 130 Min. : 242
## 1st Qu.: 9.73 1st Qu.: 1672 1st Qu.: 1713
## Median :11.14 Median : 4546 Median : 4764
## Mean :11.44 Mean : 7066 Mean : 7049
## 3rd Qu.:12.74 3rd Qu.: 9247 3rd Qu.: 9393
## Max. :21.55 Max. :50275 Max. :41779
## NA's :8912 NA's :7837 NA's :7837
## Sale_Prices ZHVI_1bedroom ZHVI_2bedroom ZHVI_3bedroom
## Min. : 83800 Min. : 30900 Min. : 32800 Min. : 49600
## 1st Qu.:137000 1st Qu.: 74600 1st Qu.: 86700 1st Qu.:116400
## Median :180900 Median :100400 Median :115400 Median :141200
## Mean :194552 Mean :117060 Mean :135169 Mean :167063
## 3rd Qu.:235775 3rd Qu.:142300 3rd Qu.:166800 3rd Qu.:204400
## Max. :543100 Max. :390200 Max. :542400 Max. :639700
## NA's :9218 NA's :2607 NA's :1467 NA's :425
## ZHVI_4bedroom ZHVI_5BedroomOrMore ZHVI_AllHomes ZHVI_BottomTier
## Min. : 64700 Min. : 68600 Min. : 38200 Min. : 32600
## 1st Qu.:174900 1st Qu.: 217900 1st Qu.:114500 1st Qu.: 66600
## Median :218000 Median : 288000 Median :144750 Median : 87400
## Mean :243830 Mean : 323734 Mean :169753 Mean :102670
## 3rd Qu.:281000 3rd Qu.: 365300 3rd Qu.:207600 3rd Qu.:128200
## Max. :850400 Max. :1497000 Max. :620400 Max. :335600
## NA's :853 NA's :1398 NA's :774 NA's :896
## ZHVI_CondoCoop ZHVI_MiddleTier ZHVI_SingleFamilyResidence ZHVI_TopTier
## Min. : 42200 Min. : 38200 Min. : 37900 Min. : 70900
## 1st Qu.:111300 1st Qu.:114500 1st Qu.:115000 1st Qu.:194700
## Median :134700 Median :144750 Median :147300 Median :251100
## Mean :156770 Mean :169753 Mean :174154 Mean :293974
## 3rd Qu.:175800 3rd Qu.:207600 3rd Qu.:211775 3rd Qu.:349400
## Max. :782900 Max. :620400 Max. :737500 Max. :988100
## NA's :1530 NA's :774 NA's :774 NA's :688
## ZRI_AllHomes ZRI_AllHomesPlusMultifamily ZriPerSqft_AllHomes
## Min. : 799 Min. : 799 Min. :0.560
## 1st Qu.:1047 1st Qu.:1036 1st Qu.:0.728
## Median :1210 Median :1210 Median :0.862
## Mean :1321 Mean :1318 Mean :0.930
## 3rd Qu.:1474 3rd Qu.:1477 3rd Qu.:1.074
## Max. :2690 Max. :2653 Max. :2.294
## NA's :8958 NA's :8876 NA's :8876
## Zri_MultiFamilyResidenceRental Zri_SingleFamilyResidenceRental
## Min. : 713.0 Min. : 799
## 1st Qu.: 959.8 1st Qu.:1039
## Median :1126.0 Median :1220
## Mean :1233.0 Mean :1328
## 3rd Qu.:1399.5 3rd Qu.:1468
## Max. :2606.0 Max. :2754
## NA's :8876 NA's :8958
Now, we will investigate missing values in detail.
rbind( table(is.na(data)) , round(prop.table(table(is.na(data)))*100,1) )## FALSE TRUE
## [1,] 449546.0 633838.0
## [2,] 41.5 58.5
58.5% of values are missing !
missing_value <- lapply(data, function(x) { round((sum(is.na(x)) / length(x)) * 100, 1) })
melt(data.frame(missing_value))%>%
ggplot(aes(x= reorder(variable, -value), y= value))+
geom_col(width=1, fill= "darkred", alpha=0.7 )+
labs(y="Missing pct(%)", x=NULL, title = "Missing Values by feature")+
theme(axis.ticks.x=element_line(colour="gray90"),
axis.text.y=element_text(size=5.5))+
coord_flip()Now we are ready to start visualizing the data. Let’s start with some overview plots for individual features, to explore the data set.
library(lubridate)
data$Year<-year(ymd(data$Date))Now we are ready to start visualizing the data. Let’s start with some overview plots for individual features, to explore the data set.
gbp1<-wes_palette("GrandBudapest2")[1]
ggplot(data, aes(x=Sale_Prices))+
geom_histogram(fill=gbp1, alpha=.9, binwidth=10000)+
labs(x=NULL, y=NULL, title = "Histogram of Sale Price")+
scale_x_continuous(breaks= seq(0,600000, by=100000))+
theme_minimal() + theme(plot.title=element_text(vjust=3, size=15) )As you can see, the sale prices are right-skewed. It means we might need to transform this variable and make it more normally distributed before modeling.
gbp2<-wes_palette("GrandBudapest2")[4]
ggplot(data, aes(x=MedianRentalPrice_AllHomes))+
geom_histogram(fill=gbp2, alpha=.9, binwidth=100)+
labs(x=NULL, y=NULL, title = "Histograms of Rental Price")+
scale_x_continuous(breaks= seq(0,4000, by=500))+
theme( plot.title=element_text(vjust=3, size=15) ) + theme_minimal()As you can see, the rental prices are right-skewed.
subset(data, select=c(Sale_Counts, Sale_Prices, MedianRentalPrice_AllHomes))%>%
na.omit()%>% cor()%>%
corrplot.mixed(lower = "number", upper = "ellipse", tl.cex=0.7,
cl.ratio=0.2, cl.cex=0.6, tl.col="black")We can guess that house value for sale and rent will go the same way.
state<- read.csv("C:/Users/neeya/Downloads/State_time_series.csv")
state_new_york <- state %>%
filter(RegionName=='NewYork')
state_new_york %>%
ggplot(aes(x=ZHVI_AllHomes))+
geom_histogram()na_count <- function(x){sum(is.na(x))}
names_of_columns_to_keep <- unlist(apply(X=state_new_york,FUN = na_count,MARGIN = 2))
names_of_columns_to_keep <- names_of_columns_to_keep[names_of_columns_to_keep==0]
names_of_columns_to_keep <- names(names_of_columns_to_keep)
names_of_columns_to_keep## [1] "Date" "RegionName" "ZHVI_2bedroom"
## [4] "ZHVI_3bedroom" "ZHVI_5BedroomOrMore" "ZHVI_BottomTier"
state_new_york_lm <- state_new_york[names_of_columns_to_keep]
# Remove the RegionName
state_new_york_lm <- state_new_york_lm %>% select(-RegionName)
state_new_york_lm_date <- state_new_york_lm %>%
select(-Date)
mod_lm <-lm(data=state_new_york_lm_date,ZHVI_BottomTier~.)
summary(mod_lm)##
## Call:
## lm(formula = ZHVI_BottomTier ~ ., data = state_new_york_lm_date)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3080.75 -917.90 34.26 713.69 3078.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.927e+03 8.467e+02 -9.362 <2e-16 ***
## ZHVI_2bedroom -3.097e-02 5.628e-02 -0.550 0.583
## ZHVI_3bedroom 1.173e+00 5.851e-02 20.047 <2e-16 ***
## ZHVI_5BedroomOrMore -1.509e-01 4.294e-03 -35.133 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1357 on 257 degrees of freedom
## Multiple R-squared: 0.9938, Adjusted R-squared: 0.9937
## F-statistic: 1.364e+04 on 3 and 257 DF, p-value: < 2.2e-16
Steps to be followed for ARIMA modeling:
We need to understand the three components of a time series data. I used the following R code to find out the components of this time series:
train2 = na.omit(with(data,data.frame( ds = Date, y= log(ZHVI_AllHomes), rn = RegionName)))
m_train = aggregate(y ~ds, train2, mean)
NY<-ts(train2$y[train2$rn=="NewYork"], start=c(1996,04), freq=12)
plot(decompose(NY), color)m_train## ds y
## 1 1996-04-30 11.47763
## 2 1996-05-31 11.47665
## 3 1996-06-30 11.47869
## 4 1996-07-31 11.48061
## 5 1996-08-31 11.48241
## 6 1996-09-30 11.48456
## 7 1996-10-31 11.48653
## 8 1996-11-30 11.48931
## 9 1996-12-31 11.49239
## 10 1997-01-31 11.49657
## 11 1997-02-28 11.49656
## 12 1997-03-31 11.50119
## 13 1997-04-30 11.50453
## 14 1997-05-31 11.50751
## 15 1997-06-30 11.51054
## 16 1997-07-31 11.51418
## 17 1997-08-31 11.51757
## 18 1997-09-30 11.52125
## 19 1997-10-31 11.52530
## 20 1997-11-30 11.52955
## 21 1997-12-31 11.53364
## 22 1998-01-31 11.53650
## 23 1998-02-28 11.54119
## 24 1998-03-31 11.54560
## 25 1998-04-30 11.54889
## 26 1998-05-31 11.55140
## 27 1998-06-30 11.55425
## 28 1998-07-31 11.56962
## 29 1998-08-31 11.57267
## 30 1998-09-30 11.57546
## 31 1998-10-31 11.57804
## 32 1998-11-30 11.58077
## 33 1998-12-31 11.58480
## 34 1999-01-31 11.58969
## 35 1999-02-28 11.59504
## 36 1999-03-31 11.60033
## 37 1999-04-30 11.60592
## 38 1999-05-31 11.61132
## 39 1999-06-30 11.61934
## 40 1999-07-31 11.62403
## 41 1999-08-31 11.62794
## 42 1999-09-30 11.63238
## 43 1999-10-31 11.63746
## 44 1999-11-30 11.64251
## 45 1999-12-31 11.64695
## 46 2000-01-31 11.65220
## 47 2000-02-29 11.65809
## 48 2000-03-31 11.66380
## 49 2000-04-30 11.66849
## 50 2000-05-31 11.67407
## 51 2000-06-30 11.68006
## 52 2000-07-31 11.68558
## 53 2000-08-31 11.69121
## 54 2000-09-30 11.69656
## 55 2000-10-31 11.70122
## 56 2000-11-30 11.70555
## 57 2000-12-31 11.71104
## 58 2001-01-31 11.71754
## 59 2001-02-28 11.72355
## 60 2001-03-31 11.72899
## 61 2001-04-30 11.73493
## 62 2001-05-31 11.74131
## 63 2001-06-30 11.74733
## 64 2001-07-31 11.75282
## 65 2001-08-31 11.75754
## 66 2001-09-30 11.76242
## 67 2001-10-31 11.76736
## 68 2001-11-30 11.77199
## 69 2001-12-31 11.77668
## 70 2002-01-31 11.78203
## 71 2002-02-28 11.78749
## 72 2002-03-31 11.79258
## 73 2002-04-30 11.79734
## 74 2002-05-31 11.80197
## 75 2002-06-30 11.80694
## 76 2002-07-31 11.81281
## 77 2002-08-31 11.81935
## 78 2002-09-30 11.82621
## 79 2002-10-31 11.83270
## 80 2002-11-30 11.83916
## 81 2002-12-31 11.84505
## 82 2003-01-31 11.85087
## 83 2003-02-28 11.85731
## 84 2003-03-31 11.86389
## 85 2003-04-30 11.86993
## 86 2003-05-31 11.87613
## 87 2003-06-30 11.88247
## 88 2003-07-31 11.88859
## 89 2003-08-31 11.89464
## 90 2003-09-30 11.90480
## 91 2003-10-31 11.91114
## 92 2003-11-30 11.91816
## 93 2003-12-31 11.92560
## 94 2004-01-31 11.93240
## 95 2004-02-29 11.93905
## 96 2004-03-31 11.94650
## 97 2004-04-30 11.95448
## 98 2004-05-31 11.96270
## 99 2004-06-30 11.97120
## 100 2004-07-31 11.98010
## 101 2004-08-31 11.98992
## 102 2004-09-30 11.99933
## 103 2004-10-31 12.00158
## 104 2004-11-30 12.00997
## 105 2004-12-31 12.01900
## 106 2005-01-31 12.02763
## 107 2005-02-28 12.03580
## 108 2005-03-31 12.04390
## 109 2005-04-30 12.05285
## 110 2005-05-31 12.06233
## 111 2005-06-30 12.07197
## 112 2005-07-31 12.08091
## 113 2005-08-31 12.08832
## 114 2005-09-30 12.09488
## 115 2005-10-31 12.10194
## 116 2005-11-30 12.10882
## 117 2005-12-31 12.11507
## 118 2006-01-31 12.12034
## 119 2006-02-28 12.12462
## 120 2006-03-31 12.12824
## 121 2006-04-30 12.13167
## 122 2006-05-31 12.13457
## 123 2006-06-30 12.13676
## 124 2006-07-31 12.13819
## 125 2006-08-31 12.13923
## 126 2006-09-30 12.14066
## 127 2006-10-31 12.14211
## 128 2006-11-30 12.14233
## 129 2006-12-31 12.14181
## 130 2007-01-31 12.14223
## 131 2007-02-28 12.14311
## 132 2007-03-31 12.14365
## 133 2007-04-30 12.14426
## 134 2007-05-31 12.14479
## 135 2007-06-30 12.14461
## 136 2007-07-31 12.14399
## 137 2007-08-31 12.15475
## 138 2007-09-30 12.15328
## 139 2007-10-31 12.15118
## 140 2007-11-30 12.14847
## 141 2007-12-31 12.14549
## 142 2008-01-31 12.14242
## 143 2008-02-29 12.13827
## 144 2008-03-31 12.13370
## 145 2008-04-30 12.12890
## 146 2008-05-31 12.12368
## 147 2008-06-30 12.11813
## 148 2008-07-31 12.11334
## 149 2008-08-31 12.09963
## 150 2008-09-30 12.09829
## 151 2008-10-31 12.09686
## 152 2008-11-30 12.09536
## 153 2008-12-31 12.09472
## 154 2009-01-31 12.09468
## 155 2009-02-28 12.09398
## 156 2009-03-31 12.09192
## 157 2009-04-30 12.08896
## 158 2009-05-31 12.08536
## 159 2009-06-30 12.08121
## 160 2009-07-31 12.07715
## 161 2009-08-31 12.07354
## 162 2009-09-30 12.07006
## 163 2009-10-31 12.06700
## 164 2009-11-30 12.06545
## 165 2009-12-31 12.05926
## 166 2010-01-31 12.04927
## 167 2010-02-28 12.04215
## 168 2010-03-31 12.03978
## 169 2010-04-30 12.03714
## 170 2010-05-31 12.03447
## 171 2010-06-30 12.03125
## 172 2010-07-31 12.02740
## 173 2010-08-31 12.02278
## 174 2010-09-30 12.01743
## 175 2010-10-31 12.01233
## 176 2010-11-30 12.00799
## 177 2010-12-31 12.00527
## 178 2011-01-31 12.00314
## 179 2011-02-28 12.00046
## 180 2011-03-31 11.99679
## 181 2011-04-30 11.99271
## 182 2011-05-31 11.98858
## 183 2011-06-30 11.98516
## 184 2011-07-31 11.98197
## 185 2011-08-31 11.97886
## 186 2011-09-30 11.97587
## 187 2011-10-31 11.97299
## 188 2011-11-30 11.96994
## 189 2011-12-31 11.96873
## 190 2012-01-31 11.96895
## 191 2012-02-29 11.96926
## 192 2012-03-31 11.96905
## 193 2012-04-30 11.96964
## 194 2012-05-31 11.97094
## 195 2012-06-30 11.97301
## 196 2012-07-31 11.97565
## 197 2012-08-31 11.97840
## 198 2012-09-30 11.98133
## 199 2012-10-31 11.98410
## 200 2012-11-30 11.98677
## 201 2012-12-31 11.99035
## 202 2013-01-31 11.99481
## 203 2013-02-28 11.99886
## 204 2013-03-31 12.00251
## 205 2013-04-30 12.00660
## 206 2013-05-31 12.01116
## 207 2013-06-30 12.01626
## 208 2013-07-31 12.02150
## 209 2013-08-31 12.02597
## 210 2013-09-30 12.02998
## 211 2013-10-31 12.03375
## 212 2013-11-30 12.03678
## 213 2013-12-31 12.03991
## 214 2014-01-31 12.04336
## 215 2014-02-28 12.04649
## 216 2014-03-31 12.04929
## 217 2014-04-30 12.05235
## 218 2014-05-31 12.05556
## 219 2014-06-30 12.05872
## 220 2014-07-31 12.06198
## 221 2014-08-31 12.06487
## 222 2014-09-30 12.06716
## 223 2014-10-31 12.06921
## 224 2014-11-30 12.07119
## 225 2014-12-31 12.07354
## 226 2015-01-31 12.07637
## 227 2015-02-28 12.08004
## 228 2015-03-31 12.08447
## 229 2015-04-30 12.08935
## 230 2015-05-31 12.09411
## 231 2015-06-30 12.09871
## 232 2015-07-31 12.10361
## 233 2015-08-31 12.10817
## 234 2015-09-30 12.11224
## 235 2015-10-31 12.11591
## 236 2015-11-30 12.11936
## 237 2015-12-31 12.12298
## 238 2016-01-31 12.12683
## 239 2016-02-29 12.13107
## 240 2016-03-31 12.13579
## 241 2016-04-30 12.14025
## 242 2016-05-31 12.14393
## 243 2016-06-30 12.14802
## 244 2016-07-31 12.15248
## 245 2016-08-31 12.15678
## 246 2016-09-30 12.16159
## 247 2016-10-31 12.16699
## 248 2016-11-30 12.17218
## 249 2016-12-31 12.17737
## 250 2017-01-31 12.18266
## 251 2017-02-28 12.18806
## 252 2017-03-31 12.19305
## 253 2017-04-30 12.19767
## 254 2017-05-31 12.20201
## 255 2017-06-30 12.20559
## 256 2017-07-31 12.20846
## 257 2017-08-31 12.21206
## 258 2017-09-30 12.21700
## 259 2017-10-31 12.22196
## 260 2017-11-30 12.22722
## 261 2017-12-31 12.23260
Here we get 4 components:
Observed – the actual data plot Trend – the overall upward or downward movement of the data points Seasonal – any monthly/yearly pattern of the data points Random – unexplainable part of the data Observing these 4 graphs closely, we can find out if the data satisfies all the assumptions of ARIMA modeling, mainly, stationarity and seasonality.
Various plots and functions that help in detecting seasonality:
A seasonal subseries plot
library(forecast)
NY_adjusted <- NY- (decompose(NY))$seasonal
ndiffs(NY_adjusted)## [1] 2
The autocorrelation function (acf()) gives the autocorrelation at all possible lags. The autocorrelation at lag 0 is included by default which always takes the value 1 as it represents the correlation between the data and themselves.
acf(NY,lag.max=40)pacf(NY,lag.max=40)library(tseries)
kpss.test(NY,lshort = TRUE)##
## KPSS Test for Level Stationarity
##
## data: NY
## KPSS Level = 0.76198, Truncation lag parameter = 4, p-value = 0.01
As we can infer from the acf graph above, the autocorrelation continues to exponentially decrease as the lag increases, confirming that there is no linear association between observations separated by larger lags.
pacf() at lag k is autocorrelation function which describes the correlation between all data points that are exactly k steps apart- after accounting for their correlation with the data between those k steps. It helps to identify the number of autoregression (AR) coefficients(p-value) in an ARIMA model.
fitARIMA <- arima(NY, order=c(1,1,1),seasonal = list(order = c(1,0,0), period = 12),method="ML")
library(lmtest)
coeftest(fitARIMA) ##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.852615 0.045789 18.6206 < 2.2e-16 ***
## ma1 0.664270 0.052247 12.7141 < 2.2e-16 ***
## sar1 0.376175 0.105187 3.5763 0.0003485 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(fitARIMA)## 2.5 % 97.5 %
## ar1 0.7628706 0.9423589
## ma1 0.5618685 0.7666711
## sar1 0.1700133 0.5823377
I picked the model with the lowest AIC values.
mod1 = arima(NY,order=c(0,1,1))
mod2 = arima(NY,order=c(0,1,0))
mod3 = arima(NY,order=c(1,1,0))
summary(mod1)##
## Call:
## arima(x = NY, order = c(0, 1, 1))
##
## Coefficients:
## ma1
## 0.8836
## s.e. 0.0276
##
## sigma^2 estimated as 1.232e-05: log likelihood = 524.16, aic = -1044.32
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0002381121 0.003676344 0.002659507 -0.001881159 0.0212097
## MASE ACF1
## Training set 0.589279 0.6812555
summary(mod2)##
## Call:
## arima(x = NY, order = c(0, 1, 0))
##
##
## sigma^2 estimated as 3.851e-05: log likelihood = 454.25, aic = -906.51
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.000546988 0.006284733 0.004578796 -0.004329216 0.03650425
## MASE ACF1
## Training set 1.014544 0.9006195
summary(mod3)##
## Call:
## arima(x = NY, order = c(1, 1, 0))
##
## Coefficients:
## ar1
## 0.9303
## s.e. 0.0301
##
## sigma^2 estimated as 4.757e-06: log likelihood = 582.91, aic = -1161.83
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set 8.11102e-05 0.002452145 0.001594108 0.0006579105 0.01270284
## MASE ACF1
## Training set 0.3532137 0.2761833
Try Out auto.arima() Auto ARIMA says MA3 performs the best, with 2 orders of differencing Note that KPSS test suggested that a second difference may be preferred as well (still non-stationary after one difference)
mod_auto = auto.arima(NY,stepwise=FALSE,approximation=FALSE,stationary=FALSE)
summary(mod_auto)## Series: NY
## ARIMA(0,2,3)(0,0,1)[12]
##
## Coefficients:
## ma1 ma2 ma3 sma1
## 0.7762 -0.1869 -0.4297 0.3512
## s.e. 0.0894 0.1143 0.0785 0.1164
##
## sigma^2 = 5.343e-06: log likelihood = 616.39
## AIC=-1222.78 AICc=-1222.26 BIC=-1208.72
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -7.666097e-05 0.002255378 0.001337053 -0.0005862646 0.01065657
## MASE ACF1
## Training set 0.02532907 -0.1590147
It is a test of independence at all lags up to the one specified. Instead of testing randomness at each distinct lag, it tests the “overall” randomness based on a number of lags It is applied to the residuals of a fitted ARIMA model, not the original series, and in such applications the hypothesis actually being tested is that the residuals from the ARIMA model have no autocorrelation.
R code to obtain the box test results:
checkresiduals(mod_auto)##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,2,3)(0,0,1)[12]
## Q* = 12.285, df = 20, p-value = 0.9059
##
## Model df: 4. Total lags used: 24
Box.test(mod_auto$residuals,type = 'Ljung-Box')##
## Box-Ljung test
##
## data: mod_auto$residuals
## X-squared = 3.2372, df = 1, p-value = 0.07198
Box.test(mod_auto$residuals,type = 'Ljung-Box',lag=5)##
## Box-Ljung test
##
## data: mod_auto$residuals
## X-squared = 6.0532, df = 5, p-value = 0.3011
I calculated the RMSE value
in_sample_pred = mod_auto$residuals+NY
RMSE = sqrt(mean((in_sample_pred - NY)^2,na.rm=T))
RMSE## [1] 0.002255378
The parameters of that ARIMA model can be used as a predictive model for making forecasts for future values of the time series once the best-suited model is selected for time series data. There is a function called predict() which is used for predictions from the results of various model fitting functions. It takes an argument n.ahead() specifying how many time steps ahead to predict.
predict(mod_auto,n.ahead = 5)## $pred
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2006 12.64041 12.64502 12.65156 12.66003
## 2007 12.66780
##
## $se
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 2006 0.002311537 0.006820873
## 2007 0.023457124
## Nov Dec
## 2006 0.012180012 0.017648411
## 2007
library(forecast)
futurVal <- forecast(mod_auto,h=10, level=c(99.5))
plot(futurVal)The forecasts are shown as a blue line, with the 80% prediction intervals as a dark shaded area, and the 95% prediction intervals as a light shaded area.
THANK YOU