library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
## # A tibble: 8,602 x 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ... with 8,592 more rows
## city year month sales
## Length:8602 Min. :2000 Min. : 1.000 Min. : 6.0
## Class :character 1st Qu.:2003 1st Qu.: 3.000 1st Qu.: 86.0
## Mode :character Median :2007 Median : 6.000 Median : 169.0
## Mean :2007 Mean : 6.406 Mean : 549.6
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.: 467.0
## Max. :2015 Max. :12.000 Max. :8945.0
## NA's :568
## volume median listings inventory
## Min. :8.350e+05 Min. : 50000 Min. : 0 Min. : 0.000
## 1st Qu.:1.084e+07 1st Qu.:100000 1st Qu.: 682 1st Qu.: 4.900
## Median :2.299e+07 Median :123800 Median : 1283 Median : 6.200
## Mean :1.069e+08 Mean :128131 Mean : 3217 Mean : 7.175
## 3rd Qu.:7.512e+07 3rd Qu.:150000 3rd Qu.: 2954 3rd Qu.: 8.150
## Max. :2.568e+09 Max. :304200 Max. :43107 Max. :55.900
## NA's :568 NA's :616 NA's :1424 NA's :1467
## date
## Min. :2000
## 1st Qu.:2004
## Median :2008
## Mean :2008
## 3rd Qu.:2012
## Max. :2016
##
I will be using the texas housing dataset referenced in the slides for last week, which was published by the TAMU Real Estate Center. This dataset includes 9 variables - the City, the year & month, # of sales, volume of sales, the median sale amount, the # of listings, the amount of inventory and a full date field.
## [1] 2000.000 2000.083 2000.167 2000.250 2000.333 2000.417 2000.500 2000.583
## [9] 2000.667 2000.750
Our date variable is expressed as a decimal, we can convert this value in our dataset so that it is more interpretable to the user.
df$date <- date_decimal(df$date, tz = "UTC")
This conversion will help us interpret our model more effectively.
## city year month sales
## Length:322 Min. :2015 Min. :1 Min. : 16.0
## Class :character 1st Qu.:2015 1st Qu.:2 1st Qu.: 91.0
## Mode :character Median :2015 Median :4 Median : 194.5
## Mean :2015 Mean :4 Mean : 658.6
## 3rd Qu.:2015 3rd Qu.:6 3rd Qu.: 523.0
## Max. :2015 Max. :7 Max. :8945.0
## NA's :6
## volume median listings inventory
## Min. :2.516e+06 Min. : 78800 Min. : 83.0 Min. : 0.800
## 1st Qu.:1.533e+07 1st Qu.:135650 1st Qu.: 529.0 1st Qu.: 2.800
## Median :3.228e+07 Median :156750 Median : 947.5 Median : 4.350
## Mean :1.713e+08 Mean :168665 Mean : 2055.4 Mean : 5.553
## 3rd Qu.:1.113e+08 3rd Qu.:198425 3rd Qu.: 1953.5 3rd Qu.: 7.300
## Max. :2.568e+09 Max. :304200 Max. :23875.0 Max. :20.100
## NA's :6 NA's :6 NA's :8 NA's :8
## date
## Min. :2015-01-01 00:00:00
## 1st Qu.:2015-01-31 10:00:00
## Median :2015-04-02 06:00:00
## Mean :2015-04-02 06:00:00
## 3rd Qu.:2015-06-02 02:00:00
## Max. :2015-07-02 12:00:00
##
The graphic below summarizes the data by month for the median value of a listing vs.ย the total number of listings by month. This graphic could be improved by adding color coding for the region that a given city in the dataset belongs to, as there are too many individual city records to meaningfully discern the impact of cities on the variables plotted in the graphic.
ggplot(data = df2, aes(listings, median)) +
geom_point() +
facet_wrap(~month) +
ggtitle("Listings vs. Median Value by Month",
"Real Estate Data from 2016 only from TAMA Real Estate Center")