Code
sales_tbl_ts %>% ggplot() +
geom_line(aes(date, value)) +
theme_bw() +
xlab("Year Week") +
ylab("Median Sales Price")There are huge finances that are involved while buying a property. In order to make that investment in the right time and right property, it is very important to understand the factors that drives the price and more important what would be the forecast price in future for that property.One of biggest player in the real state industry is Zillow who deals in buying, selling and renting a home.
The Zestimate is marketed as a tool designed to take the mystery out of real estate for consumers who would otherwise have to rely on brokers and guesswork. However, Zestimates often give a single time point value which might not be an accurate way to base the decision of purchase on.
The dataset provides detailed information on home sales in the area on a weekly basis, which can be used to create accurate short-term forecasts of the real estate market. By analyzing trends in median home prices, it is possible to make predictions about the future performance of the market.
We have median weekly sales data for the houses sold in San Francisco region from 1st week of Jan 2018 to 1st week of Jan 2023.
sales_tbl_ts %>% ggplot() +
geom_line(aes(date, value)) +
theme_bw() +
xlab("Year Week") +
ylab("Median Sales Price")Seasonality ( Up trend during summer months and downtrend during holiday period)
Mean value is increasing over the year
summary(sales_tbl_ts$value) Min. 1st Qu. Median Mean 3rd Qu. Max.
846500 927840 977750 971991 1004750 1099750
x<-sd(sales_tbl_ts$value)
x[1] 58875.3
Mean of sales price is $971991 and median is $977750 with standard deviation of $58875.3
hist <- sales_tbl_ts %>%
ggplot() +
geom_histogram(aes(value),colour = 4, fill = "white", bins = 20) +
theme_bw()
dens <- sales_tbl_ts %>%
ggplot() +
geom_density(aes(value),colour = 4, fill = "white") +
theme_bw()
violin <- sales_tbl_ts %>%
ggplot() +
geom_violin(aes("", value),colour = 4, fill = "white") +
theme_bw()
boxplot <- sales_tbl_ts %>%
ggplot() +
geom_boxplot(aes("", value),colour = 4, fill = "white") +
theme_bw()
ggarrange(hist, dens, violin,boxplot ,
labels = c("A", "B", "C","D"),
ncol = 2, nrow = 2)We don’t see any outlines in the data
Mean and Median are very close.
Data is evenly distributed and its close to normal distribution
sales_tbl_ts %>%
as.data.frame() %>%
ggplot(aes(x=date, y=value)) +
ylab("Weekly PriceMedian") +
xlab("Weekly Date") +
geom_line() +
geom_smooth(method="lm", se=FALSE)`geom_smooth()` using formula = 'y ~ x'
Linear regression is not handling the seasonality, we should check the moving average method to check which rolling period we should use to calculate the MA.
sales_ma_data <- sales_tbl_ts %>%
arrange(date) %>%
mutate(
ma_13_left = rollapply(
value,
13,
FUN = mean,
align = "left", fill = NA
),
ma_13_right = rollapply(
value,
13,
FUN = mean,
align = "right", fill = NA
),
ma_13_center = rollapply(
value,
13,
FUN = mean,
align = "center", fill = NA
)
) %>%
mutate(
value_ma_3 = rollapply(value, 3, FUN = mean, align = "center", fill = NA),
value_ma_5 = rollapply(value, 5, FUN = mean, align = "center", fill = NA),
value_ma_7 = rollapply(value, 7, FUN = mean, align = "center", fill = NA),
value_ma_13 = rollapply(value, 13, FUN = mean, align = "center", fill = NA),
value_ma_25 = rollapply(value, 25, FUN = mean, align = "center", fill = NA),
value_ma_49 = rollapply(value, 49, FUN = mean, align = "center", fill = NA)
)
sales_tbl_pivot <- sales_ma_data %>%
pivot_longer(
cols = ma_13_left:value_ma_49,
values_to = "value_ma",
names_to = "ma_order"
) %>%
mutate(ma_order = factor(
ma_order,
levels = c(
"ma_13_center",
"ma_13_left",
"ma_13_right",
"value_ma_3",
"value_ma_5",
"value_ma_7",
"value_ma_13",
"value_ma_25",
"value_ma_49"
),
labels = c(
"ma_13_center",
"ma_13_left",
"ma_13_right",
"value_ma_3",
"value_ma_5",
"value_ma_7",
"value_ma_13",
"value_ma_25",
"value_ma_49"
)
))
x <- sales_tbl_pivot %>%
filter(
!ma_order %in% c(
"ma_13_center",
"ma_13_left",
"ma_13_right",
"value_ma_7",
"value_ma_49"
)
) %>%
mutate(ma_order = case_when(
ma_order=='value_ma_3'~'3rd Order',
ma_order=='value_ma_5'~'5th Order',
ma_order=='value_ma_13'~'13th Order',
ma_order=='value_ma_25'~'25th Order')
) %>%
mutate(
ma_order = factor(
ma_order,
labels = c('3rd Order',
'5th Order',
'13th Order',
'25th Order'),
levels = c('3rd Order',
'5th Order',
'13th Order',
'25th Order')
)
) %>%
ggplot() +
geom_line(aes(date, value), size = 1) +
geom_line(aes(date, value_ma, color = ma_order), size = 1) +
scale_color_discrete(name = 'MA Order')+
theme_bw()+
ylab('Median Sales Price')
xWe are trying to see the moving average based on the order. For the analysis we have taken 3rd,5th, 13th and 25th order.
From the graph 13th order moving avg. explains the seasonality. We
sales_tbl_ts %>%
model(
classical_decomposition(value)
) %>%
components() %>%
autoplot()sales_tbl_ts %>%
model(
STL(value)
) %>%
components() %>%
autoplot()The classical and STL decomposition are able to explain the seasonality but still we see a cyclic trend in the residual as it moves up and down.We won’t be able to certify it as a white noise.
file_ts <- ts(sales_tbl_ts$value, start = c(2018, 1), frequency = 52)
naive_forecast <- naive(file_ts, h = 6)
autoplot(naive_forecast) + theme_bw() + ylab('Median Sales Price')Using the Naive forecast we are forecasting for next 6 weeks but it doesn’t look like the correct way to forecast.
sales_tbl_ts_1<-ts(sales_tbl_ts$value, sales_tbl_ts$date)
par(mfrow=c(1,2))
acf(sales_tbl_ts_1, lag.max = 30)
pacf(sales_tbl_ts_1,lag.max = 5)Dampening effect of ACF suggests that our time series has a “long memory” since future values can be explained partially by values deep in past. But when we check the PACF lag period 2 so we need to use AR(2).