My Y variable is the median sale price. The skewness of the Y variable is 3.233, indicating that it is a substantially positively skewed distribution.
skewness(data_complete$median_sale_price)
## [1] 3.233043
hist(data_complete$median_sale_price)
The kurtosis is 20.452, indicating that the distribution is too peaked. It is the result of infrequent extreme deviations (too many scattered high prices).
kurtosis(data_complete$median_sale_price)
## [1] 20.45204
My X variable is the start date of the house listing. In the dataset, it period_begin. We can see for most periods, most median sale price remain under $500,000 with some outliers above $500,000. During April and August, there are extreme outliers that exceed $2,500,000.
boxplot(data_complete$median_sale_price ~ data_complete$period_begin, data = data_complete)
Out of curiosity, I changed the X to the state of the median price. In the dataset, it is region_state. We can see median sale prices vary from state to state. Median of California is above $500,000 with many outliers around $1,500,000. New York, on the other hand, while whose volume is not as high as California, has many extreme outliers above $2,000,000 and $2,500,000.
boxplot(data_complete$median_sale_price ~ data_complete$region_state, data = data_complete)
The scatter plot between the sale price and listing begin date is as below. It is a more detailed view of the barplot above.
plot(data_complete$median_sale_price ~ data_complete$period_begin, data = data_complete)
I embedded both date and state into the plot. Most high-priced homes are in the state of CA, and some are in NY. The limitation is that with too many states as categorical variables, the colors are not very identical. I will work on trimming the dataset for future project.
ggplot(data_complete, aes(period_begin, median_sale_price, color = region_state)) + geom_point()