1. Create Univariate analysis for the variable of your interest (your Y variable). Calculate skewness and kurtosis and describe the results.

My Y variable is the median sale price. The skewness of the Y variable is 3.233, indicating that it is a substantially positively skewed distribution.

skewness(data_complete$median_sale_price)
## [1] 3.233043
hist(data_complete$median_sale_price)

The kurtosis is 20.452, indicating that the distribution is too peaked. It is the result of infrequent extreme deviations (too many scattered high prices).

kurtosis(data_complete$median_sale_price)
## [1] 20.45204

2. Create Bivariate plot Box Plot for your Y variable and one of other important metrics (your X). Describe figure.

My X variable is the start date of the house listing. In the dataset, it period_begin. We can see for most periods, most median sale price remain under $500,000 with some outliers above $500,000. During April and August, there are extreme outliers that exceed $2,500,000.

boxplot(data_complete$median_sale_price ~ data_complete$period_begin, data = data_complete)

Out of curiosity, I changed the X to the state of the median price. In the dataset, it is region_state. We can see median sale prices vary from state to state. Median of California is above $500,000 with many outliers around $1,500,000. New York, on the other hand, while whose volume is not as high as California, has many extreme outliers above $2,000,000 and $2,500,000.

boxplot(data_complete$median_sale_price ~ data_complete$region_state, data = data_complete)

3. If your variables are continuous - Create a scatter plot between your Y and your X. If your variables are categorical - Create a bar plot. Describe figure.

The scatter plot between the sale price and listing begin date is as below. It is a more detailed view of the barplot above.

plot(data_complete$median_sale_price ~ data_complete$period_begin, data = data_complete)

4. Create a multivariate plot - Use the same plot as in 3 but add another important variable using colored symbols. Describe Figure. Make sure to add legend.

I embedded both date and state into the plot. Most high-priced homes are in the state of CA, and some are in NY. The limitation is that with too many states as categorical variables, the colors are not very identical. I will work on trimming the dataset for future project.

ggplot(data_complete, aes(period_begin, median_sale_price, color = region_state)) + geom_point()