The dataset has a lot of variables. In order to clearly demonstrate them, let’s look at the summary of the data first. Count of missing data: 104479 Type of missing data: null Percent per variable: please see table below, together with the graph.
data <- weekly_housing_market_data_1124_state_1w
summary(data)
## region_id region_name region_state region_type
## Min. : 2 Length:51574 Length:51574 Length:51574
## 1st Qu.:1109 Class :character Class :character Class :character
## Median :1876 Mode :character Mode :character Mode :character
## Mean :1775
## 3rd Qu.:2603
## Max. :3234
##
## period_begin total_homes_sold median_sale_price
## Min. :2020-01-06 00:00:00 Min. : 1.00 Min. : 1
## 1st Qu.:2020-03-16 00:00:00 1st Qu.: 5.00 1st Qu.: 163000
## Median :2020-06-01 00:00:00 Median : 17.00 Median : 224900
## Mean :2020-05-31 19:30:05 Mean : 63.29 Mean : 255838
## 3rd Qu.:2020-08-17 00:00:00 3rd Qu.: 64.00 3rd Qu.: 300000
## Max. :2020-10-26 00:00:00 Max. :2363.00 Max. :6350000
## NA's :4031 NA's :4031
## total_new_listings median_new_listing_price active_listings
## Min. : 1.00 Min. : 800 Min. : 1.0
## 1st Qu.: 6.00 1st Qu.: 173000 1st Qu.: 63.0
## Median : 19.00 Median : 237000 Median : 192.0
## Mean : 73.47 Mean : 271807 Mean : 699.2
## 3rd Qu.: 73.00 3rd Qu.: 318000 3rd Qu.: 641.0
## Max. :2513.00 Max. :12587450 Max. :21084.0
## NA's :3299 NA's :3306 NA's :28
## median_active_list_price average_of_median_list_price
## Min. : 23000 Min. : 15000
## 1st Qu.: 173199 1st Qu.: 300000
## Median : 243900 Median : 399000
## Mean : 279694 Mean : 447801
## 3rd Qu.: 329900 3rd Qu.: 525000
## Max. :3997000 Max. :3498000
## NA's :46 NA's :42670
## average_of_median_offer_price median_days_on_market
## Min. : 9000 Min. : 0.0
## 1st Qu.: 300000 1st Qu.: 26.0
## Median : 398700 Median : 47.0
## Mean : 453276 Mean : 66.4
## 3rd Qu.: 530000 3rd Qu.: 75.5
## Max. :16825370 Max. :6369.0
## NA's :43010 NA's :4058
sum(is.na(data))
## [1] 104479
(colMeans(is.na(data)))*100
## region_id region_name
## 0.00000000 0.00000000
## region_state region_type
## 0.00000000 0.00000000
## period_begin total_homes_sold
## 0.00000000 7.81595378
## median_sale_price total_new_listings
## 7.81595378 6.39663396
## median_new_listing_price active_listings
## 6.41020669 0.05429092
## median_active_list_price average_of_median_list_price
## 0.08919223 82.73548687
## average_of_median_offer_price median_days_on_market
## 83.39473378 7.86830574
miss_case_summary(data)
## # A tibble: 51,574 x 3
## case n_miss pct_miss
## <int> <int> <dbl>
## 1 3794 8 57.1
## 2 11837 8 57.1
## 3 13848 8 57.1
## 4 25619 8 57.1
## 5 34358 8 57.1
## 6 34359 8 57.1
## 7 34365 8 57.1
## 8 34367 8 57.1
## 9 34370 8 57.1
## 10 34372 8 57.1
## # ... with 51,564 more rows
miss_var_summary(data)
## # A tibble: 14 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 average_of_median_offer_price 43010 83.4
## 2 average_of_median_list_price 42670 82.7
## 3 median_days_on_market 4058 7.87
## 4 total_homes_sold 4031 7.82
## 5 median_sale_price 4031 7.82
## 6 median_new_listing_price 3306 6.41
## 7 total_new_listings 3299 6.40
## 8 median_active_list_price 46 0.0892
## 9 active_listings 28 0.0543
## 10 region_id 0 0
## 11 region_name 0 0
## 12 region_state 0 0
## 13 region_type 0 0
## 14 period_begin 0 0
vis_miss(data, warn_large_data = FALSE)
gg_miss_upset(data)
vis_dat(data, warn_large_data = FALSE)
Missing data are mainly from two columns: average of median list price and average of median offer price. In addition, median sale price, and median days on the market also have some missing data.
ggplot(data, aes(x = region_id, y = average_of_median_list_price)) +
geom_miss_point()
I used region_id as the x axis just because there is no missing data in region_id. It acts like a canvas to plot out the missing data.
Neither package “BaylorEdPsych” nor “littleMCAR” is compatible with my current R version. Therefore I used the md.pattern function. From the ggvis graph in #3, the missing data spread pretty evenly in the dataset.
#install.packages("BaylorEdPsych")
#install.packages("littleMCAR", repos = "http://cran.us.r-project.org")
#LittleMCAR(x)
#vis_dat(data)
#install.packages("VIM")
#library(VIM)
md.pattern(data)
## region_id region_name region_state region_type period_begin
## 8548 1 1 1 1 1
## 338 1 1 1 1 1
## 37085 1 1 1 1 1
## 19 1 1 1 1 1
## 14 1 1 1 1 1
## 2 1 1 1 1 1
## 2262 1 1 1 1 1
## 3 1 1 1 1 1
## 3 1 1 1 1 1
## 2 1 1 1 1 1
## 1512 1 1 1 1 1
## 8 1 1 1 1 1
## 1732 1 1 1 1 1
## 1 1 1 1 1 1
## 17 1 1 1 1 1
## 28 1 1 1 1 1
## 0 0 0 0 0
## active_listings median_active_list_price total_new_listings
## 8548 1 1 1
## 338 1 1 1
## 37085 1 1 1
## 19 1 1 1
## 14 1 1 1
## 2 1 1 1
## 2262 1 1 1
## 3 1 1 1
## 3 1 1 1
## 2 1 1 0
## 1512 1 1 0
## 8 1 1 0
## 1732 1 1 0
## 1 1 0 1
## 17 1 0 0
## 28 0 0 0
## 28 46 3299
## median_new_listing_price total_homes_sold median_sale_price
## 8548 1 1 1
## 338 1 1 1
## 37085 1 1 1
## 19 1 1 1
## 14 1 0 0
## 2 1 0 0
## 2262 1 0 0
## 3 0 1 1
## 3 0 0 0
## 2 0 1 1
## 1512 0 1 1
## 8 0 1 1
## 1732 0 0 0
## 1 0 0 0
## 17 0 0 0
## 28 0 1 1
## 3306 4031 4031
## median_days_on_market average_of_median_list_price
## 8548 1 1
## 338 1 1
## 37085 1 0
## 19 0 0
## 14 0 1
## 2 0 1
## 2262 0 0
## 3 1 0
## 3 0 0
## 2 1 1
## 1512 1 0
## 8 0 0
## 1732 0 0
## 1 0 0
## 17 0 0
## 28 1 0
## 4058 42670
## average_of_median_offer_price
## 8548 1 0
## 338 0 1
## 37085 0 2
## 19 0 3
## 14 1 3
## 2 0 4
## 2262 0 5
## 3 0 3
## 3 0 6
## 2 1 2
## 1512 0 4
## 8 0 5
## 1732 0 7
## 1 0 7
## 17 0 8
## 28 0 6
## 43010 104479
The missing data consists of about 14.5% of data points, and the missing data do not seem to be related to the values of another variable. It could be because of recording error. The data are missing at random. It’s MCAR data.
There is no assumed relationship between the average of median list price or average of median offer price and any other variables. For MCAR dataset, I adopted list-wise deletions.
test_data <- na.omit(data)
sum(is.na(test_data))
## [1] 0