699 Week 4 Missing Data and Outliers Analysis

1. Describe missing data, provide summary of missing data, similar to the analysis in the Chapter 2 (table 3): Count of missing data/percent per variable, type of missing data (NA, null), total percent of missingness per dataset

The dataset has a lot of variables. In order to clearly demonstrate them, let’s look at the summary of the data first. Count of missing data: 104479 Type of missing data: null Percent per variable: please see table below, together with the graph.

data <- weekly_housing_market_data_1124_state_1w

summary(data)

##    region_id    region_name        region_state       region_type       
##  Min.   :   2   Length:51574       Length:51574       Length:51574      
##  1st Qu.:1109   Class :character   Class :character   Class :character  
##  Median :1876   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1775                                                           
##  3rd Qu.:2603                                                           
##  Max.   :3234                                                           
##                                                                         
##   period_begin                 total_homes_sold  median_sale_price
##  Min.   :2020-01-06 00:00:00   Min.   :   1.00   Min.   :      1  
##  1st Qu.:2020-03-16 00:00:00   1st Qu.:   5.00   1st Qu.: 163000  
##  Median :2020-06-01 00:00:00   Median :  17.00   Median : 224900  
##  Mean   :2020-05-31 19:30:05   Mean   :  63.29   Mean   : 255838  
##  3rd Qu.:2020-08-17 00:00:00   3rd Qu.:  64.00   3rd Qu.: 300000  
##  Max.   :2020-10-26 00:00:00   Max.   :2363.00   Max.   :6350000  
##                                NA's   :4031      NA's   :4031     
##  total_new_listings median_new_listing_price active_listings  
##  Min.   :   1.00    Min.   :     800         Min.   :    1.0  
##  1st Qu.:   6.00    1st Qu.:  173000         1st Qu.:   63.0  
##  Median :  19.00    Median :  237000         Median :  192.0  
##  Mean   :  73.47    Mean   :  271807         Mean   :  699.2  
##  3rd Qu.:  73.00    3rd Qu.:  318000         3rd Qu.:  641.0  
##  Max.   :2513.00    Max.   :12587450         Max.   :21084.0  
##  NA's   :3299       NA's   :3306             NA's   :28       
##  median_active_list_price average_of_median_list_price
##  Min.   :  23000          Min.   :  15000             
##  1st Qu.: 173199          1st Qu.: 300000             
##  Median : 243900          Median : 399000             
##  Mean   : 279694          Mean   : 447801             
##  3rd Qu.: 329900          3rd Qu.: 525000             
##  Max.   :3997000          Max.   :3498000             
##  NA's   :46               NA's   :42670               
##  average_of_median_offer_price median_days_on_market
##  Min.   :    9000              Min.   :   0.0       
##  1st Qu.:  300000              1st Qu.:  26.0       
##  Median :  398700              Median :  47.0       
##  Mean   :  453276              Mean   :  66.4       
##  3rd Qu.:  530000              3rd Qu.:  75.5       
##  Max.   :16825370              Max.   :6369.0       
##  NA's   :43010                 NA's   :4058

sum(is.na(data))

## [1] 104479

(colMeans(is.na(data)))*100

##                     region_id                   region_name 
##                    0.00000000                    0.00000000 
##                  region_state                   region_type 
##                    0.00000000                    0.00000000 
##                  period_begin              total_homes_sold 
##                    0.00000000                    7.81595378 
##             median_sale_price            total_new_listings 
##                    7.81595378                    6.39663396 
##      median_new_listing_price               active_listings 
##                    6.41020669                    0.05429092 
##      median_active_list_price  average_of_median_list_price 
##                    0.08919223                   82.73548687 
## average_of_median_offer_price         median_days_on_market 
##                   83.39473378                    7.86830574

miss_case_summary(data)

## # A tibble: 51,574 x 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1  3794      8     57.1
##  2 11837      8     57.1
##  3 13848      8     57.1
##  4 25619      8     57.1
##  5 34358      8     57.1
##  6 34359      8     57.1
##  7 34365      8     57.1
##  8 34367      8     57.1
##  9 34370      8     57.1
## 10 34372      8     57.1
## # ... with 51,564 more rows

miss_var_summary(data)

## # A tibble: 14 x 3
##    variable                      n_miss pct_miss
##    <chr>                          <int>    <dbl>
##  1 average_of_median_offer_price  43010  83.4   
##  2 average_of_median_list_price   42670  82.7   
##  3 median_days_on_market           4058   7.87  
##  4 total_homes_sold                4031   7.82  
##  5 median_sale_price               4031   7.82  
##  6 median_new_listing_price        3306   6.41  
##  7 total_new_listings              3299   6.40  
##  8 median_active_list_price          46   0.0892
##  9 active_listings                   28   0.0543
## 10 region_id                          0   0     
## 11 region_name                        0   0     
## 12 region_state                       0   0     
## 13 region_type                        0   0     
## 14 period_begin                       0   0

vis_miss(data, warn_large_data = FALSE)

2. Plot visualization of missing data pattern

gg_miss_upset(data)

vis_dat(data, warn_large_data = FALSE)

3. Describe if you have observed any patterns

Missing data are mainly from two columns: average of median list price and average of median offer price. In addition, median sale price, and median days on the market also have some missing data.

ggplot(data, aes(x = region_id, y = average_of_median_list_price)) +
geom_miss_point()

I used region_id as the x axis just because there is no missing data in region_id. It acts like a canvas to plot out the missing data.

4. Run statistical analysis to determine if your data is MCAR or MAR. For example, LittleMCAR - https://www.rdocumentation.org/packages/BaylorEdPsych/versions/0.5/topics/LittleMCAR

Neither package “BaylorEdPsych” nor “littleMCAR” is compatible with my current R version. Therefore I used the md.pattern function. From the ggvis graph in #3, the missing data spread pretty evenly in the dataset.

#install.packages("BaylorEdPsych")
#install.packages("littleMCAR", repos = "http://cran.us.r-project.org")
#LittleMCAR(x)
#vis_dat(data)
#install.packages("VIM")
#library(VIM)
md.pattern(data)

##       region_id region_name region_state region_type period_begin
## 8548          1           1            1           1            1
## 338           1           1            1           1            1
## 37085         1           1            1           1            1
## 19            1           1            1           1            1
## 14            1           1            1           1            1
## 2             1           1            1           1            1
## 2262          1           1            1           1            1
## 3             1           1            1           1            1
## 3             1           1            1           1            1
## 2             1           1            1           1            1
## 1512          1           1            1           1            1
## 8             1           1            1           1            1
## 1732          1           1            1           1            1
## 1             1           1            1           1            1
## 17            1           1            1           1            1
## 28            1           1            1           1            1
##               0           0            0           0            0
##       active_listings median_active_list_price total_new_listings
## 8548                1                        1                  1
## 338                 1                        1                  1
## 37085               1                        1                  1
## 19                  1                        1                  1
## 14                  1                        1                  1
## 2                   1                        1                  1
## 2262                1                        1                  1
## 3                   1                        1                  1
## 3                   1                        1                  1
## 2                   1                        1                  0
## 1512                1                        1                  0
## 8                   1                        1                  0
## 1732                1                        1                  0
## 1                   1                        0                  1
## 17                  1                        0                  0
## 28                  0                        0                  0
##                    28                       46               3299
##       median_new_listing_price total_homes_sold median_sale_price
## 8548                         1                1                 1
## 338                          1                1                 1
## 37085                        1                1                 1
## 19                           1                1                 1
## 14                           1                0                 0
## 2                            1                0                 0
## 2262                         1                0                 0
## 3                            0                1                 1
## 3                            0                0                 0
## 2                            0                1                 1
## 1512                         0                1                 1
## 8                            0                1                 1
## 1732                         0                0                 0
## 1                            0                0                 0
## 17                           0                0                 0
## 28                           0                1                 1
##                           3306             4031              4031
##       median_days_on_market average_of_median_list_price
## 8548                      1                            1
## 338                       1                            1
## 37085                     1                            0
## 19                        0                            0
## 14                        0                            1
## 2                         0                            1
## 2262                      0                            0
## 3                         1                            0
## 3                         0                            0
## 2                         1                            1
## 1512                      1                            0
## 8                         0                            0
## 1732                      0                            0
## 1                         0                            0
## 17                        0                            0
## 28                        1                            0
##                        4058                        42670
##       average_of_median_offer_price       
## 8548                              1      0
## 338                               0      1
## 37085                             0      2
## 19                                0      3
## 14                                1      3
## 2                                 0      4
## 2262                              0      5
## 3                                 0      3
## 3                                 0      6
## 2                                 1      2
## 1512                              0      4
## 8                                 0      5
## 1732                              0      7
## 1                                 0      7
## 17                                0      8
## 28                                0      6
##                               43010 104479

The missing data consists of about 14.5% of data points, and the missing data do not seem to be related to the values of another variable. It could be because of recording error. The data are missing at random. It’s MCAR data.

5. Explain what type of imputation will be performed: list-wise/pair-wise deletions, mean imputation, regression imputation etc

There is no assumed relationship between the average of median list price or average of median offer price and any other variables. For MCAR dataset, I adopted list-wise deletions.

test_data <- na.omit(data)
sum(is.na(test_data))

## [1] 0