Overview

Our data set includes about 100K rows of rental property data from across major markets throughout the US.

Our goal is to look into the data and determine what variables are having the largest impact on our rental prices, both positive and negative.

The plan is to clean and shape the data set, then use some exploratory methods to see if we can find any patterns and relationships.

Finally, we will create an interactive correlation map as a representation of what the data is telling us related to feature sets and pricing.

Let’s get into it!

Environment Setup

First, the libraries we will be using to work this data:

library(tidyverse)
library(janitor)
library(corrplot)
library(ggpubr)
library(magrittr)
library(moments)
library(lsr)
library(psych)
library(RColorBrewer)
library(heatmaply)

Load in the data and we will format the variable names, attach the data so that we can call the variables, and finally create a table to work with.

data_1 <- read.csv("Rental_Properties.csv") %>% 
  clean_names()

attach(data_1)

as_tibble(data_1)
## # A tibble: 100,089 × 18
##       id city  state stree…¹   zip house…²  beds baths  sqft price secur…³ pets 
##    <int> <chr> <chr> <chr>   <int> <chr>   <int> <int> <int> <int>   <int> <lgl>
##  1 92869 West… CA    6409 L… 91307 Apartm…     1     1   499  2500    5000 FALSE
##  2 89644 Bost… MA    1 Nass…  2111 Condo/…     1     1   499  2350    2350 FALSE
##  3 88767 Ocea… CA    3339 L… 92054 Apartm…     1     1   499  2300    2350 TRUE 
##  4 87598 Broo… NY    999 Li… 11208 Apartm…     1     1   499  2252    2350 TRUE 
##  5 86715 Broo… NY    999 Li… 11207 Apartm…     1     1   499  2204    2300 TRUE 
##  6 46299 Long… CA    1334-1… 90813 Apartm…     1     1   499  1399    2000 FALSE
##  7 44356 Medf… OR    406 W … 97501 Apartm…     1     1   499  1375     500 FALSE
##  8 41117 Medf… OR    406 W … 97501 Apartm…     1     1   499  1325     500 TRUE 
##  9 11483 Dall… TX    4107 A… 75219 Apartm…     1     1   499   850     850 FALSE
## 10 99721 Lagu… CA    31834 … 92651 Apartm…     1     1   500  2995    3000 FALSE
## # … with 100,079 more rows, 6 more variables: smoking_ind <lgl>, pool <lgl>,
## #   dishwasher <lgl>, washer_dryer <lgl>, air_conditioning <lgl>,
## #   parking <lgl>, and abbreviated variable names ¹​street_address, ²​house_type,
## #   ³​security_deposit

Cleanup

Reviewing the tibble, I notice right away that we have Boolean variables. Let’s go ahead and convert them to numeric so that we can use the variables in our analysis.

data_1 %<>% mutate_if(is.logical,as.numeric) 

Next, I like to see if we have any NAs that we will need to address before moving on. All good there, so let’s keep digging in.

colSums(is.na(data_1))
##               id             city            state   street_address 
##                0                0                0                0 
##              zip       house_type             beds            baths 
##                0                0                0                0 
##             sqft            price security_deposit             pets 
##                0                0                0                0 
##      smoking_ind             pool       dishwasher     washer_dryer 
##                0                0                0                0 
## air_conditioning          parking 
##                0                0

Visualize Data for Shape & Center

For the next step, let’s begin some visualization of the data to see what we are working with in terms of symmetry and shape. This is a good time to discover if we have any anomalies or dirty data that may affect our analysis. We will create some histograms, and add a mean line to help us gauge our data.

Ok, it looks like there is something funky with the sqft variable. We will take a closer look.

par(mfrow = c(1,3))

 hist(price)
 abline(v = mean(price),                       
        col = "red",
        lwd = 3)
 
 hist(sqft)
 abline(v = mean(sqft),                       
        col = "red",
        lwd = 3)
 
 
 hist(beds)
 abline(v = mean(beds),                       
        col = "red",
        lwd = 3)

We can already see that something is going on with “sqft”, but we will go ahead and check the skewness on a few of these variables for further analysis of our symmetry and centers. Typically anything within 3 to -3 is acceptable to run with. We can see “sqft” is way out of whack.

skewness(price)
## [1] 0.5086513
skewness(sqft)
## [1] 276.8053
skewness(beds)
## [1] 0.1386674
skewness(baths)
## [1] 0.9892749

We will sort the data to see if there are any anomalies

sorted <- sort(desc(sqft))
head(sorted, 10)
##  [1] -509230  -84123  -33400  -23288  -15000  -14000  -14000  -12000  -11225
## [10]  -11152

Yes, sure enough, we have some large outliers in the bottom portion of the data. Given that most of our data sqft data fall within the range of 400 to 4,000, we will remove these few outliers.

data_1 <- subset(data_1, sqft < 4000)
data_1 <- na.omit(data_1)

tail(data_1, n = 10)
##            id             city state           street_address   zip house_type
## 100029  84680           Peyton    CO        11370 Cranston Dr 80831  Apartment
## 100030  98976      Bloomington    IN 3301 N Ramble Road Court 47404  Apartment
## 100031  95002           Tucson    AZ     5556 E Circulo Terra 85750  Apartment
## 100032  72668    Mechanicsburg    PA     1210 Bent Creek Blvd 17050   Townhome
## 100033  99390     Grain Valley    MO        1800 NE Aegean Ct 64029      House
## 100034  97752 Colorado Springs    CO       5973 Revelstoke Dr 80924  Apartment
## 100035 100049         Prescott    AZ  214 N. McCormick Street 86301  Apartment
## 100036  93962         Plymouth    MN       41xx Nathan Lane N 55441  Apartment
## 100037  47185         Burleson    TX         821 Greenwood Dr 76028  Apartment
## 100038  98082          Windsor    CO       1493 Ridge West Dr 80550  Apartment
##        beds baths sqft price security_deposit pets smoking_ind pool dishwasher
## 100029    3     3 3700  2150             2150    1           0    0          0
## 100030    3     3 3720  2900             2900    0           0    0          0
## 100031    2     3 3739  2600             2900    0           0    0          0
## 100032    3     4 3774  1800             1800    0           0    0          1
## 100033    3     4 3800  2950             2950    0           0    0          1
## 100034    3     3 3844  2800             2800    0           0    0          0
## 100035    2     2 3853  2995             2995    0           0    0          0
## 100036    3     3 3866  2530             5060    1           0    0          0
## 100037    2     1 3900  1400             2800    1           0    0          1
## 100038    3     3 3958  2800             2800    1           0    0          0
##        washer_dryer air_conditioning parking
## 100029            0                0       0
## 100030            0                0       0
## 100031            0                0       0
## 100032            1                0       0
## 100033            1                1       1
## 100034            0                1       1
## 100035            0                0       0
## 100036            0                0       0
## 100037            1                0       0
## 100038            0                0       0

Let’s take a look at the histogram now. Yep, that looks better. You can compare it to the original chart above for visual confirmation. We could continue some scaling techniques to further process that variable data, but for this demonstration, we will move ahead.

 hist(data_1$sqft)
 abline(v = mean(sqft),                       
        col = "red",
        lwd = 3)

Stats

Let’s take a look at our descriptive stats. We are not using the factor variables in this analysis, so we won’t pay much attention to those stats. The rest of the numeric stats look reasonable with most of the standard deviations and center stats in a place where we can expect some reliable results with further analysis for this demonstration.

describeBy(data_1)
##                  vars      n     mean       sd  median  trimmed      mad  min
## id                  1 100038 50036.29 28891.72 50031.5 50034.84 37093.17    1
## city*               2 100038  1976.55  1221.31  1902.0  1962.43  1673.86    1
## state*              3 100038    28.36    15.66    32.0    28.99    17.79    1
## street_address*     4 100038 34715.73 20148.64 34657.5 34729.14 26002.58    1
## zip                 5 100038 61865.33 26601.82 72730.0 63813.88 25797.24 1013
## house_type*         6 100038     1.18     0.51     1.0     1.05     0.00    1
## beds                7 100038     1.92     0.77     2.0     1.90     1.48    1
## baths               8 100038     1.49     0.62     1.0     1.41     0.00    1
## sqft                9 100038  1018.58   372.34   942.0   974.20   336.55  499
## price              10 100038  1516.16   570.63  1450.0  1479.09   593.04  275
## security_deposit   11 100038   955.87   868.37   800.0   860.62  1037.82    0
## pets               12 100038     0.68     0.47     1.0     0.73     0.00    0
## smoking_ind        13 100038     0.00     0.04     0.0     0.00     0.00    0
## pool               14 100038     0.21     0.40     0.0     0.13     0.00    0
## dishwasher         15 100038     0.36     0.48     0.0     0.32     0.00    0
## washer_dryer       16 100038     0.36     0.48     0.0     0.32     0.00    0
## air_conditioning   17 100038     0.23     0.42     0.0     0.16     0.00    0
## parking            18 100038     0.20     0.40     0.0     0.12     0.00    0
##                     max  range  skew kurtosis    se
## id               100089 100088  0.00    -1.20 91.35
## city*              4298   4297  0.07    -1.24  3.86
## state*               51     50 -0.27    -1.49  0.05
## street_address*   69196  69195 -0.01    -1.21 63.70
## zip               99901  98888 -0.57    -0.77 84.11
## house_type*           4      3  3.51    13.91  0.00
## beds                  3      2  0.14    -1.28  0.00
## baths                 7      6  0.99     0.53  0.00
## sqft               3958   3459  1.35     2.86  1.18
## price              2999   2724  0.51    -0.39  1.80
## security_deposit   5990   5990  0.82     0.38  2.75
## pets                  1      1 -0.78    -1.40  0.00
## smoking_ind           1      1 26.49   699.48  0.00
## pool                  1      1  1.45     0.11  0.00
## dishwasher            1      1  0.60    -1.64  0.00
## washer_dryer          1      1  0.58    -1.66  0.00
## air_conditioning      1      1  1.31    -0.29  0.00
## parking               1      1  1.51     0.29  0.00

Correlation

We could run some correlations individually such as:

cor(price,beds)
## [1] 0.2474355
cor(price,sqft)
## [1] 0.08759964
cor(price,zip)
## [1] 0.07727921

Instead, let’s go ahead and grab all of the numeric data and create a correlation map so that we can visualize some meaningful insights out of this data.

There we go. We now have summed up 100k rows into some useful information. Here we can see the biggest feature price drivers such as sqft as well as the features that are having a negative impact on price such as air conditioning.

In this data context, air conditioning is negatively impacting the price because a majority of the properties in this data set do not have AC.

cor_map_dat <- cor(select_if(data_1, is.numeric))

cor_map_dat <- round(x = cor(cor_map_dat), digits = 2)

corrplot(cor_map_dat
         , method = 'color'
         , order = 'hclust' #cluster orders the variables so that the ones similar are placed next to each other
         , addCoef.col = 'black'
         , number.cex = .6 #lower values decrease the size of the cell
)

table(air_conditioning)
## air_conditioning
## FALSE  TRUE 
## 77434 22655

This heat map is great for our analysis, but we wrap our data into a JavaScript widget to create an interactive map that we can embed in online media or a dashboard.

rentalHeat <- heatmaply(
  cor_map_dat,
  colors = colorRampPalette(brewer.pal(3, "RdBu"))(256),
  k_col = 2, 
  k_row = 2
)
rentalHeat
# Sys.setenv("plotly_username" = "Jnorm")
# Sys.setenv("plotly_api_key" = "dpoij1564")
# api_create(rentalHeat, "Rental Data Heat Map")

Summary