Our data set includes about 100K rows of rental property data from across major markets throughout the US.
Our goal is to look into the data and determine what variables are having the largest impact on our rental prices, both positive and negative.
The plan is to clean and shape the data set, then use some exploratory methods to see if we can find any patterns and relationships.
Finally, we will create an interactive correlation map as a representation of what the data is telling us related to feature sets and pricing.
Let’s get into it!
First, the libraries we will be using to work this data:
library(tidyverse)
library(janitor)
library(corrplot)
library(ggpubr)
library(magrittr)
library(moments)
library(lsr)
library(psych)
library(RColorBrewer)
library(heatmaply)
Load in the data and we will format the variable names, attach the data so that we can call the variables, and finally create a table to work with.
data_1 <- read.csv("Rental_Properties.csv") %>%
clean_names()
attach(data_1)
as_tibble(data_1)
## # A tibble: 100,089 × 18
## id city state stree…¹ zip house…² beds baths sqft price secur…³ pets
## <int> <chr> <chr> <chr> <int> <chr> <int> <int> <int> <int> <int> <lgl>
## 1 92869 West… CA 6409 L… 91307 Apartm… 1 1 499 2500 5000 FALSE
## 2 89644 Bost… MA 1 Nass… 2111 Condo/… 1 1 499 2350 2350 FALSE
## 3 88767 Ocea… CA 3339 L… 92054 Apartm… 1 1 499 2300 2350 TRUE
## 4 87598 Broo… NY 999 Li… 11208 Apartm… 1 1 499 2252 2350 TRUE
## 5 86715 Broo… NY 999 Li… 11207 Apartm… 1 1 499 2204 2300 TRUE
## 6 46299 Long… CA 1334-1… 90813 Apartm… 1 1 499 1399 2000 FALSE
## 7 44356 Medf… OR 406 W … 97501 Apartm… 1 1 499 1375 500 FALSE
## 8 41117 Medf… OR 406 W … 97501 Apartm… 1 1 499 1325 500 TRUE
## 9 11483 Dall… TX 4107 A… 75219 Apartm… 1 1 499 850 850 FALSE
## 10 99721 Lagu… CA 31834 … 92651 Apartm… 1 1 500 2995 3000 FALSE
## # … with 100,079 more rows, 6 more variables: smoking_ind <lgl>, pool <lgl>,
## # dishwasher <lgl>, washer_dryer <lgl>, air_conditioning <lgl>,
## # parking <lgl>, and abbreviated variable names ¹street_address, ²house_type,
## # ³security_deposit
Reviewing the tibble, I notice right away that we have Boolean variables. Let’s go ahead and convert them to numeric so that we can use the variables in our analysis.
data_1 %<>% mutate_if(is.logical,as.numeric)
Next, I like to see if we have any NAs that we will need to address before moving on. All good there, so let’s keep digging in.
colSums(is.na(data_1))
## id city state street_address
## 0 0 0 0
## zip house_type beds baths
## 0 0 0 0
## sqft price security_deposit pets
## 0 0 0 0
## smoking_ind pool dishwasher washer_dryer
## 0 0 0 0
## air_conditioning parking
## 0 0
For the next step, let’s begin some visualization of the data to see what we are working with in terms of symmetry and shape. This is a good time to discover if we have any anomalies or dirty data that may affect our analysis. We will create some histograms, and add a mean line to help us gauge our data.
Ok, it looks like there is something funky with the sqft variable. We will take a closer look.
par(mfrow = c(1,3))
hist(price)
abline(v = mean(price),
col = "red",
lwd = 3)
hist(sqft)
abline(v = mean(sqft),
col = "red",
lwd = 3)
hist(beds)
abline(v = mean(beds),
col = "red",
lwd = 3)
We can already see that something is going on with “sqft”, but we will go ahead and check the skewness on a few of these variables for further analysis of our symmetry and centers. Typically anything within 3 to -3 is acceptable to run with. We can see “sqft” is way out of whack.
skewness(price)
## [1] 0.5086513
skewness(sqft)
## [1] 276.8053
skewness(beds)
## [1] 0.1386674
skewness(baths)
## [1] 0.9892749
We will sort the data to see if there are any anomalies
sorted <- sort(desc(sqft))
head(sorted, 10)
## [1] -509230 -84123 -33400 -23288 -15000 -14000 -14000 -12000 -11225
## [10] -11152
Yes, sure enough, we have some large outliers in the bottom portion of the data. Given that most of our data sqft data fall within the range of 400 to 4,000, we will remove these few outliers.
data_1 <- subset(data_1, sqft < 4000)
data_1 <- na.omit(data_1)
tail(data_1, n = 10)
## id city state street_address zip house_type
## 100029 84680 Peyton CO 11370 Cranston Dr 80831 Apartment
## 100030 98976 Bloomington IN 3301 N Ramble Road Court 47404 Apartment
## 100031 95002 Tucson AZ 5556 E Circulo Terra 85750 Apartment
## 100032 72668 Mechanicsburg PA 1210 Bent Creek Blvd 17050 Townhome
## 100033 99390 Grain Valley MO 1800 NE Aegean Ct 64029 House
## 100034 97752 Colorado Springs CO 5973 Revelstoke Dr 80924 Apartment
## 100035 100049 Prescott AZ 214 N. McCormick Street 86301 Apartment
## 100036 93962 Plymouth MN 41xx Nathan Lane N 55441 Apartment
## 100037 47185 Burleson TX 821 Greenwood Dr 76028 Apartment
## 100038 98082 Windsor CO 1493 Ridge West Dr 80550 Apartment
## beds baths sqft price security_deposit pets smoking_ind pool dishwasher
## 100029 3 3 3700 2150 2150 1 0 0 0
## 100030 3 3 3720 2900 2900 0 0 0 0
## 100031 2 3 3739 2600 2900 0 0 0 0
## 100032 3 4 3774 1800 1800 0 0 0 1
## 100033 3 4 3800 2950 2950 0 0 0 1
## 100034 3 3 3844 2800 2800 0 0 0 0
## 100035 2 2 3853 2995 2995 0 0 0 0
## 100036 3 3 3866 2530 5060 1 0 0 0
## 100037 2 1 3900 1400 2800 1 0 0 1
## 100038 3 3 3958 2800 2800 1 0 0 0
## washer_dryer air_conditioning parking
## 100029 0 0 0
## 100030 0 0 0
## 100031 0 0 0
## 100032 1 0 0
## 100033 1 1 1
## 100034 0 1 1
## 100035 0 0 0
## 100036 0 0 0
## 100037 1 0 0
## 100038 0 0 0
Let’s take a look at the histogram now. Yep, that looks better. You can compare it to the original chart above for visual confirmation. We could continue some scaling techniques to further process that variable data, but for this demonstration, we will move ahead.
hist(data_1$sqft)
abline(v = mean(sqft),
col = "red",
lwd = 3)
Let’s take a look at our descriptive stats. We are not using the factor variables in this analysis, so we won’t pay much attention to those stats. The rest of the numeric stats look reasonable with most of the standard deviations and center stats in a place where we can expect some reliable results with further analysis for this demonstration.
describeBy(data_1)
## vars n mean sd median trimmed mad min
## id 1 100038 50036.29 28891.72 50031.5 50034.84 37093.17 1
## city* 2 100038 1976.55 1221.31 1902.0 1962.43 1673.86 1
## state* 3 100038 28.36 15.66 32.0 28.99 17.79 1
## street_address* 4 100038 34715.73 20148.64 34657.5 34729.14 26002.58 1
## zip 5 100038 61865.33 26601.82 72730.0 63813.88 25797.24 1013
## house_type* 6 100038 1.18 0.51 1.0 1.05 0.00 1
## beds 7 100038 1.92 0.77 2.0 1.90 1.48 1
## baths 8 100038 1.49 0.62 1.0 1.41 0.00 1
## sqft 9 100038 1018.58 372.34 942.0 974.20 336.55 499
## price 10 100038 1516.16 570.63 1450.0 1479.09 593.04 275
## security_deposit 11 100038 955.87 868.37 800.0 860.62 1037.82 0
## pets 12 100038 0.68 0.47 1.0 0.73 0.00 0
## smoking_ind 13 100038 0.00 0.04 0.0 0.00 0.00 0
## pool 14 100038 0.21 0.40 0.0 0.13 0.00 0
## dishwasher 15 100038 0.36 0.48 0.0 0.32 0.00 0
## washer_dryer 16 100038 0.36 0.48 0.0 0.32 0.00 0
## air_conditioning 17 100038 0.23 0.42 0.0 0.16 0.00 0
## parking 18 100038 0.20 0.40 0.0 0.12 0.00 0
## max range skew kurtosis se
## id 100089 100088 0.00 -1.20 91.35
## city* 4298 4297 0.07 -1.24 3.86
## state* 51 50 -0.27 -1.49 0.05
## street_address* 69196 69195 -0.01 -1.21 63.70
## zip 99901 98888 -0.57 -0.77 84.11
## house_type* 4 3 3.51 13.91 0.00
## beds 3 2 0.14 -1.28 0.00
## baths 7 6 0.99 0.53 0.00
## sqft 3958 3459 1.35 2.86 1.18
## price 2999 2724 0.51 -0.39 1.80
## security_deposit 5990 5990 0.82 0.38 2.75
## pets 1 1 -0.78 -1.40 0.00
## smoking_ind 1 1 26.49 699.48 0.00
## pool 1 1 1.45 0.11 0.00
## dishwasher 1 1 0.60 -1.64 0.00
## washer_dryer 1 1 0.58 -1.66 0.00
## air_conditioning 1 1 1.31 -0.29 0.00
## parking 1 1 1.51 0.29 0.00
We could run some correlations individually such as:
cor(price,beds)
## [1] 0.2474355
cor(price,sqft)
## [1] 0.08759964
cor(price,zip)
## [1] 0.07727921
Instead, let’s go ahead and grab all of the numeric data and create a correlation map so that we can visualize some meaningful insights out of this data.
There we go. We now have summed up 100k rows into some useful information. Here we can see the biggest feature price drivers such as sqft as well as the features that are having a negative impact on price such as air conditioning.
In this data context, air conditioning is negatively impacting the price because a majority of the properties in this data set do not have AC.
cor_map_dat <- cor(select_if(data_1, is.numeric))
cor_map_dat <- round(x = cor(cor_map_dat), digits = 2)
corrplot(cor_map_dat
, method = 'color'
, order = 'hclust' #cluster orders the variables so that the ones similar are placed next to each other
, addCoef.col = 'black'
, number.cex = .6 #lower values decrease the size of the cell
)
table(air_conditioning)
## air_conditioning
## FALSE TRUE
## 77434 22655
This heat map is great for our analysis, but we will create an interactive map that we can embed in online media or a dashboard.
rentalHeat <- heatmaply(
cor_map_dat,
colors = colorRampPalette(brewer.pal(3, "RdBu"))(256),
k_col = 2,
k_row = 2
)
rentalHeat
We cleaned, shaped, and processed our data for analysis.
We created a heat map and an interactive heat map to help us understand what impact feature sets are having on our rents.
Square Feet have the greatest positive impact, and no AC has the greatest negative impact. If we were wanting to increase our rent ask blend, we could have the acquisition department look into rentals with AC as an option.