Texas Realty Insights: exploratory

First visualization of the dataset

city year month sales volume median_price listings months_inventory
Beaumont 2010 1 83 14.162 163800 1533 9.5
Beaumont 2010 2 108 17.690 138200 1586 10.0
Beaumont 2010 3 182 28.701 122400 1689 10.6
Beaumont 2010 4 200 26.819 123200 1708 10.6
Beaumont 2010 5 202 28.833 123100 1771 10.9

1. Variable analysis

  • city: categorical variable, can be used to study which city has the highest median price in general or by year, and also if some city has a positive trend to do some prediction on the future.
  • year: numerical countinous variable. It’s possible the historical trend, so if there is some trend that’s declining or some positive once that can be useful for new business strategy.
  • month: numerical countinous variable, infact the number as for the years represent the time period. As well as the year is possible to do some historical analysis, but also is possible to visualize if some trend depends on specific period of time like summer/winter/holidays, end or start of the year.
  • sales: numerical discrete variable. From this data is possible to analyze correlation between number of sales and certain period of time or number of listings as well as defining if the real estate business is growing or not on a city level and on a state level. It’s also possible to understand if sales and volume (revenue from the sales in milion of dollars) are linear correlated or not.
  • volume: numerical discrete variable. It’s possible to analyze the amount of milions made historically, but also by city. Also it’s possible to analyze the correlation between volume and listings as well as the median price.
  • median_price: numerical discrete variable. It’s possible to evaluate the correlation between median price and listings/volume/sales. Also it’s possible to understand which city has the highest median price and if it’s always the same or not and also we can analyze the data using year and month checking the historical trends.
  • listings: numerical discrete variable. Is possible to check which year or month has the highest listings and if the trend repeat every year or during some months. Also can be seen the correlation between number of listings and median price/volume or sales.
  • months_inventory: numerical continous variable. It’s possible to analyze the difference between every year and within the year. It can be seen also the correlation between number of listings and number of months needed to sell every house.

2. Position, variable and form indexes

Position Indexes

City Distribution
n f
Beaumont 60 0.25
Bryan-College Station 60 0.25
Tyler 60 0.25
Wichita Falls 60 0.25
Year Distribution
n f
2010 48 0.2
2011 48 0.2
2012 48 0.2
2013 48 0.2
2014 48 0.2

From the frequency distribution is possible to see that the dataset is evenly distributed in the different city and in the different year with 60 row per city and 48 for every year.

Below is possible to have an overview of the positional indexes.

Position Indexes overview
volume sales median_price listings months_inventory
Min. : 8.166 Min. : 79.0 Min. : 73800 Min. : 743 Min. : 3.400
1st Qu.:17.660 1st Qu.:127.0 1st Qu.:117300 1st Qu.:1026 1st Qu.: 7.800
Median :27.062 Median :175.5 Median :134500 Median :1618 Median : 8.950
Mean :31.005 Mean :192.3 Mean :132665 Mean :1738 Mean : 9.193
3rd Qu.:40.893 3rd Qu.:247.0 3rd Qu.:150050 3rd Qu.:2056 3rd Qu.:10.950
Max. :83.547 Max. :423.0 Max. :180000 Max. :3296 Max. :14.900

Positional indexes Grouped

Grouped by City

Sales by city
city Mean Median Max Min Sum
Beaumont 177.38 176.5 273 83 10643
Bryan-College Station 205.97 186.5 403 89 12358
Tyler 269.75 271.0 423 143 16185
Wichita Falls 116.07 114.5 167 79 6964
Volume by city
city Mean Median Max Min Sum
Beaumont 26.13 25.6185 42.028 13.496 1567.896
Bryan-College Station 38.19 33.5925 83.547 15.151 2291.496
Tyler 45.77 45.0835 80.814 21.050 2746.043
Wichita Falls 13.93 13.7110 20.881 8.166 835.810
Median Price by city
city Mean Median Max Min Sum
Beaumont 129988.3 130750 163800 106700 7799300
Bryan-College Station 157488.3 155400 180000 140700 9449300
Tyler 141441.7 142200 161600 120600 8486500
Wichita Falls 101743.3 102300 135300 73800 6104600

It’s possible to see above that Wichita Falls has the lowest sales by at least 30% and Tyler has the highest sales in all of the value (max, min, sum, mean and median). Beaumont and Bryan-College Station are in the middle with similar value, with Bryan-College Station with greater value than Beaumont. In terms of volume the same distribution appear with Wichita Falls at the last place, with the lowest market and the less dinamic one. Instead on the Median Price Level Bryan-College Station is at first place with the highest market value, but not the most dinamic one as seen in the field above.

Grouped by Year

Sales by year
year Mean Median Max Min Sum
2010 168.67 162.0 316 83 8096
2011 164.12 144.5 313 79 7878
2012 186.15 171.0 322 90 8935
2013 211.92 193.5 402 79 10172
2014 230.60 215.0 423 89 11069
Volume by year
year Mean Median Max Min Sum
2010 25.68 23.6840 49.914 8.951 1232.443
2011 25.16 23.2400 52.319 8.166 1207.575
2012 29.27 25.8475 57.388 9.695 1404.843
2013 35.15 32.6760 76.116 9.666 1687.315
2014 39.77 36.8275 83.547 9.626 1909.069
Median Price by year
year Mean Median Max Min Sum
2010 130191.7 133050 165300 86400 6249200
2011 127854.2 131650 157300 73800 6137000
2012 130077.1 134200 170000 82100 6243700
2013 135722.9 140250 167300 85900 6514700
2014 139481.2 142300 180000 90000 6695100

From 2010 and 2014 sales, volume and median_price increase year by year with the only drop in 2011. for sales and volume mean and median values tells us that the distribution has always outliers in the highest side of the curve, infact the mean value is greater than the median. Opposite behavior can be seen for the median_price.

Grouped by month

Sales by month
month Mean Median Max Min Sum
1 127.40 112.5 238 79 2548
2 140.85 124.5 244 79 2817
3 189.45 175.5 298 102 3789
4 211.70 199.0 323 111 4234
5 238.85 246.0 388 102 4777
6 243.55 258.0 423 111 4871
7 235.75 209.0 403 104 4715
8 231.45 228.0 357 123 4629
9 182.35 165.5 361 95 3647
10 179.90 163.5 369 97 3598
11 156.85 157.0 300 93 3137
12 169.40 155.5 332 81 3388
Volume by month
month Mean Median Max Min Sum
1 19.00 17.1385 36.916 8.166 380.015
2 21.65 19.3315 42.553 8.747 433.030
3 29.38 26.7675 50.404 13.104 587.694
4 33.30 30.9325 60.581 13.524 666.089
5 39.70 38.7930 71.456 12.451 794.042
6 41.30 41.2580 80.814 13.031 826.063
7 39.12 32.8395 83.547 12.355 782.438
8 38.01 36.8970 67.244 14.003 760.283
9 29.60 26.2430 68.744 11.792 591.983
10 29.08 26.5180 65.316 9.507 581.572
11 24.81 24.3195 52.314 10.844 496.143
12 27.09 25.2870 61.032 9.400 541.893
Median Price by month
month Mean Median Max Min Sum
1 124250 128400 163800 82100 2485000
2 130075 132000 168500 89400 2601500
3 127415 129800 161000 85900 2548300
4 131490 132300 169500 92200 2629800
5 134485 135300 165200 97500 2689700
6 137620 137350 169600 95000 2752400
7 134750 137450 172600 96700 2695000
8 136675 143850 172200 96000 2733500
9 134040 133100 180000 90000 2680800
10 133480 136150 176100 73800 2669600
11 134305 140050 172800 86400 2686100
12 133400 134250 177300 87500 2668000

During the year we can assert that the distribution of the sales, volume and median_price has a slow start with the lowest values in the first month of the year a peak in June and a descent after with a little bump in December. median_price and volume also has a bump in August, maybe caused by the end of the summer.

Variable Indexes

Below it’s possible to find a summary of the variable indexes for the columns for which they can be evaluated:

volume sales median_price listings months_inventory
Range 75.38 344.00 106200.00 2553.00 11.50
Interquartile Range 23.23 120.00 32750.00 1029.50 3.15
Variance 277.27 6344.30 513572983.09 566568.97 5.31
Standard Deviation 16.65 79.65 22662.15 752.71 2.30
Variance Coefficient 53.71 41.42 17.08 43.31 25.06
Gini Index 1.00 1.00 1.00 1.00 0.99

“Volume” has the most variance coefficient, so has the most variability, instead the most stable variable is “median_price”. From the Gini index can be seen that for all of the columns the index is equal to one (or very close to one), so the distribution is uniform for all of the columns.

Variable Indexes grouped

Grouped by city:

Sales grouped by city
city Range IQR Standard_Deviation Variance VC
Beaumont 190 52.00 41.48 1720.92 23.39
Bryan-College Station 314 147.75 84.98 7222.24 41.26
Tyler 280 86.75 61.96 3839.51 22.97
Wichita Falls 88 33.00 22.15 490.71 19.09

Grouping by city “sales” tells has that Wichita Falls has the most stable data; Bryan-College Station instead is the most unstable city. Tyler and Beaumont has very similar stability. Throungh standard deviation and interquartile range it’s possible to list the city from the most stable to the most unstable in term of number of sales and it’s: Wichita, Tyler, Beaumont, Bryan-College.

It’s also possible to study this stability in term of volume to have a better view on the stability, because maybe an house in Bryan-College create more volume than one in Wichita.

Volume grouped by city
city Range IQR Standard_Deviation Variance VC
Beaumont 28.53 8.21850 6.97 48.59 26.67
Bryan-College Station 68.40 23.71700 17.25 297.51 45.16
Tyler 59.76 17.57025 13.11 171.80 28.64
Wichita Falls 12.71 4.80175 3.24 10.50 23.26

In terms of volume Beaumont and Tyler switch position, infact can be seen that Beaumont it’s more stable than Tyler. So from a strategic perspective Beaumont it’s better if the goal is to create more value on the total sales of the company.

Grouped by year

Sales grouped by year
year Range IQR Standard_Deviation Variance VC
2010 233 81.25 60.54 3664.74 35.89
2011 234 86.75 63.87 4079.43 38.92
2012 232 114.00 70.91 5027.53 38.09
2013 323 124.50 84.00 7055.40 39.64
2014 334 151.25 95.51 9123.10 41.42

From a sales perspective 2011, 2012 and 2013 were very similar year. 2014, instead, has the most unstable month, with a Variance coefficient greater than the other years.

Volume grouped by year
year Range IQR Standard_Deviation Variance CV
2010 40.96 14.43950 10.80 116.53 42.04
2011 44.15 17.15750 12.20 148.93 48.51
2012 47.69 20.02350 14.52 210.91 49.62
2013 66.45 30.75825 17.93 321.65 51.02
2014 73.92 37.32250 21.19 448.86 53.27

Also from a “volume” view can be seen the same trend, with 2014 as the most unstable year.

This historical trend can be seen also in terms of company growth, because we know that from 2010 to 2014 the company increase the amount of sales and volume (with the exception of 2011), so more instability in the year mean simply mean selling more house in some month due to the growth of the company.

Form Indexes

Overview of the form indexes
volume sales median_price listings months_inventory
Fischer Index 0.88 0.72 -0.36 0.65 0.04
Kurtosis Index 3.18 2.69 2.38 2.21 2.83
  • Fischer: “median_price” has a negative asymmetric distibution, meanwhile “listings”, “sales” and “volume” has a positive asymmetric distibution with mean greater than median greater than mode. “months_inventory” has the most simmetric distribution with a Fischer index almost equal to zero.

  • Kurtosis index: “volume” is the only one with a value greater than 3 so his distribution is leptokutic. Instead the other variable has a value smaller than 3 so they are platikurtic with a wider look than the normal distribution. “months_inventory”, as expected, is the most similar to a normal distribution.

3. Identification of variables with greater variability and skewness

Variability:

The variability can be define through the variance coefficient, because is a normalize parameter that let compare different variable with differen scale, so the variable with the highest variance coefficient has the most variability. Here a list of the variable from the one with the most variability (greatest variance coefficient) to the most stable (smaller variance coefficient):

  • “volume”
  • “listings”
  • “sales”
  • “months_inventory”
  • “median_price”

Skewness:

Here a list of the variable from the most sharp (greater kurtosis index) to the most flat (smaller kurtosis index):

  • “volume” (\(\gamma_2\) > 3)
  • “months_inventory” (most similar to a normal distribution, \(\gamma_2\) = 3)
  • “sales” (\(\gamma_2\) < 3)
  • “median_price” (\(\gamma_2\) < 3)
  • “listings” (\(\gamma_2\) < 3)

Here a list of the variable with the most positve asymmetrical distribution (greater Fischer Index), so that the peak of the distribution is on the right side of the distribution, to the negative asymettrical distribution (smaller Fischer Index), with the peak of the distribution to the left side:

  • “volume” (\(\gamma_1\) > 0)
  • “sales” (\(\gamma_1\) > 0)
  • “listings” (\(\gamma_1\) > 0)
  • “months_inventory” (most similar to a normal distribution, \(\gamma_1\) = 0)
  • “median_price” (\(\gamma_1\) < 0)

4. Creating classes for a quantitative variable

Volume Class

Frequency Count for volume class
Volume Class Count
0-20 Milion 78
20-35 Milion 81
35-50 Milion 45
50+ Milion 36

The distribution of the class is towards the lower class, in fact both first range have almost the same count than the last two together. From the bar graph it’s possible to see this behavior. The Gini index is equal to 0.96 very close to 1 and tells us that the distribution is very heterogeneous, in fact there isn’t one single class that has all the values.

Here an example of a class distribution that correspond to a Gini index close to 0

5. Probability calculation

What is the probability that, if a row is randomly selected from this dataset, it shows the city ‘Beaumont’?

The probability is calculated as number of rows with Beaumont as city divided by total number of rows. From the table below it’s possible to calculate with the same process even more, in fact due to the fact the the number of rows by city is equal for every city evaluating the probability for one city it’s the same for every city (and it’s going to be equal to 25%)

City Count
City Count
Beaumont 60
Bryan-College Station 60
Tyler 60
Wichita Falls 60
City Probability
City Count Probability
Beaumont 60 25
Bryan-College Station 60 25
Tyler 60 25
Wichita Falls 60 25

What is the probability that, if a row is randomly selected from this dataset, it corresponds to the month of July?

Month Count
# Month Count
April 20
August 20
December 20
February 20
January 20
July 20
June 20
March 20
May 20
November 20
October 20
September 20

It’s possible to apply the same logic as before:

Month Probability
# Month Count Probability
April 20 8.33
August 20 8.33
December 20 8.33
February 20 8.33
January 20 8.33
July 20 8.33
June 20 8.33
March 20 8.33
May 20 8.33
November 20 8.33
October 20 8.33
September 20 8.33

What is the probability that, if a row is randomly selected from this dataset, it shows the month of December 2012?

Count by Month and Year
2010 2011 2012 2013 2014
April 4 4 4 4 4
August 4 4 4 4 4
December 4 4 4 4 4
February 4 4 4 4 4
January 4 4 4 4 4
July 4 4 4 4 4
June 4 4 4 4 4
March 4 4 4 4 4
May 4 4 4 4 4
November 4 4 4 4 4
October 4 4 4 4 4
September 4 4 4 4 4

The same logic can be apply as before. So it will be 4 divided by 240.

Probability result
The probability for every Year-Month combination is equal to 1.67 %

So summarizing:

  1. Probability that a row contains Beamout is equal to 25%
  2. Probability that a row contains July is equal to 8.33%
  3. Probability that a row contains December-2012 is equal to 1.67%%

6. Creating new variables

Mean Price

The mean price can be evaluated through “sales”, that is the number of sales done and “volume” that is the total value of the sales in milions of dollar. The result will be converted in the same scale of the “median_price”

Let’s visualize some Mean Price:

Mean Price values grouped by city
city Mean Sum Max Min
Beaumont 146640 8798426 174724 120713
Bryan-College Station 183534 11012058 213234 151816
Tyler 167677 10060604 193787 143464
Wichita Falls 119430 7165803 148775 97010

Bryan-College has the highest value, so we can assert that it’s an high value neighborhood. The opposite can be said for Wichita Falls that has the smallest values.

Mean Price values grouped by Year
year Mean Sum Max Min
2010 150189 7209054 201000 100573
2011 148251 7116026 195387 97010
2012 150899 7243139 191798 97848
2013 158705 7617852 200085 105126
2014 163559 7850820 213234 108157

The historical trend confirm what was found before. The market increase it’s value, with the only drop in 2011.

Listings Efficiency

Listing efficiency can be evaluate, for example, relating number of sales and number of listings.

Dividing the number of sales by number of listing we will have a number of sales for every listings for that specific row. This efficiency can be also normalize to have a number between 0 and 1 where 1 correspond with the max efficiency and 0 correspond to the opposite.

Row that correspond with the max listing’s efficiency
city year month sales volume median_price listings months_inventory vol_class vol_class_new month_name mean_price list_efficiency list_efficiency_norm
115 Bryan-College Station 2014 7 403 83.547 172600 1041 4.1 50+ Milion 80+ Milion July 207313 0.3871278 1
Row that correspond with the minimum listing’s efficiency
city year month sales volume median_price listings months_inventory vol_class vol_class_new month_name mean_price list_efficiency list_efficiency_norm
133 Tyler 2011 1 143 21.05 120600 2852 12.6 20-35 Milion 9-70 Milion January 147203 0.0501403 0

The same logic can be applied to “volume”, where the efficiency will be equal to the milions of dollar generated per listing.

Row that correspond with the max listing’s efficiency
city year month sales volume median_price listings months_inventory vol_class vol_class_new month_name mean_price list_efficiency list_efficiency_norm
115 Bryan-College Station 2014 7 403 83.547 172600 1041 4.1 50+ Milion 80+ Milion July 207313 0.0802565 1
Row that correspond with the minimum listing’s efficiency
city year month sales volume median_price listings months_inventory vol_class vol_class_new month_name mean_price list_efficiency list_efficiency_norm
133 Tyler 2011 1 143 21.05 120600 2852 12.6 20-35 Milion 9-70 Milion January 147203 0.0073808 0

The two way to evaluate the efficiency return the same rows, confirming that efficiency of the listing was at it’s peak and it’s bottom in the two scenarios describe by the two rows.

The minimum value happen in 2011, very consistently with the fact that 2011 was a bad year for the house market, and in January as seen, confirming that the start of the year correspond to worst month of the year. The maximum efficiency correspond to Bryan-College, confirming the high value neighborhood, during 2014, as expected due to the increase of the market during the years, and in July, proving the fact that June and July are the best month of the year for the house market.

7. Conditional analysis

With the same logic as before it’s possible to do some conditional analysis grouping by city, year and month.

City

Year

Month

8. Creating visualizations with ggplot2

BOXPLOT

Mean and Median price by city

The boxplot reflect what we have seen before: - Wichita has the lowest values. In median price term is also the most variable city. - Beaumont and Tyler are in between of Wichita and Bryan-College with Tyler that has highest value, but is more unstable with a larger range than Beaumont. - Bryan-College has the highest values, but in terms of mean price is the most variable.

BAR CHART

Total Sales by city

As seen before Tyler has the greater number of sales, follow by Bryan-College and Beaumont. Wichita Falls is at the bottom with almost half the sales of the second to last city in term of number of sales.

Total Sales by time period

The two bar chart above confirm the historical trend analyze before: - In terms of year only 2011 falls behind the year before, otherwise every year has more sales than the one before. - In terms of month June has the highest number of sales, january and December are at the bottom, with a bell-shaped trend during the year.

Volume

As expected Bryan-College and Tyler are the two best city. Tyler is more consistent from in the end of the year, with very high values in relation with the other city. Beaumont is very consistent during the year, opposite of that Bryan-College is the most variable with the highest peak and a lowest point similar to Beaumont. Wichita Falls confirm the low value market during the entire year.

The same behavior can be seen in this chart.

To deep dive, is also possible to see use also the year to study the difference between the city during the month.

The has the same behavior seen above and is very clear that during the 5 year Wichita remains at the same level, instead Tyler is the city with the greatest growth.

LINE CHART

Volume Historical Trend

The line charts confirm the historical trends seen before, also from the charts it’s possible to visualize the relation betweenn “sales” and “volume”.

Historical trends by city

From here is very clear the growth done by Tyler, that is almost above Bryan-College in the last timeframe. It’s also very clear that Wichita is a very stable market, but it’s not a good investment in terms of growth for the company.

The same trend can be seen in terms of number of sales

9. Conclusion

From the analysis can be deducted various things:

  1. The best city to sell is Bryan-College Station, because the house price are the best. Also Tyler is a very good city, because the number of sales is the highest so it’s the most dynamic market, and also the price are the second best
  2. Wichita Falls is a small market, in terms of company it’s not a good investment, ROI is very small.
  3. The start and the end of the year are the worst month, meanwhile the center of the year is the best time to sell. So in terms of company strategy can be beneficial to invest in listing and new home during the summer, where the market is more alive.
  4. Year by year the market is growing and so the company, so investing and be aggressive can be a good way to increase the company growth.
  5. Listings have a good efficiency when the market is at it’s peak, so though the year it’s beneficial to invest in listing publish during the center of the year as well as hiring some sales man during the summer.
  6. The greatest investment for the future it might be Tyler as a city, because of the growth during the time period of the dataset and the stability at the end of the year unlike other cities.