Texas Realty Insights seeks to analyze real estate market trends in Texas, leveraging historical home sales data. The goal is to provide statistical and visual insights to support strategic decisions regarding sales and optimization of real estate listings.
Project Objectives:
Identify and interpret historical real estate sales trends in Texas.
Evaluate the effectiveness of real estate listing marketing strategies.
Provide a graphical representation of the data that highlights the distribution of prices and sales across cities, months, and years.
## [1] "city" "year" "month" "sales"
## [5] "volume" "median_price" "listings" "months_inventory"
This dataset containse the following variables:
city: reference city
year: reference year
month: reference month
sales: number of total sales
volume: value of total sales (in milions of dollar)
median_price: median price of sale (in dollar)
listings: number of total active advertisements
months_inventory: The amount of time required to sell all current listings, expressed in months.
For better data management, we create a yearmon column called date using the year and month columns.
head(realestate_texas, 5)
## city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010 1 83 14.162 163800 1533 9.5
## 2 Beaumont 2010 2 108 17.690 138200 1586 10.0
## 3 Beaumont 2010 3 182 28.701 122400 1689 10.6
## 4 Beaumont 2010 4 200 26.819 123200 1708 10.6
## 5 Beaumont 2010 5 202 28.833 123100 1771 10.9
## date
## 1 Jan 2010
## 2 Feb 2010
## 3 Mar 2010
## 4 Apr 2010
## 5 May 2010
There are three types of variables:
Nominal qualitative variables: city.
Discrete numerical quantitative variables, such as year and month.
Continuous numerical quantitative variables: sales, volume, median price, month inventory, date.
Regarding the discrete numerical quantitative variables identified (year and month), the following considerations can be made depending on the type of analysis being performed:
If the goal is to analyse temporal trends (e.g. the evolution of volume over time), these variables can be considered continuous numerical variables, particularly when combined into a single temporal variable (“date”), which allows time series analysis.
However, if the objective is to compare values between distinct categories (e.g. different years or months), these variables can be considered ordinal categorical, where the order between categories is relevant.
Creating a ‘date’ column that combines year and month into a single date value enables more efficient time series analysis management, facilitating filtering, aggregation and temporal data visualisation operations.
Similar considerations can be made for the remaining numeric variables; depending on the analysis being performed, these variables can be considered ordinal categorical.
The aim of the analysis is to provide insights into how to optimise sales volume.
The relationship between the variables will be assessed, and the following KPIs calculated:
Listings per month: this represents the rate at which the market absorbs the current advertisements.
\(listing per month = listing/months inventory\)
Mean price: to also consider outliers, the average price will be calculated:
\(mean price = volume/sales\)
When examining the results of the analyses of the above variables, the time dimension will be explored in more detail, taking into account the level of temporal detail (month or year).
The position, variability and shape indices are calculated for the following variables::
sales
volume
median_price
listing
months_inventory
Positional indices
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 79.0 127.0 175.5 192.3 247.0 423.0
Variability Indices
## iqr var sd cv
## [1,] 120 6344.3 79.65111 41.42203
Shape Indices
## Fisher Kutosis
## [1,] 0.718104 -0.3131764
From the calculated indices, it appears that the sales variable exhibits a high degree of variability around its mean value, as confirmed by the coefficient of variation.
The mean is slightly higher than the median, indicating a positively skewed distribution and suggesting that some higher sales values are pulling the average upwards.
Furthermore, the negative kurtosis suggests a platykurtic distribution, meaning the data are more evenly spread around the mean and there are fewer extreme values.
Positional indices:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.166 17.660 27.062 31.005 40.893 83.547
Variability Indices
## iqr var sd cv
## [1,] 23.2335 277.2707 16.65145 53.70536
Shape Indices
## Fisher Kutosis
## [1,] 0.884742 0.176987
From the calculated indices, it can be inferred that the volume exhibits a noticeable variability around the mean value. The distribution is positively skewed, indicating a longer right tail, and slightly leptokurtic, suggesting a moderate concentration of values around the mean with few extreme observations.
Positional indices
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 73800 117300 134500 132665 150050 180000
Variability Indices
## iqr var sd cv
## [1,] 32750 513572983 22662.15 17.08218
Shape Indices
## Fisher Kutosis
## [1,] -0.3645529 -0.6229618
From the calculated indices, the distribution of median_price values does not exhibit excessive variability, as confirmed by the relatively low coefficient of variation. The mean and median are very close, suggesting that the distribution is essentially centered, albeit slightly shifted to the left, as indicated by the negative value of the skewness index. Furthermore, the negative kurtosis highlights a platykurtic shape, which is flatter than a normal distribution. This means that the data are more uniformly distributed around the mean and that there are no marked outliers or excessive concentrations of values.
Positional indices
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 743 1026 1618 1738 2056 3296
Variability Indices
## iqr var sd cv
## [1,] 1029.5 566569 752.7078 43.30833
Shape Indices
## Fisher Kutosis
## [1,] 0.6494982 -0.79179
The calculated indices show that the listings variable exhibits considerable variability, as evidenced by the wide range of values. The distribution is skewed to the right and includes some values that can be considered outliers. Furthermore, the negative kurtosis indicates a platykurtic shape, which is flatter than a normal distribution and has relatively wide dispersion with a lower concentration of values around the mean.
Positional indices
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.400 7.800 8.950 9.193 10.950 14.900
Variability Indices
## iqr var sd cv
## [1,] 3.15 5.306889 2.303669 25.06031
Shape Indices
## Fisher Kutosis
## [1,] 0.04097527 -0.1744475
The months_inventory variable displays a substantially symmetric distribution, as indicated by the close-to-zero skewness value. The kurtosis value is also very close to that of the normal distribution and suggests a mesokurtic shape: not particularly sharp or flat. Overall, the variability is moderate, with values distributed evenly around the mean.
For this variable the frequency distribution is implemented:
##
## 2010 2011 2012 2013 2014
## Beaumont 12 12 12 12 12
## Bryan-College Station 12 12 12 12 12
## Tyler 12 12 12 12 12
## Wichita Falls 12 12 12 12 12
Calculating the relative frequency for this variable would not provide significant additional information, so it is not necessary.
From the distribution of absolute frequencies, we can see that there are 12 observations per year for each city, corresponding to the twelve months.
This confirms that a complete monthly measurement of all the previously analysed quantitative variables is available for each city.
The most asymmetric variable is volume, as indicated by its Fisher index of approximately 0.88, which is higher than that of the other variables.
In addition, this variable also shows the highest relative variability, with a coefficient of variation (CV) of about 53,70%, considerably greater than that of the other measures.
Both indices confirm that volume is the most dispersed and right-skewed variable in the dataset.
A class partioning wil be created using the variable median_price.
For this variable has been shown the range [min, max]:
## [1] 73800 180000
the class partioning has been implemented with the creation of 10 classes applying a measure unit conversion of the median price from $ to k$:
realestate_texas$median_price_class = cut(realestate_texas$median_price/10^3, breaks = 10)
fn = table(realestate_texas$median_price_class)
fi = table(realestate_texas$median_price_class)/length(realestate_texas$median_price)
Fn = cumsum(fn)
Fi = cumsum(fi)
distr_freq_midian_price_class = cbind(fn, Fn, fi, Fi)
distr_freq_midian_price_class
## fn Fn fi Fi
## (73.7,84.4] 2 2 0.008333333 0.008333333
## (84.4,95] 16 18 0.066666667 0.075000000
## (95,106] 23 41 0.095833333 0.170833333
## (106,116] 17 58 0.070833333 0.241666667
## (116,127] 25 83 0.104166667 0.345833333
## (127,138] 48 131 0.200000000 0.545833333
## (138,148] 38 169 0.158333333 0.704166667
## (148,159] 46 215 0.191666667 0.895833333
## (159,169] 16 231 0.066666667 0.962500000
## (169,180] 9 240 0.037500000 1.000000000
The following graph shows the frequency distribution make above:
The frequency distribution shows a slight tendency towards bimodality, with two local peaks observed in the [127, 138) and [148, 159) classes. This suggests that there are two prevailing ranges of median prices within the dataset.
Also the Gini Index is calculated:
## [1] 0.998431
The resulting Gini index is very close to one. This indicates a high level of heterogeneity, meaning that the median_price variable is well distributed across the different categories, with no dominant concentration in a single group.
The following calculation will determine the probability of a random row in the dataset realestate_texas containing:
The city “Beaumont”, indicates with P(Beaumont)
Tee month of JUly, indicates with P(Luglio)
The date december 2012, indicates with P(dicembre 2012)
## fn fi
## Beaumont 60 0.25
## Bryan-College Station 60 0.25
## Tyler 60 0.25
## Wichita Falls 60 0.25
As shown in the frequency distribution of the city variable, there is a 25% probability that a randomly selected row from the ‘realestate_texas’ dataset will contain the city ‘Beaumont’.
## fn fi
## 1 20 0.08333333
## 2 20 0.08333333
## 3 20 0.08333333
## 4 20 0.08333333
## 5 20 0.08333333
## 6 20 0.08333333
## 7 20 0.08333333
## 8 20 0.08333333
## 9 20 0.08333333
## 10 20 0.08333333
## 11 20 0.08333333
## 12 20 0.08333333
As shown in the frequency distribution of the month variable, the probability of a randomly selected row from the ‘realestate_texas’ dataset containing the month of July (7) is 8.33%.
The probability of randomly selecting a row corresponding to December 2012 is simply:
\(P(dicembre 2012) = P(dicembre) * P(2012)\)
in fact, implementing the combined frequency distribution of year and month:
## 2010 2011 2012 2013 2014 2010 2011 2012 2013
## 1 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 2 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 3 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 4 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 5 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 6 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 7 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 8 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 9 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 10 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 11 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 12 4 4 4 4 4 0.01666667 0.01666667 0.01666667 0.01666667
## 2014
## 1 0.01666667
## 2 0.01666667
## 3 0.01666667
## 4 0.01666667
## 5 0.01666667
## 6 0.01666667
## 7 0.01666667
## 8 0.01666667
## 9 0.01666667
## 10 0.01666667
## 11 0.01666667
## 12 0.01666667
As shown in the combined frequency distribution of the month and year variables, the probability of a randomly selected row from the ‘realestate_texas’ dataset containing the date of december 2012 is 8.33%.
Two new variables are calculated:
average_price
listing_per_month
The vaiable is calculated using the variables volume (in mln$) and sales (number of sales), multiplied by 10⁶ to obtain the value in $.
\(average price=volume/price*10^6\)
realestate_texas$average_price = realestate_texas$volume / realestate_texas$sales * 10^6
For this new variable has been calculated:
Positional Indices:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 97010 132939 156588 154320 173915 213234
Variability indices:
## iqr var sd cv
## [1,] 40976.22 736984385 27147.46 17.59162
Shape indices:
## Fisher Kutosis
## [1,] -0.06870528 -0.7784329
Gini index:
## [1] 1
The distribution is nearly symmetrical, with a comparable mean and median and low variability. The kurtosis index indicates that the distribution is flatter than a normal distribution and is therefore platykurtic.
The variable values are evenly distributed across the different categories.
Below is a comparison of the average and median price distributions.:
It should be noted that the average_price is more influenced by changes in its values than the median price.
A frequency distribution was implemented, dividing the variable into 10 classes:
## fn Fn fi Fi
## (96.9,109] 10 10 0.04166667 0.04166667
## (109,120] 21 31 0.08750000 0.12916667
## (120,132] 27 58 0.11250000 0.24166667
## (132,143] 21 79 0.08750000 0.32916667
## (143,155] 37 116 0.15416667 0.48333333
## (155,167] 41 157 0.17083333 0.65416667
## (167,178] 32 189 0.13333333 0.78750000
## (178,190] 27 216 0.11250000 0.90000000
## (190,202] 16 232 0.06666667 0.96666667
## (202,213] 8 240 0.03333333 1.00000000
Implementing the bar chart:
the modal class is identified (155, 167] k$.
The speed with which the market absorbs listings is evaluated by calculating the ratio of listings to monthly inventory [listings/month].
\(listing per month = listing/months inventory\)
realestate_texas$listing_per_month = realestate_texas$listings / realestate_texas$months_inventory
Positional indices:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 106.8 140.1 176.0 187.3 236.0 329.3
Furthermore, since the variable represents a speed, the harmonic mean is evaluated
## [1] 170.0409
As can be seen, the calculated value is between the mean and the median, providing a more accurate estimate of the average speed at which the market absorbs sales announcements.
Variability indices:
## iqr var sd cv
## [1,] 95.83326 3343.947 57.82687 30.86756
Shape indices:
## Fisher Kutosis
## [1,] 0.3672083 -0.8832775
The listing_per_month variable shows a fair amount of variability and a rightward skew, with some very high values representing outliers. The distribution is platykurtic, meaning it is relatively flat and diffuse compared to a normal distribution.
A frequency distribution was created by dividing the variable into ten classes:
## fn Fn fi Fi
## (107,129] 59 59 0.24583333 0.2458333
## (129,151] 13 72 0.05416667 0.3000000
## (151,174] 43 115 0.17916667 0.4791667
## (174,196] 26 141 0.10833333 0.5875000
## (196,218] 20 161 0.08333333 0.6708333
## (218,240] 26 187 0.10833333 0.7791667
## (240,263] 28 215 0.11666667 0.8958333
## (263,285] 10 225 0.04166667 0.9375000
## (285,307] 11 236 0.04583333 0.9833333
## (307,329] 4 240 0.01666667 1.0000000
Costruito il grafico a barre:
The distribution has a modal class of [155, 167] listings/month.
The analysis now moves on to a more detailed level, making comparisons across years, months and cities within the dataset.
The first aspect to be explored is the relationship between sales and volume.
The graph shows a strong correlation between the two variables. As the number of sales increases, sales volume increases with a nearly linear trend. However, there is a residual dispersion of the data points, indicating that, for the same number of sales, variations in sales volume can occur.
For each variable, sales and volume, a box plot is constructed that allows a comparison of sales and sales volume for each city, respectively. For each variable, the mean value is shown, indicated by the marker (*), which shows the mean value for:
The graphs show three phenomena:
Low profit: The city with the lowest overall sales is Wichita Falls. This city has low variability in sales and corresponding sales volume.
High profit: The city with the highest overall sales is Tyler, which has moderate variability in sales and corresponding sales volume.
Medium profit: The cities with moderate sales and corresponding sales volume are Beaumonth and Bryan-College Station.
Both show comparable median values, but the former has much lower variability than the latter, which nevertheless appears to have the greatest fluctuations in sales compared to all the others. In fact, the mean value, represented by (*), highlights this phenomenon.
The phenomena identified above are evaluated in more detail:
In this detailed analysis, the asterisk (*) no longer represents the average value per city but rather the median value, thus simplifying the comparison of the overall annual sales trend for each city. It is noted that, with the exception of Wichita Falls, where the number of sales remains relatively constant, the sales trend has generally increased over time, particularly in Tyler, which consistently shows significant increases compared to Beaumont and Bryan-College Station, which experienced small sales declines in 2010 and 2011.
The same considerations can be made for sales volume.
To better explain the identified phenomena, the distribution of the median price among cities over the various years is considered:
It’s noticeable that for each city, the median price tends to increase over the years.
Considering what we’ve identified so far, we can conclude that the median price is not the key variable that allowed for maximizing sales. In fact, Wichita Falls, despite having a low median price, did not have a high number of sales and consequently a low sales volume.
We now evaluate the impact of listings on sales.
The graph shows a segmentation of the number of listings across three groups, with a significant spread in the number of sales for each group. This spread may be due to the fact that we are considering all cities together and therefore mixing different markets.
A more detailed segmentation is considered for each city:
The following summary table and its graphical representation shows the relationship between listings and sales volume:
## # A tibble: 4 × 4
## city mean_listings sd_listings mean_volume
## <chr> <dbl> <dbl> <dbl>
## 1 Beaumont 1679. 91.1 26.1
## 2 Bryan-College Station 1458. 253. 38.2
## 3 Tyler 2905. 227. 45.8
## 4 Wichita Falls 910. 73.8 13.9
At this stage, it is clear that the number of advertisements, rather than the price, is the key factor in increasing sales and, by extension, sales volume. To boost sales volume, it is crucial to expand the number of lists.