Exploratory Analysis of the Texas Real Estate Market

Texas Realty Insights seeks to analyze real estate market trends in Texas, leveraging historical home sales data. The goal is to provide statistical and visual insights to support strategic decisions regarding sales and optimization of real estate listings.

Project Objectives:

Identify and interpret historical real estate sales trends in Texas.
Evaluate the effectiveness of real estate listing marketing strategies.
Provide a graphical representation of the data that highlights the distribution of prices and sales across cities, months, and years.

Dataset: “Real Estate Texas.csv”

## [1] "city"             "year"             "month"            "sales"           
## [5] "volume"           "median_price"     "listings"         "months_inventory"

This dataset containse the following variables:

city: reference city
year: reference year
month: reference month
sales: number of total sales
volume: value of total sales (in milions of dollar)
median_price: median price of sale (in dollar)
listings: number of total active advertisements
months_inventory: The amount of time required to sell all current listings, expressed in months.

Variables Analysis

For better data management, we create a yearmon column called date using the year and month columns.

head(realestate_texas, 5)

##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
##       date
## 1 Jan 2010
## 2 Feb 2010
## 3 Mar 2010
## 4 Apr 2010
## 5 May 2010

There are three types of variables:

Nominal qualitative variables: city.
Discrete numerical quantitative variables, such as year and month.
Continuous numerical quantitative variables: sales, volume, median price, month inventory, date.

Regarding the discrete numerical quantitative variables identified (year and month), the following considerations can be made depending on the type of analysis being performed:

If the goal is to analyse temporal trends (e.g. the evolution of volume over time), these variables can be considered continuous numerical variables, particularly when combined into a single temporal variable (“date”), which allows time series analysis.
However, if the objective is to compare values between distinct categories (e.g. different years or months), these variables can be considered ordinal categorical, where the order between categories is relevant.

Creating a ‘date’ column that combines year and month into a single date value enables more efficient time series analysis management, facilitating filtering, aggregation and temporal data visualisation operations.

Similar considerations can be made for the remaining numeric variables; depending on the analysis being performed, these variables can be considered ordinal categorical.

Aim of the analysis

The aim of the analysis is to provide insights into how to optimise sales volume.

The relationship between the variables will be assessed, and the following KPIs calculated:

Listings per month: this represents the rate at which the market absorbs the current advertisements.

$listing per month = listing/months inventory$
Mean price: to also consider outliers, the average price will be calculated:

$mean price = volume/sales$

When examining the results of the analyses of the above variables, the time dimension will be explored in more detail, taking into account the level of temporal detail (month or year).

Position, variability and shape indices

The position, variability and shape indices are calculated for the following variables::

sales
volume
median_price
listing
months_inventory

Sales

Positional indices

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    79.0   127.0   175.5   192.3   247.0   423.0

Variability Indices

##      iqr    var       sd       cv
## [1,] 120 6344.3 79.65111 41.42203

Shape Indices

##        Fisher    Kutosis
## [1,] 0.718104 -0.3131764

From the calculated indices, it appears that the sales variable exhibits a high degree of variability around its mean value, as confirmed by the coefficient of variation.

The mean is slightly higher than the median, indicating a positively skewed distribution and suggesting that some higher sales values are pulling the average upwards.

Furthermore, the negative kurtosis suggests a platykurtic distribution, meaning the data are more evenly spread around the mean and there are fewer extreme values.

Volume

Positional indices:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.166  17.660  27.062  31.005  40.893  83.547

Variability Indices

##          iqr      var       sd       cv
## [1,] 23.2335 277.2707 16.65145 53.70536

Shape Indices

##        Fisher  Kutosis
## [1,] 0.884742 0.176987

From the calculated indices, it can be inferred that the volume exhibits a noticeable variability around the mean value. The distribution is positively skewed, indicating a longer right tail, and slightly leptokurtic, suggesting a moderate concentration of values around the mean with few extreme observations.

Median_price

Positional indices

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   73800  117300  134500  132665  150050  180000

Variability Indices

##        iqr       var       sd       cv
## [1,] 32750 513572983 22662.15 17.08218

Shape Indices

##          Fisher    Kutosis
## [1,] -0.3645529 -0.6229618

From the calculated indices, the distribution of median_price values does not exhibit excessive variability, as confirmed by the relatively low coefficient of variation. The mean and median are very close, suggesting that the distribution is essentially centered, albeit slightly shifted to the left, as indicated by the negative value of the skewness index. Furthermore, the negative kurtosis highlights a platykurtic shape, which is flatter than a normal distribution. This means that the data are more uniformly distributed around the mean and that there are no marked outliers or excessive concentrations of values.

Listings

Positional indices

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     743    1026    1618    1738    2056    3296

Variability Indices

##         iqr    var       sd       cv
## [1,] 1029.5 566569 752.7078 43.30833

Shape Indices

##         Fisher  Kutosis
## [1,] 0.6494982 -0.79179

The calculated indices show that the listings variable exhibits considerable variability, as evidenced by the wide range of values. The distribution is skewed to the right and includes some values that can be considered outliers. Furthermore, the negative kurtosis indicates a platykurtic shape, which is flatter than a normal distribution and has relatively wide dispersion with a lower concentration of values around the mean.

Months_inventory

Positional indices

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.400   7.800   8.950   9.193  10.950  14.900

Variability Indices

##       iqr      var       sd       cv
## [1,] 3.15 5.306889 2.303669 25.06031

Shape Indices

##          Fisher    Kutosis
## [1,] 0.04097527 -0.1744475

The months_inventory variable displays a substantially symmetric distribution, as indicated by the close-to-zero skewness value. The kurtosis value is also very close to that of the normal distribution and suggests a mesokurtic shape: not particularly sharp or flat. Overall, the variability is moderate, with values distributed evenly around the mean.

City

For this variable the frequency distribution is implemented:

##                        
##                         2010 2011 2012 2013 2014
##   Beaumont                12   12   12   12   12
##   Bryan-College Station   12   12   12   12   12
##   Tyler                   12   12   12   12   12
##   Wichita Falls           12   12   12   12   12

Calculating the relative frequency for this variable would not provide significant additional information, so it is not necessary.

From the distribution of absolute frequencies, we can see that there are 12 observations per year for each city, corresponding to the twelve months.

This confirms that a complete monthly measurement of all the previously analysed quantitative variables is available for each city.

Identification of variables with greater variability and asymmetry

The most asymmetric variable is volume, as indicated by its Fisher index of approximately 0.88, which is higher than that of the other variables.

In addition, this variable also shows the highest relative variability, with a coefficient of variation (CV) of about 53,70%, considerably greater than that of the other measures.

Both indices confirm that volume is the most dispersed and right-skewed variable in the dataset.

Class creation for a quantitative variable

A class partioning wil be created using the variable median_price.

For this variable has been shown the range [min, max]:

## [1]  73800 180000

the class partioning has been implemented with the creation of 10 classes applying a measure unit conversion of the median price from $ to k$:

realestate_texas$median_price_class = cut(realestate_texas$median_price/10^3, breaks = 10)
fn = table(realestate_texas$median_price_class)
fi = table(realestate_texas$median_price_class)/length(realestate_texas$median_price)
Fn = cumsum(fn)
Fi = cumsum(fi)

distr_freq_midian_price_class = cbind(fn, Fn, fi, Fi)
distr_freq_midian_price_class

##             fn  Fn          fi          Fi
## (73.7,84.4]  2   2 0.008333333 0.008333333
## (84.4,95]   16  18 0.066666667 0.075000000
## (95,106]    23  41 0.095833333 0.170833333
## (106,116]   17  58 0.070833333 0.241666667
## (116,127]   25  83 0.104166667 0.345833333
## (127,138]   48 131 0.200000000 0.545833333
## (138,148]   38 169 0.158333333 0.704166667
## (148,159]   46 215 0.191666667 0.895833333
## (159,169]   16 231 0.066666667 0.962500000
## (169,180]    9 240 0.037500000 1.000000000

The following graph shows the frequency distribution make above:

The frequency distribution shows a slight tendency towards bimodality, with two local peaks observed in the [127, 138) and [148, 159) classes. This suggests that there are two prevailing ranges of median prices within the dataset.

Also the Gini Index is calculated:

## [1] 0.998431

The resulting Gini index is very close to one. This indicates a high level of heterogeneity, meaning that the median_price variable is well distributed across the different categories, with no dominant concentration in a single group.

Probability calculation

The following calculation will determine the probability of a random row in the dataset realestate_texas containing:

The city “Beaumont”, indicates with P(Beaumont)
Tee month of JUly, indicates with P(Luglio)
The date december 2012, indicates with P(dicembre 2012)

P(Beaumont)

##                       fn   fi
## Beaumont              60 0.25
## Bryan-College Station 60 0.25
## Tyler                 60 0.25
## Wichita Falls         60 0.25

As shown in the frequency distribution of the city variable, there is a 25% probability that a randomly selected row from the ‘realestate_texas’ dataset will contain the city ‘Beaumont’.

P(Luglio)

##    fn         fi
## 1  20 0.08333333
## 2  20 0.08333333
## 3  20 0.08333333
## 4  20 0.08333333
## 5  20 0.08333333
## 6  20 0.08333333
## 7  20 0.08333333
## 8  20 0.08333333
## 9  20 0.08333333
## 10 20 0.08333333
## 11 20 0.08333333
## 12 20 0.08333333

As shown in the frequency distribution of the month variable, the probability of a randomly selected row from the ‘realestate_texas’ dataset containing the month of July (7) is 8.33%.

P(dicembre 2012)

The probability of randomly selecting a row corresponding to December 2012 is simply:

$P(dicembre 2012) = P(dicembre) * P(2012)$

in fact, implementing the combined frequency distribution of year and month:

##    2010 2011 2012 2013 2014       2010       2011       2012       2013
## 1     4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 2     4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 3     4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 4     4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 5     4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 6     4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 7     4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 8     4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 9     4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 10    4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 11    4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
## 12    4    4    4    4    4 0.01666667 0.01666667 0.01666667 0.01666667
##          2014
## 1  0.01666667
## 2  0.01666667
## 3  0.01666667
## 4  0.01666667
## 5  0.01666667
## 6  0.01666667
## 7  0.01666667
## 8  0.01666667
## 9  0.01666667
## 10 0.01666667
## 11 0.01666667
## 12 0.01666667

As shown in the combined frequency distribution of the month and year variables, the probability of a randomly selected row from the ‘realestate_texas’ dataset containing the date of december 2012 is 8.33%.

Creates new variables

Two new variables are calculated:

average_price
listing_per_month

average_price

The vaiable is calculated using the variables volume (in mln$) and sales (number of sales), multiplied by 10⁶ to obtain the value in $.

$average price=volume/price*10^6$

realestate_texas$average_price = realestate_texas$volume / realestate_texas$sales * 10^6

For this new variable has been calculated:

Positional Indices:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   97010  132939  156588  154320  173915  213234

Variability indices:

##           iqr       var       sd       cv
## [1,] 40976.22 736984385 27147.46 17.59162

Shape indices:

##           Fisher    Kutosis
## [1,] -0.06870528 -0.7784329

Gini index:

## [1] 1

The distribution is nearly symmetrical, with a comparable mean and median and low variability. The kurtosis index indicates that the distribution is flatter than a normal distribution and is therefore platykurtic.

The variable values are evenly distributed across the different categories.

Below is a comparison of the average and median price distributions.:

It should be noted that the average_price is more influenced by changes in its values than the median price.

A frequency distribution was implemented, dividing the variable into 10 classes:

##            fn  Fn         fi         Fi
## (96.9,109] 10  10 0.04166667 0.04166667
## (109,120]  21  31 0.08750000 0.12916667
## (120,132]  27  58 0.11250000 0.24166667
## (132,143]  21  79 0.08750000 0.32916667
## (143,155]  37 116 0.15416667 0.48333333
## (155,167]  41 157 0.17083333 0.65416667
## (167,178]  32 189 0.13333333 0.78750000
## (178,190]  27 216 0.11250000 0.90000000
## (190,202]  16 232 0.06666667 0.96666667
## (202,213]   8 240 0.03333333 1.00000000

Implementing the bar chart:

the modal class is identified (155, 167] k$.

listing_per_month

The speed with which the market absorbs listings is evaluated by calculating the ratio of listings to monthly inventory [listings/month].

$listing per month = listing/months inventory$

realestate_texas$listing_per_month = realestate_texas$listings / realestate_texas$months_inventory

Positional indices:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   106.8   140.1   176.0   187.3   236.0   329.3

Furthermore, since the variable represents a speed, the harmonic mean is evaluated

## [1] 170.0409

As can be seen, the calculated value is between the mean and the median, providing a more accurate estimate of the average speed at which the market absorbs sales announcements.

Variability indices:

##           iqr      var       sd       cv
## [1,] 95.83326 3343.947 57.82687 30.86756

Shape indices:

##         Fisher    Kutosis
## [1,] 0.3672083 -0.8832775

The listing_per_month variable shows a fair amount of variability and a rightward skew, with some very high values representing outliers. The distribution is platykurtic, meaning it is relatively flat and diffuse compared to a normal distribution.

A frequency distribution was created by dividing the variable into ten classes:

##           fn  Fn         fi        Fi
## (107,129] 59  59 0.24583333 0.2458333
## (129,151] 13  72 0.05416667 0.3000000
## (151,174] 43 115 0.17916667 0.4791667
## (174,196] 26 141 0.10833333 0.5875000
## (196,218] 20 161 0.08333333 0.6708333
## (218,240] 26 187 0.10833333 0.7791667
## (240,263] 28 215 0.11666667 0.8958333
## (263,285] 10 225 0.04166667 0.9375000
## (285,307] 11 236 0.04583333 0.9833333
## (307,329]  4 240 0.01666667 1.0000000

Costruito il grafico a barre:

The distribution has a modal class of [155, 167] listings/month.

Texas Realty Insights

The analysis now moves on to a more detailed level, making comparisons across years, months and cities within the dataset.

The first aspect to be explored is the relationship between sales and volume.

The graph shows a strong correlation between the two variables. As the number of sales increases, sales volume increases with a nearly linear trend. However, there is a residual dispersion of the data points, indicating that, for the same number of sales, variations in sales volume can occur.

For each variable, sales and volume, a box plot is constructed that allows a comparison of sales and sales volume for each city, respectively. For each variable, the mean value is shown, indicated by the marker (*), which shows the mean value for:

The graphs show three phenomena:

Low profit: The city with the lowest overall sales is Wichita Falls. This city has low variability in sales and corresponding sales volume.
High profit: The city with the highest overall sales is Tyler, which has moderate variability in sales and corresponding sales volume.
Medium profit: The cities with moderate sales and corresponding sales volume are Beaumonth and Bryan-College Station.

Both show comparable median values, but the former has much lower variability than the latter, which nevertheless appears to have the greatest fluctuations in sales compared to all the others. In fact, the mean value, represented by (*), highlights this phenomenon.

The phenomena identified above are evaluated in more detail:

In this detailed analysis, the asterisk (*) no longer represents the average value per city but rather the median value, thus simplifying the comparison of the overall annual sales trend for each city. It is noted that, with the exception of Wichita Falls, where the number of sales remains relatively constant, the sales trend has generally increased over time, particularly in Tyler, which consistently shows significant increases compared to Beaumont and Bryan-College Station, which experienced small sales declines in 2010 and 2011.

The same considerations can be made for sales volume.

To better explain the identified phenomena, the distribution of the median price among cities over the various years is considered:

It’s noticeable that for each city, the median price tends to increase over the years.

Considering what we’ve identified so far, we can conclude that the median price is not the key variable that allowed for maximizing sales. In fact, Wichita Falls, despite having a low median price, did not have a high number of sales and consequently a low sales volume.

We now evaluate the impact of listings on sales.

The graph shows a segmentation of the number of listings across three groups, with a significant spread in the number of sales for each group. This spread may be due to the fact that we are considering all cities together and therefore mixing different markets.

A more detailed segmentation is considered for each city:

The following summary table and its graphical representation shows the relationship between listings and sales volume:

## # A tibble: 4 × 4
##   city                  mean_listings sd_listings mean_volume
##   <chr>                         <dbl>       <dbl>       <dbl>
## 1 Beaumont                      1679.        91.1        26.1
## 2 Bryan-College Station         1458.       253.         38.2
## 3 Tyler                         2905.       227.         45.8
## 4 Wichita Falls                  910.        73.8        13.9

At this stage, it is clear that the number of advertisements, rather than the price, is the key factor in increasing sales and, by extension, sales volume. To boost sales volume, it is crucial to expand the number of lists.

Exploratory Analysis of the Texas Real Estate Market

Giovanni Rocca

2025-09-26

Dataset: “Real Estate Texas.csv”

Variables Analysis

Aim of the analysis

Position, variability and shape indices

Sales

Volume

Median_price

Listings

Months_inventory

City

Identification of variables with greater variability and asymmetry

Class creation for a quantitative variable

Probability calculation

P(Beaumont)

P(Luglio)

P(dicembre 2012)

Creates new variables

average_price

listing_per_month

Texas Realty Insights