Texas Real Estate Market Analysis

Background

The aim of this report is to provide statistical and visual insights that support strategic real estate decisions in terms of sales and listing. In more detail:

Identify and interpret historical trends in Texas real estate sales.
Evaluate the effectiveness of real estate listing marketing strategies.
Provide a graphical representation of the data that highlights the distribution of prices and sales across cities, months, and years.

Setting up the environment

1. Variable Identification and Description

city	year	month	sales	volume	median_price	listings	months_inventory
Beaumont	2010	1	83	14.162	163800	1533	9.5
Beaumont	2010	2	108	17.690	138200	1586	10.0
Beaumont	2010	3	182	28.701	122400	1689	10.6
Beaumont	2010	4	200	26.819	123200	1708	10.6
Beaumont	2010	5	202	28.833	123100	1771	10.9

The imported dataset realestate_texas consists of eight variables, covering both qualitative and quantitative data types. Understanding the nature of these variables is crucial for determining appropriate statistical and analytical approaches.

Variable Types

City: A qualitative nominal variable representing the name of the city. Since there is no inherent ordering among city names, it can only be analyzed in terms of frequency distributions and comparisons. In particular for identifying the unique value
Year: Although numerical, it is preferable to consider it as a qualitative ordinal variable, as it represents a temporal sequence rather than a true continuous measure. This allows for trend analysis, year-over-year comparisons, and time series modeling.
Month: Similar to Year, this variable is best treated as a qualitative ordinal variable, as months follow a natural order in a yearly cycle.
Sales: A quantitative discrete variable representing the number of properties sold. Since sales data consist of whole numbers (integer values), analysis techniques such as descriptive statistics, frequency distributions, and time series modeling can be applied.
Volume: A quantitative continuous variable indicating the total sales value (in millions of dollars). Being a continuous variable, it allows for analysis techniques such as descriptive statistics and frequency distributions.
Median Price: A quantitative discrete variable representing the median property sales price in dollars. Although price could theoretically be continuous. It can be used for comparing price distributions across cities. Generally speaking,it allows for analysis techniques such as descriptive statistics and frequency distributions.
Listings: A quantitative discrete variable representing the total number of active property listings at a given time. It helps in understanding the relationship between inventory and sales. Generally speaking,it allows for analysis techniques such as descriptive statistics and frequency distributions.
Months Inventory: A quantitative continuous variable indicating the estimated number of months required to sell all active listings at the current sales pace. It is a critical metric for real estate market cycles.Generally speaking,it allows for analysis techniques such as descriptive statistics and frequency distributions.

2. Measures of Central Tendency, Variability and Shapes

Analysis of the variable city

## city
##              Beaumont Bryan-College Station                 Tyler 
##                    60                    60                    60 
##         Wichita Falls 
##                    60

Since city is a qualitative nominal variable, analyzing its frequency distribution provides insights into how the dataset is structured. The frequency table shows that each city —Beaumont, Bryan-College Station, Tyler, and Wichita Falls— appears exactly 60 times. This indicates that the dataset is evenly distributed across cities, meaning no single city is overrepresented or underrepresented. As a result, there is no mode in the distribution. This suggests that the data collection process was designed to ensure equal representation across different locations, allowing for comparative analysis among cities.

Analysis of the variable year

## year
## 2010 2011 2012 2013 2014 
##   48   48   48   48   48

The year variable represents the time period over which sales data were collected. The dataset spans five years, covering the period from January 2010 to December 2014. The frequency distribution shows that each year contains exactly 48 observations, confirming that the data collection was evenly distributed over time. Given that the dataset covers four cities, this suggests that for each city, an equal number of records were recorded per year, allowing for consistent year-over-year comparisons and reliable trend analysis.

## month
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 20 20 20 20 20 20 20 20 20 20 20 20

Examining the frequency distribution of the month variable confirms the structured nature of the dataset. Each month appears exactly 20 times, and when divided by the four cities, this indicates that data collection was evenly distributed across all months for each city.

Furthermore, this pattern aligns with the yearly distribution, reinforcing that data were systematically recorded every month from January 2010 to December 2014 without gap.

Analysis of the variable sales

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    79.0   127.0   175.5   192.3   247.0   423.0

The minimum value of sales is 79, the maximum is 423. The mean (192.3) is higher than the median (175.5), which suggests a right-skewed (positively skewed) distribution. This implies that while most values are concentrated in the lower range, there are some larger values (potential outliers) pulling the mean upward.

To further confirm the findings highlighted by the measures of position, an analysis of variability is essential. This allows us to understand the spread and fluctuations in the sales data.

	Range	IQR	Variance	Sd	CV
Sales	344	120	6344.3	79.65	41.42

Range: 344

The difference between the maximum and minimum sales values shows substantial spread in the dataset, reinforcing the observation that sales are highly variable across cities and months.

Interquartile Range: 120

The difference between the 75th percentile (Q3) and the 25th percentile (Q1) , (capturing the middle 50% of the data), is significantly smaller than the range. This could indicate that extreme values are pulling the range outward.

Variance: 6344.3

The variance measures how much individual sales values deviate from the mean. A large variance (6344.3) suggests a wide spread of data points, indicating high variability in the number of properties sold.

Standard Deviation: 79.65

The standard deviation (79.65) is the average deviation of sales from the mean. On average, the number of properties sold deviates by about 79.65 units from the mean, which again shows the significant fluctuations in sales activity.

Coefficient of Variation (CV): 41.42%

The Coefficient of Variation (CV) is calculated as the ratio of the standard deviation to the mean, expressed as a percentage. With a CV of 41.42%, we can conclude that there is high relative variability in sales, meaning that sales fluctuate significantly across different locations (cities) and periods (months and years). This suggests that market conditions are highly volatile and may be influenced by external factors.

Conclusion The analysis of both measures of position and measures of variability confirms the substantial variability in the real estate market. The right-skewed distribution, large range, and high coefficient of variation all point to a dynamic market where the number of properties sold can vary greatly across different locations and time periods.

To enhance the finding a graphical representation and an in-deep analysis of skewness and kurtosis is essential.

	Skewness	Kurtosis
Sales	0.72	-0.31

Skewness: 0.72

As expected there is a Positive Skewness (Right Skew).The boxplot confirms that indeed the median is closer to the first quartile than the third one.

Kurtosis: -0.31

Kurtosis describes the shape of the frequency distribution. It gives an idea about the shape of a frequency distribution.In particular, the value of -0.31 indicates slightly platykurtic behavior. Meaning that the curve having a high peak than the normal distribution.

Boxplot: Looking at the boxplot, it is possible to notice that there are no outliers. However, the length of the upper whisker, which is noticeably longer than the lower whisker, implies a right-skewed distribution. This suggests that while most values are concentrated in the lower range, there are some higher values extending the distribution without being classified as outliers.

Analysis of the variable volume

summary(volume)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.166  17.660  27.062  31.005  40.893  83.547

The minimum value for volume is 8.166 million dollars, the maximum is 83.547, whereas the mean is 31.005 million dollars. Median (27.062) is slightly lower than the mean (31.005). This suggests a right (positive) skew, meaning there are higher values (outliers) pulling the mean upward.

	Range	IQR	Variance	Sd	CV
Volume	75.38	23.23	277.27	16.65	53.71

Range: 75.38 The difference between the maximum and the minimum could at first sight seems

Interquartile Range: 23.23

The <interquartile range is much smaller than the range. This suggests that while the middle 50% of the data is relatively concentrated, some extreme values or high variability exist in the dataset.

Variance: 277.27

A variance of 277.27 represents a relatively high value, indicating substantial a spread.

Standard Deviation: 16.65

Standard Deviation is the average deviation of volume from the mean.On average, sales volumes deviate by 16.65 units from the mean.

Coefficient of Variation: 53.71%

A Coefficient of Variation above 50% signals high relative variability, meaning sales amounts are inconsistent. This suggests potential market volatility, with some regions/times experiencing significantly higher/lower sales.

To confirm the evidence highlight by the measures of variability. A graphical representation and an in-deep analysis of skewness and kurtosis is essential.

	Skewness	Kurtosis
Volume	0.88	0.18

Skewness: 0.88

The value higher than 0, confirms that there is a Positive Skewness (Right Skew). The boxplot confirms that indeed the median is closer to the first quartile than the third one.

Kurtosis: 0.18

A value of 0.18 indicates a light Leptokurtic distribution.Meaning that the curve has a high peak than the normal distribution.In this curve, there is too much concentration of items near the central value.

Boxplot:

The boxplot confirms the presence of outliers, as there are four values that exceed the black line, which represents the upper threshold (Q3 + 1.5 × IQR). The upper whisker is noticeably longer than the lower whisker, indicating a right-skewed distribution. This suggests that while most values are concentrated in the lower range, some higher values extend the distribution, with a few classified as outliers.

Analysis of the variable Median_price

summary(median_price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   73800  117300  134500  132665  150050  180000

The minimum value is 73800 dollars, while the maximum is 180000. Median (134500) is slightly higher than the Mean (132665). This suggests a slight left (negative) skew, meaning there may be some lower values pulling the mean down.

	Range	IQR	Variance	Sd	CV
Media Price	106200	32750	513572983	22662.15	17.08

Range: 106200

The range , namely, the difference between the minimum (73800) and the maximum (180000), indicates a wide spread in data.

Interquartile range: 32750

The interquartile range is is much smaller than the range. This suggests that while the middle 50% of the data is relatively concentrated, some extreme values or high variability exist in the dataset.

Variance: 513572983

A variance of represents a relatively high value, indicating substantial a spread.

Standard Deviation:22662.15

The average deviation from the mean price is approximately $22,662, indicating a noticeable spread.

Coefficient of Variation: 17.08%

A Coefficient of Variation lower than 20% indicates that the variability is moderate, meaning median home prices do not fluctuate excessively relative to their mean.

	Skewness	Kurtosis
Median Price	-0.36	-0.62

Skewness: -0.36

A skewness value of -0.36 indicates a negative skewness (left-skewed).

Kurtosis: -0.62

A value of -0.62 indicates a Platykurtic distribution. Which is a curve having a low peak than the normal curve. In this curve, there is less concentration of items around the central value.

Boxplot:

The boxplot shows that there are no outliers. Additionally, it confirms that the data appears slightly left-skewed (Negative Skew), indeed longer whisker on the left, median closer to Q3. The median is almost centered within the box, indicating that the middle 50% of the data is quite symmetrically distributed. Looking at the whiskers, we can see that the data are more spread out in the first quartile, indicating mild left-tail skewness. This pattern implies that while the core data is quite balanced, There are a few lower values causing greater dispersion on the bottom. These values are not extreme enough to be classified as outliers.

Analysis of the variable listings

summary(listings)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     743    1026    1618    1738    2056    3296

The minimum number of listing is 743 whereas the maximum is 3296. The Mean (1.738) is higher than the Median (1.618). This suggests a right (positive) skew, meaning there are higher values (possible outliers) pulling the mean upward. The gap between Q3 (2.056) and Max (3.296) is large, indicating potential outliers on the higher end.

	Range	IQR	Variance	Sd	CV
Listings	2553	1029.5	566569	752.71	43.31

Range: 2553

The range, the difference between the minimum value (743) and the maximum value (3296) is 2.553, indicating a wide spread in the data.

Interquartile Range: 1029.5

The <interquartile range is is much smaller than the range. This suggests that while the middle 50% of the data is relatively concentrated, some extreme values or high variability exist in the dataset.

Variance: 566569

A Variance value of 566,569, indicates a significant dispersion in the number of listings.

Standard Deviation: 752.71

A Standard Deviation of 752.71 means that on average, the number of active listings deviates about 753 listings from the mean.

Coefficient of Variation: 43.31%

A Coefficient of Variation greater than 40%, indicates high relative variability. The number of active listings fluctuates significantly, suggesting market instability or differences across cities.

	Skewness	Kurtosis
Listings	0.65	-0.79

Skewness: 0.65

A skewness value higher than 0 indicates a Positive Skewness (Right Skew)

Kurtosis: -0.791

Akurtosis value of -0.79 indicates a Platykurtic distribution. Which is a curve having a low peak than the normal curve. In this curve, there is less concentration of items around the central value.

Boxplot:

The median is almost centered within the box, indicating that the middle 50% of the data is quite symmetrically distributed. Looking at the whiskers, it is possible to notice that the data is more spread out the third quartile. That indicates a mild right-tail skewness. This pattern implies that while the core data is quite balanced, there are a few higher values causing greater dispersion on the top. However,these values are not extreme enough to be classified as outliers.

Analysis of the variable Months Inventory

summary(months_inventory)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.400   7.800   8.950   9.193  10.950  14.900

The minimum value of months inventory is 3.400, the maxium is 14.90. The Mean (8.950) is higher than the Median (9.193). This suggests a right (positive) skew, meaning there are higher values (outliers) pulling the mean upward.

	Range	IQR	Variance	Sd	CV
Months Inventory	11.5	3.15	5.31	2.3	25.06

Range: 11.5

A range value of 2.553, represents a wide spread in the data.

Interquartile range: 3.15

The interquartile range is lower than the range. This suggests that while the middle 50% of the data is relatively concentrated, some extreme values or high variability exist in the dataset.

Variance: 5.31

A variance value of 5.31 A relatively low value, indicating moderate spread.

Standard Deviation: 2.3

A standard deviation of 2.30 indicates that on average the months required to sell listings deviate about 2.30 months from the mean.

Coefficient of Variation: 25.06%

A coefficient of variation between 20% and 30% means that the relative variability is moderate. This suggests that inventory levels are somewhat stable, though regional or seasonal factors may still influence fluctuations.

## [1] 0.04

## [1] -0.17

Skewness: 0.04

A skewness value of 0.04 indicates an almost symmetric distribution. Indeed,it’s almost close to zero.

Kurtosis:-0.17

A kurtosis value of -0.17 indicates an almost Mesokurtic kurtosis. That means that the curve has a quite normal peak than the normal curve.

Boxplot:

The median is not centered in the box. Indeed, it’s near the first quartile. Indicating that the middle 50% of data are not symmetrically distributed. The equal whiskers imply that extreme values are evenly distributed at both ends, even if the density of observations is greater in the lower range.

3. Identification of variables with greater variability and asymmetry

	Sales	Volume	Listings	Median Price	Month Inventory
CV	41.42	53.71	43.31	17.08	25.06
Skewness	0.72	0.88	0.65	-0.36	0.04

Comparing the 5 quantitative variable, Volume has the highest relative variability (CV: 53.71), indicating significant fluctuations in total sales value. Listings (CV: 43.31) and Sales (CV: 41.42) also show substantial variation, reflecting changes in market activity. In contrast, Median Price (CV: 17.08) is the most stable, with relatively minor fluctuations. Months Inventory (CV: 25.06) falls in between, suggesting moderate variability in the time required to sell available properties.

In terms of distribution, Volume is the most skewed (0.88), highlighting the presence of extreme sales values. Sales (0.72) and Listings (0.65) are also moderately right-skewed, indicating an asymmetrical distribution with a tendency toward higher values. Median Price (-0.36) is slightly left-skewed, reinforcing its stability. Months Inventory (0.04) is nearly normally distributed, emphasizing its consistency over time.

Looking at the whole picture seems that total sales value (Volume) is heavily influenced by external factors such as Listings and seasonality, while Months Inventory remains relatively independent of other market forces.

4. Creating classes for a quantitative variable

# Define the number of bins (6 equal-width intervals)
sales_bins <- cut(sales, 
                  breaks = seq(min(sales), max(sales), length.out = 7), 
                  include.lowest = TRUE)

# Compute frequency distribution
freq_table <- table(sales_bins)

# Compute relative frequencies
rel_freq <- prop.table(freq_table)

# Compute cumulative frequencies
cum_freq <- cumsum(freq_table)

# Compute cumulative relative frequencies
cum_rel_freq <- cumsum(rel_freq)

# Combine into a distribution table
distribution_table <- data.frame(
  Sales_Category = names(freq_table),
  Frequency = as.vector(freq_table),
  Relative_Frequency = round(as.vector(rel_freq), 4),
  Cumulative_Frequency = as.vector(cum_freq),
  Cumulative_Relative_Frequency = round(as.vector(cum_rel_freq), 4)
)

# Print the distribution table
print(distribution_table)

##   Sales_Category Frequency Relative_Frequency Cumulative_Frequency
## 1       [79,136]        74             0.3083                   74
## 2      (136,194]        67             0.2792                  141
## 3      (194,251]        40             0.1667                  181
## 4      (251,308]        36             0.1500                  217
## 5      (308,366]        15             0.0625                  232
## 6      (366,423]         8             0.0333                  240
##   Cumulative_Relative_Frequency
## 1                        0.3083
## 2                        0.5875
## 3                        0.7542
## 4                        0.9042
## 5                        0.9667
## 6                        1.0000

# Bar plot of sales distribution
barplot(freq_table, 
        col = c("lightcoral","lightpink","lightblue","lightgreen","lightgoldenrodyellow","lightyellow"),
        main = "Sales Distribution (Equal Bins)", 
        xlab = "Sales Categories", 
        ylab = "Frequency")

# Compute and print Gini Index
gini_index_sales <- round(gini.index(as.numeric(sales_bins)), 2)

gini_index_sales_matrix <- matrix(c(gini_index_sales), 
  nrow = 1, byrow = TRUE
)

# Adding names to rows and columns
colnames(gini_index_sales_matrix) <- c("**Gini Index**")
rownames(gini_index_sales_matrix) <- c("**Sales**")

# Pretty print the matrix with bold column and row names
kable(gini_index_sales_matrix, format = "markdown") %>% 
  kable_styling() %>%
  row_spec(0, bold = TRUE) %>%  # Bold column names
  column_spec(1, bold = TRUE) %>%  # Bold first column (row names)
  column_spec(2:ncol(gini_index_sales_matrix), bold = FALSE)  # Keep matrix values in normal font

	Gini Index
Sales	0.93

Histogram:

As shown in the boxplot, the histogram with 6 bins confirms the right-skewed distribution of Sales. The first three bins, representing lower sales volumes, account for 75% of transactions, indicating that most months register sales between 79 and 251 properties. Higher sales figures—beyond 251—may be influenced by seasonal trends or market anomalies. To further investigate these patterns, a conditional analysis is recommended to assess potential external drivers

Gini Index: 0.93

The Gini index measueres the inquality. A value of 0.89, that it is very close to 1, indicates inequality in the distribution. Indeed, the majority of the observations (181) are contained in the in the first three categories.

5. Probability

The probability that, taken a random row in this dataset, it will carry the city “Beaumont” can be calculated as follow:

tot_num_observations <- length(city)
tot_num_observations

## [1] 240

num_observations_Beaumont <- sum(city == "Beaumont")
num_observations_Beaumont

## [1] 60

Considering there are 4 cities and 240 observations and for each city there are 60 observations, the probability that, taken a random row in this dataset, it will carry the city “Beaumont” is:

probability_Beaumont <- num_observations_Beaumont/tot_num_observations
probability_Beaumont

## [1] 0.25

The probability that, taken a random row in this dataset, it will reports the month of July, can be calculated as follow:

tot_num_observations_month <- length(month)
tot_num_observations_month

## [1] 240

num_observations_July <- sum(month == 6)
num_observations_July

## [1] 20

The probability is:

probability_July <- num_observations_July/tot_num_observations_month
probability_July

## [1] 0.08333333

The probability that it reports the month of December 2012 can be calculated as follow:

num_observations_December_2012 <- sum(month == 12 & year == 2012)
num_observations_December_2012

## [1] 4

probability_december_2012 <- num_observations_December_2012/tot_num_observations_month
probability_december_2012

## [1] 0.01666667

6. New variable creation

# Create a new column for average price
realestate_texas$avg_price <- volume / sales

kable(head(realestate_texas,5))

city	year	month	sales	volume	median_price	listings	months_inventory	avg_price
Beaumont	2010	1	83	14.162	163800	1533	9.5	0.1706265
Beaumont	2010	2	108	17.690	138200	1586	10.0	0.1637963
Beaumont	2010	3	182	28.701	122400	1689	10.6	0.1576978
Beaumont	2010	4	200	26.819	123200	1708	10.6	0.1340950
Beaumont	2010	5	202	28.833	123100	1771	10.9	0.1427376

# Create a new column for ad effectiveness
realestate_texas$ad_effectiveness <- sales / listings

# View the first few rows to check the new columns
kable(head(realestate_texas,5))

city	year	month	sales	volume	median_price	listings	months_inventory	avg_price	ad_effectiveness
Beaumont	2010	1	83	14.162	163800	1533	9.5	0.1706265	0.0541422
Beaumont	2010	2	108	17.690	138200	1586	10.0	0.1637963	0.0680958
Beaumont	2010	3	182	28.701	122400	1689	10.6	0.1576978	0.1077561
Beaumont	2010	4	200	26.819	123200	1708	10.6	0.1340950	0.1170960
Beaumont	2010	5	202	28.833	123100	1771	10.9	0.1427376	0.1140599

round(summary(realestate_texas$ad_effectiveness),2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.05    0.09    0.11    0.12    0.13    0.39

Ad effectiveness

The calculated Sales-to-Listings ratio provides insight into the effectiveness of listings in converting to actual property sales. The values range from 0.05 (minimum) to 0.39 (maximum), indicating significant variation across different periods or locations.

Median: 0.11

The median suggests that in a typical scenario, around 11% of listed properties are sold.

Mean: 0.12 The mean is slightly higher than the median, indicating a right-skewed distribution, where a few periods or markets exhibit higher listing effectiveness.

1st quartile: 0.09 3rd quartile: 0.13

The 1st quartile (0.09) and 3rd quartile show that 50% of observations fall between 9% and 13%, reinforcing that most of the time, a relatively small portion of listings translate into sales.

Max: 0.39

The maximum value (0.39) suggests that in some cases, nearly 39% of listings resulted in sales, possibly due to high demand, lower inventory, or seasonal effects.

Min: 0.05

The minimum ratio (0.05) implies that during certain times or in specific markets, only 5% of listed properties were sold, which may indicate an oversupply of listings or weaker demand.
The effectiveness of advertise is generally low, with most values clustering below 15%, meaning a large portion of properties remain unsold in a given period. This suggests potential challenges such as oversupply, pricing mismatches, or seasonal fluctuations in buyer demand. Further analysis could explore whether certain months, cities, or price ranges exhibit better conversion rates, helping real estate professionals refine their pricing and marketing strategies.

7-8. Conditional analysis and Graphical Representation

To gain deeper insight into the market dynamics, a conditional analysis is necessary. The following sections will therefore focus on:

Property sales per city
Property sales by city over the years
Property sales by city across months (seasonality trends)
The relationship between sales and listings across cities
Distribution of median price among cities

In addition, the effectiveness of listings—calculated earlier as the ratio of sales to listings—will be analyzed in more detail. This analysis will provide a clearer understanding of how efficiently different cities convert their property listings into actual sales, offering valuable insights for market strategies and inventory management.

Property sold per city

## 
## 
## |city                  | total_sales| mean_sales_year| sd_sales_year|
## |:---------------------|-----------:|---------------:|-------------:|
## |Beaumont              |       10643|          177.38|         41.48|
## |Bryan-College Station |       12358|          205.97|         84.98|
## |Tyler                 |       16185|          269.75|         61.96|
## |Wichita Falls         |        6964|          116.07|         22.15|

# Create the boxplot using ggplot2
ggplot(realestate_texas, aes(x = city, y = sales, fill = city)) +
  geom_boxplot() +  # Create the boxplot
  labs(title = "Boxplot of Sales by City", 
       x = "City", 
       y = "Sales") +
  theme_minimal() +  # Minimal theme for better clarity
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels for readability
    plot.title = element_text(hjust = 0.5)  # Center the plot title
  )

The boxplot highlights that Tyler is the city with the highest number of properties sold, as indicated by its position in the upper range of the plot. In addition to higher sales, Tyler stands out for its market strength and stability, evidenced by its relatively narrow interquartile range, suggesting consistent performance over time.
On the other hand, Wichita Falls shows the lowest volume of property sales, with its boxplot situated near the bottom of the graph. Notably, the lower whisker drops below 100, indicating that in some months, the number of sales was particularly low.
Bryan-College Station exhibits a very wide IQR, reflecting high variability in monthly sales. This irregular pattern may be linked to seasonal factors, such as the academic calendar or local events, which could cause fluctuations in housing demand.
Lastly, Beaumont presents a moderate sales volume compared to the other cities. Its boxplot indicates a fairly consistent market, with a smaller spread and fewer extreme values. This suggests a relatively stable housing market without dramatic month-to-month changes in property sales.

Property sales by city over the years

## 
## 
## | year|city                  | total_sales| mean_sales_year| sd_sales_year|
## |----:|:---------------------|-----------:|---------------:|-------------:|
## | 2010|Beaumont              |        1874|          156.17|         36.92|
## | 2010|Bryan-College Station |        2011|          167.58|         70.75|
## | 2010|Tyler                 |        2730|          227.50|         48.98|
## | 2010|Wichita Falls         |        1481|          123.42|         26.62|
## | 2011|Beaumont              |        1728|          144.00|         22.66|
## | 2011|Bryan-College Station |        2009|          167.42|         62.19|
## | 2011|Tyler                 |        2866|          238.83|         49.62|
## | 2011|Wichita Falls         |        1275|          106.25|         19.76|
## | 2012|Beaumont              |        2063|          171.92|         28.39|
## | 2012|Bryan-College Station |        2361|          196.75|         74.28|
## | 2012|Tyler                 |        3162|          263.50|         46.40|
## | 2012|Wichita Falls         |        1349|          112.42|         14.25|
## | 2013|Beaumont              |        2414|          201.17|         37.73|
## | 2013|Bryan-College Station |        2854|          237.83|         95.85|
## | 2013|Tyler                 |        3449|          287.42|         53.05|
## | 2013|Wichita Falls         |        1455|          121.25|         26.00|
## | 2014|Beaumont              |        2564|          213.67|         36.49|
## | 2014|Bryan-College Station |        3123|          260.25|         86.69|
## | 2014|Tyler                 |        3978|          331.50|         56.85|
## | 2014|Wichita Falls         |        1404|          117.00|         21.09|

# Create bar chart
ggplot(total_sales_year, aes(x = year, y = total_sales, fill = city)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Total Sales by Year and City",
    x = NULL,
    y = "Total Sales"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5)  # Center the title
  )

# Create line chart with clean data labels
ggplot(total_sales_year, aes(x = year, y = total_sales, color = city, group = city)) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  geom_text(
    aes(label = round(total_sales, 0)),
    position = position_dodge(width = 0.5),
    vjust = -0.7,
    size = 3,
    show.legend = FALSE  # Prevents labels from showing in legend
  ) +
  labs(
    x = NULL,
    y = "Total Sales",
    title = "Time Series – Total Sales per City and Year"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5)
  )

To provide a clearer overview of annual property sales across different cities, both a bar chart and a time series line chart were used.
From both visualizations, it is evident that Tyler city has the most active real estate market. The number of properties sold in this city increased steadily over time, rising from approximately 2,730 in 2010 to nearly 3,978 in 2014.
The cities of Bryan-College Station and Beaumont also experienced a general upward trend in property sales throughout the period. However, both saw a slight decline between 2010 and 2011 before resuming growth.
On the other hand, Wichita Falls appears to have the weakest real estate market among the cities analyzed. In every year of the dataset, the number of properties sold remained below 1,500. Notably, there was a sharp decline from 1,481 in 2010 to 1,275 in 2011. Although sales slightly improved afterward, they never returned to 2010 levels by the end of 2014.

Property sales by city across months (seasonality trends)

Both The time series and Stacked bar chart visualization of total property sales per city from 2010 to 2014 reveals distinct patterns in market behavior. Among all cities, Tyler consistently leads the market, showing the highest number of properties sold across the years. Notably, Tyler also displays a strong seasonal trend, with sales peaking between April and August each year, suggesting heightened market activity during spring and summer months.
Bryan-College Station, while showing generally lower sales than Tyler, is marked by high volatility, potentially influenced by external factors. This city also follows a similar seasonal pattern, though with more irregular peaks.
Beaumont demonstrates a stable and moderate sales trend, with visible but less pronounced seasonal increases in the middle of the year. This stability may point to a more consistent real estate demand throughout the year.
On the other hand, Wichita Falls exhibits the lowest and flattest sales trend, with limited seasonal fluctuation and consistently lower sales volume. Its modest peaks during mid-year months are less defined compared to the other cities, reflecting a relatively less dynamic housing market.
Overall, the plot highlights a recurring seasonal effect, where the period from April to August consistently registers higher sales across all cities, underlining the importance of these months in the Texas real estate market.

Relationship between sales and listings across cities

## # A tibble: 4 × 3
##   city                  mean_effectiveness sd_effectiveness
##   <chr>                              <dbl>            <dbl>
## 1 Beaumont                          0.106            0.0267
## 2 Bryan-College Station             0.147            0.0729
## 3 Tyler                             0.0935           0.0235
## 4 Wichita Falls                     0.128            0.0247

The chart illustrates the mean advertisement effectiveness (sales/listings) by city, along with the corresponding standard deviations as error bars.
Bryan-College Station stands out with the highest average ad effectiveness, suggesting that listings in this city tend to convert into property sales more efficiently than in other locations. The higher variability (large error bar) may indicate fluctuating market responsiveness, possibly influenced by seasonal or event-driven factors (e.g., university cycles).
Wichita Falls also shows relatively strong effectiveness, with more consistency (smaller standard deviation) than Bryan-College Station, implying steady performance from its listings.
Beaumont and Tyler have lower average effectiveness. Interestingly, Tyler, despite having the highest total sales (as shown in the time series and boxplots), has one of the lowest ad effectiveness rates. This suggests that while a large number of properties are being sold, the number of listings is disproportionately high, possibly due to oversupply, less targeted advertising, or listing redundancy.

Distribution of Median Price among Cities

## 
## 
## |city                  | tot_median_price| min_median_price| max_median_price| mean_median_price| sd_median_price|
## |:---------------------|----------------:|----------------:|----------------:|-----------------:|---------------:|
## |Beaumont              |          7799300|           106700|           163800|          129988.3|       10104.993|
## |Bryan-College Station |          9449300|           140700|           180000|          157488.3|        8852.235|
## |Tyler                 |          8486500|           120600|           161600|          141441.7|        9336.538|
## |Wichita Falls         |          6104600|            73800|           135300|          101743.3|       11320.034|

The aggregated table indicates that Bryan-College Station has the highest median property prices among the four cities. This is visually confirmed by the boxplot, where its box is positioned highest on the y-axis and includes an outlier, suggesting the presence of particularly high property values in some months. Tyler ranks second in terms of price levels, with its median property prices generally ranging between approximately $120,600 and $141,400. This city shows a relatively tight interquartile range (IQR), implying price consistency.

In contrast, Wichita Falls stands out as the city with the lowest property prices. Its distribution spans from around 73,800 to 135,300, although the upper bound is influenced by an outlier. Excluding that outlier, the general price range is noticeably lower than in the other cities, reflecting a more affordable housing market.

Beaumont occupies a middle position, with moderate prices and a fairly symmetric distribution, suggesting stable property valuation over time.

9. Conclusion

Considering the above analysis, Tyler emerges as the city with the most dynamic market, both in terms of property sales and house prices. However, advertising in this area appears to be less effective, and sales trends follow a clear seasonal pattern, peaking in mid-spring and summer. Texas Realty Insights should maintain focus on Tyler, strategically intensifying listings during these high-activity periods.

Bryan-College Station stands out with the highest average advertising effectiveness and the highest median property prices. Despite a high variability in market responsiveness—likely influenced by seasonal factors or local events—this city presents strong potential. It represents a promising opportunity for strategic investment by Texas Realty Insights.

Beaumont shows increasing property sales over the years and a stable median price range that, while not as high as Bryan-College Station or Tyler, remains above that of Wichita Falls. This indicates emerging potential, and the city may offer an attractive investment opportunity for Texas Realty Insights.

On the other hand, Wichita Falls reflects a less dynamic and less prosperous market. While it shows relatively strong advertising effectiveness, it consistently ranks lower in terms of both property sales and median prices. Texas Realty Insights may consider prioritizing investment in the other three cities, where growth and profitability appear more promising.

Texas Real Estate Market Analysis

Claudia Rigola

15/04/2025

Background

Setting up the environment

1. Variable Identification and Description

Variable Types

2. Measures of Central Tendency, Variability and Shapes

Analysis of the variable city

Analysis of the variable year

Analysis of the variable sales

Analysis of the variable volume

Analysis of the variable Median_price

Analysis of the variable listings

Analysis of the variable Months Inventory

3. Identification of variables with greater variability and asymmetry

4. Creating classes for a quantitative variable

5. Probability

6. New variable creation

7-8. Conditional analysis and Graphical Representation

Property sold per city

Property sales by city over the years

Property sales by city across months (seasonality trends)

Relationship between sales and listings across cities

Distribution of Median Price among Cities

9. Conclusion