Background

The aim of this report is to provide statistical and visual insights that support strategic real estate decisions in terms of sales and listing. In more detail:

Setting up the environment

1. Variable Identification and Description

city year month sales volume median_price listings months_inventory
Beaumont 2010 1 83 14.162 163800 1533 9.5
Beaumont 2010 2 108 17.690 138200 1586 10.0
Beaumont 2010 3 182 28.701 122400 1689 10.6
Beaumont 2010 4 200 26.819 123200 1708 10.6
Beaumont 2010 5 202 28.833 123100 1771 10.9

The imported dataset realestate_texas consists of eight variables, covering both qualitative and quantitative data types. Understanding the nature of these variables is crucial for determining appropriate statistical and analytical approaches.

Variable Types
  • City: A qualitative nominal variable representing the name of the city. Since there is no inherent ordering among city names, it can only be analyzed in terms of frequency distributions and comparisons. In particular for identifying the unique value
  • Year: Although numerical, it is preferable to consider it as a qualitative ordinal variable, as it represents a temporal sequence rather than a true continuous measure. This allows for trend analysis, year-over-year comparisons, and time series modeling.
  • Month: Similar to Year, this variable is best treated as a qualitative ordinal variable, as months follow a natural order in a yearly cycle.
  • Sales: A quantitative discrete variable representing the number of properties sold. Since sales data consist of whole numbers (integer values), analysis techniques such as descriptive statistics, frequency distributions, and time series modeling can be applied.
  • Volume: A quantitative continuous variable indicating the total sales value (in millions of dollars). Being a continuous variable, it allows for analysis techniques such as descriptive statistics and frequency distributions.
  • Median Price: A quantitative discrete variable representing the median property sales price in dollars. Although price could theoretically be continuous. It can be used for comparing price distributions across cities. Generally speaking,it allows for analysis techniques such as descriptive statistics and frequency distributions.
  • Listings: A quantitative discrete variable representing the total number of active property listings at a given time. It helps in understanding the relationship between inventory and sales. Generally speaking,it allows for analysis techniques such as descriptive statistics and frequency distributions.
  • Months Inventory: A quantitative continuous variable indicating the estimated number of months required to sell all active listings at the current sales pace. It is a critical metric for real estate market cycles.Generally speaking,it allows for analysis techniques such as descriptive statistics and frequency distributions.

2. Measures of Central Tendency, Variability and Shapes

Analysis of the variable city
## city
##              Beaumont Bryan-College Station                 Tyler 
##                    60                    60                    60 
##         Wichita Falls 
##                    60

Since city is a qualitative nominal variable, analyzing its frequency distribution provides insights into how the dataset is structured. The frequency table shows that each city —Beaumont, Bryan-College Station, Tyler, and Wichita Falls— appears exactly 60 times. This indicates that the dataset is evenly distributed across cities, meaning no single city is overrepresented or underrepresented. As a result, there is no mode in the distribution. This suggests that the data collection process was designed to ensure equal representation across different locations, allowing for comparative analysis among cities.

Analysis of the variable year
## year
## 2010 2011 2012 2013 2014 
##   48   48   48   48   48

The year variable represents the time period over which sales data were collected. The dataset spans five years, covering the period from January 2010 to December 2014. The frequency distribution shows that each year contains exactly 48 observations, confirming that the data collection was evenly distributed over time. Given that the dataset covers four cities, this suggests that for each city, an equal number of records were recorded per year, allowing for consistent year-over-year comparisons and reliable trend analysis.

## month
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 20 20 20 20 20 20 20 20 20 20 20 20

Examining the frequency distribution of the month variable confirms the structured nature of the dataset. Each month appears exactly 20 times, and when divided by the four cities, this indicates that data collection was evenly distributed across all months for each city.

Furthermore, this pattern aligns with the yearly distribution, reinforcing that data were systematically recorded every month from January 2010 to December 2014 without gap.

Analysis of the variable sales
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    79.0   127.0   175.5   192.3   247.0   423.0

The minimum value of sales is 79, the maximum is 423. The mean (192.3) is higher than the median (175.5), which suggests a right-skewed (positively skewed) distribution. This implies that while most values are concentrated in the lower range, there are some larger values (potential outliers) pulling the mean upward.

To further confirm the findings highlighted by the measures of position, an analysis of variability is essential. This allows us to understand the spread and fluctuations in the sales data.

Range IQR Variance Sd CV
Sales 344 120 6344.3 79.65 41.42

Range: 344

The difference between the maximum and minimum sales values shows substantial spread in the dataset, reinforcing the observation that sales are highly variable across cities and months.

Interquartile Range: 120

The difference between the 75th percentile (Q3) and the 25th percentile (Q1) , (capturing the middle 50% of the data), is significantly smaller than the range. This could indicate that extreme values are pulling the range outward.

Variance: 6344.3

The variance measures how much individual sales values deviate from the mean. A large variance (6344.3) suggests a wide spread of data points, indicating high variability in the number of properties sold.

Standard Deviation: 79.65

The standard deviation (79.65) is the average deviation of sales from the mean. On average, the number of properties sold deviates by about 79.65 units from the mean, which again shows the significant fluctuations in sales activity.

Coefficient of Variation (CV): 41.42%

The Coefficient of Variation (CV) is calculated as the ratio of the standard deviation to the mean, expressed as a percentage. With a CV of 41.42%, we can conclude that there is high relative variability in sales, meaning that sales fluctuate significantly across different locations (cities) and periods (months and years). This suggests that market conditions are highly volatile and may be influenced by external factors.

Conclusion The analysis of both measures of position and measures of variability confirms the substantial variability in the real estate market. The right-skewed distribution, large range, and high coefficient of variation all point to a dynamic market where the number of properties sold can vary greatly across different locations and time periods.

To enhance the finding a graphical representation and an in-deep analysis of skewness and kurtosis is essential.

Skewness Kurtosis
Sales 0.72 -0.31

Skewness: 0.72

As expected there is a Positive Skewness (Right Skew).The boxplot confirms that indeed the median is closer to the first quartile than the third one.

Kurtosis: -0.31

Kurtosis describes the shape of the frequency distribution. It gives an idea about the shape of a frequency distribution.In particular, the value of -0.31 indicates slightly platykurtic behavior. Meaning that the curve having a high peak than the normal distribution.

Boxplot: Looking at the boxplot, it is possible to notice that there are no outliers. However, the length of the upper whisker, which is noticeably longer than the lower whisker, implies a right-skewed distribution. This suggests that while most values are concentrated in the lower range, there are some higher values extending the distribution without being classified as outliers.

Analysis of the variable volume
summary(volume)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.166  17.660  27.062  31.005  40.893  83.547

The minimum value for volume is 8.166 million dollars, the maximum is 83.547, whereas the mean is 31.005 million dollars. Median (27.062) is slightly lower than the mean (31.005). This suggests a right (positive) skew, meaning there are higher values (outliers) pulling the mean upward.

Range IQR Variance Sd CV
Volume 75.38 23.23 277.27 16.65 53.71

Range: 75.38 The difference between the maximum and the minimum could at first sight seems

Interquartile Range: 23.23

The <interquartile range is much smaller than the range. This suggests that while the middle 50% of the data is relatively concentrated, some extreme values or high variability exist in the dataset.

Variance: 277.27

A variance of 277.27 represents a relatively high value, indicating substantial a spread.

Standard Deviation: 16.65

Standard Deviation is the average deviation of volume from the mean.On average, sales volumes deviate by 16.65 units from the mean.

Coefficient of Variation: 53.71%

A Coefficient of Variation above 50% signals high relative variability, meaning sales amounts are inconsistent. This suggests potential market volatility, with some regions/times experiencing significantly higher/lower sales.

To confirm the evidence highlight by the measures of variability. A graphical representation and an in-deep analysis of skewness and kurtosis is essential.

Skewness Kurtosis
Volume 0.88 0.18

Skewness: 0.88

The value higher than 0, confirms that there is a Positive Skewness (Right Skew). The boxplot confirms that indeed the median is closer to the first quartile than the third one.

Kurtosis: 0.18

A value of 0.18 indicates a light Leptokurtic distribution.Meaning that the curve has a high peak than the normal distribution.In this curve, there is too much concentration of items near the central value.

Boxplot:

The boxplot confirms the presence of outliers, as there are four values that exceed the black line, which represents the upper threshold (Q3 + 1.5 × IQR). The upper whisker is noticeably longer than the lower whisker, indicating a right-skewed distribution. This suggests that while most values are concentrated in the lower range, some higher values extend the distribution, with a few classified as outliers.

Analysis of the variable Median_price
summary(median_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   73800  117300  134500  132665  150050  180000

The minimum value is 73800 dollars, while the maximum is 180000. Median (134500) is slightly higher than the Mean (132665). This suggests a slight left (negative) skew, meaning there may be some lower values pulling the mean down.

Range IQR Variance Sd CV
Media Price 106200 32750 513572983 22662.15 17.08

Range: 106200

The range , namely, the difference between the minimum (73800) and the maximum (180000), indicates a wide spread in data.

Interquartile range: 32750

The interquartile range is is much smaller than the range. This suggests that while the middle 50% of the data is relatively concentrated, some extreme values or high variability exist in the dataset.

Variance: 513572983

A variance of represents a relatively high value, indicating substantial a spread.

Standard Deviation:22662.15

The average deviation from the mean price is approximately $22,662, indicating a noticeable spread.

Coefficient of Variation: 17.08%

A Coefficient of Variation lower than 20% indicates that the variability is moderate, meaning median home prices do not fluctuate excessively relative to their mean.

Skewness Kurtosis
Median Price -0.36 -0.62

Skewness: -0.36

A skewness value of -0.36 indicates a negative skewness (left-skewed).

Kurtosis: -0.62

A value of -0.62 indicates a Platykurtic distribution. Which is a curve having a low peak than the normal curve. In this curve, there is less concentration of items around the central value.

Boxplot:

The boxplot shows that there are no outliers. Additionally, it confirms that the data appears slightly left-skewed (Negative Skew), indeed longer whisker on the left, median closer to Q3. The median is almost centered within the box, indicating that the middle 50% of the data is quite symmetrically distributed. Looking at the whiskers, we can see that the data are more spread out in the first quartile, indicating mild left-tail skewness. This pattern implies that while the core data is quite balanced, There are a few lower values causing greater dispersion on the bottom. These values are not extreme enough to be classified as outliers.

Analysis of the variable listings
summary(listings)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     743    1026    1618    1738    2056    3296

The minimum number of listing is 743 whereas the maximum is 3296. The Mean (1.738) is higher than the Median (1.618). This suggests a right (positive) skew, meaning there are higher values (possible outliers) pulling the mean upward. The gap between Q3 (2.056) and Max (3.296) is large, indicating potential outliers on the higher end.

Range IQR Variance Sd CV
Listings 2553 1029.5 566569 752.71 43.31

Range: 2553

The range, the difference between the minimum value (743) and the maximum value (3296) is 2.553, indicating a wide spread in the data.

Interquartile Range: 1029.5

The <interquartile range is is much smaller than the range. This suggests that while the middle 50% of the data is relatively concentrated, some extreme values or high variability exist in the dataset.

Variance: 566569

A Variance value of 566,569, indicates a significant dispersion in the number of listings.

Standard Deviation: 752.71

A Standard Deviation of 752.71 means that on average, the number of active listings deviates about 753 listings from the mean.

Coefficient of Variation: 43.31%

A Coefficient of Variation greater than 40%, indicates high relative variability. The number of active listings fluctuates significantly, suggesting market instability or differences across cities.

Skewness Kurtosis
Listings 0.65 -0.79

Skewness: 0.65

A skewness value higher than 0 indicates a Positive Skewness (Right Skew)

Kurtosis: -0.791

Akurtosis value of -0.79 indicates a Platykurtic distribution. Which is a curve having a low peak than the normal curve. In this curve, there is less concentration of items around the central value.

Boxplot:

The median is almost centered within the box, indicating that the middle 50% of the data is quite symmetrically distributed. Looking at the whiskers, it is possible to notice that the data is more spread out the third quartile. That indicates a mild right-tail skewness. This pattern implies that while the core data is quite balanced, there are a few higher values causing greater dispersion on the top. However,these values are not extreme enough to be classified as outliers.

Analysis of the variable Months Inventory
summary(months_inventory)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.400   7.800   8.950   9.193  10.950  14.900

The minimum value of months inventory is 3.400, the maxium is 14.90. The Mean (8.950) is higher than the Median (9.193). This suggests a right (positive) skew, meaning there are higher values (outliers) pulling the mean upward.

Range IQR Variance Sd CV
Months Inventory 11.5 3.15 5.31 2.3 25.06

Range: 11.5

A range value of 2.553, represents a wide spread in the data.

Interquartile range: 3.15

The interquartile range is lower than the range. This suggests that while the middle 50% of the data is relatively concentrated, some extreme values or high variability exist in the dataset.

Variance: 5.31

A variance value of 5.31 A relatively low value, indicating moderate spread.

Standard Deviation: 2.3

A standard deviation of 2.30 indicates that on average the months required to sell listings deviate about 2.30 months from the mean.

Coefficient of Variation: 25.06%

A coefficient of variation between 20% and 30% means that the relative variability is moderate. This suggests that inventory levels are somewhat stable, though regional or seasonal factors may still influence fluctuations.

## [1] 0.04
## [1] -0.17

Skewness: 0.04

A skewness value of 0.04 indicates an almost symmetric distribution. Indeed,it’s almost close to zero.

Kurtosis:-0.17

A kurtosis value of -0.17 indicates an almost Mesokurtic kurtosis. That means that the curve has a quite normal peak than the normal curve.

Boxplot:

The median is not centered in the box. Indeed, it’s near the first quartile. Indicating that the middle 50% of data are not symmetrically distributed. The equal whiskers imply that extreme values are evenly distributed at both ends, even if the density of observations is greater in the lower range.

3. Identification of variables with greater variability and asymmetry

Sales Volume Listings Median Price Month Inventory
CV 41.42 53.71 43.31 17.08 25.06
Skewness 0.72 0.88 0.65 -0.36 0.04

Comparing the 5 quantitative variable, Volume has the highest relative variability (CV: 53.71), indicating significant fluctuations in total sales value. Listings (CV: 43.31) and Sales (CV: 41.42) also show substantial variation, reflecting changes in market activity. In contrast, Median Price (CV: 17.08) is the most stable, with relatively minor fluctuations. Months Inventory (CV: 25.06) falls in between, suggesting moderate variability in the time required to sell available properties.

In terms of distribution, Volume is the most skewed (0.88), highlighting the presence of extreme sales values. Sales (0.72) and Listings (0.65) are also moderately right-skewed, indicating an asymmetrical distribution with a tendency toward higher values. Median Price (-0.36) is slightly left-skewed, reinforcing its stability. Months Inventory (0.04) is nearly normally distributed, emphasizing its consistency over time.

Looking at the whole picture seems that total sales value (Volume) is heavily influenced by external factors such as Listings and seasonality, while Months Inventory remains relatively independent of other market forces.

4. Creating classes for a quantitative variable

# Define the number of bins (6 equal-width intervals)
sales_bins <- cut(sales, 
                  breaks = seq(min(sales), max(sales), length.out = 7), 
                  include.lowest = TRUE)

# Compute frequency distribution
freq_table <- table(sales_bins)

# Compute relative frequencies
rel_freq <- prop.table(freq_table)

# Compute cumulative frequencies
cum_freq <- cumsum(freq_table)

# Compute cumulative relative frequencies
cum_rel_freq <- cumsum(rel_freq)

# Combine into a distribution table
distribution_table <- data.frame(
  Sales_Category = names(freq_table),
  Frequency = as.vector(freq_table),
  Relative_Frequency = round(as.vector(rel_freq), 4),
  Cumulative_Frequency = as.vector(cum_freq),
  Cumulative_Relative_Frequency = round(as.vector(cum_rel_freq), 4)
)

# Print the distribution table
print(distribution_table)
##   Sales_Category Frequency Relative_Frequency Cumulative_Frequency
## 1       [79,136]        74             0.3083                   74
## 2      (136,194]        67             0.2792                  141
## 3      (194,251]        40             0.1667                  181
## 4      (251,308]        36             0.1500                  217
## 5      (308,366]        15             0.0625                  232
## 6      (366,423]         8             0.0333                  240
##   Cumulative_Relative_Frequency
## 1                        0.3083
## 2                        0.5875
## 3                        0.7542
## 4                        0.9042
## 5                        0.9667
## 6                        1.0000
# Bar plot of sales distribution
barplot(freq_table, 
        col = c("lightcoral","lightpink","lightblue","lightgreen","lightgoldenrodyellow","lightyellow"),
        main = "Sales Distribution (Equal Bins)", 
        xlab = "Sales Categories", 
        ylab = "Frequency")

# Compute and print Gini Index
gini_index_sales <- round(gini.index(as.numeric(sales_bins)), 2)

gini_index_sales_matrix <- matrix(c(gini_index_sales), 
  nrow = 1, byrow = TRUE
)

# Adding names to rows and columns
colnames(gini_index_sales_matrix) <- c("**Gini Index**")
rownames(gini_index_sales_matrix) <- c("**Sales**")

# Pretty print the matrix with bold column and row names
kable(gini_index_sales_matrix, format = "markdown") %>% 
  kable_styling() %>%
  row_spec(0, bold = TRUE) %>%  # Bold column names
  column_spec(1, bold = TRUE) %>%  # Bold first column (row names)
  column_spec(2:ncol(gini_index_sales_matrix), bold = FALSE)  # Keep matrix values in normal font
Gini Index
Sales 0.93

Histogram:

As shown in the boxplot, the histogram with 6 bins confirms the right-skewed distribution of Sales. The first three bins, representing lower sales volumes, account for 75% of transactions, indicating that most months register sales between 79 and 251 properties. Higher sales figures—beyond 251—may be influenced by seasonal trends or market anomalies. To further investigate these patterns, a conditional analysis is recommended to assess potential external drivers

Gini Index: 0.93

The Gini index measueres the inquality. A value of 0.89, that it is very close to 1, indicates inequality in the distribution. Indeed, the majority of the observations (181) are contained in the in the first three categories.

5. Probability

The probability that, taken a random row in this dataset, it will carry the city “Beaumont” can be calculated as follow:

tot_num_observations <- length(city)
tot_num_observations
## [1] 240
num_observations_Beaumont <- sum(city == "Beaumont")
num_observations_Beaumont
## [1] 60

Considering there are 4 cities and 240 observations and for each city there are 60 observations, the probability that, taken a random row in this dataset, it will carry the city “Beaumont” is:

probability_Beaumont <- num_observations_Beaumont/tot_num_observations
probability_Beaumont
## [1] 0.25

The probability that, taken a random row in this dataset, it will reports the month of July, can be calculated as follow:

tot_num_observations_month <- length(month)
tot_num_observations_month
## [1] 240
num_observations_July <- sum(month == 6)
num_observations_July
## [1] 20

The probability is:

probability_July <- num_observations_July/tot_num_observations_month
probability_July
## [1] 0.08333333

The probability that it reports the month of December 2012 can be calculated as follow:

num_observations_December_2012 <- sum(month == 12 & year == 2012)
num_observations_December_2012
## [1] 4
probability_december_2012 <- num_observations_December_2012/tot_num_observations_month
probability_december_2012
## [1] 0.01666667

6. New variable creation

# Create a new column for average price
realestate_texas$avg_price <- volume / sales

kable(head(realestate_texas,5))
city year month sales volume median_price listings months_inventory avg_price
Beaumont 2010 1 83 14.162 163800 1533 9.5 0.1706265
Beaumont 2010 2 108 17.690 138200 1586 10.0 0.1637963
Beaumont 2010 3 182 28.701 122400 1689 10.6 0.1576978
Beaumont 2010 4 200 26.819 123200 1708 10.6 0.1340950
Beaumont 2010 5 202 28.833 123100 1771 10.9 0.1427376
# Create a new column for ad effectiveness
realestate_texas$ad_effectiveness <- sales / listings

# View the first few rows to check the new columns
kable(head(realestate_texas,5))
city year month sales volume median_price listings months_inventory avg_price ad_effectiveness
Beaumont 2010 1 83 14.162 163800 1533 9.5 0.1706265 0.0541422
Beaumont 2010 2 108 17.690 138200 1586 10.0 0.1637963 0.0680958
Beaumont 2010 3 182 28.701 122400 1689 10.6 0.1576978 0.1077561
Beaumont 2010 4 200 26.819 123200 1708 10.6 0.1340950 0.1170960
Beaumont 2010 5 202 28.833 123100 1771 10.9 0.1427376 0.1140599
round(summary(realestate_texas$ad_effectiveness),2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.05    0.09    0.11    0.12    0.13    0.39

Ad effectiveness

The calculated Sales-to-Listings ratio provides insight into the effectiveness of listings in converting to actual property sales. The values range from 0.05 (minimum) to 0.39 (maximum), indicating significant variation across different periods or locations.

Median: 0.11

The median suggests that in a typical scenario, around 11% of listed properties are sold.

Mean: 0.12 The mean is slightly higher than the median, indicating a right-skewed distribution, where a few periods or markets exhibit higher listing effectiveness.

1st quartile: 0.09 3rd quartile: 0.13

The 1st quartile (0.09) and 3rd quartile show that 50% of observations fall between 9% and 13%, reinforcing that most of the time, a relatively small portion of listings translate into sales.

Max: 0.39

The maximum value (0.39) suggests that in some cases, nearly 39% of listings resulted in sales, possibly due to high demand, lower inventory, or seasonal effects.

Min: 0.05

The minimum ratio (0.05) implies that during certain times or in specific markets, only 5% of listed properties were sold, which may indicate an oversupply of listings or weaker demand.
The effectiveness of advertise is generally low, with most values clustering below 15%, meaning a large portion of properties remain unsold in a given period. This suggests potential challenges such as oversupply, pricing mismatches, or seasonal fluctuations in buyer demand. Further analysis could explore whether certain months, cities, or price ranges exhibit better conversion rates, helping real estate professionals refine their pricing and marketing strategies.

7-8. Conditional analysis and Graphical Representation


To gain deeper insight into the market dynamics, a conditional analysis is necessary. The following sections will therefore focus on:

In addition, the effectiveness of listings—calculated earlier as the ratio of sales to listings—will be analyzed in more detail. This analysis will provide a clearer understanding of how efficiently different cities convert their property listings into actual sales, offering valuable insights for market strategies and inventory management.

Property sold per city


## 
## 
## |city                  | total_sales| mean_sales_year| sd_sales_year|
## |:---------------------|-----------:|---------------:|-------------:|
## |Beaumont              |       10643|          177.38|         41.48|
## |Bryan-College Station |       12358|          205.97|         84.98|
## |Tyler                 |       16185|          269.75|         61.96|
## |Wichita Falls         |        6964|          116.07|         22.15|
# Create the boxplot using ggplot2
ggplot(realestate_texas, aes(x = city, y = sales, fill = city)) +
  geom_boxplot() +  # Create the boxplot
  labs(title = "Boxplot of Sales by City", 
       x = "City", 
       y = "Sales") +
  theme_minimal() +  # Minimal theme for better clarity
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels for readability
    plot.title = element_text(hjust = 0.5)  # Center the plot title
  )


The boxplot highlights that Tyler is the city with the highest number of properties sold, as indicated by its position in the upper range of the plot. In addition to higher sales, Tyler stands out for its market strength and stability, evidenced by its relatively narrow interquartile range, suggesting consistent performance over time.
On the other hand, Wichita Falls shows the lowest volume of property sales, with its boxplot situated near the bottom of the graph. Notably, the lower whisker drops below 100, indicating that in some months, the number of sales was particularly low.
Bryan-College Station exhibits a very wide IQR, reflecting high variability in monthly sales. This irregular pattern may be linked to seasonal factors, such as the academic calendar or local events, which could cause fluctuations in housing demand.
Lastly, Beaumont presents a moderate sales volume compared to the other cities. Its boxplot indicates a fairly consistent market, with a smaller spread and fewer extreme values. This suggests a relatively stable housing market without dramatic month-to-month changes in property sales.

Property sales by city over the years
## 
## 
## | year|city                  | total_sales| mean_sales_year| sd_sales_year|
## |----:|:---------------------|-----------:|---------------:|-------------:|
## | 2010|Beaumont              |        1874|          156.17|         36.92|
## | 2010|Bryan-College Station |        2011|          167.58|         70.75|
## | 2010|Tyler                 |        2730|          227.50|         48.98|
## | 2010|Wichita Falls         |        1481|          123.42|         26.62|
## | 2011|Beaumont              |        1728|          144.00|         22.66|
## | 2011|Bryan-College Station |        2009|          167.42|         62.19|
## | 2011|Tyler                 |        2866|          238.83|         49.62|
## | 2011|Wichita Falls         |        1275|          106.25|         19.76|
## | 2012|Beaumont              |        2063|          171.92|         28.39|
## | 2012|Bryan-College Station |        2361|          196.75|         74.28|
## | 2012|Tyler                 |        3162|          263.50|         46.40|
## | 2012|Wichita Falls         |        1349|          112.42|         14.25|
## | 2013|Beaumont              |        2414|          201.17|         37.73|
## | 2013|Bryan-College Station |        2854|          237.83|         95.85|
## | 2013|Tyler                 |        3449|          287.42|         53.05|
## | 2013|Wichita Falls         |        1455|          121.25|         26.00|
## | 2014|Beaumont              |        2564|          213.67|         36.49|
## | 2014|Bryan-College Station |        3123|          260.25|         86.69|
## | 2014|Tyler                 |        3978|          331.50|         56.85|
## | 2014|Wichita Falls         |        1404|          117.00|         21.09|
# Create bar chart
ggplot(total_sales_year, aes(x = year, y = total_sales, fill = city)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Total Sales by Year and City",
    x = NULL,
    y = "Total Sales"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5)  # Center the title
  )


# Create line chart with clean data labels
ggplot(total_sales_year, aes(x = year, y = total_sales, color = city, group = city)) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  geom_text(
    aes(label = round(total_sales, 0)),
    position = position_dodge(width = 0.5),
    vjust = -0.7,
    size = 3,
    show.legend = FALSE  # Prevents labels from showing in legend
  ) +
  labs(
    x = NULL,
    y = "Total Sales",
    title = "Time Series – Total Sales per City and Year"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5)
  )


To provide a clearer overview of annual property sales across different cities, both a bar chart and a time series line chart were used.
From both visualizations, it is evident that Tyler city has the most active real estate market. The number of properties sold in this city increased steadily over time, rising from approximately 2,730 in 2010 to nearly 3,978 in 2014.
The cities of Bryan-College Station and Beaumont also experienced a general upward trend in property sales throughout the period. However, both saw a slight decline between 2010 and 2011 before resuming growth.
On the other hand, Wichita Falls appears to have the weakest real estate market among the cities analyzed. In every year of the dataset, the number of properties sold remained below 1,500. Notably, there was a sharp decline from 1,481 in 2010 to 1,275 in 2011. Although sales slightly improved afterward, they never returned to 2010 levels by the end of 2014.

9. Conclusion


Considering the above analysis, Tyler emerges as the city with the most dynamic market, both in terms of property sales and house prices. However, advertising in this area appears to be less effective, and sales trends follow a clear seasonal pattern, peaking in mid-spring and summer. Texas Realty Insights should maintain focus on Tyler, strategically intensifying listings during these high-activity periods.

Bryan-College Station stands out with the highest average advertising effectiveness and the highest median property prices. Despite a high variability in market responsiveness—likely influenced by seasonal factors or local events—this city presents strong potential. It represents a promising opportunity for strategic investment by Texas Realty Insights.

Beaumont shows increasing property sales over the years and a stable median price range that, while not as high as Bryan-College Station or Tyler, remains above that of Wichita Falls. This indicates emerging potential, and the city may offer an attractive investment opportunity for Texas Realty Insights.

On the other hand, Wichita Falls reflects a less dynamic and less prosperous market. While it shows relatively strong advertising effectiveness, it consistently ranks lower in terms of both property sales and median prices. Texas Realty Insights may consider prioritizing investment in the other three cities, where growth and profitability appear more promising.