library(readr)
realestate_texas <- read_csv("C:/Users/Utente/Downloads/realestate_texas.csv")
View(realestate_texas)
The imported dataset contains 240 observations across 8 columns. Below is an identification and description of the statistical variables in the dataset:
-city (Type: Qualitative Nominal): Represents the city associated with the observation. A categorical variable with no intrinsic order. Possible analysis: distribution of observations by city, comparisons between cities for quantitative variables (e.g., sales).
-year (Type: Quantitative Continuous to be treated as Ordinal Qualitative in this case): Represents the year of the observation. Implies a temporal dimension. Possible analysis: assessing yearly trends in other variables.
-month (Type: Qualitative Nominal (cyclic) but coded in numbers): Represents the month of the observation, with 12 distinct values (1 to 12). Implies a seasonal component. Possible analysis: identifying seasonal patterns in sales, volume, or median price.
-sales (Type: Quantitative Discrete): Indicates the number of real estate sales. Possible analysis: distribution analysis, identifying trends over time, or city-based comparisons.
-volume (Type: Quantitative Continuous): Represents sales volume in millions. Possible analysis: evaluation of total market activity, mean or temporal variation analysis.
-median_price (Type: Quantitative Continuous): Indicates the median price of real estate sales. Possible analysis: distribution analysis, spatial and temporal variations, identifying correlations with other variables.
-listings (Type: Quantitative Discrete): Represents the number of active sales listings. Possible analysis: temporal trends, relationship with sales or available inventory.
-months_inventory (Type: Quantitative Continuous): Represents the number of months required to sell the current inventory at the current sales rate. Possible analysis: assessment of market balance (supply vs. demand).
The variables year and month imply a time dimension: Combining year and month: These two variables can be combined into a single temporal variable (e.g., timestamp or year-month) for more sophisticated time-based analyses, such as time series modeling. Seasonal Patterns and Trends: Separate analyses of seasonal patterns (monthly) and long-term trends (yearly) can be conducted using tools like time series decomposition.
Categorical Variables (e.g., city): Distribution of observations across categories. Comparisons with summary statistics (mean, median).
Temporal Variables: Trends, seasonality, and time series analysis.
Quantitative Variables (sales, volume, median_price, listings, months_inventory): Correlations. Trend analysis (e.g., linear regressions over time). Spatial comparisons (between cities).
attach(realestate_texas) #in order to use dataset variables, without $
#POSITION INDEXES
#Modal values
library(knitr)
quant_variables <- c("sales", "volume", "median_price", "listings", "months_inventory")
get_mode <- function(x) {
uniqx <- unique(x)
uniqx[which.max(tabulate(match(x, uniqx)))]
}
modal_values <- lapply(quant_variables, function(var) get_mode(get(var)))
format_number <- function(x) {
ifelse(x %% 1 == 0, as.character(x), sprintf("%.2f", round(x, 2)))
}
formatted_modal_values <- sapply(modal_values, format_number)
mode_table <- data.frame(
Variable = c("Sales", "Volume", "Median Price", "Listings", "Months Inventory"),
Modal_value = formatted_modal_values
)
kable(mode_table, caption = "Quantitative variables modal values")
| Variable | Modal_value |
|---|---|
| Sales | 124 |
| Volume | 35.34 |
| Median Price | 130000 |
| Listings | 1581 |
| Months Inventory | 8.10 |
Position indexes are synthetic indices, functions of the resulting data in single values that describe in summary the distribution of our data. Modal value can be defined for every type of variable: it’s the modality with the highest absolute frequency.
#Maximum and minimum
get_min_max <- function(x) {
return(c(min(x), max(x)))
}
min_max_values <- sapply(quant_variables, function(var) get_min_max(get(var)))
formatted_min_max_values <- apply(min_max_values, 2, function(x) sapply(x, format_number))
min_max_table <- data.frame(
Min = formatted_min_max_values[1, ],
Max = formatted_min_max_values[2, ]
)
kable(min_max_table, caption = "Maximum and Minimum Values for Each Variable")
| Min | Max | |
|---|---|---|
| sales | 79 | 423 |
| volume | 8.17 | 83.55 |
| median_price | 73800 | 180000 |
| listings | 743 | 3296 |
| months_inventory | 3.40 | 14.90 |
They represent the maximum value and the minimum value of a specific variable.
#Median values
get_median <- function(x) {
return(median(x))
}
median_values <- sapply(quant_variables, function(var) get_median(get(var)))
formatted_median_values <- sapply(median_values, format_number)
median_table <- data.frame(
Median = formatted_median_values
)
kable(median_table, caption = "Median Values for Each Variable")
| Median | |
|---|---|
| sales | 175.50 |
| volume | 27.06 |
| median_price | 134500 |
| listings | 1618.50 |
| months_inventory | 8.95 |
It is a robust index, which does not account for any outliers within the distribution, and it is calculated for both quantitative and ordinal qualitative variables and represents the 50th percentile. It is defined by sorting the data in ascending or descending order and taking the value that is exactly in the middle of the series.
#Quantiles and quartiles
get_quantiles_and_summary <- function(var) {
quantiles_10 <- quantile(var, seq(0, 1, 0.1))
quantiles_100 <- quantile(var, seq(0, 1, 0.01))
summary_stats <- summary(var)
return(list(quantiles_10 = quantiles_10, quantiles_100 = quantiles_100, summary = summary_stats))
}
results <- lapply(quant_variables, function(var) get_quantiles_and_summary(get(var)))
names(results) <- quant_variables
results
## $sales
## $sales$quantiles_10
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 79.0 101.9 120.6 135.0 155.0 175.5 197.0 228.5 271.0 302.1 423.0
##
## $sales$quantiles_100
## 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
## 79.00 80.39 87.68 89.17 91.56 93.00 94.34 96.73 98.24 101.00 101.90
## 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21%
## 104.29 107.68 108.07 110.00 111.00 113.00 114.00 115.02 117.82 120.60 123.00
## 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32%
## 124.00 124.00 125.00 127.00 128.14 130.00 130.92 132.62 135.00 137.27 140.48
## 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43%
## 143.00 144.52 147.65 149.00 149.43 150.00 151.21 155.00 159.00 160.00 161.54
## 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54%
## 163.00 164.00 165.94 167.66 169.72 173.11 175.50 176.89 180.28 181.67 182.00
## 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65%
## 186.00 186.84 189.00 191.86 196.00 197.00 198.00 200.36 202.00 205.92 208.35
## 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76%
## 211.48 213.65 219.04 224.91 228.50 233.69 238.00 239.94 244.00 247.00 253.00
## 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87%
## 254.03 258.84 262.00 271.00 272.59 277.94 282.00 283.52 287.30 289.00 292.93
## 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98%
## 295.32 298.00 302.10 314.47 321.40 326.54 333.98 347.30 357.00 367.64 372.32
## 99% 100%
## 396.54 423.00
##
## $sales$summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 79.0 127.0 175.5 192.3 247.0 423.0
##
##
## $volume
## $volume$quantiles_10
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 8.1660 13.0967 16.1206 19.0344 23.9976 27.0625 31.8436 36.9307 45.5920 53.7391
## 100%
## 83.5470
##
## $volume$quantiles_100
## 0% 1% 2% 3% 4% 5% 6% 7%
## 8.16600 9.11909 9.48346 9.67093 10.51088 11.15940 11.36852 11.81244
## 8% 9% 10% 11% 12% 13% 14% 15%
## 12.05580 12.46069 13.09670 13.45837 13.57160 13.82113 13.90286 14.00300
## 16% 17% 18% 19% 20% 21% 22% 23%
## 14.52280 15.04925 15.28384 15.51278 16.12060 16.16999 16.35950 16.90221
## 24% 25% 26% 27% 28% 29% 30% 31%
## 17.21956 17.65950 17.79860 18.09290 18.24536 18.67634 19.03440 19.68418
## 32% 33% 34% 35% 36% 37% 38% 39%
## 19.78016 20.33667 20.69896 21.17025 22.19824 22.78383 22.96914 23.55640
## 40% 41% 42% 43% 44% 45% 46% 47%
## 23.99760 24.22659 24.88318 24.94887 25.29348 25.44155 25.67790 25.89993
## 48% 49% 50% 51% 52% 53% 54% 55%
## 26.36344 26.82692 27.06250 27.33559 28.44688 28.63698 28.89610 29.44380
## 56% 57% 58% 59% 60% 61% 62% 63%
## 30.07740 30.31716 31.02136 31.65611 31.84360 32.12492 32.70494 33.50099
## 64% 65% 66% 67% 68% 69% 70% 71%
## 34.04908 34.71695 34.89208 35.23495 35.36620 36.23819 36.93070 37.74989
## 72% 73% 74% 75% 76% 77% 78% 79%
## 38.69228 39.37236 40.23348 40.89300 41.09704 42.42399 42.56602 43.81661
## 80% 81% 82% 83% 84% 85% 86% 87%
## 45.59200 46.60949 47.43920 48.45832 49.25504 49.97910 50.75608 51.57237
## 88% 89% 90% 91% 92% 93% 94% 95%
## 51.90736 52.75139 53.73910 55.67585 59.66368 60.74511 62.27676 63.83685
## 96% 97% 98% 99% 100%
## 66.95336 68.64489 70.54574 77.25487 83.54700
##
## $volume$summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.166 17.660 27.062 31.005 40.893 83.547
##
##
## $median_price
## $median_price$quantiles_10
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 73800 99960 110000 121650 130700 134500 141220 147960 152360 158850 180000
##
## $median_price$quantiles_100
## 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
## 73800 86095 87156 88736 90000 91180 92472 94730 96796 98418 99960
## 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21%
## 100700 102012 102314 102684 104565 105048 105578 108006 109223 110000 111290
## 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32%
## 113474 114291 116180 117300 118800 119624 120692 121131 121650 122827 123488
## 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43%
## 124287 126126 127715 128812 129286 129928 130000 130700 130800 131438 132331
## 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54%
## 132548 133065 133294 133899 134172 134311 134500 134967 135412 135901 136330
## 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65%
## 137870 138468 139246 139848 140501 141220 142358 142772 143828 144100 144670
## 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76%
## 144874 145917 146952 147582 147960 148369 148516 148994 149300 150050 150892
## 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87%
## 151506 151942 152100 152360 153100 153888 154696 155276 155515 155654 156386
## 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98%
## 156500 157097 158850 159547 161000 161454 163766 165340 167828 169583 172288
## 99% 100%
## 174813 180000
##
## $median_price$summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 73800 117300 134500 132665 150050 180000
##
##
## $listings
## $listings$quantiles_10
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 743.0 899.9 968.0 1208.7 1525.2 1618.5 1687.8 1796.0 2721.4 2946.7 3296.0
##
## $listings$quantiles_100
## 0% 1% 2% 3% 4% 5% 6% 7% 8% 9%
## 743.00 775.17 799.90 822.53 841.36 849.70 855.70 866.11 877.60 891.08
## 10% 11% 12% 13% 14% 15% 16% 17% 18% 19%
## 899.90 904.29 907.00 914.00 918.68 932.70 938.48 941.00 950.10 961.82
## 20% 21% 22% 23% 24% 25% 26% 27% 28% 29%
## 968.00 973.00 994.74 1004.97 1018.16 1026.50 1030.14 1046.83 1126.00 1168.64
## 30% 31% 32% 33% 34% 35% 36% 37% 38% 39%
## 1208.70 1261.90 1328.72 1385.00 1439.78 1443.95 1462.72 1486.00 1496.92 1502.89
## 40% 41% 42% 43% 44% 45% 46% 47% 48% 49%
## 1525.20 1533.99 1540.90 1559.01 1570.80 1576.10 1581.00 1586.66 1598.16 1604.99
## 50% 51% 52% 53% 54% 55% 56% 57% 58% 59%
## 1618.50 1623.56 1638.80 1646.67 1655.12 1660.35 1668.52 1672.69 1676.24 1681.02
## 60% 61% 62% 63% 64% 65% 66% 67% 68% 69%
## 1687.80 1694.95 1708.00 1722.57 1729.76 1737.45 1749.74 1762.39 1769.08 1784.46
## 70% 71% 72% 73% 74% 75% 76% 77% 78% 79%
## 1796.00 1817.04 1830.16 1833.47 1844.30 2056.00 2485.60 2609.48 2643.50 2690.30
## 80% 81% 82% 83% 84% 85% 86% 87% 88% 89%
## 2721.40 2733.72 2762.62 2789.11 2825.44 2852.45 2862.94 2875.93 2903.40 2932.78
## 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%
## 2946.70 2966.72 2996.56 3041.27 3051.24 3094.35 3164.36 3214.26 3263.66 3270.05
## 100%
## 3296.00
##
## $listings$summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 743 1026 1618 1738 2056 3296
##
##
## $months_inventory
## $months_inventory$quantiles_10
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 3.40 6.69 7.50 7.97 8.40 8.95 9.40 10.53 11.40 12.21 14.90
##
## $months_inventory$quantiles_100
## 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
## 3.400 4.000 4.078 4.551 4.956 5.000 5.068 6.073 6.124 6.400 6.690
## 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21%
## 6.929 7.000 7.100 7.100 7.200 7.224 7.300 7.402 7.500 7.500 7.600
## 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32%
## 7.600 7.600 7.736 7.800 7.800 7.853 7.900 7.900 7.970 8.000 8.000
## 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43%
## 8.087 8.100 8.100 8.100 8.143 8.300 8.321 8.400 8.400 8.500 8.577
## 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54%
## 8.600 8.700 8.700 8.733 8.800 8.900 8.950 9.000 9.000 9.067 9.100
## 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65%
## 9.100 9.200 9.300 9.362 9.400 9.400 9.500 9.618 9.814 9.996 10.000
## 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76%
## 10.174 10.213 10.400 10.491 10.530 10.600 10.700 10.747 10.800 10.950 11.100
## 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87%
## 11.200 11.300 11.300 11.400 11.400 11.500 11.600 11.600 11.615 11.700 11.700
## 88% 89% 90% 91% 92% 93% 94% 95% 96% 97% 98%
## 11.932 12.000 12.210 12.300 12.400 12.600 12.898 13.015 13.344 13.483 13.844
## 99% 100%
## 14.500 14.900
##
## $months_inventory$summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.400 7.800 8.950 9.193 10.950 14.900
Quantiles are those values that split a set of data arranged in sequential order into several parts. Quartiles are those three values of the ordered data series that divide the series itself into quarters, that is, into four equal parts. They represent the 25th, 50th and 75th percentiles, respectively. A percentile, for example, Xp is that value that divides the distribution of values into two parts, such that p% of the values is less than Xp and (1-p)% is greater than Xp.
#Mean values
get_mean <- function(x) {
return(mean(x))
}
mean_values <- sapply(quant_variables, function(var) get_mean(get(var)))
formatted_mean_values <- sapply(mean_values, format_number)
mean_table <- data.frame(
Mean = formatted_mean_values
)
kable(mean_table, caption = "Mean Values for Each Variable")
| Mean | |
|---|---|
| sales | 192.29 |
| volume | 31.01 |
| median_price | 132665.42 |
| listings | 1738.02 |
| months_inventory | 9.19 |
It is that value which is obtained by summing all the values of a quantitative variable, and then dividing by the number of values.
#VARIABILITY INDEXES
#Values range
get_range <- function(x) {
return(range(x))
}
range_values <- sapply(quant_variables, function(var) get_range(get(var)))
formatted_range_values <- apply(range_values, 2, function(x) sapply(x, format_number))
range_table <- data.frame(
Min = formatted_range_values[1, ],
Max = formatted_range_values[2, ]
)
kable(range_table, caption = "Range (Min and Max) for Each Variable")
| Min | Max | |
|---|---|---|
| sales | 79 | 423 |
| volume | 8.17 | 83.55 |
| median_price | 73800 | 180000 |
| listings | 743 | 3296 |
| months_inventory | 3.40 | 14.90 |
Position indices alone are not enough, since they do not give us information about variability, how and how much the data are distributed over their domain or range. The variability of a distribution measures the tendency of units to assume different modes or values of the variable. Variability indices summarize the diversity among units in terms between two values of the distribution or observed modes.They make it possible, thus, to compare different distributions with each other. Variable range represents the difference between maximum value and minimum value of that specific variable (it refers for only observed values).
#InterQuartile range (IQR)
get_iqr <- function(x) {
return(IQR(x))
}
iqr_values <- sapply(quant_variables, function(var) get_iqr(get(var)))
formatted_iqr_values <- sapply(iqr_values, format_number)
iqr_table <- data.frame(
IQR = formatted_iqr_values
)
kable(iqr_table, caption = "Interquartile Range (IQR) for Each Variable")
| IQR | |
|---|---|
| sales | 120 |
| volume | 23.23 |
| median_price | 32750 |
| listings | 1029.50 |
| months_inventory | 3.15 |
It is the difference between the 3rd quartile and the 1st quartile and it shows the range of variation in the central body of data, i.e., 50% of the data between the two quartiles. Robust index if there are outliers in the data distribution.
#Variance and standard deviation
options(scipen = 999)
get_variance_sd <- function(x) {
return(c(variance = var(x), sd = sd(x)))
}
variance_sd_values <- sapply(quant_variables, function(var) get_variance_sd(get(var)))
formatted_variance_values <- sapply(variance_sd_values["variance", ], format_number)
formatted_sd_values <- sapply(variance_sd_values["sd", ], format_number)
variance_sd_table <- data.frame(
Variance = formatted_variance_values,
Standard_Deviation = formatted_sd_values
)
kable(variance_sd_table, caption = "Variance and Standard Deviation for Each Variable")
| Variance | Standard_Deviation | |
|---|---|---|
| sales | 6344.30 | 79.65 |
| volume | 277.27 | 16.65 |
| median_price | 513572983.09 | 22662.15 |
| listings | 566568.97 | 752.71 |
| months_inventory | 5.31 | 2.30 |
Variance shows the degree of dispersion around the mean value. Standard deviation represents the square root of the variance and gives us an index of variability in the same unit as the observed data (a measure more consistent with the source data). The latter is an absolute index and, therefore, is affected by both the unit of measurement of the variable and the order of magnitude of the data. If the mean values are very different, the standard deviation may not be a suitable measure for comparing different data.
#Coefficient of variation
CV <- function(x) {
return(sd(x) / mean(x) * 100)
}
cv_values <- sapply(quant_variables, function(var) CV(get(var)))
formatted_cv_values <- sapply(cv_values, format_number)
cv_table <- data.frame(
CV = formatted_cv_values
)
kable(cv_table, caption = "Coefficient of Variation (CV) for Each Variable")
| CV | |
|---|---|
| sales | 41.42 |
| volume | 53.71 |
| median_price | 17.08 |
| listings | 43.31 |
| months_inventory | 25.06 |
The coefficient of variation makes it possible to compare the variability of a sample relative to two different variables, or the variability of two samples relative to the same variable. This coefficient, however, is problematic if the variable has both positive and negative values, or in the case of a conventional zero in the measurement scale.
#SHAPE INDEXES
#Skewness index
library(moments)
skew_values <- sapply(quant_variables, function(var) skewness(get(var)))
formatted_skew_values <- sapply(skew_values, format_number)
skewness_table <- data.frame(
Skewness = formatted_skew_values
)
kable(skewness_table, caption = "Skewness for Each Variable")
| Skewness | |
|---|---|
| sales | 0.72 |
| volume | 0.88 |
| median_price | -0.36 |
| listings | 0.65 |
| months_inventory | 0.04 |
These indices refer to skewness and kurtosis, features of the distribution that refer to the central moment of order three and four of a random variable, respectively. A distribution is said to be symmetrical if it is possible to identify a vertical axis that cuts the distribution into two specularly equal parts (the variable must be sortable).
#Kurtosis coefficient
kurt_values <- sapply(quant_variables, function(var) kurtosis(get(var)) - 3)
formatted_kurt_values <- sapply(kurt_values, format_number)
kurtosis_table <- data.frame(
Excess_Kurtosis = formatted_kurt_values
)
kable(kurtosis_table, caption = "Kurtosis for Each Variable")
| Excess_Kurtosis | |
|---|---|
| sales | -0.31 |
| volume | 0.18 |
| median_price | -0.62 |
| listings | -0.79 |
| months_inventory | -0.17 |
Kurtosis can be defined as a measure of crushing/extension of the shape of a distribution relative to the normal distribution.
for (var in quant_variables) {
plot(density(get(var)), main = paste(var, "density plot"), xlab = var)
}
This in order to see, in a graphical way, the distribution of these
quantitative variables.
#frequency distribution for qualitative variables
freq_distr <- function(x){
N <- length(x)
freq <- table(x)
rel_freq <- freq / N
cum_freq <- cumsum(freq)
cum_rel_freq <- cumsum(freq) / N
return(cbind(freq, rel_freq, cum_freq, cum_rel_freq))
}
city_freq <- freq_distr(city)
year_freq <- freq_distr(year)
month_freq <- freq_distr(month)
formatted_city_freq <- apply(city_freq, 2, function(col) sapply(col, format_number))
formatted_year_freq <- apply(year_freq, 2, function(col) sapply(col, format_number))
formatted_month_freq <- apply(month_freq, 2, function(col) sapply(col, format_number))
city_freq_table <- data.frame(formatted_city_freq)
year_freq_table <- data.frame(formatted_year_freq)
month_freq_table <- data.frame(formatted_month_freq)
kable(city_freq_table, caption = "Frequency Distribution for City")
| freq | rel_freq | cum_freq | cum_rel_freq | |
|---|---|---|---|---|
| Beaumont | 60 | 0.25 | 60 | 0.25 |
| Bryan-College Station | 60 | 0.25 | 120 | 0.50 |
| Tyler | 60 | 0.25 | 180 | 0.75 |
| Wichita Falls | 60 | 0.25 | 240 | 1 |
kable(year_freq_table, caption = "Frequency Distribution for Year")
| freq | rel_freq | cum_freq | cum_rel_freq | |
|---|---|---|---|---|
| 2010 | 48 | 0.20 | 48 | 0.20 |
| 2011 | 48 | 0.20 | 96 | 0.40 |
| 2012 | 48 | 0.20 | 144 | 0.60 |
| 2013 | 48 | 0.20 | 192 | 0.80 |
| 2014 | 48 | 0.20 | 240 | 1 |
kable(month_freq_table, caption = "Frequency Distribution for Month")
| freq | rel_freq | cum_freq | cum_rel_freq |
|---|---|---|---|
| 20 | 0.08 | 20 | 0.08 |
| 20 | 0.08 | 40 | 0.17 |
| 20 | 0.08 | 60 | 0.25 |
| 20 | 0.08 | 80 | 0.33 |
| 20 | 0.08 | 100 | 0.42 |
| 20 | 0.08 | 120 | 0.50 |
| 20 | 0.08 | 140 | 0.58 |
| 20 | 0.08 | 160 | 0.67 |
| 20 | 0.08 | 180 | 0.75 |
| 20 | 0.08 | 200 | 0.83 |
| 20 | 0.08 | 220 | 0.92 |
| 20 | 0.08 | 240 | 1 |
Here, we have frequency tables for the others qualitative variables.
“volume” is the variable with the highest variability (53.70536, or about 53.71%), and this result was obtained through the coefficient of variation (CV): a very high value of this index implies a relatively large value of the standard deviation from the mean value, so individual observations will be far from the mean itself. A relatively higher value of the standard deviation than the mean value could be due to several other factors: presence of outliers or strong skewness. In several cases, using a measure of absolute variability such as standard deviation is problematic. In fact, to make a comparison of variability, the standard deviation is not always the most suitable index because it always assumes the same unit of measurement and order of magnitude as the variable on which it is calculated. “volume” is also the variable with the highest skewness (positive, in this case, since we have a value of 0.884742). If a variable has a positive skewed distribution, it means that low values (or modes) are more frequent and we will have mean > median > mode value. Moreover, we have a distribution of data that deviates significantly from the normal distribution, in which it is possible to identify a vertical axis that bisects the distribution into two specularly equal parts.
realestate_texas$cl_sales <- cut(
realestate_texas$sales,
breaks = c(79, 100, 150, 200, 250, 300, 350, 400, 423),
right = TRUE,
include.lowest = TRUE
)
View(realestate_texas)
attach(realestate_texas)
freq_distr_cl_sales <- freq_distr(cl_sales)
View(freq_distr_cl_sales)
freq_distr_cl_sales <- as.data.frame(freq_distr_cl_sales)
A frequency distribution was created starting from the quantitative variable “sales”.
create_barplot <- function(data, freq_col, ylab, ylim, main) {
barplot_heights <- barplot(data[[freq_col]],
main = main,
xlab = "Sales Class",
ylab = ylab,
ylim = ylim,
names.arg = rownames(data),
cex.names = 0.7)
text(x = barplot_heights,
y = data[[freq_col]],
labels = round(data[[freq_col]], 3),
cex = 0.8,
col = "black",
pos = 3)
}
par(mfrow = c(2, 2))
create_barplot(freq_distr_cl_sales, "freq", "Frequency", c(0, 90), "Frequency Class Distribution")
create_barplot(freq_distr_cl_sales, "rel_freq", "Relative Frequency", c(0, 0.35), "Relative Frequency Class Distribution")
create_barplot(freq_distr_cl_sales, "cum_freq", "Cumulative Frequency", c(0, 300), "Cumulative Frequency Class Distribution")
create_barplot(freq_distr_cl_sales, "cum_rel_freq", "Cumulative Relative Frequency", c(0, 1.2), "Cumulative Relative Frequency Class Distribution")
par(mfrow = c(1, 1))
The highest absolute frequency is concentrated in the class (100, 150] and the highest relative frequency is concentrated in the class (100, 150].
#Gini coefficient
gini_index <- function(x){
ni=table(x)
fi=ni/length(x)
fi2=fi^2
J=length(table(x))
gini=1-sum(fi2)
normalized_gini=gini/((J-1)/J)
return(normalized_gini)
}
gini_index_cl_sales <- gini_index(cl_sales)
gini_index_cl_sales_rounded <- round(gini_index_cl_sales, 2)
gini_index_cl_sales_table <- data.frame(Gini_Index = gini_index_cl_sales_rounded)
kable(gini_index_cl_sales_table, col.names = c("cl_sales Gini index"))
| cl_sales Gini index |
|---|
| 0.92 |
The Gini heterogeneity index (normalized) for the qualitative variable “cl_sales” is about 0.92, practically almost equal to 1: this shows us a situation of almost maximum heterogeneity, that is, almost all statistical units are equally distributed among all modes of the variable (equidistribution).
library(ggplot2)
total_obs <- nrow(realestate_texas)
num_beaumont <- sum(city == "Beaumont")
prob_beaumont <- num_beaumont / total_obs
num_july <- sum(month == "7")
prob_july <- num_july / total_obs
year_month <- paste(year, month, sep = "-")
num_dec_2012 <- sum(year_month == "2012-12")
prob_dec_2012 <- num_dec_2012 / total_obs
prob_data <- data.frame(
Category = c("Beaumont", "July", "December 2012"),
Probability = c(prob_beaumont, prob_july, prob_dec_2012)
)
combined_plot <- ggplot(prob_data, aes(x = Category, y = Probability)) +
geom_bar(stat = "identity", col = "black", fill = "gray", width = 0.6) +
geom_text(aes(y = Probability, label = round(Probability, 3)),
color = "black", vjust = -0.5, size = 4) +
labs(x = "Category", y = "Probability") +
theme(aspect.ratio = 1, plot.margin = margin(5, 30, 5, 5))
combined_plot
The probability that, taken a random row in this dataset, it carries the
city “Beaumont” is 0.25, or 25% of the cases; the probability that,
taken a random row in this dataset, it carries the month and year
“December 2012” is about 0.017, or about 1.7% of the cases and the
probability that, taken a random row in this dataset, it carries the
month “July” is about 0.083, or about 8.3% of the cases.
realestate_texas$mean_price <- volume/sales * 1000000
“mean_price” represents the average sales price per unit. It was calculated as the ratio of the total sales volume (volume) to the total number of sales (sales), multiplied by 1,000,000 to express the result in dollars (volume is expressed in millions of dollars). Basically, this value gives you the average sales price of a property for each completed transaction.
realestate_texas$sales_efficiency <- round(realestate_texas$sales / realestate_texas$listings, 3)
View(realestate_texas)
“sales_efficiency” measures the efficiency with which properties are sold. It is calculated as the ratio of the total number of sales (sales) to the total number of active listings (listings). This variable indicates how many sales are made per active listing. A higher value suggests that properties are sold faster than the number of listings, while a low value indicates that there are many listings for sale without corresponding equally high sales. If sales_efficiency values are high, it means that sales listings are more effective in bringing about a transaction. This could be the case during periods or in cities with high demand for real estate, where listings find buyers quickly. If sales_efficiency values are low, this could indicate that properties are remaining unsold for longer periods, which could reflect low demand, prices that are too high, or other issues related to properties for sale. Finally, we see the new variables in the dataset.
library(dplyr)
library(ggplot2)
library(tidyr)
summary_data_city_sales <- realestate_texas %>%
group_by(city) %>%
summarise(mean = mean(sales),
st_dev = sd(sales))
summary_data_city_sales_plot <- ggplot(summary_data_city_sales) +
geom_bar(aes(x = city, y = mean, fill = "Mean"),
stat = "identity", position = "dodge") +
geom_errorbar(aes(x = city, ymin = mean - st_dev, ymax = mean + st_dev, color = "Standard Deviation"), width = 0.2, linewidth = 1.5) +
scale_fill_manual(values = c("Mean" = "#00BFFF")) +
scale_color_manual(values = c("Standard Deviation" = "red1")) +
labs(title = "Sales Mean (Blue) and Standard Deviation (Red) by City",
x = "City",
y = "Value") +
theme_minimal() +
theme(legend.position = "none")
summary_data_city_sales_plot
“Sales” shows a higher mean in “Tyler” and a higher variability in
”Brian-College Station”.
summary_data_year_sales <- realestate_texas %>%
group_by(year) %>%
summarise(mean = mean(sales),
st_dev = sd(sales))
summary_data_year_sales_plot <- ggplot(summary_data_year_sales) +
geom_bar(aes(x = year, y = mean, fill = "Mean"),
stat = "identity", position = "dodge") +
geom_errorbar(aes(x = year, ymin = mean - st_dev, ymax = mean + st_dev, color = "Standard Deviation"), width = 0.2, linewidth = 1.5) +
scale_fill_manual(values = c("Mean" = "#00BFFF")) +
scale_color_manual(values = c("Standard Deviation" = "red1")) +
labs(title = "Sales Mean (Blue) and Standard Deviation (Red) by Year",
x = "Year",
y = "Value") +
theme_minimal() +
theme(legend.position = "none")
summary_data_year_sales_plot
“Sales” shows a higher mean and a higher variability in ”2014”.
summary_data_month_sales <- realestate_texas %>%
group_by(month) %>%
summarise(mean = mean(sales),
st_dev = sd(sales)) %>%
mutate(month = factor(month, levels = 1:12, labels = month.name))
summary_data_month_sales_plot <- ggplot(summary_data_month_sales) +
geom_bar(aes(x = month, y = mean, fill = "Mean"),
stat = "identity", position = "dodge") +
geom_errorbar(aes(x = month, ymin = mean - st_dev, ymax = mean + st_dev, color = "Standard Deviation"), width = 0.2, linewidth = 1.5) +
scale_fill_manual(values = c("Mean" = "#00BFFF")) +
scale_color_manual(values = c("Standard Deviation" = "red1")) +
labs(title = "Sales Mean (Blue) and Standard Deviation (Red) by Month",
x = "Month",
y = "Value") +
theme_minimal() +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
summary_data_month_sales_plot
“Sales” shows a higher mean and a higher variability in ”June”.
summary_data_city_year_sales <- realestate_texas %>%
group_by(city, year) %>%
summarise(mean = mean(sales),
st_dev = sd(sales),
.groups = "keep")
summary_data_city_year_sales_plot <- ggplot(summary_data_city_year_sales) +
geom_bar(aes(x = interaction(city, year), y = mean, fill = "Mean"),
stat = "identity", position = "dodge") +
geom_errorbar(aes(x = interaction(city, year),
ymin = mean - st_dev,
ymax = mean + st_dev,
color = "Standard Deviation"),
width = 0.2, linewidth = 1.5) +
scale_fill_manual(values = c("Mean" = "#00BFFF")) +
scale_color_manual(values = c("Standard Deviation" = "red1")) +
labs(title = "Sales Mean (Blue) and Standard Deviation (Red) by City and Year",
x = "City and Year",
y = "Value") +
theme_minimal() +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
summary_data_city_year_sales_plot
“Sales” shows a higher mean in “Tyler 2014” and a higher variability in
”Brian-College Station 2013”.
library(ggplot2)
library(lubridate)
library(dplyr)
# Median Prices by City box-plot
median_prices_boxplot <- ggplot(data = realestate_texas) +
geom_boxplot(aes(x = city, y = median_price),
fill = "gray80",
color = "black",
outlier.colour = "red",
outlier.size = 2,
width = 0.7) +
labs(title = "Distribution of Median Prices by City",
x = "City",
y = "Median Price") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "none")
median_prices_boxplot
Brian-College Station has a higher range of variation than the other
cities, leading to higher Q1, Q3 and a median. Where, on the other hand,
Beaumont and Tyler have similar ranges of variation, although Tyler has
a higher median. Wichita Falls, on the other hand, has the smallest
range of variation. Brian-College Station, Beaumont and Witchita Falls
show outliers, where for Tyler they are not present.
Sys.setlocale("LC_TIME", "en_US.UTF-8")
## [1] "en_US.UTF-8"
# Average Sales Volume / Number of Sales by Month and City dodged barplots
dodged_barplot_1 <- function(variable, y_label, title) {
monthly_data <- realestate_texas %>%
group_by(city, month) %>%
summarise(mean_value = mean(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
mutate(month = factor(month, levels = 1:12, labels = month.name))
ggplot(data = monthly_data) +
geom_bar(aes(x = month, y = mean_value), stat = "identity", position = "dodge", fill = "gray") +
facet_wrap(~ city, scales = "free_y") +
labs(title = title,
x = "Month", y = y_label) +
scale_x_discrete(labels = month.name) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
legend.position = "none"
)
}
dodged_barplot_volume_1 <- dodged_barplot_1("volume", "Volume", "Average Sales Volume by Month and City")
dodged_barplot_sales_1 <- dodged_barplot_1("sales", "Sales", "Average Number of Sales by Month and City")
dodged_barplot_volume_1
dodged_barplot_sales_1
These graphics consists of 4 bar charts, which refer to the 4 different
cities in the dataset, in order to assess the trend of sales
volumes/number of sales by individual city by month (each month
represents the average of individual months’ values over a time span of
all 4 years in the dataset): - in Beaumont, sales volumes/number of
sales tend to be higher and more concentrated in the spring and summer
periods; - in Brian-College Station we have a very similar situation,
but we have an even greater concentration of sales volumes/number of
sales in the spring and summer periods than in Beaumont; - in Tyler, we
find a similar situation to the previous ones, and we can also see a
significant increase in sales volumes/number of sales in June; - in
Wichita Falls, we find the usual concentration of increases in sales
volumes/number of sales close to the spring/summer periods but, unlike
the other cities, the trend does not appear to be increasing but appears
to be stable over time, with a conspicuous concentration in the spring
period.
dodged_barplot_2 <- function(variable, y_label, title) {
monthly_data <- realestate_texas %>%
group_by(city, month) %>%
summarise(mean_value = mean(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
mutate(month = factor(month, levels = 1:12, labels = month.name))
ggplot(data = monthly_data) +
geom_bar(aes(x = month, y = mean_value, fill = city), stat = "identity", position = "dodge") +
labs(title = title,
x = "Month", y = y_label) +
scale_x_discrete(labels = month.name) +
scale_fill_brewer(palette = "Set1") +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
legend.position = "right"
)
}
dodged_barplot_volume_2 <- dodged_barplot_2("volume", "Volume", "Average Sales Volume by Month and City")
dodged_barplot_sales_2 <- dodged_barplot_2("sales", "Sales", "Average Number of Sales by Month and City")
dodged_barplot_volume_2
dodged_barplot_sales_2
Unlike before, here we can compare the different distributions of sales
volumes/number of sales among different cities and how they evolve over
time (every months was calculated like the previous dodged barplots):
Tyler and Brian-College Station turn out to have the highest and most
increasing sales volumes/number of sales over time, with the same
concentration of higher sales volumes/number of sales during the
spring/summer periods.
# Average Sales Volume / Number of Sales by Month and City time series
time_series_plot_1 <- function(variable, y_label, title) {
monthly_data <- realestate_texas %>%
group_by(city, month) %>%
summarise(mean_value = mean(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
mutate(month = factor(month, levels = 1:12, labels = month.name))
ggplot(data = monthly_data) +
geom_line(aes(x = month, y = mean_value, group = city),
linewidth = 1, alpha = 0.8, linetype = "solid", color = "black") +
facet_wrap(~city, scales = "free_y") +
labs(
title = title,
x = "Month", y = y_label,
color = "City"
) +
scale_x_discrete(labels = month.name) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
legend.position = "none"
)
}
time_series_plot_sales_1 <- time_series_plot_1("sales", "Sales", "Average Number of Sales Trend by City and Month")
time_series_plot_volume_1 <- time_series_plot_1("volume", "Volume", "Average Sales Volume Trend by City and Month")
time_series_plot_sales_1
time_series_plot_volume_1
Again, a breakdown was made on the basis of the 4 cities to assess,
however, the trend in the sales volumes/number of sales over time but,
in these cases, visualization by line charts were used (same calculation
for every month): - in Beaumont, we see an increasing trend over time
for sales volumes/number of sales, with the usual concentration in the
spring/summer periods but it is also possible to see that in the spring
period there was a very conspicuous increase, something that would only
be consolidated onward; - in Brian-College Station, the peaks are very
pronounced between spring/summer and fall/winter periods for sales
volumes/number of sales; the trend appears to be stable, even in
recording these conspicuous peaks of increases and decreases in the
number of sales; - at Tyler, we have a trend and concentration of values
very similar to those we find at Beaumont; - in Wichita Falls, we always
find a high distributional concentration between the spring/summer
periods.
time_series_plot_2 <- function(variable, y_label, title) {
monthly_data <- realestate_texas %>%
group_by(city, month) %>%
summarise(mean_value = mean(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
mutate(month = factor(month, levels = 1:12, labels = month.name))
ggplot(data = monthly_data) +
geom_line(aes(x = month, y = mean_value, color = city, group = city),
linewidth = 1, alpha = 0.8, linetype = "solid") +
labs(
title = title,
x = "Month", y = y_label,
color = "City"
) +
scale_x_discrete(labels = month.name) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
legend.position = "top"
)
}
time_series_plot_sales_2 <- time_series_plot_2("sales", "Sales", "Average Number of Sales Trend by City and Month")
time_series_plot_volume_2 <- time_series_plot_2("volume", "Volume", "Average Sales Volume Trend by City and Month")
time_series_plot_sales_2
time_series_plot_volume_2
Here, we are going to compare the sales volumes/number of sales of
different cities over time (same calculation for months): Brian-College
Station and Tyler show very similar trends and concentrations of values,
especially they share an ever-increasing trend over time in the number
of sales. Note, however, that Tyler shows a more “jagged” distribution
of data around the spring/summer periods than Brian-College Station.
# Total Sales Volume / Number of Sales by Month and City stacked barplots
stacked_barplot_1 <- function(variable, y_label, title) {
monthly_data <- realestate_texas %>%
group_by(city, month) %>%
summarise(sum_value = sum(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
mutate(month = factor(month, levels = 1:12, labels = month.name))
ggplot(data = monthly_data) +
geom_bar(aes(x = month, y = sum_value, fill = city), stat = "identity", position = "stack") +
labs(title = title,
x = "Month", y = y_label) +
scale_x_discrete(labels = month.name) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
legend.position = "top"
)
}
stacked_barplot_volume_1 <- stacked_barplot_1("volume", "Total Volume", "Total Sales Volume by Month and City")
stacked_barplot_sales_1 <- stacked_barplot_1("sales", "Total Sales", "Total Number of Sales by Month and City")
stacked_barplot_volume_1
stacked_barplot_sales_1
Here, on the other hand, we have stacked bar charts showing sales
volumes/number of sales over time, with each bar representing the sum of
sales volumes/number of sales by city in a given month (every month is
the sum of the values for that specific month across 4 years, so each
bar represents a cumulative sum for all cities of the sums discussed
earlier): it can be seen that Tyler and Brian-College Station have the
largest sales volumes/number of sales compared to the other cities, and
both show increasing trends over time, with an always strong
concentration of increases during the spring/summer periods.
stacked_barplot_2 <- function(variable, y_label, title) {
monthly_data <- realestate_texas %>%
group_by(city, month) %>%
summarise(sum_value = sum(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
mutate(month = factor(month, levels = 1:12, labels = month.name))
ggplot(data = monthly_data) +
geom_bar(aes(x = month, y = sum_value, fill = city), stat = "identity", position = "fill") +
labs(title = title,
x = "Month", y = y_label) +
scale_x_discrete(labels = month.name) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor = element_blank(),
legend.position = "top"
)
}
stacked_barplot_volume_2 <- stacked_barplot_2("volume", "Normalized Total Sales Volume", "Normalized Total Sales Volume by Month and City")
stacked_barplot_sales_2 <- stacked_barplot_2("sales", "Normalized Total Sales", "Normalized Total Number of Sales by Month and City")
stacked_barplot_volume_2
stacked_barplot_sales_2
Here, we have stacked bar charts normalized to show sales volumes/number
of sales over time, where each bar represents the sum of sales
volumes/number of sales for each city in a given month (same calculation
for every month but unlike before, the sum here equals 1, or 100%,
because a relative evaluation is made, not an absolute one like before):
the considerations that can be made are the same as those for the
previously discussed stacked bar charts.
# Average Sales Efficiency by Month and City time series
time_series_plot_sales_efficiency_1 <- time_series_plot_1("sales_efficiency", "Sales Efficiency", "Average Sales Efficiency Trend by City and Month")
time_series_plot_sales_efficiency_1
Here, a chart has been created showing 4 line charts, each representing
the trend over time of the new variable ‘sales_efficiency’
(sales/listings) divided by the 4 cities (same calculation for months
like the previous time series):
- In Wichita Falls, it can be seen that active listings turned into
actual sales more consistently during the spring period. The highest
concentrations of efficiency are typically observed during the
spring/summer periods;
- In Beaumont, there is a continuously increasing trend, always keeping
in mind the same concentration observed in Wichita Falls;
- In Bryan-College Station, it starts with relatively low
sales_efficiency values, then reaches significant peaks in the summer
period;
- In Tyler, the trend is very similar to the one observed in
Beaumont.
time_series_plot_sales_efficiency_2 <- time_series_plot_2("sales_efficiency", "Sales Efficiency", "Average Sales Efficiency Trend by City and Month")
time_series_plot_sales_efficiency_2
Here, a chart has been created showing 4 line charts, each representing
the trend over time of the new variable ‘sales_efficiency’
(sales/listings), with each line representing a city (same calculation
for months like the previous time series). In this case, these 4 line
charts are combined into the same chart, rather than being separate as
before. The same observations made earlier can be applied here, but it
can also be added that Bryan-College Station shows higher, steadily
increasing, and more consistent sales_efficiency values compared to the
other cities, reaching an extremely positive peak in July period.
volume_boxplots <- ggplot(data = realestate_texas) +
geom_boxplot(aes(x = city, y = volume),
fill = "gray80",
color = "black",
outlier.colour = "red",
outlier.size = 2,
width = 0.7) +
labs(title = "Distribution of Total Sales Volume by City for Each Year",
x = "City",
y = "Volume") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 4),
legend.position = "none"
) +
facet_wrap(~ year, scales = "free_y")
volume_boxplots
Several box plots have been created, as many as the years, and for each
year, a comparison is made between the medians, Q1, Q3, and ranges
across different cities regarding sales volumes: it can be observed
that, generally, for each year, Tyler and Brian-College Station show
fairly similar distribution values. However, in 2012 and 2013, although
Brian-College Station has a lower median than Tyler, it shows a Q3 that
is slightly lower (2012) and slightly higher (2013) than that of Tyler.
It is worth noting that Beaumont, in 2012, shows several outliers.
Based on the charts seen above, Brian-College Station and Tyler have the highest value distributions, especially when talking about sales volume and sales numbers. In spite of this, for all cities we have a very high concentration of values during the spring/summer periods, and this trend is repeated in every year: real estate sales tend, as a rule, to go up during the good season periods, since many more visits at properties for sale by, for example, real estate agencies are also encouraged. Thus, from the insights extrapolated from the analyses done so far, in order to increase the effectiveness of real estate listings, one could increase the volume of listings published within the “hottest” periods (spring/summer) and trying to diversify the channels, including digital channels, where the listings themselves are published, in order to attract a greater volume of customers. These conclusions are particularly relevant for cities like Beaumont and Wichita Falls, which have significant room for improvement, especially in terms of sales volumes and sales numbers.