Texas Real Estate Analysis

1. Preliminary Data Analysis

#import dataset
setwd("D:/Users/Anton/Desktop/ProfessionAI/02-Statistica descrittiva/reporitory/progetto finale/Progetto")
data=read.csv("Real Estate Texas.csv")
attach(data)

#show first dataset row
head(data)

##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
## 6 Beaumont 2010     6   189 27.219       122800     1803             11.1

The dataset contains the following variables:

city: reference city (Qualitative variable on a nominal scale)
year: reference year (Quantitative discrete variable)
month: reference month (Qualitative variable on an ordinal scale, in this case, the month is coded with numbers from 1 January to 12 December)
sales: total number of sales (Quantitative discrete variable)
volume: total sales value in millions of dollars (Quantitative continuous variable)
median price: median sales price expressed in dollars (Quantitative continuous variable)
listings: total number of listings (Quantitative discrete variable)
months inventory: amount of time needed to sell all current listings at the current sales pace, expressed in months (Quantitative continuous variable)

1.1 Data synthesis

tot_observation=length(data$city)
sprintf("tot_obs=%s.",tot_observation)

## [1] "tot_obs=240."

summary(data)

##      city                year          month           sales      
##  Length:240         Min.   :2010   Min.   : 1.00   Min.   : 79.0  
##  Class :character   1st Qu.:2011   1st Qu.: 3.75   1st Qu.:127.0  
##  Mode  :character   Median :2012   Median : 6.50   Median :175.5  
##                     Mean   :2012   Mean   : 6.50   Mean   :192.3  
##                     3rd Qu.:2013   3rd Qu.: 9.25   3rd Qu.:247.0  
##                     Max.   :2014   Max.   :12.00   Max.   :423.0  
##      volume        median_price       listings    months_inventory
##  Min.   : 8.166   Min.   : 73800   Min.   : 743   Min.   : 3.400  
##  1st Qu.:17.660   1st Qu.:117300   1st Qu.:1026   1st Qu.: 7.800  
##  Median :27.062   Median :134500   Median :1618   Median : 8.950  
##  Mean   :31.005   Mean   :132665   Mean   :1738   Mean   : 9.193  
##  3rd Qu.:40.893   3rd Qu.:150050   3rd Qu.:2056   3rd Qu.:10.950  
##  Max.   :83.547   Max.   :180000   Max.   :3296   Max.   :14.900

The dataset consists of 240 observations. The data refers to sales in the period 2010-2014. Each row of the dataset represents the sales for a given city, in the reference year and month.

For a better data comprehension, it is possible to identify the unique value of variable city.

matrix(unique(data$city,length((data$city)),1))

##      [,1]                   
## [1,] "Beaumont"             
## [2,] "Bryan-College Station"
## [3,] "Tyler"                
## [4,] "Wichita Falls"

The data refers to 4 city:

Beaumont
Bryan-College Station
Tyler
Wichita Falls

2. Positional, Variability and Shape Index

# Define the function to calculate variability index
variability_index = function(df) {
  # If df is a vector, convert it to a dataframe
  if (is.vector(df)) {
    df = data.frame(col = df)
  }
  
  # Initialize an empty dataframe to store the results
  variability = data.frame(
    Index = c("Range", "IQR", "SD", "CV"),
    stringsAsFactors = FALSE
  )
  
  # Loop through each column in the dataframe
  for (col_name in names(df)) {
    col_data = df[[col_name]]
    
    # Check if the column is numeric
    if (is.numeric(col_data)) {
      # Calculate the variability index
      range_val = round(diff(range(col_data)),2)
      iqr_val = round(IQR(col_data),2)
      sd_val = round(sd(col_data),2)
      cv_val = round(sd_val / mean(col_data),2)
      
      # Create a new column with the results
      new_column = data.frame(
        col_name = c(range_val, iqr_val, sd_val, cv_val),
        stringsAsFactors = FALSE
      )
      
      # Set the column name
      colnames(new_column) = col_name
      # Combine the new column with the variability dataframe
      variability = cbind(variability, new_column)
    }
  }
  
  return(variability)
}

# Load necessary library
library(moments)

# Define the function to calculate shape index
shape_index = function(df) {
  # If df is a vector, convert it to a dataframe
  if (is.vector(df)) {
    df = data.frame(col = df)
  }
  
  # Initialize an empty dataframe to store the results
  shape_ind = data.frame(
    Index = c("Skewness", "Kurtosis"),
    stringsAsFactors = FALSE
  )
  
  # Loop through each column in the dataframe
  for (col_name in names(df)) {
    col_data = df[[col_name]]
    
    # Check if the column is numeric
    if (is.numeric(col_data)) {
      # Calculate the shape index
      skewness_val = round(skewness(col_data), 2)
      kurtosis_val = round(kurtosis(col_data)-3, 2)
      
      # Create a new column with the results
      new_column = data.frame(
        col_name = c(skewness_val, kurtosis_val),
        stringsAsFactors = FALSE
      )
      
      # Set the column name
      colnames(new_column) = col_name
      # Combine the new column with the shape_index dataframe
      shape_ind = cbind(shape_ind, new_column)
    }
  }
  
  return(shape_ind)
}

summary(data[,c(4,5,6,8)])

##      sales           volume        median_price    months_inventory
##  Min.   : 79.0   Min.   : 8.166   Min.   : 73800   Min.   : 3.400  
##  1st Qu.:127.0   1st Qu.:17.660   1st Qu.:117300   1st Qu.: 7.800  
##  Median :175.5   Median :27.062   Median :134500   Median : 8.950  
##  Mean   :192.3   Mean   :31.005   Mean   :132665   Mean   : 9.193  
##  3rd Qu.:247.0   3rd Qu.:40.893   3rd Qu.:150050   3rd Qu.:10.950  
##  Max.   :423.0   Max.   :83.547   Max.   :180000   Max.   :14.900

variability_index(data[,c(4,5,6,8)])

##   Index  sales volume median_price months_inventory
## 1 Range 344.00  75.38    106200.00            11.50
## 2   IQR 120.00  23.23     32750.00             3.15
## 3    SD  79.65  16.65     22662.15             2.30
## 4    CV   0.41   0.54         0.17             0.25

shape_index(data[,c(4,5,6,8)])

##      Index sales volume median_price months_inventory
## 1 Skewness  0.72   0.88        -0.36             0.04
## 2 Kurtosis -0.31   0.18        -0.62            -0.17

The variables sales and volume have a high range and a positive asymmetric distribution ($\gamma_{sales}=0.72 , \gamma_{volume}=0.88$). By their nature these variables are linked to each other through the median price which, however is negatively asymmetric ($\gamma_{median\_price}=-0.36$).

Comparing CV index appears that sales ($CV_{sales}=0.41$) and volume ($CV_{volume}=0.54$) has the higher variability, median price has the lowest ($CV_{median\_price}=0.17$).

The variable months inventory is the only nearly symmetric ($\gamma_{months\_inventory}=0.04$), symptom of a balanced distribution of the data.

3. Highest variability and most asymmetric variables identification

From the tables relating to the variability and asymmetry index it is possible to identify the variable with the greatest variability and the most asymmetric one.

For what concern variability, since the analyzed variables present standard deviations with different magnitude index, it is necessary to focus on the Coefficient of Variation in order to make comparisons.

Focusing on CV index, it is clear that volume is the variable with highest variability.

$CV_{volume}$ = 0.54

Focusing on Skeweness index, appears that volume is also the most asymmetric variable.

$\gamma_{volume}$= 0.88

The variable volume has a positive asymmetry ( $\gamma>0$ ), which means that there is a higher frequency of lower values.

4. Class Sales

The variable sales has a range from $Min_{sales}$=79 to $Max_{sales}$=423. It is possible to divide the variable in classes, ranging from 50 to 450 by steps of 50.

# Sales class from 50 to 450 by 50
breaks_sales = seq(50, 450, by = 50)
labels_sales = paste(head(breaks_sales, -1), "-", tail(breaks_sales, -1), sep="")

#Sales partition by sales class
data$class_sales =  cut(sales, 
                         breaks = breaks_sales, 
                         include.lowest = TRUE,
                         labels = labels_sales)

# Create frequency distribution
freq_distribution_sales = table(data$class_sales)

# Convert frequency distribution in dataframe
df_freq_sales = as.data.frame(freq_distribution_sales)

# Bar chart sales classes
library(ggplot2)
ggplot(df_freq_sales, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Sales Class", y = "Absolute Frequency", title = "Sales Frequency Distribution") +
  scale_y_continuous(breaks = seq(0, 80, by = 10), limits = c(0, 80)) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14),  # Center Title
        axis.title.x = element_text(size = 12),  # axis name size
        axis.title.y = element_text(size = 12))

Fig. 1. Sales frequency distribution in classes

Fig. 1 shows that the lowest sales classes (100-150, 150-200) have a high frequency, which could indicate a prevalence of months and cities with low number of sales. The higher classes (300-350, 350-400) have very low frequencies, suggesting that it is rare to have months and cities with such a high number of sales. By further investigations, it is possible to find out if there are trends related to seasonality or if there are areas that on average record higher sales than others.

#Gini Index

gini.index = function(x){
  ni = table(x)
  fi = ni/length(x)
  fi2 = fi^2
  J = length(table(x))
  
  gini = 1-sum(fi2)
  gini.norm = gini/((J-1)/J)
  
  return(gini.norm)
}

Calculating the Gini index for the variable sales divided into classes, it comes out that $G'_{sales}$=0.92. The value obtained is very close to the maximum value of 1, which indicates that the variable sales divided into classes has a very high heterogeneity, confirming the results shown in the graph.

5. Probability

What is the probability that, if you take a random row from this dataset, it will show the city “Beaumont”? And the probability that it will bring back the month of July? And the probability that it will carry over the month of December 2012?

From the dataset nature, it is intuitive to find the answer to these questions. The dataset consists of 240 observations. An observation for each city in the month and year of reference. The probability that taken a row will report the city of Beumont will be equal to the total number of observations(240), divided by the total number of cities(4) reported in dataset. The result will be 25%.

It is possible to verify this statement by applying the definition of probability:

$p= \frac{Number~of~favorable~outcomes}{Total~number~of~outcomes}$

p_Beumont=(sum(data$city == "Beaumont")/tot_observation)*100

$p_{Beumont}=$25%

similarly we have that

p_july=round((sum(data$month == "7")/tot_observation)*100,2)
p_dec2012=round((sum(data$month == "12" & data$year==2012)/tot_observation)*100,2)

$p_{july}=$8.33%

$p_{dec2012}=$1.67%

6. New Variable

Starting from available data, it is possible to define new variables that helps in analyzing the data.

Define average price as:

$mean_{price}=\frac{volume}{sales}$

Define listings efficiency as total number of sales versus listings. Multiplying by 100, the output will represent the percentage of listings converted in effective sales.

$\eta_{listings}=\frac{sales}{listings} *100$

data$mean_price=(data$volume*1000000)/data$sales
data$eta_listings=round((data$sales/data$listings)*100,2)

summary(data[,c(4,6,10,11)])

##      sales        median_price      mean_price      eta_listings  
##  Min.   : 79.0   Min.   : 73800   Min.   : 97010   Min.   : 5.01  
##  1st Qu.:127.0   1st Qu.:117300   1st Qu.:132939   1st Qu.: 8.98  
##  Median :175.5   Median :134500   Median :156588   Median :10.96  
##  Mean   :192.3   Mean   :132665   Mean   :154320   Mean   :11.87  
##  3rd Qu.:247.0   3rd Qu.:150050   3rd Qu.:173915   3rd Qu.:13.49  
##  Max.   :423.0   Max.   :180000   Max.   :213234   Max.   :38.71

All positional index calculated on the mean price are about 20-30 k$ greater than the respective positional index calculated on the variable median price. This implies that for a given city, in a given month and year, among all the sales made, there are some (in limited number) that have recorded higher revenues than the others.

For what concern listings efficiency, although the range of this variable is high, ranging from a minimum of 5% to a maximum of about 40%, it is important to focus on the $3^{rd} Qu.$ index.

This parameter has a value of 13.5%. This means that in 75% of the cases observed, the effectiveness of the ads was less than 15% and only in rare cases has been recorded higher sales.

Deeper analyses will allow to understand if there is a relationship between the increase in sales and variables such as cities or a certain historical period.

7. Conditional Analysis

# Median Price by city
ggplot(data, aes(x = city, y = median_price)) +
  geom_boxplot() +
  stat_summary(fun = median, geom = "text", aes(label = round(after_stat(y), 1)), vjust = -0.5, size = 3, color = "red") + #add median label
  labs(title = "Median Price by City", subtitle = "2010-2014", x = "City", y = "Median Price [$]") +
  scale_y_continuous(breaks = seq(70000, 180000, by = 10000),
                     limits = c(70000, 180000))+ # Set y-axis limits
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5, size=14),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),
        axis.title.x = element_text(size = 12),  # axis name size
        axis.title.y = element_text(size = 12))

Fig. 2. Median Price by City

Bryan-College Station has the highest median prices among the cities represented, with a median price of around 155400$. Follows Tyler with a median price of 142200$, and Beaumont with 130750$. Wichita Falls has the lowest median prices 102300$.

Fig. 2 also shows that Brian-College is the city with the lowest IQR. Probably, properties in this city are in greater demand than in other cities, managing to keep prices relatively stable.

Wichita Falls, in addition to having the lowest median price, also has the highest IQR. It is likely that this area is more subject to market trends. Another point to underline regarding Wichita Falls is the presence of a relative minimum 25% lower than the median price, suggesting that there have been periods in this area when property sale prices have collapsed drastically.

#Volume by year
ggplot(data, aes(x = factor(year), y = volume, fill = city)) +
  geom_boxplot(position = position_dodge(width = 0.8)) +
  scale_fill_manual(values = c("Beaumont" = "lightblue", "Bryan-College Station" = "lightgreen", "Tyler" = "lightcoral", "Wichita Falls" = "lightgoldenrod")) +
  labs(title = "Volume by Year",
       x = "Year",
       y = "Volume [Billion $]") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size=14),
        axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12))+
  scale_y_continuous(breaks = seq(0, 100, by = 10),
                     limits = c(0, 100))  # Set y-axis limits

Fig. 3. Volume by year

Fig. 3 shows that excep Whicita Falls, all cities show an uptrend in the analyzed time period.

Wichita Falls is the city with the lowest sales volume, with a relatively stable trend. Considering that all the other cities show an increasing trend, the fact that its sales volume has been stable over the years could be a symptom of a lack of interest in this area by buyers.

Bryan-College Station is the city with the highest sales volume, with a peak value exceeding $80 million in 2014. It is also the city with the highest average variability. This suggests the presence of a possible seasonal trend, with months characterized by higher number of sales.

Tyler has sales volumes with an uptred. It is the city with the highest median volume in the entire period analyzed. In addition, the variability is more contained than in Bryan-College city, suggesting the presence of a more stable market.

Beumont despite having lower sales volumes, among the cities registering an uptrend, is the one with the lowest variability. The market is stable and growing.

#sales by month (2010-2014)
ggplot(data, aes(x = factor(month), y = sales, fill = city)) +
  geom_bar(stat = "identity", position = "stack") +
  geom_text(aes(label = sales), position = position_stack(vjust = 0.5), size = 3) +
  facet_wrap(~ year, ncol = 1) +
  labs(title = "Sales by Month (2010-2014)",
       x = "Month",
       y = "Sales") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5,size=16),
    axis.title.x = element_text(size=14),
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title.y = element_text(size=14)) +
  scale_x_discrete(labels = month.abb)

Fig. 4. Sales by Month (2010-2014)

Analyzing the sales trend in the various months, it can be seen that starting from 2011, the most favorable months for sales are those ranging from May to August. The months with lowest sales are January and February.

library(dplyr)
library(gridExtra)
# Filter the data for the year 2014
data_2014 = data %>% filter(year == 2014)
data_2014_normalized = data_2014 %>%
  group_by(month) %>%
  mutate(sales_normalized = sales / sum(sales))

#sales by months
p1=ggplot(data_2014, aes(x = factor(month), y = sales, fill = city)) +
  geom_bar(stat = "identity", position = "stack") +
  geom_text(aes(label = sales), position = position_stack(vjust = 0.5), size = 4) +
  labs(title = "Sales by Month (2014)",
       x = "Month",
       y = "Sales") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5,size=14),
        legend.position = c(0.9,0.9),
        axis.title.x = element_text(size=12),
        axis.title.y=element_text(size=12))+
  scale_y_continuous(breaks = seq(0, 1200, by = 100),
                     limits = c(0, 1200))

#sales by months normalized
p2=ggplot(data_2014_normalized, aes(x = factor(month), y = sales_normalized, fill = city)) +
  geom_bar(stat = "identity", position = "stack") +
  geom_text(aes(label = round((sales_normalized*100), 1)), position = position_stack(vjust = 0.5), size = 4) +
  labs(title = "Normalized Sales by Month (2014)",
       x = "Month",
       y = "Normalized Sales") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5,size=14),
        axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12),
        legend.position = "none")+
  scale_y_continuous(labels = scales::percent)

# Arrange the two plots side by side
grid.arrange(p1, p2, ncol = 2)

Fig. 5. Sales by Month (2014)

Graph on the left clearly shows the presence of a seasonal trend, with the presence of a sales peak, for all cities, recorded in the summer months.

Analyzing the graph on the right, it is possible to observe that the distribution of sales between the various cities remains more or less stable as the months vary.

Wichita Falls is the city with the lowest but also most stable sales percentage in 2014. The sales percentage varies from a minimum of 10.5% to a maximum of 14.6% ( $\Delta _{\%sales}=4.1\%$ )

Tyler is the city with the highest percentage of sales and also the second in terms of stability. The sales percentage fluctuates from a minimum of 32.7% to a maximum of 40.2% ( $\Delta _{\%sales}=7.5\%$ ). This suggests that Tyler is a good area to invest in, as demand is high and stable throughout the year.

# Median price line chart
ggplot(data, aes(x = year + month/12, y = median_price, color = city, group = city)) +
  geom_line(size = 1) +
  labs(title = "Median Price Trends",
       x = "year",
       y = "Median Price [$]",
       color = "city") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5,size=16),
        axis.title.x = element_text(size=14),
        axis.title.y = element_text(size=14))+
  theme(legend.position = "bottom",
        legend.title = element_text(size=14),
        legend.text = element_text(size=12))

Fig. 6 Median Price Trends

All cities show an uptrend in the median price over the period analyzed. This suggests a growth in the Texas real estate market. It is also possible to notice the presence of seasonal variations with peaks and declines that repeat cyclically from year to year.

Bryan - College Station, generally, has the highest median prices, symptom of higher demand or greater attractiveness of this city. Furthermore, it is the city with the least volatility with a more or less stable trend and limited annual price fluctuations.

Wichita Falls has the lowest median prices, which may suggest less buyer interest due to local economic conditions. It also the city with the highest volatility among the cities analyzed, with numerous annual peaks and drops.

Tyler is the second city by median price, has an upward trend and intermediate variability. The presence of seasonal trends could serve as the basis for a possible buying and selling strategy.

It could be useful to investigate possible relationship between Tyler median price and sales trends.

# Tyler data filter
tyler_data = data %>% filter(city == "Tyler")

# Find min median price value
min_median_price = tyler_data %>%
  group_by(year) %>%
  summarize(min_price = min(median_price, na.rm = TRUE))

# Join tyler data with min median price
tyler_data = tyler_data %>%
  left_join(min_median_price, by = "year")

# Tyler sales line chart
plot_sales = ggplot(tyler_data, aes(x = year + month/12, y = sales)) +
  geom_line(color = "blue", size = 1) +
  labs(title = "Tyler Sales over years",
       x = "Year",
       y = "Sales") +
  scale_y_continuous(breaks = seq(100, 500, by = 100), limits = c(100, 500)) +
  theme_minimal()+
   theme(plot.title = element_text(hjust = 0.5,size=14),
        axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12))

# Tyler median price line chart
plot_median_price = ggplot(tyler_data, aes(x = year + month/12, y = median_price)) +
  geom_line(color = "red", size = 1) +
  geom_text(aes(label = ifelse(median_price == min_price & year>2010, round(median_price, 1), "")),
            hjust = 0, vjust = +1, size = 3, color = "black") +
  labs(title = "Tyler Median Price over years",
       x = "Year",
       y = "Median Price [$]") +
  scale_y_continuous(breaks = seq(120000, 170000, by=10000), limits = c(120000, 170000))+
  theme_minimal()+
   theme(plot.title = element_text(hjust = 0.5,size=14),
        axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12))


# Join plots
grid.arrange(plot_sales, plot_median_price, ncol = 1)

Fig. 7. Tyler median price vs sales

As already seen previously, the trend of the median price for the city of Tyler presents an annual cyclical nature. Although the overall behavior is uptrend, in the years from 2011 to 2014, the price recorded its annual minimum close to the beginning of the year. Comparing the two graphs, an inverse proportionality relationship emerges between the price and the number of sales. In the summer months, where the price is highest, the greatest number of sales are also recorded, so it is better to buy properties at the beginning of the year and then resell them at a higher price in the summer months.

8. Conclusion

The analysis of the Texas real estate market from 2010 to 2014 revealed that:

Volume and sales are the parameters with the highest variability, while the median price has the lowest variability.
Listing efficiency (defined as sales on listings) in 75% of observed cases is lower than 15%.
Lower sales classes (100-150, 150-200) have the highest frequency, indicating it is rare to have months and cities with a high number of sales.
Volume by year analysis shows that except for Wichita Falls, all cities registered an increase in volume from 2010 to 2014, indicating that the market is growing.
Sales by month, both for the entire time period and focused on 2014, highlight the presence of a seasonal trend with the highest sales registered in the summer months (May-August). It also shows that Tyler city’s sales percentage is the highest and relatively stable.
The median price generally grew in the last year. It is also characterized by seasonal trends, with drops at the beginning of the year and peaks in the middle of the year.
Comparing Tyler city’s sales and median price over the years highlights an inverse proportionality between these two parameters. In the summer months when the median price reaches its peak, the highest number of sales is also recorded.

Considering that Tyler has a generally high median price, a market affected by an uptrend, a relatively stable percentage of sales, and the highest sales compared to other cities, in addition to the presence of seasonal trends, it is suggested to invest in this city.