We start by importing the libraries we need:

library(ggplot2)
library(dplyr)
## 
## Caricamento pacchetto: 'dplyr'
## I seguenti oggetti sono mascherati da 'package:stats':
## 
##     filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
## 
##     intersect, setdiff, setequal, union

This project aim to analyze the Texas real estate market using historical data from 2010 and 2014. The goal is to identify trends in property sales, evaluate listing effectiveness, and create informative visualizations to support decision-making. To do that we will combine the use of base R syntax, ggplot and dplyr libraries.

data = read.csv("realestate_texas.csv",sep=",")
head(data,5)
##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9

Step 1 - Variable Analysis

Identify and describe the variables type

  • City (character): Nominal categorical variable
  • Year (integer): Discrete numerical variable
  • Month (integer): Ordinal categorical variable
  • Sales (integer): Discrete quantitative variable
  • Volume (numeric): Continous quantitative variable
  • Median_price (numeric): Continous quantitative variable
  • Listings (integer): Discrete numerical variable
  • Months_inventory(numeric):Continous quantitative variable

How to treat time variables?

Year and month are not really helpful alone, it makes more sense to analyze them together with other variables like sales or median price, to identify trends over different years or in different seasons (by grouping the month). In the previous paragraph year is a defined as a categorical variable, it will be used as a grouping item rather than something for arithmetic operations (for example we do not usually calculate year 2013 minus year 2012)

What analysis is more suitable for each variable?

-City: We will use the function “table” to identify frequencies. -Year and month: line chart combined with other variables like sales, volume or median price -Sales and Volume: Variability measures to study the dispersion of the data. We will use boxplot to detect possible outliers. -Median Price: we will use a density plot to explore shape (Positve skewness is expected, due to the fact that more low-price houses should be on the market) We can also explore possible links between median price and city/year. -Listings: We will study the relationship between listing and sales plus a brief analysis on variability. -Month_inventory: We will study the variabilty compared to the variable city or year.

Step 2 - Index of position, variabilty and shape

attach(data)
table(city)
## city
##              Beaumont Bryan-College Station                 Tyler 
##                    60                    60                    60 
##         Wichita Falls 
##                    60

All 4 categories have the same absolute frequency, so we have a quadrimodal distribution. We also expect the Gini Index to be 1, so Maximum Heterogeneity.

table(year)
## year
## 2010 2011 2012 2013 2014 
##   48   48   48   48   48
table(month)
## month
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 20 20 20 20 20 20 20 20 20 20 20 20

Due to the structure of the dataset also year and month have all equal frequencies.

summary(sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    79.0   127.0   175.5   192.3   247.0   423.0
#The sales variable shows a concentraded range of value, from 79 to 423, with the mean a bit higher than the median value. Also the SD is pretty high compared to the mean, this indicates an important variation in the number of sales per month. 
sales_range = max(sales)-min(sales)
sigma2_sales = var(sales)
sigma_sales = sd(sales)
print(sigma_sales)
## [1] 79.65111
IQR_sales = IQR(sales)
boxplot(sales, 
        main = "Boxplot of Sales",
        col = "lightblue4")

plot(density(sales))

The boxplot confirms the spread of values while the density plot show a right skewed distribution with more values concentraded in the lower range. This also explain why the mean is slighty higher than the median value. In this case the median may be a better indicator for a typical month than the mean.

summary(volume)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.166  17.660  27.062  31.005  40.893  83.547
volume_range = max(volume)-min(volume)
sigma2_volume = var(volume)
sigma_volume = sd(volume)
print(sigma_volume)
## [1] 16.65145
IQR_sales = IQR(volume)
boxplot(volume, 
        main = "Boxplot of volume",
        col = "lightblue4")

plot(density(volume))

The volume variable reflects the total dollar value of property sales each month. It shows a positive skew, very similar to sales, but with even more outliers that are visible in the boxplot. This is expected, since volume depends on both the number of sales and the price of properties, making it sensitive to extreme values. The high standard deviation, relative to the mean, confirms a high level of variability. In this case, the median could again serve as a more realistic indicator of central tendency than the mean.

summary(median_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   73800  117300  134500  132665  150050  180000
plot(density(median_price))

Contrary to expectations, the density plot reveals a slightly left skewed distribution, with high prices more common than low prices. This represent the situation of texas real estate market and is probably connected to the high outliers that we saw before with the volume’s boxplot.

summary(listings)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     743    1026    1618    1738    2056    3296
listings_range = max(listings) - min(listings)
listings_sd = sd(listings)
listings_iqr= IQR(listings)

boxplot(listings, 
        main = "Boxplot of Listings",
        col = "lightblue4")

plot(density(listings), 
     main = "Density of Listings")

The boxplot show a pretty simmetric distribution, with the median positioned near the center of the interquartile range. There are no outliers and the interquartile range is pretty large, suggesting moderate variability in the data. The density plot instead shows a multimodal distribution with three different peaks. These may be explained with other analysis, especially link the listings number with the city or the period.

summary(months_inventory)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.400   7.800   8.950   9.193  10.950  14.900
months_inventory_range = max(months_inventory) - min(months_inventory)
months_inventory_sd = sd(months_inventory)
print(months_inventory_sd)
## [1] 2.303669
months_inventory_iqr= IQR(months_inventory)

boxplot(months_inventory, 
        main = "Boxplot of months_inventory",
        col = "lightblue4")

plot(density(months_inventory), 
     main = "Density of months_inventory")

The boxplot shows a compact distribution, with low variabilty (sd=2 with mean around 8). The median is right below the center so I’m expecting a very slight right skewed distribution, that in fact is confirmed by the density plot, that is really close to the normal distribution.

Step 3 - Variables with higher asimmetry and variability

Even if Listings has the largest interquartile range, the variable with the highest variability is likely to be volume. The standard deviation is more than 50% of the mean and median values, in addiction it shows both a large interquartile range and many outliers. All these factors represent the dispersion of the data.

The variable with most skewed distribution is also volume, since it has a right skewed distribution with the highest peak of density.

Step 4 - Class creation for a quantitative variable

summary(sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    79.0   127.0   175.5   192.3   247.0   423.0
data$classi = cut(sales,
                  breaks = c(78,139,199,259,319,379,440)
)
ggplot(data=data)+
  geom_bar(aes(x=classi),
           stat = "count",
           fill = "lightblue4",
           col = "black") +
  labs(title="Frequency distribution of sales",
       x = "Class",
       y = "Frequency")+
  theme_classic()

ggplot(data = data)+
  geom_boxplot(aes(x=city,y=sales),
               fill="lightblue4")+
  labs(title="Boxplot of sales by city",
       x="City",
       y="Number of sales")+
  theme_classic()

gini.index = function(x){
  ni=table(x)
  fi=ni/length(x)
  fi2=fi^2
  J = length(table(x))
  gini = 1-sum(fi2)
  gini.normalizzato= gini / ((J-1)/J)
  
  return (gini.normalizzato)
}


gini.classi = gini.index(data$classi)
print(gini.classi)
## [1] 0.913125

The variable sales was grouped into 6 classes, the frequency distribution graph show a right skew with most value concentred in the 2 lower classes. The Gini index confirms the former conclusions and indicates a very high level of heterogenity.

Step 5 - Probability

N = length(data$city)
sum_city = sum(data$city == "Beaumont")
cat("Number of Beaumont occurencies: ",sum_city,"\n")
## Number of Beaumont occurencies:  60
cat("Probabilty equals to: ",(sum_city/N)*100,"%","\n\n")
## Probabilty equals to:  25 %
sum_month = sum(data$month == "7")
cat("Number of July occurencies: ",sum_month)
## Number of July occurencies:  20
cat("Probabilty equals to: ",(sum_month/N)*100,"%","\n\n")
## Probabilty equals to:  8.333333 %
sum_monthyear = sum(data$month == "12" & data$year =="2012")
cat("Number of December 2012 occurencies: ",sum_monthyear)
## Number of December 2012 occurencies:  4
cat("Probabilty equals to: ",(sum_monthyear/N)*100,"%","\n\n")
## Probabilty equals to:  1.666667 %

Step 6 - New Variables

data$avg_price= (volume*1000000)/sales
data$selling_ratio = sales/listings

ggplot(data = data)+
  geom_boxplot(aes(x=city,y=avg_price),
               fill="lightblue4")+
  labs(title="Boxplot of avg_price by city",
       x="City",
       y="Avg_price")+
  theme_classic()

ggplot(data = data)+
  geom_boxplot(aes(x=city,y=median_price),
               fill="lightblue4")+
  labs(title="Boxplot of median_price by city",
       x="City",
       y="median_price")+
  theme_classic()

ggplot(data = data)+
  geom_boxplot(aes(x=city,y=selling_ratio),
               fill="lightblue4")+
  labs(title="Boxplot of selling-ratio by city",
       x="City",
       y="selling_ratio")+
  theme_classic()

data %>%
  group_by(year) %>%
  summarise(mean_price = mean(avg_price)) %>%
  ggplot(aes(x = year, y = mean_price)) +
  geom_line() +
  labs(title = "Average Price Over Years", x = "Year", y = "Avg Price")

From the grapichs we can observe that “Bryan-College Station” has the highest average price. The median price distribution is similar to that avg_price one, but a little more balanced between the four cities. The selling_ratio (sales divided by listings) shows clear differences in listing effectiveness across cities. Bryan–College Station shows a higher average ratio, suggesting a more efficient market. Other cities show lower and more consistent ratios,also with less variability. The last line chart shows a clear upward trend and the growth of prices in the analized period.

Step 7 - Conditional Analysis

data %>%
  group_by(city) %>%
  summarise(
    avg_sales = mean(sales),
    sd_sales = sd(sales),
    avg_price = mean(avg_price),
    selling_efficiency = mean(selling_ratio)
  )
## # A tibble: 4 × 5
##   city                  avg_sales sd_sales avg_price selling_efficiency
##   <chr>                     <dbl>    <dbl>     <dbl>              <dbl>
## 1 Beaumont                   177.     41.5   146640.             0.106 
## 2 Bryan-College Station      206.     85.0   183534.             0.147 
## 3 Tyler                      270.     62.0   167677.             0.0935
## 4 Wichita Falls              116.     22.2   119430.             0.128
data %>%
  group_by(city) %>%
  summarise(avg_sales = mean(sales)) %>%
  ggplot(aes(x = city, y = avg_sales, fill = city)) +
  geom_col() +
  labs(title = "Average Sales by City", x = "City", y = "Average Sales") +
  theme_classic()

Step 8 - Final Visualizations

data %>%
  group_by(city, month) %>%
  summarise(total_sales = sum(sales)) %>%
  ggplot(aes(x = month, y = total_sales, fill = city)) +
  geom_col(position = "dodge") +
  labs(title = "Total Sales by Month and City",
       x = "Month", y = "Total Sales") +
  scale_x_continuous(breaks = seq(1,12,1))+
  theme_classic()
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.

data %>%
  group_by(year, city) %>%
  summarise(avg_sales = mean(sales)) %>%
  ggplot(aes(x = year, y = avg_sales, color = city)) +
  geom_line() +
  labs(title = "Average Sales Over Time by City",
       x = "Year", y = "Average Sales") +
  theme_minimal()
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

The sales by month and city shows a clear trend for “Beaumont” and “Bryan–College Station” with higher sales during the summer season, while the other two cities have a quite stable situation over the year.

The second graph with average sales by city over time shows a clear upward trend for all cities from 2011 to 2014 except fot “Wichita falls”, that had the max peak in 2010.

Extra

monthly_sales = aggregate(sales ~ month + city, data = data, sum)

ggplot(data=monthly_sales)+
  geom_col(aes(x = month, y = sales, fill = city),
           col = "black", position = "fill") +
  labs(itle = "Total Sales by Month and City",
    x = "Month",
    y = "Sales") +
  theme_classic() +
  scale_x_continuous(breaks= seq(1,12,1))

monthly_volume = aggregate(volume ~ month + city, data = data, sum)

ggplot(data=monthly_volume)+
  geom_col(aes(x = month, y = volume, fill = city),
           col = "black", position = "fill") +
  labs(itle = "Total Sales by Month and City",
    x = "Month",
    y = "Sales") +
  theme_classic() +
  scale_x_continuous(breaks= seq(1,12,1))

monthly_listings = aggregate(listings ~ month + city, data = data, sum)

ggplot(data=monthly_listings)+
  geom_col(aes(x = month, y = listings, fill = city),
           col = "black", position = "fill") +
  labs(itle = "Total Listings by Month and City",
    x = "Month",
    y = "Listings") +
  theme_classic() +
  scale_x_continuous(breaks= seq(1,12,1))

The last two graphs display how sales and volume are distributed for the four cities during the months of the year. The charts are normalized to 100%, allowing for a clear comparison of the relative contribution of each city. The graphics suggests that while the total number of sales or volume may vary monthly, the relative weight of each city remains stable, with Tyler consistently having the largest share of the market.

Step 9 - Final Comment and Recap

The analysis made by using descriptive statics gave us some good insights on the texal real estate market. Sales and Volume: There was an overall increase in sales volume over the years, representing a growing market.

Listing: The listing distribution reflects what we saw from the other two normalized graphic, with same percentage distribution for the four cities.

Temporal Trends: Monthly analysis showed that certain months (often summer) were associated with higher sales volumes, indicating a potential seasonal effect on the market.