1) Loading Dataset

setwd("C:/Users/tomma/Desktop/Profession AI")

texas <- read_csv("realestate_texas.csv")

## Rows: 240 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): city
## dbl (7): year, month, sales, volume, median_price, listings, months_inventory
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

attach(texas)

2) Type of Variables present in the Dataset

str(texas)

## spc_tbl_ [240 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ city            : chr [1:240] "Beaumont" "Beaumont" "Beaumont" "Beaumont" ...
##  $ year            : num [1:240] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ month           : num [1:240] 1 2 3 4 5 6 7 8 9 10 ...
##  $ sales           : num [1:240] 83 108 182 200 202 189 164 174 124 150 ...
##  $ volume          : num [1:240] 14.2 17.7 28.7 26.8 28.8 ...
##  $ median_price    : num [1:240] 163800 138200 122400 123200 123100 ...
##  $ listings        : num [1:240] 1533 1586 1689 1708 1771 ...
##  $ months_inventory: num [1:240] 9.5 10 10.6 10.6 10.9 11.1 11.7 11.6 11.7 11.5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   city = col_character(),
##   ..   year = col_double(),
##   ..   month = col_double(),
##   ..   sales = col_double(),
##   ..   volume = col_double(),
##   ..   median_price = col_double(),
##   ..   listings = col_double(),
##   ..   months_inventory = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

City: nominal qualitative variable
Year: even if expressed with integers it is to be considered as an ordinal qualitative variable
Month: for the reason expressed previously it is an ordinal qualitative variable
Sales: quantitative variable
Volume: quantitative variable
Median Price: quantitative variable
Listings: quantitative variable
Months Inventory: quantitative variable

3) Indices

#Variable "city"
ni.city<-table(city)
fi.city<-table(city)/dim(texas)[1]
Tabcity<-cbind(ni.city,fi.city)

#Variable "year"
ni.year<-table(year)
fi.year<-table(year)/dim(texas)[1]
Ni.year<-cumsum(table(year))
Fi.year<-cumsum(table(year))/dim(texas)[1]
Tabyear<-cbind(ni.year,fi.year,Ni.year,Fi.year)

#Variable "month"
ni.month<-table(month)
fi.month<-table(month)/dim(texas)[1]
Ni.month<-cumsum(table(month))
Fi.month<-cumsum(table(month))/dim(texas)[1]
Tabmonth<-cbind(ni.month,fi.month,Ni.month,Fi.month)

#Other variables
quant_var <- colnames(texas[4:8])
lista_tab <- c()
for (i in quant_var) {
  #Position Indices
  Mean<-mean(texas[[i]]) #average
  Mode<-as.numeric(names(which.max(table(texas[[i]])))) #mode
  Median<-median(texas[[i]]) #median
  Min<-min(texas[[i]])
  Max<-max(texas[[i]])
  #Variability Indices
  Range<-Max-Min
  IQR<-IQR(texas[[i]])
  SD<-sd(texas[[i]]) #variance
  #Variance.sales<-(var(sales)*(length(sales)-1))/length(sales) or
  Variance<-SD^2
  CV<-SD/Mean
  #Shape Indices
  Kurtosis<-kurtosis(texas[[i]])-3
  Skewness<-skewness(texas[[i]])
  Tab<-cbind(Mean,Mode,Median,Min,Max,Range,IQR,Variance,SD,CV,Skewness,Kurtosis)
  
  lista_tab[[i]] <- Tab
}
Tabcompl_quant<-rbind(lista_tab[[(1)]], lista_tab[[(2)]], lista_tab[[(3)]], lista_tab[[(4)]], lista_tab[[(5)]])
Tabcompl_quant<-as.data.frame(Tabcompl_quant)
rownames(Tabcompl_quant)<-colnames(texas[4:8])

For the quantitative variables I calculated the main position, variability and shape indices. I then grouped them in a summary table. We can compare the variables with the skewness and coefficient of variation indices to better understand their distribution and variability. Most of the variables have a positive skewness, where the values are concentrated mostly in the lower part of the distribution, leaving a longer tail to the right. The variable “median_price” is an exception. From the point of view of relative variability, measured by the coefficient of variation (CV), the variable with the greatest dispersion with respect to the mean is volume. This variable is also the one with the highest degree of skewness, as indicated by the skewness index in the table, reflecting a distribution particularly unbalanced towards higher values.

kable(round(Tabcompl_quant, 2))

	Mean	Mode	Median	Min	Max	Range	IQR	Variance	SD	CV	Skewness	Kurtosis
sales	192.29	124.0	175.50	79.00	423.00	344.00	120.00	6.34430e+03	79.65	0.41	0.72	-0.31
volume	31.01	14.0	27.06	8.17	83.55	75.38	23.23	2.77270e+02	16.65	0.54	0.88	0.18
median_price	132665.42	130000.0	134500.00	73800.00	180000.00	106200.00	32750.00	5.13573e+08	22662.15	0.17	-0.36	-0.62
listings	1738.02	1581.0	1618.50	743.00	3296.00	2553.00	1029.50	5.66569e+05	752.71	0.43	0.65	-0.79
months_inventory	9.19	8.1	8.95	3.40	14.90	11.50	3.15	5.31000e+00	2.30	0.25	0.04	-0.17

For the variables “city”, “years” and “month”, being qualitative, I created only the frequency distributions. As you can see, within the dataset, the number of observations for each city is identical and data were collected for each month from 2010 to 2014, for the cities of Beaumont, Bryan-College Station, Tyler, Wichita Falls.

kable(Tabcity, caption = "Frequency Table for the variable City", align = "c")

Frequency Table for the variable City
	ni.city	fi.city
Beaumont	60	0.25
Bryan-College Station	60	0.25
Tyler	60	0.25
Wichita Falls	60	0.25

kable(round(Tabmonth,2), caption = "Frequency Table for the variable Month", align = "c")

Frequency Table for the variable Month
ni.month	fi.month	Ni.month	Fi.month
20	0.08	20	0.08
20	0.08	40	0.17
20	0.08	60	0.25
20	0.08	80	0.33
20	0.08	100	0.42
20	0.08	120	0.50
20	0.08	140	0.58
20	0.08	160	0.67
20	0.08	180	0.75
20	0.08	200	0.83
20	0.08	220	0.92
20	0.08	240	1.00

kable(Tabyear, caption = "Frequency Table for the variable Year", align = "c")

Frequency Table for the variable Year
	ni.year	fi.year	Ni.year	Fi.year
2010	48	0.2	48	0.2
2011	48	0.2	96	0.4
2012	48	0.2	144	0.6
2013	48	0.2	192	0.8
2014	48	0.2	240	1.0

4) Which is the variable with the highest variability?

The variable with index 2, or “volume”, has a higher CV, therefore higher variability.

kable(rownames(Tabcompl_quant[which.max(Tabcompl_quant$CV),]))

x
volume

The “volume” variable has a higher skewness index in absolute value, therefore it is more asymmetric.

kable(rownames(Tabcompl_quant[which.max(abs(Tabcompl_quant$Skewness)),]))

x
volume

5) Division of one of the Quantitative Variables into classes

For the number of classes, Sturges’ formula is used.

S<-round(1+10/3*log(length(sales)))
W<-ceiling((max(sales)-min(sales))/S)
class.sales<-cut(sales,breaks=c(seq(min(sales)-1,max(sales),W),max(sales)))
ni.class.sales<-table(class.sales)
Ni.class.sales<-cumsum(table(class.sales))
fi.class.sales<-table(class.sales)/length(sales)
Fi.class.sales<-cumsum(table(class.sales)/length(sales))
Tabclass.sales<-cbind(ni.class.sales,Ni.class.sales,fi.class.sales,Fi.class.sales)
kable(round(Tabclass.sales, 2), caption = "Frequency Table of the Variable Sales divided into Classes", align = "c")

Frequency Table of the Variable Sales divided into Classes
	ni.class.sales	Ni.class.sales	fi.class.sales	Fi.class.sales
(78,97]	19	19	0.08	0.08
(97,116]	26	45	0.11	0.19
(116,135]	29	74	0.12	0.31
(135,154]	21	95	0.09	0.40
(154,173]	23	118	0.10	0.49
(173,192]	21	139	0.09	0.58
(192,211]	19	158	0.08	0.66
(211,230]	10	168	0.04	0.70
(230,249]	12	180	0.05	0.75
(249,268]	11	191	0.05	0.80
(268,287]	13	204	0.05	0.85
(287,306]	13	217	0.05	0.90
(306,325]	5	222	0.02	0.92
(325,344]	5	227	0.02	0.95
(344,363]	5	232	0.02	0.97
(363,382]	4	236	0.02	0.98
(382,401]	1	237	0.00	0.99
(401,420]	2	239	0.01	1.00
(420,423]	1	240	0.00	1.00

barplot(table(class.sales),ylim = c(0,30),xlab = "Sales Classes",ylab = "Abs.Freq.")

There are 19 classes, 18 of which have a width of 19 and the last one has a width of 3. In most months there is a total number of sales belonging to the interval (116, 135], with a probability of 12%.

The Gini index is equal to 0.97, very close to the upper limit value. This means that there is almost an equal distribution of the classes of total sales, as expected, given the low frequency of the most frequent class.

gini.index <- function(x){
  ni = table(x)
  fi = ni/length(x)
  fi2 = fi^2
  J = length(table(x))
  gini = 1 - sum(fi2)
  gini.normalizzato = gini/((J-1)/J)
  return(gini.normalizzato)}

kable(round(gini.index(class.sales),2))

x
0.97

6) Identifying the Gini index for the variable “City”

The Gini index will be equal to 1, since there is an equal distribution of the absolute frequencies of the modes of the city variable.

7) What is the probability that a random row of this dataset reports the city “Beaumont”? And the probability that it reports the month of July? And the probability that it reports the month of December 2012?

p1 <- sum(city=="Beaumont")/nrow(texas)
p2 <- sum(month==7)/nrow(texas)
p3 <- sum(month==12 & year==2012)/nrow(texas)

p1 <- sum(city == "Beaumont") / nrow(texas)
p2 <- sum(month == 7) / nrow(texas)
p3 <- sum(month == 12 & year == 2012) / nrow(texas)

probabilities <- data.frame(
  p1,
  p2,
  p3)

kable(
  round(probabilities, 2),
  align = "c",
  col.names = c("P(city='Beaumont')", "P(month=7)", "P(month=12 & year=2012)"))

P(city=‘Beaumont’)	P(month=7)	P(month=12 & year=2012)
0.25	0.08	0.02

8) Column with Median Price

texas$mean_price<-(volume/sales)*1000000

9) Create another column that gives an idea of the “effectiveness” of the sales announcements

texas$efflistings<-(sales/listings)*100
texas$date <- as.Date(paste(texas$year,texas$month,"01",sep = "-"))

To create a column that reflects the effectiveness of sales ads, I calculated, for each observation, the percentage of ads that actually led to a sale.

texas$efflistings<-(sales/listings)*100
texas$date <- as.Date(paste(texas$year,texas$month,"01",sep = "-"))

ggplot(data = texas) +
  geom_line(aes(x = date, y = efflistings, colour = city)) +
  geom_point(aes(x = date, y = efflistings, colour = city)) +
  labs(
    title = "Time Series of ADS effectiveness by City",
    x = "Period",
    y = "Percentage of Sales on active ADS"
  ) +
  scale_x_date(
    breaks = seq(as.Date("2010-01-01"), as.Date("2014-12-01"), by = "6 months"),
    date_labels = "%m/%Y", 
    expand = expansion(mult = c(0.001, 0.001)) 
  ) +
  theme_classic()

The most effective sales ADS are those in the Bryan-Colege Station city. In general, it is observed that positive and negative trends occur simultaneously in the four cities but with different magnitudes.

10) Creating Summary

summary_texas <- texas %>%
  group_by(year, city) %>%
  summarise(
    media.di.sales = round(mean(sales),2),
    sd.di.volume = round(sd(volume),2))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

kable(summary_texas)

year	city	media.di.sales	sd.di.volume
2010	Beaumont	156.17	4.95
2010	Bryan-College Station	167.58	10.82
2010	Tyler	227.50	8.39
2010	Wichita Falls	123.42	4.07
2011	Beaumont	144.00	4.30
2011	Bryan-College Station	167.42	10.31
2011	Tyler	238.83	9.41
2011	Wichita Falls	106.25	2.52
2012	Beaumont	171.92	4.92
2012	Bryan-College Station	196.75	13.49
2012	Tyler	263.50	10.23
2012	Wichita Falls	112.42	2.66
2013	Beaumont	201.17	6.44
2013	Bryan-College Station	237.83	19.54
2013	Tyler	287.42	10.33
2013	Wichita Falls	121.25	3.11
2014	Beaumont	213.67	7.05
2014	Bryan-College Station	260.25	17.97
2014	Tyler	331.50	12.76
2014	Wichita Falls	117.00	3.13

ggplot(summary_texas, aes(x = year, y = media.di.sales, color = city, group = city)) +
  geom_line(linewidth = 1) +           
  geom_point(size = 2) +           
  labs(
    title = "Average monthly Sales by City and Year",
    x = "Year",
    y = "Average Monthly Sales",
    color = "City"
  ) +
  theme_classic()

There is a decline in average monthly sales in 2011 for all cities in the sample, then an increase every year, except for the city of Wichita Falls, with a decrease in 2014.

GGPLOT Graphs

ggplot(data=texas)+
  geom_boxplot(aes(y=median_price,x=city))+
  labs(title = "Median Price Blox",
       x="City",
       y="Median Price")+
  theme_classic()

The graph shows that the variable “median_price” presents a greater variability for the city of Wichita Falls, where there are also houses with the lowest prices, while the same variable is distributed with a lower variability for the observations relating to the city of Bryan-College Station, where the most expensive houses are present.

ggplot(data=texas)+
  geom_boxplot(aes(y=volume,x=city,fill=as.factor(year)))+
  labs(title = "BoxPlot Total Sales value by City",
       x="City",
       y="Total Sales Value",
       fill="Year")+
  theme_classic()

From the graph, it can be observed that the variable “volume” is characterized by greater variability when it comes to the city of Bryan-College Station. In particular, the total value of sales has a greater variability in the year 2013. Instead, “volume” is less variable for the city of Wichita Falls. Finally, in the year 2012, limited to the city of Beumont, the volume of sales is characterized, in absolute terms, by a lower variability.

for (i in 2010:2014) {
  grafico <- ggplot(data=subset(texas, year == i)) +
    geom_col(aes(x = month, y = sales, fill = city), position = "stack") +
    labs(
      title = paste("Total Sales by Month and City in the year", i),
      x = "Month",
      y = "Total Sales",
      fill = "City"
    ) +
    scale_x_continuous(breaks = seq(1, 12, 1)) +
    scale_y_continuous(breaks = seq(100, 1200, 100)) +
    theme_classic() +
    theme(plot.title = element_text(size = 12))
  print(grafico)
}

for (i in 2010:2014) {
  grafico1 <- ggplot(data=subset(texas, year == i)) +
    geom_col(aes(x = month, y = sales, fill = city), position = "fill") +
    labs(
      title = paste("Total Sales Percentage by Month and City in the year", i),
      x = "Month",
      y = "Percentage of Sales",
      fill = "City"
    ) +
    scale_x_continuous(breaks = seq(1, 12, 1)) +
    scale_y_continuous(breaks = seq(0, 1, 0.05)) +
    theme_classic() +
    theme(plot.title = element_text(size = 10))
  print(grafico1)
}

From the graphs it is noted that for most of the months examined the highest percentage of sales is recorded in the city of Tyler while the lowest in the city of Wichita Falls. Furthermore, the percentages of sales for each city appear to be more or less constant over time. Another consideration to add concerns the fact that sales are concentrated, for each year, during the summer period.

ggplot(data = texas) +
  geom_line(aes(x = date, y = mean_price, colour = city)) +
  geom_point(aes(x = date, y = mean_price, colour = city)) + 
  labs(
    title = "Historical Series of Average Price between Cities",
    x = "Period",
    y = "Average Price",
    colour = "City"
  ) +
  scale_x_date(
    breaks = seq(as.Date("2010-01-01"), as.Date("2014-12-01"), by = "6 months"),
    date_labels = "%m/%y",
    expand = expansion(mult = c(0.01, 0.01)) 
  ) +
  theme_classic()

I chose to represent the average price of sales in each month from 2010 to 2015 for the cities in the dataset. You can see that the highest average prices are in the city of Bryan-College Station while the lowest are in the city of Wichita Falls. At the beginning of 2012 there was a sharp drop in prices that involved all cities. The average price undergoes strong variations, for all cities, undergoes strong variations, represented as broken lines. Often, the variations of the average prices of the cities of Beaumont and Tyler go in the same direction.

Final Aspects

Interesting insights have been extracted from the descriptive analyses conducted on the dataset. First of all, we note the continuous price variations that involve all the cities included in the sample, therefore they are shocks that concern the real estate market itself. On the sales side, it should be noted that the trends are generally increasing, with transactions that are concluded more frequently during the summer months of the year. From the point of view of the individual cities, it is observed that the highest average price is recorded in the city of Bryan-College Station while the lowest is in the city of Wichita Falls. In the city of Tyler, more real estate sales are made than in the others and at the same time, a higher overall turnover in the sector is recorded. Going down a level further, considering the quality of real estate agencies, it is necessary to mention the agents of the city of Bryan-College Station who outperform their colleagues in other cities in terms of percentage of sales on active ADS.

Texas Real Estate Market Analysis

Tommaso

2025-02-22