1) Loading Dataset

setwd("C:/Users/tomma/Desktop/Profession AI")
texas <- read_csv("realestate_texas.csv")
## Rows: 240 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): city
## dbl (7): year, month, sales, volume, median_price, listings, months_inventory
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach(texas)

2) Type of Variables present in the Dataset

str(texas)
## spc_tbl_ [240 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ city            : chr [1:240] "Beaumont" "Beaumont" "Beaumont" "Beaumont" ...
##  $ year            : num [1:240] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ month           : num [1:240] 1 2 3 4 5 6 7 8 9 10 ...
##  $ sales           : num [1:240] 83 108 182 200 202 189 164 174 124 150 ...
##  $ volume          : num [1:240] 14.2 17.7 28.7 26.8 28.8 ...
##  $ median_price    : num [1:240] 163800 138200 122400 123200 123100 ...
##  $ listings        : num [1:240] 1533 1586 1689 1708 1771 ...
##  $ months_inventory: num [1:240] 9.5 10 10.6 10.6 10.9 11.1 11.7 11.6 11.7 11.5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   city = col_character(),
##   ..   year = col_double(),
##   ..   month = col_double(),
##   ..   sales = col_double(),
##   ..   volume = col_double(),
##   ..   median_price = col_double(),
##   ..   listings = col_double(),
##   ..   months_inventory = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

3) Indices

#Variable "city"
ni.city<-table(city)
fi.city<-table(city)/dim(texas)[1]
Tabcity<-cbind(ni.city,fi.city)

#Variable "year"
ni.year<-table(year)
fi.year<-table(year)/dim(texas)[1]
Ni.year<-cumsum(table(year))
Fi.year<-cumsum(table(year))/dim(texas)[1]
Tabyear<-cbind(ni.year,fi.year,Ni.year,Fi.year)

#Variable "month"
ni.month<-table(month)
fi.month<-table(month)/dim(texas)[1]
Ni.month<-cumsum(table(month))
Fi.month<-cumsum(table(month))/dim(texas)[1]
Tabmonth<-cbind(ni.month,fi.month,Ni.month,Fi.month)

#Other variables
quant_var <- colnames(texas[4:8])
lista_tab <- c()
for (i in quant_var) {
  #Position Indices
  Mean<-mean(texas[[i]]) #average
  Mode<-as.numeric(names(which.max(table(texas[[i]])))) #mode
  Median<-median(texas[[i]]) #median
  Min<-min(texas[[i]])
  Max<-max(texas[[i]])
  #Variability Indices
  Range<-Max-Min
  IQR<-IQR(texas[[i]])
  SD<-sd(texas[[i]]) #variance
  #Variance.sales<-(var(sales)*(length(sales)-1))/length(sales) or
  Variance<-SD^2
  CV<-SD/Mean
  #Shape Indices
  Kurtosis<-kurtosis(texas[[i]])-3
  Skewness<-skewness(texas[[i]])
  Tab<-cbind(Mean,Mode,Median,Min,Max,Range,IQR,Variance,SD,CV,Skewness,Kurtosis)
  
  lista_tab[[i]] <- Tab
}
Tabcompl_quant<-rbind(lista_tab[[(1)]], lista_tab[[(2)]], lista_tab[[(3)]], lista_tab[[(4)]], lista_tab[[(5)]])
Tabcompl_quant<-as.data.frame(Tabcompl_quant)
rownames(Tabcompl_quant)<-colnames(texas[4:8])

For the quantitative variables I calculated the main position, variability and shape indices. I then grouped them in a summary table. We can compare the variables with the skewness and coefficient of variation indices to better understand their distribution and variability. Most of the variables have a positive skewness, where the values are concentrated mostly in the lower part of the distribution, leaving a longer tail to the right. The variable “median_price” is an exception. From the point of view of relative variability, measured by the coefficient of variation (CV), the variable with the greatest dispersion with respect to the mean is volume. This variable is also the one with the highest degree of skewness, as indicated by the skewness index in the table, reflecting a distribution particularly unbalanced towards higher values.

kable(round(Tabcompl_quant, 2))
Mean Mode Median Min Max Range IQR Variance SD CV Skewness Kurtosis
sales 192.29 124.0 175.50 79.00 423.00 344.00 120.00 6.34430e+03 79.65 0.41 0.72 -0.31
volume 31.01 14.0 27.06 8.17 83.55 75.38 23.23 2.77270e+02 16.65 0.54 0.88 0.18
median_price 132665.42 130000.0 134500.00 73800.00 180000.00 106200.00 32750.00 5.13573e+08 22662.15 0.17 -0.36 -0.62
listings 1738.02 1581.0 1618.50 743.00 3296.00 2553.00 1029.50 5.66569e+05 752.71 0.43 0.65 -0.79
months_inventory 9.19 8.1 8.95 3.40 14.90 11.50 3.15 5.31000e+00 2.30 0.25 0.04 -0.17

For the variables “city”, “years” and “month”, being qualitative, I created only the frequency distributions. As you can see, within the dataset, the number of observations for each city is identical and data were collected for each month from 2010 to 2014, for the cities of Beaumont, Bryan-College Station, Tyler, Wichita Falls.

kable(Tabcity, caption = "Frequency Table for the variable City", align = "c")
Frequency Table for the variable City
ni.city fi.city
Beaumont 60 0.25
Bryan-College Station 60 0.25
Tyler 60 0.25
Wichita Falls 60 0.25
kable(round(Tabmonth,2), caption = "Frequency Table for the variable Month", align = "c")
Frequency Table for the variable Month
ni.month fi.month Ni.month Fi.month
20 0.08 20 0.08
20 0.08 40 0.17
20 0.08 60 0.25
20 0.08 80 0.33
20 0.08 100 0.42
20 0.08 120 0.50
20 0.08 140 0.58
20 0.08 160 0.67
20 0.08 180 0.75
20 0.08 200 0.83
20 0.08 220 0.92
20 0.08 240 1.00
kable(Tabyear, caption = "Frequency Table for the variable Year", align = "c")
Frequency Table for the variable Year
ni.year fi.year Ni.year Fi.year
2010 48 0.2 48 0.2
2011 48 0.2 96 0.4
2012 48 0.2 144 0.6
2013 48 0.2 192 0.8
2014 48 0.2 240 1.0

4) Which is the variable with the highest variability?

The variable with index 2, or “volume”, has a higher CV, therefore higher variability.

kable(rownames(Tabcompl_quant[which.max(Tabcompl_quant$CV),]))
x
volume

The “volume” variable has a higher skewness index in absolute value, therefore it is more asymmetric.

kable(rownames(Tabcompl_quant[which.max(abs(Tabcompl_quant$Skewness)),]))
x
volume

5) Division of one of the Quantitative Variables into classes

For the number of classes, Sturges’ formula is used.

S<-round(1+10/3*log(length(sales)))
W<-ceiling((max(sales)-min(sales))/S)
class.sales<-cut(sales,breaks=c(seq(min(sales)-1,max(sales),W),max(sales)))
ni.class.sales<-table(class.sales)
Ni.class.sales<-cumsum(table(class.sales))
fi.class.sales<-table(class.sales)/length(sales)
Fi.class.sales<-cumsum(table(class.sales)/length(sales))
Tabclass.sales<-cbind(ni.class.sales,Ni.class.sales,fi.class.sales,Fi.class.sales)
kable(round(Tabclass.sales, 2), caption = "Frequency Table of the Variable Sales divided into Classes", align = "c")
Frequency Table of the Variable Sales divided into Classes
ni.class.sales Ni.class.sales fi.class.sales Fi.class.sales
(78,97] 19 19 0.08 0.08
(97,116] 26 45 0.11 0.19
(116,135] 29 74 0.12 0.31
(135,154] 21 95 0.09 0.40
(154,173] 23 118 0.10 0.49
(173,192] 21 139 0.09 0.58
(192,211] 19 158 0.08 0.66
(211,230] 10 168 0.04 0.70
(230,249] 12 180 0.05 0.75
(249,268] 11 191 0.05 0.80
(268,287] 13 204 0.05 0.85
(287,306] 13 217 0.05 0.90
(306,325] 5 222 0.02 0.92
(325,344] 5 227 0.02 0.95
(344,363] 5 232 0.02 0.97
(363,382] 4 236 0.02 0.98
(382,401] 1 237 0.00 0.99
(401,420] 2 239 0.01 1.00
(420,423] 1 240 0.00 1.00
barplot(table(class.sales),ylim = c(0,30),xlab = "Sales Classes",ylab = "Abs.Freq.")

There are 19 classes, 18 of which have a width of 19 and the last one has a width of 3. In most months there is a total number of sales belonging to the interval (116, 135], with a probability of 12%.

The Gini index is equal to 0.97, very close to the upper limit value. This means that there is almost an equal distribution of the classes of total sales, as expected, given the low frequency of the most frequent class.

gini.index <- function(x){
  ni = table(x)
  fi = ni/length(x)
  fi2 = fi^2
  J = length(table(x))
  gini = 1 - sum(fi2)
  gini.normalizzato = gini/((J-1)/J)
  return(gini.normalizzato)}

kable(round(gini.index(class.sales),2))
x
0.97

6) Identifying the Gini index for the variable “City”

The Gini index will be equal to 1, since there is an equal distribution of the absolute frequencies of the modes of the city variable.

7) What is the probability that a random row of this dataset reports the city “Beaumont”? And the probability that it reports the month of July? And the probability that it reports the month of December 2012?

p1 <- sum(city=="Beaumont")/nrow(texas)
p2 <- sum(month==7)/nrow(texas)
p3 <- sum(month==12 & year==2012)/nrow(texas)

p1 <- sum(city == "Beaumont") / nrow(texas)
p2 <- sum(month == 7) / nrow(texas)
p3 <- sum(month == 12 & year == 2012) / nrow(texas)

probabilities <- data.frame(
  p1,
  p2,
  p3)

kable(
  round(probabilities, 2),
  align = "c",
  col.names = c("P(city='Beaumont')", "P(month=7)", "P(month=12 & year=2012)"))
P(city=‘Beaumont’) P(month=7) P(month=12 & year=2012)
0.25 0.08 0.02

8) Column with Median Price

texas$mean_price<-(volume/sales)*1000000

9) Create another column that gives an idea of the “effectiveness” of the sales announcements

texas$efflistings<-(sales/listings)*100
texas$date <- as.Date(paste(texas$year,texas$month,"01",sep = "-"))

To create a column that reflects the effectiveness of sales ads, I calculated, for each observation, the percentage of ads that actually led to a sale.

texas$efflistings<-(sales/listings)*100
texas$date <- as.Date(paste(texas$year,texas$month,"01",sep = "-"))
ggplot(data = texas) +
  geom_line(aes(x = date, y = efflistings, colour = city)) +
  geom_point(aes(x = date, y = efflistings, colour = city)) +
  labs(
    title = "Time Series of ADS effectiveness by City",
    x = "Period",
    y = "Percentage of Sales on active ADS"
  ) +
  scale_x_date(
    breaks = seq(as.Date("2010-01-01"), as.Date("2014-12-01"), by = "6 months"),
    date_labels = "%m/%Y", 
    expand = expansion(mult = c(0.001, 0.001)) 
  ) +
  theme_classic()

The most effective sales ADS are those in the Bryan-Colege Station city. In general, it is observed that positive and negative trends occur simultaneously in the four cities but with different magnitudes.

10) Creating Summary

summary_texas <- texas %>%
  group_by(year, city) %>%
  summarise(
    media.di.sales = round(mean(sales),2),
    sd.di.volume = round(sd(volume),2))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
kable(summary_texas)
year city media.di.sales sd.di.volume
2010 Beaumont 156.17 4.95
2010 Bryan-College Station 167.58 10.82
2010 Tyler 227.50 8.39
2010 Wichita Falls 123.42 4.07
2011 Beaumont 144.00 4.30
2011 Bryan-College Station 167.42 10.31
2011 Tyler 238.83 9.41
2011 Wichita Falls 106.25 2.52
2012 Beaumont 171.92 4.92
2012 Bryan-College Station 196.75 13.49
2012 Tyler 263.50 10.23
2012 Wichita Falls 112.42 2.66
2013 Beaumont 201.17 6.44
2013 Bryan-College Station 237.83 19.54
2013 Tyler 287.42 10.33
2013 Wichita Falls 121.25 3.11
2014 Beaumont 213.67 7.05
2014 Bryan-College Station 260.25 17.97
2014 Tyler 331.50 12.76
2014 Wichita Falls 117.00 3.13
ggplot(summary_texas, aes(x = year, y = media.di.sales, color = city, group = city)) +
  geom_line(linewidth = 1) +           
  geom_point(size = 2) +           
  labs(
    title = "Average monthly Sales by City and Year",
    x = "Year",
    y = "Average Monthly Sales",
    color = "City"
  ) +
  theme_classic()

There is a decline in average monthly sales in 2011 for all cities in the sample, then an increase every year, except for the city of Wichita Falls, with a decrease in 2014.

GGPLOT Graphs

ggplot(data=texas)+
  geom_boxplot(aes(y=median_price,x=city))+
  labs(title = "Median Price Blox",
       x="City",
       y="Median Price")+
  theme_classic()

The graph shows that the variable “median_price” presents a greater variability for the city of Wichita Falls, where there are also houses with the lowest prices, while the same variable is distributed with a lower variability for the observations relating to the city of Bryan-College Station, where the most expensive houses are present.

ggplot(data=texas)+
  geom_boxplot(aes(y=volume,x=city,fill=as.factor(year)))+
  labs(title = "BoxPlot Total Sales value by City",
       x="City",
       y="Total Sales Value",
       fill="Year")+
  theme_classic()

From the graph, it can be observed that the variable “volume” is characterized by greater variability when it comes to the city of Bryan-College Station. In particular, the total value of sales has a greater variability in the year 2013. Instead, “volume” is less variable for the city of Wichita Falls. Finally, in the year 2012, limited to the city of Beumont, the volume of sales is characterized, in absolute terms, by a lower variability.

for (i in 2010:2014) {
  grafico <- ggplot(data=subset(texas, year == i)) +
    geom_col(aes(x = month, y = sales, fill = city), position = "stack") +
    labs(
      title = paste("Total Sales by Month and City in the year", i),
      x = "Month",
      y = "Total Sales",
      fill = "City"
    ) +
    scale_x_continuous(breaks = seq(1, 12, 1)) +
    scale_y_continuous(breaks = seq(100, 1200, 100)) +
    theme_classic() +
    theme(plot.title = element_text(size = 12))
  print(grafico)
}

for (i in 2010:2014) {
  grafico1 <- ggplot(data=subset(texas, year == i)) +
    geom_col(aes(x = month, y = sales, fill = city), position = "fill") +
    labs(
      title = paste("Total Sales Percentage by Month and City in the year", i),
      x = "Month",
      y = "Percentage of Sales",
      fill = "City"
    ) +
    scale_x_continuous(breaks = seq(1, 12, 1)) +
    scale_y_continuous(breaks = seq(0, 1, 0.05)) +
    theme_classic() +
    theme(plot.title = element_text(size = 10))
  print(grafico1)
}

From the graphs it is noted that for most of the months examined the highest percentage of sales is recorded in the city of Tyler while the lowest in the city of Wichita Falls. Furthermore, the percentages of sales for each city appear to be more or less constant over time. Another consideration to add concerns the fact that sales are concentrated, for each year, during the summer period.

ggplot(data = texas) +
  geom_line(aes(x = date, y = mean_price, colour = city)) +
  geom_point(aes(x = date, y = mean_price, colour = city)) + 
  labs(
    title = "Historical Series of Average Price between Cities",
    x = "Period",
    y = "Average Price",
    colour = "City"
  ) +
  scale_x_date(
    breaks = seq(as.Date("2010-01-01"), as.Date("2014-12-01"), by = "6 months"),
    date_labels = "%m/%y",
    expand = expansion(mult = c(0.01, 0.01)) 
  ) +
  theme_classic()

I chose to represent the average price of sales in each month from 2010 to 2015 for the cities in the dataset. You can see that the highest average prices are in the city of Bryan-College Station while the lowest are in the city of Wichita Falls. At the beginning of 2012 there was a sharp drop in prices that involved all cities. The average price undergoes strong variations, for all cities, undergoes strong variations, represented as broken lines. Often, the variations of the average prices of the cities of Beaumont and Tyler go in the same direction.

Final Aspects

Interesting insights have been extracted from the descriptive analyses conducted on the dataset. First of all, we note the continuous price variations that involve all the cities included in the sample, therefore they are shocks that concern the real estate market itself. On the sales side, it should be noted that the trends are generally increasing, with transactions that are concluded more frequently during the summer months of the year. From the point of view of the individual cities, it is observed that the highest average price is recorded in the city of Bryan-College Station while the lowest is in the city of Wichita Falls. In the city of Tyler, more real estate sales are made than in the others and at the same time, a higher overall turnover in the sector is recorded. Going down a level further, considering the quality of real estate agencies, it is necessary to mention the agents of the city of Bryan-College Station who outperform their colleagues in other cities in terms of percentage of sales on active ADS.