Analysis of dataframe’s structure and dimensions

my_table <- df %>%
  summarise(across(where(is.numeric),
                   list(
                     Min = ~min(.x, na.rm = TRUE),
                     Q1 = ~quantile(.x, 0.25, na.rm = TRUE),
                     Median = ~median(.x, na.rm = TRUE),
                     Mean = ~mean(.x, na.rm = TRUE),
                     Q3 = ~quantile(.x, 0.75, na.rm = TRUE),
                     Max = ~max(.x, na.rm = TRUE)
                   ))) %>%
  pivot_longer(
    everything(),
    names_to = c("Variables", ".value"),
    names_pattern = "(.+)_(Min|Q1|Median|Mean|Q3|Max)"
  )
my_table %>%
  mutate(across(-Variables, ~round(as.numeric(.), 2))) %>%
  kable(caption = "Variable analysis") %>%
  kable_styling(full_width = FALSE)

Variable analysis
Variables	Min	Q1	Median	Mean	Q3	Max
year	2010.00	2011.00	2012.00	2012.00	2013.00	2014.00
month	1.00	3.75	6.50	6.50	9.25	12.00
sales	79.00	127.00	175.50	192.29	247.00	423.00
volume	8.17	17.66	27.06	31.01	40.89	83.55
median_price	73800.00	117300.00	134500.00	132665.42	150050.00	180000.00
listings	743.00	1026.50	1618.50	1738.02	2056.00	3296.00
months_inventory	3.40	7.80	8.95	9.19	10.95	14.90

kable(data.frame(Dimension = c("Rows", "Columns"), Value = dim(df)),
      caption = "Dimension of dataset") %>%
  kable_styling(full_width = FALSE)

Dimension of dataset
Dimension	Value
Rows	240
Columns	8

Quick analysis of the dataset’s variables

kable(head(df), caption = "Firs rows of dataset") %>%
  kable_styling(full_width = FALSE)

Firs rows of dataset
city	year	month	sales	volume	median_price	listings	months_inventory	month_year
Beaumont	2010	1	83	14.162	163800	1533	9.5	1 / 2010
Beaumont	2010	2	108	17.690	138200	1586	10.0	2 / 2010
Beaumont	2010	3	182	28.701	122400	1689	10.6	3 / 2010
Beaumont	2010	4	200	26.819	123200	1708	10.6	4 / 2010
Beaumont	2010	5	202	28.833	123100	1771	10.9	5 / 2010
Beaumont	2010	6	189	27.219	122800	1803	11.1	6 / 2010

The dataset contains 240 rows and 8 variables:

-‘city’ nominal qualitative variable

-‘year’ continuous quantitative to be treated as ordinal qualitative variable

-‘month’ nominal qualitative variable coded with numbers

-‘sales’ discrete quantitative variable

-‘volume’ continuous quantitative variable on a ratio scale

-‘median_price’ continuous quantitative variable on a ratio scale

-‘listings’ discrete quantitative variable

-‘months_inventory’ continuous quantitative variable on a ratio scale

I decided to add the ‘month_year’ variable so I can do analysis on time dimension

We can do analysis on the distribution of quantitative variables by qualitative variables

Position, variability and shape indexes for quantitative variables

We would like to analyse the distribution of quantitative variables, especially ‘sales’ and ‘volume’; with that we can see the sales, both in quantity and in amount

stats_table <- data.frame(
  Variable = c("sales", "volume", "median_price", "listings", "months_inventory"),
  Mean = round(c(mean(sales), mean(volume), mean(median_price), mean(listings), mean(months_inventory)), 2),
  SD = round(c(sd(sales), sd(volume), sd(median_price), sd(listings), sd(months_inventory)), 2),
  IQR = round(c(IQR(sales), IQR(volume), IQR(median_price), IQR(listings), IQR(months_inventory)), 2),
  Skewness = round(c(skewness(sales), skewness(volume), skewness(median_price), skewness(listings), skewness(months_inventory)), 2),
  Kurtosis = round(c(kurtosis(sales)-3, kurtosis(volume)-3, kurtosis(median_price)-3, kurtosis(listings)-3, kurtosis(months_inventory)-3), 2)
)

kable(stats_table, caption = "Statistical indexes") %>%
  kable_styling(full_width = FALSE)

Statistical indexes
Variable	Mean	SD	IQR	Skewness	Kurtosis
sales	192.29	79.65	120.00	0.72	-0.31
volume	31.01	16.65	23.23	0.88	0.18
median_price	132665.42	22662.15	32750.00	-0.36	-0.62
listings	1738.02	752.71	1029.50	0.65	-0.79
months_inventory	9.19	2.30	3.15	0.04	-0.17

Most of the variables have positive skewness and are platicurtic with the exception of ‘median_price’ that has negative skewness and ‘volume’ that is leptocurtic

The positive skeweness indicates that the lowest values are more frequent

If we take as an example the ‘sales’ variabile we can see a good amount of variance between datas with an extense range of datas.

Graphic analysis for ‘volume’

boxplot(volume, col = "lightblue",
        main = "Volume Boxplot (mln)",
        ylab = "Volume") # few outliers

hist(volume, probability = T, col = "lightblue",
     main = "Volume trend analysis (mln)",
     xlab = "Volume",
     ylab = "Density")
lines(density(volume)) # we can clearly see the positive skewness

I analysed the ‘volume’ variable with 2 graphs: a boxplot to see the quantiles and the outliers and a histogram (with the density curve) to see the frequency for each class (to see the positive skewness)

‘City’, ‘year’, ‘month’ and ‘month_year’ frequency

[[1]]

City frequencies
city	Freq
Beaumont	60
Bryan-College Station	60
Tyler	60
Wichita Falls	60

[[2]]

Year frequencies
year	Freq
2010	48
2011	48
2012	48
2013	48
2014	48

[[3]]

Month frequencies
month	Freq
1	20
2	20
3	20
4	20
5	20
6	20
7	20
8	20
9	20
10	20
11	20
12	20

Month-Year frequencies (pivoted)
Month	2010	2011	2012	2013	2014
1	4	4	4	4	4
2	4	4	4	4	4
3	4	4	4	4	4
4	4	4	4	4	4
5	4	4	4	4	4
6	4	4	4	4	4
7	4	4	4	4	4
8	4	4	4	4	4
9	4	4	4	4	4
10	4	4	4	4	4
11	4	4	4	4	4
12	4	4	4	4	4

Comparisons between variables for variability and skewness

It is essential to analyze the variability of the variables in order to understand how much the data deviate from the central value (e.g. to identify anomalous behaviour). Low variability indicates relatively homogeneous observations, while high dispersion signals the presence of heterogeneity

Coefficient of variation function; it is necessary in order to compare variables with different units of measurement

CV <- function(datas, column){
  sd_v = sd(datas[[column]])
  mean_v = mean(datas[[column]])
  
  round(sd_v/mean_v*100, 2)
}

Comparison between some variables to see the one with the highest variability

cols <- c("sales", "volume", "median_price", "listings", "months_inventory")

cv_table <- data.frame(
  Variable = cols,
  CV_percent = unname(sapply(cols, CV, datas=df))
)

kable(cv_table, caption = "Coefficient of Variation (%)", align = "c") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover"))

Coefficient of Variation (%)
Variable	CV_percent
sales	41.42
volume	53.71
median_price	17.08
listings	43.31
months_inventory	25.06

Histograms of the 5 variables above (to analyse the skewness)

par(mfrow = c(3, 2))
hist(sales, col = "lightblue")
hist(volume, col = "lightgreen")
hist(median_price, col = "lightgrey")
hist(listings, col = "brown")
hist(months_inventory, col = "yellow")
par(mfrow = c(1, 1))

In order to compare the variability for columns with different unit of measurement we must use the coefficient of variation

From a statistical point of view the variable ‘volume’ is the one whose values vary greatly relative to the average

If we see above the skewness functions (where I executed the position, variability and shape indexes) and the histograms, ‘volume’ is the variable with the highest skewness

Classes for ‘median_price’

By dividing ‘median_price’ into classes, we can group them to identify those with the highest frequency. In Texas, average prices are concentrated in the center of the distribution

median_price_class = cut(median_price, breaks = 10, labels = FALSE)
breaks_vec <- seq(min(median_price), max(median_price), length.out=11)

Barplot for ‘median_price’ divided in 10 classes

label <- sapply(1:(length(breaks_vec)-1), function(i){
  paste0(round(breaks_vec[i]/1000), "-", round(breaks_vec[i+1]/1000), "k")
})

barplot(table(median_price_class), names.arg = label, las=2, col = "lightblue",
        main = "Barplot for median_price in classes",
        ylab = "Frequency")

Calculation of Gini index

sum_median_price_class = sum(table(median_price_class))
freq_rel_median_price = round(table(median_price_class)/sum_median_price_class, digits = 3)
Gini <- 1-sum(freq_rel_median_price^2)

cat("Gini Index:", round(Gini, 2), "%") # 0.86 -> high eterogenity

Gini Index: 0.86 %

Variable ‘median_price’, divided in classes and plotted with a barplot, has high heterogeneity.

It can be seen also through the Gini index (0.86).

Probabilities

Probability for ‘city’ = ‘Beaumont’

P_Beaumont = sum(city == 'Beaumont')/length(city)
cat("Probability Beaumont:", round(P_Beaumont, 2), "%") # P=1/4

Probability Beaumont: 0.25 %

Probability for ‘month’ = ‘July’ (7)

P_July = sum(month == 7)/length(month)
cat("Probability July:", round(P_July, 2), "%") # P=1/12

Probability July: 0.08 %

Probability for ‘month’ = ‘December’ (12) and ‘year’ = 2012

P_Dec_2012 = sum((month == 7) & (year == 2012))/nrow(df)
cat("Probability Dec_2012:", round(P_Dec_2012, 2), "%") # P=1/60

Probability Dec_2012: 0.02 %

Adding new variables

Variable ‘mean_price’ calculated by ‘volume’/‘sales’

Variable ‘listings_efficiency’ calculated by ‘sales’/‘listings’. We can obtain how well listings have performed

Visualize the results of the new variables

‘Listings_efficiency’ is a perfect indicator of how well the real estate market is performing in Texas. A high value indicates a fluid and responsive market, where demand is ableto quickly absorb supply. Conversely, low efficiency suggests that many listings remain unsold, indicating possible imbalances, such as uncompetitive prices or weak demand

hist(df$mean_price, col = "lightgreen",
     main = "Histogram of mean_prices",
     xlab = "Mean_prices")

max_val <- max(df$listings_efficiency, na.rm = TRUE)
idx_max <- which.max(df$listings_efficiency)
city_max <- df$city[idx_max]
month_year_max <- df$month_year[idx_max]

boxplot(df$listings_efficiency, col = "lightblue",
        main = "Listings_efficiency Boxplot",
        xlab = "Listings_efficiency",
        horizontal = TRUE)
points(max_val, 1, col="red", pch=19, cex=1.5)
text(max_val - 0.08, 1.1, labels = paste0(round(max_val,2), " (", city_max, ", ", month_year_max, ")"), col="red")

From the histogram we can see that ‘mean_prices’ between 140k and 170k are the ones with the highest frequency.

From the boxplot we can see the distribution of ‘listings_efficiency’, calculated by ‘sales’/‘listings’.

The maximum value for ‘listings_efficency’ was registrated in 07/2014 for Bryan-College Station where there were 39 sales for every 100 listings.

Statistical analysis

With the analysis below we can see different aspects of the market. We can see which city has the highest amount of sales, which year has the highest quantity of sales and which month has the highest listings. This way we can decide in which city we can decide to buy a new house for example, or in which period of the year

‘Volume’ by cities

We can clearly see the cities with higher variability (Tyler and Bryan-College Station)

volume_per_city <- df %>%
  group_by(city) %>%
  summarise(mean_val_city = mean(volume),
            dev_std_city = sd(volume)
  ) %>%
  mutate(label = paste0(city, " (SD=", round(dev_std_city, 2), ")"))

kable(
  volume_per_city %>% select(label, mean_val_city),
  caption = "Mean Volume per City with SD",
  col.names = c("City (SD)", "Mean Volume"),
  digits = 2,
  align = c("l", "c")
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover")
  )

Mean Volume per City with SD
City (SD)	Mean Volume
Beaumont (SD=6.97)	26.13
Bryan-College Station (SD=17.25)	38.19
Tyler (SD=13.11)	45.77
Wichita Falls (SD=3.24)	13.93

ggplot(df, aes(x = volume, color = city, fill = city)) +
  geom_density(alpha = 0.3) +
  geom_vline(data = volume_per_city, aes(xintercept = mean_val_city, color = city),
             linetype = "dashed", size = 1) +
  scale_color_manual(
    values = scales::hue_pal()(nrow(volume_per_city)),
    labels = volume_per_city$label
  ) +
  labs(x = "Volume", y = "Density",
       title = "Density curve for cities with highlighting means") +
  theme_minimal()

‘Sales’ by years

We can see the variability between the years and the trend over time; higher variability and sales (in average) for 2013 and 2014

sales_per_year <- df %>%
  group_by(year) %>%
  summarise(mean_val_year = mean(sales),
            dev_std_year = sd(sales),
            .groups = "drop"
            ) %>%
  mutate(label = paste0(year, " (SD=", round(dev_std_year, 2), ")"))

kable(
  sales_per_year %>% select(label, mean_val_year),
  caption = "Mean Sales per Year with SD",
  col.names = c("Year (SD)", "Mean Sales"),
  digits = 2,
  align = c("l", "c")
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover")
  )

Mean Sales per Year with SD
Year (SD)	Mean Sales
2010 (SD=60.54)	168.67
2011 (SD=63.87)	164.12
2012 (SD=70.91)	186.15
2013 (SD=84)	211.92
2014 (SD=95.51)	230.60

ggplot(df, aes(x=year, y=sales)) +
  geom_jitter(width=0.2, alpha=0.5) +
  stat_summary(fun=mean, geom="point", color="red", size=3) +
  labs(title="Volume per città con punti e media")

‘Listings’ by months

We can see the 4° month as the one with the highest mean and the 7° with the highest SD

listings_per_month <- df %>%
  group_by(month) %>%
  summarise(mean_val_month = mean(listings),
            dev_std_month = sd(listings),
            .groups = "drop"
            ) %>%
  mutate(
    label = paste0(month, " (SD=", round(dev_std_month, 2), ")")
  )

kable(
  listings_per_month %>% select(label, mean_val_month),
  caption = "Mean Listings per Month with SD",
  col.names = c("Month (SD)", "Mean Listings"),
  digits = 2,
  align = c("l", "c")
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover")
  )

Mean Listings per Month with SD
Month (SD)	Mean Listings
1 (SD=704.61)	1647.05
2 (SD=711.2)	1692.50
3 (SD=727.35)	1756.70
4 (SD=770.43)	1825.70
5 (SD=790.22)	1823.85
6 (SD=811.63)	1833.25
7 (SD=826.72)	1821.20
8 (SD=815.87)	1786.30
9 (SD=802.66)	1748.90
10 (SD=779.16)	1710.35
11 (SD=741.25)	1652.70
12 (SD=692.57)	1557.75

boxplot(listings~month, col = "lightblue",
        main = "Boxplot of listings over months",
        xlab = "Month", ylab = "Listings")

Boxplot to compare ‘median_price’ by city

‘Median_price’ has increased during the years. Given that, we can decide in which city of Texas is better to invest if we would like to buy a new house

Bryan-College Station has the highest median prices, while Wichita Falls has the lowest median prices.

ggplot(df, aes(x=city, y=median_price, fill=factor(year))) +
  geom_boxplot() +
  labs(x="City", y="Median_price",
       title="Boxplot of median_price by city for each year")

Barplots: ‘sales’ by ‘month’ and ‘sales’ by ‘city’ for each ‘year’

Same analysis can be made with the visualization below, that can lead to same conclusions; cities with lower prices registered lower sales. A special case could be Bryan-College Station that, even if prices are higher tha Beaumont, registered a higher quantity of sales for each year, maybe due to some attractions in the city or due to more professional occasions

The highest amount of sales are between May and August and for Tyler, while the lowest sales are in January and for Wichita Falls

I also created a stacked barplot to see the ‘sales’ bu ‘city’ and ‘month’ (Tyler and Bryan-College Station have the highest values)

ggplot(df, aes(x=month, y=sales)) +
  geom_col(aes(fill = factor(year))) +
  labs(x="Month", y="Sales",
       title="Barplot of sales by month for each year") +
  facet_wrap(~year, nrow = 1, ncol = 5) +
  scale_x_continuous(
    breaks = 1:12,
    labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
  ) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

ggplot(df, aes(x=month, y=sales, fill=city)) +
  geom_bar(stat="identity") +
  labs(x="Month", y="Sales",
       title="Stacked barplot of sales by month and city for each year") +
  facet_wrap(~year, nrow = 1, ncol = 5) +
  scale_x_continuous(
    breaks = 1:12,
    labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
  ) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

ggplot(df, aes(x=city, y=sales)) +
  geom_col(fill="violet") +
  labs(x="City", y="Sales",
       title="Barplot of sales by city for each year") +
  facet_wrap(~year, nrow = 1, ncol = 5) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90))

Linecharts for ‘sales’ trend over the years by ‘city’

For every city, the 2014 and 2013 where the years with highest sales

df$month_year <- gsub(" ", "", df$month_year)
df$date <- as.Date(paste0("01/", df$month_year), format = "%d/%m/%Y")

ggplot(df, aes(x = df$date, y = sales, color = city, group = city)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Monthly sales trend by city",
    x = "Month",
    y = "Sales",
    color = "City"
  ) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
  theme_minimal()

Linecharts of ‘volume’ by ‘year’ (in average) for every ‘city’

By examining volume over time for each city, we can illustrate how different markets develop and change. Not all cities follow the same trajectory: most of them show a gradual growth, while Wichita Falls go through phases of expansion and contraction. These differences could reflect local dynamics, such as variations in demand, prices, or market attractiveness

They all seem to have an upward trend. Tyler has the highest values, while Whicita Falls has a great drop in 2011

par(mfrow = c(2,2))
df_Beaumont <- subset(df, city == "Beaumont")
df_agg_Beaumont <- aggregate(volume ~ year, data = df_Beaumont, FUN = mean)
plot(df_agg_Beaumont$year, df_agg_Beaumont$volume, type="l", col="blue", lwd=2,
     main="Beaumont city",
     xlab="Year", ylab="Volume (mln)")

df_Bryan_College_Station <- subset(df, city == "Bryan-College Station")
df_agg_Bryan_College_Station <- aggregate(volume ~ year, data = df_Bryan_College_Station, FUN = mean)
plot(df_agg_Bryan_College_Station$year, df_agg_Bryan_College_Station$volume, type="l", col="red", lwd=2,
     main="Bryan-College Station city",
     xlab="Year", ylab="Volume (mln)")

df_Tyler <- subset(df, city == "Tyler")
df_agg_Tyler <- aggregate(volume ~ year, data = df_Tyler, FUN = mean)
plot(df_agg_Tyler$year, df_agg_Tyler$volume, type="l", col="green", lwd=2,
     main="Tyler city",
     xlab="Year", ylab="Volume (mln)")

df_Wichita_Falls <- subset(df, city == "Wichita Falls")
df_agg_Wichita_Falls <- aggregate(volume ~ year, data = df_Wichita_Falls, FUN = mean)
plot(df_agg_Wichita_Falls$year, df_agg_Wichita_Falls$volume, type="l", col="orange", lwd=2,
     main="Wichita Falls city",
     xlab="Year", ylab="Volume (mln)")

par(mfrow = c(1,1))

————————–

CONCLUSIONS

————————–

From the historical trends, volumes of sales (in mln) increased from 2010 to 2014 for every city, with Tyler having the highest values and Wichita Falls the lowest values.

We can also highlight a fall in volumes for Wichita Falls in 2011.

If we compare monthly every year for every city, quantity of sales have the highest values in 2013 and 2014, with a peak in the central months (May-Aug) for Bryan-College Station.

The efficacy of listings (that become sales) is between 5% and 20% for almost every observation, with few outliers.

We can highlight the efficacy of 39% for Bryan-College Station in July 2014. That is the maximum value registered.

The highest classes of median price registered were 127k-138k and 148k-159k. These amounts are nearer to the highest value of median price than to the lowest.

Month	2010	2011	2012	2013	2014
1	4	4	4	4	4
2	4	4	4	4	4
3	4	4	4	4	4
4	4	4	4	4	4
5	4	4	4	4	4
6	4	4	4	4	4
7	4	4	4	4	4
8	4	4	4	4	4
9	4	4	4	4	4
10	4	4	4	4	4
11	4	4	4	4	4
12	4	4	4	4	4

Month	2010	2011	2012	2013	2014
1	4	4	4	4	4
2	4	4	4	4	4
3	4	4	4	4	4
4	4	4	4	4	4
5	4	4	4	4	4
6	4	4	4	4	4
7	4	4	4	4	4
8	4	4	4	4	4
9	4	4	4	4	4
10	4	4	4	4	4
11	4	4	4	4	4
12	4	4	4	4	4

Real Estate Analysis on Texas

Tommaso Scortecci

27/03/2026

Analysis of dataframe’s structure and dimensions

Quick analysis of the dataset’s variables

The dataset contains 240 rows and 8 variables:

-‘city’ nominal qualitative variable

-‘year’ continuous quantitative to be treated as ordinal qualitative variable

-‘month’ nominal qualitative variable coded with numbers

-‘sales’ discrete quantitative variable

-‘volume’ continuous quantitative variable on a ratio scale

-‘median_price’ continuous quantitative variable on a ratio scale

-‘listings’ discrete quantitative variable

-‘months_inventory’ continuous quantitative variable on a ratio scale

I decided to add the ‘month_year’ variable so I can do analysis on time dimension

We can do analysis on the distribution of quantitative variables by qualitative variables

Position, variability and shape indexes for quantitative variables

We would like to analyse the distribution of quantitative variables, especially ‘sales’ and ‘volume’; with that we can see the sales, both in quantity and in amount

Most of the variables have positive skewness and are platicurtic with the exception of ‘median_price’ that has negative skewness and ‘volume’ that is leptocurtic

The positive skeweness indicates that the lowest values are more frequent

If we take as an example the ‘sales’ variabile we can see a good amount of variance between datas with an extense range of datas.

Graphic analysis for ‘volume’

I analysed the ‘volume’ variable with 2 graphs: a boxplot to see the quantiles and the outliers and a histogram (with the density curve) to see the frequency for each class (to see the positive skewness)

‘City’, ‘year’, ‘month’ and ‘month_year’ frequency

Comparisons between variables for variability and skewness

It is essential to analyze the variability of the variables in order to understand how much the data deviate from the central value (e.g. to identify anomalous behaviour). Low variability indicates relatively homogeneous observations, while high dispersion signals the presence of heterogeneity

Coefficient of variation function; it is necessary in order to compare variables with different units of measurement

Comparison between some variables to see the one with the highest variability

Histograms of the 5 variables above (to analyse the skewness)

In order to compare the variability for columns with different unit of measurement we must use the coefficient of variation

From a statistical point of view the variable ‘volume’ is the one whose values vary greatly relative to the average

If we see above the skewness functions (where I executed the position, variability and shape indexes) and the histograms, ‘volume’ is the variable with the highest skewness

Classes for ‘median_price’

By dividing ‘median_price’ into classes, we can group them to identify those with the highest frequency. In Texas, average prices are concentrated in the center of the distribution

Barplot for ‘median_price’ divided in 10 classes

Calculation of Gini index

Variable ‘median_price’, divided in classes and plotted with a barplot, has high heterogeneity.

It can be seen also through the Gini index (0.86).

Probabilities

Probability for ‘city’ = ‘Beaumont’

Probability for ‘month’ = ‘July’ (7)

Probability for ‘month’ = ‘December’ (12) and ‘year’ = 2012

Adding new variables

Variable ‘mean_price’ calculated by ‘volume’/‘sales’

Variable ‘listings_efficiency’ calculated by ‘sales’/‘listings’. We can obtain how well listings have performed

Visualize the results of the new variables

From the histogram we can see that ‘mean_prices’ between 140k and 170k are the ones with the highest frequency.

From the boxplot we can see the distribution of ‘listings_efficiency’, calculated by ‘sales’/‘listings’.

The maximum value for ‘listings_efficency’ was registrated in 07/2014 for Bryan-College Station where there were 39 sales for every 100 listings.

Statistical analysis

‘Volume’ by cities

We can clearly see the cities with higher variability (Tyler and Bryan-College Station)

‘Sales’ by years

We can see the variability between the years and the trend over time; higher variability and sales (in average) for 2013 and 2014

‘Listings’ by months

We can see the 4° month as the one with the highest mean and the 7° with the highest SD

Boxplot to compare ‘median_price’ by city

‘Median_price’ has increased during the years. Given that, we can decide in which city of Texas is better to invest if we would like to buy a new house

Bryan-College Station has the highest median prices, while Wichita Falls has the lowest median prices.

Barplots: ‘sales’ by ‘month’ and ‘sales’ by ‘city’ for each ‘year’

The highest amount of sales are between May and August and for Tyler, while the lowest sales are in January and for Wichita Falls

I also created a stacked barplot to see the ‘sales’ bu ‘city’ and ‘month’ (Tyler and Bryan-College Station have the highest values)

Linecharts for ‘sales’ trend over the years by ‘city’

For every city, the 2014 and 2013 where the years with highest sales

Linecharts of ‘volume’ by ‘year’ (in average) for every ‘city’

They all seem to have an upward trend. Tyler has the highest values, while Whicita Falls has a great drop in 2011

————————–

CONCLUSIONS

————————–

From the historical trends, volumes of sales (in mln) increased from 2010 to 2014 for every city, with Tyler having the highest values and Wichita Falls the lowest values.

We can also highlight a fall in volumes for Wichita Falls in 2011.

If we compare monthly every year for every city, quantity of sales have the highest values in 2013 and 2014, with a peak in the central months (May-Aug) for Bryan-College Station.

The efficacy of listings (that become sales) is between 5% and 20% for almost every observation, with few outliers.

We can highlight the efficacy of 39% for Bryan-College Station in July 2014. That is the maximum value registered.

The highest classes of median price registered were 127k-138k and 148k-159k. These amounts are nearer to the highest value of median price than to the lowest.

Month	2010	2011	2012	2013	2014
1	4	4	4	4	4
2	4	4	4	4	4
3	4	4	4	4	4
4	4	4	4	4	4
5	4	4	4	4	4
6	4	4	4	4	4
7	4	4	4	4	4
8	4	4	4	4	4
9	4	4	4	4	4
10	4	4	4	4	4
11	4	4	4	4	4
12	4	4	4	4	4