The goal of the project, developed for Texas Realty Insights, is to analyze historical real estate data in the state of Texas to extract statistical insights and create visual representations that can support the company in making strategic decisions.

The analysis focuses on understanding trends in home sales, price distribution, and market dynamics across cities and over time (months and years). By exploring the relationship between key variables such as number of sales, total volume, median price, active listings, and inventory months, the project aims to:

Evaluate the effectiveness of real estate listings and marketing efforts.

Identify cities or periods with growth potential or underperformance.

Help the company optimize property pricing and inventory management.

Deliver clear and data-driven recommendations for sales strategy.

The outcome will be a comprehensive statistical report with summaries, graphs, and interpretations to guide data-informed decision-making in the Texas housing market.

dataframe creation

df <- read.csv("realestate_texas.csv")

packages:

install.packages("moments")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("ineq")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(moments)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(ineq)

1° assignment, Types of Variables:

“city” (of reference): Qualitative (Nominal), possibility to create a distribution of frequencies and a barplot.

“year” (of reference): Temporal (discrete), time series

“month” (of reference): Temporal (discrete), time series

“sales” (total of sales): Quantitative discrete, position indexes, variability indexes, Shape Indexes

“volume” (total value (in M dollars) of the sales): Quantitative continuous, position indexes, variability indexes, Shape Indexes

“median_price” (median sale price in dollars): Quantitative continuous (eve if for simplicity rounded), position indexes, variability indexes, Shape Indexes

“listing” (total number of active advertisements): Quantitative discrete, position indexes, variability indexes, Shape Indexes

“months_inventory” (months needed to sell all the active advertisements): Quantitative continuous, position indexes, variability indexes, Shape Indexes

2° assignment, calculate indexes for all the variables where is possible,

for the others create a distribution of freq, comment briefly:

Mode function:

mode_value <- function(x) names(which.max(table(x)))

—- sales —-

summary_sales = summary(df$sales)                    # Summary statistics: min, 1st Qu., median, mean, 3rd Qu., max
mean_sales   = mean(df$sales)                        # Mean: average sales
median_sales = median(df$sales)                      # Median: middle value of sales
mode_sales = mode_value(df$sales)                    # Most frequent number of sales
variance_sales = var(df$sales)                       # Variance: how much sales values vary
sd_sales = sd(df$sales)                              # Standard deviation: average distance from the mean
CoV_sales = sd(df$sales) / mean(df$sales)            # CoV: relative measure of variability
skewness_sales = skewness(df$sales)                  # Skewness: asymmetry of the sales distribution
kurtosis_sales = kurtosis(df$sales) - 3              # Excess kurtosis: peakedness vs. normal distribution

Plot density of sales

plot(density(df$sales),                                         # Density plot
     main = "Density of Sales", xlab = "Sales")
abline(v = mean_sales, col = "red", lwd = 2)                    # Red = mean
abline(v = median_sales, col = "green", lwd = 2)                # Green = median
abline(v = mode_sales, col = "blue", lwd = 2)
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)

comments:

looks like it is positively skewed from the density plot

—- volume —-

summary_volume  <- summary(df$volume)                            # Summary statistics
mean_volume     <- mean(df$volume)                               # Mean
median_volume   <- median(df$volume)                             # Median
mode_volume     <- mode_value(df$volume)                         # Mode
variance_volume <- var(df$volume)                                # Variance
sd_volume       <- sd(df$volume)                                 # Standard deviation
CoV_volume      <- sd_volume / mean_volume                       # Coefficient of Variation
skewness_volume <- skewness(df$volume)                           # Skewness
kurtosis_volume <- kurtosis(df$volume) - 3                       # Excess kurtosis

plot(density(df$volume),                                         # Density plot
     main = "Density of Volume", xlab = "Volume")
abline(v = mean_volume, col = "red", lwd = 2)                    # Red = mean
abline(v = median_volume, col = "green", lwd = 2)                # Green = median
abline(v = mode_volume, col = "blue", lwd = 2)                   # Blue = mode
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)

comments:

looks like it is positively skewed from the density plot

—- median_price —-

summary_median_price  <- summary(df$median_price)                           # Summary statistics
mean_median_price     <- mean(df$median_price)                              # Mean
median_median_price   <- median(df$median_price)                            # Median
mode_median_price     <- mode_value(df$median_price)                        # Mode
variance_median_price <- var(df$median_price)                               # Variance
sd_median_price       <- sd(df$median_price)                                # Standard deviation
CoV_median_price      <- sd_median_price / mean_median_price                # Coefficient of Variation
skewness_median_price <- skewness(df$median_price)                          # Skewness
kurtosis_median_price <- kurtosis(df$median_price) - 3                      # Excess kurtosis

plot(density(df$median_price),                                              # Density plot
     main = "Density of Median Price", xlab = "Median Price")
abline(v = mean_median_price, col = "red", lwd = 2)                          # Red = mean
abline(v = median_median_price, col = "green", lwd = 2)                      # Green = median
abline(v = mode_median_price, col = "blue", lwd = 2)                         # Blue = mode
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)

comments: # looks like it is negatively skewed, also kurtosis is negative, so it is a platykurtic distribution.

—- listings —-

summary_listings  <- summary(df$listings)                          # Summary statistics
mean_listings     <- mean(df$listings)                             # Mean
median_listings   <- median(df$listings)                           # Median
mode_listings     <- mode_value(df$listings)                       # Mode
variance_listings <- var(df$listings)                              # Variance
sd_listings       <- sd(df$listings)                               # Standard deviation
CoV_listings      <- sd_listings / mean_listings                   # Coefficient of Variation
skewness_listings <- skewness(df$listings)                         # Skewness
kurtosis_listings <- kurtosis(df$listings) - 3                     # Excess kurtosis

plot(density(df$listings),                                         # Density plot
     main = "Density of Listings", xlab = "Listings")
abline(v = mean_listings, col = "red", lwd = 2)                    # Red = mean
abline(v = median_listings, col = "green", lwd = 2)                # Green = median
abline(v = mode_listings, col = "blue", lwd = 2)                   # Blue = mode
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)

comments:

it looks like there are three modes, anyway the bigger one is 1581. About the skewness, it is moderately positive.

anyway it’s clearly not a normal distribution, because of the three modes. data says that’s platykurtic and positively skewed.

—- months_inventory —-

summary_months_inventory  <- summary(df$months_inventory)                           # Summary statistics
mean_months_inventory     <- mean(df$months_inventory)                              # Mean
median_months_inventory   <- median(df$months_inventory)                            # Median
mode_months_inventory     <- mode_value(df$months_inventory)                        # Mode
variance_months_inventory <- var(df$months_inventory)                               # Variance
sd_months_inventory       <- sd(df$months_inventory)                                # Standard deviation
CoV_months_inventory      <- sd_months_inventory / mean_months_inventory            # Coefficient of Variation
skewness_months_inventory <- skewness(df$months_inventory)                          # Skewness
kurtosis_months_inventory <- kurtosis(df$months_inventory) - 3                      # Excess kurtosis

plot(density(df$months_inventory),                                                  # Density plot
     main = "Density of Months Inventory", xlab = "Months of Inventory")
abline(v = mean_months_inventory, col = "red", lwd = 2)                             # Red = mean
abline(v = median_months_inventory, col = "green", lwd = 2)                         # Green = median
abline(v = mode_months_inventory, col = "blue", lwd = 2)                            # Blue = mode
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)

comments: # it is clearly not a perfect normal distribution, it looks like there are three modes again # but it is not so bad like the previous variable. # The skewness is close to 0, and the kurtosis is very slightly negative (-0.17).

—- city (absolute and relative frequency) —-

table(df$city)                        # Absolute frequency of each city
## 
##              Beaumont Bryan-College Station                 Tyler 
##                    60                    60                    60 
##         Wichita Falls 
##                    60
prop.table(table(df$city))            # Relative frequency of each city
## 
##              Beaumont Bryan-College Station                 Tyler 
##                  0.25                  0.25                  0.25 
##         Wichita Falls 
##                  0.25

—- year (absolute and relative frequency) —-

table(df$year)                        # Absolute frequency of each year
## 
## 2010 2011 2012 2013 2014 
##   48   48   48   48   48
prop.table(table(df$year))            # Relative frequency of each year
## 
## 2010 2011 2012 2013 2014 
##  0.2  0.2  0.2  0.2  0.2

—- month (absolute and relative frequency) —-

table(df$month)                       # Absolute frequency of each month
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 20 20 20 20 20 20 20 20 20 20 20 20
prop.table(table(df$month))           # Relative frequency of each month
## 
##          1          2          3          4          5          6          7 
## 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 
##          8          9         10         11         12 
## 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333

comments: each modality in city, year and month is equally represented.

3° assignment, which one is the variable with the highest variability?

so let’s find the variable with highest CoV:

Vector of CoV values (previously calculated)

CoV_values <- c(
  sales            = CoV_sales,
  volume           = CoV_volume,
  median_price     = CoV_median_price,
  listings         = CoV_listings,
  months_inventory = CoV_months_inventory
)

print(sort(CoV_values, decreasing = TRUE))
##           volume         listings            sales months_inventory 
##        0.5370536        0.4330833        0.4142203        0.2506031 
##     median_price 
##        0.1708218

The variable with the highest variability is ‘volume’ with a CoV of 0.53.

3° assignment part two: which variable has the highest skewness?

Vector of skewness values (previously calculated)

skewness_values <- c(
  sales            = skewness_sales,
  volume           = skewness_volume,
  median_price     = skewness_median_price,
  listings         = skewness_listings,
  months_inventory = skewness_months_inventory
)

print(sort(abs(skewness_values), decreasing = TRUE))
##           volume            sales         listings     median_price 
##       0.88474203       0.71810402       0.64949823       0.36455288 
## months_inventory 
##       0.04097527

The variable with the highest variability is ‘volume’ with a skewness of 0.88.

statistical considerations: Data indicates that, in most cases, sales occur in relatively low quantities

and are associated with moderate total volumes, while median housing prices remain relatively high.

This suggests that the market activity is driven by not so big amount of transactions,

each involving properties of relatively high individual value.

4° assignment: select a quantitative variable, divide it in classes and calculate distribution of freq.

# Define 6 intervals using 7 breakpoints for median_price
breaks <- seq(min(df$median_price), max(df$median_price), length.out = 7)
labels <- c("Very Low", "Low", "Mid-Low", "Mid-High", "High", "Very High")

# Assign labeled classes based on median_price
df$price_class <- cut(df$median_price, breaks = breaks, labels = labels, include.lowest = TRUE)

# Frequency table
freq_table <- df %>%
  group_by(price_class) %>%
  summarise(frequency = n()) %>%
  mutate(relative_freq = frequency / sum(frequency))

print(freq_table)
## # A tibble: 6 × 3
##   price_class frequency relative_freq
##   <fct>           <int>         <dbl>
## 1 Very Low           13        0.0542
## 2 Low                33        0.138 
## 3 Mid-Low            37        0.154 
## 4 Mid-High           73        0.304 
## 5 High               68        0.283 
## 6 Very High          16        0.0667
# Bar plot
ggplot(freq_table, aes(x = price_class, y = frequency)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Median Price Frequency Distribution",
       x = "Price Class",
       y = "Frequency") +
  theme_minimal()

# Create frequency table for price_class (absolute counts)
freq_table <- df %>%
  count(price_class) %>%
  arrange(price_class)  # Ensure proper class order

# Extract frequency counts
frequencies <- freq_table$n

# Original Gini function
gini_function <- function(x) {
  x <- sort(x)
  n <- length(x)
  G <- (2 * sum(x * seq_along(x))) / (n * sum(x)) - (n + 1) / n
  return(G)
}

# Calculate Gini index on class distribution
gini_val <- gini_function(frequencies)
cat("Gini Index for 'price_class' distribution:", round(gini_val, 3))
## Gini Index for 'price_class' distribution: 0.319

The median_price variable was divided into six labeled classes ranging from Very Low to Very High. The frequency table shows a clear concentration in the central-upper classes: Mid-High (30.4%) and High (28.3%) together account for 58,7% of the observations. In contrast, the extreme categories (Very Low at 5.4% and Very High at 6.7%) are sparsely represented.

This pattern is visually confirmed by the bar chart, which shows a peak in the mid-to-high range and a decline toward the extremes. This suggests that most houses fall within an intermediate or moderately high median price range, with fewer cases of very low or very high prices.

The Gini index calculated on the class frequency distribution is 0.319, indicating a moderate level of inequality. This value is consistent with the data: while a few classes dominate in terms of frequency, the imbalance is not extreme, as all six categories are represented to some degree.

In summary, the distribution of median_price is skewed toward mid-high price classes, reflecting a moderately uneven but not polarized housing market.

# Summary of ranges per price_class
class_ranges <- df %>%
  group_by(price_class) %>%
  summarise(
    min_value = min(median_price),
    max_value = max(median_price)
  )

print(class_ranges)
## # A tibble: 6 × 3
##   price_class min_value max_value
##   <fct>           <dbl>     <dbl>
## 1 Very Low        73800     91200
## 2 Low             91700    109100
## 3 Mid-Low        109400    126200
## 4 Mid-High       127000    144600
## 5 High           144800    161600
## 6 Very High      163700    180000

assignment 5°: calculation of probabilities:

calculation of probabilities:

probabilities to sample a particular city is 25%:

—- city (absolute and relative frequency) —-

table(df$city)                        # Absolute frequency of each city
## 
##              Beaumont Bryan-College Station                 Tyler 
##                    60                    60                    60 
##         Wichita Falls 
##                    60
prop.table(table(df$city))            # Relative frequency of each city
## 
##              Beaumont Bryan-College Station                 Tyler 
##                  0.25                  0.25                  0.25 
##         Wichita Falls 
##                  0.25

probability to sample a row with month 7 (July) is 8.33%:

—- month (absolute and relative frequency) —-

table(df$month)                       # Absolute frequency of each month
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 20 20 20 20 20 20 20 20 20 20 20 20
prop.table(table(df$month))           # Relative frequency of each month
## 
##          1          2          3          4          5          6          7 
## 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 
##          8          9         10         11         12 
## 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333

probability to sample a row with month 12 pf 2012 os 1.66%:

—- year (absolute and relative frequency) —-

table(df$year)                        # Absolute frequency of each year
## 
## 2010 2011 2012 2013 2014 
##   48   48   48   48   48
prop.table(table(df$year))            # Relative frequency of each year
## 
## 2010 2011 2012 2013 2014 
##  0.2  0.2  0.2  0.2  0.2

then…

0.2*0.0833
## [1] 0.01666

or

4/240
## [1] 0.01666667

assignment 6°:

Create a new column that calculates the average property price using the available variables

df$avg_price <- (df$volume * 1e6) / df$sales

Calculate average house price by city

avg_price_by_city <- aggregate(avg_price ~ city, data = df, mean)
print(avg_price_by_city)
##                    city avg_price
## 1              Beaumont  146640.4
## 2 Bryan-College Station  183534.3
## 3                 Tyler  167676.8
## 4         Wichita Falls  119430.0

Barplot of average prices

barplot(avg_price_by_city$avg_price,
        names.arg = avg_price_by_city$city,
        las = 2,                           # Rotate city names vertically
        col = "blue",
        main = "Average House Price by City",
        ylab = "Average Price ($)")

— efficiency —:

Create a column that measures the effectiveness of property listings (sell-efficiency)

df$sell_efficiency <- df$listings / df$months_inventory

Normalize sell-efficiency (min-max scaling)

df$sell_efficiency_norm <- (
  df$sell_efficiency - min(df$sell_efficiency)
) / (
  max(df$sell_efficiency) - min(df$sell_efficiency)
)

Calculate average normalized sell-efficiency by city

efficiency_by_city_norm <- aggregate(sell_efficiency_norm ~ city, data = df, mean)
print(efficiency_by_city_norm)
##                    city sell_efficiency_norm
## 1              Beaumont           0.29320961
## 2 Bryan-College Station           0.41693448
## 3                 Tyler           0.69298448
## 4         Wichita Falls           0.04451062

Barplot of average normalized sell-efficiency by city

barplot(efficiency_by_city_norm$sell_efficiency_norm,
        names.arg = efficiency_by_city_norm$city,
        las = 2,                               # Rotate city names vertically
        col = "seagreen3",
        main = "Average Normalized Sell Efficiency by City",
        ylab = "Average Normalized Sell Efficiency")

comments:

Bryan-College Station

Highest average price (~$183.5k) with moderate efficiency (average normalized ~0.42).

A solid market with the most expensive properties, but sales occur at a relatively moderate pace.

Tyler

Lower average price (~$167.7k) than Bryan-College Station but highest efficiency (~0.69).

A dynamic market where properties are still quite expensive and sold very quickly. The company performs well here.

Beaumont

relatively low average price (~$146.6k) and low efficiency (~0.29).

A stagnant market with relatively inexpensive properties that are harder to sell. May indicate excess supply or weak demand.

Wichita Falls

Lowest average price (~$119.4k) and minimal efficiency (~0.04).

A depressed market with low-value properties and extremely low turnover. The company likely loses resources here.

Assignments 7°: use dplyr or basic R for conditional statistical analysis per city, year, month.

Generate summaries (means, sd) and make charts.

Sales by city

sales_by_city <- df %>%         # %>% means take the output of the previous function and use it as input for the next one
  group_by(city) %>%
  summarise(
    count = n(),                # Number of observations
    total_sales = sum(sales),   # Total sales
    mean_sales = mean(sales),   # Mean sales
    sd_sales = sd(sales),       # Standard deviation
    min_sales = min(sales),     # Minimum sales
    max_sales = max(sales)      # Maximum sales
  )
print(sales_by_city)
## # A tibble: 4 × 7
##   city                 count total_sales mean_sales sd_sales min_sales max_sales
##   <chr>                <int>       <int>      <dbl>    <dbl>     <int>     <int>
## 1 Beaumont                60       10643       177.     41.5        83       273
## 2 Bryan-College Stati…    60       12358       206.     85.0        89       403
## 3 Tyler                   60       16185       270.     62.0       143       423
## 4 Wichita Falls           60        6964       116.     22.2        79       167

Barplot of mean sales by city

barplot(
  sales_by_city$mean_sales,
  names.arg = sales_by_city$city,
  col = "steelblue",
  main = "Average Sales by City",
  ylab = "Mean Sales",
  xlab = "City",
  las = 2,            # Rotate city names vertically
  cex.names = 0.8     # Adjust label size
)

Barplot of total sales by city

barplot(
  sales_by_city$total_sales,
  names.arg = sales_by_city$city,
  col = "steelblue",
  main = "Total Sales by City",
  ylab = "Total Sales",
  xlab = "City",
  las = 2,            # Rotate city names vertically
  cex.names = 0.8     # Adjust label size
)

comments: Tyler is the city where there are the most of the sales (totally and in average), according to the considerations made in

the code above. BCStation is the second by mean sales and has the highest standard deviation. WFalls

is the city with the minimum amount of sales and also with the minimum mean of sales.

again these results accord with the previous ones, where Tyler is the most dynamic market, followed by BCStation

and WFalls is the most stagnant one, probably because it is closer to the desert and distant from main cities

like Dallas and Houston (maybe??).

Sales by year

sales_by_year <- df %>%
  group_by(year) %>%
  summarise(
    count = n(),                # Number of observations
    total_sales = sum(sales),   # Total sales
    mean_sales = mean(sales),   # Mean sales
    sd_sales = sd(sales),       # Standard deviation
    min_sales = min(sales),     # Minimum sales
    max_sales = max(sales)      # Maximum sales
  )
print(sales_by_year)
## # A tibble: 5 × 7
##    year count total_sales mean_sales sd_sales min_sales max_sales
##   <int> <int>       <int>      <dbl>    <dbl>     <int>     <int>
## 1  2010    48        8096       169.     60.5        83       316
## 2  2011    48        7878       164.     63.9        79       313
## 3  2012    48        8935       186.     70.9        90       322
## 4  2013    48       10172       212.     84.0        79       402
## 5  2014    48       11069       231.     95.5        89       423

Barplot of total sales by year

barplot(
  sales_by_year$total_sales,
  names.arg = sales_by_year$year,
  col = "darkorange",
  main = "N. of Sales by Year",
  ylab = "Total Sales",
  xlab = "Year"
)

comments: looking at the charts it looks like the total sales are increasing over the years, indicating a general

growth of the company (probably they hired more people making more deals) or the market became more fluid. That’s a good

Sales by month

sales_by_month <- df %>%
  group_by(month) %>%
  summarise(
    count = n(),                # Number of observations
    total_sales = sum(sales),   # Total sales
    mean_sales = mean(sales),   # Mean sales
    sd_sales = sd(sales),       # Standard deviation
    min_sales = min(sales),     # Minimum sales
    max_sales = max(sales)      # Maximum sales
  )
print(sales_by_month)
## # A tibble: 12 × 7
##    month count total_sales mean_sales sd_sales min_sales max_sales
##    <int> <int>       <int>      <dbl>    <dbl>     <int>     <int>
##  1     1    20        2548       127.     43.4        79       238
##  2     2    20        2817       141.     51.1        79       244
##  3     3    20        3789       189.     59.2       102       298
##  4     4    20        4234       212.     65.4       111       323
##  5     5    20        4777       239.     83.1       102       388
##  6     6    20        4871       244.     95.0       111       423
##  7     7    20        4715       236.     96.3       104       403
##  8     8    20        4629       231.     79.2       123       357
##  9     9    20        3647       182.     72.5        95       361
## 10    10    20        3598       180.     75.0        97       369
## 11    11    20        3137       157.     55.5        93       300
## 12    12    20        3388       169.     60.7        81       332

Barplot of total sales by month

barplot(
  sales_by_month$total_sales,
  names.arg = sales_by_month$month,
  col = "darkgreen",
  main = "N. of Sales by Month",
  ylab = "Total Sales",
  xlab = "Month"
)

Housing market peaks in late spring and summer probably due to:

Families moving when schools are closed, so no interruption of school for children

Stable weather, ideal for visits and people to relocate.

assignment 8°:

part 1: use boxplot to compare median_price across cities.

boxplot of median_price by city

ggplot(df, aes(x = city, y = median_price)) +
  geom_boxplot(fill = "skyblue") +
  labs(
    title = "Comparison of Median Price by City",
    x = "City",
    y = "Median Price"
  )

comments: Tyler has the highest median price, followed by BCStation and Beaumont. Wichita Falls is the cheapest city.

we have already seen this pattern in the code above (with average price by city above)

but here is more clean and also median price is a more stable measure than average price to outliers.

even if outliers, we can see from the chart, here are not so a problem and boxplots are quite simmetric.

part 2: use ggplot2 to create a barplot to compare total sales across month and cities:

Total sales by city

Summarise total sales by city

sales_by_city <- df %>%
  group_by(city) %>%
  summarise(total_sales = sum(sales))

Barplot

ggplot(sales_by_city, aes(x = city, y = total_sales)) +
  geom_col(fill = "steelblue") +
  labs(title = "Total Sales by City", x = "City", y = "Total Sales")

Summarise total sales by month

sales_by_month <- df %>%
  group_by(month) %>%
  summarise(total_sales = sum(sales))

Barplot

ggplot(sales_by_month, aes(x = factor(month), y = total_sales)) +
  geom_col(fill = "darkgreen") +
  labs(title = "Total Sales by Month", x = "Month", y = "Total Sales")

Summarise total sales by month and city

sales_by_month_city <- df %>%
  group_by(month, city) %>%
  summarise(total_sales = sum(sales))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
print(sales_by_month_city)
## # A tibble: 48 × 3
## # Groups:   month [12]
##    month city                  total_sales
##    <int> <chr>                       <int>
##  1     1 Beaumont                      608
##  2     1 Bryan-College Station         591
##  3     1 Tyler                         907
##  4     1 Wichita Falls                 442
##  5     2 Beaumont                      677
##  6     2 Bryan-College Station         628
##  7     2 Tyler                        1058
##  8     2 Wichita Falls                 454
##  9     3 Beaumont                      855
## 10     3 Bryan-College Station         949
## # ℹ 38 more rows

let’s create a variable that combines year and month to make a complete time series:

df$date <- as.Date(paste(df$year, df$month, "1", sep = "-")) # "1" means 1st of the month

Grouped barplot

df$year <- format(df$date, "%Y")
sales_by_month_city <- df %>%
  group_by(year, month, city) %>%
  summarise(total_sales = sum(sales), .groups = "drop")
df$month <- format(df$date, "%m")
ggplot(sales_by_month_city, aes(x = factor(month), y = total_sales, fill = city)) +
  geom_col(position = "dodge") +
  labs(
    title = "Total Sales by Month and City",
    x = "Month",
    y = "Total Sales",
    fill = "City"
  ) +
  facet_wrap(~year)

comments: It looks like BCStation and Tyler are more sensible to seasons rather than the other two cities

above all WFalls but can’t understand why. Probably due to weather reasons but I won’t deep more into it.

part 3: line charts for time series:

Total sales by year

sales_by_year <- df %>%
  group_by(year) %>%
  summarise(total_sales = sum(sales))
print(sales_by_year)
## # A tibble: 5 × 2
##   year  total_sales
##   <chr>       <int>
## 1 2010         8096
## 2 2011         7878
## 3 2012         8935
## 4 2013        10172
## 5 2014        11069

Line-Chart:

# Ensure 'year' is numeric to allow line connection
sales_by_year$year <- as.numeric(as.character(sales_by_year$year))

# Create line plot of total sales per year
ggplot(sales_by_year, aes(x = year, y = total_sales)) +
  geom_line(color = "blue", size = 1.2) +       # Line connecting the points
  geom_point(color = "blue") +                  # Points on each year
  labs(
    title = "Total Sales per Year",             # Title of the plot
    x = "Year",                                 # X-axis label
    y = "Total Sales"                           # Y-axis label
  ) +
  scale_x_continuous(breaks = unique(sales_by_year$year)) +  # Show all years on x-axis
  theme_minimal()                                # Clean theme
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

comments: as we have seen before above in the code sales did increase over years.

Total sales by month

sales_by_month <- df %>%
  group_by(month) %>%
  summarise(total_sales = sum(sales))
print(sales_by_month)
## # A tibble: 12 × 2
##    month total_sales
##    <chr>       <int>
##  1 01           2548
##  2 02           2817
##  3 03           3789
##  4 04           4234
##  5 05           4777
##  6 06           4871
##  7 07           4715
##  8 08           4629
##  9 09           3647
## 10 10           3598
## 11 11           3137
## 12 12           3388

Create a line chart to show total sales by month

sales_by_month$month <- as.numeric(sales_by_month$month)


ggplot(sales_by_month, aes(x = month, y = total_sales)) +
  geom_line(color = "darkgreen", size = 1.2) +   
  geom_point(color = "darkgreen") +              
  scale_x_continuous(breaks = 1:12) +            # 12 months on x-axis
  labs(
    title = "Total Sales per Month",             
    x = "Month",                                
    y = "Total Sales"                         
  )

comments: as we have seen before above in the code sales increase during spring-summer and have a minimum in Genuary.

probably due to weather conditions.

Group data by time and calculate total sales globally (across all of cities)

sales_over_time <- df %>%
  group_by(date) %>%
  summarise(total_sales = sum(sales))
print(sales_over_time)
## # A tibble: 60 × 2
##    date       total_sales
##    <date>           <int>
##  1 2010-01-01         421
##  2 2010-02-01         487
##  3 2010-03-01         755
##  4 2010-04-01         916
##  5 2010-05-01         931
##  6 2010-06-01         866
##  7 2010-07-01         712
##  8 2010-08-01         738
##  9 2010-09-01         598
## 10 2010-10-01         565
## # ℹ 50 more rows

Create a time series line chart of total sales over time

ggplot(sales_over_time, aes(x = date, y = total_sales)) +
  geom_line(color = "purple", size = 1.2) +
  geom_point(color = "purple") +                
  labs(
    title = "Total Sales Over Time",            
    x = "Date (Year-Month)",                    
    y = "Total Sales"                           
  )

comments: now it is clear, sales increase over seasons and over years.

Total sales by city and time

sales_time_city <- df %>%
  group_by(city, date) %>%
  summarise(total_sales = sum(sales))
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
print(sales_time_city)
## # A tibble: 240 × 3
## # Groups:   city [4]
##    city     date       total_sales
##    <chr>    <date>           <int>
##  1 Beaumont 2010-01-01          83
##  2 Beaumont 2010-02-01         108
##  3 Beaumont 2010-03-01         182
##  4 Beaumont 2010-04-01         200
##  5 Beaumont 2010-05-01         202
##  6 Beaumont 2010-06-01         189
##  7 Beaumont 2010-07-01         164
##  8 Beaumont 2010-08-01         174
##  9 Beaumont 2010-09-01         124
## 10 Beaumont 2010-10-01         150
## # ℹ 230 more rows

Line chart: one line per city

ggplot(sales_time_city, aes(x = date, y = total_sales, color = city)) +
  geom_line(size = 0.3) +
  geom_point() +
  labs(
    title = "Monthly Sales Over Time by City",
    x = "Date",
    y = "Total Sales"
  )

comments: the numbers of sales are increasing over time but not thanks to WFalls, only the other three cities

are developing a bigger market (or company did put more effort into them over time).

The very interesting thing is seeing that BCStation has the biggest variation over seasons in term of sales.

Doing some research online taking curiosity by the name I have seen that’s a very important university city

compared to the other ones, the logical problem is that students rent houses, they don’t buy/sell them.

this doesn’t explain then the oddity of the data (message to reader: if you know the reason please let me know).

let’s see the variable median price over time by city to see if there are some trends:

Group by city and date, then calculate the median_price for each point in time.

trend_price_city <- df %>%
  group_by(city, date) %>%
  summarise(median_price)
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
print(trend_price_city)
## # A tibble: 240 × 3
## # Groups:   city [4]
##    city     date       median_price
##    <chr>    <date>            <dbl>
##  1 Beaumont 2010-01-01       163800
##  2 Beaumont 2010-02-01       138200
##  3 Beaumont 2010-03-01       122400
##  4 Beaumont 2010-04-01       123200
##  5 Beaumont 2010-05-01       123100
##  6 Beaumont 2010-06-01       122800
##  7 Beaumont 2010-07-01       124300
##  8 Beaumont 2010-08-01       136800
##  9 Beaumont 2010-09-01       121100
## 10 Beaumont 2010-10-01       138500
## # ℹ 230 more rows

Line chart: median_price over time, one line per city

ggplot(trend_price_city, aes(x = date, y = median_price, color = city)) +
  geom_line(size = 0.3) +
  geom_point() +
  labs(
    title = "Median Price Over Time by City",
    x = "Date",
    y = "Median Price"
  )

comments:

Just by eyes it looks like after 2012’s low peaks, median prices of the cities but Beaumont are growing…

BCStation and Tyler median prices at the end reach values evidently higher than initial ones (before 2012), instead WFalls is more difficult

to be interpreted cause of the big variation it has over time.

Anyways, median prices of the two cities with the most number of sales (also growing over time, as we have seen before) are growing.

That means that the market for those cities is becoming more fluid, also houses more expensive.

The company is doing well in those cities so they should keep bet on them.