Analisi Esplorativa del Mercato Immobiliare del Texas

Task 1:

Load the “Real Estate Texas.csv” dataset into an R dataframe named df and display its head to verify the loading.

df <- read.csv('Real Estate Texas.csv')
head(df)

##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
## 6 Beaumont 2010     6   189 27.219       122800     1803             11.1

Inspect structure, data types and initial descriptive statistics of the loaded dataset, using str(df), summary(df), and dplyr::glimpse(df).

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

str(df)

## 'data.frame':    240 obs. of  8 variables:
##  $ city            : chr  "Beaumont" "Beaumont" "Beaumont" "Beaumont" ...
##  $ year            : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ month           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sales           : int  83 108 182 200 202 189 164 174 124 150 ...
##  $ volume          : num  14.2 17.7 28.7 26.8 28.8 ...
##  $ median_price    : num  163800 138200 122400 123200 123100 ...
##  $ listings        : int  1533 1586 1689 1708 1771 1803 1857 1830 1829 1779 ...
##  $ months_inventory: num  9.5 10 10.6 10.6 10.9 11.1 11.7 11.6 11.7 11.5 ...

summary(df)

##      city                year          month           sales      
##  Length:240         Min.   :2010   Min.   : 1.00   Min.   : 79.0  
##  Class :character   1st Qu.:2011   1st Qu.: 3.75   1st Qu.:127.0  
##  Mode  :character   Median :2012   Median : 6.50   Median :175.5  
##                     Mean   :2012   Mean   : 6.50   Mean   :192.3  
##                     3rd Qu.:2013   3rd Qu.: 9.25   3rd Qu.:247.0  
##                     Max.   :2014   Max.   :12.00   Max.   :423.0  
##      volume        median_price       listings    months_inventory
##  Min.   : 8.166   Min.   : 73800   Min.   : 743   Min.   : 3.400  
##  1st Qu.:17.660   1st Qu.:117300   1st Qu.:1026   1st Qu.: 7.800  
##  Median :27.062   Median :134500   Median :1618   Median : 8.950  
##  Mean   :31.005   Mean   :132665   Mean   :1738   Mean   : 9.193  
##  3rd Qu.:40.893   3rd Qu.:150050   3rd Qu.:2056   3rd Qu.:10.950  
##  Max.   :83.547   Max.   :180000   Max.   :3296   Max.   :14.900

dplyr::glimpse(df)

## Rows: 240
## Columns: 8
## $ city             <chr> "Beaumont", "Beaumont", "Beaumont", "Beaumont", "Beau…
## $ year             <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…
## $ month            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5,…
## $ sales            <int> 83, 108, 182, 200, 202, 189, 164, 174, 124, 150, 150,…
## $ volume           <dbl> 14.162, 17.690, 28.701, 26.819, 28.833, 27.219, 22.70…
## $ median_price     <dbl> 163800, 138200, 122400, 123200, 123100, 122800, 12430…
## $ listings         <int> 1533, 1586, 1689, 1708, 1771, 1803, 1857, 1830, 1829,…
## $ months_inventory <dbl> 9.5, 10.0, 10.6, 10.6, 10.9, 11.1, 11.7, 11.6, 11.7, …

Categorize its variables by type (e.g., qualitative, quantitative). Based on the data inspection, here s a classification of the variables:

Qualitative Variables:

city: This is a nominal variable representing different cities in Texas. It will be useful for grouping data and comparing trends across different locations.

Quantitative Variables:

year: An integer variable representing the year. While numerically an integer, its inherently a time-series component, indicating specific periods of observation. It can be treated as discrete quantitative for some analyses (e.g., aggregations by year) or as a time component when combined with month.
month: An integer variable representing the month of the year. Similar to year, it’s a discrete quantitative variable but carries significant temporal meaning. It’s crucial for identifying seasonal trends.
sales: An integer variable representing the total number of sales. This is a discrete quantitative variable suitable for calculating sums, averages, and analyzing distribution.
volume: A numeric variable representing the total value of sales (in millions of dollars). This is a continuous quantitative variable, suitable for various statistical calculations like mean, standard deviation, and for trend analysis.
median_price: A numeric variable representing the median price of sales (in dollars). This is a continuous quantitative variable, key for understanding price trends and distributions.
listings: An integer variable representing the total number of active listings. This is a discrete quantitative variable, useful for assessing market supply.
months_inventory: A numeric variable representing the quantity of time needed to sell all current listings. This is a continuous quantitative variable, critical for understanding market balance and demand.

Discuss how to handle time-related variables like ‘year’ and ‘month’. Handling Time-Related Variables (year and month):
The year and month variables, while currently integers, represent a temporal dimension. For robust time-series analysis or more intuitive plotting (e.g., chronological order on an axis), it’s often beneficial to combine them into a single date/time object. This can be done by creating a new column, for instance, date, which could represent the first day of each month.

This will allow for:

Time Series Analysis: Easily plotting trends over time.
Seasonality Analysis: Grouping by month to observe seasonal patterns, or by year to observe annual changes.
Filtering and Aggregation: Efficiently filtering data by specific periods or aggregating data over longer timeframes (e.g., quarterly or annually).

Task 2:

Calculate descriptive statistics (mean, median, standard deviation, IQR, variance, skewness, and kurtosis) for the quantitative variables: sales, volume, median_price, listings, and months_inventory. Ensure the e1071 package is installed and loaded for skewness and kurtosis.

library(e1071)

–> n R (for OSX) select Tools
–> Install Packages from the menu. Then you can search for the package and install it via Install selected.
You can also go to packages in the right windows click on install write e1071 click install.

Proceed with calculating the descriptive statistics.
Ensure the variable is numeric.
Add to results dataframe.

variables_to_analyze <- c("sales", "volume", "median_price", "listings", "months_inventory")

results <- data.frame()

for (var_name in variables_to_analyze) {
  
  if (is.numeric(df[[var_name]])) {
    mean_val <- mean(df[[var_name]], na.rm = TRUE)
    median_val <- median(df[[var_name]], na.rm = TRUE)
    sd_val <- sd(df[[var_name]], na.rm = TRUE)
    var_val <- var(df[[var_name]], na.rm = TRUE)
    iqr_val <- IQR(df[[var_name]], na.rm = TRUE)
    skew_val <- skewness(df[[var_name]], na.rm = TRUE)
    kurt_val <- kurtosis(df[[var_name]], na.rm = TRUE)
    
    results <- rbind(results, data.frame(
      Variable = var_name,
      Mean = mean_val,
      Median = median_val,
      SD = sd_val,
      Variance = var_val,
      IQR = iqr_val,
      Skewness = skew_val,
      Kurtosis = kurt_val
    ))
  } else {
    cat(paste0("Variable '", var_name, "' is not numeric and will be skipped.\n"))
  }
}

print("Descriptive Statistics for Quantitative Variables:")

## [1] "Descriptive Statistics for Quantitative Variables:"

print(results)

##           Variable         Mean      Median           SD     Variance
## 1            sales    192.29167    175.5000    79.651111 6.344300e+03
## 2           volume     31.00519     27.0625    16.651447 2.772707e+02
## 3     median_price 132665.41667 134500.0000 22662.148687 5.135730e+08
## 4         listings   1738.02083   1618.5000   752.707756 5.665690e+05
## 5 months_inventory      9.19250      8.9500     2.303669 5.306889e+00
##          IQR    Skewness   Kurtosis
## 1   120.0000  0.71362055 -0.3355200
## 2    23.2335  0.87921815  0.1505673
## 3 32750.0000 -0.36227680 -0.6427292
## 4  1029.5000  0.64544309 -0.8101534
## 5     3.1500  0.04071944 -0.1979448

Generate frequency distributions for the qualitative and discrete quantitative variablescity, year, and month.
The e1071 package has been previously installed for skewness and kurtosis calculations.

cat("\nFrequency Distribution for 'city':\n")

## 
## Frequency Distribution for 'city':

print(table(df$city))

## 
##              Beaumont Bryan-College Station                 Tyler 
##                    60                    60                    60 
##         Wichita Falls 
##                    60

cat("\nFrequency Distribution for 'year':\n")

## 
## Frequency Distribution for 'year':

print(table(df$year))

## 
## 2010 2011 2012 2013 2014 
##   48   48   48   48   48

cat("\nFrequency Distribution for 'month':\n")

## 
## Frequency Distribution for 'month':

print(table(df$month))

## 
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 20 20 20 20 20 20 20 20 20 20 20 20

Finally, comment on the statistical results, highlighting significant findings and patterns for each variable.

Data Analysis Key Findings:

Sales and Volume show positive skewness:
sales (mean: 192.29, median: 175.5, skewness: 0.714) and volume (mean: 31.01 million, median: 27.06 million, skewness: 0.879) both have means higher than their medians, indicating a concentration of lower values and occasional higher values pulling the mean up.
Median price is slightly negatively skewed:
The median_price (mean: $132,665, median: $134,500, skewness: -0.362) has its median slightly higher than its mean, suggesting a concentration of prices at the higher end. It also exhibits high variability with a standard deviation of $22,662.
Listings are positively skewed with high variability:
listings (mean: 1738.02, median: 1618.5, skewness: 0.645) also shows a positive skew and substantial variability (standard deviation: 752.71).
Months inventory is nearly symmetrical and least variable:
months_inventory (mean: 9.19, median: 8.95, skewness: 0.041) has an almost symmetrical distribution and the lowest relative variability (standard deviation: 2.30) among the quantitative variables.
Platykurtic distributions are common:
sales, median_price, and listings all exhibit platykurtic distributions (kurtosis values of -0.336, -0.643, and -0.810 respectively), meaning they have lighter tails and are less peaked than a normal distribution. volume is mesokurtic (kurtosis: 0.151) and months_inventory is nearly mesokurtic (kurtosis: -0.198).
Dataset is perfectly balanced for categorical variables:
Each of the four city entries, five year entries, and twelve month entries have an equal number of observations (60 for city, 48 for year, and 20 for month), ensuring a balanced dataset for comparative analyses across these dimensions.

Insights or Next Steps:

The prevalent positive skew in sales, volume, and listings suggests that while most periods have moderate activity, there are occasional spikes that significantly boost averages. Further investigation into the factors driving these high-activity periods could be valuable.
The balanced distribution of observations across city, year, and month variables makes the dataset highly suitable for time-series analysis, seasonality studies, and city-specific comparisons without the need for weighting or normalization due to unequal representation.

Task 4:

Select the median_price variable, divide it into meaningful classes (bins) based on its quartiles, create a new categorical column named median_price_class in the dataframe, and display the defined bin edges and a sample of the new column.

Firstly alculate the quartiles for the median_price variable to define the bin edges for classification. Then, I will create a new categorical column named median_price_class using the cut() function, based on these quartiles, and display the bin edges and a sample of the new column to verify its creation.

quartile_breaks <- quantile(df$median_price, probs = c(0, 0.25, 0.50, 0.75, 1), na.rm = TRUE)

labels <- c("Low", "Medium-Low", "Medium-High", "High")

df$median_price_class <- cut(df$median_price, breaks = quartile_breaks, labels = labels, include.lowest = TRUE)

cat("Defined Quartile Breaks (Bin Edges):\n")

## Defined Quartile Breaks (Bin Edges):

print(quartile_breaks)

##     0%    25%    50%    75%   100% 
##  73800 117300 134500 150050 180000

cat("\nSample of new 'median_price_class' column:\n")

## 
## Sample of new 'median_price_class' column:

print(head(df$median_price_class))

## [1] High        Medium-High Medium-Low  Medium-Low  Medium-Low  Medium-Low 
## Levels: Low Medium-Low Medium-High High

Generate and display the frequency distribution (table) for the newly created median_price_class variable, showing the count of observations falling into each price class.

Use the table() function on the df$median_price_class column and then print the resulting table.

print(table(df$median_price_class))

## 
##         Low  Medium-Low Medium-High        High 
##          60          61          59          60

Create and display a bar chart using ggplot2 to visually represent the frequency distribution of median_price_class. The chart will include appropriate labels and a title.

Use ggplot2 to create a bar chart.
This involves loading the library, mapping median_price_class to the x-axis, and adding appropriate labels and a title.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:e1071':
## 
##     element

# Create a bar chart for median_price_class

ggplot(df, aes(x = median_price_class)) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Frequency Distribution of Median Price Classes",
    x = "Median Price Class",
    y = "Frequency"
  ) +
  theme_minimal()

Calculate Gini Heterogeneity Index: for the median_price_class variable:
determinate the proportions of each category within the median_price_class variable and apply the Gini index formula: Gini = 1 - sum(p_i^2), where p_i is the proportion of observations in category i.

freq_table <- table(df$median_price_class)
proportions <- freq_table / sum(freq_table)
gini_index <- 1 - sum(proportions^2)

cat("Frequency Table for median_price_class:\n")

## Frequency Table for median_price_class:

print(freq_table)

## 
##         Low  Medium-Low Medium-High        High 
##          60          61          59          60

cat("\nProportions for median_price_class:\n")

## 
## Proportions for median_price_class:

print(proportions)

## 
##         Low  Medium-Low Medium-High        High 
##   0.2500000   0.2541667   0.2458333   0.2500000

cat("\nGini Heterogeneity Index for median_price_class: ", gini_index, "\n")

## 
## Gini Heterogeneity Index for median_price_class:  0.7499653

Discuss the interpretation of the calculated index in the context of the price class distribution. The calculated Gini heterogeneity index for median_price_class is approximately 0.750.

Discussion:

Gini Index Range:
The Gini index ranges from 0 to 1.
A value of 0 indicates perfect homogeneity, meaning all observations fall into a single category.
A value of 1 (or close to 1) indicates perfect heterogeneity, meaning observations are spread as evenly as possible across all categories, or each observation is in its own category.

Context of median_price_class:
In this case, the median_price_class variable was created by dividing median_price into four classes (Low, Medium-Low, Medium-High, High) based on quartiles.
Ideally, if the division into quartiles is perfectly balanced, each class would contain 25% of the observations.
This would lead to a high Gini index, indicating a high degree of heterogeneity (diversity) among the classes.

Result Interpretation:
A Gini index of 0.750 is a relatively high value, indicating a significant level of heterogeneity in the median_price_class variable.
This is expected given that the classes were created using quartiles, aiming for an approximately equal distribution of observations across the four categories.
The frequency table shows that the classes are indeed very evenly distributed (60, 61, 59, 60 observations respectively out of 240 total), which contributes to this high heterogeneity index. This suggests that the median prices are well-distributed across the defined price ranges, and no single price class overwhelmingly dominates the dataset.

Implication:
This high heterogeneity means that the median_price variable, when categorized into these four classes, provides a good differentiation across the spectrum of prices. It confirms that the classification effectively captures the diversity in median housing prices across the dataset.

Data Analysis Key Findings:

The frequency distribution of the median_price_class variable was visually represented using a bar chart, clearly showing the distribution across its four categories.
The frequency table for median_price_class revealed a very even distribution across the categories:
Low (60 observations), Medium-Low (61 observations), Medium-High (59 observations), and High (60 observations).
The calculated Gini heterogeneity index for median_price_class is approximately 0.750.
This Gini index value of 0.750 indicates a significant level of heterogeneity (diversity) within the median_price_class variable.
This high value is consistent with the design of the classes, which were based on quartiles to ensure an approximately equal distribution of observations across the categories.

Insights or Next Steps:

The high heterogeneity of the median_price_class variable confirms that the classification effectively captures the diversity in median housing prices, providing well-differentiated segments for further analysis.
This robust classification allows for comparative studies of other housing characteristics or market behaviors across these distinct price segments.

Task 5:

Calculate the probability that a randomly selected row from the dataset corresponds to the city ‘Beaumont’.
This involves counting rows where ‘city’ is ‘Beaumont’ and dividing by the total number of rows.

Filter the DataFrame to count rows for ‘Beaumont’, get the total number of rows, and then divide the ‘Beaumont’ count by the total count.

count_beaumont <- sum(df$city == "Beaumont")
total_rows <- nrow(df)
probability_beaumont <- count_beaumont / total_rows

cat("Number of rows for 'Beaumont':", count_beaumont, "\n")

## Number of rows for 'Beaumont': 60

cat("Total number of rows:", total_rows, "\n")

## Total number of rows: 240

cat("Probability of selecting a row for 'Beaumont':", probability_beaumont, "\n")

## Probability of selecting a row for 'Beaumont': 0.25

Calculate the probability that a randomly selected row corresponds to the month of ‘July’ using a similar approach.

count_july <- sum(df$month == 7)
total_rows <- nrow(df)
probability_july <- count_july / total_rows

cat("Number of rows for 'July':", count_july, "\n")

## Number of rows for 'July': 20

cat("Total number of rows:", total_rows, "\n")

## Total number of rows: 240

cat("Probability of selecting a row for 'July':", probability_july, "\n")

## Probability of selecting a row for 'July': 0.08333333

determine the probability of selecting a row that corresponds to ‘December 2012’ by filtering the data for both month 12 and year 2012, and then dividing this count by the total number of rows.

count_december_2012 <- sum(df$month == 12 & df$year == 2012)
total_rows <- nrow(df)
probability_december_2012 <- count_december_2012 / total_rows

cat("Number of rows for 'December 2012':", count_december_2012, "\n")

## Number of rows for 'December 2012': 4

cat("Total number of rows:", total_rows, "\n")

## Total number of rows: 240

cat("Probability of selecting a row for 'December 2012':", probability_december_2012, "\n")

## Probability of selecting a row for 'December 2012': 0.01666667

Probabilities:

Probability for ‘Beaumont’:
The probability of randomly selecting a row corresponding to the city ‘Beaumont’ is 0.25 (or 25%).
Probability for ‘July’:
The probability of randomly selecting a row corresponding to the month of ‘July’ is approximately 0.0833 (or 8.33%).
Probability for ‘December 2012’:
The probability of randomly selecting a row corresponding to ‘December 2012’ is approximately 0.0167 (or 1.67%).

Data Analysis Key Findings and Interpretation:

City ‘Beaumont’ (P = 0.25):

This probability of 0.25 indicates that 25% of the total observations in the dataset belong to the city ‘Beaumont’.
Given that there are 4 unique cities in the dataset and each city has an equal number of observations (60 out of 240 total rows), this result is expected (1/4 = 0.25). This confirms the balanced representation of cities in the dataset.

Month ‘July’ (P ≈ 0.0833):

This probability indicates that approximately 8.33% of the total observations correspond to the month of July.
Since there are 12 unique months in the dataset and each month has an equal number of observations (20 out of 240 total rows), this result is expected (20/240 = 1/12 ≈ 0.0833).
This confirms the balanced representation of months across the dataset.

‘December 2012’ (P ≈ 0.0167):

This probability indicates that approximately 1.67% of the total observations specifically correspond to December 2012.
This is an intersection of two conditions (month is December AND year is 2012). Given that there are 4 cities * 5 years * 12 months = 240 total combinations, and each specific month-year combination for a given city appears only once, but there are 4 cities for ‘December 2012’, meaning 4 rows out of 240. (4/240 = 1/60 ≈ 0.0167). This shows that the probability of picking a very specific time slice is quite low, as expected due to the granularity of the data.

Insights:

The calculated probabilities are consistent with the balanced nature of the dataset across cities, years, and months, as identified in previous steps. This reinforces confidence in using this dataset for comparative analyses without concerns about skewed representation.
These probabilities serve as a fundamental understanding of the dataset’s composition and can be used as baseline references for more complex conditional probability calculations or hypothesis testing in future analyses.

Task 6:

Create two new variables in the df dataframe: average_price (calculated as (volume * 1,000,000) divided by sales) and listing_effectiveness (calculated as sales divided by listings). For each new variable, display the head of the updated dataframe and summary statistics.
Then, provide a commentary on their meaning, distributions, and initial insights into the real estate market.
Finally, summarize the process of creating these new variables and the insights gained from their analysis, linking them to the project’s objectives.

Calculate the average_price as specified, multiplying volume by 1,000,000 to convert it to dollars, then divide by sales.
display the head of the updated dataframe and summary statistics for the new column to verify the operation.

df$average_price <- (df$volume * 1000000) / df$sales

cat("Head of DataFrame with new 'average_price' column:\n")

## Head of DataFrame with new 'average_price' column:

print(head(df))

##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
## 6 Beaumont 2010     6   189 27.219       122800     1803             11.1
##   median_price_class average_price
## 1               High      170626.5
## 2        Medium-High      163796.3
## 3         Medium-Low      157697.8
## 4         Medium-Low      134095.0
## 5         Medium-Low      142737.6
## 6         Medium-Low      144015.9

cat("\nSummary statistics for 'average_price' column:\n")

## 
## Summary statistics for 'average_price' column:

print(summary(df$average_price))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   97010  132939  156588  154320  173915  213234

Now that the average_price column has been successfully created and inspected, It is possible to proceed and create the second new column, listing_effectiveness, calculated as sales divided by listings and display the head of the updated dataframe and summary statistics for this new column to verify its creation and initial properties.

df$listing_effectiveness <- df$sales / df$listings

cat("Head of DataFrame with new 'listing_effectiveness' column:
")

## Head of DataFrame with new 'listing_effectiveness' column:

print(head(df))

##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
## 6 Beaumont 2010     6   189 27.219       122800     1803             11.1
##   median_price_class average_price listing_effectiveness
## 1               High      170626.5            0.05414220
## 2        Medium-High      163796.3            0.06809584
## 3         Medium-Low      157697.8            0.10775607
## 4         Medium-Low      134095.0            0.11709602
## 5         Medium-Low      142737.6            0.11405985
## 6         Medium-Low      144015.9            0.10482529

cat("\nSummary statistics for 'listing_effectiveness' column:\n")

## 
## Summary statistics for 'listing_effectiveness' column:

print(summary(df$listing_effectiveness))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05014 0.08980 0.10963 0.11874 0.13492 0.38713

Comments:

average_price:

Meaning:
The average_price variable represents the average price per real estate transaction, calculated as the total volume of sales (in dollars) divided by the total sales count for a given period and city. Unlike median_price, which indicates the midpoint of individual property prices, average_price reflects the average value of each transaction. This can be significantly influenced by a few high-value sales.
Distribution (from Summary Statistics):
> Min.: 97,010
> 1𝑠𝑡𝑄𝑢.: 132,939
> Median: 156,588
> 𝑀𝑒𝑎𝑛: 154,320
> 3rd Qu.: 173,915
> 𝑀𝑎𝑥.: 213,234
Analysis:
The average_price ranges from approximately 97,010𝑡𝑜 213,234.
The mean (154,320)𝑖𝑠𝑠𝑙𝑖𝑔ℎ𝑡𝑙𝑦𝑙𝑜𝑤𝑒𝑟𝑡ℎ𝑎𝑛𝑡ℎ𝑒𝑚𝑒𝑑𝑖𝑎𝑛(156,588), suggesting a very slight negative skew, meaning there are potentially more values concentrated towards the higher end, or fewer extremely low values pulling the mean down.
This contrasts with the median_price which showed a slight negative skew too, with its mean (132,665)𝑏𝑒𝑖𝑛𝑔𝑠𝑙𝑖𝑔ℎ𝑡𝑙𝑦𝑙𝑜𝑤𝑒𝑟𝑡ℎ𝑎𝑛𝑖𝑡𝑠𝑚𝑒𝑑𝑖𝑎𝑛(134,500).
However, the range and quartiles of average_price are generally higher than median_price, indicating that the average transaction value is often higher than the typical (median) property price, possibly due to higher-value properties contributing disproportionately to the total volume. The difference between the mean and median is small, indicating a relatively symmetrical distribution for average_price compared to some other variables like sales or volume.
Initial Insights:
Revenue Perspective: average_price is a direct indicator of the average revenue generated per sale. Monitoring this alongside median_price provides a more complete picture of the market’s value proposition. A growing average_price suggests increasing revenue per transaction, which is positive for profitability. Market Segmentation: Discrepancies between average_price and median_price can highlight different market dynamics. For instance, if average_price is significantly higher than median_price, it could imply a robust luxury market or a few very expensive properties driving the overall volume.

listing_effectiveness:

Meaning:
listing_effectiveness is calculated as sales divided by listings. This variable measures how many sales are generated per active listing. It can be interpreted as a proxy for market efficiency or the success rate of listings in leading to a sale. A higher value indicates that a greater proportion of listings are resulting in sales, suggesting a more dynamic market or more effective marketing.
Distribution (from Summary Statistics):
> Min.: 0.05014
> 1st Qu.: 0.08980
> Median: 0.10963
> Mean: 0.11874
> 3rd Qu.: 0.13492
> Max.: 0.38713
Analysis:
The listing_effectiveness ranges from approximately 0.05 (5%) to 0.387 (38.7%).
This means that, at its lowest, about 5% of active listings result in a sale in a given period, while at its peak, almost 39% do. This represents a wide range of effectiveness.
The mean (0.11874) is higher than the median (0.10963), indicating a positive skew.
This suggests that while most periods have lower to moderate listing effectiveness, there are occasional instances where listings are remarkably effective, pulling the average up.
Initial Insights:
Market Demand vs. Supply: Higher listing_effectiveness generally points to stronger buyer demand relative to the supply of listings, or a faster-moving market.
Conversely, lower values might suggest an oversupply of properties or weaker demand.
Marketing Success: This metric can also reflect the overall success of marketing and listing strategies.
Periods or cities with consistently higher listing_effectiveness might have more competitive pricing, better property presentation, or more targeted advertising.
Variability: The significant range and positive skew highlight that listing_effectiveness is not constant. Identifying the factors contributing to high effectiveness (e.g., specific cities, seasons, or market conditions) would be valuable for strategic decision-making.

Data Analysis Key Findings:

Average_price vs. median_price:
The average_price (mean: $154,320; median: $156,588) was found to be generally higher than the median_price (mean: $132,665; median: $134,500).
This suggests that higher-priced homes contribute more significantly to the total sales volume.
The average_price showed a slight negative skew, indicating a concentration of values towards the higher end or fewer extremely low values pulling the mean down.
Listing_effectiveness Variability:
This metric exhibited considerable variability, ranging from approximately 5% to 38.7% (mean: 0.11874; median: 0.10963).
The positive skew (mean greater than median) suggests that while there is a baseline level of effectiveness, some periods experience notably higher sales conversion rates relative to active listings, indicating dynamic market conditions.

Insights:

Decision Support:
The newly created variables provide quantitative metrics vital for understanding market dynamics and supporting strategic sales decisions. average_price helps in assessing revenue potential per transaction, while listing_effectiveness measures the success rate of listings.
Optimization and Further Analysis:
Texas Realty Insights can utilize listing_effectiveness to identify optimal periods or cities for listing properties, thereby improving conversion rates. Further analysis should explore factors influencing high listing_effectiveness and the relationship between average_price and median_price to gain deeper insights into market value and segmentation.

Task 7:

Conditional Analysis by City:

Perform conditional statistical analysis (mean and standard deviation) for key quantitative variables grouped by ‘city’ using dplyr. Display the summary statistics and create appropriate graphical representations (e.g., bar charts for means, box plots for distributions) to visualize city-specific trends.

Calculate the mean and standard deviation for the specified quantitative variables, grouped by ‘city’, as instructed.

library(dplyr)

city_summary <- df %>% 
  group_by(city) %>% 
  summarise(
    mean_sales = mean(sales, na.rm = TRUE),
    sd_sales = sd(sales, na.rm = TRUE),
    mean_volume = mean(volume, na.rm = TRUE),
    sd_volume = sd(volume, na.rm = TRUE),
    mean_median_price = mean(median_price, na.rm = TRUE),
    sd_median_price = sd(median_price, na.rm = TRUE),
    mean_listings = mean(listings, na.rm = TRUE),
    sd_listings = sd(listings, na.rm = TRUE),
    mean_months_inventory = mean(months_inventory, na.rm = TRUE),
    sd_months_inventory = sd(months_inventory, na.rm = TRUE),
    mean_average_price = mean(average_price, na.rm = TRUE),
    sd_average_price = sd(average_price, na.rm = TRUE),
    mean_listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
    sd_listing_effectiveness = sd(listing_effectiveness, na.rm = TRUE)
  )

print(city_summary)

## # A tibble: 4 × 15
##   city               mean_sales sd_sales mean_volume sd_volume mean_median_price
##   <chr>                   <dbl>    <dbl>       <dbl>     <dbl>             <dbl>
## 1 Beaumont                 177.     41.5        26.1      6.97           129988.
## 2 Bryan-College Sta…       206.     85.0        38.2     17.2            157488.
## 3 Tyler                    270.     62.0        45.8     13.1            141442.
## 4 Wichita Falls            116.     22.2        13.9      3.24           101743.
## # ℹ 9 more variables: sd_median_price <dbl>, mean_listings <dbl>,
## #   sd_listings <dbl>, mean_months_inventory <dbl>, sd_months_inventory <dbl>,
## #   mean_average_price <dbl>, sd_average_price <dbl>,
## #   mean_listing_effectiveness <dbl>, sd_listing_effectiveness <dbl>

Conditional Analysis by Year:

Perform the conditional statistical analysis grouped by ‘year’. Display the summary statistics and create graphical representations (e.g., line charts, bar charts) to visualize annual trends.

Calculate the mean and standard deviation for the specified quantitative variables.

library(dplyr)

year_summary <- df %>%
  group_by(year) %>%
  summarise(
    mean_sales = mean(sales, na.rm = TRUE),
    sd_sales = sd(sales, na.rm = TRUE),
    mean_volume = mean(volume, na.rm = TRUE),
    sd_volume = sd(volume, na.rm = TRUE),
    mean_median_price = mean(median_price, na.rm = TRUE),
    sd_median_price = sd(median_price, na.rm = TRUE),
    mean_listings = mean(listings, na.rm = TRUE),
    sd_listings = sd(listings, na.rm = TRUE),
    mean_months_inventory = mean(months_inventory, na.rm = TRUE),
    sd_months_inventory = sd(months_inventory, na.rm = TRUE),
    mean_average_price = mean(average_price, na.rm = TRUE),
    sd_average_price = sd(average_price, na.rm = TRUE),
    mean_listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
    sd_listing_effectiveness = sd(listing_effectiveness, na.rm = TRUE)
  )

print(year_summary)

## # A tibble: 5 × 15
##    year mean_sales sd_sales mean_volume sd_volume mean_median_price
##   <int>      <dbl>    <dbl>       <dbl>     <dbl>             <dbl>
## 1  2010       169.     60.5        25.7      10.8           130192.
## 2  2011       164.     63.9        25.2      12.2           127854.
## 3  2012       186.     70.9        29.3      14.5           130077.
## 4  2013       212.     84.0        35.2      17.9           135723.
## 5  2014       231.     95.5        39.8      21.2           139481.
## # ℹ 9 more variables: sd_median_price <dbl>, mean_listings <dbl>,
## #   sd_listings <dbl>, mean_months_inventory <dbl>, sd_months_inventory <dbl>,
## #   mean_average_price <dbl>, sd_average_price <dbl>,
## #   mean_listing_effectiveness <dbl>, sd_listing_effectiveness <dbl>

Conditional Analysis by Month:

Perform conditional statistical analysis (mean and standard deviation) for key quantitative variables grouped by ‘month’. Display the summary statistics and create graphical representations (e.g., line charts, bar charts) to visualize annual trends.

Calculate the mean and standard deviation for the specified quantitative variables.

library(dplyr)

month_summary <- df %>%
  group_by(month) %>%
  summarise(
    mean_sales = mean(sales, na.rm = TRUE),
    sd_sales = sd(sales, na.rm = TRUE),
    mean_volume = mean(volume, na.rm = TRUE),
    sd_volume = sd(volume, na.rm = TRUE),
    mean_median_price = mean(median_price, na.rm = TRUE),
    sd_median_price = sd(median_price, na.rm = TRUE),
    mean_listings = mean(listings, na.rm = TRUE),
    sd_listings = sd(listings, na.rm = TRUE),
    mean_months_inventory = mean(months_inventory, na.rm = TRUE),
    sd_months_inventory = sd(months_inventory, na.rm = TRUE),
    mean_average_price = mean(average_price, na.rm = TRUE),
    sd_average_price = sd(average_price, na.rm = TRUE),
    mean_listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
    sd_listing_effectiveness = sd(listing_effectiveness, na.rm = TRUE)
  )

print(month_summary)

## # A tibble: 12 × 15
##    month mean_sales sd_sales mean_volume sd_volume mean_median_price
##    <int>      <dbl>    <dbl>       <dbl>     <dbl>             <dbl>
##  1     1       127.     43.4        19.0      8.37            124250
##  2     2       141.     51.1        21.7     10.1             130075
##  3     3       189.     59.2        29.4     12.0             127415
##  4     4       212.     65.4        33.3     14.5             131490
##  5     5       239.     83.1        39.7     19.0             134485
##  6     6       244.     95.0        41.3     21.1             137620
##  7     7       236.     96.3        39.1     21.4             134750
##  8     8       231.     79.2        38.0     18.0             136675
##  9     9       182.     72.5        29.6     15.2             134040
## 10    10       180.     75.0        29.1     15.1             133480
## 11    11       157.     55.5        24.8     11.2             134305
## 12    12       169.     60.7        27.1     12.6             133400
## # ℹ 9 more variables: sd_median_price <dbl>, mean_listings <dbl>,
## #   sd_listings <dbl>, mean_months_inventory <dbl>, sd_months_inventory <dbl>,
## #   mean_average_price <dbl>, sd_average_price <dbl>,
## #   mean_listing_effectiveness <dbl>, sd_listing_effectiveness <dbl>

Conditional Analysis by Month:

Data Analysis Key Findings:

Peak Sales Season:
Mean sales and volume show a clear increase from March, peaking in June-July, and then declining towards the end of the year. For example, mean sales are lowest in January (127) and highest in June (244). This suggests a strong seasonal demand in the warmer months.
Price Trends:
Mean median_price and average_price generally follow the sales trend, with prices tending to be higher during the peak selling season. Mean median price is lowest in January (124,250)𝑎𝑛𝑑𝑝𝑒𝑎𝑘𝑠𝑖𝑛𝐽𝑢𝑛𝑒(137,620), indicating that properties sold during these months tend to fetch higher prices.
Listing Activity:
Mean listings generally increase from the beginning of the year, peaking around spring/summer (e.g., June with 2110 listings) before slightly decreasing. This indicates that sellers are more likely to put their homes on the market during periods of high demand.
Market Efficiency:
Mean listing_effectiveness is notably higher in the spring and summer months (e.g., peaks in June with 0.134), indicating that listings convert into sales more efficiently during these periods. This is a critical insight for optimizing marketing strategies.
Inventory Levels:
Mean months_inventory shows an inverse relationship to sales and effectiveness, generally decreasing from winter through summer (lowest in July at 7.7 months) and rising again towards winter (highest in January at 10.1 months). This confirms a tighter, faster-moving market in the warmer months.
Distribution Insights (from Box Plots):
The box plots visually confirm these seasonal trends, showing shifts in the median, interquartile range, and presence of outliers across months. For instance, sales and volume box plots typically show higher medians and potentially wider spreads during peak seasons, while months_inventory shows lower medians and narrower spreads.

Insights or Next Steps:

Optimal Listing Periods:
Texas Realty Insights should focus their marketing and listing efforts heavily on the spring and summer months (March to August) to capitalize on peak sales activity, higher prices, and improved listing_effectiveness. This directly addresses the project objective of optimizing listing strategies.
Pricing Strategy:
Dynamic pricing strategies could be implemented to align with seasonal price fluctuations, potentially maximizing revenue during peak periods.
Inventory Management:
Understanding seasonal months_inventory trends can help agents advise sellers on the best time to list their properties to achieve quicker sales or higher prices.
Targeted Marketing:
Marketing campaigns can be tailored to seasonal variations, perhaps focusing on buyer attraction in spring/summer and seller education or strategic listings in fall/winter.

Conditional Analysis (by City, Year, and Month):

Data Analysis Key Findings:

City Performance Disparities:
Cities exhibit distinct market characteristics. For instance, Tyler is a high-activity market with robust sales, while Bryan-College Station commands higher prices. Wichita Falls consistently shows lower activity and prices across most metrics. Beaumont generally falls in the middle.
Market Growth Over Time:
The Texas real estate market demonstrated consistent growth and increasing efficiency between 2010 and 2014. Key indicators like sales, volume, and prices generally rose, while the time properties spent on the market (months_inventory) decreased, signaling a strengthening seller’s market.
Seasonal Influence:
The real estate market is highly seasonal, with a strong preference for spring and summer sales. This period is characterized by higher transaction volumes, greater market efficiency (higher listing_effectiveness), and elevated prices, followed by a slowdown in colder months.

Insights or Next Steps:

Targeted Strategies:
Texas Realty Insights should develop city-specific marketing and sales strategies. High-volume markets like Tyler might benefit from aggressive listing campaigns, while higher-priced markets like Bryan-College Station could focus on premium services. Wichita Falls may require strategies to stimulate demand or differentiate listings.
Capitalize on Growth and Seasonality:
Leveraging the observed annual growth and seasonal peaks is crucial. Marketing efforts should be intensified from March to August to maximize sales and revenue. Understanding these patterns allows for proactive inventory management, staffing adjustments, and promotional timing.
Further Granular Analysis:
While this conditional analysis provides a strong overview, combining these factors (e.g., city-year, city-month trends) could yield even more precise insights. For example, investigating why a particular city might deviate from the general annual or seasonal trend could uncover unique local market dynamics.

Integrated Summary of Conditional Analyses and Strategic Implications:

The conditional analyses conducted by city, year, and month provide a comprehensive understanding of the dynamics shaping the Texas real estate market. These insights highlight structural differences across cities, a clear multi‑year growth trajectory, and strong seasonal patterns that influence sales activity, pricing, inventory behavior, and listing performance.

City-Level Differences:

The analyses reveal substantial variation across local markets:

Tyler: stands out as the most active market, with the highest mean sales and volume, as well as strong listing_effectiveness, indicating efficient conversion of listings into sales.
Bryan-College Station: leads in median_price and average_price, positioning it as a premium-value market.
Beaumont: shows moderate performance across most metrics.
Wichita Falls: consistently records the lowest values in sales, prices, and listing effectiveness, along with higher and more variable months_inventory, reflecting a slower and less competitive market.

These differences underscore the need for city-specific strategies tailored to each market’s structure and performance.

Annual Trends (2010–2014):

Across the five-year period, the Texas real estate market demonstrates strong and sustained growth:

Mean sales, volume, median_price, average_price, and listing_effectiveness all increase steadily from 2010 to 2014.
Mean months_inventory declines from approximately 9.9 to 8.4 months, signaling a tightening seller’s market where properties sell more quickly.
Price appreciation is evident, with mean median price rising from roughly $130,192 in 2010 to $139,481 in 2014.

These trends reflect improving market efficiency, stronger demand, and favorable economic conditions.

Seasonal Patterns:

The analyses also reveal pronounced seasonality:

Sales and volume peak between May and July, with June typically showing the highest activity.
Median_price and average_price follow similar seasonal peaks, reaching their highest levels in late spring and early summer.
Listing_effectiveness is strongest during the same period, indicating that marketing efforts convert more effectively when demand is highest.
Months_inventory displays inverse seasonality, reaching its lowest levels in summer and highest in winter, confirming faster turnover during peak demand months.
Listings increase in spring and peak around May–July, suggesting that sellers strategically enter the market when conditions are most favorable.

Strategic Implications:

The combined insights from the conditional analyses support several strategic recommendations for Texas Realty Insights:

Develop City-Specific Strategies:
Focus on high-volume campaigns in Tyler, premium positioning in Bryan-College Station, and targeted, conservative approaches in slower markets like Wichita Falls.
Capitalize on Seasonality:
Concentrate marketing investments, listing promotions, and staffing resources between March and August to leverage peak demand, higher prices, and stronger listing effectiveness.
Implement Dynamic Pricing:
Adjust pricing strategies to reflect seasonal fluctuations, maximizing revenue during high-activity months.
Monitor Inventory Conditions:
Use months_inventory as a key indicator of market competitiveness and to guide recommendations on optimal listing timing.
Leverage Multi-Year Growth Trends:
Invest in expanding operations in high-performing markets and refine sales forecasts using the consistent upward trends observed from 2010 to 2014.

This integrated analysis provides a clear, data-driven foundation for strategic decision-making, enabling Texas Realty Insights to optimize marketing efforts, improve forecasting, and tailor approaches to the unique characteristics of each market.

Task 8 –> Ggplot2 Visualizations:

Create a bar chart using ggplot2 to visualize the mean of sales across different cities. This chart will help in comparing the average sales performance of each city.

library(ggplot2)

Bar chart for mean sales by city

ggplot(city_summary, aes(x = city, y = mean_sales, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales by City",
    x = "City",
    y = "Mean Sales"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of average sales volume per city.

Bar chart for mean volume by city

ggplot(city_summary, aes(x = city, y = mean_volume, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales Volume by City",
    x = "City",
    y = "Mean Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average median price per city.

Bar chart for mean median_price by city

ggplot(city_summary, aes(x = city, y = mean_median_price, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Median Price by City",
    x = "City",
    y = "Mean Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average number of listings per city.

Bar chart for mean listings by city

ggplot(city_summary, aes(x = city, y = mean_listings, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listings by City",
    x = "City",
    y = "Mean Listings"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average months inventory per city.

Bar chart for mean months_inventory by city

ggplot(city_summary, aes(x = city, y = mean_months_inventory, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Months Inventory by City",
    x = "City",
    y = "Mean Months Inventory"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average price per city.

Bar chart for mean average_price by city

ggplot(city_summary, aes(x = city, y = mean_average_price, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Average Price by City",
    x = "City",
    y = "Mean Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average listing effectiveness per city.

Bar chart for mean listing_effectiveness by city

ggplot(city_summary, aes(x = city, y = mean_listing_effectiveness, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listing Effectiveness by City",
    x = "City",
    y = "Mean Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Create a box plot using ggplot2 to visualize the distribution of sales across different cities. This chart will provide insights into the spread, central tendency, and outliers of sales for each city.

Box plot for sales by city

ggplot(df, aes(x = city, y = sales, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales by City",
    x = "City",
    y = "Sales"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of sales volume for each city.

Box plot for volume by city

ggplot(df, aes(x = city, y = volume, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by City",
    x = "City",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of median price for each city.

Box plot for median_price by city

ggplot(df, aes(x = city, y = median_price, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Median Price by City",
    x = "City",
    y = "Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of listings for each city.

Box plot for listings by city

ggplot(df, aes(x = city, y = listings, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listings by City",
    x = "City",
    y = "Listings"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of months inventory for each city.

Box plot for months_inventory by city

ggplot(df, aes(x = city, y = months_inventory, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Months Inventory by City",
    x = "City",
    y = "Months Inventory"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of average price for each city.

Box plot for average_price by city

ggplot(df, aes(x = city, y = average_price, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Average Price by City",
    x = "City",
    y = "Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of listing effectiveness for each city.

Box plot for listing_effectiveness by city

ggplot(df, aes(x = city, y = listing_effectiveness, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listing Effectiveness by City",
    x = "City",
    y = "Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This chart will help in comparing the average sales performance for each year.

Bar chart for mean sales by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_sales, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales by Year",
    x = "Year",
    y = "Mean Sales"
  ) +
  theme_minimal() + 
  theme(legend.position = "none")

This will allow for easy comparison of average sales volume per year.

Bar chart for mean volume by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_volume, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales Volume by Year",
    x = "Year",
    y = "Mean Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average median price per year.

Bar chart for mean median_price by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_median_price, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Median Price by Year",
    x = "Year",
    y = "Mean Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average number of listings per year.

Bar chart for mean listings by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_listings, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listings by Year",
    x = "Year",
    y = "Mean Listings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average months inventory per year.

Bar chart for mean months_inventory by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_months_inventory, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Months Inventory by Year",
    x = "Year",
    y = "Mean Months Inventory"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average price per year.

Bar chart for mean average_price by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_average_price, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Average Price by Year",
    x = "Year",
    y = "Mean Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average listing effectiveness per year.

Bar chart for mean listing_effectiveness by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_listing_effectiveness, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listing Effectiveness by Year",
    x = "Year",
    y = "Mean Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Create a box plot using ggplot2, to visualize the distribution of sales across different years. This will provide insights into the spread, central tendency, and outliers of sales for each year.

Box plot for sales by year

ggplot(df, aes(x = as.factor(year), y = sales, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales by Year",
    x = "Year",
    y = "Sales"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of sales volume for each year.

Box plot for volume by year

ggplot(df, aes(x = as.factor(year), y = volume, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by Year",
    x = "Year",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of median price for each year.

Box plot for median_price by year

ggplot(df, aes(x = as.factor(year), y = median_price, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Median Price by Year",
    x = "Year",
    y = "Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of listings for each year.

Box plot for listings by year

ggplot(df, aes(x = as.factor(year), y = listings, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listings by Year",
    x = "Year",
    y = "Listings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of months inventory for each year.

Box plot for months_inventory by year

ggplot(df, aes(x = as.factor(year), y = months_inventory, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Months Inventory by Year",
    x = "Year",
    y = "Months Inventory"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of average price for each year.

Box plot for average_price by year

ggplot(df, aes(x = as.factor(year), y = average_price, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Average Price by Year",
    x = "Year",
    y = "Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of listing effectiveness for each year.

Box plot for listing_effectiveness by year

ggplot(df, aes(x = as.factor(year), y = listing_effectiveness, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listing Effectiveness by Year",
    x = "Year",
    y = "Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This chart will help in comparing the average sales performance for each month.

Bar chart for mean sales by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_sales, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales by Month",
    x = "Month",
    y = "Mean Sales"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of average sales volume per month.

Bar chart for mean volume by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_volume, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales Volume by Month",
    x = "Month",
    y = "Mean Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average median price per month.

Bar chart for mean median_price by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_median_price, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Median Price by Month",
    x = "Month",
    y = "Mean Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average number of listings per month.

Bar chart for mean listings by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_listings, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listings by Month",
    x = "Month",
    y = "Mean Listings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average months inventory per month.

Bar chart for mean months_inventory by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_months_inventory, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Months Inventory by Month",
    x = "Month",
    y = "Mean Months Inventory"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average price per month.

Bar chart for mean average_price by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_average_price, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Average Price by Month",
    x = "Month",
    y = "Mean Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average listing effectiveness per month.

Bar chart for mean listing_effectiveness by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_listing_effectiveness, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listing Effectiveness by Month",
    x = "Month",
    y = "Mean Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of sales for each month.

Box plot for sales by month

ggplot(df, aes(x = as.factor(month), y = sales, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales by Month",
    x = "Month",
    y = "Sales"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of sales volume for each month.

Box plot for volume by month

ggplot(df, aes(x = as.factor(month), y = volume, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by Month",
    x = "Month",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of median price for each month.

Box plot for median_price by month

ggplot(df, aes(x = as.factor(month), y = median_price, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Median Price by Month",
    x = "Month",
    y = "Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of listings for each month.

Box plot for listings by month

ggplot(df, aes(x = as.factor(month), y = listings, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listings by Month",
    x = "Month",
    y = "Listings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of months inventory for each month.

Box plot for months_inventory by month

ggplot(df, aes(x = as.factor(month), y = months_inventory, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Months Inventory by Month",
    x = "Month",
    y = "Months Inventory"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of average price for each month.

Box plot for average_price by month

ggplot(df, aes(x = as.factor(month), y = average_price, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Average Price by Month",
    x = "Month",
    y = "Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Box Plot for listing_effectiveness by Month:

Create a box plot to visualize the distribution of listing effectiveness across different months, to provide insights into the spread, central tendency, and outliers of listing effectiveness for each month.

library(ggplot2)

Box plot for listing_effectiveness by month

ggplot(df, aes(x = as.factor(month), y = listing_effectiveness, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listing Effectiveness by Month",
    x = "Month",
    y = "Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Data Analysis Key Findings:

Sales and Volume Growth:
Both mean sales and mean volume show a consistent increase over the years. This suggests a healthy and expanding real estate market in Texas during this period. For example, mean sales rose from 169 in 2010 to 231 in 2014, and mean sales volume increased from 25.7 million to 39.8 million in the same period.
Price Appreciation:
Mean median_price and mean average_price also demonstrated an upward trend, indicating increasing property values. Mean median price increased from 130,192𝑖𝑛2010𝑡𝑜 139,481 in 2014, and mean average price followed a similar pattern.
Increased Listings:
The mean listings also increased steadily from 1572 in 2010 to 2106 in 2014, suggesting more properties were coming onto the market, potentially driven by the improving market conditions.
Improved Market Efficiency:
Mean listing_effectiveness generally improved from 0.108 in 2010 to 0.123 in 2014, indicating that a higher proportion of listings were resulting in sales each year. This suggests a more active and demand-driven market.
Decreasing Inventory:
Mean months_inventory showed a decreasing trend from 9.9 in 2010 to 8.4 in 2014. A declining months of inventory generally signifies a seller’s market, where properties are selling faster, reflecting strong demand.
Distribution Insights (from Box Plots):
The box plots illustrate that while the median values for sales, volume, and prices generally increased, the spread (IQR) and the presence of outliers also varied by year. Some years show tighter distributions, while others reveal a wider range of values, indicating periods of higher market volatility or varied performance across different segments.

Insights or Next Steps:

Confirmation of Growth:
The analysis clearly confirms a positive growth trajectory in the Texas real estate market between 2010 and 2014 across multiple key indicators. This information is crucial for Texas Realty Insights to understand macro-level market performance.
Strategic Planning:
The observed trends can inform strategic decisions, such as anticipating future market conditions, adjusting pricing strategies, and optimizing listing efforts based on annual performance patterns.
Deeper Dive:
Further analysis could explore the drivers behind the increasing listing_effectiveness and decreasing months_inventory, such as economic indicators, population growth, or interest rates, to provide more granular actionable insights.

Create a stacked bar chart to compare the total sales across months, with cities stacked within each month.

Provide the necessary total_sales values for each combination to aggregate the total sales by month and city and prepare the data for the stacked bar chart.

library(dplyr)
library(ggplot2)

Aggregate data to calculate total sales by month and city

sales_by_month_city <- df %>%
  group_by(month, city) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE)) %>%
  ungroup()

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

print(sales_by_month_city)

## # A tibble: 48 × 3
##    month city                  total_sales
##    <int> <chr>                       <int>
##  1     1 Beaumont                      608
##  2     1 Bryan-College Station         591
##  3     1 Tyler                         907
##  4     1 Wichita Falls                 442
##  5     2 Beaumont                      677
##  6     2 Bryan-College Station         628
##  7     2 Tyler                        1058
##  8     2 Wichita Falls                 454
##  9     3 Beaumont                      855
## 10     3 Bryan-College Station         949
## # ℹ 38 more rows

Use the aggregated sales_by_month_city data to create a stacked bar chart to visualize total sales by month and city.

library(dplyr)
library(ggplot2)

Aggregate data to calculate total sales by month and city

sales_by_month_city <- df %>%
  group_by(month, city) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE), .groups = 'drop')

Added.groups = ‘drop’ to address the warning
Create a stacked bar chart for total sales by month and city

ggplot(sales_by_month_city, aes(x = as.factor(month), y = total_sales, fill = city)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Total Sales by Month and City",
    x = "Month",
    y = "Total Sales"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Box Plot for median_price by City:

Create a box plot to compare the distribution of median prices across cities, to highlight differences in central tendency, variability, and potential outliers.

Calculate the mean and standard deviation for the specified quantitative variables.

library(ggplot2)

Create a boxplot to compare the distribution of median_price between cities

ggplot(df, aes(x = city, y = median_price, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribuzione del Prezzo Mediano per Città",
    x = "Città",
    y = "Prezzo Mediano (in dollari)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Boxplot Analysis: Distribution of Median Price by City

The boxplot illustrates the distribution of median_price across the four cities: Beaumont, Bryan-College Station, Tyler, and Wichita Falls.

Central Tendency (Median):

Bryan-College Station shows the highest median price, indicating a significantly higher-priced market.
Tyler has the second-highest median, above Beaumont and Wichita Falls.
Beaumont displays a lower median price, suggesting more moderate property values.
Wichita Falls has the lowest median, making it the most affordable market.

Variability (IQR and Whiskers):

Bryan-College Station exhibits the widest IQR and longest whiskers, reflecting high price dispersion and a broad range of mid‑market properties.
Tyler shows moderate variability, lower than Bryan-College Station but higher than Beaumont and Wichita Falls.
Beaumont has a more compact box, indicating lower variability in property prices.
Wichita Falls shows the narrowest IQR and shortest whiskers, suggesting highly uniform and tightly clustered prices.

Outliers:

Bryan-College Station contains several upper and lower outliers, indicating unusually high‑value properties (luxury homes) and unusually low‑priced sales.
Tyler, Beaumont, and Wichita Falls show few or no significant outliers, suggesting more predictable and stable price ranges.

Preliminary Conclusions:

Bryan-College Station is the most expensive and heterogeneous market, with strong price dispersion and distinct segments.
Tyler represents a mid‑range market with moderate variability.
Beaumont offers more moderate prices with limited dispersion.
Wichita Falls is the most affordable and homogeneous market, with consistently uniform pricing.

These differences are essential for Texas Realty Insights when tailoring marketing strategies, positioning properties, and advising clients based on each city’s specific market dynamics.

Boxplots for volume by City:

Create a boxplot to compare the distribution of total sales volume across cities, highlighting differences in spread, central tendency, and outliers.

library(ggplot2)

Box plot for volume by city

ggplot(df, aes(x = city, y = volume, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by City",
    x = "City",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Boxplots for volume by Year:

Create a boxplot to compare the distribution of total sales volume across years, to identify temporal trends, variability, and unusual values.

library(ggplot2)

Box plot for volume by year

ggplot(df, aes(x = as.factor(year), y = volume, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by Year",
    x = "Year",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Calculate the mean and standard deviation for the specified quantitative variables.

To create a normalized stacked bar chart, it is necessary to calculate the proportion of sales for each city within each month. This involves grouping the previously aggregated data by month and then calculating each city’ s sales as a proportion of the total monthly sales.

library(dplyr)
library(ggplot2)

Aggregate data to calculate total sales by month and city (re-doing just to be safe, though already done)

sales_by_month_city <- df %>%
  group_by(month, city) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE), .groups = 'drop')

Calculate the proportion of sales for each city within each month

sales_by_month_city_normalized <- sales_by_month_city %>%
  group_by(month) %>%
  mutate(proportion = total_sales / sum(total_sales)) %>%
  ungroup()

print(sales_by_month_city_normalized)

## # A tibble: 48 × 4
##    month city                  total_sales proportion
##    <int> <chr>                       <int>      <dbl>
##  1     1 Beaumont                      608      0.239
##  2     1 Bryan-College Station         591      0.232
##  3     1 Tyler                         907      0.356
##  4     1 Wichita Falls                 442      0.173
##  5     2 Beaumont                      677      0.240
##  6     2 Bryan-College Station         628      0.223
##  7     2 Tyler                        1058      0.376
##  8     2 Wichita Falls                 454      0.161
##  9     3 Beaumont                      855      0.226
## 10     3 Bryan-College Station         949      0.250
## # ℹ 38 more rows

Generate the normalized stacked bar chart to visually represent the proportion of sales contributed by each city for every month.

ggplot(sales_by_month_city_normalized, aes(x = as.factor(month), y = proportion, fill = city)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(
    title = "Normalized Total Sales by Month and City",
    x = "Month",
    y = "Proportion of Sales"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Line Chart for sales Over Time by City:

Generate a line chart using ggplot2 to visualize the trend of sales over time, differentiated by city.
This plot will help compare sales trajectories across locations during different historical periods and highlight how each market evolves over time.

Purpose:
Visualize temporal sales patterns for each city to identify growth trends, seasonal fluctuations, and deviations in performance.
Task:

Create a line chart mapping date (or year/month) on the x-axis and sales on the y-axis.
Differentiate cities using color, line type, or grouping.
Add titles, axis labels, and styling for clarity.

Interpretation:
Comment on the observed trends, noting:

Long-term growth or decline in each city.
Seasonal or cyclical patterns.
Deviations, anomalies, or divergences between cities.

library(dplyr)

df$date <- as.Date(paste0(df$year, "-", df$month, "-01"), format = "%Y-%m-%d")

cat("Head of DataFrame with new 'date' column:\n")

## Head of DataFrame with new 'date' column:

print(head(df))

##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
## 6 Beaumont 2010     6   189 27.219       122800     1803             11.1
##   median_price_class average_price listing_effectiveness       date
## 1               High      170626.5            0.05414220 2010-01-01
## 2        Medium-High      163796.3            0.06809584 2010-02-01
## 3         Medium-Low      157697.8            0.10775607 2010-03-01
## 4         Medium-Low      134095.0            0.11709602 2010-04-01
## 5         Medium-Low      142737.6            0.11405985 2010-05-01
## 6         Medium-Low      144015.9            0.10482529 2010-06-01

Create a line chart for sales trend over time by city

library(ggplot2)

ggplot(df, aes(x = date, y = sales, color = city, group = city)) +
  geom_line() +
  labs(
    title = "Andamento delle Vendite per Città nel Tempo",
    x = "Data",
    y = "Vendite"
  ) +
  theme_minimal()

Comprehensive Summary of ggplot2 Visualizations (Project Step 8):

This summary consolidates insights from all ggplot2 visualizations, showing how they enhance understanding of the Texas real estate market and support strategic decisions for Texas Realty Insights.

Boxplot of median_price by City

Bryan-College Station shows the highest median prices, indicating a premium market.
Tyler follows, while Beaumont and Wichita Falls display progressively lower prices.
Bryan-College Station has the widest IQR and many outliers, reflecting high price heterogeneity; Wichita Falls shows the tightest, most predictable distribution.
These visuals reinforce conditional analyses and highlight the need for city‑specific pricing and segmentation strategies.

Boxplots of volume by City and by Year

2.1. Volume by City

Tyler leads in total sales volume, confirming its role as the most active market.
Bryan-College Station shows substantial but more variable volume; Beaumont and Wichita Falls remain lower and more stable.
Findings align with conditional analyses and suggest high‑volume strategies for Tyler and variability‑management approaches for Bryan-College Station.

2.2. Volume by Year
Clear upward trend in sales volume from 2010 to 2014, with higher medians and wider IQRs in later years.
Increased upper outliers in 2013–2014 indicate growing market dynamism.
These visuals confirm annual growth patterns and support expansion-oriented planning.

Stacked and Normalized Bar Charts of sales by Month and City

Strong seasonal peaks appear from May to July, with Tyler consistently contributing the largest share of total sales.
Normalized charts show stable relative market shares across cities despite seasonal fluctuations.
These visuals reinforce monthly conditional analyses and highlight the need for seasonal marketing intensification.

Line Chart of sales Over Time by City

Shows integrated annual and seasonal trends: steady year‑over‑year growth and recurring spring/summer peaks.
Tyler maintains the highest trajectory, followed by Bryan-College Station, Beaumont, and Wichita Falls.
Parallel movement across cities suggests shared macroeconomic influences combined with local market differences.
Supports long‑term planning and seasonal tactical adjustments.

Overall Strategic Value:

The visualizations validate statistical findings with clear graphical evidence.
They simplify communication of complex market dynamics for stakeholders.
They provide a strong foundation for decisions on segmentation, resource allocation, marketing timing, and listing strategies, directly supporting project objectives.

Task 9:

Final Conclusions and Recommendations for Texas Realty Insights

This project delivered a comprehensive statistical and visual analysis of the Texas real estate market from 2010 to 2014 using the “Real Estate Texas.csv” dataset.
The objectives—identifying trends, evaluating marketing effectiveness, and providing visual insights for strategic decision‑making—were fully achieved through descriptive statistics, probability calculations, new variable creation, conditional analyses, and extensive ggplot2 visualizations.

Key Findings and Emerging Trends:

Data Robustness and Statistical Validity:

The dataset is perfectly balanced across city, year, and month, ensuring reliable conditional comparisons.
Sales and volume show positive skewness, reflecting occasional high‑activity periods.
Median_price shows slight negative skewness.
The Gini heterogeneity index for median_price_class (~0.750) confirms strong price diversity and effective segmentation.

Multi‑Year Growth (2010–2014):

Steady increases in sales, volume, median_price, and average_price each year.
Months_inventory declines from 9.9 to 8.4 months, indicating a tightening seller’s market.
Listing_effectiveness improves from 0.108 to 0.123, showing more efficient listing‑to‑sale conversion.
Boxplots of volume by year reinforce this trend, with higher medians and more upper outliers in later years.

Pronounced Seasonal Patterns:

Clear peaks in sales, volume, median_price, average_price, and listing_effectiveness from May to July.
Months_inventory reaches its lowest levels during these high‑activity months and peaks in winter.
Line charts confirm recurring seasonal cycles across all cities.

City‑Specific Dynamics:

Tyler: Highest activity and volume, excellent listing effectiveness.
Bryan-College Station: Highest prices and greatest variability, indicating a premium and segmented market.
Beaumont: Moderate and stable market profile.
Wichita Falls: Lowest prices and activity, most homogeneous market.

Insights from New Variables:

Average_price (~$154,320) exceeds median_price (~$132,665), showing that high‑value transactions disproportionately influence total volume.
Listing_effectiveness reveals meaningful differences in market efficiency across cities and seasons.

Strategic Value and Practical Implications:

The analysis provides a statistically solid foundation for strategic planning.
Findings are consistent across multiple analytical methods and visualizations, reinforcing their reliability.

Actionable Recommendations for Texas Realty Insights:

7.1. City‑Specific Marketing and Sales Strategies:

Evidence: Strong differences in means, medians, variability, and distributions across cities.
Implications:
- Tyler: Focus on high‑volume, fast‑turnover marketing.
- Bryan-College Station: Highlight premium properties and tailor services to high‑value clients.
- Beaumont & Wichita Falls: Emphasize affordability, long‑term value, and community features.
7.2. Proactive Seasonal Optimization:
Evidence: Consistent peaks in sales, volume, and listing effectiveness from May to July.
Implications:
- Intensify marketing and listing acquisition from March to August.
- Use winter months for seller preparation, long‑term planning, and expectation management.
7.3. Strategic Use of Listing Effectiveness and Months Inventory:
Evidence: Listing effectiveness improves annually and seasonally; months inventory is inversely correlated with market activity.
Implications:
- Integrate these KPIs into performance dashboards and agent training.
- Use them to guide pricing, timing, and client advisement.
7.4. Capitalizing on Multi‑Year Growth:
Evidence: Strong upward trends across all key metrics.
Implications:
- Invest in expansion, especially in high‑growth cities.
- Use historical trends to support realistic sales forecasts and portfolio growth.
7.5. Dynamic Pricing Strategy Considerations:
Evidence: Statistically significant seasonal and annual price fluctuations.
Implications:
- Implement dynamic pricing models aligned with seasonal peaks.
- Recommend more aggressive pricing in high‑demand months and cautious adjustments in slower periods.

Final Summary:

This analysis transforms raw data into a clear, actionable understanding of Texas real estate dynamics. The recommendations—grounded in strong statistical and visual evidence—equip Texas Realty Insights to make informed decisions, optimize strategies, and strengthen their position in a competitive and evolving market.

Analisi Esplorativa del Mercato Immobiliare del Texas

Filippo Roppo

2026-01-26

Task 1:

Task 2:

Task 4:

Task 5:

Task 6:

Task 7:

Task 8 –> Ggplot2 Visualizations:

Task 9: