Data Dive 6

Loading our Texas Housing Dataset:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows

Creating Pairs and Columns

In order to understand how multiple columns in our dataset are correlated, we will collect pairs of columns and observe how the values behave with respect to each other.

Pair1: Listings and Sales

The Listings column represents the total number of units available for a given city at a given month. Meanwhile the sales column refers to the total number of units sold for that given city at a given month.

Creating new column - Unsold

We will subtract the total Sales from Listings to create a column of Unsold units:

## # A tibble: 7,126 × 1
##    unsold
##     <dbl>
##  1    629
##  2    648
##  3    654
##  4    687
##  5    653
##  6    624
##  7    590
##  8    634
##  9    667
## 10    663
## # ℹ 7,116 more rows

Visualisation:

txhousing |> ggplot()+ geom_point(mapping = aes(x = listings, y = sales), color='darkblue') + labs(title = "Listings vs Sales",
       x = "Total Units Sold",
       y = "Total Units Listed")

At first glance, the 2 columns appear to have a Positive Correlation with a strong cluster of points in the bottom left of the graph. It gives me the impression that the higher the number of listed units, the more units are sold.

Calculating Correlation Coefficient:

round(cor(txhousing$listings,txhousing$sales),2)
## [1] 0.92

As seen above the Correlation Coefficient between the 2 values is 0.92, which is a considerably strong and positive relationship between Total Units Listed and Total Units Sold. This relationship seems logical and consistent as an increase in Supply can often result in More people buying houses. The strength of this relationshiop might have been even stronger if we limitted our value to between 0 - 2500 listings.

Pair 2: Median and Inventory

Median represents the median sales price for that city in a given month. Meanwhile, Inventory refers to amount of time (in months) it would take to sell all current listings at current pace of sales.

Creating a new column - Price_Category

We will use the Median column that contains the median sales prices to create a new categorical column - Price_Category that consists of 5 values [‘Low_End’, ‘Lower_Medium’,‘Upper_Medium’,‘High_End’,‘Premium’]

price_ranges = c('Low_End', 'Lower_Medium','Upper_Medium','High_End','Premium')

txhousing <-
  txhousing |> 
  mutate(price_category = cut(median, breaks = 5, 
                       labels = price_ranges, right = TRUE))

txhousing = na.omit(txhousing)
select(txhousing,price_category)
## # A tibble: 7,126 × 1
##    price_category
##    <fct>         
##  1 Low_End       
##  2 Low_End       
##  3 Low_End       
##  4 Low_End       
##  5 Low_End       
##  6 Low_End       
##  7 Low_End       
##  8 Low_End       
##  9 Low_End       
## 10 Low_End       
## # ℹ 7,116 more rows

The number of listings in each Price Category can be seen below:

The most frequent category appears to be - Lower Medium while the least frequent category appears to be Premium

### Number of Each Category:
table(txhousing$price_category)
## 
##      Low_End Lower_Medium Upper_Medium     High_End      Premium 
##         1673         3577         1560          268           48

Visualising the Columns:

txhousing |> ggplot()+ geom_point(mapping = aes(x = median, y = inventory)) + labs(title = "Inventory vs Median Sales Price",
       y = "Inventory",
       x = "Median Sales Price")

From the above graph, it appears as though there is a neutral or very slightly negative relationship between Median Sales Price and Total Inventory. There are, however, a considerable number of Outliers in this graph which make the relationship slightly difficult to ascertain.

Calculating Correlation Coefficient:

cor(txhousing$median, txhousing$inventory)
## [1] -0.1421893

Our Correlation Coefficient of -0.142 confirms our observation of weak negative relationship between Median Sales Price and Inventory. Logically, this value seems consistent, given the fact that Inventory represents the amount of time needed to to sell out current listings given the current pace of sales.

The Median Price does not itself lend us much insight into the pace of sales or the number of remaining units. As a result a very weak relationshiop between the 2 columns seems logical and understandable.

Pair 3 - Volume and Sales

The Volume columns represents the product of Sales Price multiplied by the number of units sold, while Sales represents the total number of units sold in a given city for a given year. In a sense, Sales can be seen as the explanatory variable while Volume is the response variable

Creating a new column - Sales_Value

The Volume column represents the total number of sold Units times the Sales price of each unit. Since we don’t have a column for the exact dollar amount of sales in the dataset we can calculate it by dividing the Volume column by the Sales column

txhousing = txhousing |> mutate(sales_value = volume/sales)

txhousing = na.omit(txhousing)
select(txhousing,sales_value)
## # A tibble: 7,126 × 1
##    sales_value
##          <dbl>
##  1      74722.
##  2      66378.
##  3      71423.
##  4      99286.
##  5      75106.
##  6      89167.
##  7      83125 
##  8      81756.
##  9      73221.
## 10      69703.
## # ℹ 7,116 more rows

Visualisation

txhousing |> ggplot()+ geom_point(mapping = aes(x = sales, y = volume)) + labs(title = "Volume($) vs Units Sold",
       y = "Total Volume $",
       x = "Total Units Sold")

From the above graph wee can observe a very strong, positive and linear relationship between the Total Volume and Total columns sold. This is very obviously predictable given the fact that sales is an explanatory variable and volume is a response variable.

It is very reasonable expect a high, positive correlation coefficient between the 2 variables

Correlation Coefficient for Sales and Volume

cor(txhousing$sales,txhousing$volume)
## [1] 0.9808091

As we observe above, the calculated correlation coefficient is 0.98 - which is in line with our expectation of a strong, positive and linear between the 2 variables.

Calculating the Confidence Intervals for Total Volume:

We set P as 95% and calculate the confidence interval for the Mean of Volume which is our Response Variable

sample_n <- length(txhousing$volume)
sample_sd <- sd(txhousing$volume)
sample_se <- sample_sd/sqrt(sample_n)

P = 0.95
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)

sample_mean = mean(txhousing$volume)

c(sample_mean - z_score, sample_mean + z_score)
## [1] 118252317 118252321

As we see above the 95% confidence-interval for the mean of Volume is [118252317, 118252321]. This means that we can say that for every 100 samples of the population we take from the population of Home Sales in Texas, we will find the true mean in this range in 95 samples.

Conclusion and Further Questions:

In this Data Dive we created 3 Pairs of columns and observed how these columns behaved in relation to each other. In some cases we observed very strong relationships, with high correlation-coefficients (Sales Count - Volume ) while in some cases we observed very low correlation (Median Sales Price - Inventory).

In the case of Sales and Volume, the two columns represent a case of Explanatory and Response Variables respectively, which is well demonstrated by their high correlation coefficients. The mechanism of this correllation is also logical and consistent. However, the case of Listings vs Sales is also worth noting in that produced a correlation coefficient of 0.92, which is fairly high; but the causal mechanism of why having more listings resulted in more sales is rather difficult to ascertain. This might be a question worth pondering in the future.