library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
In order to understand how multiple columns in our dataset are correlated, we will collect pairs of columns and observe how the values behave with respect to each other.
The Listings column represents the total number of units available for a given city at a given month. Meanwhile the sales column refers to the total number of units sold for that given city at a given month.
We will subtract the total Sales from Listings to create a column of Unsold units:
## # A tibble: 7,126 × 1
## unsold
## <dbl>
## 1 629
## 2 648
## 3 654
## 4 687
## 5 653
## 6 624
## 7 590
## 8 634
## 9 667
## 10 663
## # ℹ 7,116 more rows
txhousing |> ggplot()+ geom_point(mapping = aes(x = listings, y = sales), color='darkblue') + labs(title = "Listings vs Sales",
x = "Total Units Sold",
y = "Total Units Listed")
At first glance, the 2 columns appear to have a Positive Correlation with a strong cluster of points in the bottom left of the graph. It gives me the impression that the higher the number of listed units, the more units are sold.
round(cor(txhousing$listings,txhousing$sales),2)
## [1] 0.92
As seen above the Correlation Coefficient between the 2 values is 0.92, which is a considerably strong and positive relationship between Total Units Listed and Total Units Sold. This relationship seems logical and consistent as an increase in Supply can often result in More people buying houses. The strength of this relationshiop might have been even stronger if we limitted our value to between 0 - 2500 listings.
Median represents the median sales price for that city in a given month. Meanwhile, Inventory refers to amount of time (in months) it would take to sell all current listings at current pace of sales.
We will use the Median column that contains the median sales prices to create a new categorical column - Price_Category that consists of 5 values [‘Low_End’, ‘Lower_Medium’,‘Upper_Medium’,‘High_End’,‘Premium’]
price_ranges = c('Low_End', 'Lower_Medium','Upper_Medium','High_End','Premium')
txhousing <-
txhousing |>
mutate(price_category = cut(median, breaks = 5,
labels = price_ranges, right = TRUE))
txhousing = na.omit(txhousing)
select(txhousing,price_category)
## # A tibble: 7,126 × 1
## price_category
## <fct>
## 1 Low_End
## 2 Low_End
## 3 Low_End
## 4 Low_End
## 5 Low_End
## 6 Low_End
## 7 Low_End
## 8 Low_End
## 9 Low_End
## 10 Low_End
## # ℹ 7,116 more rows
The number of listings in each Price Category can be seen below:
The most frequent category appears to be - Lower Medium while the least frequent category appears to be Premium
### Number of Each Category:
table(txhousing$price_category)
##
## Low_End Lower_Medium Upper_Medium High_End Premium
## 1673 3577 1560 268 48
txhousing |> ggplot()+ geom_point(mapping = aes(x = median, y = inventory)) + labs(title = "Inventory vs Median Sales Price",
y = "Inventory",
x = "Median Sales Price")
From the above graph, it appears as though there is a neutral or very slightly negative relationship between Median Sales Price and Total Inventory. There are, however, a considerable number of Outliers in this graph which make the relationship slightly difficult to ascertain.
cor(txhousing$median, txhousing$inventory)
## [1] -0.1421893
Our Correlation Coefficient of -0.142 confirms our observation of weak negative relationship between Median Sales Price and Inventory. Logically, this value seems consistent, given the fact that Inventory represents the amount of time needed to to sell out current listings given the current pace of sales.
The Median Price does not itself lend us much insight into the pace of sales or the number of remaining units. As a result a very weak relationshiop between the 2 columns seems logical and understandable.
The Volume columns represents the product of Sales Price multiplied by the number of units sold, while Sales represents the total number of units sold in a given city for a given year. In a sense, Sales can be seen as the explanatory variable while Volume is the response variable
The Volume column represents the total number of sold Units times the Sales price of each unit. Since we don’t have a column for the exact dollar amount of sales in the dataset we can calculate it by dividing the Volume column by the Sales column
txhousing = txhousing |> mutate(sales_value = volume/sales)
txhousing = na.omit(txhousing)
select(txhousing,sales_value)
## # A tibble: 7,126 × 1
## sales_value
## <dbl>
## 1 74722.
## 2 66378.
## 3 71423.
## 4 99286.
## 5 75106.
## 6 89167.
## 7 83125
## 8 81756.
## 9 73221.
## 10 69703.
## # ℹ 7,116 more rows
txhousing |> ggplot()+ geom_point(mapping = aes(x = sales, y = volume)) + labs(title = "Volume($) vs Units Sold",
y = "Total Volume $",
x = "Total Units Sold")
From the above graph wee can observe a very strong, positive and linear relationship between the Total Volume and Total columns sold. This is very obviously predictable given the fact that sales is an explanatory variable and volume is a response variable.
It is very reasonable expect a high, positive correlation coefficient between the 2 variables
cor(txhousing$sales,txhousing$volume)
## [1] 0.9808091
As we observe above, the calculated correlation coefficient is 0.98 - which is in line with our expectation of a strong, positive and linear between the 2 variables.
We set P as 95% and calculate the confidence interval for the Mean of Volume which is our Response Variable
sample_n <- length(txhousing$volume)
sample_sd <- sd(txhousing$volume)
sample_se <- sample_sd/sqrt(sample_n)
P = 0.95
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)
sample_mean = mean(txhousing$volume)
c(sample_mean - z_score, sample_mean + z_score)
## [1] 118252317 118252321
As we see above the 95% confidence-interval for the mean of Volume is [118252317, 118252321]. This means that we can say that for every 100 samples of the population we take from the population of Home Sales in Texas, we will find the true mean in this range in 95 samples.
In this Data Dive we created 3 Pairs of columns and observed how these columns behaved in relation to each other. In some cases we observed very strong relationships, with high correlation-coefficients (Sales Count - Volume ) while in some cases we observed very low correlation (Median Sales Price - Inventory).
In the case of Sales and Volume, the two columns represent a case of Explanatory and Response Variables respectively, which is well demonstrated by their high correlation coefficients. The mechanism of this correllation is also logical and consistent. However, the case of Listings vs Sales is also worth noting in that produced a correlation coefficient of 0.92, which is fairly high; but the causal mechanism of why having more listings resulted in more sales is rather difficult to ascertain. This might be a question worth pondering in the future.