Data Dive 8

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows

ANOVA : How do the number of House Sales vary across years?

For the purposes of this Data Dive we will explore the ways in which House Sales (a continuous variable) is affectd by the Year (categorical variable) of sales.

Devising a Null Hypothesis:

” The difference between the mean sales of houses in Texas from 2000 to 2010 is 0. “

In other words, \[ H_0 : \theta_{2000} = \theta_{2002} = \cdots = \theta_{2015} \] Where \(\theta\) refers to the mean of sales for that year.

The alternative hypothesis would be that there exists at least one year \(i\) where the mean sales \(\theta_i\) is different from other years. In other words, we are testing whether a group factor is independent of the response variable.

First, let us create a subset of txhousing dataframe to only include years from 2000 - 2010 and then explore how the distirbutions vary:

## Warning: Removed 560 rows containing non-finite values (`stat_boxplot()`).

Analysis of Variance Calculation :

anova  <- aov(sales ~ year, data = ten_yr_df)
summary(anova)
##               Df    Sum Sq Mean Sq F value  Pr(>F)   
## year          10 2.794e+07 2793586   2.408 0.00747 **
## Residuals   5501 6.381e+09 1159909                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 560 observations deleted due to missingness

The p-valuee we observe is 0.00000000115 which is extremely small, implying that if we assume the Nulll Hypothesis to be true, that probability of obtaining a an F value = 6.299 or greater is extremely low.

This implies that we can reject the null hypothesis and safely assume that at least one year has a mean that is different from the rest.

Linear Regression:

We will continue analysing our Sales column as a response variable and explore how it is affected by other variables in this data set

Regressing Sales on Number of Listings.

The Listings represents the number of houses listed in the housing market and as such, could have a positive -linear relationship with sales. Let us use a Linear regression to observe this relationship:

txhousing
## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows
ggplot(txhousing, aes(listings, sales)) +
      geom_point()
## Warning: Removed 1426 rows containing missing values (`geom_point()`).

model <- lm(sales ~ listings, txhousing)
model$coefficients
## (Intercept)    listings 
##  22.6613744   0.1792617
# Check model coefficients:
coef(model)
## (Intercept)    listings 
##  22.6613744   0.1792617

From the above model we can observe that our slope coefficient = 0.1792 which implies that there is a positive relationship between the independant variable listings and the response variable sales.

This means that a marginal change in listings results in a 0.179 unit increase in the number of sales of houses in that period.

We can visualise this relationship below:

# Add regression line to plot:
ggplot(txhousing, aes(listings, sales)) +
      geom_point() + 
      geom_abline(aes(intercept = coef(model)[1], slope = coef(model)[2]),
                colour = "red")
## Warning: Removed 1426 rows containing missing values (`geom_point()`).

Conclusion and Further Questions:

From the above regression we can observe the fact that listings and sales have a positive, causal relationship. However a few questions arise from this: