library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
For the purposes of this Data Dive we will explore the ways in which House Sales (a continuous variable) is affectd by the Year (categorical variable) of sales.
” The difference between the mean sales of houses in Texas from 2000 to 2010 is 0. “
In other words, \[ H_0 : \theta_{2000} = \theta_{2002} = \cdots = \theta_{2015} \] Where \(\theta\) refers to the mean of sales for that year.
The alternative hypothesis would be that there exists at least one year \(i\) where the mean sales \(\theta_i\) is different from other years. In other words, we are testing whether a group factor is independent of the response variable.
First, let us create a subset of txhousing dataframe to only include years from 2000 - 2010 and then explore how the distirbutions vary:
## Warning: Removed 560 rows containing non-finite values (`stat_boxplot()`).
anova <- aov(sales ~ year, data = ten_yr_df)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## year 10 2.794e+07 2793586 2.408 0.00747 **
## Residuals 5501 6.381e+09 1159909
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 560 observations deleted due to missingness
The p-valuee we observe is 0.00000000115 which is extremely small, implying that if we assume the Nulll Hypothesis to be true, that probability of obtaining a an F value = 6.299 or greater is extremely low.
This implies that we can reject the null hypothesis and safely assume that at least one year has a mean that is different from the rest.
We will continue analysing our Sales column as a response variable and explore how it is affected by other variables in this data set
The Listings represents the number of houses listed in the housing market and as such, could have a positive -linear relationship with sales. Let us use a Linear regression to observe this relationship:
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
ggplot(txhousing, aes(listings, sales)) +
geom_point()
## Warning: Removed 1426 rows containing missing values (`geom_point()`).
model <- lm(sales ~ listings, txhousing)
model$coefficients
## (Intercept) listings
## 22.6613744 0.1792617
# Check model coefficients:
coef(model)
## (Intercept) listings
## 22.6613744 0.1792617
From the above model we can observe that our slope coefficient = 0.1792 which implies that there is a positive relationship between the independant variable listings and the response variable sales.
This means that a marginal change in listings results in a 0.179 unit increase in the number of sales of houses in that period.
We can visualise this relationship below:
# Add regression line to plot:
ggplot(txhousing, aes(listings, sales)) +
geom_point() +
geom_abline(aes(intercept = coef(model)[1], slope = coef(model)[2]),
colour = "red")
## Warning: Removed 1426 rows containing missing values (`geom_point()`).
From the above regression we can observe the fact that listings and sales have a positive, causal relationship. However a few questions arise from this: