#Data Dive 2
Loading the tidyverse package which contains our dataset. Our dataset is called the ‘Texas Housing’ dataset, it contains information about the housing market in Texas provided by the TAMU real estate center
## Loading the tidyverse library as well as the TXHousing Dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
## city year month sales
## Length:8602 Min. :2000 Min. : 1.000 Min. : 6.0
## Class :character 1st Qu.:2003 1st Qu.: 3.000 1st Qu.: 86.0
## Mode :character Median :2007 Median : 6.000 Median : 169.0
## Mean :2007 Mean : 6.406 Mean : 549.6
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.: 467.0
## Max. :2015 Max. :12.000 Max. :8945.0
## NA's :568
## volume median listings inventory
## Min. :8.350e+05 Min. : 50000 Min. : 0 Min. : 0.000
## 1st Qu.:1.084e+07 1st Qu.:100000 1st Qu.: 682 1st Qu.: 4.900
## Median :2.299e+07 Median :123800 Median : 1283 Median : 6.200
## Mean :1.069e+08 Mean :128131 Mean : 3217 Mean : 7.175
## 3rd Qu.:7.512e+07 3rd Qu.:150000 3rd Qu.: 2954 3rd Qu.: 8.150
## Max. :2.568e+09 Max. :304200 Max. :43107 Max. :55.900
## NA's :568 NA's :616 NA's :1424 NA's :1467
## date
## Min. :2000
## 1st Qu.:2004
## Median :2008
## Mean :2008
## 3rd Qu.:2012
## Max. :2016
##
out = txhousing |> group_by(city, year) |> summarise(yearly_diff = (mean(sales) - median(sales)))
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
out
## # A tibble: 736 × 3
## # Groups: city [46]
## city year yearly_diff
## <chr> <int> <dbl>
## 1 Abilene 2000 12.1
## 2 Abilene 2001 2.25
## 3 Abilene 2002 -7.17
## 4 Abilene 2003 -1
## 5 Abilene 2004 0
## 6 Abilene 2005 -2.75
## 7 Abilene 2006 8.42
## 8 Abilene 2007 0.917
## 9 Abilene 2008 1.58
## 10 Abilene 2009 -0.833
## # ℹ 726 more rows
###What is the distribution of the Sales column:
hist(txhousing$sales)
### How do the number of listings change over time for each city?
txhousing |> ggplot() +
geom_line(mapping = aes(x = year, y = listings, color = city))
## Warning: Removed 518 rows containing missing values (`geom_line()`).
###How do Housing Sales for each City change over time for each city:
txhousing |> ggplot() +
geom_line(mapping = aes(x = year, y = sales, color = city))
## Warning: Removed 430 rows containing missing values (`geom_line()`).