#Data Dive 2

Loading Packages and Dataset:

Loading the tidyverse package which contains our dataset. Our dataset is called the ‘Texas Housing’ dataset, it contains information about the housing market in Texas provided by the TAMU real estate center

## Loading the tidyverse library as well as the TXHousing Dataset
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

txhousing

## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows

Summarizing Key Columns:

##      city                year          month            sales       
##  Length:8602        Min.   :2000   Min.   : 1.000   Min.   :   6.0  
##  Class :character   1st Qu.:2003   1st Qu.: 3.000   1st Qu.:  86.0  
##  Mode  :character   Median :2007   Median : 6.000   Median : 169.0  
##                     Mean   :2007   Mean   : 6.406   Mean   : 549.6  
##                     3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.: 467.0  
##                     Max.   :2015   Max.   :12.000   Max.   :8945.0  
##                                                     NA's   :568     
##      volume              median          listings       inventory     
##  Min.   :8.350e+05   Min.   : 50000   Min.   :    0   Min.   : 0.000  
##  1st Qu.:1.084e+07   1st Qu.:100000   1st Qu.:  682   1st Qu.: 4.900  
##  Median :2.299e+07   Median :123800   Median : 1283   Median : 6.200  
##  Mean   :1.069e+08   Mean   :128131   Mean   : 3217   Mean   : 7.175  
##  3rd Qu.:7.512e+07   3rd Qu.:150000   3rd Qu.: 2954   3rd Qu.: 8.150  
##  Max.   :2.568e+09   Max.   :304200   Max.   :43107   Max.   :55.900  
##  NA's   :568         NA's   :616      NA's   :1424    NA's   :1467    
##       date     
##  Min.   :2000  
##  1st Qu.:2004  
##  Median :2008  
##  Mean   :2008  
##  3rd Qu.:2012  
##  Max.   :2016  
##

3 Key questions about our Data:

-1) Which City has the highest median Home Prices?

-2) In which year did we have the highest and lowest sales values ?

-3) What is the difference between Mean and Median house sales for a given city in a given year?

Let us explore the 3rd question below:

out = txhousing |> group_by(city, year) |> summarise(yearly_diff = (mean(sales) - median(sales)))

## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.

out

## # A tibble: 736 × 3
## # Groups:   city [46]
##    city     year yearly_diff
##    <chr>   <int>       <dbl>
##  1 Abilene  2000      12.1  
##  2 Abilene  2001       2.25 
##  3 Abilene  2002      -7.17 
##  4 Abilene  2003      -1    
##  5 Abilene  2004       0    
##  6 Abilene  2005      -2.75 
##  7 Abilene  2006       8.42 
##  8 Abilene  2007       0.917
##  9 Abilene  2008       1.58 
## 10 Abilene  2009      -0.833
## # ℹ 726 more rows

Let us explore this dataset using some visualisations:

###What is the distribution of the Sales column:

hist(txhousing$sales)

### How do the number of listings change over time for each city?

txhousing |>  ggplot() +
    geom_line(mapping = aes(x = year, y = listings, color = city))

## Warning: Removed 518 rows containing missing values (`geom_line()`).

###How do Housing Sales for each City change over time for each city:

txhousing |>  ggplot() +
    geom_line(mapping = aes(x = year, y = sales, color = city))

## Warning: Removed 430 rows containing missing values (`geom_line()`).

Data Dive 2

Shresht Venkatraman

2024-01-23