This homework uses txhousing dataset from ggplot2 package, which is a part of the tidyverse. Additionally, it uses Consumer Price Index (CPI) dataset, which I downloaded from the BLS website (https://data.bls.gov/timeseries/CUUR0000SA0) and cleaned. I will provide the code for cleaning CPI data in the solution to this homework.

txhousing consists of the following variables:

names(txhousing)
## [1] "city"      "year"      "month"     "sales"     "volume"    "median"   
## [7] "listings"  "inventory" "date"

Before you start working on this homework, study the variables in txhousing as well as the structure of the dataset by typing this command in your console.

help(txhousing)

Take a peek at the data by typing:

head(txhousing)

Save the CPI dataset in your project folder where you have also saved this homework file. Read the CPI dataset into this session. It consists of the following columns:

#setwd("C:/Users/brend/OneDrive/Desktop/Data Viz/Homework/Homework 0/Homework_0_Final/Homework 0_v3") #set my path
cpi=read_rds("cpi.rds") #read CPI file

names(cpi)
## [1] "year"       "month_name" "cpi"

Take a look at the first few observations by using head() function.

head(cpi)
year month_name cpi
1999 jan 164.3
1999 feb 164.5
1999 mar 165.0
1999 apr 166.2
1999 may 166.2
1999 jun 166.2

This homework consists of 10 questions and each carries one point. Your objective is to reproduce the output shown in the HTML file for Q1 through Q9. For Q10 just print the name of the city and the value.

Dollar values over a long time series make comparison difficult due to inflation. $100 in January 2000 is worth $154 in July 2020. In txhousing, there are two variables—volume and median—which are specified in unadjusted USD. Questions 1 to 4 are designed to inflation-adjust these two variables to July 2020 dollars.

Q1

Create a new data frame month_map with 12 rows and 2 columns titled month_name and month. month_name in month_map should have only the unique values from month_name column in CPI data frame. month column should contain the month numbers from 1 to 12.

Try to not type out this data frame manually and instead try to do it algorithmically. This is how it will look:

view(txhousing)
view(month.name)

month_map <- data.frame(month_name = tolower(month.abb),
                        month = (1:12))


head(month_map,4)
month_name month
jan 1
feb 2
mar 3
apr 4

Q2

Merge month_map to the CPI data frame. Explicitly identify the common key variable on which you will perform the merge. Store the resulting merged data frame as cpi_merge. Print the first six rows.

Here are the first six rows of cpi_merge:

cpi_merge <- cpi %>% 
            inner_join(month_map, by = ("month_name"))

head(cpi_merge)
year month_name cpi month
1999 jan 164.3 1
1999 feb 164.5 2
1999 mar 165.0 3
1999 apr 166.2 4
1999 may 166.2 5
1999 jun 166.2 6

Q3

We want to add a column to the txhousing data frame that holds the information on the CPI. Rather than altering txhousing, create a new data frame housing by merging txhousing and cpi_merge. The common keys for merging are year and month. Note that the resulting data frame is essentially txhousing with just one more column of CPI.

Here are the top six rows of housing:

#a tibble 6x10

housing <- txhousing %>% 
  inner_join(cpi_merge, by = c("year" , "month"))

housing %>% select(-c("month_name")) %>% head()
city year month sales volume median listings inventory date cpi
Abilene 2000 1 72 5380000 71400 701 6.3 2000.000 168.8
Abilene 2000 2 98 6505000 58700 746 6.6 2000.083 169.8
Abilene 2000 3 130 9285000 58100 784 6.8 2000.167 171.2
Abilene 2000 4 98 9730000 68600 785 6.9 2000.250 171.3
Abilene 2000 5 141 10590000 67300 794 6.8 2000.333 171.5
Abilene 2000 6 156 13910000 66900 780 6.6 2000.417 172.4

Q4

Modify housing by adding these new columns:

  1. cpi_latest - Contains the CPI of July 2020. This entire column will have the same value appearing in each cell.

  2. multiplier - Ratio of cpi_latest and cpi

  3. volume_adj - Adjusted volume as a product of volume and multiplier

  4. median_adj - Adjusted median sale price as a product of median and multiplier

Here I show the top six rows with only a select few columns to help you ensure your output resembles this:

 #separate july 2020 

july2020 <- cpi %>% filter(month_name == "jul" & year == "2020")

housing <- mutate(housing,
                          cpi_latest = july2020$cpi,
                          multiplier = cpi_latest/cpi,
                          volume_adj = volume * multiplier,
                          median_adj = median * multiplier)
                          
housing %>% select(c("city", "year", "month", "volume", "median", "cpi_latest", "multiplier", "volume_adj", "median_adj")) %>% head()
city year month volume median cpi_latest multiplier volume_adj median_adj
Abilene 2000 1 5380000 71400 259.101 1.534958 8258077 109596.04
Abilene 2000 2 6505000 58700 259.101 1.525919 9926101 89571.43
Abilene 2000 3 9285000 58100 259.101 1.513440 14052294 87930.89
Abilene 2000 4 9730000 68600 259.101 1.512557 14717179 103761.40
Abilene 2000 5 10590000 67300 259.101 1.510793 15999298 101676.37
Abilene 2000 6 13910000 66900 259.101 1.502906 20905423 100544.41

Q5

Using housing from Q4, create a new data frame housing_sum1 with this information for each city across all years and months:

  1. Maximum and minimum volume_adj
  2. Maximum and minimum median_adj sale price

Hint: If you group by city, you will get the summary across all the years and months.

Merge housing_sum1 into housing by city and save it as a new dataset housing_1.

Here I show first six rows of housing_1 and only a select columns:

housing_sum1 <- housing %>% 
                   group_by(city) %>% 
                   summarize(volume_adj_max = max(volume_adj),
                             volume_adj_min = min(volume_adj),
                             median_adj_max = max(median_adj),
                             median_adj_min = min(median_adj),
                    .groups = "drop")

housing_1 <- housing %>% 
              inner_join(housing_sum1, by = "city")
            
  
housing_1 %>%  select(c("city","year", "month", "volume_adj_max", "volume_adj_min","median_adj_max", "median_adj_min")) %>% head()
city year month volume_adj_max volume_adj_min median_adj_max median_adj_min
Abilene 2000 1 49773624 7678915 161440.1 84890.23
Abilene 2000 2 49773624 7678915 161440.1 84890.23
Abilene 2000 3 49773624 7678915 161440.1 84890.23
Abilene 2000 4 49773624 7678915 161440.1 84890.23
Abilene 2000 5 49773624 7678915 161440.1 84890.23
Abilene 2000 6 49773624 7678915 161440.1 84890.23

Here I show last six rows of housing_1 and only a select columns:

housing_1 %>%  select(c("city","year", "month", "volume_adj_max", "volume_adj_min","median_adj_max", "median_adj_min")) %>% tail()
city year month volume_adj_max volume_adj_min median_adj_max median_adj_min
Wichita Falls 2015 2 33832205 8337061 147083.7 84451.77
Wichita Falls 2015 3 33832205 8337061 147083.7 84451.77
Wichita Falls 2015 4 33832205 8337061 147083.7 84451.77
Wichita Falls 2015 5 33832205 8337061 147083.7 84451.77
Wichita Falls 2015 6 33832205 8337061 147083.7 84451.77
Wichita Falls 2015 7 33832205 8337061 147083.7 84451.77

Q6

  1. Using housing_1 from Q5, create a new data frame housing_min which will retain only the rows of housing_1 where volume_adj of a city was equal to the minimum adjusted volume.

Here I show first six rows of housing_min and only a select columns:

housing_min <-(housing_1[housing_1$volume_adj == housing_1$volume_adj_min,])

housing_min %>%  select(c("city", "year", "month", "volume_adj","volume_adj_min", "volume_adj_max")) %>% head()
city year month volume_adj volume_adj_min volume_adj_max
Abilene 2003 1 7678915 7678915 49773624
Amarillo 2005 10 9397614 9397614 67596289
Arlington 2011 1 30008080 30008080 137867140
Austin 2009 1 252191554 252191554 1248942028
Bay Area 2000 1 45009066 45009066 214278169
Beaumont 2001 1 13117820 13117820 57413784
  1. Using housing_1 from Q5, create a new data frame housing_max which will retain only the rows of housing_1 where median_adj of a city was equal to the maximum adjusted median sale price.

Here I show first six rows of housing_max and only a select columns:

housing_max <-(housing_1[housing_1$median_adj == housing_1$median_adj_max,])

housing_max %>%  select(c("city", "year", "month", "median_adj","median_adj_max", "median_adj_min")) %>% head()
city year month median_adj median_adj_max median_adj_min
Abilene 2015 7 161440.1 161440.1 84890.23
Amarillo 2015 5 172040.3 172040.3 113161.58
Arlington 2015 6 195434.8 195434.8 131799.55
Austin 2015 4 296007.2 296007.2 200091.09
Bay Area 2015 7 218003.8 218003.8 154570.32
Beaumont 2010 1 195862.0 195862.0 108772.85

Q7

Using housing_1 from Q5, create a new data frame housing_sum2 with this information for each year and month pair across all cities:

  1. Median listings
  2. Median sales

Hint: If even a single value for listings or sales of a city is NA, the median of that variable will be NA. In order to avoid this, use na.rm = TRUE argument in median().

Here I show first six rows of housing_sum2 and all the columns:

housing_sum2 <- housing_1 %>%
                 group_by(year,month) %>% 
                 summarize(listings_med = median(listings,na.rm = TRUE),
                              sales_med = median(sales,na.rm = TRUE),
                 .groups = "drop")

housing_sum2 %>%  select(c("year","month","listings_med", "sales_med")) %>% head()
year month listings_med sales_med
2000 1 972.0 99
2000 2 916.5 134
2000 3 946.5 167
2000 4 985.0 153
2000 5 978.5 165
2000 6 863.5 188

Merge housing_sum2 into housing_1 and save a new data frame housing_2.

Here I show first six rows of housing_2 and some of the columns:

housing_2 <- housing_1 %>% 
              inner_join(housing_sum2, by = c("month","year"))

housing_2 %>% select(c("city","year","month","listings","sales","listings_med","sales_med")) %>% head()
city year month listings sales listings_med sales_med
Abilene 2000 1 701 72 972.0 99
Abilene 2000 2 746 98 916.5 134
Abilene 2000 3 784 130 946.5 167
Abilene 2000 4 785 98 985.0 153
Abilene 2000 5 794 141 978.5 165
Abilene 2000 6 780 156 863.5 188

Q8

Modify housing_2 from Q7 to add these indicator variables (also called dummy variables):

  1. listings_ind - If a city’s listings is less than or equal to the median listings for that year and month across all the cities, the value should be 0 else it should be 1.

  2. sales_ind - If a city’s sales is less than or equal to the median sales for that year and month across all the cities, the value should be 0 else it should be 1.

Hint: This can be achieved using ifelse() function from R along with mutate() from dplyr

Here I show first six rows of housing_2 and some of the columns:

housing_2 =  mutate (housing_2,
                     listings_ind = ifelse(listings <= listings_med,0,1),
                     sales_ind = ifelse(sales <= sales_med,0,1))

housing_2 %>% select(c("city", "year", "month", "listings","listings_med", "listings_ind", "sales","sales_med","sales_ind")) %>% head()
city year month listings listings_med listings_ind sales sales_med sales_ind
Abilene 2000 1 701 972.0 0 72 99 0
Abilene 2000 2 746 916.5 0 98 134 0
Abilene 2000 3 784 946.5 0 130 167 0
Abilene 2000 4 785 985.0 0 98 153 0
Abilene 2000 5 794 978.5 0 141 165 0
Abilene 2000 6 780 863.5 0 156 188 0

Q9

Using housing_2 from Q8, add a new variable market_hotness as follows:

housing_2 = mutate(housing_2,
                   market_hotness = 
                   ifelse((listings_ind == 0) & (sales_ind == 0), print("Low"),
                   ifelse((listings_ind == 0) & (sales_ind == 1), print("High"),
                   ifelse((listings_ind == 1) & (sales_ind == 0), print("Very Low"),
                   ifelse((listings_ind == 1) & (sales_ind == 1), print("Average"), print("NA")
                          )))))
## [1] "Low"
## [1] "High"
## [1] "Very Low"
## [1] "Average"
## [1] "NA"

Here I show first six rows of housing_2 and some of the columns:

housing_2 %>% select(c("city", "year", "month", "listings","sales","listings_ind", "sales_ind", "market_hotness")) %>% head() 
city year month listings sales listings_ind sales_ind market_hotness
Abilene 2000 1 701 72 0 0 Low
Abilene 2000 2 746 98 0 0 Low
Abilene 2000 3 784 130 0 0 Low
Abilene 2000 4 785 98 0 0 Low
Abilene 2000 5 794 141 0 0 Low
Abilene 2000 6 780 156 0 0 Low

Q10

Which city has the highest average median_adj sale price and what is that price?

average_median_adj <- housing_2 %>% 
              group_by(city) %>% 
              summarize(average = mean(median_adj, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
              .groups = "drop"
              
average_median_adj %>%  arrange(desc(average)) %>% head(.,1)
city average
Collin County 252325.2