This homework uses txhousing dataset from ggplot2 package, which is a part of the tidyverse. Additionally, it uses Consumer Price Index (CPI) dataset, which I downloaded from the BLS website (https://data.bls.gov/timeseries/CUUR0000SA0) and cleaned. I will provide the code for cleaning CPI data in the solution to this homework.
txhousing consists of the following variables:
names(txhousing)
## [1] "city" "year" "month" "sales" "volume" "median"
## [7] "listings" "inventory" "date"
Before you start working on this homework, study the variables in txhousing as well as the structure of the dataset by typing this command in your console.
help(txhousing)
Take a peek at the data by typing:
head(txhousing)
Save the CPI dataset in your project folder where you have also saved this homework file. Read the CPI dataset into this session. It consists of the following columns:
#setwd("C:/Users/brend/OneDrive/Desktop/Data Viz/Homework/Homework 0/Homework_0_Final/Homework 0_v3") #set my path
cpi=read_rds("cpi.rds") #read CPI file
names(cpi)
## [1] "year" "month_name" "cpi"
Take a look at the first few observations by using head() function.
head(cpi)
| year | month_name | cpi |
|---|---|---|
| 1999 | jan | 164.3 |
| 1999 | feb | 164.5 |
| 1999 | mar | 165.0 |
| 1999 | apr | 166.2 |
| 1999 | may | 166.2 |
| 1999 | jun | 166.2 |
This homework consists of 10 questions and each carries one point. Your objective is to reproduce the output shown in the HTML file for Q1 through Q9. For Q10 just print the name of the city and the value.
Dollar values over a long time series make comparison difficult due to inflation. $100 in January 2000 is worth $154 in July 2020. In txhousing, there are two variables—volume and median—which are specified in unadjusted USD. Questions 1 to 4 are designed to inflation-adjust these two variables to July 2020 dollars.
Create a new data frame month_map with 12 rows and 2 columns titled month_name and month. month_name in month_map should have only the unique values from month_name column in CPI data frame. month column should contain the month numbers from 1 to 12.
Try to not type out this data frame manually and instead try to do it algorithmically. This is how it will look:
view(txhousing)
view(month.name)
month_map <- data.frame(month_name = tolower(month.abb),
month = (1:12))
head(month_map,4)
| month_name | month |
|---|---|
| jan | 1 |
| feb | 2 |
| mar | 3 |
| apr | 4 |
Merge month_map to the CPI data frame. Explicitly identify the common key variable on which you will perform the merge. Store the resulting merged data frame as cpi_merge. Print the first six rows.
Here are the first six rows of cpi_merge:
cpi_merge <- cpi %>%
inner_join(month_map, by = ("month_name"))
head(cpi_merge)
| year | month_name | cpi | month |
|---|---|---|---|
| 1999 | jan | 164.3 | 1 |
| 1999 | feb | 164.5 | 2 |
| 1999 | mar | 165.0 | 3 |
| 1999 | apr | 166.2 | 4 |
| 1999 | may | 166.2 | 5 |
| 1999 | jun | 166.2 | 6 |
We want to add a column to the txhousing data frame that holds the information on the CPI. Rather than altering txhousing, create a new data frame housing by merging txhousing and cpi_merge. The common keys for merging are year and month. Note that the resulting data frame is essentially txhousing with just one more column of CPI.
Here are the top six rows of housing:
#a tibble 6x10
housing <- txhousing %>%
inner_join(cpi_merge, by = c("year" , "month"))
housing %>% select(-c("month_name")) %>% head()
| city | year | month | sales | volume | median | listings | inventory | date | cpi |
|---|---|---|---|---|---|---|---|---|---|
| Abilene | 2000 | 1 | 72 | 5380000 | 71400 | 701 | 6.3 | 2000.000 | 168.8 |
| Abilene | 2000 | 2 | 98 | 6505000 | 58700 | 746 | 6.6 | 2000.083 | 169.8 |
| Abilene | 2000 | 3 | 130 | 9285000 | 58100 | 784 | 6.8 | 2000.167 | 171.2 |
| Abilene | 2000 | 4 | 98 | 9730000 | 68600 | 785 | 6.9 | 2000.250 | 171.3 |
| Abilene | 2000 | 5 | 141 | 10590000 | 67300 | 794 | 6.8 | 2000.333 | 171.5 |
| Abilene | 2000 | 6 | 156 | 13910000 | 66900 | 780 | 6.6 | 2000.417 | 172.4 |
Modify housing by adding these new columns:
cpi_latest - Contains the CPI of July 2020. This entire column will have the same value appearing in each cell.
multiplier - Ratio of cpi_latest and cpi
volume_adj - Adjusted volume as a product of volume and multiplier
median_adj - Adjusted median sale price as a product of median and multiplier
Here I show the top six rows with only a select few columns to help you ensure your output resembles this:
#separate july 2020
july2020 <- cpi %>% filter(month_name == "jul" & year == "2020")
housing <- mutate(housing,
cpi_latest = july2020$cpi,
multiplier = cpi_latest/cpi,
volume_adj = volume * multiplier,
median_adj = median * multiplier)
housing %>% select(c("city", "year", "month", "volume", "median", "cpi_latest", "multiplier", "volume_adj", "median_adj")) %>% head()
| city | year | month | volume | median | cpi_latest | multiplier | volume_adj | median_adj |
|---|---|---|---|---|---|---|---|---|
| Abilene | 2000 | 1 | 5380000 | 71400 | 259.101 | 1.534958 | 8258077 | 109596.04 |
| Abilene | 2000 | 2 | 6505000 | 58700 | 259.101 | 1.525919 | 9926101 | 89571.43 |
| Abilene | 2000 | 3 | 9285000 | 58100 | 259.101 | 1.513440 | 14052294 | 87930.89 |
| Abilene | 2000 | 4 | 9730000 | 68600 | 259.101 | 1.512557 | 14717179 | 103761.40 |
| Abilene | 2000 | 5 | 10590000 | 67300 | 259.101 | 1.510793 | 15999298 | 101676.37 |
| Abilene | 2000 | 6 | 13910000 | 66900 | 259.101 | 1.502906 | 20905423 | 100544.41 |
Using housing from Q4, create a new data frame housing_sum1 with this information for each city across all years and months:
volume_adjmedian_adj sale priceHint: If you group by city, you will get the summary across all the years and months.
Merge housing_sum1 into housing by city and save it as a new dataset housing_1.
Here I show first six rows of housing_1 and only a select columns:
housing_sum1 <- housing %>%
group_by(city) %>%
summarize(volume_adj_max = max(volume_adj),
volume_adj_min = min(volume_adj),
median_adj_max = max(median_adj),
median_adj_min = min(median_adj),
.groups = "drop")
housing_1 <- housing %>%
inner_join(housing_sum1, by = "city")
housing_1 %>% select(c("city","year", "month", "volume_adj_max", "volume_adj_min","median_adj_max", "median_adj_min")) %>% head()
| city | year | month | volume_adj_max | volume_adj_min | median_adj_max | median_adj_min |
|---|---|---|---|---|---|---|
| Abilene | 2000 | 1 | 49773624 | 7678915 | 161440.1 | 84890.23 |
| Abilene | 2000 | 2 | 49773624 | 7678915 | 161440.1 | 84890.23 |
| Abilene | 2000 | 3 | 49773624 | 7678915 | 161440.1 | 84890.23 |
| Abilene | 2000 | 4 | 49773624 | 7678915 | 161440.1 | 84890.23 |
| Abilene | 2000 | 5 | 49773624 | 7678915 | 161440.1 | 84890.23 |
| Abilene | 2000 | 6 | 49773624 | 7678915 | 161440.1 | 84890.23 |
Here I show last six rows of housing_1 and only a select columns:
housing_1 %>% select(c("city","year", "month", "volume_adj_max", "volume_adj_min","median_adj_max", "median_adj_min")) %>% tail()
| city | year | month | volume_adj_max | volume_adj_min | median_adj_max | median_adj_min |
|---|---|---|---|---|---|---|
| Wichita Falls | 2015 | 2 | 33832205 | 8337061 | 147083.7 | 84451.77 |
| Wichita Falls | 2015 | 3 | 33832205 | 8337061 | 147083.7 | 84451.77 |
| Wichita Falls | 2015 | 4 | 33832205 | 8337061 | 147083.7 | 84451.77 |
| Wichita Falls | 2015 | 5 | 33832205 | 8337061 | 147083.7 | 84451.77 |
| Wichita Falls | 2015 | 6 | 33832205 | 8337061 | 147083.7 | 84451.77 |
| Wichita Falls | 2015 | 7 | 33832205 | 8337061 | 147083.7 | 84451.77 |
housing_1 from Q5, create a new data frame housing_min which will retain only the rows of housing_1 where volume_adj of a city was equal to the minimum adjusted volume.Here I show first six rows of housing_min and only a select columns:
housing_min <-(housing_1[housing_1$volume_adj == housing_1$volume_adj_min,])
housing_min %>% select(c("city", "year", "month", "volume_adj","volume_adj_min", "volume_adj_max")) %>% head()
| city | year | month | volume_adj | volume_adj_min | volume_adj_max |
|---|---|---|---|---|---|
| Abilene | 2003 | 1 | 7678915 | 7678915 | 49773624 |
| Amarillo | 2005 | 10 | 9397614 | 9397614 | 67596289 |
| Arlington | 2011 | 1 | 30008080 | 30008080 | 137867140 |
| Austin | 2009 | 1 | 252191554 | 252191554 | 1248942028 |
| Bay Area | 2000 | 1 | 45009066 | 45009066 | 214278169 |
| Beaumont | 2001 | 1 | 13117820 | 13117820 | 57413784 |
housing_1 from Q5, create a new data frame housing_max which will retain only the rows of housing_1 where median_adj of a city was equal to the maximum adjusted median sale price.Here I show first six rows of housing_max and only a select columns:
housing_max <-(housing_1[housing_1$median_adj == housing_1$median_adj_max,])
housing_max %>% select(c("city", "year", "month", "median_adj","median_adj_max", "median_adj_min")) %>% head()
| city | year | month | median_adj | median_adj_max | median_adj_min |
|---|---|---|---|---|---|
| Abilene | 2015 | 7 | 161440.1 | 161440.1 | 84890.23 |
| Amarillo | 2015 | 5 | 172040.3 | 172040.3 | 113161.58 |
| Arlington | 2015 | 6 | 195434.8 | 195434.8 | 131799.55 |
| Austin | 2015 | 4 | 296007.2 | 296007.2 | 200091.09 |
| Bay Area | 2015 | 7 | 218003.8 | 218003.8 | 154570.32 |
| Beaumont | 2010 | 1 | 195862.0 | 195862.0 | 108772.85 |
Using housing_1 from Q5, create a new data frame housing_sum2 with this information for each year and month pair across all cities:
Hint: If even a single value for listings or sales of a city is NA, the median of that variable will be NA. In order to avoid this, use na.rm = TRUE argument in median().
Here I show first six rows of housing_sum2 and all the columns:
housing_sum2 <- housing_1 %>%
group_by(year,month) %>%
summarize(listings_med = median(listings,na.rm = TRUE),
sales_med = median(sales,na.rm = TRUE),
.groups = "drop")
housing_sum2 %>% select(c("year","month","listings_med", "sales_med")) %>% head()
| year | month | listings_med | sales_med |
|---|---|---|---|
| 2000 | 1 | 972.0 | 99 |
| 2000 | 2 | 916.5 | 134 |
| 2000 | 3 | 946.5 | 167 |
| 2000 | 4 | 985.0 | 153 |
| 2000 | 5 | 978.5 | 165 |
| 2000 | 6 | 863.5 | 188 |
Merge housing_sum2 into housing_1 and save a new data frame housing_2.
Here I show first six rows of housing_2 and some of the columns:
housing_2 <- housing_1 %>%
inner_join(housing_sum2, by = c("month","year"))
housing_2 %>% select(c("city","year","month","listings","sales","listings_med","sales_med")) %>% head()
| city | year | month | listings | sales | listings_med | sales_med |
|---|---|---|---|---|---|---|
| Abilene | 2000 | 1 | 701 | 72 | 972.0 | 99 |
| Abilene | 2000 | 2 | 746 | 98 | 916.5 | 134 |
| Abilene | 2000 | 3 | 784 | 130 | 946.5 | 167 |
| Abilene | 2000 | 4 | 785 | 98 | 985.0 | 153 |
| Abilene | 2000 | 5 | 794 | 141 | 978.5 | 165 |
| Abilene | 2000 | 6 | 780 | 156 | 863.5 | 188 |
Modify housing_2 from Q7 to add these indicator variables (also called dummy variables):
listings_ind - If a city’s listings is less than or equal to the median listings for that year and month across all the cities, the value should be 0 else it should be 1.
sales_ind - If a city’s sales is less than or equal to the median sales for that year and month across all the cities, the value should be 0 else it should be 1.
Hint: This can be achieved using ifelse() function from R along with mutate() from dplyr
Here I show first six rows of housing_2 and some of the columns:
housing_2 = mutate (housing_2,
listings_ind = ifelse(listings <= listings_med,0,1),
sales_ind = ifelse(sales <= sales_med,0,1))
housing_2 %>% select(c("city", "year", "month", "listings","listings_med", "listings_ind", "sales","sales_med","sales_ind")) %>% head()
| city | year | month | listings | listings_med | listings_ind | sales | sales_med | sales_ind |
|---|---|---|---|---|---|---|---|---|
| Abilene | 2000 | 1 | 701 | 972.0 | 0 | 72 | 99 | 0 |
| Abilene | 2000 | 2 | 746 | 916.5 | 0 | 98 | 134 | 0 |
| Abilene | 2000 | 3 | 784 | 946.5 | 0 | 130 | 167 | 0 |
| Abilene | 2000 | 4 | 785 | 985.0 | 0 | 98 | 153 | 0 |
| Abilene | 2000 | 5 | 794 | 978.5 | 0 | 141 | 165 | 0 |
| Abilene | 2000 | 6 | 780 | 863.5 | 0 | 156 | 188 | 0 |
Using housing_2 from Q8, add a new variable market_hotness as follows:
housing_2 = mutate(housing_2,
market_hotness =
ifelse((listings_ind == 0) & (sales_ind == 0), print("Low"),
ifelse((listings_ind == 0) & (sales_ind == 1), print("High"),
ifelse((listings_ind == 1) & (sales_ind == 0), print("Very Low"),
ifelse((listings_ind == 1) & (sales_ind == 1), print("Average"), print("NA")
)))))
## [1] "Low"
## [1] "High"
## [1] "Very Low"
## [1] "Average"
## [1] "NA"
Here I show first six rows of housing_2 and some of the columns:
housing_2 %>% select(c("city", "year", "month", "listings","sales","listings_ind", "sales_ind", "market_hotness")) %>% head()
| city | year | month | listings | sales | listings_ind | sales_ind | market_hotness |
|---|---|---|---|---|---|---|---|
| Abilene | 2000 | 1 | 701 | 72 | 0 | 0 | Low |
| Abilene | 2000 | 2 | 746 | 98 | 0 | 0 | Low |
| Abilene | 2000 | 3 | 784 | 130 | 0 | 0 | Low |
| Abilene | 2000 | 4 | 785 | 98 | 0 | 0 | Low |
| Abilene | 2000 | 5 | 794 | 141 | 0 | 0 | Low |
| Abilene | 2000 | 6 | 780 | 156 | 0 | 0 | Low |
Which city has the highest average median_adj sale price and what is that price?
average_median_adj <- housing_2 %>%
group_by(city) %>%
summarize(average = mean(median_adj, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
.groups = "drop"
average_median_adj %>% arrange(desc(average)) %>% head(.,1)
| city | average |
|---|---|
| Collin County | 252325.2 |