Homework 0

This homework uses txhousing dataset from ggplot2 package, which is a part of the tidyverse. Additionally, it uses Consumer Price Index (CPI) dataset, which I downloaded from the BLS website (https://data.bls.gov/timeseries/CUUR0000SA0) and cleaned. I will provide the code for cleaning CPI data in the solution to this homework.

txhousing consists of the following variables:

names(txhousing)

## [1] "city"      "year"      "month"     "sales"     "volume"    "median"   
## [7] "listings"  "inventory" "date"

Before you start working on this homework, study the variables in txhousing as well as the structure of the dataset by typing this command in your console.

help(txhousing)

Take a peek at the data by typing:

head(txhousing)

Save the CPI dataset in your project folder where you have also saved this homework file. Read the CPI dataset into this session. It consists of the following columns:

#setwd("C:/Users/brend/OneDrive/Desktop/Data Viz/Homework/Homework 0/Homework_0_Final/Homework 0_v3") #set my path
cpi=read_rds("cpi.rds") #read CPI file

names(cpi)

## [1] "year"       "month_name" "cpi"

Take a look at the first few observations by using head() function.

head(cpi)

year	month_name	cpi
1999	jan	164.3
1999	feb	164.5
1999	mar	165.0
1999	apr	166.2
1999	may	166.2
1999	jun	166.2

This homework consists of 10 questions and each carries one point. Your objective is to reproduce the output shown in the HTML file for Q1 through Q9. For Q10 just print the name of the city and the value.

Dollar values over a long time series make comparison difficult due to inflation. $100 in January 2000 is worth $154 in July 2020. In txhousing, there are two variables—volume and median—which are specified in unadjusted USD. Questions 1 to 4 are designed to inflation-adjust these two variables to July 2020 dollars.

Q1

Create a new data frame month_map with 12 rows and 2 columns titled month_name and month. month_name in month_map should have only the unique values from month_name column in CPI data frame. month column should contain the month numbers from 1 to 12.

Try to not type out this data frame manually and instead try to do it algorithmically. This is how it will look:

view(txhousing)
view(month.name)

month_map <- data.frame(month_name = tolower(month.abb),
                        month = (1:12))


head(month_map,4)

month_name	month
jan	1
feb	2
mar	3
apr	4

Q2

Merge month_map to the CPI data frame. Explicitly identify the common key variable on which you will perform the merge. Store the resulting merged data frame as cpi_merge. Print the first six rows.

Here are the first six rows of cpi_merge:

cpi_merge <- cpi %>% 
            inner_join(month_map, by = ("month_name"))

head(cpi_merge)

year	month_name	cpi	month
1999	jan	164.3	1
1999	feb	164.5	2
1999	mar	165.0	3
1999	apr	166.2	4
1999	may	166.2	5
1999	jun	166.2	6

Q3

We want to add a column to the txhousing data frame that holds the information on the CPI. Rather than altering txhousing, create a new data frame housing by merging txhousing and cpi_merge. The common keys for merging are year and month. Note that the resulting data frame is essentially txhousing with just one more column of CPI.

Here are the top six rows of housing:

#a tibble 6x10

housing <- txhousing %>% 
  inner_join(cpi_merge, by = c("year" , "month"))

housing %>% select(-c("month_name")) %>% head()

city	year	month	sales	volume	median	listings	inventory	date	cpi
Abilene	2000	1	72	5380000	71400	701	6.3	2000.000	168.8
Abilene	2000	2	98	6505000	58700	746	6.6	2000.083	169.8
Abilene	2000	3	130	9285000	58100	784	6.8	2000.167	171.2
Abilene	2000	4	98	9730000	68600	785	6.9	2000.250	171.3
Abilene	2000	5	141	10590000	67300	794	6.8	2000.333	171.5
Abilene	2000	6	156	13910000	66900	780	6.6	2000.417	172.4

Q4

Modify housing by adding these new columns:

cpi_latest - Contains the CPI of July 2020. This entire column will have the same value appearing in each cell.
multiplier - Ratio of cpi_latest and cpi
volume_adj - Adjusted volume as a product of volume and multiplier
median_adj - Adjusted median sale price as a product of median and multiplier

Here I show the top six rows with only a select few columns to help you ensure your output resembles this:

 #separate july 2020 

july2020 <- cpi %>% filter(month_name == "jul" & year == "2020")

housing <- mutate(housing,
                          cpi_latest = july2020$cpi,
                          multiplier = cpi_latest/cpi,
                          volume_adj = volume * multiplier,
                          median_adj = median * multiplier)
                          
housing %>% select(c("city", "year", "month", "volume", "median", "cpi_latest", "multiplier", "volume_adj", "median_adj")) %>% head()

city	year	month	volume	median	cpi_latest	multiplier	volume_adj	median_adj
Abilene	2000	1	5380000	71400	259.101	1.534958	8258077	109596.04
Abilene	2000	2	6505000	58700	259.101	1.525919	9926101	89571.43
Abilene	2000	3	9285000	58100	259.101	1.513440	14052294	87930.89
Abilene	2000	4	9730000	68600	259.101	1.512557	14717179	103761.40
Abilene	2000	5	10590000	67300	259.101	1.510793	15999298	101676.37
Abilene	2000	6	13910000	66900	259.101	1.502906	20905423	100544.41

Q5

Using housing from Q4, create a new data frame housing_sum1 with this information for each city across all years and months:

Maximum and minimum volume_adj
Maximum and minimum median_adj sale price

Hint: If you group by city, you will get the summary across all the years and months.

Merge housing_sum1 into housing by city and save it as a new dataset housing_1.

Here I show first six rows of housing_1 and only a select columns:

housing_sum1 <- housing %>% 
                   group_by(city) %>% 
                   summarize(volume_adj_max = max(volume_adj),
                             volume_adj_min = min(volume_adj),
                             median_adj_max = max(median_adj),
                             median_adj_min = min(median_adj),
                    .groups = "drop")

housing_1 <- housing %>% 
              inner_join(housing_sum1, by = "city")
            
  
housing_1 %>%  select(c("city","year", "month", "volume_adj_max", "volume_adj_min","median_adj_max", "median_adj_min")) %>% head()

city	year	month	volume_adj_max	volume_adj_min	median_adj_max	median_adj_min
Abilene	2000	1	49773624	7678915	161440.1	84890.23
Abilene	2000	2	49773624	7678915	161440.1	84890.23
Abilene	2000	3	49773624	7678915	161440.1	84890.23
Abilene	2000	4	49773624	7678915	161440.1	84890.23
Abilene	2000	5	49773624	7678915	161440.1	84890.23
Abilene	2000	6	49773624	7678915	161440.1	84890.23

Here I show last six rows of housing_1 and only a select columns:

housing_1 %>%  select(c("city","year", "month", "volume_adj_max", "volume_adj_min","median_adj_max", "median_adj_min")) %>% tail()

city	year	month	volume_adj_max	volume_adj_min	median_adj_max	median_adj_min
Wichita Falls	2015	2	33832205	8337061	147083.7	84451.77
Wichita Falls	2015	3	33832205	8337061	147083.7	84451.77
Wichita Falls	2015	4	33832205	8337061	147083.7	84451.77
Wichita Falls	2015	5	33832205	8337061	147083.7	84451.77
Wichita Falls	2015	6	33832205	8337061	147083.7	84451.77
Wichita Falls	2015	7	33832205	8337061	147083.7	84451.77

Q6

Using housing_1 from Q5, create a new data frame housing_min which will retain only the rows of housing_1 where volume_adj of a city was equal to the minimum adjusted volume.

Here I show first six rows of housing_min and only a select columns:

housing_min <-(housing_1[housing_1$volume_adj == housing_1$volume_adj_min,])

housing_min %>%  select(c("city", "year", "month", "volume_adj","volume_adj_min", "volume_adj_max")) %>% head()

city	year	month	volume_adj	volume_adj_min	volume_adj_max
Abilene	2003	1	7678915	7678915	49773624
Amarillo	2005	10	9397614	9397614	67596289
Arlington	2011	1	30008080	30008080	137867140
Austin	2009	1	252191554	252191554	1248942028
Bay Area	2000	1	45009066	45009066	214278169
Beaumont	2001	1	13117820	13117820	57413784

Using housing_1 from Q5, create a new data frame housing_max which will retain only the rows of housing_1 where median_adj of a city was equal to the maximum adjusted median sale price.

Here I show first six rows of housing_max and only a select columns:

housing_max <-(housing_1[housing_1$median_adj == housing_1$median_adj_max,])

housing_max %>%  select(c("city", "year", "month", "median_adj","median_adj_max", "median_adj_min")) %>% head()

city	year	month	median_adj	median_adj_max	median_adj_min
Abilene	2015	7	161440.1	161440.1	84890.23
Amarillo	2015	5	172040.3	172040.3	113161.58
Arlington	2015	6	195434.8	195434.8	131799.55
Austin	2015	4	296007.2	296007.2	200091.09
Bay Area	2015	7	218003.8	218003.8	154570.32
Beaumont	2010	1	195862.0	195862.0	108772.85

Q7

Using housing_1 from Q5, create a new data frame housing_sum2 with this information for each year and month pair across all cities:

Median listings
Median sales

Hint: If even a single value for listings or sales of a city is NA, the median of that variable will be NA. In order to avoid this, use na.rm = TRUE argument in median().

Here I show first six rows of housing_sum2 and all the columns:

housing_sum2 <- housing_1 %>%
                 group_by(year,month) %>% 
                 summarize(listings_med = median(listings,na.rm = TRUE),
                              sales_med = median(sales,na.rm = TRUE),
                 .groups = "drop")

housing_sum2 %>%  select(c("year","month","listings_med", "sales_med")) %>% head()

year	month	listings_med	sales_med
2000	1	972.0	99
2000	2	916.5	134
2000	3	946.5	167
2000	4	985.0	153
2000	5	978.5	165
2000	6	863.5	188

Merge housing_sum2 into housing_1 and save a new data frame housing_2.

Here I show first six rows of housing_2 and some of the columns:

housing_2 <- housing_1 %>% 
              inner_join(housing_sum2, by = c("month","year"))

housing_2 %>% select(c("city","year","month","listings","sales","listings_med","sales_med")) %>% head()

city	year	month	listings	sales	listings_med	sales_med
Abilene	2000	1	701	72	972.0	99
Abilene	2000	2	746	98	916.5	134
Abilene	2000	3	784	130	946.5	167
Abilene	2000	4	785	98	985.0	153
Abilene	2000	5	794	141	978.5	165
Abilene	2000	6	780	156	863.5	188

Q8

Modify housing_2 from Q7 to add these indicator variables (also called dummy variables):

listings_ind - If a city’s listings is less than or equal to the median listings for that year and month across all the cities, the value should be 0 else it should be 1.
sales_ind - If a city’s sales is less than or equal to the median sales for that year and month across all the cities, the value should be 0 else it should be 1.

Hint: This can be achieved using ifelse() function from R along with mutate() from dplyr

Here I show first six rows of housing_2 and some of the columns:

housing_2 =  mutate (housing_2,
                     listings_ind = ifelse(listings <= listings_med,0,1),
                     sales_ind = ifelse(sales <= sales_med,0,1))

housing_2 %>% select(c("city", "year", "month", "listings","listings_med", "listings_ind", "sales","sales_med","sales_ind")) %>% head()

city	year	month	listings	listings_med	sales	sales_med
Abilene	2000	1	701	972.0	72	99
Abilene	2000	2	746	916.5	98	134
Abilene	2000	3	784	946.5	130	167
Abilene	2000	4	785	985.0	98	153
Abilene	2000	5	794	978.5	141	165
Abilene	2000	6	780	863.5	156	188

Q9

Using housing_2 from Q8, add a new variable market_hotness as follows:

housing_2 = mutate(housing_2,
                   market_hotness = 
                   ifelse((listings_ind == 0) & (sales_ind == 0), print("Low"),
                   ifelse((listings_ind == 0) & (sales_ind == 1), print("High"),
                   ifelse((listings_ind == 1) & (sales_ind == 0), print("Very Low"),
                   ifelse((listings_ind == 1) & (sales_ind == 1), print("Average"), print("NA")
                          )))))

## [1] "Low"
## [1] "High"
## [1] "Very Low"
## [1] "Average"
## [1] "NA"

Here I show first six rows of housing_2 and some of the columns:

housing_2 %>% select(c("city", "year", "month", "listings","sales","listings_ind", "sales_ind", "market_hotness")) %>% head()

city	year	month	listings	sales	market_hotness
Abilene	2000	1	701	72	Low
Abilene	2000	2	746	98	Low
Abilene	2000	3	784	130	Low
Abilene	2000	4	785	98	Low
Abilene	2000	5	794	141	Low
Abilene	2000	6	780	156	Low

Q10

Which city has the highest average median_adj sale price and what is that price?

average_median_adj <- housing_2 %>% 
              group_by(city) %>% 
              summarize(average = mean(median_adj, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

              .groups = "drop"
              
average_median_adj %>%  arrange(desc(average)) %>% head(.,1)

city	average
Collin County	252325.2

Homework 0

Brenda Parnin

9/13/2020

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10