In This project, I’m evaluating txhousing dataset which contains information about housing prices in Texas.

txhousing is a data frame with 8602 observations and 9 variables:

1. Create a new dataset, txh, that contains all of the variables in txhousing, plus three more that you will create:

  1. The txhousing dataset includes the median sale price of all sales in a city in a given month, but it does not include the mean sale price. Create a new variable, mean_price, that is the mean sale price of all sales in a city in a given month, calculated from the total volume and the number of sales.
  2. The median sale price in a given month will generally be different from the mean_price. Create a new variable, price_dif that is the difference between the two (mean_price – median).
  3. Create a new variable, sales_prop, that calculates the proportion of listings that resulted in sales in a given month.
txh <- txhousing %>% mutate(mean_price = volume / sales, 
                            price_dif = mean_price - median, 
                            sales_prop = sales / listings)
txh

2. Are there any observations where sales_prop is greater than one? List them.

txh %>%
  filter(sales_prop > 1) %>%
  select(city, year, month, sales_prop, everything())

3. Find the total number of sales, the total volume of sales, and the number of cities in this dataset.

txh %>%
  summarise(sales_total = sum(sales, na.rm = TRUE), 
            volume_total = sum(volume, na.rm = TRUE), 
            cities_num = n_distinct(city))

4. Find the mean number of sales per month, the median of the median price per month, and the median of the mean_price each month.

txh %>%
  summarise(sales_avg = mean(sales, na.rm = TRUE), 
            median_med = median(median, na.rm = TRUE), 
            median_mean = median(mean_price, na.rm = TRUE))

5. For each city, find the median price_dif and list them in descending order of their magnitude.

txh %>%
  group_by(city) %>%
  summarise(med_price_dif = median(price_dif, na.rm = TRUE)) %>%
  arrange(desc(med_price_dif))

6. Sales vary over the course of a single year. For each city and each year, find the mean number of monthly sales and the median of the price variables: median, mean_price, price_dif.

txh6 <- txh %>%
  group_by(city, year) %>%
  summarise(sales_avg = mean(sales, na.rm = TRUE), 
            median_med = median(median, na.rm = TRUE), 
            median_mean = median(mean_price, na.rm = TRUE), 
            median_dif = median(price_dif, na.rm = TRUE))
txh6

7. Use what you did in the previous exercise to create a line plot of the average monthly sales per year for each city, using a different color line for each city.

txh6 %>%
  ggplot() +
    geom_line(mapping = aes(x = year, y = sales_avg, color = city))

8. Use what you did in exercise 6 to create side-by-side boxplots of the median price_dif for each city.

txh6 %>%
  ggplot() +
    geom_boxplot(mapping = aes(x = median_dif, y = city))