DS 2870 Homework 5

Question 1) Conditional Bar Chart

Data Description: Housing Data

The homes data set has 8 variables on 1299 homes that sold in January of 2023:

type: The type of property
zip: the zip code of the property
price: the price the home sold for
bed and baths: The number of beds and baths in the home, respectively
sqft: The square footage of the home
lot: The lot size of the home
year: The year the home was built

Run the line of code below to add the city of the home (Burlington/DC/Nashville):

##                       type   n
## 1                    condo 309
## 2 mobile/manufactured home   5
## 3             multi-family  67
## 4                    other   2
## 5                  parking   5
## 6            single family 608
## 7                townhouse 296
## 8              vacant land   7

Question 1a) Create the data set

Create the data set that has the proportion of homes for each property type (just condos, townhouses, and single family homes) conditional on the city. Round the result to 4 decimal places. Save it as homes_q1.

If done correctly, the proportion of DC homes that are single family is 0.2037

Make sure to display it in the knitted document

homes_q1 <-
  homes |> 
  # Keeping the 3 home types
  filter(type %in% c("condo", "townhouse", "single family")) |> 
  # Counting the number of homes per type and city combo
  summarize(
    .by = c(city, type),
    houses = n()
  ) |> 
  mutate(
    .by = city,
    prop_city = round(houses/sum(houses), digits = 4),
    type = factor(type, levels = c("condo", "townhouse", "single family"))
    ) |> 
  arrange(city, type)
  

homes_q1

##         city          type houses prop_city
## 1 Burlington         condo     85    0.2787
## 2 Burlington     townhouse     56    0.1836
## 3 Burlington single family    164    0.5377
## 4         DC         condo    171    0.4005
## 5         DC     townhouse    169    0.3958
## 6         DC single family     87    0.2037
## 7  Nashville         condo     53    0.1102
## 8  Nashville     townhouse     71    0.1476
## 9  Nashville single family    357    0.7422

Question 1b) Create the side-by-side bar charts

Create the side-by-side bar charts displaying the proportions of property type conditional on the city. Which city has the lowest rate of single family homes?

ggplot(
  data = homes_q1,
  mapping = aes(
    x = city,
    fill = type,
    y = prop_city
  )
) + 
  
  # Using geom_col() to add the bars
  # and positioning them side-by-side with position = "dodge2"
  geom_col(
    color = "black",
    position = "dodge2"
  ) + 
  
  # Adding the number of homes above each bar using geom_text()
  geom_bar_text(
    mapping = aes(label = houses),
    position = "dodge2"
  ) +
  
  # Changing the labels and adding a caption
  labs(
    x = NULL,
    y = NULL,
    fill = "Property Type",
    caption = "Data: Redfin.com"
  ) + 
  
  # Informing what the number above each bar represents
  annotate(
    geom = "text",
    label = "Number of homes displayed\nat the top of each bar",
    y = 0.70,
    x = 2,
    fontface = "bold",
    size = 5
  ) +
  
  # Changing what appears on the y-axis:
  # 1) expand removes the blank space
  # 2) labels changes the proportions to percentages
  # 3) breaks changes where the tick marks are located
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0),
    labels = scales::label_percent(),
    breaks = (0:7)/10
  ) + 
  
  # Changing the labels on the x-axis
  scale_x_discrete(
    labels = c("Burlington, VT", "Washington DC", "Nashville, TN")
  ) +
  
  # Changing the default theme
  theme_classic() + 
  
  # Moving the theme to the top
  theme(legend.position = "top")

Question 2) Map of education levels

Part 2A) Creating the states data set

The code chunk above is creating a data set called state_education that has the proportion of residents within each state without a high school degree, only a high school degree, some college but no degree, and at least a college degree.

Use the state_education data set to create a new data set named states2 with columns:

1 - 3) long, lat, and group: the 3 columns needed to make an outline of the state outline

region: The name of the state (in lowercase)
education: a factor with 4 levels ordered “No High School Degree”, “Only High School Degree”, “Some College Experience”, “Bachelor Degree or Higher” (hint: use the factor() function with levels and labels arguments)
proportion: The proportion of residents in the state with that education level

Make sure to remove the row for Washington DC.

If done correctly, it should have 62108 rows and 6 columns.

# Creating a data set with the state outlines
state_data <- map_data(map = "state")


# adding the house price info to the state_data set
states2 <- 
  left_join(x = state_data,
            y = state_education |> 
                mutate(region = tolower(state)),
            by = "region") |> 
  # Removing washington dc
  filter(state != "District of Columbia") |> 
  # placing all 4 education proportions into 1 column named proportion
  pivot_longer(
    cols = no_hs:bachelor,
    names_to = "education",
    values_to = "proportion"
  ) |> 
  # Reordering the groups for education and changing the label for it
  mutate(
    education = factor(
      x = education,
      levels = c("no_hs", "high_school", "some_college", "bachelor"),
      labels = c("No High School Degree",
                 "Only High School Degree",
                 "Some College Experience",
                 "Bachelor Degree or Higher")
    )
  ) |> 
  # Picking the relevant columns
  dplyr::select(long:group, region, education, proportion)

tibble(states2)

## # A tibble: 62,108 × 6
##     long   lat group region  education                 proportion
##    <dbl> <dbl> <dbl> <chr>   <fct>                          <dbl>
##  1 -87.5  30.4     1 alabama No High School Degree          0.142
##  2 -87.5  30.4     1 alabama Only High School Degree        0.309
##  3 -87.5  30.4     1 alabama Some College Experience        0.299
##  4 -87.5  30.4     1 alabama Bachelor Degree or Higher      0.250
##  5 -87.5  30.4     1 alabama No High School Degree          0.142
##  6 -87.5  30.4     1 alabama Only High School Degree        0.309
##  7 -87.5  30.4     1 alabama Some College Experience        0.299
##  8 -87.5  30.4     1 alabama Bachelor Degree or Higher      0.250
##  9 -87.5  30.4     1 alabama No High School Degree          0.142
## 10 -87.5  30.4     1 alabama Only High School Degree        0.309
## # ℹ 62,098 more rows

Part 2B) Creating Education Maps

Create the maps seen in Brightspace using ggplot(). Make sure to exclude the row for the District of Columbia!

ggplot(
  data = states2,
  mapping = aes(
    x = long,
    y = lat,
    fill = proportion,
    group = group
  )
) + 
  # Drawing the state outline
  geom_polygon(
    color = "black", 
    size = 0.2,
    #show.legend = F
  ) + 
  
  # Map theme
  theme_map() + 
  
  # Separate map for each education level
  facet_wrap(
    facets = ~ education
  ) +
  
  # Changing the color levels to different shades of green
  scale_fill_fermenter(
    label = scales::label_percent(),
    palette = "Greens",
    direction = 1
  ) +
  
  # Changing the coordinate to an albers projection
  coord_map(
    projection = "albers",
    lat0 = 39, lat1 = 45
  ) + 
  
  # Removing the buffer space around the sides
  scale_x_continuous(expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0)) + 
  
  # Removing the label for fill and adding a title
  labs(
    title = "Education Levels per State",
    fill = NULL
  ) + 
  
  # Centering the title, removing the background of the facet panels, and
  # moving the legend to the center of the graph
  theme(
    plot.title = element_text(size = 16,
                              hjust = 0.5),
    strip.background = element_blank(),
    strip.text = element_text(face = "bold"),
    legend.position = c(0.45, 0.45)
  )

DS 2870 Homework 5 - Solutions

Jacob Martin