ANLY 512 - Problem Set 6

Objectives

In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is the iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
- Sat - satisfaction of householders with their present housing circumstances
- Infl - perceived degree of influence householders have on the management of the property
- Type - type of rental accomodation
- Cont - contact residents are afforded with other residents
- Freq - Frequencies: the numbers of residents in each class
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.

Look at the data

str(housing)

## 'data.frame':    72 obs. of  5 variables:
##  $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
##  $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
##  $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  21 21 28 34 22 36 10 11 36 61 ...

head(housing)

##      Sat   Infl  Type Cont Freq
## 1    Low    Low Tower  Low   21
## 2 Medium    Low Tower  Low   21
## 3   High    Low Tower  Low   28
## 4    Low Medium Tower  Low   34
## 5 Medium Medium Tower  Low   22
## 6   High Medium Tower  Low   36

1. First plot

# The frequency of each house type with different satisfaction levels
library(dplyr)

housing %>% 
  dplyr::select(c(Sat,Type, Freq)) %>% 
  group_by(Type, Sat) %>% 
  mutate(total_freq = sum(Freq)) %>%
  dplyr::select(c(Sat,Type,total_freq)) %>%
  unique()-> df1


ggplot(df1, aes(x=Type, y=total_freq)) + geom_bar(stat='identity', aes(fill = Type)) +facet_wrap(.~Sat) + scale_fill_manual(values = c("skyblue", "royalblue", "blue", "navy")) + xlab('Housing Type') +ylab('Total Number of Residents') + ggtitle('Residents Satisfaction of Different Housing Types') + theme_light()

Apartment counts the most housing type among all four. Also, apartment housing type ranked the highest among all low, medium, and high satisfaction categories, which means while lots of residents are highly satisfied with Apartment, certain amount of residents are highly unsatisfied with Apartment. The same logic happens at Tower housing type. The reason behinds can be total number of residents living in Apartment and Tower way more than Atrium and Terrace. Most of residents living in Atrium are high satisfied, less residents are unsatisfied.

2. Second plot

df1

## # A tibble: 12 x 3
## # Groups:   Type, Sat [12]
##    Sat    Type      total_freq
##    <ord>  <fct>          <int>
##  1 Low    Tower             99
##  2 Medium Tower            101
##  3 High   Tower            200
##  4 Low    Apartment        271
##  5 Medium Apartment        192
##  6 High   Apartment        302
##  7 Low    Atrium            64
##  8 Medium Atrium            79
##  9 High   Atrium            96
## 10 Low    Terrace          133
## 11 Medium Terrace           74
## 12 High   Terrace           70

ggplot(df1, aes(Sat, total_freq)) + geom_col(aes(fill = Sat)) +facet_grid(. ~ Type) + theme_cleveland() +xlab('Resident Satisfaction') + ylab('Total Number of Residents') + ggtitle('Residents Satisfaction of Different Housing Types')

Based on the plot, we can tell that Tower and Atrium types high residents satisfaction, while Apartment has both high and low residents satisfaction. Most of residents are not satisfied with Terrace housing type.

3. Third plot

# How the perceived management of property affect the residents satisfaction
housing %>% 
  dplyr::select(c(Sat,Infl,Freq)) %>% 
  group_by(Infl, Sat) %>% 
  mutate(total_freq = sum(Freq)) %>%
  dplyr::select(c(Sat,Infl,total_freq)) %>%
  unique()-> df2

ggplot(df2, aes(x=Infl, y=total_freq)) + geom_bar(stat='identity', aes(fill = Infl)) + facet_wrap(.~Sat) +  scale_fill_brewer(palette = "Blues") + theme_pubr() + xlab('Perceived Management of Property ') +ylab('Total Number of Residents') + ggtitle('Residents Satisfaction of Perceived Management of Property ')

Based on the plot, we can identify the trend that the lower the perceived management of property by residents, the lower the residents’ satisfaction. However, not necessary right on the contrary. Medium level of perceived management of property tends to lead to high satisfaction. Too much involvement of property management tends to go to the opposite.

4. Fourth plot

housing %>% 
  dplyr::select(c(Type,Infl, Freq)) %>% 
  group_by(Type, Infl) %>% 
  mutate(total_freq = sum(Freq)) %>%
  dplyr::select(c(Type,Infl,total_freq)) %>%
  unique()-> df3

ggplot(df3, aes(Infl, total_freq)) + geom_col(aes(fill = Infl)) +facet_grid(. ~ Type) + theme_minimal() +xlab('Perceived Management of Property') + ylab('Total Number of Residents') + ggtitle('Perceived Management of Property of Different Housing Types')

In general, high perceived management of property is rare among four different housing types, while medium level of perception is more common in terms of the number of residents. Atrium and Terrace tend to have lower property management.

5. Fifth plot

housing %>% 
  dplyr::select(c(Type,Cont, Freq)) %>% 
  group_by(Type, Cont) %>% 
  mutate(total_freq = sum(Freq)) %>%
  dplyr::select(c(Type,Cont,total_freq)) %>%
  unique()-> df4

ggplot(df4,aes(x=Cont, y = total_freq)) + geom_bar(aes(fill = Cont), stat = 'identity', color = 'black', position = position_dodge(0.9))  +
   facet_wrap(Type ~ .) + theme_light() + xlab('Afforded Contact') + ylab('Number of Residents')

Based on the plot, we can tell that the number of low Afforded Contact is nearly half of the high Afforded Contact.