In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is the iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.
str(housing)
## 'data.frame': 72 obs. of 5 variables:
## $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 21 21 28 34 22 36 10 11 36 61 ...
head(housing)
## Sat Infl Type Cont Freq
## 1 Low Low Tower Low 21
## 2 Medium Low Tower Low 21
## 3 High Low Tower Low 28
## 4 Low Medium Tower Low 34
## 5 Medium Medium Tower Low 22
## 6 High Medium Tower Low 36
# The frequency of each house type with different satisfaction levels
library(dplyr)
housing %>%
dplyr::select(c(Sat,Type, Freq)) %>%
group_by(Type, Sat) %>%
mutate(total_freq = sum(Freq)) %>%
dplyr::select(c(Sat,Type,total_freq)) %>%
unique()-> df1
ggplot(df1, aes(x=Type, y=total_freq)) + geom_bar(stat='identity', aes(fill = Type)) +facet_wrap(.~Sat) + scale_fill_manual(values = c("skyblue", "royalblue", "blue", "navy")) + xlab('Housing Type') +ylab('Total Number of Residents') + ggtitle('Residents Satisfaction of Different Housing Types') + theme_light()
Apartment counts the most housing type among all four. Also, apartment housing type ranked the highest among all low, medium, and high satisfaction categories, which means while lots of residents are highly satisfied with Apartment, certain amount of residents are highly unsatisfied with Apartment. The same logic happens at Tower housing type. The reason behinds can be total number of residents living in Apartment and Tower way more than Atrium and Terrace. Most of residents living in Atrium are high satisfied, less residents are unsatisfied.
df1
## # A tibble: 12 x 3
## # Groups: Type, Sat [12]
## Sat Type total_freq
## <ord> <fct> <int>
## 1 Low Tower 99
## 2 Medium Tower 101
## 3 High Tower 200
## 4 Low Apartment 271
## 5 Medium Apartment 192
## 6 High Apartment 302
## 7 Low Atrium 64
## 8 Medium Atrium 79
## 9 High Atrium 96
## 10 Low Terrace 133
## 11 Medium Terrace 74
## 12 High Terrace 70
ggplot(df1, aes(Sat, total_freq)) + geom_col(aes(fill = Sat)) +facet_grid(. ~ Type) + theme_cleveland() +xlab('Resident Satisfaction') + ylab('Total Number of Residents') + ggtitle('Residents Satisfaction of Different Housing Types')
Based on the plot, we can tell that Tower and Atrium types high residents satisfaction, while Apartment has both high and low residents satisfaction. Most of residents are not satisfied with Terrace housing type.
# How the perceived management of property affect the residents satisfaction
housing %>%
dplyr::select(c(Sat,Infl,Freq)) %>%
group_by(Infl, Sat) %>%
mutate(total_freq = sum(Freq)) %>%
dplyr::select(c(Sat,Infl,total_freq)) %>%
unique()-> df2
ggplot(df2, aes(x=Infl, y=total_freq)) + geom_bar(stat='identity', aes(fill = Infl)) + facet_wrap(.~Sat) + scale_fill_brewer(palette = "Blues") + theme_pubr() + xlab('Perceived Management of Property ') +ylab('Total Number of Residents') + ggtitle('Residents Satisfaction of Perceived Management of Property ')
Based on the plot, we can identify the trend that the lower the perceived management of property by residents, the lower the residents’ satisfaction. However, not necessary right on the contrary. Medium level of perceived management of property tends to lead to high satisfaction. Too much involvement of property management tends to go to the opposite.
housing %>%
dplyr::select(c(Type,Infl, Freq)) %>%
group_by(Type, Infl) %>%
mutate(total_freq = sum(Freq)) %>%
dplyr::select(c(Type,Infl,total_freq)) %>%
unique()-> df3
ggplot(df3, aes(Infl, total_freq)) + geom_col(aes(fill = Infl)) +facet_grid(. ~ Type) + theme_minimal() +xlab('Perceived Management of Property') + ylab('Total Number of Residents') + ggtitle('Perceived Management of Property of Different Housing Types')
In general, high perceived management of property is rare among four different housing types, while medium level of perception is more common in terms of the number of residents. Atrium and Terrace tend to have lower property management.
housing %>%
dplyr::select(c(Type,Cont, Freq)) %>%
group_by(Type, Cont) %>%
mutate(total_freq = sum(Freq)) %>%
dplyr::select(c(Type,Cont,total_freq)) %>%
unique()-> df4
ggplot(df4,aes(x=Cont, y = total_freq)) + geom_bar(aes(fill = Cont), stat = 'identity', color = 'black', position = position_dodge(0.9)) +
facet_wrap(Type ~ .) + theme_light() + xlab('Afforded Contact') + ylab('Number of Residents')
Based on the plot, we can tell that the number of low Afforded Contact is nearly half of the high Afforded Contact.