The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.
str(housing)
## 'data.frame': 72 obs. of 5 variables:
## $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 21 21 28 34 22 36 10 11 36 61 ...
# place code for vis here
first_housing <- housing %>%
group_by(Type, Sat) %>%
summarise(Freq = sum(Freq))
ggplot(first_housing, aes(Sat, Freq)) +
geom_point(aes(color = Type)) +
facet_grid(Type ~ ., scales = "free", space = "free") +
theme_light() +
theme(strip.text.y = element_text(angle = 0),
legend.position = "none") +
labs(y = "Total Number of Residents",
x = "Satisfaction",
title = "Resident Satisfaction by Rental Property Type")
# Most residents in tower and atrium property types are satisfied with their housing, and unsatified residents are rare.
# On the contrary, terrace property has a lot of unsatisfied residents, and few feel satisfied or neutral.
# Apartment residents are either very satisfied or unsatisfied. This property type also has the largest sample size.
# place code for vis here
second_housing <- housing %>%
group_by(Infl, Sat) %>%
summarise(Freq = sum(Freq))
ggplot(second_housing, aes(Sat, Freq)) +
geom_point(aes(color = Infl)) +
facet_grid(Infl ~ ., scales = "free", space = "free") +
theme_light() +
theme(strip.text.y = element_text(angle = 0),
legend.position = "none") +
labs(y = "Total Number of Residents",
x = "Satisfaction",
title = "Resident Satisfaction",
subtitle = "by perceived degree of influence householders")
# Resident satisfaction seems positively correlated with level of perceived infuence householders have on the management of the property. When perceived influence is low, most residents feel unsatisfied, and when perceived influence is high, most residents feel satisfied.
# place code for vis here
third_housing <- housing %>%
group_by(Infl, Cont) %>%
summarise(Freq = sum(Freq))
ggplot(third_housing, aes(Cont, Freq)) +
geom_point(aes(color = Infl)) +
facet_grid(Infl ~ ., scales = "free", space = "free") +
theme_light() +
theme(strip.text.y = element_text(angle = 0),
legend.position = "none") +
labs(y = "Number of Residents",
x = "Perceived Influence",
title = "Resident Perceived Influence by Cost Shared with Others")
# When cost shared with other residents is low or medium, residents' perceived influence on property management is high, whereas when this cost is high, more residents feel they have little influence over property management. Most residents in this dataset have low or medium shared cost.
# place code for vis here
fourth_housing = housing %>%
group_by(Sat, Type, Infl) %>%
summarise(Freq = sum(Freq))
levels(fourth_housing$Infl) = list('Inf-Low' = 'Low', 'Inf-Medium' = 'Medium', 'Inf-High' = 'High')
ggplot(fourth_housing, aes(x = Infl, y = Freq))+
geom_bar(
aes(fill = Sat), stat = 'identity', color = 'black',
position = position_dodge(0.9)
) +
theme_light() +
facet_wrap(~Type) +
guides(fill = guide_legend(title = 'Satisfaction')) +
labs(x = 'Perceived Influence',
y = 'Numbers of Residents',
title = 'Number of Resident',
subtitle = 'by Satisfaction, Rental Accomodation and Management Influence') +
scale_fill_manual(name = 'Number of Resident by Satisfaction, Rental Accomodation and Management Influence',
labels = c('Influence-Low', 'Influence-Medium', 'Influence-High'),
values = c('azure2', 'azure3', 'azure4'))
# place code for vis here
fifth_housing <- housing %>%
group_by(Cont, Type, Infl) %>%
summarise(Freq = sum(Freq))
ggplot(fifth_housing, aes(x = Infl, y = Freq))+
geom_bar(
aes(fill = Cont), stat = "identity", color = "black",
position = position_dodge(0.9)
)+
facet_wrap(~Type)+
theme_light() +
guides(fill = guide_legend(title = 'Contact')) +
labs(x = 'Perceived Influence',
y = 'Numbers of Residents',
title = 'Number of Resident',
subtitle = 'by Contact, Rental Property Type, and Perceived Influence') +
scale_fill_manual(labels = c('Low', 'High'),
values = c('azure2', 'azure3'))
# When amount of cost shared with other residents (Cont) is high:
# In all rental property types, fewer residents had high perceived degree of influence on the management of the property (Infl). More specifically, for tower, apartment and atrium property types, slightly more residents felt low perceived degree of influence on the management of the property (Infl), and even more residents residents felt medium perceived degree of influence on the management of the property (Infl). For terrace type property, there are much larger differences in the number of residents showing low, medium and high perceived degree of influence on the management of the property (Infl), with majority of residents having low perceived degree of influence on the management of the property (Infl), followed by medium and high.
# When amount of cost shared with other residents (Cont) is low, there isn't a very clear trend of satisfaction across property types.