The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.
str(housing)
## 'data.frame': 72 obs. of 5 variables:
## $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 21 21 28 34 22 36 10 11 36 61 ...
ggplot(housing,aes(Infl, Freq)) + geom_bar( aes(fill = Cont),stat = "identity",position=position_dodge())+ labs(y = 'Numbers of Residents', x = 'Degree of influence', title = 'Degree of infl householders have on management by Type and No of Residents') + facet_wrap(~Type)
The graphs shows the relationship between the degree of influence and the number of residents which refined the type of rental and contacted residents. By observing the chart, we can say the influence of the terrace is not high, while the influence of the apartment is very high.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
housing_dt <- data.table(housing)
housing_dt[Sat == "Low", sat_score := 1 *Freq ]
housing_dt[Sat == "Medium", sat_score := 2 * Freq]
housing_dt[Sat == "High", sat_score := 3 * Freq]
housing_dt <- housing_dt[, .(
sat_score = sum(sat_score)
, Freq = sum(Freq)), by = .(Type, Infl)]
housing_dt[, avg := round(sat_score / Freq, 2)]
ggplot(housing_dt,aes(x=avg, y=Infl))+ geom_point(aes(size=Freq,color=Type))+facet_grid(Type ~ .) +labs(x="Average Sat Scores", y="Management Influence", title="Residents' count by Influence on Management, Type of Rental, and Average Satisfaction Scores")
This graphs shows the impact of different management influence on average satisfaction and is broken down by rental type. The graphs shows that Tower residents are more satisfied on average
ggplot(housing, aes(Freq, Cont)) +
geom_point(aes(color=Type))+
facet_grid(Type ~ Infl, space="fixed", scales="fixed") +
theme(strip.text.y = element_text(angle = 0),
legend.position = 'none') +
labs(y = 'Afforded Contact',
x = 'Numbers of Residents',
title = 'Number of Resident by Afforded Contact and Type of Rental')
For both low and high contact residents that are afforded with other residents, Atrium housing type has the lowest numbers of residents in each class,and Apartment housing type has the highest numbers of residents in each class
housing_dt4 <- data.table(housing)
ggplot(housing_dt4[, .(Freq = sum(Freq)), by = .(Sat, Type)], aes(y = Freq, x= Sat)) +
geom_bar(aes(fill=Type),stat='identity')+
coord_flip()+
facet_grid(Type ~ ., scales = 'fixed', space = 'fixed') +
theme(strip.text.y = element_text(angle = 0),
legend.position = 'none') +
labs(y = 'Satisfaction',
x = 'Numbers of Residents',
title = 'Number of Resident by Satisfaction and Rental Accomodation')
For Tower/Apartment/Atrium housing type, residents with high satisfaction about present housing circumstances have the highest numbers of residents in each class. But for Terrace housing type, residents with low satisfaction about present housing circumstances have the highest numbers of residents in each class.
ggplot(housing, aes(x=Type, y=Freq)) + geom_boxplot(aes(fill = Infl),shape=1) +facet_grid(.~Infl) + labs(y = 'Numbers of Residents', x = 'Type of rental accomodation', title = 'Type of rental accomodation by Influence and No of Residents')
This diagram shows the relationship between type of rental accormodation and their impact on management. As can be seen from the figure, Terrace has a large number of residents and a small influence on residents. The distribution of Atrium residents is relatively even.