The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.
str(housing)
## 'data.frame': 72 obs. of 5 variables:
## $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 21 21 28 34 22 36 10 11 36 61 ...
# place code for vis here
dataset1 = housing %>%
group_by(Sat,Type) %>%
summarise(Freq = sum(Freq))
ggplot(dataset1, aes(x = Sat, y = Freq))+
geom_bar(
aes(fill = Sat), stat = "identity", color = "white",
position = position_dodge(0.9)
)+
facet_wrap(~Type) +
labs(y = "Number of Residents",
x = "Satisfaction",
title = "Resident Satisfaction",
subtitle = "Number of resident's Satification for each rental property type")+scale_fill_discrete(name = "Satisfaction", labels = c("Sat-Low", "Sat-Medium", "Sat-High"))+theme+theme_bw()
This visualization show resident’s satisfcation for different rental properties. Apartment is most famous among other residential property and thus has high counts for each satisfaction type compared to other residential property. Tower is most satisfied property considering its resident count. Resident are less satisfied with Terrace type property where as Atrium has average satification rate.
# place code for vis here
dataset2 = housing %>%
group_by(Infl,Type) %>%
summarise(Freq = sum(Freq))
ggplot(dataset2, aes(x = Infl, y = Freq))+
geom_bar(
aes(fill = Infl), stat = "identity", color = "white",
position = position_dodge(0.9)
)+
facet_wrap(~Type) +
labs(y = "Number of Residents",
x = "Perceived Degree of Influence",
title = "Resident Perceived Degree of Influence",
subtitle = "Number of resident's Perceived Degree of Influence for each rental property type") +scale_fill_discrete(name = "Influence", labels = c("Infl-Low", "Infl-Medium", "Infl-High"))+theme+theme_bw()
This visualization represents resident’s perceived degree of influence on the management of the property. Atrium and Terrace resident has less degree of influence on management, where as Apartement has high number of residents thus perceived degree of influence is also high there as compare to other properties.
# place code for vis here
dataset3 = housing %>%
group_by(Sat,Infl,Type) %>%
summarise(Freq = sum(Freq))
ggplot(dataset3, aes(x = Sat, y = Freq))+
geom_bar(
aes(fill = Infl), stat = "identity", color = "white",
position = position_dodge(0.9)
)+
facet_grid(Infl~Type) +
labs(y = "Number of Residents",
x = "Satisfaction",
title = "Resident's Satisfaction and Influence by Rental Property Type")+scale_fill_discrete(name = "Influence", labels = c("Infl-Low", "Infl-Medium", "Infl-High")) +theme+theme_bw()
This visualization represents how the satisfaction and degree of influence varies as per number of residents. Atrium type of properties having less influence and satisfaction compared to other properties. Terrace has low satisfaction where as Tower has high satisfaction. Apartment resident perceived high and medium degree of influence are highly satisfied. In Tower, resident perceived low degree of influence are less satisfed.
# place code for vis here
dataset4 = housing %>%
group_by(Cont,Sat,Type) %>%
summarise(Freq = sum(Freq))
ggplot(dataset4, aes(x = Sat, y = Freq))+
geom_point(aes(color=Cont))+
facet_grid(~Type) +
labs(y = "Number of Residents",
x = "Satisfaction",
title = "Resident's Satisfaction and Contact by Rental Property Type",
color = "Contact") + scale_color_manual(values = c("red","green", "yellow"), labels = c("Con-Low", "Con-Medium", "Con-High"))+theme+theme_bw()
Number of high contact residents that are afforded with other residents are more as compare to low category in all property type except Tower. Apartment has high number of residents with high contact and satisfaction rate. Terrace and Atrium residents with low contacts has less satisfcation rate. Highly satisfied Tower resident having high contacts.
# place code for vis here
ggballoonplot(dataset3, x = 'Infl', y = 'Sat', size = 'Freq', facet.by = 'Type',
fill = 'Freq', ggtheme = theme_light()) +
scale_fill_viridis_c(option = 'D')
This ballon plot shows most of the residents lives in apartment and are highly satified resident has medium and high level of influence on property management. Tower residents are highly satisfied and have distributed influence on property management. Terrace resident with low satisfaction has low influence on management property where as Atrium has less number of residents as compare to other properties.