The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.
str(housing)
## 'data.frame': 72 obs. of 5 variables:
## $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 21 21 28 34 22 36 10 11 36 61 ...
#Dim clean-up function
clean_up<-theme_bw()+theme(panel.border=element_blank(), panel.grid.major=element_blank(),panel.grid.minor=element_blank(),axis.line = element_line(colour= "black"))
#Plot percentage of residents with different level of satisfaction
#by type of accomodation
p<-ggplot(housing,aes(x=Type,y=Freq,fill=Sat))
p<-p+geom_bar(stat="identity",width=0.7,position="fill")+labs(ylab="Percentage",title ="Resident Satisfaction",subtitle="Percentage of residents by accomodation type")+scale_fill_manual(values=wes_palette("Moonrise3"))+clean_up+guides(fill=guide_legend(title="Satisfaction"))
p
Tower residents scores highest in the percentage of people who are highly satisfied with present housing circumstances. It is also observed that tower residents are less likely to report a low satisfaction level, as the percentage is comparably lower.
#Plot percentage of residents with different level of satisfaction
#by degree of influence
p1<-ggplot(housing)+geom_col(aes(x=1,y=Freq,fill=Sat),position="fill")+coord_polar(theta="y")+scale_fill_manual(values=wes_palette("FantasticFox1"))
p1<-p1+facet_wrap(~Infl)+theme_bw()+theme(axis.title=element_blank(),axis.text=element_blank(),axis.ticks=element_blank(),panel.grid.major=element_blank(),panel.grid.minor=element_blank(),panel.border=element_blank())+guides(fill=guide_legend(title="Satisfaction"))+labs(title ="Resident Satisfaction",subtitle="Percentage of residents by degree of influence")
p1
The better the householder’s management of property, the more likely they are going to achieve a high satisfaction level with their housing circumstance.
#Plot number of residents with different level of satisfaction
#by type of accomodation, by amount of cost shared with other residents
p2<-ggplot(housing,aes(x=Sat,y=Freq,fill=Cont))+geom_bar(stat="identity",position="dodge")+labs(y="Number of Residents",x="Satisfaction",title ="Resident Satisfaction",
subtitle="Number of residents by accomodation type and cost shared")+scale_fill_manual(values=wes_palette("Royal1"))+facet_wrap(~Type)+clean_up+guides(fill=guide_legend(title="Cost Shared"))
p2
When grouped by level of cost shared, residents of terrace displays an opposite pattern of satisfaction to those of tower, apartment and atrium. Specifically, when the cost shared by other residents is higher, less people will achieve a high satisfaction in regards of housing circumstances.
housing2=housing%>%
group_by(Infl,Type)%>%
summarise(Freq=sum(Freq))
ggplot(housing2,aes(Freq,Infl))+geom_point(aes(color=Type))+facet_grid(Type~.,scales='fixed',space='fixed')+theme_light()+theme(strip.text.y=element_text(angle=0),legend.position='none')+labs(y ="Influence",x ="Numbers of Residents",title ="Resident Influence",subtitle="Number of residents by accomodation type")
For tower and apartment residents, most would rank themselves medium in management of the property. While for atrium and terrace residents, most would rank themselves low in management of property.
housing3=housing%>%
group_by(Infl,Cont)%>%
summarise(Freq=sum(Freq))
ggplot(housing3,aes(Freq,Infl))+geom_point(aes(color=Cont))+facet_grid(Cont~.,scales='fixed',space='fixed')+theme_light()+theme(strip.text.y=element_text(angle=0),legend.position='none')+labs(y ="Influence",x ="Numbers of Residents",title ="Resident Influence",subtitle="Number of residents by cost shared")
When cost shared by other residents is low, which means the contact householder pays more rent, most report themselves to have a medium management of the property. However, when the cost shared is high, more residents see themselves to be poor in management of the property. Aside of that, the gap between number of residents who would exert low influence and those who would exert high influence is significantly larger when the cost shared is high. Much fewer people would pay attention to housework when they pay less rent.