The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.
library(MASS)
## Warning: package 'MASS' was built under R version 3.6.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ----------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.3
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts -------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
str(housing)
## 'data.frame': 72 obs. of 5 variables:
## $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 21 21 28 34 22 36 10 11 36 61 ...
ggplot(housing)+
geom_bar(aes(x=Sat, y=Freq, fill=Infl), stat='identity', width=0.5,position="dodge")+
labs(title="Resident satisfaction by rental type & degree of perceived influence")+
xlab("Satisfaction Level")+
ylab("Number of residents")+
theme_dark()+
facet_wrap(~Type)
##This barplot shows us that there’s a higher number of terrace residents who are very disatisfied with their living conditions. I was surprised to see that although there is a high number of apartment residents who are highly satisfied with their living conditions, there is also a high number of them who are not satisfied and who have a very low degree of perceived influence.
ggplot(housing, aes(Cont,Type))+
geom_point(aes(size = Freq))+
theme_light()+
facet_wrap(Type~.)+
labs(x="Afforded contact",
y="Type of residence",
title="Number of residents by afforded contact")
#The number of apartment residents with high afforded contact is almost as high as those with very contact with other residents.
house3=housing%>%
group_by(Cont,Type,Sat)%>%
summarize(Freq=sum(Freq))
ggplot(house3, aes(x= Cont, y = Freq, color = Sat))+
geom_point()+
facet_wrap(~Type)+
theme_gray()+
labs(x='Afforded Contact',
y='Number of Residents')
#This plot shows that the majority of apartment residents are highly satisfied with their housing conditions and have also a high degree of contact with other residents.
ggplot(housing)+
geom_bar(aes(x=Cont,y=Freq, fill=Infl),stat='identity', width=0.6,position="dodge")+
labs(title="Relationship between resident's contact & satisfacion level")+
xlab("Degree of contact")+
ylab("Number of residents")+
theme_light()+
facet_grid(~Type)
#THis bar graph shows that there is a higher number of apartment residents who have a high degree of contact with other residents as well as being highly satisfied with their housing circumstances.
house4=housing%>%
group_by(Infl,Type,Cont)%>%
summarize(Freq=sum(Freq))
ggplot(house4, aes(x=Infl,y=Freq))+
geom_point(aes(col=Cont))+
theme_gray()+
facet_grid(Type~.)+
labs(x='Degree of Influence',
y='Number of Residents',
title='Degree of contact by housing type & number of residents')