The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.
str(housing)
## 'data.frame': 72 obs. of 5 variables:
## $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 21 21 28 34 22 36 10 11 36 61 ...
#head(housing)
# place code for vis here
ggplot(housing, aes(x=Type, y=Freq)) +
geom_col() +
facet_wrap(~Infl) +
labs(x = 'Type of Rental Accomodation',
y = 'the numbers of residents in each class',
title = 'Residents count in each rental type categorized by influence level')
#The idea of this plot is to visualize all the three influence types (Low, Medium and High) and further understand number of residents in each rental type (Tower, Apartment, Atrium and Terrace). We see that most of the apartment residents have high degree influence over property management.
ggplot(housing, aes(x=Type, y=Freq)) + geom_col() + facet_grid(Cont ~ Infl) + labs(title = 'Residents count in each rental type categorized by influence & Contact levels')
#This plot is an extension of the first plot. Where in we have further divided by the contact residents are affored with other residents. When this new dimension is introduced, we can see that highest number of residents have the high contacts with medium influence over property management.
# place code for vis here
levels(housing$Sat)[levels(housing$Sat)=="Low"] <- "Low-Satisfaction"
levels(housing$Sat)[levels(housing$Sat)=="Medium"] <- "Med-Satisfaction"
levels(housing$Sat)[levels(housing$Sat)=="High"] <- "High-Satisfaction"
ggplot(housing, aes(x=Freq)) + geom_histogram(binwidth=4,colour="black") + facet_grid(Infl ~ Sat) + labs(title = 'Residents distribution among the range of Satisfaction and Influence')
#This plot reveals a matrix representation on frequency distribution of number of residents with satisfactio and influence as other two dimensions. The plotn shows that, there are 1 set of max. number of residents (>75) with medium influence and higher satisfaction. Interestingly, residents with <50 are present in several steps within low influence and higher satisfaction.
# place code for vis here
ggplot(housing,aes(x=Sat,y=Freq,fill=Cont)) + geom_bar(stat="identity",width=0.7,position="fill")+labs(ylab="Residents F.D",title ="Stacked Plot of Resident Satisfaction by contact levels") + coord_flip()
#This plot is to evaluate the contact levels and satisfaction levels and distribution of residents among those two dimensions. High contact residents are more against all the three satisfaction levels. There are more than 50 percentage.
# place code for vis here
ggplot(housing, aes(x=Cont,y=Freq)) + geom_point()+facet_grid(~Type)
#This plot provides the distrbution of low and high contact levels within different rental types. We see that “Apartment” rental type do have the wide range of contact levels within low and high and “Atrium” rental type have the lesser range of contact level.