The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.
str(housing)
## 'data.frame': 72 obs. of 5 variables:
## $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 21 21 28 34 22 36 10 11 36 61 ...
The plot shows number of residents with different satisfaction levels, by 4 different type of property, and by perceived degree of influence householders have on the management of the property, which indicate that in Apartment, the high satisfaction person with medium influence have the highest number.
ggplot(housing, aes(x = Sat, y = Freq))+geom_bar(aes(fill = Infl), stat = "identity", color = "white",position = position_dodge(0.9))+facet_grid(~Type)
plot number of residents with different level of satisfaction with their present housing circumstances (Sat), by 4 types of rental property (Type), by amount of cost shared with other residents (Cont: low, high). When cost is high, the terrace residents show the decreasing number with satisfaction level. And the tower shows the increasing number while with the satisfaction level.
ggplot(housing, aes(x = Sat, y = Freq))+geom_bar(aes(fill = Cont), stat = "identity", color = "white",position = position_dodge(0.9))+facet_wrap(~Type)
plot number of residents with different level of satisfaction with their present housing circumstances (Sat), by 4 types of rental property (Type)
Overall, the terrace residents show the decreasing number with satisfaction level. And the tower shows the increasing number while with the satisfaction level.
housing2 <- housing %>%
group_by(Type, Sat) %>%
summarise(Freq = sum(Freq))
## `summarise()` regrouping output by 'Type' (override with `.groups` argument)
ggplot(housing2, aes(Sat, Freq)) +geom_point(aes(color = Type)) + facet_grid(Type ~ ., scales = "free", space = "free")
The plot shows number of residents with different satisfaction levels, and by perceived degree of influence householders have on the management of the property
It indicates that the satisfaction level increase with the influence level. The lowest the influence comes out with lowest satisfaction. The highest influence level results in highest satisfaction.
housing3 <- housing %>%
group_by(Infl, Sat) %>%
summarise(Freq = sum(Freq))
## `summarise()` regrouping output by 'Infl' (override with `.groups` argument)
ggplot(housing3, aes(Sat, Freq)) +geom_point(aes(color = Infl)) +facet_grid(Infl ~ ., scales = "free", space = "free")
The plot indicates that overall, the satisfaction is irrelevant to the cost in the Apartment type. Because the number of residents is always maximum in different satisfaction level in Apartment Type
ggplot(housing, aes(x=Sat, y=Freq)) + geom_point(aes(color = Type)) +facet_grid(Cont~.)