Categorical Data Visualization

Objectives

The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
- Sat - satisfaction of householders with their present housing circumstances
- Infl - perceived degree of influence householders have on the management of the property
- Type - type of rental accomodation
- Cont - contact residents are afforded with other residents
- Freq - Frequencies: the numbers of residents in each class
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.

Look at the data

str(housing)

## 'data.frame':    72 obs. of  5 variables:
##  $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
##  $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
##  $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  21 21 28 34 22 36 10 11 36 61 ...

# libraries

library(MASS)
library(tidyverse)
library(ggplot2)

1. First plot

# Create dataset : Group satisfaction levels by property type

data1 = housing %>%
      group_by(Sat,Type) %>%
      summarise(Freq = sum(Freq))

# Add total freq count to dataset and calculate percentage

data1$total <- NA

tower_t <- 400
apart_t<- 765
atrium_t <- 239
terrace_t <- 277

data1$total <- ifelse(data1$Type =='Tower', tower_t,ifelse(data1$Type =='Apartment', apart_t,ifelse(data1$Type =='Atrium', atrium_t,terrace_t ) ) )

data1$percentage <- data1$Freq/ data1$total

# Plot the graph

 ggplot(data1, aes(x = Sat, y = percentage))+
          geom_bar(
                  aes(fill = Sat), stat = "identity", color = "white",
                  position = position_dodge(0.9)
              )+
          facet_grid(. ~ Type) +
          labs(y = "Percentge of Residents",
                          x = "Satisfaction",
                          title = "Resident Satisfaction",
                          subtitle = "Graph depicting resident satification rates for each rental property type")+scale_fill_discrete(name = "Satisfaction", labels = c("Low", "Medium", "High"))+theme_bw()

Towers has the highest satisfaction rates with 50% of the residents highly satisfied with the property. Whereas, Terrace has the lowest % (less than 30%) of residents who are highly satisfied

2. Second plot

# Create dataset : Group contact levels by property type

data2 = housing %>%
      group_by(Cont,Type) %>%
      summarise(Freq = sum(Freq))

# Add total freq count to dataset and calculate percentage

data2$total <- NA

tower_t <- 400
apart_t<- 765
atrium_t <- 239
terrace_t <- 277

data2$total <- ifelse(data2$Type =='Tower', tower_t,ifelse(data2$Type =='Apartment', apart_t,ifelse(data2$Type =='Atrium', atrium_t,terrace_t ) ) )

data2$percentage <- data2$Freq/ data2$total

# Plot the graph

ggplot(data2, aes(x = Cont, y = percentage))+
          geom_bar(
                  aes(fill = Cont), stat = "identity", color = "white",
                  position = position_dodge(0.9)
              )+
          facet_wrap(~Type) +
          labs(y = "Percentge of Residents",
                          x = "Contact",
                          title = "Resident Contact",
                          subtitle = "Graph depicting contact residents are afforded with other residents for each property type")+scale_fill_discrete(name = "Contact", labels = c("Low", "High"))+theme_bw()

Of the four property types, only tower residents say that majority of them are afforded low contact with other residents. The other three properties provide high contact to majority of their residents

3. Third plot

# Create dataset : Group Influence by property type

data3 = housing %>%
      group_by(Infl,Type) %>%
      summarise(Freq = sum(Freq))

# Add total freq count to dataset and calculate percentage

data3$total <- NA

tower_t <- 400
apart_t<- 765
atrium_t <- 239
terrace_t <- 277

data3$total <- ifelse(data3$Type =='Tower', tower_t,ifelse(data3$Type =='Apartment', apart_t,ifelse(data3$Type =='Atrium', atrium_t,terrace_t ) ) )

data3$percentage <- data3$Freq/ data3$total

# Plot the graph

ggplot(data3, aes(x = Infl, y = percentage))+
          geom_bar(
                  aes(fill = Infl), stat = "identity", color = "white",
                  position = position_dodge(0.9)
              )+
          facet_wrap(~Type) +
          labs(y = "Percentge of Residents",
                          x = "Influence",
                          title = "Resident Influence",
                          subtitle = "Graph depicting resident perceived influence over management for each property type")+scale_fill_discrete(name = "Influence", labels = c("Low", "Medium", "High"))+theme_bw()

At all four properties, very low percentage of residents felt that they had high influence over the property management. More than 40% of Terrace residents felt they had low influence

4. Fourth plot

#  Create dataset 

data4 = housing %>%
  group_by(Infl,Sat,Type) %>%
  summarise(Freq = sum(Freq))

# Plot the graph 

 ggplot(data4, aes(x = Sat, y = Freq))+
   geom_point(
      aes(size=Freq),shape=21, colour="black", fill="skyblue"
     )+
     facet_grid(Infl~Type) +
   labs(y = "Number of Residents",
        x = "Satisfaction",
        title = "Resident's Satisfaction and Influence by Rental Property Type")+scale_fill_discrete(name = "Influence", labels = c("Infl-Low", "Infl-Medium", "Infl-High")) +theme_bw()

Many apartment residents who reported high and medium influence also reported high satisfaction levels. Many who reported low satisfaction also felt they had low influence. This indicates that influence over management could be factor influencing resident satisfaction levels

5. Fifth plot

# Create dataset : Group residents by property type

data5 = housing %>%
       group_by(Type) %>%
       summarise(Freq = sum(Freq))

# Plot the graph

ggplot(data5, aes(x = Type, y = Freq))+
          geom_bar(
                  aes(fill = Type ), stat = "identity", color = "white",
                  position = position_dodge(0.9)
              )+
        
          labs(y = "Number of Residents",
                          x = "Type",
                          title = "Resident Count",
                          subtitle = "Graph depicting number of residents for each property type")+scale_fill_discrete(name = "Type", labels = c("Apartment", "Tower", "Atrium", "Terrace"))+theme_bw()

Maximum number of residents surveyed lived in apartments followed by towers, terrace and atrium in that order

Categorical Data Visualization - Problem Set 6

Abhilasha Vyas

2020-03-27