Categorical Data Visualization

Objectives

The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set with categorical variables. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring the Copenhagen Housing Conditions Survey:
- Sat - satisfaction of householders with their present housing circumstances
- Infl - perceived degree of influence householders have on the management of the property
- Type - type of rental accomodation
- Cont - contact residents are afforded with other residents
- Freq - Frequencies: the numbers of residents in each class
Your task for this assignment is to use ggplot and the facet_grid and facet_wrap functions to explore the Copenhagen Housing Conditions Survey. Your objective is to develop 5 report quality visualizations (at least 4 of the visualizations should use the facet_wrap or facet_grid functions) and identify interesting patterns and trends within the data that may be indicative of large scale trends. For each visualization you need to write a few sentences about the trends and patterns that you discover from the visualization.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.

Look at the data

str(housing)

## 'data.frame':    72 obs. of  5 variables:
##  $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
##  $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
##  $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  21 21 28 34 22 36 10 11 36 61 ...

1. First plot

# place code for vis here
ggplot(housing, aes(Sat, Freq)) +
      geom_point() + facet_grid(. ~ Type) + 
      labs( x= 'Satisfaction', y= 'Number of Residents', title = 'Number of Residents in Each Housing Type Measured by Satisfaction')

#Summary - This plot reveals the number of residents in each housing type measured by satisfaction. It appears that maximum number of residents living in apartments have high satisfaction while minimum number of residents in Atrium have low satisfaction. The difference between the number of residents who are most satisfied versus least satisfied is minimum for Atrium

2. Second plot

# place code for vis here
p2 = housing %>%
group_by(Infl, Type, Cont) %>%
summarise(Freq = sum(Freq))

levels(p2$Infl) = list("Low Influence"="Low", "Medium Influence"="Medium", "High Influence"="High")

levels(p2$Cont) = list("Low Contact"="Low", "High Contact"="High")

ggballoonplot(p2, x ="Infl", y = "Cont", size = "Freq", facet.by = "Type",
              fill = "Freq", ggtheme = theme_bw()) + labs(x = 'Influence', y = 'Contact', title = 'Number of Residents in Each Housing Type by Influence & Contact')+ scale_fill_viridis_c(option = "B")

#Summary - This plot reveals the number of residents in each housing type by influence & contact. It appears that least number of residents in Atrium have high influence but low contact.Most residents involved in the study lived in apartments while least residents lived in atrium.In apartments, most number of people had high influence and contact as compared to other housing types.

3. Third plot

# place code for vis here
p3 = housing
levels(p3$Sat)= list("Low Satisfaction"="Low", "Med Satisfaction"="Medium", "High Satisfaction" = "High")

ggplot(p3, aes(x=Freq)) + geom_histogram(binwidth=4,colour="black") + facet_grid(Infl ~ Sat) + labs(x = "Number of Residents", y = "Count", title = 'Residents Distribution Based on Satisfaction and Influence')

#Summary - This plot reveals a matrix representation on frequency distribution of number of residents based on satisfaction and influence. It appears that there is one set of max. number of residents, >75, with medium influence and higher satisfaction. Also, there are no set of residents who are more than 30 in number with low or medium satisfaction with High influence.

4. Fourth plot

# place code for vis here

p4 = housing %>%
group_by(Sat, Type, Cont) %>%
summarise(Freq = sum(Freq))

ggplot(p4, aes(Sat, Cont)) +
  geom_point(aes(size = Freq, color = Freq)) +
  facet_grid(Type ~ .) +
  theme_light() +
  guides(
    size = guide_legend(title = 'Numbers of Residents'),
    color = guide_legend(title = 'Numbers of Residents')) +
  labs(x = 'Satisfaction',
       y = 'Contact',
       title = 'Number of Residents in Each Housing Type by Satisfaction & Contact')

#Summary - This plot reveals the satisfaction of the residents based on housing type and contact. It appears that Apartments have highest number of residents having high satisfaction and high contact, Atrium and Terrace have least number of residents having medium satisfaction and low contact.

5. Fifth plot

# place code for vis here

p5 = housing %>%
  group_by(Cont, Type) %>%
  summarise(Freq = sum(Freq))

ggplot(p5, aes(Freq, Cont)) +
  geom_point(aes(color = Type)) +
  facet_grid(Type ~ ., scales = 'fixed', space = 'fixed') +
  theme_light() +
  theme(strip.text.y = element_text(angle = 0),
        legend.position = 'none') +
  labs(x = 'Numbers of Residents',
       y = "Contact",
       title = 'Number of Residents in Each Housing Type by Contact')

#Summary - This plot reveals the number of residents and evaluating their contact in each housing type.This plot indicates that more than 400 residents in Apartments have high contact while less than 100 residents living in Atrium and Terrace have low contact.

Categorical Data Visualization - Problem Set 6

Anjana Ananthraman

2020-03-24