Objectives

The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. So far, we have discussed a number of visualization approaches to exploring a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is the iterative and repetitive nature of exploring a data set. So, take your time to look and look again, adjusting and refining your analyses each time.

Load Package

library(ggplot2)
library(dplyr)
library(Hmisc)
library(ggthemes)

Load Data

gss <- spss.get("gss00_10allcases.sav", use.value.labels=TRUE)
gss10<-subset(gss, year==2010)

Exploratory data analysis

Figure 1 Gender and Level of education

plot(gss10$sex, gss10$degree, main = "Gender and Level of education", xlab="Gender", ylab="Education", ann = FALSE, yaxt = "n", col=terrain.colors(8))
legend("right", cex=0.5, title="Level of education", c("LT High school", "High school", "Junior college", "Bachelor", "Graduate"), fill=terrain.colors(8))

I made a leveled histogram to make the distribution of education level by gender. It seems there is no difference in level of education distribution among males and females

Firgure 2 Confidence in bank and Financial Situation by sex

finan <- gss10 %>%
  select(sex,confinan) %>%
  filter(!is.na(sex)) %>%
  filter(!is.na(confinan)) %>%
  group_by(sex,confinan) %>%
  summarise(count=n())
finan
## # A tibble: 6 x 3
## # Groups:   sex [?]
##      sex     confinan count
##   <fctr>       <fctr> <int>
## 1   MALE A GREAT DEAL    54
## 2   MALE    ONLY SOME   278
## 3   MALE   HARDLY ANY   262
## 4 FEMALE A GREAT DEAL    90
## 5 FEMALE    ONLY SOME   378
## 6 FEMALE   HARDLY ANY   302
finan <- finan %>%
  group_by(sex) %>%
  mutate(percentBF = 100*(count/sum(count)))
finan
## # A tibble: 6 x 4
## # Groups:   sex [2]
##      sex     confinan count percentBF
##   <fctr>       <fctr> <int>     <dbl>
## 1   MALE A GREAT DEAL    54  9.090909
## 2   MALE    ONLY SOME   278 46.801347
## 3   MALE   HARDLY ANY   262 44.107744
## 4 FEMALE A GREAT DEAL    90 11.688312
## 5 FEMALE    ONLY SOME   378 49.090909
## 6 FEMALE   HARDLY ANY   302 39.220779
g <-ggplot(finan, aes(x=sex, y=percentBF, fill=confinan))+
 geom_bar(stat="identity")
plot(g)

In orde to see the confidence in banks and financial institutions by sex, I created a new variable called pecentBF. Femals are a little bit more confident in bank and financial institution

Figure 3 Sex vs Favourability of Sex Education

data <- filter(gss10) %>%
  select(sex,sexeduc)
dataset <- na.omit(data)
dataset <- droplevels(dataset)
mosaicplot(prop.table(table(dataset),1), shade = TRUE, color = TRUE, legend = TRUE, main = 'Sex vs Favourability of Sex Education')

We can tell from the mosaic plot there’s almost no difference by set in Favourability of Sex Education

Figure 4 Desity of Party ID VS Religion

ggplot(data=gss10, aes(x=relig,fill=partyid,color=partyid))+
  geom_density(alpha=0.2)+
  labs(x="Religion", y="Density", title="Party ID VS Religion")+
  theme_economist()

The chart shows the partyid desity for different religions and different results.

Figure 5 Pan chart for regions of the responder

ggplot(data=gss10, aes(x=factor(1), fill=region))+
  geom_bar(width = 1)+
  coord_polar("y")+
  ggtitle("Region")

We can get the percentage of the region of the responder to see if samples are randomly drawn from different regions

Firgure 6 Number of Prays by Partyid

ggplot(data=gss10,aes(x=partyid,color=pray))+
  geom_point(stat="count")+
  ggtitle("Number of Prays by Partyid")+
  xlab("Partyid")+
  ylab("Number of prays")+
  scale_fill_discrete(name="pray frequency")

This visulization shows that how often the people from different party pray. We could see that Democrat and Independent people pray more than the people labled with other party ID

Summary

The data visualization process of Exploratory Data Analysis is always very helpful and powful in showing the analyst the distribution and different kinds of trend related to the original data. With the help of data visualization, we can easily realize a lot of hidden information in the raw data.

Reference

  1. Zuur, Leno & Elphick (2010) A protocol for data exploration to avoid common File
  2. Few (2014) Are Mosaic Plots Worthwhile? [Visual Business Intelligence Newsletter]