The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. So far, we have discussed a number of visualization approaches to exploring a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is the iterative and repetitive nature of exploring a data set. So, take your time to look and look again, adjusting and refining your analyses each time.
library(ggplot2)
library(dplyr)
library(Hmisc)
library(ggthemes)
gss <- spss.get("gss00_10allcases.sav", use.value.labels=TRUE)
gss10<-subset(gss, year==2010)
plot(gss10$sex, gss10$degree, main = "Gender and Level of education", xlab="Gender", ylab="Education", ann = FALSE, yaxt = "n", col=terrain.colors(8))
legend("right", cex=0.5, title="Level of education", c("LT High school", "High school", "Junior college", "Bachelor", "Graduate"), fill=terrain.colors(8))
I made a leveled histogram to make the distribution of education level by gender. It seems there is no difference in level of education distribution among males and females
finan <- gss10 %>%
select(sex,confinan) %>%
filter(!is.na(sex)) %>%
filter(!is.na(confinan)) %>%
group_by(sex,confinan) %>%
summarise(count=n())
finan
## # A tibble: 6 x 3
## # Groups: sex [?]
## sex confinan count
## <fctr> <fctr> <int>
## 1 MALE A GREAT DEAL 54
## 2 MALE ONLY SOME 278
## 3 MALE HARDLY ANY 262
## 4 FEMALE A GREAT DEAL 90
## 5 FEMALE ONLY SOME 378
## 6 FEMALE HARDLY ANY 302
finan <- finan %>%
group_by(sex) %>%
mutate(percentBF = 100*(count/sum(count)))
finan
## # A tibble: 6 x 4
## # Groups: sex [2]
## sex confinan count percentBF
## <fctr> <fctr> <int> <dbl>
## 1 MALE A GREAT DEAL 54 9.090909
## 2 MALE ONLY SOME 278 46.801347
## 3 MALE HARDLY ANY 262 44.107744
## 4 FEMALE A GREAT DEAL 90 11.688312
## 5 FEMALE ONLY SOME 378 49.090909
## 6 FEMALE HARDLY ANY 302 39.220779
g <-ggplot(finan, aes(x=sex, y=percentBF, fill=confinan))+
geom_bar(stat="identity")
plot(g)
In orde to see the confidence in banks and financial institutions by sex, I created a new variable called pecentBF. Femals are a little bit more confident in bank and financial institution
data <- filter(gss10) %>%
select(sex,sexeduc)
dataset <- na.omit(data)
dataset <- droplevels(dataset)
mosaicplot(prop.table(table(dataset),1), shade = TRUE, color = TRUE, legend = TRUE, main = 'Sex vs Favourability of Sex Education')
We can tell from the mosaic plot there’s almost no difference by set in Favourability of Sex Education
ggplot(data=gss10, aes(x=relig,fill=partyid,color=partyid))+
geom_density(alpha=0.2)+
labs(x="Religion", y="Density", title="Party ID VS Religion")+
theme_economist()
The chart shows the partyid desity for different religions and different results.
ggplot(data=gss10, aes(x=factor(1), fill=region))+
geom_bar(width = 1)+
coord_polar("y")+
ggtitle("Region")
We can get the percentage of the region of the responder to see if samples are randomly drawn from different regions
ggplot(data=gss10,aes(x=partyid,color=pray))+
geom_point(stat="count")+
ggtitle("Number of Prays by Partyid")+
xlab("Partyid")+
ylab("Number of prays")+
scale_fill_discrete(name="pray frequency")
This visulization shows that how often the people from different party pray. We could see that Democrat and Independent people pray more than the people labled with other party ID
The data visualization process of Exploratory Data Analysis is always very helpful and powful in showing the analyst the distribution and different kinds of trend related to the original data. With the help of data visualization, we can easily realize a lot of hidden information in the raw data.