STAT 545A Homework 6

In this project, we will use adult data set which is available from this UCI Machine Learning Repository. The data is already being transformed and cleaned. However, we need to perform further cleanings in order to demonstrate the results that we have obtained. You can replicate my analysis following the README manual from Github.

Transforming the `adult` data

For a particular project, we usually end up analyzing a collection of data, transforming it and storing different forms of information about the data. It is convenient to introduce encapsulation which packages all of our data into a container that is used at a later time and thus protecting our raw data. In this project, we use R's environment concept and give it a name as below:

main.en <- new.env()

After that, data is placed into the container and we can access the data using $ notations. However, to operate on variables within an environment without using the $ notation we can use evalq(). For example, let us load our data in main.en container and assign the variables with suitable names as follows:

require(package = xtable)

## Loading required package: xtable

evalq({
    adult.data <- read.table("adult_raw.txt")
    names(adult.data) <- c("age", "workclass", "fnlwgt", "education", "educationNum", 
        "maritalStatus", "occupation", "relationship", "race", "sex", "capitalGain", 
        "capitalLoss", "hoursPerWeek", "nativeCountry", "grossIncome")
    print(xtable(head(adult.data)), type = "html", include.rownames = TRUE)
}, main.en)

	age	workclass	fnlwgt	education	educationNum	maritalStatus	occupation	relationship	race	sex	capitalGain	capitalLoss	hoursPerWeek	nativeCountry	grossIncome
1	39,	State-gov,	77516,	Bachelors,	13,	Never-married,	Adm-clerical,	Not-in-family,	White,	Male,	2174,	0,	40,	United-States,	< =50K
2	50,	Self-emp-not-inc,	83311,	Bachelors,	13,	Married-civ-spouse,	Exec-managerial,	Husband,	White,	Male,	0,	0,	13,	United-States,	< =50K
3	38,	Private,	215646,	HS-grad,	9,	Divorced,	Handlers-cleaners,	Not-in-family,	White,	Male,	0,	0,	40,	United-States,	< =50K
4	53,	Private,	234721,	11th,	7,	Married-civ-spouse,	Handlers-cleaners,	Husband,	Black,	Male,	0,	0,	40,	United-States,	< =50K
5	28,	Private,	338409,	Bachelors,	13,	Married-civ-spouse,	Prof-specialty,	Wife,	Black,	Female,	0,	0,	40,	Cuba,	< =50K
6	37,	Private,	284582,	Masters,	14,	Married-civ-spouse,	Exec-managerial,	Wife,	White,	Female,	0,	0,	40,	United-States,	< =50K

We remove , from each value in the data and change the mode of the following variables to numerics: age, fnlwgt, educationNum, capitalGain, capitalLoss, and hoursPerWeek.

evalq({
    adult.data$age <- as.numeric(gsub(",", "", adult.data$age))
    adult.data$workclass <- gsub(",", "", adult.data$workclass)
    adult.data$fnlwgt <- as.numeric(gsub(",", "", adult.data$fnlwgt))
    adult.data$education <- gsub(",", "", adult.data$education)
    adult.data$educationNum <- as.numeric(gsub(",", "", adult.data$educationNum))
    adult.data$maritalStatus <- gsub(",", "", adult.data$maritalStatus)
    adult.data$occupation <- gsub(",", "", adult.data$occupation)
    adult.data$relationship <- gsub(",", "", adult.data$relationship)
    adult.data$race <- gsub(",", "", adult.data$race)
    adult.data$sex <- gsub(",", "", adult.data$sex)
    adult.data$capitalGain <- as.numeric(gsub(",", "", adult.data$capitalGain))
    adult.data$capitalLoss <- as.numeric(gsub(",", "", adult.data$capitalLoss))
    adult.data$hoursPerWeek <- as.numeric(gsub(",", "", adult.data$hoursPerWeek))
    adult.data$nativeCountry <- gsub(",", "", adult.data$nativeCountry)
    adult.data$grossIncome <- gsub(",", "", adult.data$grossIncome)
    print(xtable(head(adult.data)), type = "html", include.rownames = TRUE)
}, main.en)

	age	workclass	fnlwgt	education	educationNum	maritalStatus	occupation	relationship	race	sex	capitalGain	hoursPerWeek	nativeCountry	grossIncome
1	39.00	State-gov	77516.00	Bachelors	13.00	Never-married	Adm-clerical	Not-in-family	White	Male	2174.00	40.00	United-States	< =50K
2	50.00	Self-emp-not-inc	83311.00	Bachelors	13.00	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.00	13.00	United-States	< =50K
3	38.00	Private	215646.00	HS-grad	9.00	Divorced	Handlers-cleaners	Not-in-family	White	Male	0.00	40.00	United-States	< =50K
4	53.00	Private	234721.00	11th	7.00	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0.00	40.00	United-States	< =50K
5	28.00	Private	338409.00	Bachelors	13.00	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0.00	40.00	Cuba	< =50K
6	37.00	Private	284582.00	Masters	14.00	Married-civ-spouse	Exec-managerial	Wife	White	Female	0.00	40.00	United-States	< =50K

Our data is already being cleaned and the missing values are filled with either ?, for categorial values, or 0, for numerical data. However, we treate any ? as missing values and remove them from the data.

evalq({
    adult.data$workclass <- gsub("[?]", NA, adult.data$workclass)
    adult.data$education <- gsub("[?]", NA, adult.data$education)
    adult.data$maritalStatus <- gsub("[?]", NA, adult.data$maritalStatus)
    adult.data$occupation <- gsub("[?]", NA, adult.data$occupation)
    adult.data$relationship <- gsub("[?]", NA, adult.data$relationship)
    adult.data$race <- gsub("[?]", NA, adult.data$race)
    adult.data$sex <- gsub("[?]", NA, adult.data$sex)
    adult.data$nativeCountry <- gsub("[?]", NA, adult.data$nativeCountry)
    adult.data <- na.omit(adult.data)
}, main.en)

Now, let us explore the transformed data to discover various insightful knowledge such as dimensions, the variables names, its boundaries, its characteristics…etc.

evalq({
    head(adult.data)
    tail(adult.data)
    dim(adult.data)
    names(adult.data)
    str(adult.data)
    summary(adult.data)
}, main.en)

To obtain more detailed summary of the numerical data, we apply the basicStats() function from fBasics package.

require(fBasics)

## Loading required package: fBasics
## Loading required package: MASS
## Loading required package: timeDate
## 
## Attaching package: 'timeDate'
## 
## The following object is masked from 'package:xtable':
## 
##     align
## 
## Loading required package: timeSeries
## 
## Attaching package: 'fBasics'
## 
## The following object is masked from 'package:base':
## 
##     norm

evalq({
    print(xtable(basicStats(adult.data$age)), type = "html", include.colnames = FALSE, 
        include.rownames = TRUE)
}, main.en)

nobs	30162.00
NAs	0.00
Minimum	17.00
Maximum	90.00
1. Quartile	28.00
3. Quartile	47.00
Mean	38.44
Median	37.00
Sum	1159364.00
SE Mean	0.08
LCL Mean	38.29
UCL Mean	38.59
Variance	172.52
Stdev	13.13
Skewness	0.53
Kurtosis	-0.15

In analyzing data, it is very useful to transform continuous numeric variables into a specific set of meaningful categoric values for subsequent analysis especially to reduce the effects of minor observation errors. This operation is called binning. Lets apply binning to our age variable by first recording the range of this continous data.

evalq({
    print(range(adult.data$age))
}, main.en)

[1] 17 90

We observe that the range of age in our dataset is 17 and maximum is 90. Therefore, we apply binning to age variable to categorize it into four distinct intervals that are chosen subjectively as follows:

evalq({
    adult.data$agedesc = ifelse(test = adult.data[, "age"] <= 22, yes = "youth", 
        no = ifelse(test = adult.data[, "age"] > 22 & adult.data[, "age"] <= 
            35, yes = "adult", no = ifelse(test = adult.data[, "age"] > 35 & 
            adult.data[, "age"] <= 55, yes = "adultToOld", no = "old")))
    print(xtable(head(adult.data)), type = "html", include.rownames = TRUE)
}, main.en)

	age	workclass	fnlwgt	education	educationNum	maritalStatus	occupation	relationship	race	sex	capitalGain	hoursPerWeek	nativeCountry	grossIncome	agedesc
1	39.00	State-gov	77516.00	Bachelors	13.00	Never-married	Adm-clerical	Not-in-family	White	Male	2174.00	40.00	United-States	< =50K	adultToOld
2	50.00	Self-emp-not-inc	83311.00	Bachelors	13.00	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.00	13.00	United-States	< =50K	adultToOld
3	38.00	Private	215646.00	HS-grad	9.00	Divorced	Handlers-cleaners	Not-in-family	White	Male	0.00	40.00	United-States	< =50K	adultToOld
4	53.00	Private	234721.00	11th	7.00	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0.00	40.00	United-States	< =50K	adultToOld
5	28.00	Private	338409.00	Bachelors	13.00	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0.00	40.00	Cuba	< =50K	adult
6	37.00	Private	284582.00	Masters	14.00	Married-civ-spouse	Exec-managerial	Wife	White	Female	0.00	40.00	United-States	< =50K	adultToOld

We keep certain variables in our data and remove the redundant ones as follows:

evalq({
    adult.data <- data.frame(adult.data[, c("age", "workclass", "education", 
        "educationNum", "maritalStatus", "occupation", "relationship", "race", 
        "sex", "nativeCountry")])
}, main.en)

Plotting simple figures using `ggplot2`

We create one additional container to hold data for ploting purposes and to perform other applications on our data without altering the main data. Doing so we demonstrate the inheritance structure in our coding scheme.

plot.en <- new.env(parent = main.en)

We require the following packages.

require(plyr)

## Loading required package: plyr

require(ggplot2)

## Loading required package: ggplot2

Let us plot our first figure which depicts the education level per gender

evalq({
    edulevelPergender <- ddply(.data = adult.data, .variables = ~educationNum + 
        sex, .fun = function(x) {
        count <- nrow(x)
        return(data.frame(count))
    })
    plotbar <- ggplot(adult.data, aes(x = factor(education), fill = factor(sex)))
    plotbar <- plotbar + ggtitle("Education level per gender")
    plotbar <- plotbar + geom_bar(position = "dodge") + coord_flip()
    plotbar <- plotbar + xlab("Education") + ylab("Number of observations")
    plotbar <- plotbar + scale_fill_discrete(name = "Gender")
    plotbar
}, envir = plot.en)

plot of chunk unnamed-chunk-13

We see that the above figure does not represent the ratio of each gender with the level of education in the dataset for that particular gender. Therefore, we use proportions of each gender in the dataset according to their level of education and re-plot the results.

evalq({
    nPerSex <- count(df = adult.data, ~sex)
    adult.data.modified <- ddply(adult.data, ~education + sex, summarise, proportion = length(education))
    adult.data.modified$proportion = ifelse(test = adult.data.modified$sex == 
        "Male", yes = adult.data.modified$proportion/nPerSex$freq[2], no = adult.data.modified$proportion/nPerSex$freq[1])
    plotbarprop <- ggplot(adult.data.modified, aes(x = factor(education), y = proportion, 
        fill = factor(sex)))
    plotbarprop <- plotbarprop + ggtitle("Education level per gender")
    plotbarprop <- plotbarprop + geom_bar(position = "dodge") + coord_flip()
    plotbarprop <- plotbarprop + xlab("Education") + ylab("Proportion")
    plotbarprop <- plotbarprop + scale_fill_discrete(name = "Gender")
    plotbarprop
}, envir = plot.en)

## Mapping a variable to y and also using stat="bin".
##   With stat="bin", it will attempt to set the y value to the count of cases in each group.
##   This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
##   If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
##   If you want y to represent values in the data, use stat="identity".
##   See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)

plot of chunk unnamed-chunk-14

Now, we observe similar proportions for each gender with the level of education. In fact, females are more educated in some levels such as at 11th grade. This piece of information did not reveal to us in our previous figure.

We use boxplot that provides a graphical overview of how the observations of a particular variable, age in our case, are distributed per gender.

evalq({
    plotbox <- ggplot(adult.data) + aes(x = factor(agedesc), y = age)
    plotbox <- plotbox + geom_boxplot(aes(fill = sex), outlier.colour = "red", 
        outlier.shape = 4)
    plotbox <- plotbox + ggtitle("Education level with Age per gender")
    plotbox <- plotbox + xlab("Education") + ylab("Age")
    plotbox <- plotbox + scale_fill_discrete(name = "Gender") + coord_flip()
    plotbox
}, envir = plot.en)

plot of chunk unnamed-chunk-15

Here we plot education level with specific age intervals.

evalq({
    p1 <- ggplot(adult.data) + aes(education, fill = factor(sex))
    p1 <- p1 + ggtitle("Education level with each Age interval per gender")
    p1 <- p1 + geom_bar() + xlab("Education") + ylab("Number of observations")
    p1 <- p1 + scale_fill_discrete(name = "Gender") + coord_flip()
    p1 <- p1 + facet_wrap(~agedesc)
    p1
}, envir = plot.en)

plot of chunk unnamed-chunk-16

Plotting the number of educated people based on their ethnicity.

evalq({
    p2 <- ggplot(adult.data) + aes(education, fill = race)
    p2 <- p2 + ggtitle("Education level with Race")
    p2 <- p2 + geom_bar() + xlab("Education") + ylab("Number of observations")
    p2 <- p2 + scale_fill_brewer("Race") + coord_flip()
    p2
}, envir = plot.en)

plot of chunk unnamed-chunk-17

Gradient colors are not a convenient solution in plotting the above figure. We plot the status of relationships with level of educations and we use palette option:

evalq({
    p3 <- ggplot(adult.data) + aes(education, fill = maritalStatus)
    p3 <- p3 + ggtitle("Education level with Marital Status")
    p3 <- p3 + geom_bar() + xlab("Education") + ylab("Number of observations")
    p3 <- p3 + scale_fill_brewer("Marital Status", palette = "Dark2") + coord_flip()
    p3
}, envir = plot.en)

plot of chunk unnamed-chunk-18

Exploring the type of employment with level of educations:

evalq({
    p4 <- ggplot(adult.data) + aes(education, fill = workclass)
    p4 <- p4 + ggtitle("Education level with Employment type")
    p4 <- p4 + geom_bar() + xlab("Education") + ylab("Number of observations")
    p4 <- p4 + scale_fill_brewer("Work Class", palette = "Dark2") + coord_flip()
    p4
}, envir = plot.en)

plot of chunk unnamed-chunk-19

gridExtra is an interesting package that puts several figures onto one panel. As an example:

require(gridExtra)

## Loading required package: gridExtra
## Loading required package: grid

evalq({
    grid.arrange(plotbox, p2, p3, p4, ncol = 2, main = "Plotting in one panel")
}, envir = plot.en)

plot of chunk unnamed-chunk-21

Plotting world map using `ggplot2`

We load the following packages.

require(ggmap)

## Loading required package: ggmap

require(mapproj)

## Loading required package: mapproj
## Loading required package: maps

We copy map data into plot.en container

evalq({
    map.world <- map_data(map = "world")
}, envir = plot.en)

Let us draw a map using ggplot2. We first note that the countries in adult data need to be mapped with regions in the map data. Therefore, several issues need to be handled properly:

some countries in our data does not exist in map data, for example Scotland. We merge these countries with their closest neighbours.
names of countries are different or abbreviated in both data, for example USA is used in map data and it is renamed to United-States in adult data.
countries in adult data are misspelled or are not complete, for example in our data there is a country named by south but we do not have enough information to guess whether it is the south korea or the south africa. Another example, the country Outlying-US(Guam-USVI-etc) does not have any equivalent with map data.

evalq({
    adult.data$nativeCountry <- gsub("(Outlying-US\\(Guam-USVI-etc\\))", NA, 
        adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("England", "UK", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Scotland", "UK", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Hong", "China", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Taiwan", "China", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Trinadad&Tobago", "Tobago", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("United-States", "USA", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Puerto-Rico", "Puerto Rico", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Holand-Netherlands", "Netherlands", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("El-Salvador", "El Salvador", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Dominican-Republic", "Dominican Republic", 
        adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Columbia", "Colombia", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("South", "South Korea", adult.data$nativeCountry)
    adult.data <- na.omit(adult.data)
}, plot.en)

After we address the above shortcomings, let's begin to plot our first world map.

evalq({
    plot <- ggplot(map.world, aes(x = long, y = lat, group = group, fill = region))
    plot <- plot + ggtitle("World, filled with regions")
    plot <- plot + geom_polygon() + theme(legend.position = "none")
    plot <- plot + xlab("Longitude") + ylab("Latitude")
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-25

Interestingly, we can also represent the filled countries that are only recorded in adult data as follows:

evalq({
    map.world$availableCountry <- ifelse(test = map.world$region %in% adult.data$nativeCountry, 
        yes = "yes", no = "no")

    plot <- ggplot(map.world, aes(x = long, y = lat, group = group, fill = availableCountry))
    plot <- plot + ggtitle("World, filled with interested countries")
    plot <- plot + geom_polygon() + theme(legend.position = "none")
    plot <- plot + xlab("Longitude") + ylab("Latitude")
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-26

Let us perform more fancy work by plotting the world map according to the median age per country.

evalq({
    tmp.df <- ddply(adult.data, ~nativeCountry, summarise, medianAge = median(age), 
        meanAge = mean(age))
    map.world <- merge(x = map.world, y = tmp.df, by.x = "region", by.y = "nativeCountry", 
        all = TRUE)
    map.world$medianAge[is.na(map.world$medianAge)] <- 0  # other countries median
    # are set to 0
    map.world$meanAge[is.na(map.world$meanAge)] <- 0  # other countries mean
    # are set to 0
    map.world <- arrange(map.world, order)  # We need to sort the map.world data 
    # according to the order variable for the purpose of plotting
    plot <- ggplot(map.world, aes(x = long, y = lat, group = group, fill = medianAge))
    plot <- plot + ggtitle("Median age for specific countries")
    plot <- plot + geom_polygon() + scale_fill_gradientn("Median Age", colours = rainbow(7), 
        limits = c(10, 60))
    plot <- plot + xlab("Longitude") + ylab("Latitude")
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-27

Plotting dendrogram using `ggplot2`

In our dataset, we identify some kind of relationships or correlations between the observations of continous variables. Let us explore the relationships between numerical variables using a tree-like structure known as a dendrogram. For this, we need to load the following packages:

require(grid)
require(ggdendro)

## Loading required package: ggdendro
## 
## Attaching package: 'ggdendro'
## 
## The following object is masked from 'package:xtable':
## 
##     label

We first identify the numerical variables and then perform correlation between them. Afterwards, we apply hclust function to the results in order to create an object with hierarchical structure that will be used to plot dendrogram.

evalq({
    numerics <- c(1, 5, 11:13)
    cr <- cor(x = adult.data[numerics], use = "pairwise", method = "pearson")
    hc <- hclust(dist(cr), method = "average")
    ddata <- dendro_data(hc)

    plot <- ggplot() + geom_segment(data = segment(ddata), aes_string(x = "x", 
        y = "y", xend = "xend", yend = "yend"), colour = "grey", lwd = 1)
    plot <- plot + geom_text(data = label(ddata), aes_string(x = "x", y = "y", 
        label = "label"), fontface = 3, colour = "tomato", cex = 3.5, hjust = 1, 
        vjust = 0, angle = 0)
    plot <- plot + scale_y_continuous(expand = c(0.2, 0))
    plot <- plot + labs(title = "Variable Correlation Clusters\nadult data using Pearson")
    plot <- plot + theme_bw() + theme(axis.ticks = element_blank(), axis.text.y = element_blank())
    plot <- plot + geom_hline(data = segment(ddata), xintercept = "x", colour = "orange") + 
        ylab(NULL) + xlab(NULL)
    plot <- plot + coord_flip()
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-29

The plot lists the variables in the left column. The variables are then linked together according to the strength of their relationships. The range of x-axis, (0, 1.5), gives an indication of the level of correlation between variables, with shorter heights indicating stronger correlations. We can observe that hoursPerWeek and educationNum are closely correlated. The variables age and capitalLoss have some higher level of correlation amongst themselves.

Although, the above correlation defines the relationships between the numerical data, this does not provide true information because much of our data are missing and replaced by 0. But to illustrate the correlation, we plot the relationship between hours per week with the education level ignoring the whether information is accurate or not.

evalq({
    plot <- ggplot(data = adult.data, aes(x = hoursPerWeek, y = educationNum, 
        colour = hoursPerWeek))
    plot <- plot + geom_point(alpha = 1/10)
    plot <- plot + ggtitle("Hours per Week vs Level of Education")
    plot <- plot + stat_smooth(method = "lm", se = FALSE, colour = "red", size = 1)
    plot <- plot + xlab("Education Level") + ylab("Hours per Week")
    plot <- plot + theme(legend.position = "none")
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-30

Thank you Jenny for being the person who taught us how to appreciate the art of R language. We had a great time last six weeks.

STAT 545A Homework 6

Transforming the adult data

Plotting simple figures using ggplot2

Plotting world map using ggplot2

Plotting dendrogram using ggplot2

Transforming the `adult` data

Plotting simple figures using `ggplot2`

Plotting world map using `ggplot2`

Plotting dendrogram using `ggplot2`