STAT 545A Homework 6

In this project, we will use adult data set which is available from this UCI Machine Learning Repository. The data is already being transformed and cleaned. However, we need to perform further cleanings in order to demonstrate the results that we have obtained. You can replicate my analysis following the README manual from Github.

Transforming the adult data

For a particular project, we usually end up analyzing a collection of data, transforming it and storing different forms of information about the data. It is convenient to introduce encapsulation which packages all of our data into a container that is used at a later time and thus protecting our raw data. In this project, we use R's environment concept and give it a name as below:

main.en <- new.env()

After that, data is placed into the container and we can access the data using $ notations. However, to operate on variables within an environment without using the $ notation we can use evalq(). For example, let us load our data in main.en container and assign the variables with suitable names as follows:

require(package = xtable)
## Loading required package: xtable
evalq({
    adult.data <- read.table("adult_raw.txt")
    names(adult.data) <- c("age", "workclass", "fnlwgt", "education", "educationNum", 
        "maritalStatus", "occupation", "relationship", "race", "sex", "capitalGain", 
        "capitalLoss", "hoursPerWeek", "nativeCountry", "grossIncome")
    print(xtable(head(adult.data)), type = "html", include.rownames = TRUE)
}, main.en)
age workclass fnlwgt education educationNum maritalStatus occupation relationship race sex capitalGain capitalLoss hoursPerWeek nativeCountry grossIncome
1 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, < =50K
2 50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, < =50K
3 38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, < =50K
4 53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, < =50K
5 28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, < =50K
6 37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, < =50K

We remove , from each value in the data and change the mode of the following variables to numerics: age, fnlwgt, educationNum, capitalGain, capitalLoss, and hoursPerWeek.

evalq({
    adult.data$age <- as.numeric(gsub(",", "", adult.data$age))
    adult.data$workclass <- gsub(",", "", adult.data$workclass)
    adult.data$fnlwgt <- as.numeric(gsub(",", "", adult.data$fnlwgt))
    adult.data$education <- gsub(",", "", adult.data$education)
    adult.data$educationNum <- as.numeric(gsub(",", "", adult.data$educationNum))
    adult.data$maritalStatus <- gsub(",", "", adult.data$maritalStatus)
    adult.data$occupation <- gsub(",", "", adult.data$occupation)
    adult.data$relationship <- gsub(",", "", adult.data$relationship)
    adult.data$race <- gsub(",", "", adult.data$race)
    adult.data$sex <- gsub(",", "", adult.data$sex)
    adult.data$capitalGain <- as.numeric(gsub(",", "", adult.data$capitalGain))
    adult.data$capitalLoss <- as.numeric(gsub(",", "", adult.data$capitalLoss))
    adult.data$hoursPerWeek <- as.numeric(gsub(",", "", adult.data$hoursPerWeek))
    adult.data$nativeCountry <- gsub(",", "", adult.data$nativeCountry)
    adult.data$grossIncome <- gsub(",", "", adult.data$grossIncome)
    print(xtable(head(adult.data)), type = "html", include.rownames = TRUE)
}, main.en)
age workclass fnlwgt education educationNum maritalStatus occupation relationship race sex capitalGain capitalLoss hoursPerWeek nativeCountry grossIncome
1 39.00 State-gov 77516.00 Bachelors 13.00 Never-married Adm-clerical Not-in-family White Male 2174.00 0.00 40.00 United-States < =50K
2 50.00 Self-emp-not-inc 83311.00 Bachelors 13.00 Married-civ-spouse Exec-managerial Husband White Male 0.00 0.00 13.00 United-States < =50K
3 38.00 Private 215646.00 HS-grad 9.00 Divorced Handlers-cleaners Not-in-family White Male 0.00 0.00 40.00 United-States < =50K
4 53.00 Private 234721.00 11th 7.00 Married-civ-spouse Handlers-cleaners Husband Black Male 0.00 0.00 40.00 United-States < =50K
5 28.00 Private 338409.00 Bachelors 13.00 Married-civ-spouse Prof-specialty Wife Black Female 0.00 0.00 40.00 Cuba < =50K
6 37.00 Private 284582.00 Masters 14.00 Married-civ-spouse Exec-managerial Wife White Female 0.00 0.00 40.00 United-States < =50K

Our data is already being cleaned and the missing values are filled with either ?, for categorial values, or 0, for numerical data. However, we treate any ? as missing values and remove them from the data.

evalq({
    adult.data$workclass <- gsub("[?]", NA, adult.data$workclass)
    adult.data$education <- gsub("[?]", NA, adult.data$education)
    adult.data$maritalStatus <- gsub("[?]", NA, adult.data$maritalStatus)
    adult.data$occupation <- gsub("[?]", NA, adult.data$occupation)
    adult.data$relationship <- gsub("[?]", NA, adult.data$relationship)
    adult.data$race <- gsub("[?]", NA, adult.data$race)
    adult.data$sex <- gsub("[?]", NA, adult.data$sex)
    adult.data$nativeCountry <- gsub("[?]", NA, adult.data$nativeCountry)
    adult.data <- na.omit(adult.data)
}, main.en)

Now, let us explore the transformed data to discover various insightful knowledge such as dimensions, the variables names, its boundaries, its characteristics…etc.

evalq({
    head(adult.data)
    tail(adult.data)
    dim(adult.data)
    names(adult.data)
    str(adult.data)
    summary(adult.data)
}, main.en)

To obtain more detailed summary of the numerical data, we apply the basicStats() function from fBasics package.

require(fBasics)
## Loading required package: fBasics
## Loading required package: MASS
## Loading required package: timeDate
## 
## Attaching package: 'timeDate'
## 
## The following object is masked from 'package:xtable':
## 
##     align
## 
## Loading required package: timeSeries
## 
## Attaching package: 'fBasics'
## 
## The following object is masked from 'package:base':
## 
##     norm
evalq({
    print(xtable(basicStats(adult.data$age)), type = "html", include.colnames = FALSE, 
        include.rownames = TRUE)
}, main.en)
nobs 30162.00
NAs 0.00
Minimum 17.00
Maximum 90.00
1. Quartile 28.00
3. Quartile 47.00
Mean 38.44
Median 37.00
Sum 1159364.00
SE Mean 0.08
LCL Mean 38.29
UCL Mean 38.59
Variance 172.52
Stdev 13.13
Skewness 0.53
Kurtosis -0.15

In analyzing data, it is very useful to transform continuous numeric variables into a specific set of meaningful categoric values for subsequent analysis especially to reduce the effects of minor observation errors. This operation is called binning. Lets apply binning to our age variable by first recording the range of this continous data.

evalq({
    print(range(adult.data$age))
}, main.en)

[1] 17 90

We observe that the range of age in our dataset is 17 and maximum is 90. Therefore, we apply binning to age variable to categorize it into four distinct intervals that are chosen subjectively as follows:

evalq({
    adult.data$agedesc = ifelse(test = adult.data[, "age"] <= 22, yes = "youth", 
        no = ifelse(test = adult.data[, "age"] > 22 & adult.data[, "age"] <= 
            35, yes = "adult", no = ifelse(test = adult.data[, "age"] > 35 & 
            adult.data[, "age"] <= 55, yes = "adultToOld", no = "old")))
    print(xtable(head(adult.data)), type = "html", include.rownames = TRUE)
}, main.en)
age workclass fnlwgt education educationNum maritalStatus occupation relationship race sex capitalGain capitalLoss hoursPerWeek nativeCountry grossIncome agedesc
1 39.00 State-gov 77516.00 Bachelors 13.00 Never-married Adm-clerical Not-in-family White Male 2174.00 0.00 40.00 United-States < =50K adultToOld
2 50.00 Self-emp-not-inc 83311.00 Bachelors 13.00 Married-civ-spouse Exec-managerial Husband White Male 0.00 0.00 13.00 United-States < =50K adultToOld
3 38.00 Private 215646.00 HS-grad 9.00 Divorced Handlers-cleaners Not-in-family White Male 0.00 0.00 40.00 United-States < =50K adultToOld
4 53.00 Private 234721.00 11th 7.00 Married-civ-spouse Handlers-cleaners Husband Black Male 0.00 0.00 40.00 United-States < =50K adultToOld
5 28.00 Private 338409.00 Bachelors 13.00 Married-civ-spouse Prof-specialty Wife Black Female 0.00 0.00 40.00 Cuba < =50K adult
6 37.00 Private 284582.00 Masters 14.00 Married-civ-spouse Exec-managerial Wife White Female 0.00 0.00 40.00 United-States < =50K adultToOld

We keep certain variables in our data and remove the redundant ones as follows:

evalq({
    adult.data <- data.frame(adult.data[, c("age", "workclass", "education", 
        "educationNum", "maritalStatus", "occupation", "relationship", "race", 
        "sex", "nativeCountry")])
}, main.en)

Plotting simple figures using ggplot2

We create one additional container to hold data for ploting purposes and to perform other applications on our data without altering the main data. Doing so we demonstrate the inheritance structure in our coding scheme.

plot.en <- new.env(parent = main.en)

We require the following packages.

require(plyr)
## Loading required package: plyr
require(ggplot2)
## Loading required package: ggplot2

Let us plot our first figure which depicts the education level per gender

evalq({
    edulevelPergender <- ddply(.data = adult.data, .variables = ~educationNum + 
        sex, .fun = function(x) {
        count <- nrow(x)
        return(data.frame(count))
    })
    plotbar <- ggplot(adult.data, aes(x = factor(education), fill = factor(sex)))
    plotbar <- plotbar + ggtitle("Education level per gender")
    plotbar <- plotbar + geom_bar(position = "dodge") + coord_flip()
    plotbar <- plotbar + xlab("Education") + ylab("Number of observations")
    plotbar <- plotbar + scale_fill_discrete(name = "Gender")
    plotbar
}, envir = plot.en)

plot of chunk unnamed-chunk-13

We see that the above figure does not represent the ratio of each gender with the level of education in the dataset for that particular gender. Therefore, we use proportions of each gender in the dataset according to their level of education and re-plot the results.

evalq({
    nPerSex <- count(df = adult.data, ~sex)
    adult.data.modified <- ddply(adult.data, ~education + sex, summarise, proportion = length(education))
    adult.data.modified$proportion = ifelse(test = adult.data.modified$sex == 
        "Male", yes = adult.data.modified$proportion/nPerSex$freq[2], no = adult.data.modified$proportion/nPerSex$freq[1])
    plotbarprop <- ggplot(adult.data.modified, aes(x = factor(education), y = proportion, 
        fill = factor(sex)))
    plotbarprop <- plotbarprop + ggtitle("Education level per gender")
    plotbarprop <- plotbarprop + geom_bar(position = "dodge") + coord_flip()
    plotbarprop <- plotbarprop + xlab("Education") + ylab("Proportion")
    plotbarprop <- plotbarprop + scale_fill_discrete(name = "Gender")
    plotbarprop
}, envir = plot.en)
## Mapping a variable to y and also using stat="bin".
##   With stat="bin", it will attempt to set the y value to the count of cases in each group.
##   This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
##   If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
##   If you want y to represent values in the data, use stat="identity".
##   See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)

plot of chunk unnamed-chunk-14

Now, we observe similar proportions for each gender with the level of education. In fact, females are more educated in some levels such as at 11th grade. This piece of information did not reveal to us in our previous figure.

We use boxplot that provides a graphical overview of how the observations of a particular variable, age in our case, are distributed per gender.

evalq({
    plotbox <- ggplot(adult.data) + aes(x = factor(agedesc), y = age)
    plotbox <- plotbox + geom_boxplot(aes(fill = sex), outlier.colour = "red", 
        outlier.shape = 4)
    plotbox <- plotbox + ggtitle("Education level with Age per gender")
    plotbox <- plotbox + xlab("Education") + ylab("Age")
    plotbox <- plotbox + scale_fill_discrete(name = "Gender") + coord_flip()
    plotbox
}, envir = plot.en)

plot of chunk unnamed-chunk-15

Here we plot education level with specific age intervals.

evalq({
    p1 <- ggplot(adult.data) + aes(education, fill = factor(sex))
    p1 <- p1 + ggtitle("Education level with each Age interval per gender")
    p1 <- p1 + geom_bar() + xlab("Education") + ylab("Number of observations")
    p1 <- p1 + scale_fill_discrete(name = "Gender") + coord_flip()
    p1 <- p1 + facet_wrap(~agedesc)
    p1
}, envir = plot.en)

plot of chunk unnamed-chunk-16

Plotting the number of educated people based on their ethnicity.

evalq({
    p2 <- ggplot(adult.data) + aes(education, fill = race)
    p2 <- p2 + ggtitle("Education level with Race")
    p2 <- p2 + geom_bar() + xlab("Education") + ylab("Number of observations")
    p2 <- p2 + scale_fill_brewer("Race") + coord_flip()
    p2
}, envir = plot.en)

plot of chunk unnamed-chunk-17

Gradient colors are not a convenient solution in plotting the above figure. We plot the status of relationships with level of educations and we use palette option:

evalq({
    p3 <- ggplot(adult.data) + aes(education, fill = maritalStatus)
    p3 <- p3 + ggtitle("Education level with Marital Status")
    p3 <- p3 + geom_bar() + xlab("Education") + ylab("Number of observations")
    p3 <- p3 + scale_fill_brewer("Marital Status", palette = "Dark2") + coord_flip()
    p3
}, envir = plot.en)

plot of chunk unnamed-chunk-18

Exploring the type of employment with level of educations:

evalq({
    p4 <- ggplot(adult.data) + aes(education, fill = workclass)
    p4 <- p4 + ggtitle("Education level with Employment type")
    p4 <- p4 + geom_bar() + xlab("Education") + ylab("Number of observations")
    p4 <- p4 + scale_fill_brewer("Work Class", palette = "Dark2") + coord_flip()
    p4
}, envir = plot.en)

plot of chunk unnamed-chunk-19

gridExtra is an interesting package that puts several figures onto one panel. As an example:

require(gridExtra)
## Loading required package: gridExtra
## Loading required package: grid
evalq({
    grid.arrange(plotbox, p2, p3, p4, ncol = 2, main = "Plotting in one panel")
}, envir = plot.en)

plot of chunk unnamed-chunk-21

Plotting world map using ggplot2

We load the following packages.

require(ggmap)
## Loading required package: ggmap
require(mapproj)
## Loading required package: mapproj
## Loading required package: maps

We copy map data into plot.en container

evalq({
    map.world <- map_data(map = "world")
}, envir = plot.en)

Let us draw a map using ggplot2. We first note that the countries in adult data need to be mapped with regions in the map data. Therefore, several issues need to be handled properly:

evalq({
    adult.data$nativeCountry <- gsub("(Outlying-US\\(Guam-USVI-etc\\))", NA, 
        adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("England", "UK", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Scotland", "UK", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Hong", "China", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Taiwan", "China", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Trinadad&Tobago", "Tobago", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("United-States", "USA", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Puerto-Rico", "Puerto Rico", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Holand-Netherlands", "Netherlands", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("El-Salvador", "El Salvador", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Dominican-Republic", "Dominican Republic", 
        adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("Columbia", "Colombia", adult.data$nativeCountry)
    adult.data$nativeCountry <- gsub("South", "South Korea", adult.data$nativeCountry)
    adult.data <- na.omit(adult.data)
}, plot.en)

After we address the above shortcomings, let's begin to plot our first world map.

evalq({
    plot <- ggplot(map.world, aes(x = long, y = lat, group = group, fill = region))
    plot <- plot + ggtitle("World, filled with regions")
    plot <- plot + geom_polygon() + theme(legend.position = "none")
    plot <- plot + xlab("Longitude") + ylab("Latitude")
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-25

Interestingly, we can also represent the filled countries that are only recorded in adult data as follows:

evalq({
    map.world$availableCountry <- ifelse(test = map.world$region %in% adult.data$nativeCountry, 
        yes = "yes", no = "no")

    plot <- ggplot(map.world, aes(x = long, y = lat, group = group, fill = availableCountry))
    plot <- plot + ggtitle("World, filled with interested countries")
    plot <- plot + geom_polygon() + theme(legend.position = "none")
    plot <- plot + xlab("Longitude") + ylab("Latitude")
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-26

Let us perform more fancy work by plotting the world map according to the median age per country.

evalq({
    tmp.df <- ddply(adult.data, ~nativeCountry, summarise, medianAge = median(age), 
        meanAge = mean(age))
    map.world <- merge(x = map.world, y = tmp.df, by.x = "region", by.y = "nativeCountry", 
        all = TRUE)
    map.world$medianAge[is.na(map.world$medianAge)] <- 0  # other countries median
    # are set to 0
    map.world$meanAge[is.na(map.world$meanAge)] <- 0  # other countries mean
    # are set to 0
    map.world <- arrange(map.world, order)  # We need to sort the map.world data 
    # according to the order variable for the purpose of plotting
    plot <- ggplot(map.world, aes(x = long, y = lat, group = group, fill = medianAge))
    plot <- plot + ggtitle("Median age for specific countries")
    plot <- plot + geom_polygon() + scale_fill_gradientn("Median Age", colours = rainbow(7), 
        limits = c(10, 60))
    plot <- plot + xlab("Longitude") + ylab("Latitude")
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-27

Plotting dendrogram using ggplot2

In our dataset, we identify some kind of relationships or correlations between the observations of continous variables. Let us explore the relationships between numerical variables using a tree-like structure known as a dendrogram. For this, we need to load the following packages:

require(grid)
require(ggdendro)
## Loading required package: ggdendro
## 
## Attaching package: 'ggdendro'
## 
## The following object is masked from 'package:xtable':
## 
##     label

We first identify the numerical variables and then perform correlation between them. Afterwards, we apply hclust function to the results in order to create an object with hierarchical structure that will be used to plot dendrogram.

evalq({
    numerics <- c(1, 5, 11:13)
    cr <- cor(x = adult.data[numerics], use = "pairwise", method = "pearson")
    hc <- hclust(dist(cr), method = "average")
    ddata <- dendro_data(hc)

    plot <- ggplot() + geom_segment(data = segment(ddata), aes_string(x = "x", 
        y = "y", xend = "xend", yend = "yend"), colour = "grey", lwd = 1)
    plot <- plot + geom_text(data = label(ddata), aes_string(x = "x", y = "y", 
        label = "label"), fontface = 3, colour = "tomato", cex = 3.5, hjust = 1, 
        vjust = 0, angle = 0)
    plot <- plot + scale_y_continuous(expand = c(0.2, 0))
    plot <- plot + labs(title = "Variable Correlation Clusters\nadult data using Pearson")
    plot <- plot + theme_bw() + theme(axis.ticks = element_blank(), axis.text.y = element_blank())
    plot <- plot + geom_hline(data = segment(ddata), xintercept = "x", colour = "orange") + 
        ylab(NULL) + xlab(NULL)
    plot <- plot + coord_flip()
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-29

The plot lists the variables in the left column. The variables are then linked together according to the strength of their relationships. The range of x-axis, (0, 1.5), gives an indication of the level of correlation between variables, with shorter heights indicating stronger correlations. We can observe that hoursPerWeek and educationNum are closely correlated. The variables age and capitalLoss have some higher level of correlation amongst themselves.

Although, the above correlation defines the relationships between the numerical data, this does not provide true information because much of our data are missing and replaced by 0. But to illustrate the correlation, we plot the relationship between hours per week with the education level ignoring the whether information is accurate or not.

evalq({
    plot <- ggplot(data = adult.data, aes(x = hoursPerWeek, y = educationNum, 
        colour = hoursPerWeek))
    plot <- plot + geom_point(alpha = 1/10)
    plot <- plot + ggtitle("Hours per Week vs Level of Education")
    plot <- plot + stat_smooth(method = "lm", se = FALSE, colour = "red", size = 1)
    plot <- plot + xlab("Education Level") + ylab("Hours per Week")
    plot <- plot + theme(legend.position = "none")
    plot
}, envir = plot.en)

plot of chunk unnamed-chunk-30

Thank you Jenny for being the person who taught us how to appreciate the art of R language. We had a great time last six weeks.