In this project, we will use adult data set which is available from this UCI Machine Learning Repository. The data is already being transformed and cleaned. However, we need to perform further cleanings in order to demonstrate the results that we have obtained. You can replicate my analysis following the README manual from Github.
adult dataFor a particular project, we usually end up analyzing a collection of data, transforming it and storing different forms of information about the data. It is convenient to introduce encapsulation which packages all of our data into a container that is used at a later time and thus protecting our raw data. In this project, we use R's environment concept and give it a name as below:
main.en <- new.env()
After that, data is placed into the container and we can access the data using $ notations. However, to operate on variables within an environment without using the $ notation we can use evalq(). For example, let us load our data in main.en container and assign the variables with suitable names as follows:
require(package = xtable)
## Loading required package: xtable
evalq({
adult.data <- read.table("adult_raw.txt")
names(adult.data) <- c("age", "workclass", "fnlwgt", "education", "educationNum",
"maritalStatus", "occupation", "relationship", "race", "sex", "capitalGain",
"capitalLoss", "hoursPerWeek", "nativeCountry", "grossIncome")
print(xtable(head(adult.data)), type = "html", include.rownames = TRUE)
}, main.en)
| age | workclass | fnlwgt | education | educationNum | maritalStatus | occupation | relationship | race | sex | capitalGain | capitalLoss | hoursPerWeek | nativeCountry | grossIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 39, | State-gov, | 77516, | Bachelors, | 13, | Never-married, | Adm-clerical, | Not-in-family, | White, | Male, | 2174, | 0, | 40, | United-States, | < =50K |
| 2 | 50, | Self-emp-not-inc, | 83311, | Bachelors, | 13, | Married-civ-spouse, | Exec-managerial, | Husband, | White, | Male, | 0, | 0, | 13, | United-States, | < =50K |
| 3 | 38, | Private, | 215646, | HS-grad, | 9, | Divorced, | Handlers-cleaners, | Not-in-family, | White, | Male, | 0, | 0, | 40, | United-States, | < =50K |
| 4 | 53, | Private, | 234721, | 11th, | 7, | Married-civ-spouse, | Handlers-cleaners, | Husband, | Black, | Male, | 0, | 0, | 40, | United-States, | < =50K |
| 5 | 28, | Private, | 338409, | Bachelors, | 13, | Married-civ-spouse, | Prof-specialty, | Wife, | Black, | Female, | 0, | 0, | 40, | Cuba, | < =50K |
| 6 | 37, | Private, | 284582, | Masters, | 14, | Married-civ-spouse, | Exec-managerial, | Wife, | White, | Female, | 0, | 0, | 40, | United-States, | < =50K |
We remove , from each value in the data and change the mode of the following variables to numerics: age, fnlwgt, educationNum, capitalGain, capitalLoss, and hoursPerWeek.
evalq({
adult.data$age <- as.numeric(gsub(",", "", adult.data$age))
adult.data$workclass <- gsub(",", "", adult.data$workclass)
adult.data$fnlwgt <- as.numeric(gsub(",", "", adult.data$fnlwgt))
adult.data$education <- gsub(",", "", adult.data$education)
adult.data$educationNum <- as.numeric(gsub(",", "", adult.data$educationNum))
adult.data$maritalStatus <- gsub(",", "", adult.data$maritalStatus)
adult.data$occupation <- gsub(",", "", adult.data$occupation)
adult.data$relationship <- gsub(",", "", adult.data$relationship)
adult.data$race <- gsub(",", "", adult.data$race)
adult.data$sex <- gsub(",", "", adult.data$sex)
adult.data$capitalGain <- as.numeric(gsub(",", "", adult.data$capitalGain))
adult.data$capitalLoss <- as.numeric(gsub(",", "", adult.data$capitalLoss))
adult.data$hoursPerWeek <- as.numeric(gsub(",", "", adult.data$hoursPerWeek))
adult.data$nativeCountry <- gsub(",", "", adult.data$nativeCountry)
adult.data$grossIncome <- gsub(",", "", adult.data$grossIncome)
print(xtable(head(adult.data)), type = "html", include.rownames = TRUE)
}, main.en)
| age | workclass | fnlwgt | education | educationNum | maritalStatus | occupation | relationship | race | sex | capitalGain | capitalLoss | hoursPerWeek | nativeCountry | grossIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 39.00 | State-gov | 77516.00 | Bachelors | 13.00 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174.00 | 0.00 | 40.00 | United-States | < =50K |
| 2 | 50.00 | Self-emp-not-inc | 83311.00 | Bachelors | 13.00 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0.00 | 0.00 | 13.00 | United-States | < =50K |
| 3 | 38.00 | Private | 215646.00 | HS-grad | 9.00 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0.00 | 0.00 | 40.00 | United-States | < =50K |
| 4 | 53.00 | Private | 234721.00 | 11th | 7.00 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0.00 | 0.00 | 40.00 | United-States | < =50K |
| 5 | 28.00 | Private | 338409.00 | Bachelors | 13.00 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0.00 | 0.00 | 40.00 | Cuba | < =50K |
| 6 | 37.00 | Private | 284582.00 | Masters | 14.00 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0.00 | 0.00 | 40.00 | United-States | < =50K |
Our data is already being cleaned and the missing values are filled with either ?, for categorial values, or 0, for numerical data. However, we treate any ? as missing values and remove them from the data.
evalq({
adult.data$workclass <- gsub("[?]", NA, adult.data$workclass)
adult.data$education <- gsub("[?]", NA, adult.data$education)
adult.data$maritalStatus <- gsub("[?]", NA, adult.data$maritalStatus)
adult.data$occupation <- gsub("[?]", NA, adult.data$occupation)
adult.data$relationship <- gsub("[?]", NA, adult.data$relationship)
adult.data$race <- gsub("[?]", NA, adult.data$race)
adult.data$sex <- gsub("[?]", NA, adult.data$sex)
adult.data$nativeCountry <- gsub("[?]", NA, adult.data$nativeCountry)
adult.data <- na.omit(adult.data)
}, main.en)
Now, let us explore the transformed data to discover various insightful knowledge such as dimensions, the variables names, its boundaries, its characteristics…etc.
evalq({
head(adult.data)
tail(adult.data)
dim(adult.data)
names(adult.data)
str(adult.data)
summary(adult.data)
}, main.en)
To obtain more detailed summary of the numerical data, we apply the basicStats() function from fBasics package.
require(fBasics)
## Loading required package: fBasics
## Loading required package: MASS
## Loading required package: timeDate
##
## Attaching package: 'timeDate'
##
## The following object is masked from 'package:xtable':
##
## align
##
## Loading required package: timeSeries
##
## Attaching package: 'fBasics'
##
## The following object is masked from 'package:base':
##
## norm
evalq({
print(xtable(basicStats(adult.data$age)), type = "html", include.colnames = FALSE,
include.rownames = TRUE)
}, main.en)
| nobs | 30162.00 |
| NAs | 0.00 |
| Minimum | 17.00 |
| Maximum | 90.00 |
| 1. Quartile | 28.00 |
| 3. Quartile | 47.00 |
| Mean | 38.44 |
| Median | 37.00 |
| Sum | 1159364.00 |
| SE Mean | 0.08 |
| LCL Mean | 38.29 |
| UCL Mean | 38.59 |
| Variance | 172.52 |
| Stdev | 13.13 |
| Skewness | 0.53 |
| Kurtosis | -0.15 |
In analyzing data, it is very useful to transform continuous numeric variables into a specific set of meaningful categoric values for subsequent analysis especially to reduce the effects of minor observation errors. This operation is called binning. Lets apply binning to our age variable by first recording the range of this continous data.
evalq({
print(range(adult.data$age))
}, main.en)
[1] 17 90
We observe that the range of age in our dataset is 17 and maximum is 90. Therefore, we apply binning to age variable to categorize it into four distinct intervals that are chosen subjectively as follows:
evalq({
adult.data$agedesc = ifelse(test = adult.data[, "age"] <= 22, yes = "youth",
no = ifelse(test = adult.data[, "age"] > 22 & adult.data[, "age"] <=
35, yes = "adult", no = ifelse(test = adult.data[, "age"] > 35 &
adult.data[, "age"] <= 55, yes = "adultToOld", no = "old")))
print(xtable(head(adult.data)), type = "html", include.rownames = TRUE)
}, main.en)
| age | workclass | fnlwgt | education | educationNum | maritalStatus | occupation | relationship | race | sex | capitalGain | capitalLoss | hoursPerWeek | nativeCountry | grossIncome | agedesc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 39.00 | State-gov | 77516.00 | Bachelors | 13.00 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174.00 | 0.00 | 40.00 | United-States | < =50K | adultToOld |
| 2 | 50.00 | Self-emp-not-inc | 83311.00 | Bachelors | 13.00 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0.00 | 0.00 | 13.00 | United-States | < =50K | adultToOld |
| 3 | 38.00 | Private | 215646.00 | HS-grad | 9.00 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0.00 | 0.00 | 40.00 | United-States | < =50K | adultToOld |
| 4 | 53.00 | Private | 234721.00 | 11th | 7.00 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0.00 | 0.00 | 40.00 | United-States | < =50K | adultToOld |
| 5 | 28.00 | Private | 338409.00 | Bachelors | 13.00 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0.00 | 0.00 | 40.00 | Cuba | < =50K | adult |
| 6 | 37.00 | Private | 284582.00 | Masters | 14.00 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0.00 | 0.00 | 40.00 | United-States | < =50K | adultToOld |
We keep certain variables in our data and remove the redundant ones as follows:
evalq({
adult.data <- data.frame(adult.data[, c("age", "workclass", "education",
"educationNum", "maritalStatus", "occupation", "relationship", "race",
"sex", "nativeCountry")])
}, main.en)
ggplot2We create one additional container to hold data for ploting purposes and to perform other applications on our data without altering the main data. Doing so we demonstrate the inheritance structure in our coding scheme.
plot.en <- new.env(parent = main.en)
We require the following packages.
require(plyr)
## Loading required package: plyr
require(ggplot2)
## Loading required package: ggplot2
Let us plot our first figure which depicts the education level per gender
evalq({
edulevelPergender <- ddply(.data = adult.data, .variables = ~educationNum +
sex, .fun = function(x) {
count <- nrow(x)
return(data.frame(count))
})
plotbar <- ggplot(adult.data, aes(x = factor(education), fill = factor(sex)))
plotbar <- plotbar + ggtitle("Education level per gender")
plotbar <- plotbar + geom_bar(position = "dodge") + coord_flip()
plotbar <- plotbar + xlab("Education") + ylab("Number of observations")
plotbar <- plotbar + scale_fill_discrete(name = "Gender")
plotbar
}, envir = plot.en)
We see that the above figure does not represent the ratio of each gender with the level of education in the dataset for that particular gender. Therefore, we use proportions of each gender in the dataset according to their level of education and re-plot the results.
evalq({
nPerSex <- count(df = adult.data, ~sex)
adult.data.modified <- ddply(adult.data, ~education + sex, summarise, proportion = length(education))
adult.data.modified$proportion = ifelse(test = adult.data.modified$sex ==
"Male", yes = adult.data.modified$proportion/nPerSex$freq[2], no = adult.data.modified$proportion/nPerSex$freq[1])
plotbarprop <- ggplot(adult.data.modified, aes(x = factor(education), y = proportion,
fill = factor(sex)))
plotbarprop <- plotbarprop + ggtitle("Education level per gender")
plotbarprop <- plotbarprop + geom_bar(position = "dodge") + coord_flip()
plotbarprop <- plotbarprop + xlab("Education") + ylab("Proportion")
plotbarprop <- plotbarprop + scale_fill_discrete(name = "Gender")
plotbarprop
}, envir = plot.en)
## Mapping a variable to y and also using stat="bin".
## With stat="bin", it will attempt to set the y value to the count of cases in each group.
## This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
## If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity".
## See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
Now, we observe similar proportions for each gender with the level of education. In fact, females are more educated in some levels such as at 11th grade. This piece of information did not reveal to us in our previous figure.
We use boxplot that provides a graphical overview of how the observations of a particular variable, age in our case, are distributed per gender.
evalq({
plotbox <- ggplot(adult.data) + aes(x = factor(agedesc), y = age)
plotbox <- plotbox + geom_boxplot(aes(fill = sex), outlier.colour = "red",
outlier.shape = 4)
plotbox <- plotbox + ggtitle("Education level with Age per gender")
plotbox <- plotbox + xlab("Education") + ylab("Age")
plotbox <- plotbox + scale_fill_discrete(name = "Gender") + coord_flip()
plotbox
}, envir = plot.en)
Here we plot education level with specific age intervals.
evalq({
p1 <- ggplot(adult.data) + aes(education, fill = factor(sex))
p1 <- p1 + ggtitle("Education level with each Age interval per gender")
p1 <- p1 + geom_bar() + xlab("Education") + ylab("Number of observations")
p1 <- p1 + scale_fill_discrete(name = "Gender") + coord_flip()
p1 <- p1 + facet_wrap(~agedesc)
p1
}, envir = plot.en)
Plotting the number of educated people based on their ethnicity.
evalq({
p2 <- ggplot(adult.data) + aes(education, fill = race)
p2 <- p2 + ggtitle("Education level with Race")
p2 <- p2 + geom_bar() + xlab("Education") + ylab("Number of observations")
p2 <- p2 + scale_fill_brewer("Race") + coord_flip()
p2
}, envir = plot.en)
Gradient colors are not a convenient solution in plotting the above figure. We plot the status of relationships with level of educations and we use palette option:
evalq({
p3 <- ggplot(adult.data) + aes(education, fill = maritalStatus)
p3 <- p3 + ggtitle("Education level with Marital Status")
p3 <- p3 + geom_bar() + xlab("Education") + ylab("Number of observations")
p3 <- p3 + scale_fill_brewer("Marital Status", palette = "Dark2") + coord_flip()
p3
}, envir = plot.en)
Exploring the type of employment with level of educations:
evalq({
p4 <- ggplot(adult.data) + aes(education, fill = workclass)
p4 <- p4 + ggtitle("Education level with Employment type")
p4 <- p4 + geom_bar() + xlab("Education") + ylab("Number of observations")
p4 <- p4 + scale_fill_brewer("Work Class", palette = "Dark2") + coord_flip()
p4
}, envir = plot.en)
gridExtra is an interesting package that puts several figures onto one panel. As an example:
require(gridExtra)
## Loading required package: gridExtra
## Loading required package: grid
evalq({
grid.arrange(plotbox, p2, p3, p4, ncol = 2, main = "Plotting in one panel")
}, envir = plot.en)
ggplot2We load the following packages.
require(ggmap)
## Loading required package: ggmap
require(mapproj)
## Loading required package: mapproj
## Loading required package: maps
We copy map data into plot.en container
evalq({
map.world <- map_data(map = "world")
}, envir = plot.en)
Let us draw a map using ggplot2. We first note that the countries in adult data need to be mapped with regions in the map data. Therefore, several issues need to be handled properly:
Outlying-US(Guam-USVI-etc) does not have any equivalent with map data.evalq({
adult.data$nativeCountry <- gsub("(Outlying-US\\(Guam-USVI-etc\\))", NA,
adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("England", "UK", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("Scotland", "UK", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("Hong", "China", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("Taiwan", "China", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("Trinadad&Tobago", "Tobago", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("United-States", "USA", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("Puerto-Rico", "Puerto Rico", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("Holand-Netherlands", "Netherlands", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("El-Salvador", "El Salvador", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("Dominican-Republic", "Dominican Republic",
adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("Columbia", "Colombia", adult.data$nativeCountry)
adult.data$nativeCountry <- gsub("South", "South Korea", adult.data$nativeCountry)
adult.data <- na.omit(adult.data)
}, plot.en)
After we address the above shortcomings, let's begin to plot our first world map.
evalq({
plot <- ggplot(map.world, aes(x = long, y = lat, group = group, fill = region))
plot <- plot + ggtitle("World, filled with regions")
plot <- plot + geom_polygon() + theme(legend.position = "none")
plot <- plot + xlab("Longitude") + ylab("Latitude")
plot
}, envir = plot.en)
Interestingly, we can also represent the filled countries that are only recorded in adult data as follows:
evalq({
map.world$availableCountry <- ifelse(test = map.world$region %in% adult.data$nativeCountry,
yes = "yes", no = "no")
plot <- ggplot(map.world, aes(x = long, y = lat, group = group, fill = availableCountry))
plot <- plot + ggtitle("World, filled with interested countries")
plot <- plot + geom_polygon() + theme(legend.position = "none")
plot <- plot + xlab("Longitude") + ylab("Latitude")
plot
}, envir = plot.en)
Let us perform more fancy work by plotting the world map according to the median age per country.
evalq({
tmp.df <- ddply(adult.data, ~nativeCountry, summarise, medianAge = median(age),
meanAge = mean(age))
map.world <- merge(x = map.world, y = tmp.df, by.x = "region", by.y = "nativeCountry",
all = TRUE)
map.world$medianAge[is.na(map.world$medianAge)] <- 0 # other countries median
# are set to 0
map.world$meanAge[is.na(map.world$meanAge)] <- 0 # other countries mean
# are set to 0
map.world <- arrange(map.world, order) # We need to sort the map.world data
# according to the order variable for the purpose of plotting
plot <- ggplot(map.world, aes(x = long, y = lat, group = group, fill = medianAge))
plot <- plot + ggtitle("Median age for specific countries")
plot <- plot + geom_polygon() + scale_fill_gradientn("Median Age", colours = rainbow(7),
limits = c(10, 60))
plot <- plot + xlab("Longitude") + ylab("Latitude")
plot
}, envir = plot.en)
ggplot2In our dataset, we identify some kind of relationships or correlations between the observations of continous variables. Let us explore the relationships between numerical variables using a tree-like structure known as a dendrogram. For this, we need to load the following packages:
require(grid)
require(ggdendro)
## Loading required package: ggdendro
##
## Attaching package: 'ggdendro'
##
## The following object is masked from 'package:xtable':
##
## label
We first identify the numerical variables and then perform correlation between them. Afterwards, we apply hclust function to the results in order to create an object with hierarchical structure that will be used to plot dendrogram.
evalq({
numerics <- c(1, 5, 11:13)
cr <- cor(x = adult.data[numerics], use = "pairwise", method = "pearson")
hc <- hclust(dist(cr), method = "average")
ddata <- dendro_data(hc)
plot <- ggplot() + geom_segment(data = segment(ddata), aes_string(x = "x",
y = "y", xend = "xend", yend = "yend"), colour = "grey", lwd = 1)
plot <- plot + geom_text(data = label(ddata), aes_string(x = "x", y = "y",
label = "label"), fontface = 3, colour = "tomato", cex = 3.5, hjust = 1,
vjust = 0, angle = 0)
plot <- plot + scale_y_continuous(expand = c(0.2, 0))
plot <- plot + labs(title = "Variable Correlation Clusters\nadult data using Pearson")
plot <- plot + theme_bw() + theme(axis.ticks = element_blank(), axis.text.y = element_blank())
plot <- plot + geom_hline(data = segment(ddata), xintercept = "x", colour = "orange") +
ylab(NULL) + xlab(NULL)
plot <- plot + coord_flip()
plot
}, envir = plot.en)
The plot lists the variables in the left column. The variables are then linked together according to the strength of their relationships. The range of x-axis, (0, 1.5), gives an indication of the level of correlation between variables, with shorter heights indicating stronger correlations. We can observe that hoursPerWeek and educationNum are closely correlated. The variables age and capitalLoss have some higher level of correlation amongst themselves.
Although, the above correlation defines the relationships between the numerical data, this does not provide true information because much of our data are missing and replaced by 0. But to illustrate the correlation, we plot the relationship between hours per week with the education level ignoring the whether information is accurate or not.
evalq({
plot <- ggplot(data = adult.data, aes(x = hoursPerWeek, y = educationNum,
colour = hoursPerWeek))
plot <- plot + geom_point(alpha = 1/10)
plot <- plot + ggtitle("Hours per Week vs Level of Education")
plot <- plot + stat_smooth(method = "lm", se = FALSE, colour = "red", size = 1)
plot <- plot + xlab("Education Level") + ylab("Hours per Week")
plot <- plot + theme(legend.position = "none")
plot
}, envir = plot.en)
Thank you Jenny for being the person who taught us how to appreciate the art of R language. We had a great time last six weeks.