In this segmentation study, we will try to retrieve actionable insights from the
given datasets by performing an exploratory data analysis using various
visualization tools and libraries contained in R, and coming up with pertinent
questions throughout the process to guide our analysis.
We will start by analyzing the potential relationship between people’s age and
their income potential, especially wether or not they make over $50,000 per year. Since we are dealing two continuous variables in this case, we will use a
histogram to accurately answer the question.
setwd("C:/Users/jkevi_000/Documents")
rm(list=ls(all=TRUE)) # Clear memory
#Load Visualization Tools
library(ggplot2)
library(gridExtra)
library(plyr)
#Load Datasets
dat.train <- read.csv("au_train.csv")
dat.test <- read.csv("au_test.csv")
dat.full <- rbind(dat.train,dat.test) #Merge both datasets
# Age vs Income class
ggplot(dat.train) + aes(x=as.numeric(age), group= class, fill= class) +
geom_histogram(binwidth=1, color='black')
#Age vs Sex
ggplot(dat.full) + aes(x=as.numeric(age), group=sex, fill=sex) +
geom_histogram(binwidth=1, color='black')
boxplot (age ~ class, data = dat.train,
main = "Age distribution for different income levels",
xlab = "Income Levels", ylab = "Age", col = "blue")
From the graphs, we notice that The majority of participants make less or equal
to $50,000. The people that make over 50k a year are between 30 and 55 years
old. We can assume that they are mostly mid-career professionals with a few
years of experience under their belt.So the more time spent working, the more
likely a person will make over 50k per year.
More experience in the workplace leads to a higher salary.
We also notice that females are very underepresented in the dataset which might
signal a possible bias during the survey.
In this case, we take a closer look at the relationship between occupation and
income potential.
qplot(class, data = dat.train, fill = occupation) + facet_grid (. ~ occupation)
It’s pretty clear that highly specialized jobs do not guarantee a higher income
In fact, we can see a lot people with less specialized jobs making over 50k per
year and vice versa. As expected, executives and managers have the largest
amount of people making over 50k, closely followed by professors.
We will explore this very popular question using our usual visualization
tools(histograms and bar charts) to explain the not-so-obvious relationship
between income and education.
qplot(class, data = dat.train, fill = education) + facet_grid (. ~ education)
We observe that the majority of people making over 50,000 a year have at
Bachelor’s degree closely followed by a high school degree.
We also notice that they are higher in number when compared to master’s and phD degree holders which is expected because a typical class of graduates will be
about 90% bachelor’s degree holders, 30% master’s and 10 % PhD.
Therefore we can imply that education level for the most part affects a person’s income potential.
Let’s analyze how relationship and race are connected a person’s income
potential.
qplot(class, data = dat.train, fill = relationship) + facet_grid (. ~ race)
We notice that the majority of the people making over 50k a year are white male
married workers
closely followed by black married workers and pacific islanders.