In this project I have analyzed a U.S. census data taken from the UCI (University of California at Irvine) Machine Learning Repository. Instruction to be followed:
–>Read your dataset in R and visualize the length and breadth of your dataset. –>Create a descriptive statistics (min, max, median etc) of each variable. –>Create one-way contingency tables for the categorical variables in your dataset. –>Create two-way contingency tables for the categorical variables in your dataset. –>Draw a boxplot of the variables that belong to your study. –>Draw Histograms for your suitable data fields. –>Draw suitable plot for your data fields. –>Create a correlation matrix. –>Visualize your correlation matrix using corrgram. –>Create a scatter plot matrix for your data set. –>Run a suitable test to check your hypothesis for your suitable assumptions. –>Run a t-test to analyse your hypothesis.
First we load the packages that we will use:
library(ggplot2)
library(plyr)
library(gridExtra)
library(gmodels)
library(grid)
library(vcd)
library(scales)
library(ggthemes)
library(knitr)
First i have download the census dataset. In order to do this i set the working directory:
setwd("C:/Users/ASHUTOSH/Documents/R/iim")
Then we download the file containing the training data:
if(!file.exists("./adult.data")){
fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
download.file(fileUrl, destfile = "./adult.data")
}
and the file that contains the test data:
if(!file.exists("./adult.test")){
fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
download.file(fileUrl, destfile = "./adult.test")
}
My next step is to read the data from the file “adult.data” into the “db.adult” data frame, which is the training dataset:
db.adult <- read.table("adult.data", sep = ",", header = FALSE)
View(db.adult)
The number of observations in the “db.adult” data frame is:
dim(db.adult)[1]
## [1] 32561
and the number of variables is:
dim(db.adult)[2]
## [1] 15
summary(db.adult)
## V1 V2 V3
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## V4 V5 V6
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## V7 V8
## Prof-specialty :4140 Husband :13193
## Craft-repair :4099 Not-in-family : 8305
## Exec-managerial:4066 Other-relative: 981
## Adm-clerical :3770 Own-child : 5068
## Sales :3650 Unmarried : 3446
## Other-service :3295 Wife : 1568
## (Other) :9541
## V9 V10 V11
## Amer-Indian-Eskimo: 311 Female:10771 Min. : 0
## Asian-Pac-Islander: 1039 Male :21790 1st Qu.: 0
## Black : 3124 Median : 0
## Other : 271 Mean : 1078
## White :27816 3rd Qu.: 0
## Max. :99999
##
## V12 V13 V14 V15
## Min. : 0.0 Min. : 1.00 United-States:29170 <=50K:24720
## 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 643 >50K : 7841
## Median : 0.0 Median :40.00 ? : 583
## Mean : 87.3 Mean :40.44 Philippines : 198
## 3rd Qu.: 0.0 3rd Qu.:45.00 Germany : 137
## Max. :4356.0 Max. :99.00 Canada : 121
## (Other) : 1709
I took the names of the variables from the attributes list, which is available at https://archive.ics.uci.edu/ml/datasets/Census+Income or from the file adult.names provided in the UCI repository:
colnames(db.adult) <- c("age", "workclass", "fnlwgt",
"education", "education_num",
"marital_status", "occupation",
"relationship", "race", "sex",
"capital_gain", "capital_loss",
"hours_per_week", "native_country", "income")
Let us take a look at the data frame:
head(db.adult, 10)
## age workclass fnlwgt education education_num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## 7 49 Private 160187 9th 5
## 8 52 Self-emp-not-inc 209642 HS-grad 9
## 9 31 Private 45781 Masters 14
## 10 42 Private 159449 Bachelors 13
## marital_status occupation relationship race sex
## 1 Never-married Adm-clerical Not-in-family White Male
## 2 Married-civ-spouse Exec-managerial Husband White Male
## 3 Divorced Handlers-cleaners Not-in-family White Male
## 4 Married-civ-spouse Handlers-cleaners Husband Black Male
## 5 Married-civ-spouse Prof-specialty Wife Black Female
## 6 Married-civ-spouse Exec-managerial Wife White Female
## 7 Married-spouse-absent Other-service Not-in-family Black Female
## 8 Married-civ-spouse Exec-managerial Husband White Male
## 9 Never-married Prof-specialty Not-in-family White Female
## 10 Married-civ-spouse Exec-managerial Husband White Male
## capital_gain capital_loss hours_per_week native_country income
## 1 2174 0 40 United-States <=50K
## 2 0 0 13 United-States <=50K
## 3 0 0 40 United-States <=50K
## 4 0 0 40 United-States <=50K
## 5 0 0 40 Cuba <=50K
## 6 0 0 40 United-States <=50K
## 7 0 0 16 Jamaica <=50K
## 8 0 0 45 United-States >50K
## 9 14084 0 50 United-States >50K
## 10 5178 0 40 United-States >50K
and its structure:
str(db.adult, vec.len = 2, strict.width = "no", width = 30)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 ...
## $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 ...
## $ education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 ...
## $ education_num : int 13 13 9 7 13 ...
## $ marital_status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 ...
## $ occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 ...
## $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 ...
## $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 ...
## $ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 ...
## $ capital_gain : int 2174 0 0 0 0 ...
## $ capital_loss : int 0 0 0 0 0 ...
## $ hours_per_week: int 40 13 40 40 40 ...
## $ native_country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 ...
## $ income : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 ...
boxplot(db.adult$age, xlab= "hours_per_week", ylab = "income", main= "income vs working-hours", horizontal = TRUE)
boxplot(db.adult$age, xlab= "age", ylab = "income", main= "Income vs age", horizontal = TRUE)
After I have marked the missing values, I can clean them with the function “na.omit()” which deletes all rows (i.e. observations) containig missing values:
db.adult <- na.omit(db.adult)
and after that i re-enumerate the rows of the data frame:
row.names(db.adult) <- 1:nrow(db.adult)
summary(db.adult$hours_per_week)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 40.00 40.00 40.44 45.00 99.00
From the box plot i also observe that there are a lot of outliers:
ggplot(aes(x = factor(0), y = hours_per_week),
data = db.adult) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
scale_x_discrete(breaks = NULL) +
scale_y_continuous(breaks = seq(0, 100, 5)) +
xlab(label = "") +
ylab(label = "Working hours per week") +
ggtitle("Box Plot of Working Hours per Week")
Therefore, i will group the working hours in 5 categories which we consider relevant:
less than 40 hours per week between 40 and 45 hours per week between 45 and 60 hours per week between 60 and 80 hours per week, and more than 80 hours per week, and i will create a new factor variable called “hours_w” with 5 levels corresponding to these 5 categories:
db.adult$hours_w[db.adult$hours_per_week < 40] <- " less_than_40"
db.adult$hours_w[db.adult$hours_per_week >= 40 &
db.adult$hours_per_week <= 45] <- " between_40_and_45"
db.adult$hours_w[db.adult$hours_per_week > 45 &
db.adult$hours_per_week <= 60 ] <- " between_45_and_60"
db.adult$hours_w[db.adult$hours_per_week > 60 &
db.adult$hours_per_week <= 80 ] <- " between_60_and_80"
db.adult$hours_w[db.adult$hours_per_week > 80] <- " more_than_80"
In the latter i create the category " between_40_and_45" with greater than or equal and less than or equal sign, because we want the range of this category to correspond to the span of the data points between the first and the third quartile.
i want to make the new variable “hours_w” a factor variable, therefore i have used the “factor()” function to do this:
db.adult$hours_w <- factor(db.adult$hours_w,
ordered = FALSE,
levels = c(" less_than_40",
" between_40_and_45",
" between_45_and_60",
" between_60_and_80",
" more_than_80"))
From the summary below i can see how many people belong to each category of the factor variable “hours_w”:
summary(db.adult$hours_w)
## less_than_40 between_40_and_45 between_45_and_60
## 7763 17659 6029
## between_60_and_80 more_than_80
## 902 208
df <- db.adult[db.adult$capital_gain > 0, ]
ggplot(data = df,
aes(x = df$capital_gain)) +
geom_histogram(binwidth = 5000,
colour = "black",
fill = "lightblue",
alpha = 0.4) +
scale_y_continuous(breaks = seq(0, 4000, 100)) +
labs(x = "Capital gain", y = "Count") +
ggtitle("Histogram of Nonzero Capital Gain")
Create a correlation matrix.
db<- db.adult[,1:6] # take a subset of 6 columns
dim(db) # it has 50 rows, 6 columns
## [1] 32561 6
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(db)
## vars n mean sd median trimmed mad
## age 1 32561 38.58 13.64 37 37.69 14.83
## workclass* 2 32561 4.87 1.46 5 4.96 0.00
## fnlwgt 3 32561 189778.37 105549.98 178356 180802.36 88798.84
## education* 4 32561 11.30 3.87 12 11.81 2.97
## education_num 5 32561 10.08 2.57 10 10.19 1.48
## marital_status* 6 32561 3.61 1.51 3 3.65 2.97
## min max range skew kurtosis se
## age 17 90 73 0.56 -0.17 0.08
## workclass* 1 9 8 -0.75 1.68 0.01
## fnlwgt 12285 1484705 1472420 1.45 6.22 584.94
## education* 1 16 15 -0.93 0.68 0.02
## education_num 1 16 15 -0.31 0.62 0.01
## marital_status* 1 7 6 -0.01 -0.54 0.01
Visualize your correlation matrix using corrgram.
library(corrgram)
##
## Attaching package: 'corrgram'
## The following object is masked from 'package:plyr':
##
## baseball
corrgram(db.adult, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of db.adult intercorrelations")
attach(db.adult)
## The following object is masked from package:psych:
##
## income
plot(age, capital_gain, main="Scatterplot Example",
xlab="age ", ylab="capital_gain ", pch=19)
chisq.test(db.adult$capital_gain,db.adult$education_num)
## Warning in chisq.test(db.adult$capital_gain, db.adult$education_num): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: db.adult$capital_gain and db.adult$education_num
## X-squared = 4825.8, df = 1770, p-value < 2.2e-16
chisq.test(db.adult$capital_loss,db.adult$fnlwgt)
## Warning in chisq.test(db.adult$capital_loss, db.adult$fnlwgt): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: db.adult$capital_loss and db.adult$fnlwgt
## X-squared = 2162200, df = 1969900, p-value < 2.2e-16
t.test(db.adult$capital_loss,db.adult$fnlwgt)
##
## Welch Two Sample t-test
##
## data: db.adult$capital_loss and db.adult$fnlwgt
## t = -324.29, df = 32561, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -190837.6 -188544.6
## sample estimates:
## mean of x mean of y
## 87.30383 189778.36651
t.test(db.adult$capital_gain,db.adult$education_num)
##
## Welch Two Sample t-test
##
## data: db.adult$capital_gain and db.adult$education_num
## t = 26.084, df = 32560, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 987.3481 1147.7882
## sample estimates:
## mean of x mean of y
## 1077.64884 10.08068
t.test(db.adult$hours_per_week,db.adult$age)
##
## Welch Two Sample t-test
##
## data: db.adult$hours_per_week and db.adult$age
## t = 18.201, df = 64485, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.655961 2.055657
## sample estimates:
## mean of x mean of y
## 40.43746 38.58165
t.test(db.adult$capital_loss,db.adult$fnlwgt)
##
## Welch Two Sample t-test
##
## data: db.adult$capital_loss and db.adult$fnlwgt
## t = -324.29, df = 32561, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -190837.6 -188544.6
## sample estimates:
## mean of x mean of y
## 87.30383 189778.36651
summ=lm(db.adult$income~db.adult$age+db.adult$fnlwgt)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
summ
##
## Call:
## lm(formula = db.adult$income ~ db.adult$age + db.adult$fnlwgt)
##
## Coefficients:
## (Intercept) db.adult$age db.adult$fnlwgt
## 9.504e-01 7.357e-03 3.454e-08
summary(summ)$r.squared
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## [1] NA
summ1=lm(db.adult$income~db.adult$capital_gain+db.adult$hours_per_week)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
summ1
##
## Call:
## lm(formula = db.adult$income ~ db.adult$capital_gain + db.adult$hours_per_week)
##
## Coefficients:
## (Intercept) db.adult$capital_gain db.adult$hours_per_week
## 9.290e-01 1.196e-05 7.393e-03
t.test(db.adult$age,db.adult$fnwlgt)
##
## One Sample t-test
##
## data: db.adult$age
## t = 510.39, df = 32560, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 38.43348 38.72981
## sample estimates:
## mean of x
## 38.58165
t.test(db.adult$capital_gain,db.adult$hours_per_week)
##
## Welch Two Sample t-test
##
## data: db.adult$capital_gain and db.adult$hours_per_week
## t = 25.342, df = 32560, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 956.9912 1117.4316
## sample estimates:
## mean of x mean of y
## 1077.64884 40.43746