1 Introduction

In this project I have analyzed a U.S. census data taken from the UCI (University of California at Irvine) Machine Learning Repository. Instruction to be followed:

–>Read your dataset in R and visualize the length and breadth of your dataset. –>Create a descriptive statistics (min, max, median etc) of each variable. –>Create one-way contingency tables for the categorical variables in your dataset. –>Create two-way contingency tables for the categorical variables in your dataset. –>Draw a boxplot of the variables that belong to your study. –>Draw Histograms for your suitable data fields. –>Draw suitable plot for your data fields. –>Create a correlation matrix. –>Visualize your correlation matrix using corrgram. –>Create a scatter plot matrix for your data set. –>Run a suitable test to check your hypothesis for your suitable assumptions. –>Run a t-test to analyse your hypothesis.

First we load the packages that we will use:

library(ggplot2)
library(plyr)
library(gridExtra)
library(gmodels)
library(grid)
library(vcd)
library(scales)
library(ggthemes)
library(knitr)

Downloading and Reading the Data

First i have download the census dataset. In order to do this i set the working directory:

setwd("C:/Users/ASHUTOSH/Documents/R/iim")

Then we download the file containing the training data:

if(!file.exists("./adult.data")){
  
    fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"  
    
    download.file(fileUrl, destfile = "./adult.data")
    
}

and the file that contains the test data:

if(!file.exists("./adult.test")){
  
    fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"  
    
    download.file(fileUrl, destfile = "./adult.test")
    
}

My next step is to read the data from the file “adult.data” into the “db.adult” data frame, which is the training dataset:

db.adult <- read.table("adult.data", sep = ",", header = FALSE)
View(db.adult)

The number of observations in the “db.adult” data frame is:

dim(db.adult)[1]
## [1] 32561

and the number of variables is:

dim(db.adult)[2]
## [1] 15

Summary

summary(db.adult)
##        V1                        V2              V3         
##  Min.   :17.00    Private         :22696   Min.   :  12285  
##  1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00    Local-gov       : 2093   Median : 178356  
##  Mean   :38.58    ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00    State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)          :  981                    
##              V4              V5                             V6       
##   HS-grad     :10501   Min.   : 1.00    Divorced             : 4443  
##   Some-college: 7291   1st Qu.: 9.00    Married-AF-spouse    :   23  
##   Bachelors   : 5355   Median :10.00    Married-civ-spouse   :14976  
##   Masters     : 1723   Mean   :10.08    Married-spouse-absent:  418  
##   Assoc-voc   : 1382   3rd Qu.:12.00    Never-married        :10683  
##   11th        : 1175   Max.   :16.00    Separated            : 1025  
##  (Other)      : 5134                    Widowed              :  993  
##                 V7                     V8       
##   Prof-specialty :4140    Husband       :13193  
##   Craft-repair   :4099    Not-in-family : 8305  
##   Exec-managerial:4066    Other-relative:  981  
##   Adm-clerical   :3770    Own-child     : 5068  
##   Sales          :3650    Unmarried     : 3446  
##   Other-service  :3295    Wife          : 1568  
##  (Other)         :9541                          
##                    V9             V10             V11       
##   Amer-Indian-Eskimo:  311    Female:10771   Min.   :    0  
##   Asian-Pac-Islander: 1039    Male  :21790   1st Qu.:    0  
##   Black             : 3124                   Median :    0  
##   Other             :  271                   Mean   : 1078  
##   White             :27816                   3rd Qu.:    0  
##                                              Max.   :99999  
##                                                             
##       V12              V13                    V14            V15       
##  Min.   :   0.0   Min.   : 1.00    United-States:29170    <=50K:24720  
##  1st Qu.:   0.0   1st Qu.:40.00    Mexico       :  643    >50K : 7841  
##  Median :   0.0   Median :40.00    ?            :  583                 
##  Mean   :  87.3   Mean   :40.44    Philippines  :  198                 
##  3rd Qu.:   0.0   3rd Qu.:45.00    Germany      :  137                 
##  Max.   :4356.0   Max.   :99.00    Canada       :  121                 
##                                   (Other)       : 1709

I took the names of the variables from the attributes list, which is available at https://archive.ics.uci.edu/ml/datasets/Census+Income or from the file adult.names provided in the UCI repository:

colnames(db.adult) <- c("age", "workclass", "fnlwgt", 
                        "education", "education_num", 
                        "marital_status", "occupation",
                        "relationship", "race", "sex", 
                        "capital_gain", "capital_loss", 
                        "hours_per_week", "native_country", "income")

Let us take a look at the data frame:

head(db.adult, 10)
##    age         workclass fnlwgt  education education_num
## 1   39         State-gov  77516  Bachelors            13
## 2   50  Self-emp-not-inc  83311  Bachelors            13
## 3   38           Private 215646    HS-grad             9
## 4   53           Private 234721       11th             7
## 5   28           Private 338409  Bachelors            13
## 6   37           Private 284582    Masters            14
## 7   49           Private 160187        9th             5
## 8   52  Self-emp-not-inc 209642    HS-grad             9
## 9   31           Private  45781    Masters            14
## 10  42           Private 159449  Bachelors            13
##            marital_status         occupation   relationship   race     sex
## 1           Never-married       Adm-clerical  Not-in-family  White    Male
## 2      Married-civ-spouse    Exec-managerial        Husband  White    Male
## 3                Divorced  Handlers-cleaners  Not-in-family  White    Male
## 4      Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
## 5      Married-civ-spouse     Prof-specialty           Wife  Black  Female
## 6      Married-civ-spouse    Exec-managerial           Wife  White  Female
## 7   Married-spouse-absent      Other-service  Not-in-family  Black  Female
## 8      Married-civ-spouse    Exec-managerial        Husband  White    Male
## 9           Never-married     Prof-specialty  Not-in-family  White  Female
## 10     Married-civ-spouse    Exec-managerial        Husband  White    Male
##    capital_gain capital_loss hours_per_week native_country income
## 1          2174            0             40  United-States  <=50K
## 2             0            0             13  United-States  <=50K
## 3             0            0             40  United-States  <=50K
## 4             0            0             40  United-States  <=50K
## 5             0            0             40           Cuba  <=50K
## 6             0            0             40  United-States  <=50K
## 7             0            0             16        Jamaica  <=50K
## 8             0            0             45  United-States   >50K
## 9         14084            0             50  United-States   >50K
## 10         5178            0             40  United-States   >50K

and its structure:

str(db.adult, vec.len = 2, strict.width = "no", width = 30)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  39 50 38 53 28 ...
##  $ workclass     : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 ...
##  $ education     : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 ...
##  $ education_num : int  13 13 9 7 13 ...
##  $ marital_status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 ...
##  $ occupation    : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 ...
##  $ relationship  : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 ...
##  $ race          : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 ...
##  $ sex           : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 ...
##  $ capital_gain  : int  2174 0 0 0 0 ...
##  $ capital_loss  : int  0 0 0 0 0 ...
##  $ hours_per_week: int  40 13 40 40 40 ...
##  $ native_country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 ...
##  $ income        : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 ...

Box-plot

boxplot(db.adult$age, xlab= "hours_per_week", ylab = "income", main= "income vs working-hours", horizontal = TRUE)

boxplot(db.adult$age, xlab= "age", ylab = "income", main= "Income vs age", horizontal = TRUE)

After I have marked the missing values, I can clean them with the function “na.omit()” which deletes all rows (i.e. observations) containig missing values:

db.adult <- na.omit(db.adult)

and after that i re-enumerate the rows of the data frame:

row.names(db.adult) <- 1:nrow(db.adult)
summary(db.adult$hours_per_week)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   40.00   40.00   40.44   45.00   99.00

From the box plot i also observe that there are a lot of outliers:

ggplot(aes(x = factor(0), y = hours_per_week),
       data = db.adult) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = 'point', 
               shape = 19,
               color = "red",
               cex = 2) +
  scale_x_discrete(breaks = NULL) +
  scale_y_continuous(breaks = seq(0, 100, 5)) + 
  xlab(label = "") +
  ylab(label = "Working hours per week") +
  ggtitle("Box Plot of Working Hours per Week") 

Therefore, i will group the working hours in 5 categories which we consider relevant:

less than 40 hours per week between 40 and 45 hours per week between 45 and 60 hours per week between 60 and 80 hours per week, and more than 80 hours per week, and i will create a new factor variable called “hours_w” with 5 levels corresponding to these 5 categories:

db.adult$hours_w[db.adult$hours_per_week < 40] <- " less_than_40"
db.adult$hours_w[db.adult$hours_per_week >= 40 & 
                 db.adult$hours_per_week <= 45] <- " between_40_and_45"
db.adult$hours_w[db.adult$hours_per_week > 45 &
                 db.adult$hours_per_week <= 60  ] <- " between_45_and_60"
db.adult$hours_w[db.adult$hours_per_week > 60 &
                 db.adult$hours_per_week <= 80  ] <- " between_60_and_80"
db.adult$hours_w[db.adult$hours_per_week > 80] <- " more_than_80"

In the latter i create the category " between_40_and_45" with greater than or equal and less than or equal sign, because we want the range of this category to correspond to the span of the data points between the first and the third quartile.

i want to make the new variable “hours_w” a factor variable, therefore i have used the “factor()” function to do this:

db.adult$hours_w <- factor(db.adult$hours_w,
                           ordered = FALSE,
                           levels = c(" less_than_40", 
                                      " between_40_and_45", 
                                      " between_45_and_60",
                                      " between_60_and_80",
                                      " more_than_80"))

From the summary below i can see how many people belong to each category of the factor variable “hours_w”:

summary(db.adult$hours_w)
##       less_than_40  between_40_and_45  between_45_and_60 
##               7763              17659               6029 
##  between_60_and_80       more_than_80 
##                902                208

Histogram

df <- db.adult[db.adult$capital_gain > 0, ]

ggplot(data = df, 
       aes(x = df$capital_gain)) +
  geom_histogram(binwidth = 5000,
                 colour = "black",
                 fill = "lightblue",
                 alpha = 0.4) +
  scale_y_continuous(breaks = seq(0, 4000, 100)) +
  labs(x = "Capital gain", y = "Count") +
  ggtitle("Histogram of Nonzero Capital Gain") 

Create a correlation matrix.

db<- db.adult[,1:6] # take a subset of 6 columns
dim(db)  # it has 50 rows, 6 columns
## [1] 32561     6
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
## 
##     alpha, rescale
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(db)
##                 vars     n      mean        sd median   trimmed      mad
## age                1 32561     38.58     13.64     37     37.69    14.83
## workclass*         2 32561      4.87      1.46      5      4.96     0.00
## fnlwgt             3 32561 189778.37 105549.98 178356 180802.36 88798.84
## education*         4 32561     11.30      3.87     12     11.81     2.97
## education_num      5 32561     10.08      2.57     10     10.19     1.48
## marital_status*    6 32561      3.61      1.51      3      3.65     2.97
##                   min     max   range  skew kurtosis     se
## age                17      90      73  0.56    -0.17   0.08
## workclass*          1       9       8 -0.75     1.68   0.01
## fnlwgt          12285 1484705 1472420  1.45     6.22 584.94
## education*          1      16      15 -0.93     0.68   0.02
## education_num       1      16      15 -0.31     0.62   0.01
## marital_status*     1       7       6 -0.01    -0.54   0.01

Visualize your correlation matrix using corrgram.

library(corrgram)
## 
## Attaching package: 'corrgram'
## The following object is masked from 'package:plyr':
## 
##     baseball
corrgram(db.adult, order=TRUE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="Corrgram of db.adult intercorrelations")

attach(db.adult)
## The following object is masked from package:psych:
## 
##     income
plot(age, capital_gain, main="Scatterplot Example", 
    xlab="age ", ylab="capital_gain ", pch=19)

Chi-squared hypothesis

chisq.test(db.adult$capital_gain,db.adult$education_num)
## Warning in chisq.test(db.adult$capital_gain, db.adult$education_num): Chi-
## squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  db.adult$capital_gain and db.adult$education_num
## X-squared = 4825.8, df = 1770, p-value < 2.2e-16
chisq.test(db.adult$capital_loss,db.adult$fnlwgt)
## Warning in chisq.test(db.adult$capital_loss, db.adult$fnlwgt): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  db.adult$capital_loss and db.adult$fnlwgt
## X-squared = 2162200, df = 1969900, p-value < 2.2e-16
t.test(db.adult$capital_loss,db.adult$fnlwgt)
## 
##  Welch Two Sample t-test
## 
## data:  db.adult$capital_loss and db.adult$fnlwgt
## t = -324.29, df = 32561, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -190837.6 -188544.6
## sample estimates:
##    mean of x    mean of y 
##     87.30383 189778.36651

T-test

t.test(db.adult$capital_gain,db.adult$education_num)
## 
##  Welch Two Sample t-test
## 
## data:  db.adult$capital_gain and db.adult$education_num
## t = 26.084, df = 32560, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   987.3481 1147.7882
## sample estimates:
##  mean of x  mean of y 
## 1077.64884   10.08068
t.test(db.adult$hours_per_week,db.adult$age)
## 
##  Welch Two Sample t-test
## 
## data:  db.adult$hours_per_week and db.adult$age
## t = 18.201, df = 64485, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.655961 2.055657
## sample estimates:
## mean of x mean of y 
##  40.43746  38.58165
t.test(db.adult$capital_loss,db.adult$fnlwgt)
## 
##  Welch Two Sample t-test
## 
## data:  db.adult$capital_loss and db.adult$fnlwgt
## t = -324.29, df = 32561, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -190837.6 -188544.6
## sample estimates:
##    mean of x    mean of y 
##     87.30383 189778.36651
summ=lm(db.adult$income~db.adult$age+db.adult$fnlwgt)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
summ
## 
## Call:
## lm(formula = db.adult$income ~ db.adult$age + db.adult$fnlwgt)
## 
## Coefficients:
##     (Intercept)     db.adult$age  db.adult$fnlwgt  
##       9.504e-01        7.357e-03        3.454e-08
summary(summ)$r.squared
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## [1] NA
summ1=lm(db.adult$income~db.adult$capital_gain+db.adult$hours_per_week)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
summ1
## 
## Call:
## lm(formula = db.adult$income ~ db.adult$capital_gain + db.adult$hours_per_week)
## 
## Coefficients:
##             (Intercept)    db.adult$capital_gain  db.adult$hours_per_week  
##               9.290e-01                1.196e-05                7.393e-03
t.test(db.adult$age,db.adult$fnwlgt)
## 
##  One Sample t-test
## 
## data:  db.adult$age
## t = 510.39, df = 32560, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  38.43348 38.72981
## sample estimates:
## mean of x 
##  38.58165
t.test(db.adult$capital_gain,db.adult$hours_per_week)
## 
##  Welch Two Sample t-test
## 
## data:  db.adult$capital_gain and db.adult$hours_per_week
## t = 25.342, df = 32560, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   956.9912 1117.4316
## sample estimates:
##  mean of x  mean of y 
## 1077.64884   40.43746