Reading the input CSV and investigating the structure in raw data

df <- read.csv("germany_income.csv")
str(df)
## 'data.frame':    137 obs. of  7 variables:
##  $ age         : int  30 22 22 26 59 61 46 30 35 26 ...
##  $ fnlwgt      : int  77143 34918 151790 109186 212448 134768 134727 111415 178322 152046 ...
##  $ sex         : Factor w/ 2 levels " Female"," Male": 2 1 1 2 1 2 2 2 2 1 ...
##  $ race        : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 3 5 5 5 5 5 1 5 5 5 ...
##  $ capital.gain: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ income      : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 2 1 1 2 1 ...

By above structure, we can conclude that, all the attributes are in proper format & all data is avaialbe to analysis. No need of Data Preprocessing.

summary(df)
##       age            fnlwgt            sex                      race    
##  Min.   :18.00   Min.   : 21306    Female:60    Amer-Indian-Eskimo:  1  
##  1st Qu.:29.00   1st Qu.:116391    Male  :77    Asian-Pac-Islander:  3  
##  Median :36.00   Median :178322                 Black             :  8  
##  Mean   :39.26   Mean   :189325                 Other             :  1  
##  3rd Qu.:47.00   3rd Qu.:231604                 White             :124  
##  Max.   :74.00   Max.   :606111                                         
##   capital.gain      capital.loss        income  
##  Min.   :    0.0   Min.   :   0.00    <=50K:93  
##  1st Qu.:    0.0   1st Qu.:   0.00    >50K :44  
##  Median :    0.0   Median :   0.00              
##  Mean   :  887.1   Mean   :  77.98              
##  3rd Qu.:    0.0   3rd Qu.:   0.00              
##  Max.   :27828.0   Max.   :1977.00

By the above summary, we can conclude that, race attribute is not properly distributed over all categories & capital.gain , capital.loss attributes are scwed. Let see the summary in plots.

ANALYSING CONTIONUS VARIABLES :-

par(mfrow=c(2,4))
boxplot(df$age, horizontal = TRUE)
boxplot(df$fnlwgt/1000, horizontal = TRUE)
boxplot(df$capital.gain, horizontal = TRUE)
boxplot(df$capital.loss, horizontal = TRUE)

main_string <- function(text , size ){
  return(paste(text ,
               "\n size:" , size , collapse = " "))
}

sub_string <- function(vector){
  return(paste("Mean : " , round(mean(vector),2) ,
               ";\nMedian : " , round(median(vector),2) ,
               ";\nStd.Dev : " , round(sd(vector),2), collapse = " "))
}

hist(df$age , main = main_string("Age Distribution" , nrow(df)) , sub = sub_string(df$age)  , labels=TRUE , xlab = "" )
hist(df$fnlwgt/1000, main = main_string("FnlWgt Distribution in 1000's" , nrow(df)), sub = sub_string(df$fnlwgt/1000)  , labels=TRUE , xlab = "" )
hist(df$capital.gain , main = main_string("CPTL gain Distribution" , nrow(df)), sub = sub_string(df$capital.gain)  , labels=TRUE , xlab = "" )
hist(df$capital.loss , main = main_string("CPTL loss Distribution" , nrow(df)), sub = sub_string(df$capital.loss)  , labels=TRUE , xlab = "" )

By the above plots, we can conclude that, the distribution of Age and FnlWgt are better and there are lot of outliers in CPTL gain & CPTL loss.

ANALYSING CATEGORICAL VARIABLES :-

xtabs(~df$sex)
## df$sex
##  Female    Male 
##      60      77
xtabs(~df$race)
## df$race
##  Amer-Indian-Eskimo  Asian-Pac-Islander               Black               Other 
##                   1                   3                   8                   1 
##               White 
##                 124
xtabs(~df$income)
## df$income
##  <=50K   >50K 
##     93     44

By above cross tabs, we can conclude that, race categories are not usefull.

matrix <- xtabs(~df$sex+df$race)
matrix <- cbind(matrix , "Total" = rowSums(matrix))
matrix <- rbind(matrix , "Total" = colSums(matrix))
matrix
##          Amer-Indian-Eskimo  Asian-Pac-Islander  Black  Other  White Total
##  Female                   0                   1      3      1     55    60
##  Male                     1                   2      5      0     69    77
## Total                     1                   3      8      1    124   137

In race, all the data points are in white category. But with in white, the varience is good.

matrix <- xtabs(~df$income+df$race)
matrix <- cbind(matrix , "Total" = rowSums(matrix))
matrix <- rbind(matrix , "Total" = colSums(matrix))
matrix
##         Amer-Indian-Eskimo  Asian-Pac-Islander  Black  Other  White Total
##  <=50K                   1                   3      6      1     82    93
##  >50K                    0                   0      2      0     42    44
## Total                    1                   3      8      1    124   137
matrix <- xtabs(~df$sex+df$income)
matrix <- cbind(matrix , "Total" = rowSums(matrix))
matrix <- rbind(matrix , "Total" = colSums(matrix))
matrix
##          <=50K  >50K Total
##  Female     53     7    60
##  Male       40    37    77
## Total       93    44   137
mosaicplot(xtabs(~df$sex+df$income) , shade = TRUE , main = "SEX & Income" , xlab = "SEX" , ylab = "INCOME" , labels = TRUE)
## Warning: In mosaicplot.default(xtabs(~df$sex + df$income), shade = TRUE, 
##     main = "SEX & Income", xlab = "SEX", ylab = "INCOME", labels = TRUE) :
##  extra argument 'labels' will be disregarded

IN Male category, the varience is good for income… in female category, the varience is not good.

par(mfrow = c(2,1))
boxplot(age ~ income , data = df , horizontal = TRUE)
boxplot(age ~ sex , data = df , horizontal = TRUE)

Fitting Logistic regression Model :-

model <- glm(income ~ . , family = "binomial" , data = df)
summary(model)
## 
## Call:
## glm(formula = income ~ ., family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6933  -0.7862  -0.3402   0.6080   2.2898  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.232e+01  3.956e+03  -0.006 0.995499    
## age                      5.529e-02  1.998e-02   2.767 0.005653 ** 
## fnlwgt                   1.348e-06  2.331e-06   0.579 0.562873    
## sex Male                 2.028e+00  5.458e-01   3.716 0.000202 ***
## race Asian-Pac-Islander  1.536e+00  4.450e+03   0.000 0.999725    
## race Black               1.776e+01  3.956e+03   0.004 0.996418    
## race Other               3.108e+00  5.595e+03   0.001 0.999557    
## race White               1.748e+01  3.956e+03   0.004 0.996474    
## capital.gain             2.295e-04  1.292e-04   1.777 0.075624 .  
## capital.loss             1.376e-03  8.084e-04   1.702 0.088785 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 172.00  on 136  degrees of freedom
## Residual deviance: 121.19  on 127  degrees of freedom
## AIC: 141.19
## 
## Number of Fisher Scoring iterations: 16
coef(model)
##             (Intercept)                     age                  fnlwgt 
##           -2.231923e+01            5.529056e-02            1.348433e-06 
##                sex Male race Asian-Pac-Islander              race Black 
##            2.028126e+00            1.535938e+00            1.776032e+01 
##              race Other              race White            capital.gain 
##            3.107710e+00            1.748300e+01            2.294946e-04 
##            capital.loss 
##            1.375704e-03

By this we can fit a Logistic regression to get the categories of income. from above model summary, we can find out that, AGE & SEX-MALE are two variables which are effecting the income category.

Thank you