Reading the input CSV and investigating the structure in raw data
df <- read.csv("germany_income.csv")
str(df)
## 'data.frame': 137 obs. of 7 variables:
## $ age : int 30 22 22 26 59 61 46 30 35 26 ...
## $ fnlwgt : int 77143 34918 151790 109186 212448 134768 134727 111415 178322 152046 ...
## $ sex : Factor w/ 2 levels " Female"," Male": 2 1 1 2 1 2 2 2 2 1 ...
## $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 3 5 5 5 5 5 1 5 5 5 ...
## $ capital.gain: int 0 0 0 0 0 0 0 0 0 0 ...
## $ capital.loss: int 0 0 0 0 0 0 0 0 0 0 ...
## $ income : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 2 1 1 2 1 ...
By above structure, we can conclude that, all the attributes are in proper format & all data is avaialbe to analysis. No need of Data Preprocessing.
summary(df)
## age fnlwgt sex race
## Min. :18.00 Min. : 21306 Female:60 Amer-Indian-Eskimo: 1
## 1st Qu.:29.00 1st Qu.:116391 Male :77 Asian-Pac-Islander: 3
## Median :36.00 Median :178322 Black : 8
## Mean :39.26 Mean :189325 Other : 1
## 3rd Qu.:47.00 3rd Qu.:231604 White :124
## Max. :74.00 Max. :606111
## capital.gain capital.loss income
## Min. : 0.0 Min. : 0.00 <=50K:93
## 1st Qu.: 0.0 1st Qu.: 0.00 >50K :44
## Median : 0.0 Median : 0.00
## Mean : 887.1 Mean : 77.98
## 3rd Qu.: 0.0 3rd Qu.: 0.00
## Max. :27828.0 Max. :1977.00
By the above summary, we can conclude that, race attribute is not properly distributed over all categories & capital.gain , capital.loss attributes are scwed. Let see the summary in plots.
ANALYSING CONTIONUS VARIABLES :-
par(mfrow=c(2,4))
boxplot(df$age, horizontal = TRUE)
boxplot(df$fnlwgt/1000, horizontal = TRUE)
boxplot(df$capital.gain, horizontal = TRUE)
boxplot(df$capital.loss, horizontal = TRUE)
main_string <- function(text , size ){
return(paste(text ,
"\n size:" , size , collapse = " "))
}
sub_string <- function(vector){
return(paste("Mean : " , round(mean(vector),2) ,
";\nMedian : " , round(median(vector),2) ,
";\nStd.Dev : " , round(sd(vector),2), collapse = " "))
}
hist(df$age , main = main_string("Age Distribution" , nrow(df)) , sub = sub_string(df$age) , labels=TRUE , xlab = "" )
hist(df$fnlwgt/1000, main = main_string("FnlWgt Distribution in 1000's" , nrow(df)), sub = sub_string(df$fnlwgt/1000) , labels=TRUE , xlab = "" )
hist(df$capital.gain , main = main_string("CPTL gain Distribution" , nrow(df)), sub = sub_string(df$capital.gain) , labels=TRUE , xlab = "" )
hist(df$capital.loss , main = main_string("CPTL loss Distribution" , nrow(df)), sub = sub_string(df$capital.loss) , labels=TRUE , xlab = "" )
By the above plots, we can conclude that, the distribution of Age and FnlWgt are better and there are lot of outliers in CPTL gain & CPTL loss.
ANALYSING CATEGORICAL VARIABLES :-
xtabs(~df$sex)
## df$sex
## Female Male
## 60 77
xtabs(~df$race)
## df$race
## Amer-Indian-Eskimo Asian-Pac-Islander Black Other
## 1 3 8 1
## White
## 124
xtabs(~df$income)
## df$income
## <=50K >50K
## 93 44
By above cross tabs, we can conclude that, race categories are not usefull.
matrix <- xtabs(~df$sex+df$race)
matrix <- cbind(matrix , "Total" = rowSums(matrix))
matrix <- rbind(matrix , "Total" = colSums(matrix))
matrix
## Amer-Indian-Eskimo Asian-Pac-Islander Black Other White Total
## Female 0 1 3 1 55 60
## Male 1 2 5 0 69 77
## Total 1 3 8 1 124 137
In race, all the data points are in white category. But with in white, the varience is good.
matrix <- xtabs(~df$income+df$race)
matrix <- cbind(matrix , "Total" = rowSums(matrix))
matrix <- rbind(matrix , "Total" = colSums(matrix))
matrix
## Amer-Indian-Eskimo Asian-Pac-Islander Black Other White Total
## <=50K 1 3 6 1 82 93
## >50K 0 0 2 0 42 44
## Total 1 3 8 1 124 137
matrix <- xtabs(~df$sex+df$income)
matrix <- cbind(matrix , "Total" = rowSums(matrix))
matrix <- rbind(matrix , "Total" = colSums(matrix))
matrix
## <=50K >50K Total
## Female 53 7 60
## Male 40 37 77
## Total 93 44 137
mosaicplot(xtabs(~df$sex+df$income) , shade = TRUE , main = "SEX & Income" , xlab = "SEX" , ylab = "INCOME" , labels = TRUE)
## Warning: In mosaicplot.default(xtabs(~df$sex + df$income), shade = TRUE,
## main = "SEX & Income", xlab = "SEX", ylab = "INCOME", labels = TRUE) :
## extra argument 'labels' will be disregarded
IN Male category, the varience is good for income⦠in female category, the varience is not good.
par(mfrow = c(2,1))
boxplot(age ~ income , data = df , horizontal = TRUE)
boxplot(age ~ sex , data = df , horizontal = TRUE)
Fitting Logistic regression Model :-
model <- glm(income ~ . , family = "binomial" , data = df)
summary(model)
##
## Call:
## glm(formula = income ~ ., family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6933 -0.7862 -0.3402 0.6080 2.2898
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.232e+01 3.956e+03 -0.006 0.995499
## age 5.529e-02 1.998e-02 2.767 0.005653 **
## fnlwgt 1.348e-06 2.331e-06 0.579 0.562873
## sex Male 2.028e+00 5.458e-01 3.716 0.000202 ***
## race Asian-Pac-Islander 1.536e+00 4.450e+03 0.000 0.999725
## race Black 1.776e+01 3.956e+03 0.004 0.996418
## race Other 3.108e+00 5.595e+03 0.001 0.999557
## race White 1.748e+01 3.956e+03 0.004 0.996474
## capital.gain 2.295e-04 1.292e-04 1.777 0.075624 .
## capital.loss 1.376e-03 8.084e-04 1.702 0.088785 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 172.00 on 136 degrees of freedom
## Residual deviance: 121.19 on 127 degrees of freedom
## AIC: 141.19
##
## Number of Fisher Scoring iterations: 16
coef(model)
## (Intercept) age fnlwgt
## -2.231923e+01 5.529056e-02 1.348433e-06
## sex Male race Asian-Pac-Islander race Black
## 2.028126e+00 1.535938e+00 1.776032e+01
## race Other race White capital.gain
## 3.107710e+00 1.748300e+01 2.294946e-04
## capital.loss
## 1.375704e-03
By this we can fit a Logistic regression to get the categories of income. from above model summary, we can find out that, AGE & SEX-MALE are two variables which are effecting the income category.
Thank you