Download and load the ‘Clem3Training’ data.
mydata = read.csv("Clem3Training.csv",na.strings = "?") #Loading Clem3Training data CSV file
What do you think the ‘education-num’, ‘capital.gain’, ‘capita.loss’ and ‘hours.per.week’ signify? While answering these questions see if there are any relationships between: - ‘education’ and ‘education.num’ ‘capital.gain’, ‘capita.loss’
education.num: education.num signifies the levels of education from 1 to 16 where 1 is referred as Pre school and 16 as Doctorate.
capital.gain : capital.gain signifies the profit that results from a sale of a capital asset.
capital.loss :capital.loss signifies the loss incurred when a capital asset,decreases in value.
hours.per.week:hours.per.week signifies the total number of hours worked per week by that specific person.
By looking at the dataset,I can observe that there is relationship between ‘education’ and ‘education.num’ because each number in education.num refers a level of education from education variable.
Pre school referred as 1
1st-4th referred as 2
5th-6th referred as 3
7th-8th referred as 4
9th referred as 5
10th referred as 6
11th referred as 7
12th referred as 8
HS-grad referred as 9
Some-college referred as 10
Assoc-voc referred as 11
Assoc-acdm referred as 12
Bachelors referred as 13
Masters referred as 14
Prof-school referred as 15
Doctorate referred as 16
Also, I can observe that there is no relationship between ‘capital.gain’ and ‘capita.loss’
Provide the descriptive statistics for the variables. - Are there any variables that are technically correct but doesn’t make much sense ? - Change the variable type wherever you see fit.
summary(mydata)
## age workclass demogweight
## Min. :17.00 Private :17385 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 1978 1st Qu.: 117963
## Median :37.00 Local-gov : 1624 Median : 178353
## Mean :38.61 State-gov : 993 Mean : 189742
## 3rd Qu.:48.00 Self-emp-inc : 857 3rd Qu.: 236861
## Max. :90.00 (Other) : 764 Max. :1484705
## NA's : 1399
## education education.num marital.status
## HS-grad :8120 Min. : 1.00 Divorced : 3435
## Some-college:5597 1st Qu.: 9.00 Married-AF-spouse : 16
## Bachelors :4140 Median :10.00 Married-civ-spouse :11441
## Masters :1300 Mean :10.08 Married-spouse-absent: 328
## Assoc-voc :1059 3rd Qu.:12.00 Never-married : 8225
## 11th : 909 Max. :16.00 Separated : 786
## (Other) :3875 Widowed : 769
## occupation relationship race
## Prof-specialty :3180 Husband :10064 Amer-Indian-Eskimo: 241
## Craft-repair :3122 Not-in-family : 6443 Asian-Pac-Islander: 775
## Exec-managerial:3084 Other-relative: 729 Black : 2379
## Adm-clerical :2975 Own-child : 3911 Other : 214
## Sales :2815 Unmarried : 2640 White :21391
## (Other) :8420 Wife : 1213
## NA's :1404
## sex capital.gain capital.loss hours.per.week
## Female: 8291 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :16709 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1089 Mean : 86.5 Mean :40.41
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## native.country income
## United-States:22421 <=50K.:19016
## Mexico : 488 >50K. : 5984
## Philippines : 151
## Germany : 102
## Canada : 99
## (Other) : 1294
## NA's : 445
mydata$education.num<- as.factor(mydata$education.num) #Changing education.num to Categorical variable from numerical variable using as.factor()
Explore whether there are missing values for any of the variables.
sum(is.na(mydata$age))
## [1] 0
sum(is.na(mydata$workclass))
## [1] 1399
sum(is.na(mydata$demogweight))
## [1] 0
sum(is.na(mydata$education))
## [1] 0
sum(is.na(mydata$education.num))
## [1] 0
sum(is.na(mydata$marital.status))
## [1] 0
sum(is.na(mydata$occupation))
## [1] 1404
sum(is.na(mydata$relationship))
## [1] 0
sum(is.na(mydata$race))
## [1] 0
sum(is.na(mydata$sex))
## [1] 0
sum(is.na(mydata$capital.gain))
## [1] 0
sum(is.na(mydata$capital.loss))
## [1] 0
sum(is.na(mydata$hours.per.week))
## [1] 0
sum(is.na(mydata$native.country))
## [1] 445
sum(is.na(mydata$income))
## [1] 0
Explore the capital.gain and hours.per.week variables in further detail. Discuss any apparent abnormalities.
hist(mydata$capital.gain)
table(mydata$capital.gain)
##
## 0 114 401 594 914 991 1055 1086 1111 1151 1173 1409
## 22924 5 2 28 7 4 22 2 1 6 1 6
## 1424 1455 1471 1506 1639 1797 1831 1848 2009 2036 2050 2062
## 3 1 5 10 1 5 6 6 2 4 5 2
## 2105 2174 2176 2202 2228 2290 2329 2346 2354 2407 2414 2463
## 8 38 14 14 4 5 4 4 9 15 6 10
## 2538 2580 2597 2635 2653 2829 2885 2907 2936 2961 2964 2977
## 1 9 13 9 3 25 19 8 2 2 8 5
## 2993 3103 3137 3273 3325 3411 3418 3432 3456 3464 3471 3674
## 1 72 23 4 40 16 5 3 1 18 5 11
## 3781 3818 3887 3908 3942 4064 4101 4386 4416 4508 4650 4687
## 9 7 5 27 12 34 15 57 11 9 32 2
## 4787 4865 4931 4934 5013 5178 5455 5556 5721 6097 6360 6418
## 17 14 1 5 53 67 9 3 3 1 3 7
## 6497 6514 6723 6767 6849 7298 7430 7443 7688 7896 7978 8614
## 9 4 1 3 23 182 8 3 214 2 1 41
## 9386 9562 10520 10566 10605 11678 13550 14084 14344 15020 15024 15831
## 17 4 31 6 10 2 20 34 19 4 263 5
## 18481 20051 22040 25124 25236 27828 34095 41310 99999
## 2 28 1 4 9 22 5 2 126
hist(mydata$hours.per.week)
table(mydata$hours.per.week)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 17 26 29 37 36 50 23 114 15 218 8 126
## 13 14 15 16 17 18 19 20 21 22 23 24
## 18 26 307 158 21 61 11 942 24 32 15 203
## 25 26 27 28 29 30 31 32 33 34 35 36
## 514 26 27 76 6 873 2 196 29 21 1003 173
## 37 38 39 40 41 42 43 44 45 46 47 48
## 109 370 32 11713 31 165 118 172 1380 56 38 402
## 49 50 51 52 53 54 55 56 57 58 59 60
## 22 2146 10 102 19 30 512 78 13 23 4 1150
## 61 62 63 64 65 66 67 68 70 72 73 74
## 1 13 9 11 195 11 4 7 216 53 1 1
## 75 76 77 78 80 81 82 84 85 86 87 88
## 54 2 5 5 101 2 1 34 11 2 1 2
## 89 90 91 92 94 95 96 97 98 99
## 1 20 2 1 1 1 3 2 9 60
Use a graph to visually determine whether there are any outliers in the hours.per.week variable
boxplot(mydata$hours.per.week)
Transform the hours.per.week attribute using Z-score standardization
Hours_per_week= (mydata$hours.per.week-mean(mydata$hours.per.week))/sd(mydata$hours.per.week) # Z score-standardization, where you subtract the mean of variable from your Original variable value and divide by the standard deviation of your variable
Do you think the age variable should be tranformed to be normalized? If so, see which method will yield the best result?
hist(mydata$age) #Plotting histogram for variable age
#Appliying transfomrations
log_age=log(mydata$age)# Calculating log of the variable age
hist(log_age,main ="Log Transformtion") #Plotting histogram for variable log_age
sqrt_age=sqrt(mydata$age) # Calculating square root of the variable age
hist(sqrt_age,main=" Square-root Transformation") #Plotting histogram for variable sqrt_age
inv_age=1/sqrt(mydata$age) # Calculating inverse square root of the variable age
hist(inv_age,main ="Inverse Square-root Transformation") #Plotting histogram for variable inv_age
To get better understanding,We will calculate Skewness for each transformation variables
s_log_age=(3*(mean(log_age)-median(log_age)))/sd(log_age) # Skewness for Log Transformation
s_log_age
## [1] -0.1756795
s_sqrt_age=(3*(mean(sqrt_age)-median(sqrt_age)))/sd(sqrt_age)# Skewness for Square-root Transformation
s_sqrt_age
## [1] 0.09310181
s_inv_age=(3*(mean(inv_age)-median(inv_age)))/sd(inv_age)# Skewness for Inverse Square-root Transformation
s_inv_age
## [1] 0.4366356