ASSIGNMENT 3 - ANSWERS

Download and load the ‘Clem3Training’ data.

mydata = read.csv("Clem3Training.csv",na.strings = "?") #Loading Clem3Training data CSV file

Question 1:

What do you think the ‘education-num’, ‘capital.gain’, ‘capita.loss’ and ‘hours.per.week’ signify? While answering these questions see if there are any relationships between: - ‘education’ and ‘education.num’ ‘capital.gain’, ‘capita.loss’

Answer 1:

education.num: education.num signifies the levels of education from 1 to 16 where 1 is referred as Pre school and 16 as Doctorate.

capital.gain : capital.gain signifies the profit that results from a sale of a capital asset.

capital.loss :capital.loss signifies the loss incurred when a capital asset,decreases in value.

hours.per.week:hours.per.week signifies the total number of hours worked per week by that specific person.

By looking at the dataset,I can observe that there is relationship between ‘education’ and ‘education.num’ because each number in education.num refers a level of education from education variable.

Pre school referred as 1

1st-4th referred as 2

5th-6th referred as 3

7th-8th referred as 4

9th referred as 5

10th referred as 6

11th referred as 7

12th referred as 8

HS-grad referred as 9

Some-college referred as 10

Assoc-voc referred as 11

Assoc-acdm referred as 12

Bachelors referred as 13

Masters referred as 14

Prof-school referred as 15

Doctorate referred as 16

Also, I can observe that there is no relationship between ‘capital.gain’ and ‘capita.loss’

Question 2:

Provide the descriptive statistics for the variables. - Are there any variables that are technically correct but doesn’t make much sense ? - Change the variable type wherever you see fit.

Answer 2:

summary(mydata)
##       age                   workclass      demogweight     
##  Min.   :17.00   Private         :17385   Min.   :  12285  
##  1st Qu.:28.00   Self-emp-not-inc: 1978   1st Qu.: 117963  
##  Median :37.00   Local-gov       : 1624   Median : 178353  
##  Mean   :38.61   State-gov       :  993   Mean   : 189742  
##  3rd Qu.:48.00   Self-emp-inc    :  857   3rd Qu.: 236861  
##  Max.   :90.00   (Other)         :  764   Max.   :1484705  
##                  NA's            : 1399                    
##         education    education.num                 marital.status 
##  HS-grad     :8120   Min.   : 1.00   Divorced             : 3435  
##  Some-college:5597   1st Qu.: 9.00   Married-AF-spouse    :   16  
##  Bachelors   :4140   Median :10.00   Married-civ-spouse   :11441  
##  Masters     :1300   Mean   :10.08   Married-spouse-absent:  328  
##  Assoc-voc   :1059   3rd Qu.:12.00   Never-married        : 8225  
##  11th        : 909   Max.   :16.00   Separated            :  786  
##  (Other)     :3875                   Widowed              :  769  
##            occupation           relationship                   race      
##  Prof-specialty :3180   Husband       :10064   Amer-Indian-Eskimo:  241  
##  Craft-repair   :3122   Not-in-family : 6443   Asian-Pac-Islander:  775  
##  Exec-managerial:3084   Other-relative:  729   Black             : 2379  
##  Adm-clerical   :2975   Own-child     : 3911   Other             :  214  
##  Sales          :2815   Unmarried     : 2640   White             :21391  
##  (Other)        :8420   Wife          : 1213                             
##  NA's           :1404                                                    
##      sex         capital.gain    capital.loss    hours.per.week 
##  Female: 8291   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##  Male  :16709   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median :    0   Median :   0.0   Median :40.00  
##                 Mean   : 1089   Mean   :  86.5   Mean   :40.41  
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##        native.country     income     
##  United-States:22421   <=50K.:19016  
##  Mexico       :  488   >50K. : 5984  
##  Philippines  :  151                 
##  Germany      :  102                 
##  Canada       :   99                 
##  (Other)      : 1294                 
##  NA's         :  445
education.num is the variable that is technically correct but doesn’t make much sense because education.num should be Categorical variable.But in this case education.num is numerical variable. As we can see in the dataset education.num implies various levels of education such as Bachelors - 13,Masters -14 etc.Performing average,mean, median on education.num doesn’t make sense
mydata$education.num<- as.factor(mydata$education.num) #Changing education.num to Categorical variable from numerical variable using as.factor()

Question 3:

Explore whether there are missing values for any of the variables.

Answer 3:

sum(is.na(mydata$age))
## [1] 0
sum(is.na(mydata$workclass))
## [1] 1399
sum(is.na(mydata$demogweight))
## [1] 0
sum(is.na(mydata$education))
## [1] 0
sum(is.na(mydata$education.num))
## [1] 0
sum(is.na(mydata$marital.status))
## [1] 0
sum(is.na(mydata$occupation))
## [1] 1404
sum(is.na(mydata$relationship))
## [1] 0
sum(is.na(mydata$race))
## [1] 0
sum(is.na(mydata$sex))
## [1] 0
sum(is.na(mydata$capital.gain))
## [1] 0
sum(is.na(mydata$capital.loss))
## [1] 0
sum(is.na(mydata$hours.per.week))
## [1] 0
sum(is.na(mydata$native.country))
## [1] 445
sum(is.na(mydata$income))
## [1] 0
There are no missing values in the variables age,demogweight,education,education.num,marital.status,relationship,race,sex,capital.gain,capital.loss,hours.per.week,income.There are 1399 missing values in the variable workclass,1404 missing values in the variable occupation ,445 missing values in the variable native.country.

Question 4:

Explore the capital.gain and hours.per.week variables in further detail. Discuss any apparent abnormalities.

Answer 4:

hist(mydata$capital.gain)

By looking at the histogram, we can see there are outliers in the capital.gain
table(mydata$capital.gain)
## 
##     0   114   401   594   914   991  1055  1086  1111  1151  1173  1409 
## 22924     5     2    28     7     4    22     2     1     6     1     6 
##  1424  1455  1471  1506  1639  1797  1831  1848  2009  2036  2050  2062 
##     3     1     5    10     1     5     6     6     2     4     5     2 
##  2105  2174  2176  2202  2228  2290  2329  2346  2354  2407  2414  2463 
##     8    38    14    14     4     5     4     4     9    15     6    10 
##  2538  2580  2597  2635  2653  2829  2885  2907  2936  2961  2964  2977 
##     1     9    13     9     3    25    19     8     2     2     8     5 
##  2993  3103  3137  3273  3325  3411  3418  3432  3456  3464  3471  3674 
##     1    72    23     4    40    16     5     3     1    18     5    11 
##  3781  3818  3887  3908  3942  4064  4101  4386  4416  4508  4650  4687 
##     9     7     5    27    12    34    15    57    11     9    32     2 
##  4787  4865  4931  4934  5013  5178  5455  5556  5721  6097  6360  6418 
##    17    14     1     5    53    67     9     3     3     1     3     7 
##  6497  6514  6723  6767  6849  7298  7430  7443  7688  7896  7978  8614 
##     9     4     1     3    23   182     8     3   214     2     1    41 
##  9386  9562 10520 10566 10605 11678 13550 14084 14344 15020 15024 15831 
##    17     4    31     6    10     2    20    34    19     4   263     5 
## 18481 20051 22040 25124 25236 27828 34095 41310 99999 
##     2    28     1     4     9    22     5     2   126
By looking at the table, we can see there are 126 entries for which the capital.gain is 99999.According to me,this looks like a abnormality in the capital.gain variable.So we can eleminate these rows having capital.gain as 99999.
hist(mydata$hours.per.week)

By looking at the histogram, we can see there are outliers in the capital.gain
table(mydata$hours.per.week)
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
##    17    26    29    37    36    50    23   114    15   218     8   126 
##    13    14    15    16    17    18    19    20    21    22    23    24 
##    18    26   307   158    21    61    11   942    24    32    15   203 
##    25    26    27    28    29    30    31    32    33    34    35    36 
##   514    26    27    76     6   873     2   196    29    21  1003   173 
##    37    38    39    40    41    42    43    44    45    46    47    48 
##   109   370    32 11713    31   165   118   172  1380    56    38   402 
##    49    50    51    52    53    54    55    56    57    58    59    60 
##    22  2146    10   102    19    30   512    78    13    23     4  1150 
##    61    62    63    64    65    66    67    68    70    72    73    74 
##     1    13     9    11   195    11     4     7   216    53     1     1 
##    75    76    77    78    80    81    82    84    85    86    87    88 
##    54     2     5     5   101     2     1    34    11     2     1     2 
##    89    90    91    92    94    95    96    97    98    99 
##     1    20     2     1     1     1     3     2     9    60
By looking at the table, we can see there are 60 entries for which the hours.per.week is 99.According to me,this looks like a abnormality in the hours.per.week variable.So we can eleminate these rows having hours.per.week as 99.

Question 5:

Use a graph to visually determine whether there are any outliers in the hours.per.week variable

Answer 5:

boxplot(mydata$hours.per.week)

By looking at the above Boxplot graph for variable hours.per.week,We can say that yes there are many outliers.

Question 6:

Transform the hours.per.week attribute using Z-score standardization

Answer 6:

Hours_per_week= (mydata$hours.per.week-mean(mydata$hours.per.week))/sd(mydata$hours.per.week) # Z score-standardization, where you subtract the mean of variable from your Original variable value and divide by the standard deviation of your variable

Question 7:

Do you think the age variable should be tranformed to be normalized? If so, see which method will yield the best result?

Answer 7:

hist(mydata$age) #Plotting histogram for variable age

By looking at the above histogram for variable age,I think the age variable should be tranformed to be normalized.As we can see that the graph is Right Skewed symmetric
#Appliying transfomrations
log_age=log(mydata$age)# Calculating log of the variable age
hist(log_age,main ="Log Transformtion") #Plotting histogram for variable log_age

sqrt_age=sqrt(mydata$age) # Calculating square root of the variable age
hist(sqrt_age,main=" Square-root Transformation") #Plotting histogram for variable sqrt_age

inv_age=1/sqrt(mydata$age) # Calculating inverse square root of the variable age
hist(inv_age,main ="Inverse Square-root Transformation") #Plotting histogram for variable inv_age

By looking at the above transformed histograms for variable age,It’s hard to decide which method will yield the best result. As we can see that Log Transformtion and Inverse Square-root Transformation are the methods improve symmetry as compared to Square-root Transformation

To get better understanding,We will calculate Skewness for each transformation variables

s_log_age=(3*(mean(log_age)-median(log_age)))/sd(log_age) # Skewness for Log Transformation
s_log_age
## [1] -0.1756795
s_sqrt_age=(3*(mean(sqrt_age)-median(sqrt_age)))/sd(sqrt_age)# Skewness for Square-root Transformation
s_sqrt_age
## [1] 0.09310181
s_inv_age=(3*(mean(inv_age)-median(inv_age)))/sd(inv_age)# Skewness for Inverse Square-root Transformation
s_inv_age
## [1] 0.4366356
By looking at the above Skewness for Transformtion variables,I think Inverse Square-root Transformation is the method which will yield the best result as compared to Square-root Transformation and Log Transformation