How to download and prepare a dataset form the UCL machine learning repository

Publicly available datasets are very helpful for self-studying statistical models. The UCL machine learning repository has a wide list of datasets to be used for exercise. You can accede to the repository from here: https://archive.ics.uci.edu/ml/datasets.html

Most datasets are for ‘regression’ or ‘classification’. For each dataset, you’ll found a ‘Data Folder’ containing both the data and its description. The description report information on the source of the data and past usage of it. There is also a list of the variables (usually: Name, Data Type, Measure, and so on).

To be used, a dataset must be downloaded from the repository, uploaded in your R console, renamed and cleaned. The renaming of the dataset is necessary, since in general variables in the dataset have not their proper head.

So, what to do?

As an example, I will consider the “Adult Data Set” or “Census Income” dataset, which was suggested in this post in StackoverflowR:

http://stackoverflow.com/questions/29213915/prediction-using-rpart-on-new-factor-categorical-variables

Download

Click on the dataset and save it into your folder.

The data will be saved as “adultData.txt” but it is, indeed, a csv file.

Uploading

This is the most tricky part. Files in csv from outside Europe can be uploaded with the library(foreign), using the read.csv command. Files in csv from outside Europe can be uploaded with the library(foreign), using the read.csv2 command.

Let updload the file. I tend to use the command “file.choose()”, which opens a dialogue window (easier than set the working directory). You have to set “header” to FALSE since the dataset only has the values of the variables,

The head (name) of the variables have to be added, and when there are less than 30 variables, the simplest way is to do this by hand.

  library(foreign)
    adultData <- read.csv(file.choose(), header = FALSE)

Then you need to add the head to the variables.

names(adultData) <- c("age",
    "workclass",
    "fnlwgt",
    "education", 
    "education.num",     
    "marital.status",         
    "occupation",   
    "relationship",   
    "race",     
    "sex",
    "capital.gain",  
    "capital.loss", 
    "hours.per.week", 
    "native.country",  
    "salary")

Check the data.

dim(adultData)
head(adultData)
names(adultData)

Now you can save again your reworked dataset for future use.

I suggest use row.names=FALSE to avoid the creation of a list of rownames with numbers, quote=FALSE to avoid the saving add quote to some string. I use write.csv2 since I am in Europe. If you are not in Europe, use write.csv. By default, the file is saved in the current working directory.

write.csv2(dat, file = "adultDataRenamed.csv", row.names=FALSE, quote=FALSE)

Problems with downloaded/uploaded data

When I downloaded and then updloaded the data, I’ve recognized a problem that is common with factors: the label had extra-space, as a consequence, when you call the labels (e.g., “Bachelors”), the system does not recognize it, since in the factor this levels has an extra-space:

" Bachelors“.

You can see this by calling the levels of the factor: levels(adultData$education).

You can remove whitespaces in a read call by setting the strip.white parameter to TRUE.

The dataset is too wide to be published here. My clean dataset can be dowloaded from here: http://www.insular.it/?wpdmact=process&did=OC5ob3RsaW5r

Since I am in an European country, you should use read.csv2 to use my dataset.

If you upload the dataset in the standard way, you can say that the factors’ labels have extra space.

adultData <- read.csv2(file.choose(), header = TRUE)

levels(adultData$education)

[1] " 10th" " 11th" " 12th" " 1st-4th"
[5] " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm"
[9] " Assoc-voc" " Bachelors" " Doctorate" " HS-grad"
[13] " Masters" " Preschool" " Prof-school" " Some-college"

If you upload the dataset with strip.white = TRUE, you can see that factors’ labels have no extra space.

adultData <- read.csv2(file.choose(), header = TRUE, strip.white = TRUE)

levels(adultData$education)

[1] “10th” “11th” “12th” “1st-4th” “5th-6th”
[6] “7th-8th” “9th” “Assoc-acdm” “Assoc-voc” “Bachelors”
[11] “Doctorate” “HS-grad” “Masters” “Preschool” “Prof-school” [16] “Some-college”

Now, let keep the data from the repository I’ve created

        adultData <- read.csv2("http://www.insular.it/?wpdmact=process&did=OC5ob3RsaW5r", header = TRUE, strip.white = TRUE)

Always take a look at the data

    dim(adultData)
## [1] 32561    15
        head(adultData)
##   age        workclass fnlwgt education education.num     marital.status
## 1  39        State-gov  77516 Bachelors            13      Never-married
## 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
## 3  38          Private 215646   HS-grad             9           Divorced
## 4  53          Private 234721      11th             7 Married-civ-spouse
## 5  28          Private 338409 Bachelors            13 Married-civ-spouse
## 6  37          Private 284582   Masters            14 Married-civ-spouse
##          occupation  relationship  race    sex capital.gain capital.loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hours.per.week native.country salary
## 1             40  United-States  <=50K
## 2             13  United-States  <=50K
## 3             40  United-States  <=50K
## 4             40  United-States  <=50K
## 5             40           Cuba  <=50K
## 6             40  United-States  <=50K
        str(adultData)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education     : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education.num : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation    : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex           : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital.gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital.loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week: int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native.country: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
##  $ salary        : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
        summary(adultData)
##       age                  workclass         fnlwgt       
##  Min.   :17.0   Private         :22696   Min.   :  12285  
##  1st Qu.:28.0   Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.0   Local-gov       : 2093   Median : 178356  
##  Mean   :38.6   ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.0   State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.0   Self-emp-inc    : 1116   Max.   :1484705  
##                 (Other)         :  981                    
##         education     education.num                marital.status 
##  HS-grad     :10501   Min.   : 1.0   Divorced             : 4443  
##  Some-college: 7291   1st Qu.: 9.0   Married-AF-spouse    :   23  
##  Bachelors   : 5355   Median :10.0   Married-civ-spouse   :14976  
##  Masters     : 1723   Mean   :10.1   Married-spouse-absent:  418  
##  Assoc-voc   : 1382   3rd Qu.:12.0   Never-married        :10683  
##  11th        : 1175   Max.   :16.0   Separated            : 1025  
##  (Other)     : 5134                  Widowed              :  993  
##            occupation           relationship                   race      
##  Prof-specialty :4140   Husband       :13193   Amer-Indian-Eskimo:  311  
##  Craft-repair   :4099   Not-in-family : 8305   Asian-Pac-Islander: 1039  
##  Exec-managerial:4066   Other-relative:  981   Black             : 3124  
##  Adm-clerical   :3770   Own-child     : 5068   Other             :  271  
##  Sales          :3650   Unmarried     : 3446   White             :27816  
##  Other-service  :3295   Wife          : 1568                             
##  (Other)        :9541                                                    
##      sex         capital.gain    capital.loss  hours.per.week
##  Female:10771   Min.   :    0   Min.   :   0   Min.   : 1.0  
##  Male  :21790   1st Qu.:    0   1st Qu.:   0   1st Qu.:40.0  
##                 Median :    0   Median :   0   Median :40.0  
##                 Mean   : 1078   Mean   :  87   Mean   :40.4  
##                 3rd Qu.:    0   3rd Qu.:   0   3rd Qu.:45.0  
##                 Max.   :99999   Max.   :4356   Max.   :99.0  
##                                                              
##        native.country    salary     
##  United-States:29170   <=50K:24720  
##  Mexico       :  643   >50K : 7841  
##  ?            :  583                
##  Philippines  :  198                
##  Germany      :  137                
##  Canada       :  121                
##  (Other)      : 1709

A simple function to check the dataset

Assign the dataset to an easy vector.

    df <- adultData

Redefine the dataset (you do not want ‘fnlwgt’ in your dataset, believe me).

    df <- df[, c(1:2,4:15)]

Apply a for cycle, using the number of columns as a counter. You can specify this by using dim(df)[2], because ‘dim’ return just two numbers: the number of rows (how may cases you have), the number of columns (how many variables you have in the dataset).

    for (i in 1:dim(df)[2]) { 

        na <- names(df[i])

        tab <- table(df[,i])
        
        print(i)        ## the number of the column where is the variable in the dataset
        print(na)       ## name of the variable
        print(class(df[,i]))    ## class of the variable
        print(tab)      ## frequency distribution of the values in the variable
        cat("\n")

        }
## [1] 1
## [1] "age"
## [1] "integer"
## 
##  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34 
## 395 550 712 753 720 765 877 798 841 785 835 867 813 861 888 828 875 886 
##  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52 
## 876 898 858 827 816 794 808 780 770 724 734 737 708 543 577 602 595 478 
##  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70 
## 464 415 419 366 358 366 355 312 300 258 230 208 178 150 151 120 108  89 
##  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88 
##  72  67  64  51  45  46  29  23  22  22  20  12   6  10   3   1   1   3 
##  90 
##  43 
## 
## [1] 2
## [1] "workclass"
## [1] "factor"
## 
##                ?      Federal-gov        Local-gov     Never-worked 
##             1836              960             2093                7 
##          Private     Self-emp-inc Self-emp-not-inc        State-gov 
##            22696             1116             2541             1298 
##      Without-pay 
##               14 
## 
## [1] 3
## [1] "education"
## [1] "factor"
## 
##         10th         11th         12th      1st-4th      5th-6th 
##          933         1175          433          168          333 
##      7th-8th          9th   Assoc-acdm    Assoc-voc    Bachelors 
##          646          514         1067         1382         5355 
##    Doctorate      HS-grad      Masters    Preschool  Prof-school 
##          413        10501         1723           51          576 
## Some-college 
##         7291 
## 
## [1] 4
## [1] "education.num"
## [1] "integer"
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
##    51   168   333   646   514   933  1175   433 10501  7291  1382  1067 
##    13    14    15    16 
##  5355  1723   576   413 
## 
## [1] 5
## [1] "marital.status"
## [1] "factor"
## 
##              Divorced     Married-AF-spouse    Married-civ-spouse 
##                  4443                    23                 14976 
## Married-spouse-absent         Never-married             Separated 
##                   418                 10683                  1025 
##               Widowed 
##                   993 
## 
## [1] 6
## [1] "occupation"
## [1] "factor"
## 
##                 ?      Adm-clerical      Armed-Forces      Craft-repair 
##              1843              3770                 9              4099 
##   Exec-managerial   Farming-fishing Handlers-cleaners Machine-op-inspct 
##              4066               994              1370              2002 
##     Other-service   Priv-house-serv    Prof-specialty   Protective-serv 
##              3295               149              4140               649 
##             Sales      Tech-support  Transport-moving 
##              3650               928              1597 
## 
## [1] 7
## [1] "relationship"
## [1] "factor"
## 
##        Husband  Not-in-family Other-relative      Own-child      Unmarried 
##          13193           8305            981           5068           3446 
##           Wife 
##           1568 
## 
## [1] 8
## [1] "race"
## [1] "factor"
## 
## Amer-Indian-Eskimo Asian-Pac-Islander              Black 
##                311               1039               3124 
##              Other              White 
##                271              27816 
## 
## [1] 9
## [1] "sex"
## [1] "factor"
## 
## Female   Male 
##  10771  21790 
## 
## [1] 10
## [1] "capital.gain"
## [1] "integer"
## 
##     0   114   401   594   914   991  1055  1086  1111  1151  1173  1409 
## 29849     6     2    34     8     5    25     4     1     8     3     7 
##  1424  1455  1471  1506  1639  1797  1831  1848  2009  2036  2050  2062 
##     3     1     7    15     1     7     7     6     3     4     5     2 
##  2105  2174  2176  2202  2228  2290  2329  2346  2354  2387  2407  2414 
##     9    48    23    16     5     5     6     6    11     1    19     8 
##  2463  2538  2580  2597  2635  2653  2829  2885  2907  2936  2961  2964 
##    11     1    12    20    11     5    31    24    11     3     3     9 
##  2977  2993  3103  3137  3273  3325  3411  3418  3432  3456  3464  3471 
##     8     2    97    37     6    53    24     5     4     2    23     8 
##  3674  3781  3818  3887  3908  3942  4064  4101  4386  4416  4508  4650 
##    14    12     7     6    32    14    42    20    70    12    12    41 
##  4687  4787  4865  4931  4934  5013  5060  5178  5455  5556  5721  6097 
##     3    23    17     1     7    69     1    97    11     5     3     1 
##  6360  6418  6497  6514  6723  6767  6849  7298  7430  7443  7688  7896 
##     3     9    11     5     2     5    27   246     9     5   284     3 
##  7978  8614  9386  9562 10520 10566 10605 11678 13550 14084 14344 15020 
##     1    55    22     4    43     6    12     2    27    41    26     5 
## 15024 15831 18481 20051 22040 25124 25236 27828 34095 41310 99999 
##   347     6     2    37     1     4    11    34     5     2   159 
## 
## [1] 11
## [1] "capital.loss"
## [1] "integer"
## 
##     0   155   213   323   419   625   653   810   880   974  1092  1138 
## 31042     1     4     3     3    12     3     2     6     2     7     2 
##  1258  1340  1380  1408  1411  1485  1504  1539  1564  1573  1579  1590 
##     4     7     7    21     1    51    18     1    25     6    20    40 
##  1594  1602  1617  1628  1648  1651  1668  1669  1672  1719  1721  1726 
##     8    47     9    15     2     9     4    24    34    22    18     4 
##  1735  1740  1741  1755  1762  1816  1825  1844  1848  1876  1887  1902 
##     2    42    24     2    14     2     4     1    51    39   159   202 
##  1944  1974  1977  1980  2001  2002  2042  2051  2057  2080  2129  2149 
##     1    18   168    23    24    21     9    21     6     1     3     2 
##  2163  2174  2179  2201  2205  2206  2231  2238  2246  2258  2267  2282 
##     1     7    15     1     9     6     3     2     6    25     3     1 
##  2339  2352  2377  2392  2415  2444  2457  2467  2472  2489  2547  2559 
##    17     2    20     9    49    12     3     1     1     1     4    12 
##  2603  2754  2824  3004  3683  3770  3900  4356 
##     5     2    10     2     2     2     2     3 
## 
## [1] 12
## [1] "hours.per.week"
## [1] "integer"
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
##    20    32    39    54    60    64    26   145    18   278    11   173 
##    13    14    15    16    17    18    19    20    21    22    23    24 
##    23    34   404   205    29    75    14  1224    24    44    21   252 
##    25    26    27    28    29    30    31    32    33    34    35    36 
##   674    30    30    86     7  1149     5   266    39    28  1297   220 
##    37    38    39    40    41    42    43    44    45    46    47    48 
##   149   476    38 15217    36   219   151   212  1824    82    49   517 
##    49    50    51    52    53    54    55    56    57    58    59    60 
##    29  2819    13   138    25    41   694    97    17    28     5  1475 
##    61    62    63    64    65    66    67    68    70    72    73    74 
##     2    18    10    14   244    17     4    12   291    71     2     1 
##    75    76    77    78    80    81    82    84    85    86    87    88 
##    66     3     6     8   133     3     1    45    13     2     1     2 
##    89    90    91    92    94    95    96    97    98    99 
##     2    29     3     1     1     2     5     2    11    85 
## 
## [1] 13
## [1] "native.country"
## [1] "factor"
## 
##                          ?                   Cambodia 
##                        583                         19 
##                     Canada                      China 
##                        121                         75 
##                   Columbia                       Cuba 
##                         59                         95 
##         Dominican-Republic                    Ecuador 
##                         70                         28 
##                El-Salvador                    England 
##                        106                         90 
##                     France                    Germany 
##                         29                        137 
##                     Greece                  Guatemala 
##                         29                         64 
##                      Haiti         Holand-Netherlands 
##                         44                          1 
##                   Honduras                       Hong 
##                         13                         20 
##                    Hungary                      India 
##                         13                        100 
##                       Iran                    Ireland 
##                         43                         24 
##                      Italy                    Jamaica 
##                         73                         81 
##                      Japan                       Laos 
##                         62                         18 
##                     Mexico                  Nicaragua 
##                        643                         34 
## Outlying-US(Guam-USVI-etc)                       Peru 
##                         14                         31 
##                Philippines                     Poland 
##                        198                         60 
##                   Portugal                Puerto-Rico 
##                         37                        114 
##                   Scotland                      South 
##                         12                         80 
##                     Taiwan                   Thailand 
##                         51                         18 
##            Trinadad&Tobago              United-States 
##                         19                      29170 
##                    Vietnam                 Yugoslavia 
##                         67                         16 
## 
## [1] 14
## [1] "salary"
## [1] "factor"
## 
## <=50K  >50K 
## 24720  7841

It will always happen that some variable have to be explicitly called as factor, and some other variables have to be reorganized.

Further inspection of the dataset

It is a good thing to plot the associations between variables.

In this dataset you may be interested in looking whether there is a link between marital status and salary.

    sel <- c("marital.status", "salary")        # criterion for selection
    dat <- subset(adultData, select = sel)      # subset based on the predefined selection

                            # Rename the labels of the factor 'marital.status' for plotting
                            # Added \n' to make room for the label when plotted
    levels(dat$marital.status) <- c("Divorced", "Married \n AF-spouse","Married \n civ-spouse", 
        "Married \nspouse-absent", "Never-married", "Separated", "Widowed")         
    
    library(wesanderson)                # Did I mention I'm a fan of Wes Anderson's movies?
## Warning: package 'wesanderson' was built under R version 3.1.3
    pal <- wes_palette(name = "GrandBudapest", type = "discrete") # this does not work in knitr, unless you have installed the package in RStudio

    par(las=2)                  # Rotated labels
    par(mar = c(9, 5, 5, 3) + 0.1)          # Increase bottom margin to make room for rotated labels
                            # In the label of y axis added '\n' to make room for the label
                            # Use of expression(italic("Your label")) to have the lables printed in italic
    plot(dat, col = pal, ylab=expression(italic("Salary\n")),xlab="", main=expression(italic("Marital status")), cex=0.8)
## Warning: metrica carattere sconosciuta per carattere 0xa
## Warning: metrica carattere sconosciuta per carattere 0xa

plot of chunk unnamed-chunk-6

Essentially, if you are ‘Married-civ-spouse’ you are more likely to get a salary with more than 50K US dollars.

Now, something more difficult: worked hours per week by country.

    sel <- c("native.country", "hours.per.week")        # criterion for selection
    dat <- subset(adultData, select = sel)      # subset based on the predefined selection

    
    library(wesanderson)                # I've already mentioned I'm a fan of Wes Anderson's movies
    pal <- wes_palette(name = "Royal1", type = "continuous")

    par(las=2)                  # Rotated labels
    par(mar = c(9, 5, 5, 3) + 0.1)          # Increased bottom margin to make room for rotated labels
                            # Use of expression(italic("Your label")) to have the lables printed in italic
    plot(dat, col = pal, ylab=expression(italic("Worked hours per week")),xlab="",  cex.axis=0.7,cex.lab=1.25, col.lab="#D67236", 
        main=expression(italic("Native country")), col.main="#D67236")

plot of chunk unnamed-chunk-7

    dev.off()
## null device 
##           1

According to the native country, some people work more hours per week than other, and yes, there is a link between worked hours per week and salary. You work more, you get a better salary (there are a lot of outliers, though).

    sel <- c("salary", "hours.per.week")        # criterion for selection
    dat <- subset(adultData, select = sel)      # subset based on the predefined selection

    
    library(wesanderson)                # I've already mentioned I'm a fan of Wes Anderson's movies
    pal <- wes_palette(name = "Cavalcanti", type = "continuous")

    par(las=2)                  # Rotated labels
    par(mar = c(9, 5, 5, 3) + 0.1)          # Increased bottom margin to make room for rotated labels
                            # Use of expression(italic("Your label")) to have the lables printed in italic
    plot(dat, col = pal, ylab=expression(italic("Worked hours per week")),xlab=expression(italic("Salary")), 
        cex.axis=0.7,cex.lab=1.25, col.lab="#81A88D")

plot of chunk unnamed-chunk-8

In this dataset you may be also interested in looking whether there is a link between education and salary.

        # reordering of the levels of the 'education' factor, to have the levels ordered from lowest to highest education

    df$education<- ordered(df$education,levels = c("Preschool","1st-4th","5th-6th","7th-8th","9th","10th","11th","12th","HS-grad","Some-college","Assoc-voc","Assoc-acdm","Bachelors","Masters","Prof-school","Doctorate"))
      par(mar = c(9, 8, 5, 3) + 0.1)
      par(las=2)
    
    # creation of a palette with colors between two intervals (in this example, blue to red)
        colfunc <- colorRampPalette(c("blue", "red"))

    # the colfunc is used to create 16 intervals of colors between the two main colors

        plot(df$education~df$salary, col=colfunc(16), ylab="",xlab="Salary", cex.axis=0.7)

plot of chunk unnamed-chunk-9

Apparently, as higher the education, as more likely to get a salary with more than 50K US dollars.

Any gender difference in salary distribution?

        plot(df$sex~df$salary, col=c("green", "yellow"), ylab="Sex",xlab="Salary")

plot of chunk unnamed-chunk-10

Here we go: female workers are less likely to get a salary with more than 50K US dollars.

Any link between gender and education?

      par(mar = c(9, 8, 5, 3) + 0.1)
      par(las=2)

# creation of a palette with colors between two intervals (in this example, white to black)
      
        colfunc <- colorRampPalette(c("white", "black"))

    # the colfunc is used to create 16 intervals of colors between the two main colors

        plot(df$education~df$sex, col=colfunc(16), ylab="",xlab="Sex")

plot of chunk unnamed-chunk-11

Apparently, males had higher education than females.

        prop <- prop.table(table(df$sex,df$education))
        round(prop*100,1)
##         
##          Preschool 1st-4th 5th-6th 7th-8th  9th 10th 11th 12th HS-grad
##   Female       0.0     0.1     0.3     0.5  0.4  0.9  1.3  0.4    10.4
##   Male         0.1     0.4     0.8     1.5  1.1  2.0  2.3  0.9    21.8
##         
##          Some-college Assoc-voc Assoc-acdm Bachelors Masters Prof-school
##   Female          8.6       1.5        1.3       5.0     1.6         0.3
##   Male           13.8       2.7        2.0      11.5     3.6         1.5
##         
##          Doctorate
##   Female       0.3
##   Male         1.0
        options(scipen=999) # this will eliminate the scientific notation with exponentials
        chisq.test(df$sex,df$education)
## 
##  Pearson's Chi-squared test
## 
## data:  df$sex and df$education
## X-squared = 297.7, df = 15, p-value < 0.00000000000000022

Is this difference a reasonable explanation for gender differences in salary?

A smart guy, Duy Bui, posted an interesting model of this dataset, which was based on the ‘caret’ package (a very interesting tool for predictive analytics and machine learning). You may found information on this package at: http://topepo.github.io/caret/index.html

Here the model of Duy Bui:

    library(caret)      # same as before, this does not work in knitr, unless you have installed the package in RStudio
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2

The variables that were used in the model were a selection of the attributes.

    selected <- c("age", "education", "marital.status", "relationship", "sex", "hours.per.week", "salary")
    adultData <- subset(adultData, select = selected)

The training set and the test set were created by using a useful feature of the ‘caret’ package, the command ‘createDataPartition’. A random index is created, which can be used to select the training set and the test set. I’ve reduced the data set to 60% only (which is accetable for training).

    trainIndex = createDataPartition(adultData$salary, p=0.60, list=FALSE)
    training = adultData[ trainIndex, ]

I also added a test-set.

    test = adultData[ -trainIndex, ]

Model fitting

The model is based on the recursive partitioning for classification, regression and survival trees (rpart). See here for details: http://cran.r-project.org/web/packages/rpart/index.html

    set.seed(33833) # for reproducibility

    modFit <- train(salary ~ ., method = "rpart", data=training)
## Loading required package: rpart

This analysis will take some time, so I add an alarm sound to warn me when the analysis had completed.

Why I do not use alarm()? Because, for no clear reasons, it does not work in my PC.

You can skip this

library(audio)

beep <- function(n = 4){
        for(i in seq(n)){
        play(sin(1:10000/20))
        Sys.sleep(.5)
        }
    }


beep()
### done
### done

Overall accuracy (tested on the test set, a NEW dataset than the one in which the model had been evaluated). This script is based on ‘Practical Data Science with R’, authored by Nina Zumel and John Mount, a very good reading.

    prediction <- predict(modFit, newdata=test)

    tab <- table(prediction, test$salary)

    sum(diag(tab))/sum(tab)
## [1] 0.8142

Better testing with the caret package.

    rpartPred<-predict(modFit,test)

    confusionMatrix(rpartPred,test$salary) 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  9419 1951
##      >50K    469 1185
##                                              
##                Accuracy : 0.814              
##                  95% CI : (0.807, 0.821)     
##     No Information Rate : 0.759              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.394              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.953              
##             Specificity : 0.378              
##          Pos Pred Value : 0.828              
##          Neg Pred Value : 0.716              
##              Prevalence : 0.759              
##          Detection Rate : 0.723              
##    Detection Prevalence : 0.873              
##       Balanced Accuracy : 0.665              
##                                              
##        'Positive' Class : <=50K              
## 

Plot the model (not really clear).

    library(rattle)
## Warning: package 'rattle' was built under R version 3.1.3
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
    fancyRpartPlot(modFit$finalModel)

plot of chunk unnamed-chunk-20

Alternative.

    library(partykit)
## Warning: package 'partykit' was built under R version 3.1.3
## Loading required package: grid
    finalModel <-as.party(modFit$finalModel)
  par(mar = c(5, 8, 5, 3) + 0.1)
    plot(finalModel)

plot of chunk unnamed-chunk-21

We are interested in the prediction with the new data value as specified by the investigator:

    fem <- data.frame(age = 40, education = "Masters", marital.status = "Married-civ-spouse", relationship = "Own-child", sex = "Female", hours.per.week = 40)

    predict(modFit, newdata = fem)
## [1] >50K
## Levels: <=50K >50K
    mal <- data.frame(age = 40, education = "Masters", marital.status = "Married-civ-spouse", relationship = "Own-child", sex = "Male", hours.per.week = 40)

    predict(modFit, newdata = mal)
## [1] >50K
## Levels: <=50K >50K

Same socio-demographic profile but different gender seems not related to differences in salary. Uhm!… Probably more analyses are required.

I hope you’ve enjoyed this!

Twitter: @AntoViral

    library(knitr)