Chapter 1 - Programming in R

Slicing and Dicing Data in R - Visualizations

For the purpose of this Tutorial, we will be using the ‘Lattice’ package, which contains the necessary libraries for creating some interesting visualizations.

We will be dealing with 3 different datasets to highlight how we can manipulate the data and understand it better.

We will not be performing Hypothesis Testing/Statistical Analysis on the datasets - instead, we will be looking at how to familiarize ourselves with the Tools of the trade and how to use R more effectively to give shape to the data we have.

library(lattice)
library(nutshell)

## Warning: package 'nutshell' was built under R version 3.4.3

## Loading required package: nutshell.bbdb

## Warning: package 'nutshell.bbdb' was built under R version 3.4.3

## Loading required package: nutshell.audioscrobbler

## Warning: package 'nutshell.audioscrobbler' was built under R version 3.4.3

Loading and Viewing Data into R workspace

In this example, we load the data table in the workspace from within a Library (nutshell library). We are not reading the data from a csv data file.

data(births2006.smpl)

Viewing the Data

We can view the data in a few different ways as per your convenience and requirement :

head(births2006.smpl, n = 10)

##         DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5
## 591430       9      1    25       2     NA   F     NA
## 1827276      2      6    28       2     26   M      9
## 1705673      2      2    18       2     25   F      9
## 3368269     10      5    21       2      6   M      9
## 2990253      7      7    25       1     36   M     10
## 966967       3      3    28       3     35   M      8
## 3527960      5      2    33       2     26   M      9
## 1204799      4      7    31       3     25   F      9
## 3690420     10      3    18       1     46   F      9
## 2016880      4      4    24       2     43   M      9
##                         DMEDUC UPREVIS ESTGEST DMETH_REC  DPLURAL DBWT
## 591430                    NULL      10      99   Vaginal 1 Single 3800
## 1827276     2 years of college      10      37   Vaginal 1 Single 3625
## 1705673                   NULL      14      38   Vaginal 1 Single 3650
## 3368269                   NULL      22      38   Vaginal 1 Single 3045
## 2990253 2 years of high school      15      40   Vaginal 1 Single 3827
## 966967                    NULL      18      39   Vaginal 1 Single 3090
## 3527960                   NULL      10      38 C-section 1 Single 3430
## 1204799     2 years of college      19      38 C-section 1 Single 3204
## 3690420                   NULL      15      40 C-section 1 Single 3227
## 2016880 2 years of high school      13      40   Vaginal 1 Single 3459

#This shows the first n rows of the data table we pass as parameter

tail(births2006.smpl, n = 10)

##         DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5
## 3415553      1      2    21       2     20   M      9
## 34644        5      6    17       1     36   F      9
## 1784359      7      3    34       2     21   F      9
## 3099555      3      5    28       2     38   M      9
## 219509       1      3    20       2     NA   F     NA
## 2050468      7      4    20       1     58   F      8
## 2961820     12      3    30       2     13   F      9
## 887843      11      2    34       2     28   M      9
## 2076215      9      5    32       5     28   M      9
## 4146535     11      2    31       4     NA   F     10
##                         DMEDUC UPREVIS ESTGEST DMETH_REC  DPLURAL DBWT
## 3415553                   NULL      13      37   Vaginal 1 Single 2955
## 34644    1 year of high school       7      38   Vaginal 1 Single 3175
## 1784359 4 years of high school      13      39 C-section 1 Single 3634
## 3099555                   NULL      99      99   Vaginal 1 Single 3231
## 219509                    NULL       7      99   Vaginal 1 Single 2760
## 2050468 2 years of high school      15      39   Vaginal 1 Single 2187
## 2961820                   NULL       7      38   Vaginal 1 Single 3210
## 887843  3 years of high school       7      39   Vaginal 1 Single 3799
## 2076215 4 years of high school      18      38 C-section 1 Single 4290
## 4146535                   NULL       7      40   Vaginal 1 Single 3770

#This shows the last n rows of the data table we pass as parameter

View(births2006.smpl)
#This will show the entire data set in a new tab

births2006.smpl[1:5,]

##         DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5
## 591430       9      1    25       2     NA   F     NA
## 1827276      2      6    28       2     26   M      9
## 1705673      2      2    18       2     25   F      9
## 3368269     10      5    21       2      6   M      9
## 2990253      7      7    25       1     36   M     10
##                         DMEDUC UPREVIS ESTGEST DMETH_REC  DPLURAL DBWT
## 591430                    NULL      10      99   Vaginal 1 Single 3800
## 1827276     2 years of college      10      37   Vaginal 1 Single 3625
## 1705673                   NULL      14      38   Vaginal 1 Single 3650
## 3368269                   NULL      22      38   Vaginal 1 Single 3045
## 2990253 2 years of high school      15      40   Vaginal 1 Single 3827

#This is the [x,y] way of accessing the data table. 1:5 here means we are selecting rows 1 to 5. The blank after , implies ALL the columns need to be selected.

births2006.smpl[1:5,1:2]

##         DOB_MM DOB_WK
## 591430       9      1
## 1827276      2      6
## 1705673      2      2
## 3368269     10      5
## 2990253      7      7

#This command will give only the first 5 rows and the first 2 columns.

dim(births2006.smpl)

## [1] 427323     13

#This command gives the total number of rows and columns present in the data set that was loaded

Moving to Visualizations

#Command for a simple histogram
hist(births2006.smpl$DOB_WK)

##Creates a frequency Table for the 7 days of the week. Table() command is used for creating a frequency table.
births.dow <- table(births2006.smpl$DOB_WK)
View(births.dow)

#Command for a barchart
barchart(births.dow, ylab = "Days of the Week", col='black')

Using the table() function to create an RXC Contingency Table

## This creates a Week X Method table from the given data
dob.dm.method <- table(week=births2006.smpl$DOB_WK,method = births2006.smpl$DMETH_REC)
dob.dm.method

##     method
## week C-section Unknown Vaginal
##    1      8836      90   31348
##    2     20454     272   42031
##    3     22921     247   46607
##    4     23103     252   46935
##    5     22825     258   47081
##    6     23233     289   44858
##    7     10696     109   34878

## We need to delete the Column Unknown since it doesn't help us with our analysis
dob.dm.method <- dob.dm.method[,-2]
dob.dm.method

##     method
## week C-section Vaginal
##    1      8836   31348
##    2     20454   42031
##    3     22921   46607
##    4     23103   46935
##    5     22825   47081
##    6     23233   44858
##    7     10696   34878

Playing with Barcharts

##Drawing a barchart - type 1 (Horizontal alignment, no groups within the single bar)
barchart(dob.dm.method,ylab="Day of Week")

## Drawing a barchart - type 2
barchart(dob.dm.method,horizontal=FALSE,groups=FALSE,xlab="Day of Week",col="black")

Histograms - Multiple ways of plotting

histogram(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5), col="black")

histogram(~DBWT|DMETH_REC,data=births2006.smpl,layout=c(1,3),col="black")

Density Plots

## We can use the | operator to give a command for separating different categories of the plots based on variable DPLURAL
densityplot(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),plot.points=FALSE,col="black")

##Different categories in the same plot - without using any category separator
densityplot(~DBWT,groups=DPLURAL,data=births2006.smpl,plot.points=FALSE)

Dotplots and xyPlots

## Plot each point in the graph where data is present
# In this case, dividing the DBWT dot points into various categories based on DPLURAL using | operator
dotplot(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),plot.points=FALSE,col="black")

xyplot(DBWT~DOB_WK,data=births2006.smpl,col="black")

xyplot(DBWT~DOB_WK|DPLURAL,data=births2006.smpl,layout=c(1,5), col="black")

xyplot(DBWT~WTGAIN,data=births2006.smpl,col="black")

xyplot(DBWT~WTGAIN|DPLURAL,data=births2006.smpl,layout=c(1,5), col="black")

## Smooth Scatter Plot
smoothScatter(births2006.smpl$WTGAIN,births2006.smpl$DBWT)

Boxplots (standard function and lattice package function)

boxplot(DBWT~APGAR5,data=births2006.smpl,ylab="DBWT",xlab="AGPAR5")

boxplot(DBWT~DOB_WK,data=births2006.smpl,ylab="DBWT",xlab="Day of Week")

## bwplot is the command for a box plot in the lattice graphics
## package. There you need to declare the conditioning variables
## as factors
bwplot(DBWT~factor(APGAR5)|factor(SEX),data=births2006.smpl, xlab="AGPAR5")

bwplot(DBWT~factor(DOB_WK),data=births2006.smpl, xlab="Day of Week")

Applying some Functions to the data columns in order to get better insights

## Apply mean function to each of the weights for each of the DPLURAL categories
fac=factor(births2006.smpl$DPLURAL)
res=births2006.smpl$DBWT
t4=tapply(res,fac,mean,na.rm=TRUE)
t4

##               1 Single                 2 Twin              3 Triplet 
##               3298.263               2327.478               1677.017 
##           4 Quadruplet 5 Quintuplet or higher 
##               1196.105               1142.800

## Splitting for Gender
t5=tapply(births2006.smpl$DBWT,INDEX=list(births2006.smpl$DPLURAL,births2006.smpl$SEX),FUN=mean,na.rm=TRUE)
t5

##                               F        M
## 1 Single               3242.302 3351.637
## 2 Twin                 2279.508 2373.819
## 3 Triplet              1697.822 1655.348
## 4 Quadruplet           1319.556 1085.000
## 5 Quintuplet or higher 1007.667 1345.500

## Plotting the results of the above Data operations
barplot(t4,ylab="DBWT")

barplot(t5,beside=TRUE,ylab="DBWT")

t6=table(births2006.smpl$ESTGEST)
t6

## 
##     12     15     17     18     19     20     21     22     23     24 
##      1      2     18     43     69    116    162    209    288    401 
##     25     26     27     28     29     30     31     32     33     34 
##    445    461    566    670    703   1000   1243   1975   2652   4840 
##     35     36     37     38     39     40     41     42     43     44 
##   7954  15874  33310  76794 109046  84890  23794   1931    133     32 
##     45     46     47     48     51     99 
##      6      5      5      2      1  57682

An earlier frequency distribution table of estimated gestation period indicates that “99” is used as the code for “unknown”. For the subsequent calculations, we omit all records with unknown gestation period (i.e., value 99).

new=births2006.smpl[births2006.smpl$ESTGEST != 99,]
t51=table(new$ESTGEST)
t51

## 
##     12     15     17     18     19     20     21     22     23     24 
##      1      2     18     43     69    116    162    209    288    401 
##     25     26     27     28     29     30     31     32     33     34 
##    445    461    566    670    703   1000   1243   1975   2652   4840 
##     35     36     37     38     39     40     41     42     43     44 
##   7954  15874  33310  76794 109046  84890  23794   1931    133     32 
##     45     46     47     48     51 
##      6      5      5      2      1

t7=tapply(new$DBWT,INDEX=list(cut(new$WTGAIN,breaks=10),cut(new$ESTGEST,breaks=10)),FUN=mean,na.rm=TRUE)
t7

##              (12,15.9] (15.9,19.8] (19.8,23.7] (23.7,27.6] (27.6,31.5]
## (-0.098,9.8]       227    321.3125    486.7534    799.5614    1398.234
## (9.8,19.6]        2649    592.8235    546.7738    813.4179    1421.181
## (19.6,29.4]         NA    585.8889    590.1368    882.4800    1452.186
## (29.4,39.2]       2977   1891.0000    731.5957    866.0294    1521.757
## (39.2,49]           NA   2485.2500    803.8667    955.7639    1513.215
## (49,58.8]           NA          NA    434.7500    950.8039    1506.355
## (58.8,68.6]         NA          NA    352.0000   1285.6250    1469.508
## (68.6,78.4]         NA          NA          NA    805.5714    1463.391
## (78.4,88.2]         NA          NA          NA   1110.0000    1487.846
## (88.2,98.1]         NA          NA          NA    768.0000    1434.333
##              (31.5,35.4] (35.4,39.3] (39.3,43.2] (43.2,47.1] (47.1,51]
## (-0.098,9.8]    2275.316    3166.748    3443.652    3911.667      3310
## (9.8,19.6]      2289.950    3171.085    3434.708    3206.400        NA
## (19.6,29.4]     2307.429    3213.362    3475.328    3007.800      3969
## (29.4,39.2]     2323.002    3276.400    3535.965    3326.143      4042
## (39.2,49]       2368.520    3329.068    3605.645    3447.200        NA
## (49,58.8]       2358.658    3370.630    3650.549    3501.000        NA
## (58.8,68.6]     2367.365    3389.672    3681.233    3435.500        NA
## (68.6,78.4]     2368.205    3418.076    3694.160    3118.000        NA
## (78.4,88.2]     2447.250    3496.495    3708.868          NA        NA
## (88.2,98.1]     2481.105    3406.835    3688.067          NA        NA

levelplot(t7,scales = list(x = list(rot = 90)))

contourplot(t7,scales = list(x = list(rot = 90)))

One may wish to predict the birth weight from characteristics such as the estimated gestation period and the weight gain of the mother; for that, one could use regression and regression trees. Or, one may want to identify births that lead to very low APGAR scores, for which purpose, one could use classification methods.