For the purpose of this Tutorial, we will be using the ‘Lattice’ package, which contains the necessary libraries for creating some interesting visualizations.
We will be dealing with 3 different datasets to highlight how we can manipulate the data and understand it better.
We will not be performing Hypothesis Testing/Statistical Analysis on the datasets - instead, we will be looking at how to familiarize ourselves with the Tools of the trade and how to use R more effectively to give shape to the data we have.
library(lattice)
library(nutshell)
## Warning: package 'nutshell' was built under R version 3.4.3
## Loading required package: nutshell.bbdb
## Warning: package 'nutshell.bbdb' was built under R version 3.4.3
## Loading required package: nutshell.audioscrobbler
## Warning: package 'nutshell.audioscrobbler' was built under R version 3.4.3
In this example, we load the data table in the workspace from within a Library (nutshell library). We are not reading the data from a csv data file.
data(births2006.smpl)
We can view the data in a few different ways as per your convenience and requirement :
head(births2006.smpl, n = 10)
## DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5
## 591430 9 1 25 2 NA F NA
## 1827276 2 6 28 2 26 M 9
## 1705673 2 2 18 2 25 F 9
## 3368269 10 5 21 2 6 M 9
## 2990253 7 7 25 1 36 M 10
## 966967 3 3 28 3 35 M 8
## 3527960 5 2 33 2 26 M 9
## 1204799 4 7 31 3 25 F 9
## 3690420 10 3 18 1 46 F 9
## 2016880 4 4 24 2 43 M 9
## DMEDUC UPREVIS ESTGEST DMETH_REC DPLURAL DBWT
## 591430 NULL 10 99 Vaginal 1 Single 3800
## 1827276 2 years of college 10 37 Vaginal 1 Single 3625
## 1705673 NULL 14 38 Vaginal 1 Single 3650
## 3368269 NULL 22 38 Vaginal 1 Single 3045
## 2990253 2 years of high school 15 40 Vaginal 1 Single 3827
## 966967 NULL 18 39 Vaginal 1 Single 3090
## 3527960 NULL 10 38 C-section 1 Single 3430
## 1204799 2 years of college 19 38 C-section 1 Single 3204
## 3690420 NULL 15 40 C-section 1 Single 3227
## 2016880 2 years of high school 13 40 Vaginal 1 Single 3459
#This shows the first n rows of the data table we pass as parameter
tail(births2006.smpl, n = 10)
## DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5
## 3415553 1 2 21 2 20 M 9
## 34644 5 6 17 1 36 F 9
## 1784359 7 3 34 2 21 F 9
## 3099555 3 5 28 2 38 M 9
## 219509 1 3 20 2 NA F NA
## 2050468 7 4 20 1 58 F 8
## 2961820 12 3 30 2 13 F 9
## 887843 11 2 34 2 28 M 9
## 2076215 9 5 32 5 28 M 9
## 4146535 11 2 31 4 NA F 10
## DMEDUC UPREVIS ESTGEST DMETH_REC DPLURAL DBWT
## 3415553 NULL 13 37 Vaginal 1 Single 2955
## 34644 1 year of high school 7 38 Vaginal 1 Single 3175
## 1784359 4 years of high school 13 39 C-section 1 Single 3634
## 3099555 NULL 99 99 Vaginal 1 Single 3231
## 219509 NULL 7 99 Vaginal 1 Single 2760
## 2050468 2 years of high school 15 39 Vaginal 1 Single 2187
## 2961820 NULL 7 38 Vaginal 1 Single 3210
## 887843 3 years of high school 7 39 Vaginal 1 Single 3799
## 2076215 4 years of high school 18 38 C-section 1 Single 4290
## 4146535 NULL 7 40 Vaginal 1 Single 3770
#This shows the last n rows of the data table we pass as parameter
View(births2006.smpl)
#This will show the entire data set in a new tab
births2006.smpl[1:5,]
## DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5
## 591430 9 1 25 2 NA F NA
## 1827276 2 6 28 2 26 M 9
## 1705673 2 2 18 2 25 F 9
## 3368269 10 5 21 2 6 M 9
## 2990253 7 7 25 1 36 M 10
## DMEDUC UPREVIS ESTGEST DMETH_REC DPLURAL DBWT
## 591430 NULL 10 99 Vaginal 1 Single 3800
## 1827276 2 years of college 10 37 Vaginal 1 Single 3625
## 1705673 NULL 14 38 Vaginal 1 Single 3650
## 3368269 NULL 22 38 Vaginal 1 Single 3045
## 2990253 2 years of high school 15 40 Vaginal 1 Single 3827
#This is the [x,y] way of accessing the data table. 1:5 here means we are selecting rows 1 to 5. The blank after , implies ALL the columns need to be selected.
births2006.smpl[1:5,1:2]
## DOB_MM DOB_WK
## 591430 9 1
## 1827276 2 6
## 1705673 2 2
## 3368269 10 5
## 2990253 7 7
#This command will give only the first 5 rows and the first 2 columns.
dim(births2006.smpl)
## [1] 427323 13
#This command gives the total number of rows and columns present in the data set that was loaded
#Command for a simple histogram
hist(births2006.smpl$DOB_WK)
##Creates a frequency Table for the 7 days of the week. Table() command is used for creating a frequency table.
births.dow <- table(births2006.smpl$DOB_WK)
View(births.dow)
#Command for a barchart
barchart(births.dow, ylab = "Days of the Week", col='black')
## This creates a Week X Method table from the given data
dob.dm.method <- table(week=births2006.smpl$DOB_WK,method = births2006.smpl$DMETH_REC)
dob.dm.method
## method
## week C-section Unknown Vaginal
## 1 8836 90 31348
## 2 20454 272 42031
## 3 22921 247 46607
## 4 23103 252 46935
## 5 22825 258 47081
## 6 23233 289 44858
## 7 10696 109 34878
## We need to delete the Column Unknown since it doesn't help us with our analysis
dob.dm.method <- dob.dm.method[,-2]
dob.dm.method
## method
## week C-section Vaginal
## 1 8836 31348
## 2 20454 42031
## 3 22921 46607
## 4 23103 46935
## 5 22825 47081
## 6 23233 44858
## 7 10696 34878
##Drawing a barchart - type 1 (Horizontal alignment, no groups within the single bar)
barchart(dob.dm.method,ylab="Day of Week")
## Drawing a barchart - type 2
barchart(dob.dm.method,horizontal=FALSE,groups=FALSE,xlab="Day of Week",col="black")
histogram(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5), col="black")
histogram(~DBWT|DMETH_REC,data=births2006.smpl,layout=c(1,3),col="black")
## We can use the | operator to give a command for separating different categories of the plots based on variable DPLURAL
densityplot(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),plot.points=FALSE,col="black")
##Different categories in the same plot - without using any category separator
densityplot(~DBWT,groups=DPLURAL,data=births2006.smpl,plot.points=FALSE)
## Plot each point in the graph where data is present
# In this case, dividing the DBWT dot points into various categories based on DPLURAL using | operator
dotplot(~DBWT|DPLURAL,data=births2006.smpl,layout=c(1,5),plot.points=FALSE,col="black")
xyplot(DBWT~DOB_WK,data=births2006.smpl,col="black")
xyplot(DBWT~DOB_WK|DPLURAL,data=births2006.smpl,layout=c(1,5), col="black")
xyplot(DBWT~WTGAIN,data=births2006.smpl,col="black")
xyplot(DBWT~WTGAIN|DPLURAL,data=births2006.smpl,layout=c(1,5), col="black")
## Smooth Scatter Plot
smoothScatter(births2006.smpl$WTGAIN,births2006.smpl$DBWT)
boxplot(DBWT~APGAR5,data=births2006.smpl,ylab="DBWT",xlab="AGPAR5")
boxplot(DBWT~DOB_WK,data=births2006.smpl,ylab="DBWT",xlab="Day of Week")
## bwplot is the command for a box plot in the lattice graphics
## package. There you need to declare the conditioning variables
## as factors
bwplot(DBWT~factor(APGAR5)|factor(SEX),data=births2006.smpl, xlab="AGPAR5")
bwplot(DBWT~factor(DOB_WK),data=births2006.smpl, xlab="Day of Week")
## Apply mean function to each of the weights for each of the DPLURAL categories
fac=factor(births2006.smpl$DPLURAL)
res=births2006.smpl$DBWT
t4=tapply(res,fac,mean,na.rm=TRUE)
t4
## 1 Single 2 Twin 3 Triplet
## 3298.263 2327.478 1677.017
## 4 Quadruplet 5 Quintuplet or higher
## 1196.105 1142.800
## Splitting for Gender
t5=tapply(births2006.smpl$DBWT,INDEX=list(births2006.smpl$DPLURAL,births2006.smpl$SEX),FUN=mean,na.rm=TRUE)
t5
## F M
## 1 Single 3242.302 3351.637
## 2 Twin 2279.508 2373.819
## 3 Triplet 1697.822 1655.348
## 4 Quadruplet 1319.556 1085.000
## 5 Quintuplet or higher 1007.667 1345.500
## Plotting the results of the above Data operations
barplot(t4,ylab="DBWT")
barplot(t5,beside=TRUE,ylab="DBWT")
t6=table(births2006.smpl$ESTGEST)
t6
##
## 12 15 17 18 19 20 21 22 23 24
## 1 2 18 43 69 116 162 209 288 401
## 25 26 27 28 29 30 31 32 33 34
## 445 461 566 670 703 1000 1243 1975 2652 4840
## 35 36 37 38 39 40 41 42 43 44
## 7954 15874 33310 76794 109046 84890 23794 1931 133 32
## 45 46 47 48 51 99
## 6 5 5 2 1 57682
An earlier frequency distribution table of estimated gestation period indicates that “99” is used as the code for “unknown”. For the subsequent calculations, we omit all records with unknown gestation period (i.e., value 99).
new=births2006.smpl[births2006.smpl$ESTGEST != 99,]
t51=table(new$ESTGEST)
t51
##
## 12 15 17 18 19 20 21 22 23 24
## 1 2 18 43 69 116 162 209 288 401
## 25 26 27 28 29 30 31 32 33 34
## 445 461 566 670 703 1000 1243 1975 2652 4840
## 35 36 37 38 39 40 41 42 43 44
## 7954 15874 33310 76794 109046 84890 23794 1931 133 32
## 45 46 47 48 51
## 6 5 5 2 1
t7=tapply(new$DBWT,INDEX=list(cut(new$WTGAIN,breaks=10),cut(new$ESTGEST,breaks=10)),FUN=mean,na.rm=TRUE)
t7
## (12,15.9] (15.9,19.8] (19.8,23.7] (23.7,27.6] (27.6,31.5]
## (-0.098,9.8] 227 321.3125 486.7534 799.5614 1398.234
## (9.8,19.6] 2649 592.8235 546.7738 813.4179 1421.181
## (19.6,29.4] NA 585.8889 590.1368 882.4800 1452.186
## (29.4,39.2] 2977 1891.0000 731.5957 866.0294 1521.757
## (39.2,49] NA 2485.2500 803.8667 955.7639 1513.215
## (49,58.8] NA NA 434.7500 950.8039 1506.355
## (58.8,68.6] NA NA 352.0000 1285.6250 1469.508
## (68.6,78.4] NA NA NA 805.5714 1463.391
## (78.4,88.2] NA NA NA 1110.0000 1487.846
## (88.2,98.1] NA NA NA 768.0000 1434.333
## (31.5,35.4] (35.4,39.3] (39.3,43.2] (43.2,47.1] (47.1,51]
## (-0.098,9.8] 2275.316 3166.748 3443.652 3911.667 3310
## (9.8,19.6] 2289.950 3171.085 3434.708 3206.400 NA
## (19.6,29.4] 2307.429 3213.362 3475.328 3007.800 3969
## (29.4,39.2] 2323.002 3276.400 3535.965 3326.143 4042
## (39.2,49] 2368.520 3329.068 3605.645 3447.200 NA
## (49,58.8] 2358.658 3370.630 3650.549 3501.000 NA
## (58.8,68.6] 2367.365 3389.672 3681.233 3435.500 NA
## (68.6,78.4] 2368.205 3418.076 3694.160 3118.000 NA
## (78.4,88.2] 2447.250 3496.495 3708.868 NA NA
## (88.2,98.1] 2481.105 3406.835 3688.067 NA NA
levelplot(t7,scales = list(x = list(rot = 90)))
contourplot(t7,scales = list(x = list(rot = 90)))
One may wish to predict the birth weight from characteristics such as the estimated gestation period and the weight gain of the mother; for that, one could use regression and regression trees. Or, one may want to identify births that lead to very low APGAR scores, for which purpose, one could use classification methods.