Recap

  • Invocation and the Read-Eval-Print Loop
  • Constants
  • Parsing of R code
  • Strings and names
  • Functions and operators
  • Comments
  • Working directory
  • Workspace
  • Search path
  • Documentation

3rd-Party Libraries (Packages)

  • Compilation of functions and/or data
  • Made to ease specific (repetitive) tasks
  • Some are general-purpose, others domain-specific
  • Currently, 17678 packages on CRAN
  • Other locations - Bioconductor, GitHub
  • See CRAN Task Views

A hypothetical package called pkg has a function, pause():

  • To install from CRAN - install.packages("pkg")
  • To learn about the package - help(package = "pkg")
  • To use pkg in R session - library(pkg) (or require)
  • Function pause() can be used one-off - pkg::pause()
  • To detach pkg from search path - detach("package:pkg")
  • To uninstall pkg use remove.packages("pkg")

Importing Data

  • Note: There are inbuilt datasets
  • Import from MS Excel using the readxl package
  • This object is called a data frame
  • Has observations (rows) and variables (columns)

dat <- readxl::read_excel("whostat2005_coverage.xls")

# Alternatively:
# library(readxl)
# dat <- read_excel("whostat2005_coverage.xls")

Data exploration

  • View() (to find out about it - ?View)

Data exploration

  • dim()
  • head() or tail()
  • colnames()
  • str()

dim(dat)
## [1] 216  21

head(dat)
##   ...1 ...2        ...3   ...4 ...5
## 1   NA   NA        <NA>    WHO   NA
## 2   NA   NA     Country region   NA
## 3   NA   NA        <NA>   <NA>   NA
## 4   NA   NA        <NA>   <NA>   NA
## 5   NA   NA        <NA>   <NA>   NA
## 6    1   NA Afghanistan    EMR   NA
##   Immunization coverage (%) among 1-year-oldsa ...7  ...8      Antenatal ...10
## 1                                      Measles DTP3 HepB3 care coverageb  <NA>
## 2                                         <NA> <NA>  <NA>            (%)  year
## 3                                         <NA> <NA>  <NA>           <NA>  <NA>
## 4                                         2003 2003  2003           <NA>  <NA>
## 5                                         <NA> <NA>  <NA>           <NA>  <NA>
## 6                                           50   54     0             52  2003
##          Births attended by ...12    Contraceptive ...14
## 1 skilled health personnelb  <NA> prevalence rateb  <NA>
## 2                       (%)  year              (%)  year
## 3                      <NA>  <NA>             <NA>  <NA>
## 4                      <NA>  <NA>             <NA>  <NA>
## 5                      <NA>  <NA>             <NA>  <NA>
## 6                        14  2003                4  2000
##      Children under-5 using ...16     TB detection ...18        TB treatment
## 1 insecticide-treated netsc  <NA> rate under DOTSd  <NA> success under DOTSd
## 2                       (%)  year              (%)  year                 (%)
## 3                      <NA>  <NA>             <NA>  <NA>                <NA>
## 4                      <NA>  <NA>             <NA>  <NA>                <NA>
## 5                      <NA>  <NA>             <NA>  <NA>                <NA>
## 6                       ...   ...               18  2003                  87
##   ...20    Antiretroviral
## 1  <NA> therapy coveragee
## 2  year               (%)
## 3  <NA>              <NA>
## 4  <NA>          Dec 2004
## 5  <NA>              <NA>
## 6  2002               ...

tail(dat)
##     ...1 ...2
## 211   NA   NA
## 212   NA   NA
## 213   NA   NA
## 214   NA   NA
## 215   NA   NA
## 216   NA   NA
##                                                                                                                                                                                               ...3
## 211                                                                                                                                                        … data not available or not applicable.
## 212  aWorld Health Organization, Department of Immunization Vaccines and Biologicals, Vaccine Assessment and Monitoring Team. (http//www.who.int/vaccines-surveillance, accessed on 16 April 2005)
## 213                                         bThe World Health Report 2005: make every mother and child count. Geneva, World Health Organization, 2005. (http://www.who.int/whr/2005/en/index.html)
## 214                                                                                       cThe WHO Global Roll Back Malaria database. (http://www.who.int/globalatlas/autologin/malaria_login.asp)
## 215 dWHO report 2005. Global Tuberculosis Control; Surveillance, Planning, Financing. Geneva, World Health Organization, 2005.(http://www.who.int/tb/publications/global_report/2005/pdf/Full.pdf)
## 216                                                                                                  eThe WHO Global Database on Child Growth and Malnutrition. (http://www.who.int/nutgrowthdb)\n
##     ...4 ...5 Immunization coverage (%) among 1-year-oldsa ...7 ...8 Antenatal
## 211 <NA>   NA                                         <NA> <NA> <NA>      <NA>
## 212 <NA>   NA                                         <NA> <NA> <NA>      <NA>
## 213 <NA>   NA                                         <NA> <NA> <NA>      <NA>
## 214 <NA>   NA                                         <NA> <NA> <NA>      <NA>
## 215 <NA>   NA                                         <NA> <NA> <NA>      <NA>
## 216 <NA>   NA                                         <NA> <NA> <NA>      <NA>
##     ...10 Births attended by ...12 Contraceptive ...14 Children under-5 using
## 211  <NA>               <NA>  <NA>          <NA>  <NA>                   <NA>
## 212  <NA>               <NA>  <NA>          <NA>  <NA>                   <NA>
## 213  <NA>               <NA>  <NA>          <NA>  <NA>                   <NA>
## 214  <NA>               <NA>  <NA>          <NA>  <NA>                   <NA>
## 215  <NA>               <NA>  <NA>          <NA>  <NA>                   <NA>
## 216  <NA>               <NA>  <NA>          <NA>  <NA>                   <NA>
##     ...16 TB detection ...18 TB treatment ...20 Antiretroviral
## 211  <NA>         <NA>  <NA>         <NA>  <NA>           <NA>
## 212  <NA>         <NA>  <NA>         <NA>  <NA>           <NA>
## 213  <NA>         <NA>  <NA>         <NA>  <NA>           <NA>
## 214  <NA>         <NA>  <NA>         <NA>  <NA>           <NA>
## 215  <NA>         <NA>  <NA>         <NA>  <NA>           <NA>
## 216  <NA>         <NA>  <NA>         <NA>  <NA>           <NA>

colnames(dat)
##  [1] "...1"                                        
##  [2] "...2"                                        
##  [3] "...3"                                        
##  [4] "...4"                                        
##  [5] "...5"                                        
##  [6] "Immunization coverage (%) among 1-year-oldsa"
##  [7] "...7"                                        
##  [8] "...8"                                        
##  [9] "Antenatal"                                   
## [10] "...10"                                       
## [11] "Births attended by"                          
## [12] "...12"                                       
## [13] "Contraceptive"                               
## [14] "...14"                                       
## [15] "Children under-5 using"                      
## [16] "...16"                                       
## [17] "TB detection"                                
## [18] "...18"                                       
## [19] "TB treatment"                                
## [20] "...20"                                       
## [21] "Antiretroviral"

str(dat)
## 'data.frame':    216 obs. of  21 variables:
##  $ ...1                                        : num  NA NA NA NA NA 1 2 3 4 5 ...
##  $ ...2                                        : logi  NA NA NA NA NA NA ...
##  $ ...3                                        : chr  NA "Country" NA NA ...
##  $ ...4                                        : chr  "WHO" "region" NA NA ...
##  $ ...5                                        : logi  NA NA NA NA NA NA ...
##  $ Immunization coverage (%) among 1-year-oldsa: chr  "Measles" NA NA "2003" ...
##  $ ...7                                        : chr  "DTP3" NA NA "2003" ...
##  $ ...8                                        : chr  "HepB3" NA NA "2003" ...
##  $ Antenatal                                   : chr  "care coverageb" "(%)" NA NA ...
##  $ ...10                                       : chr  NA "year" NA NA ...
##  $ Births attended by                          : chr  "skilled health personnelb" "(%)" NA NA ...
##  $ ...12                                       : chr  NA "year" NA NA ...
##  $ Contraceptive                               : chr  "prevalence rateb" "(%)" NA NA ...
##  $ ...14                                       : chr  NA "year" NA NA ...
##  $ Children under-5 using                      : chr  "insecticide-treated netsc" "(%)" NA NA ...
##  $ ...16                                       : chr  NA "year" NA NA ...
##  $ TB detection                                : chr  "rate under DOTSd" "(%)" NA NA ...
##  $ ...18                                       : chr  NA "year" NA NA ...
##  $ TB treatment                                : chr  "success under DOTSd" "(%)" NA NA ...
##  $ ...20                                       : chr  NA "year" NA NA ...
##  $ Antiretroviral                              : chr  "therapy coveragee" "(%)" NA "Dec 2004" ...

Subsetting

  • $ operator allows us to pick out a column by name
  • A new object can be created from the column
  • The column can also be modified with this operator
  • Concept of getting and setting
# Get
anc <- dat$Antenatal

To see value of object bound to anc, run the name alone

anc
##   [1] "care coverageb" "(%)"            NA               NA              
##   [5] NA               "52"             "81"             "79"            
##   [9] "..."            "..."            "..."            "..."           
##  [13] "82"             "..."            "..."            "70"            
##  [17] "..."            "63"             "39"             "89"            
##  [21] "..."            "..."            "..."            "88"            
##  [25] "..."            "84"             "99"             "99"            
##  [29] "84"             "..."            "..."            "72"            
##  [33] "93"             "44"             "77"             "..."           
##  [37] "..."            "…"              "51"             "..."           
##  [41] "..."            "90"             "87"             "..."           
##  [45] "..."            "..."            "84"             "..."           
##  [49] "..."            "..."            "…"              "98"            
##  [53] "72"             "..."            "..."            "..."           
##  [57] "100"            "56"             "54"             "..."           
##  [61] "..."            "..."            "..."            "27"            
##  [65] "..."            "..."            "..."            "94"            
##  [69] "92"             "91"             "..."            "90"            
##  [73] "..."            "..."            "86"             "74"            
##  [77] "89"             "88"             "79"             "..."           
##  [81] "..."            "..."            "65"             "97"            
##  [85] "..."            "..."            "..."            "..."           
##  [89] "..."            "..."            "..."            "99"            
##  [93] "82"             "88"             "..."            "83"            
##  [97] "88"             "44"             "..."            "..."           
## [101] "91"             "..."            "..."            "..."           
## [105] "..."            "91"             "94"             "..."           
## [109] "98"             "53"             "..."            "..."           
## [113] "63"             "..."            "..."            "..."           
## [117] "..."            "..."            "…"              "71"            
## [121] "..."            "85"             "..."            "49"            
## [125] "..."            "..."            "85"             "39"            
## [129] "61"             "..."            "..."            "77"            
## [133] "36"             "..."            "..."            "..."           
## [137] "..."            "85"             "94"             "..."           
## [141] "..."            "62"             "..."            "99"            
## [145] "89"             "96"             "93"             "..."           
## [149] "..."            "..."            "..."            "..."           
## [153] "91"             "77"             "82"             "..."           
## [157] "..."            "82"             "..."            "..."           
## [161] "..."            "..."            "..."            "89"            
## [165] "..."            "..."            "..."            "91"            
## [169] "..."            "..."            "..."            "..."           
## [173] "75"             "..."            "..."            "..."           
## [177] "78"             "..."            "96"             "..."           
## [181] "67"             "87"             "..."            "92"            
## [185] "90"             "97"             "..."            "96"            
## [189] "..."            "..."            "95"             "..."           
## [193] "..."            "70"             "34"             "94"            
## [197] "82"             NA               NA               NA              
## [201] NA               "70"             "84"             "66"            
## [205] "84"             "46"             "77"             NA              
## [209] NA               NA               NA               NA              
## [213] NA               NA               NA               NA

Hmmm, it’s rather long. Let’s shorten the output a little.

head(anc)
## [1] "care coverageb" "(%)"            NA               NA              
## [5] NA               "52"
tail(anc)
## [1] NA NA NA NA NA NA

  • These objects we have extracted from our data frame are known as atomic vectors.
  • [] and [[]] also allow sub-setting with numerical indices
  • Getting/setting applies
anc2 <- dat[['Antenatal']]
anc3 <- dat[[9]]
identical(anc, anc2)
identical(anc, anc3)
## [1] TRUE
## [1] TRUE

Data Cleaning

  • Time-consuming
  • Development of elegant constructs
  • R scripts allow documentation of cleaning data steps
  • Scripting: quality assurance, knowledge sharing and correction/reversal where necessary
  • Scripting -> package development -> Automation
  • 3rd party packages available (limited value)

Attempt to improve data import (DEMO)

Simple summary statistics

  • summary()
  • For categorical data - table()
  • For numerical data - fivenum()

Visualization

  • Outstanding graphics capabilities
  • Base R, other packages like lattice and ggplot2
  • Run demo("graphics") and demo("image")

Concluding your Session

  • You will be prompted to save your session.
  • 99% of the time DON’T do this!
  • Rather than saving objects, save the scripts that created them.
  • You can save individual objects with the function saveRDS()
  • To recall the object, used readRDS().

# Create a variable with the filename. Why?
myfile <- "who_data.rds"

# Save the object on disk
saveRDS(dat, file = myfile)

# Read the object from disk
diskdat <- readRDS(file = myfile)

# Check if it's the same as the earlier object
identical(dat, diskdat)
## [1] TRUE

Resources

TO DO

  • Pick any old Excel file and import it into R. Give it the name mydat:
    • You may have to install the readxl package.
    • What is the result of nrow(mydat)?
    • what is the result of ncol(mydat)?
    • Run lapply(mydat, typeof) and inspect the result. What do you think happened? (Tip: You are free to check ?typeof and ?lapply, or even Google!)