Using RMarkdown

Text

Text can be decorated with bold or italics. It is also possible to

  • create links
  • include mathematics like \(e=mc^2\) or \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2\]

Be sure to put a space after the * when you are creating bullets and a space after # when creating section headers, but not between $ and the mathematical formulas.

Download Info

We will be using RStudio server for now, however later on as you ease in you might want to use RStudio, which is an interface (IDE) to the program R, in conjunction with R on your PC/Laptop. In order to do so

  • First, downloaded R using the link provided in here links and install it on your computer.

  • Then download RStudio from the link provided links and install it on your computer.

Graphics

If the code of an R chunk produces a plot, this plot can be displayed in the resulting file.

xyplot(births ~ date, data=Births78)

R output

Other forms of R output are also displayed as they are produced.

favstats(~ births, data=Births78)
##   min   Q1 median   Q3   max     mean       sd   n missing
##  7135 8554   9218 9705 10711 9132.162 817.8821 365       0

Destination formats

This file can be knit to HTML, PDF, or Word. In RStudio, just select the desired output file type and click on Knit HTML, Knit PDF, or Knit Word. Use the dropdown menu next to that to change the desired file type.

The Goal of this lab:

To introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course.

R is the name of the programming language itself and RStudio is a convenient interface ( IDE).

The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered.

Any plots that you generate will show up in the panel in the lower right corner. In additon thats where you can install and check packages, see your workign directory etc.

The panel on the left is called the console. It will have the prompt “>”, literally, it is a request for a command.

The symbol <- is the assignment operator. Using it, we can create and store different types of R objects. Anything after the pound ( hashtag) symbol is a comment-explanatory text that is not executed.

a <- c(1.8, 4.5)   #numeric
b <- c(1 + 2i, 3 - 6i) #complex
d <- c(23, 44)   #integer
e <- vector("logical", length = 5)
a
## [1] 1.8 4.5
b
## [1] 1+2i 3-6i
d
## [1] 23 44
e
## [1] FALSE FALSE FALSE FALSE FALSE
class(qt) # to check the class of an R object
## [1] "function"
my_list <- list(22, "ab", TRUE, 1 + 2i) #  Vector with elements of different types
my_list
## [[1]]
## [1] 22
## 
## [[2]]
## [1] "ab"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+2i
my_matrix <- matrix(1:6, nrow=3, ncol=2) #matrix
my_matrix
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
dim(my_matrix) #checking dimensions of the matrix 
## [1] 3 2
b1 <- 0:5
b1
## [1] 0 1 2 3 4 5
class(b1)
## [1] "integer"
b2<-as.factor(b1)
b2
## [1] 0 1 2 3 4 5
## Levels: 0 1 2 3 4 5
class(b2)
## [1] "factor"
a1<-3+2 # Simple arithmetic, a1 is the object name which will store the result 
a1
## [1] 5
a2<-c(1,2,3,4) # Note that as above the symbol c is concatination operator used to create vectors
a2 # the object a2 is called a vector
## [1] 1 2 3 4
a3<-sum(a2) # adds up all elements of a2
a3
## [1] 10
a4<-20:30 # creating a sequence incrementing by 1
a4
##  [1] 20 21 22 23 24 25 26 27 28 29 30
#To obtain help on any of the commands, type the name of the command you wish help on:
?hist

head(Births78) # to see first few columns
##         date births wday year month day_of_year day_of_month day_of_week
## 1 1978-01-01   7701  Sun 1978     1           1            1           1
## 2 1978-01-02   7527  Mon 1978     1           2            2           2
## 3 1978-01-03   8825  Tue 1978     1           3            3           3
## 4 1978-01-04   8859  Wed 1978     1           4            4           4
## 5 1978-01-05   9043  Thu 1978     1           5            5           5
## 6 1978-01-06   9208  Fri 1978     1           6            6           6
#To check the size (dimension) of the data frame, type
dim(Births78)
## [1] 365   8
str(Births78) # compactly displays internal structure of an R object in this case dataframe
## 'data.frame':    365 obs. of  8 variables:
##  $ date        : Date, format: "1978-01-01" "1978-01-02" ...
##  $ births      : int  7701 7527 8825 8859 9043 9208 8084 7611 9172 9089 ...
##  $ wday        : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ year        : num  1978 1978 1978 1978 1978 ...
##  $ month       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day_of_year : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ day_of_month: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ day_of_week : num  1 2 3 4 5 6 7 1 2 3 ...
summary(Births78)
##       date                births       wday         year     
##  Min.   :1978-01-01   Min.   : 7135   Sun:53   Min.   :1978  
##  1st Qu.:1978-04-02   1st Qu.: 8554   Mon:52   1st Qu.:1978  
##  Median :1978-07-02   Median : 9218   Tue:52   Median :1978  
##  Mean   :1978-07-02   Mean   : 9132   Wed:52   Mean   :1978  
##  3rd Qu.:1978-10-01   3rd Qu.: 9705   Thu:52   3rd Qu.:1978  
##  Max.   :1978-12-31   Max.   :10711   Fri:52   Max.   :1978  
##                                       Sat:52                 
##      month         day_of_year   day_of_month    day_of_week   
##  Min.   : 1.000   Min.   :  1   Min.   : 1.00   Min.   :1.000  
##  1st Qu.: 4.000   1st Qu.: 92   1st Qu.: 8.00   1st Qu.:2.000  
##  Median : 7.000   Median :183   Median :16.00   Median :4.000  
##  Mean   : 6.526   Mean   :183   Mean   :15.72   Mean   :3.992  
##  3rd Qu.:10.000   3rd Qu.:274   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :12.000   Max.   :365   Max.   :31.00   Max.   :7.000  
## 

If you strike the key before typing a complete expression, you will see the continuation prompt, the plus sign.

The columns are the variables. There are two types of variables: numeric, for example, number of births and factor (also called categorical), for example work day. The rows are called observations or cases.

Tables, bar charts and histograms

Remarks:

  • The $ is one way to access the variables of a data frame. Another option, which we will learn later, is to create a new vector object by subsetting.

  • R is case-sensitive! for example a1 and A1 are considered different objects

We will do more descriptive stats/initital data exploration in the next Lab

  • To visualize the distribution of a factor variable you create a bar chart
barplot(table(Births78$wday))

To see the distribution of a numeric variable, create a histogram

hist(Births78$births)

The shape of the distribution of this variable is left-skewed, it is a count variable not a continuous variable!

Boxplots give a visualization of the 5-number summary of a variable.

boxplot(Births78$births)

boxplot(births ~ wday, data = Births78) #births grouped by week day

Remarks

  • The boxplot command offers the option of using a formula syntax. Here, since we can specify the data set, we don’t have to use the $ to access the variables.

  • In general, the variables in a data frame are not immediately accessible for use try calling births, you will see an error

  • Error: object ‘births’ not found

  • If you use a variable frequently, you may want to extract it and store it in as a vector

b1<-Births78$births
mean(b1)
## [1] 9132.162
median(b1)
## [1] 9218
range(b1)
## [1]  7135 10711
var(b1)
## [1] 668931.1
quantile(b1)
##    0%   25%   50%   75%  100% 
##  7135  8554  9218  9705 10711
#The tapply command allows you to compute numeric summaries on values based on levels
#of a factor variable. For instance, find the mean or median births by wday,
bb<-tapply(b1, Births78$wday, median)
bb
##    Sun    Mon    Tue    Wed    Thu    Fri    Sat 
## 7936.0 9321.0 9667.5 9361.5 9397.0 9544.5 8260.5

Functions in R are called by typing their name followed by arguments inside parentheses, e.g., mean(b1).

Typing a function name without parentheses will give the code for the function.

sd
## function (x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", 
##     FALSE)) 
## {
##     if (lazyeval::is_formula(x)) {
##         if (is.null(data)) 
##             data <- lazyeval::f_env(x)
##         formula <- mosaicCore::mosaic_formula_q(x, groups = groups, 
##             max.slots = 3)
##         return(maggregate(formula, data = data, FUN = stats::sd, 
##             ..., na.rm = na.rm, .multiple = FALSE))
##     }
##     stats::sd(x, ..., na.rm = na.rm)
## }
## <bytecode: 0x7f96f7ef26d0>
## <environment: namespace:mosaic>

Determine the working directory of your R session by typing or by going to files->More tab in the bottom right panel of RStudio

getwd() # you must set your WD to the current project folder
## [1] "/Users/tzaihra/Downloads"
#setwd() # to set working directory

Subsetting Data and vector

R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations.

For subsetting a vector, the basic syntax is vector[index].

z <- c(8, 3, 0, 9, 9, 2, 1, 3)
z1<-z[4] 
z1 #z1 is the fourth element of z
## [1] 9
z2<-z[c(1, 4, 5)]
z2 #The first, fourth and fifth element
## [1] 8 9 9
z3<-z[-c(1, 3, 4)] 
z3 #All elements except the first, third and fourth
## [1] 3 9 2 1 3
z4<-z[8:1] 
z4 #The elements of z in reverse
## [1] 3 1 2 9 9 0 3 8

For subsetting data

Selecting keeping variables (columns) and observations (rows)

Keep Variables/columns age, race group, gender and treatment group from the HELPrct data

head(HELPrct)
##   age anysubstatus anysub cesd d1 daysanysub dayslink drugrisk e2b female
## 1  37            1    yes   49  3        177      225        0  NA      0
## 2  37            1    yes   30 22          2       NA        0  NA      0
## 3  26            1    yes   39  0          3      365       20  NA      0
## 4  39            1    yes   15  2        189      343        0   1      1
## 5  32            1    yes   39 12          2       57        0   1      0
## 6  47            1    yes    6  1         31      365        0  NA      1
##      sex g1b homeless i1 i2 id indtot linkstatus link       mcs      pcs
## 1   male yes   housed 13 26  1     39          1  yes 25.111990 58.41369
## 2   male yes homeless 56 62  2     43         NA <NA> 26.670307 36.03694
## 3   male  no   housed  0  0  3     41          0   no  6.762923 74.80633
## 4 female  no   housed  5  5  4     28          0   no 43.967880 61.93168
## 5   male  no homeless 10 13  5     38          1  yes 21.675755 37.34558
## 6 female  no   housed  4  4  6     29          0   no 55.508991 46.47521
##   pss_fr racegrp satreat sexrisk substance treat avg_drinks max_drinks
## 1      0   black      no       4   cocaine   yes         13         26
## 2      1   white      no       7   alcohol   yes         56         62
## 3     13   black      no       2    heroin    no          0          0
## 4     11   white     yes       4    heroin    no          5          5
## 5     10   black      no       6   cocaine    no         10         13
## 6      5   black      no       5   cocaine   yes          4          4
myvars<-c("age","racegrp","sex","substance","treat")
newdata<-HELPrct[myvars]
head(newdata)
##   age racegrp    sex substance treat
## 1  37   black   male   cocaine   yes
## 2  37   white   male   alcohol   yes
## 3  26   black   male    heroin    no
## 4  39   white female    heroin    no
## 5  32   black   male   cocaine    no
## 6  47   black female   cocaine   yes

Select 1st and 6th to 8th rows/observations

newdata1<-HELPrct[c(1,6:8)]
head(newdata1)
##   age daysanysub dayslink drugrisk
## 1  37        177      225        0
## 2  37          2       NA        0
## 3  26          3      365       20
## 4  39        189      343        0
## 5  32          2       57        0
## 6  47         31      365        0

Exclude 3rd and 4th row

newdata3<-newdata[c(-1,-2)]
head(newdata3)
##      sex substance treat
## 1   male   cocaine   yes
## 2   male   alcohol   yes
## 3   male    heroin    no
## 4 female    heroin    no
## 5   male   cocaine    no
## 6 female   cocaine   yes

Selecting observations

newdata4<-newdata[1:5,]
head(newdata4)
##   age racegrp    sex substance treat
## 1  37   black   male   cocaine   yes
## 2  37   white   male   alcohol   yes
## 3  26   black   male    heroin    no
## 4  39   white female    heroin    no
## 5  32   black   male   cocaine    no
#based on variable value
newdata5<-newdata[which(newdata$sex=="female"),]
head(newdata5)
##    age racegrp    sex substance treat
## 4   39   white female    heroin    no
## 6   47   black female   cocaine   yes
## 7   49   black female   cocaine    no
## 9   50   white female   alcohol    no
## 11  34   white female    heroin   yes
## 12  58   black female   alcohol    no
#or 
attach(newdata)
newdata6<-newdata[which(sex=="female" & age>45),]
head(newdata6)
##     age racegrp    sex substance treat
## 6    47   black female   cocaine   yes
## 7    49   black female   cocaine    no
## 9    50   white female   alcohol    no
## 12   58   black female   alcohol    no
## 25   48   black female   cocaine    no
## 122  50   white female   alcohol    no
detach(newdata)

Selecting using subset command

The basic syntax is subset(data, select = columns,subset = row.condition). The row condition must be a logical statement (something that evaluates to TRUE/FALSE.)

newdata7<-subset(newdata,age>=45|age<55,select=c(age,sex,substance))
head(newdata7)
##   age    sex substance
## 1  37   male   cocaine
## 2  37   male   alcohol
## 3  26   male    heroin
## 4  39 female    heroin
## 5  32   male   cocaine
## 6  47 female   cocaine
newdata8<-subset(newdata,sex=="female" & age>50,select=c(age,sex,substance))
head(newdata8)
##     age    sex substance
## 12   58 female   alcohol
## 156  57 female   alcohol
## 225  55 female    heroin

Random Samples

Use the sample( ) function to take a random sample of size n from a dataset.

# take a random sample of size 50 from a dataset 
# sample without replacement
mysample <- HELPrct[sample(1:nrow(HELPrct), 50,
   replace=FALSE),] 

Grapphing R has some powerful functions for making graphics.

Use source to import the arbuthnot data from the specififed URL below. The workspace area in the upper righthand corner of the RStudio window will list a data set called arbuthnot that has 82 observations on 3 variables.

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710.

Create a simple plot of the number of girls baptized per year with the command plot. We could add a third argument, the letter l for line.

To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in.

source("http://www.openintro.org/stat/data/arbuthnot.R")
head(arbuthnot)
##   year boys girls
## 1 1629 5218  4683
## 2 1630 4858  4457
## 3 1631 4422  4102
## 4 1632 4994  4590
## 5 1633 5158  4839
## 6 1634 5035  4820
plot(x = arbuthnot$year, y = arbuthnot$girls, type = "l")