IntroR_part1

The Goal of this lab:

To introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course.

R is the name of the programming language itself and RStudio is a convenient interface ( IDE).

The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered.

Any plots that you generate will show up in the panel in the lower right corner. In additon thats where you can install and check packages, see your workign directory etc.

The panel on the left is called the console. It will have the prompt “>”, literally, it is a request for a command.

The symbol <- is the assignment operator. Using it, we can create and store different types of R objects. Anything after the pound ( hashtag) symbol is a comment-explanatory text that is not executed.

a <- c(1.8, 4.5)   #numeric
b <- c(1 + 2i, 3 - 6i) #complex
d <- c(23, 44)   #integer
e <- vector("logical", length = 5)
a

## [1] 1.8 4.5

## [1] 1+2i 3-6i

## [1] 23 44

## [1] FALSE FALSE FALSE FALSE FALSE

class(qt) # to check the class of an R object

## [1] "function"

my_list <- list(22, "ab", TRUE, 1 + 2i) #  Vector with elements of different types
my_list

## [[1]]
## [1] 22
## 
## [[2]]
## [1] "ab"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+2i

my_matrix <- matrix(1:6, nrow=3, ncol=2) #matrix
my_matrix

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

dim(my_matrix) #checking dimensions of the matrix

## [1] 3 2

b1 <- 0:5
b1

## [1] 0 1 2 3 4 5

class(b1)

## [1] "integer"

b2<-as.factor(b1)
b2

## [1] 0 1 2 3 4 5
## Levels: 0 1 2 3 4 5

class(b2)

## [1] "factor"

a1<-3+2 # Simple arithmetic, a1 is the object name which will store the result 
a1

## [1] 5

a2<-c(1,2,3,4) # Note that as above the symbol c is concatination operator used to create vectors
a2 # the object a2 is called a vector

## [1] 1 2 3 4

a3<-sum(a2) # adds up all elements of a2
a3

## [1] 10

a4<-20:30 # creating a sequence incrementing by 1
a4

##  [1] 20 21 22 23 24 25 26 27 28 29 30

#To obtain help on any of the commands, type the name of the command you wish help on:
?hist

head(Births78) # to see first few columns

##         date births wday year month day_of_year day_of_month day_of_week
## 1 1978-01-01   7701  Sun 1978     1           1            1           1
## 2 1978-01-02   7527  Mon 1978     1           2            2           2
## 3 1978-01-03   8825  Tue 1978     1           3            3           3
## 4 1978-01-04   8859  Wed 1978     1           4            4           4
## 5 1978-01-05   9043  Thu 1978     1           5            5           5
## 6 1978-01-06   9208  Fri 1978     1           6            6           6

#To check the size (dimension) of the data frame, type
dim(Births78)

## [1] 365   8

str(Births78) # compactly displays internal structure of an R object in this case dataframe

## 'data.frame':    365 obs. of  8 variables:
##  $ date        : Date, format: "1978-01-01" "1978-01-02" ...
##  $ births      : int  7701 7527 8825 8859 9043 9208 8084 7611 9172 9089 ...
##  $ wday        : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ year        : num  1978 1978 1978 1978 1978 ...
##  $ month       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day_of_year : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ day_of_month: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ day_of_week : num  1 2 3 4 5 6 7 1 2 3 ...

summary(Births78)

##       date                births       wday         year     
##  Min.   :1978-01-01   Min.   : 7135   Sun:53   Min.   :1978  
##  1st Qu.:1978-04-02   1st Qu.: 8554   Mon:52   1st Qu.:1978  
##  Median :1978-07-02   Median : 9218   Tue:52   Median :1978  
##  Mean   :1978-07-02   Mean   : 9132   Wed:52   Mean   :1978  
##  3rd Qu.:1978-10-01   3rd Qu.: 9705   Thu:52   3rd Qu.:1978  
##  Max.   :1978-12-31   Max.   :10711   Fri:52   Max.   :1978  
##                                       Sat:52                 
##      month         day_of_year   day_of_month    day_of_week   
##  Min.   : 1.000   Min.   :  1   Min.   : 1.00   Min.   :1.000  
##  1st Qu.: 4.000   1st Qu.: 92   1st Qu.: 8.00   1st Qu.:2.000  
##  Median : 7.000   Median :183   Median :16.00   Median :4.000  
##  Mean   : 6.526   Mean   :183   Mean   :15.72   Mean   :3.992  
##  3rd Qu.:10.000   3rd Qu.:274   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :12.000   Max.   :365   Max.   :31.00   Max.   :7.000  
##

If you strike the key before typing a complete expression, you will see the continuation prompt, the plus sign.

The columns are the variables. There are two types of variables: numeric, for example, number of births and factor (also called categorical), for example work day. The rows are called observations or cases.

Tables, bar charts and histograms

Remarks:

The $ is one way to access the variables of a data frame. Another option, which we will learn later, is to create a new vector object by subsetting.
R is case-sensitive! for example a1 and A1 are considered different objects

We will do more descriptive stats/initital data exploration in the next Lab

To visualize the distribution of a factor variable you create a bar chart

barplot(table(Births78$wday))

To see the distribution of a numeric variable, create a histogram

hist(Births78$births)

The shape of the distribution of this variable is left-skewed, it is a count variable not a continuous variable!

Boxplots give a visualization of the 5-number summary of a variable.

boxplot(Births78$births)

boxplot(births ~ wday, data = Births78) #births grouped by week day

Remarks

The boxplot command offers the option of using a formula syntax. Here, since we can specify the data set, we don’t have to use the $ to access the variables.
In general, the variables in a data frame are not immediately accessible for use try calling births, you will see an error
Error: object ‘births’ not found
If you use a variable frequently, you may want to extract it and store it in as a vector

b1<-Births78$births
mean(b1)

## [1] 9132.162

median(b1)

## [1] 9218

range(b1)

## [1]  7135 10711

var(b1)

## [1] 668931.1

quantile(b1)

##    0%   25%   50%   75%  100% 
##  7135  8554  9218  9705 10711

#The tapply command allows you to compute numeric summaries on values based on levels
#of a factor variable. For instance, find the mean or median births by wday,
bb<-tapply(b1, Births78$wday, median)
bb

##    Sun    Mon    Tue    Wed    Thu    Fri    Sat 
## 7936.0 9321.0 9667.5 9361.5 9397.0 9544.5 8260.5

Functions in R are called by typing their name followed by arguments inside parentheses, e.g., mean(b1).

Typing a function name without parentheses will give the code for the function.

sd

## function (x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", 
##     FALSE)) 
## {
##     if (lazyeval::is_formula(x)) {
##         if (is.null(data)) 
##             data <- lazyeval::f_env(x)
##         formula <- mosaicCore::mosaic_formula_q(x, groups = groups, 
##             max.slots = 3)
##         return(maggregate(formula, data = data, FUN = stats::sd, 
##             ..., na.rm = na.rm, .multiple = FALSE))
##     }
##     stats::sd(x, ..., na.rm = na.rm)
## }
## <bytecode: 0x7f96f7ef26d0>
## <environment: namespace:mosaic>

Determine the working directory of your R session by typing or by going to files->More tab in the bottom right panel of RStudio

getwd() # you must set your WD to the current project folder

## [1] "/Users/tzaihra/Downloads"

#setwd() # to set working directory

Subsetting Data and vector

R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations.

For subsetting a vector, the basic syntax is vector[index].

z <- c(8, 3, 0, 9, 9, 2, 1, 3)
z1<-z[4] 
z1 #z1 is the fourth element of z

## [1] 9

z2<-z[c(1, 4, 5)]
z2 #The first, fourth and fifth element

## [1] 8 9 9

z3<-z[-c(1, 3, 4)] 
z3 #All elements except the first, third and fourth

## [1] 3 9 2 1 3

z4<-z[8:1] 
z4 #The elements of z in reverse

## [1] 3 1 2 9 9 0 3 8

For subsetting data

Selecting keeping variables (columns) and observations (rows)

Keep Variables/columns age, race group, gender and treatment group from the HELPrct data

head(HELPrct)

##   age anysubstatus anysub cesd d1 daysanysub dayslink drugrisk e2b female
## 1  37            1    yes   49  3        177      225        0  NA      0
## 2  37            1    yes   30 22          2       NA        0  NA      0
## 3  26            1    yes   39  0          3      365       20  NA      0
## 4  39            1    yes   15  2        189      343        0   1      1
## 5  32            1    yes   39 12          2       57        0   1      0
## 6  47            1    yes    6  1         31      365        0  NA      1
##      sex g1b homeless i1 i2 id indtot linkstatus link       mcs      pcs
## 1   male yes   housed 13 26  1     39          1  yes 25.111990 58.41369
## 2   male yes homeless 56 62  2     43         NA <NA> 26.670307 36.03694
## 3   male  no   housed  0  0  3     41          0   no  6.762923 74.80633
## 4 female  no   housed  5  5  4     28          0   no 43.967880 61.93168
## 5   male  no homeless 10 13  5     38          1  yes 21.675755 37.34558
## 6 female  no   housed  4  4  6     29          0   no 55.508991 46.47521
##   pss_fr racegrp satreat sexrisk substance treat avg_drinks max_drinks
## 1      0   black      no       4   cocaine   yes         13         26
## 2      1   white      no       7   alcohol   yes         56         62
## 3     13   black      no       2    heroin    no          0          0
## 4     11   white     yes       4    heroin    no          5          5
## 5     10   black      no       6   cocaine    no         10         13
## 6      5   black      no       5   cocaine   yes          4          4

myvars<-c("age","racegrp","sex","substance","treat")
newdata<-HELPrct[myvars]
head(newdata)

##   age racegrp    sex substance treat
## 1  37   black   male   cocaine   yes
## 2  37   white   male   alcohol   yes
## 3  26   black   male    heroin    no
## 4  39   white female    heroin    no
## 5  32   black   male   cocaine    no
## 6  47   black female   cocaine   yes

Select 1st and 6th to 8th rows/observations

newdata1<-HELPrct[c(1,6:8)]
head(newdata1)

##   age daysanysub dayslink drugrisk
## 1  37        177      225        0
## 2  37          2       NA        0
## 3  26          3      365       20
## 4  39        189      343        0
## 5  32          2       57        0
## 6  47         31      365        0

Exclude 3rd and 4th row

newdata3<-newdata[c(-1,-2)]
head(newdata3)

##      sex substance treat
## 1   male   cocaine   yes
## 2   male   alcohol   yes
## 3   male    heroin    no
## 4 female    heroin    no
## 5   male   cocaine    no
## 6 female   cocaine   yes

Selecting observations

newdata4<-newdata[1:5,]
head(newdata4)

##   age racegrp    sex substance treat
## 1  37   black   male   cocaine   yes
## 2  37   white   male   alcohol   yes
## 3  26   black   male    heroin    no
## 4  39   white female    heroin    no
## 5  32   black   male   cocaine    no

#based on variable value
newdata5<-newdata[which(newdata$sex=="female"),]
head(newdata5)

##    age racegrp    sex substance treat
## 4   39   white female    heroin    no
## 6   47   black female   cocaine   yes
## 7   49   black female   cocaine    no
## 9   50   white female   alcohol    no
## 11  34   white female    heroin   yes
## 12  58   black female   alcohol    no

#or 
attach(newdata)
newdata6<-newdata[which(sex=="female" & age>45),]
head(newdata6)

##     age racegrp    sex substance treat
## 6    47   black female   cocaine   yes
## 7    49   black female   cocaine    no
## 9    50   white female   alcohol    no
## 12   58   black female   alcohol    no
## 25   48   black female   cocaine    no
## 122  50   white female   alcohol    no

detach(newdata)

Selecting using subset command

The basic syntax is subset(data, select = columns,subset = row.condition). The row condition must be a logical statement (something that evaluates to TRUE/FALSE.)

newdata7<-subset(newdata,age>=45|age<55,select=c(age,sex,substance))
head(newdata7)

##   age    sex substance
## 1  37   male   cocaine
## 2  37   male   alcohol
## 3  26   male    heroin
## 4  39 female    heroin
## 5  32   male   cocaine
## 6  47 female   cocaine

newdata8<-subset(newdata,sex=="female" & age>50,select=c(age,sex,substance))
head(newdata8)

##     age    sex substance
## 12   58 female   alcohol
## 156  57 female   alcohol
## 225  55 female    heroin

Random Samples

Use the sample( ) function to take a random sample of size n from a dataset.

# take a random sample of size 50 from a dataset 
# sample without replacement
mysample <- HELPrct[sample(1:nrow(HELPrct), 50,
   replace=FALSE),]

Grapphing R has some powerful functions for making graphics.

Use source to import the arbuthnot data from the specififed URL below. The workspace area in the upper righthand corner of the RStudio window will list a data set called arbuthnot that has 82 observations on 3 variables.

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710.

Create a simple plot of the number of girls baptized per year with the command plot. We could add a third argument, the letter l for line.

To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in.

source("http://www.openintro.org/stat/data/arbuthnot.R")
head(arbuthnot)

##   year boys girls
## 1 1629 5218  4683
## 2 1630 4858  4457
## 3 1631 4422  4102
## 4 1632 4994  4590
## 5 1633 5158  4839
## 6 1634 5035  4820

plot(x = arbuthnot$year, y = arbuthnot$girls, type = "l")

?plot
# Plot of the total number of baptisms per year with the command
plot(arbuthnot$year, arbuthnot$boys + arbuthnot$girls, type = "l")

# plot of the proportion of boys over time
plot(x = arbuthnot$year, y = arbuthnot$boys, type = "l")

Is there an apparent trend in the number of girls baptized over the years? How would you describe it?

R LAB Questions

Open up a new project, call it RLAb1 and save it in your working directory.
Load up the present day birth records of USA, data with the following command.

source("http://www.openintro.org/stat/data/present.R")

The data are stored in a data frame called present.

What years are included in this data set? What are the dimensions of the data frame and what are the variable or column names?
How do these counts compare to Arbuthnot’s? Are they on a similar scale?
Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see? Is it right to claim that boys are born in greater proportion than girls in the U.S.? Include the plot in your response.
In what year did we see the most total number of births in the U.S.? You can refer to the help files or the R reference card linksto find helpful commands.
Obtain a subset of data which has only females, who are born in the year 1956 or later.
These data come from a report by the Centers for Disease Control links
If you’re interested in learning more, or find more labs for practice at links

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine ?etinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics

Documenting file creation

It’s useful to record some information about how your file was created.

File creation date: 2019-09-03
R version 3.5.2 (2018-12-20)
R version (short form): 3.5.2
mosaic package version: 1.5.0
Additional session information

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mosaic_1.5.0      Matrix_1.2-15     mosaicData_0.17.0 ggformula_0.9.1  
## [5] ggstance_0.3.1    ggplot2_3.1.0     lattice_0.20-38   dplyr_0.7.8      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5 xfun_0.7         purrr_0.2.5      splines_3.5.2   
##  [5] colorspace_1.4-0 generics_0.0.2   htmltools_0.3.6  yaml_2.2.0      
##  [9] rlang_0.3.1      pillar_1.3.1     later_0.8.0      glue_1.3.0      
## [13] withr_2.1.2      bindrcpp_0.2.2   bindr_0.1.1      plyr_1.8.4      
## [17] mosaicCore_0.6.0 stringr_1.3.1    munsell_0.5.0    gtable_0.2.0    
## [21] htmlwidgets_1.3  evaluate_0.14    knitr_1.23       httpuv_1.5.1    
## [25] crosstalk_1.0.0  broom_0.5.2      Rcpp_1.0.0       readr_1.3.1     
## [29] xtable_1.8-4     scales_1.0.0     backports_1.1.4  promises_1.0.1  
## [33] leaflet_2.0.2    mime_0.6         gridExtra_2.3    hms_0.4.2       
## [37] digest_0.6.18    stringi_1.2.4    ggrepel_0.8.1    shiny_1.3.2     
## [41] grid_3.5.2       tools_3.5.2      magrittr_1.5     lazyeval_0.2.1  
## [45] tibble_2.0.1     ggdendro_0.1-20  crayon_1.3.4     tidyr_0.8.3     
## [49] pkgconfig_2.0.2  MASS_7.3-51.1    assertthat_0.2.0 rmarkdown_1.13  
## [53] R6_2.3.0         nlme_3.1-137     compiler_3.5.2