This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

About Rmd file

Text

Text can be decorated with bold or italics. It is also possible to

  • create links
  • include mathematics like \(e=mc^2\) or \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2\]

Be sure to put a space after the * when you are creating bullets and a space after # when creating section headers, but not between $ and the mathematical formulas.

You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

knitr settings to control how R chunks work.

require(knitr) opts_chunk$set( tidy=FALSE, # display code as typed size=“small” # slightly smaller font for code )

The most important template is

  • goal(y~x,data=mydata)
  • What you want R to do is Goal : This determines the function to use(favstat, mean, SD, lm)
  • What must R know to do that: This determines the inputs to the function, must identify variables and data frame
  • it produces single and multiple variable graphical summaries
  • it produces single and multiple variable numerical summaries
  • linear models

Univariate Summaries

Numerical Summaries for one variable

favstats(~ age, data=HELPrct)
##  min Q1 median Q3 max     mean       sd   n missing
##   19 30     35 40  60 35.65342 7.710266 453       0
tally(~ sex, data=HELPrct)
## sex
## female   male 
##    107    346

Graphical Summaries one variable

#graphing quantitative numeric variable
histogram(~age,data=HELPrct)

densityplot(~age,data=HELPrct)

bwplot(~age,data=HELPrct)

qqmath(~age,data=HELPrct)

freqpolygon(~age,data=HELPrct)

bargraph(~age,data=HELPrct)

bargraph(~sex, data=HELPrct) #graphing categorical variable

Bivariate Summaries

Categorical variable vs. categorical variable

tally(homeless~sex,data=HELPrct)
##           sex
## homeless   female male
##   homeless     40  169
##   housed       67  177
bargraph(~sex,group=homeless, data=HELPrct,auto.key=TRUE)

Numerical summaries of two variables

Quantitative variable vs. quantitative variable

i1 average number of drinks consumed per day in past 30 days

cor(i1~age, data=HELPrct)
## [1] 0.2069538
xyplot(i1~age, data=HELPrct)

Categorical Variable vs. Quantitative Variable

a1<-favstats(age~substance|sex,data=HELPrct)
a1
##              sex min Q1 median   Q3 max     mean       sd   n missing
## 1 alcohol.female  23 33   37.0 45.0  58 39.16667 7.980333  36       0
## 2 cocaine.female  24 31   34.0 38.0  49 34.85366 6.195002  41       0
## 3  heroin.female  21 29   34.0 39.0  55 34.66667 8.035839  30       0
## 4   alcohol.male  20 32   38.0 42.0  58 37.95035 7.575644 141       0
## 5   cocaine.male  23 30   33.0 37.0  60 34.36036 6.889772 111       0
## 6    heroin.male  19 27   32.5 39.0  53 33.05319 7.973568  94       0
## 7         female  21 31   35.0 40.5  58 36.25234 7.584858 107       0
## 8           male  19 30   35.0 40.0  60 35.46821 7.750110 346       0
a2<-favstats(age~ racegrp, data=HELPrct)
a2
##    racegrp min    Q1 median    Q3 max     mean       sd   n missing
## 1    black  20 31.00     35 39.00  60 35.68246 7.083759 211       0
## 2 hispanic  21 28.25     32 36.25  55 33.20000 7.989789  50       0
## 3    other  22 30.00     34 40.50  48 34.96154 7.660187  26       0
## 4    white  19 30.00     36 42.00  58 36.46386 8.281152 166       0
bwplot(age~racegrp, data=HELPrct)#boxplot

a3<-mean(age~substance|sex,data=HELPrct,.format="table") #tabular form
a3
##   substance    sex     mean
## 1   alcohol female 39.16667
## 2   alcohol   male 37.95035
## 3   cocaine female 34.85366
## 4   cocaine   male 34.36036
## 5    heroin female 34.66667
## 6    heroin   male 33.05319

Categorical Variable vs. Categorical Variable

Numerical summaries

Cross Tabulations

tally(sex~substance,data=HELPrct)
##         substance
## sex      alcohol cocaine heroin
##   female      36      41     30
##   male       141     111     94
summary(sex~substance,data=HELPrct)
##  Length   Class    Mode 
##       3 formula    call

Graphical summaries two variables

xyplot(i1~age,data=HELPrct)

bwplot(age~substance,data=HELPrct)

Tips

Replace summary name by plot name

bwplot(age~substance|sex,data=HELPrct, .format="table")

add groups = group to overlay

use y~x|z to create multipanel plots

densityplot(~age|sex,data=HELPrct,groups=substance, auto.key=TRUE)

Some other generic functions, that will come in handy as we progress in the course

Mosaic package includes datasets, xtras: xchisq.test(data name),xpnorm(),mplot(),xqqmath

xpnorm( 700, mean=500, sd=100)
## 
## If X ~ N(500, 100), then
##  P(X <= 700) = P(Z <= 2) = 0.9772
##  P(X >  700) = P(Z >  2) = 0.02275
## 

## [1] 0.9772499
xpnorm( c(300, 700), mean=500, sd=100)
## 
## If X ~ N(500, 100), then
##  P(X <= 300) = P(Z <= -2) = 0.02275  P(X <= 700) = P(Z <=  2) = 0.97725
##  P(X >  300) = P(Z >  -2) = 0.97725  P(X >  700) = P(Z >   2) = 0.02275
## 

## [1] 0.02275013 0.97724987

Modelling

  • linear models lm(),glm() # linear models
a<-lm(age~substance*sex, data=HELPrct)
plot(a)

***************************

Your turn

****************************

Refer to the Quiz handout given in class and follow the instructions to complete it.