Some R basics

General considerations

R takes a programmer's perspective – makes if difficult for

In this presentation

Focus on

Some hints

Getting data into R

Lots of ways. We'll mostly use two of them:

.csv file

Generally looks like

column1,column2,column3
1,2,3
2,3,4
3,4,5

If these values are in a file called data/demo1.csv then you get these data into R by:

R> csvdata = read.csv('data/demo1.csv', stringsAsFactor=FALSE)
R> csvdata
  column1 column2 column3
1       1       2       3
2       2       3       4
3       3       4       5

.rdata file

.rdata files are created by the save() function
(e.g., save(csvdata, file='data/demo1.rdata'))

If we had saved csvdata in data/demo1.rdata then we can get it into R by:

R> load('data/demo1.rdata')
R> csvdata
  column1 column2 column3
1       1       2       3
2       2       3       4
3       3       4       5

Data in R

Data in R are represented in various ways.

Rather than dwelling on all the forms, for now it is sufficient to think of data in a table of values where the rows are the individuals who were measured and the columns are the names of the measures.

In R this is called a data frame.

We can make this concrete with some actual data loaded from the HANDLS website:

R> (load(url('http://handls.nih.gov/zrdata/w00Demographics.rdata')))
[1] "w00Dem"

The function is in parenthesis so that it prints the function's output (the name of the data frame is the output of the load function).

What's in the data frame (head)

head() prints the first few lines of data.

R> head(w00Dem)
       HNDid        DOB Age0 Age0grp Age0med   Sex  Race PovStat
1 8031042901 1948-08-17   56   55-59   Above Women White   Above
2 8031077401 1946-02-05   58   55-59   Above Women White   Above
3 8031079801 1940-08-28   63   60-64   Above Women White   Above
4 8031086302 1970-01-02   34   30-34   Below   Men White   Above
5 8031107801 1941-08-02   63   60-64   Above   Men White   Above
6 8031107802 1951-06-09   53   50-54   Above Women White   Above

Each row (line) is a HANDLS participant in this example. The columns are characteristics of the participants (e.g., Age0: initial age; Race, etc).

What's in the data frame (tail)

tail() prints the last few lines of data.

R> tail(w00Dem)
          HNDid        DOB Age0 Age0grp Age0med   Sex  Race PovStat
3715 8224500501 1964-02-06   41   40-44   Below   Men White   Below
3716 8224505601 1947-07-04   58   55-59   Above Women White   Above
3717 8224515701 1952-07-04   53   50-54   Above   Men White   Above
3718 8224520701 1953-01-13   52   50-54   Above Women White   Above
3719 8224521901 1966-09-19   38   35-39   Below Women White   Below
3720 8224521902 1971-01-17   34   30-34   Below   Men White   Below

Both head & tail print 5 lines by default.

If want want more or less then include n=# where # is the number of lines you want to print (e.g., tail(w00Dem, n=10) to print the last 10 lines.

Descriptive statistics

The summary() is used for created a variety of reports. At its most basic, summary(dataframe) provides basic statistics (but not n).

R> summary(w00Dem)
     HNDid                 DOB                  Age0           Age0grp     Age0med        Sex          Race       PovStat    
 Min.   :8031042901   Min.   :1939-07-28   Min.   :30.00000   30-34:392   Below:1860   Women:2035   White:1522   Above:2185  
 1st Qu.:8133082301   1st Qu.:1950-11-27   1st Qu.:40.00000   35-39:461   Above:1860   Men  :1685   AfrAm:2198   Below:1535  
 Median :8162502101   Median :1958-02-09   Median :48.00000   40-44:509                                                      
 Mean   :8160956892   Mean   :1958-07-25   Mean   :47.71263   45-39:692                                                      
 3rd Qu.:8192566076   3rd Qu.:1966-01-17   3rd Qu.:55.00000   50-54:631                                                      
 Max.   :8224521902   Max.   :1978-07-21   Max.   :64.00000   55-59:577                                                      
                                                              60-64:458                                                      

Descriptive statistics (continued)

The summary provided by the summary() function is not very satisfying.

An alternative is the describe() function from the Hmisc package (neither installed nor loaded by default).

R> library(Hmisc)
R> describe(w00Dem[,c('Age0','Sex')])
w00Dem[, c("Age0", "Sex")] 

 2  Variables      3720  Observations
------------------------------------------------------------------------------------------------------------------------------------
Age0 : Initial age 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
   3720       0      35   47.71      32      34      40      48      55      60      62 

lowest : 30 31 32 33 34, highest: 60 61 62 63 64 
------------------------------------------------------------------------------------------------------------------------------------
Sex 
      n missing  unique 
   3720       0       2 

Women (2035, 55%), Men (1685, 45%) 
------------------------------------------------------------------------------------------------------------------------------------

Hmisc is an example of a package, a collection of user defined functions available from CRAN, a public archive a R packages.

Descriptive statistics (continued again)

Another descriptive display is available in the psych package, which has a variety of other functions useful for psychometrics (e.g., component analysis with rotations such as varimax and promax, alpha internal consistency, path diagrams).

R> library(psych)
R> describe(w00Dem[,3:8])   #remove 1st and 2nd columns (HNDid & DOB) because HNDid isn't meaningful and DOB chokes this function
         var    n  mean   sd median trimmed   mad min max range  skew kurtosis   se
Age0       1 3720 47.71 9.34   48.0   47.85 11.86  30  64    34 -0.11    -1.00 0.15
Age0grp*   2 3720  4.15 1.86    4.0    4.19  2.97   1   7     6 -0.12    -1.05 0.03
Age0med*   3 3720  1.50 0.50    1.5    1.50  0.74   1   2     1  0.00    -2.00 0.01
Sex*       4 3720  1.45 0.50    1.0    1.44  0.00   1   2     1  0.19    -1.96 0.01
Race*      5 3720  1.59 0.49    2.0    1.61  0.00   1   2     1 -0.37    -1.86 0.01
PovStat*   6 3720  1.41 0.49    1.0    1.39  0.00   1   2     1  0.35    -1.87 0.01

Merging data frames

Merging (continued)

Merge wave 1 neuropsych data with demographics selecting just age, sex, Benton total errors, & Trails B seconds.

R> (load(url('http://handls.nih.gov/zrdata/w01Neupsy.rdata')))
[1] "w01Neupsy"
R> w01 = merge(w00Dem[,c('HNDid','Age0','Sex')], w01Neupsy[,c('HNDid','BVRtot','TrailsBtestSec')], 'HNDid')
R> summary(w01)
     HNDid                 Age0             Sex           BVRtot          TrailsBtestSec    
 Min.   :8031079801   Min.   :30.00000   Women:1574   Min.   : 0.000000   Min.   : 25.0000  
 1st Qu.:8133205802   1st Qu.:41.00000   Men  :1226   1st Qu.: 2.000000   1st Qu.: 64.0000  
 Median :8162109301   Median :48.00000                Median : 6.000000   Median : 92.0000  
 Mean   :8160707107   Mean   :48.03786                Mean   : 6.540666   Mean   :159.9653  
 3rd Qu.:8192393501   3rd Qu.:56.00000                3rd Qu.:10.000000   3rd Qu.:150.0000  
 Max.   :8224521902   Max.   :64.00000                Max.   :30.000000   Max.   :600.0000  
                                                      NA's   :9           NA's   :121       

Selected subsets from a data frame

Separate elements of a data frame are identified as row and column locations.

Rows are identified by

Columns are identified by

Selecting specific rows

Let's say that we want to summarize just the cognitive data for women.

We can do this with the summary command on the merged data frame w01.

We find just the women with the logical expression w01$Sex=='Women'.

We use this expression in the summary function (or anywhere else) by enclosing it in a which() function (the reasons are a bit obscure but are related to what R does when there are missing data).

R> summary(w01[which(w01$Sex=='Women'),])
     HNDid                 Age0             Sex           BVRtot          TrailsBtestSec    
 Min.   :8031079801   Min.   :30.00000   Women:1574   Min.   : 0.000000   Min.   : 26.0000  
 1st Qu.:8133176326   1st Qu.:41.00000   Men  :   0   1st Qu.: 3.000000   1st Qu.: 62.0000  
 Median :8161571702   Median :48.00000                Median : 6.000000   Median : 88.0000  
 Mean   :8160317346   Mean   :48.06544                Mean   : 6.942602   Mean   :152.4196  
 3rd Qu.:8191904502   3rd Qu.:56.00000                3rd Qu.:10.000000   3rd Qu.:146.5000  
 Max.   :8224521901   Max.   :64.00000                Max.   :30.000000   Max.   :600.0000  
                                                      NA's   :6           NA's   :63        

Correlations

Calculate a correlation matrix. In the base R package,

R> cor(w01[,c('Age0','BVRtot','TrailsBtestSec')])
               Age0 BVRtot TrailsBtestSec
Age0              1     NA             NA
BVRtot           NA      1             NA
TrailsBtestSec   NA     NA              1

doesn't seem useful when there are missing data.

Correlations on missing data (pairwise deletion)

zCor from my package zStat isn't tripped up by missing data:

R> library(zStat)
R> zCor(w01[,c('Age0','BVRtot','TrailsBtestSec')])
               Age0     BVRtot   TrailsBtestSec
Age0            1.00     0.21***  0.18***      
BVRtot          0.21***  1.00     0.39***      
TrailsBtestSec  0.18***  0.39***  1.00         

n
               Age0 BVRtot TrailsBtestSec
Age0           2800   2791           2679
BVRtot         2791   2791           2672
TrailsBtestSec 2679   2672           2679

Descriptive statistics (again)

While we're using zStat, you might try:

R> library(zStat)
R> zDesc(w01)
                  N          Mean          SD        Min        Max                              
HNDid          2800 8160707107.07 41831641.12 8031079801 8224521902 HANDLS Identification        
Age0           2800         48.04        9.24         30         64 Initial age                  
BVRtot         2791          6.54        5.32          0         30 NeuPsy: BVR Total errors     
TrailsBtestSec 2679        159.97      169.62         25        600 NeuPsy: Trails B test seconds

Descriptive statistics breakdown by group

R> library(doBy)
R> library(zStat)
R> summaryBy(BVRtot ~ Sex, w01, FUN=zByFUN)
    Sex BVRtot.n BVRtot.mean   BVRtot.sd BVRtot.min BVRtot.max
1 Women     1568 6.942602041 5.311534821          0         30
2   Men     1223 6.025347506 5.285484417          0         23
R> summaryBy(TrailsBtestSec ~ Sex, w01, FUN=zByFUN)
    Sex TrailsBtestSec.n TrailsBtestSec.mean TrailsBtestSec.sd TrailsBtestSec.min TrailsBtestSec.max
1 Women             1511         152.4195897       163.1331232                 26                600
2   Men             1168         169.7268836       177.2463897                 25                600

Linear regression

Although R has routines for t tests (t.test), oneway ANOVA (oneway in the lattice package), and ANOVA (aov), we tend to use general linear model formulations most frequently.

Note the aov secretly performs regression and displays the results as a tradional ANOVA table.

A simple multiple regression is done with lm – to regress age, sex, and age × sex on Benton total score:

R> (mod.result1 = lm(BVRtot~Age0*Sex, w01))

Call:
lm(formula = BVRtot ~ Age0 * Sex, data = w01)

Coefficients:
(Intercept)         Age0       SexMen  Age0:SexMen  
 1.48932686   0.11345614  -1.64274336   0.01527688  

Linear regression summary

A more interpretable display is available by applying the summary function to the object holding the linear regression results (mod.results1).

R> summary(mod.result1)

Call:
lm(formula = BVRtot ~ Age0 * Sex, data = w01)

Residuals:
NeuPsy: BVR Total errors 
       Min         1Q     Median         3Q        Max 
-8.7505199 -3.9563269 -0.6395695  3.2167654 21.3629363 

Coefficients:
               Estimate  Std. Error  t value   Pr(>|t|)
(Intercept)  1.48932686  0.68618323  2.17045   0.030057
Age0         0.11345614  0.01401388  8.09599 8.3978e-16
SexMen      -1.64274336  1.05038263 -1.56395   0.117943
Age0:SexMen  0.01527688  0.02148000  0.71121   0.477011

Residual standard error: 5.184251 on 2787 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.0509401, Adjusted R-squared:  0.04991851 
F-statistic:  49.8634 on 3 and 2787 DF,  p-value: < 2.2204e-16

Linear regression (just main effects)

For just main effects:

R> mod.result2 = lm(BVRtot~Age0+Sex, w01)
R> summary(mod.result2)

Call:
lm(formula = BVRtot ~ Age0 + Sex, data = w01)

Residuals:
NeuPsy: BVR Total errors 
       Min         1Q     Median         3Q        Max 
-8.8541374 -3.9854093 -0.6948813  3.2254907 21.2658212 

Coefficients:
               Estimate  Std. Error  t value    Pr(>|t|)
(Intercept)  1.17678212  0.52695343  2.23318    0.025616
Age0         0.11995868  0.01061964 11.29593  < 2.22e-16
SexMen      -0.90905876  0.19776256 -4.59672 0.000004483

Residual standard error: 5.183791 on 2788 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.05076785,    Adjusted R-squared:  0.05008691 
F-statistic:  74.5554 on 2 and 2788 DF,  p-value: < 2.2204e-16

Comparing models

Although not entirely useful here, we can compare the two linear models to determine whether the interaction term adds useful information. This command is general in that it will compare models from other types of regression (e.g., nonlinear, mixed-model).

R> anova(mod.result1, mod.result2)
Analysis of Variance Table

Model 1: BVRtot ~ Age0 * Sex
Model 2: BVRtot ~ Age0 + Sex
  Res.Df       RSS Df  Sum of Sq       F  Pr(>F)
1   2787 74904.680                              
2   2788 74918.275 -1 -13.594804 0.50583 0.47701

Note that in this simple case the probablility for the difference between the two models (0.477) is the same as the probability of t in the regression analysis).

Categories

Most variables in R are numeric on a continuous scale (e.g., height).

In R, these variables are called factors.

Creating factors

In our example data, age is a continous variable with a (more or less) rectangular distribution per the HANDLS sampling design.

R> hist(w01$Age0, main='Frequency histogram for baseline age')

plot of chunk hist1

Factors (continued)

Imagine that we have some reason to believe that there are meaningful differences in Benton errors among three age groups, 30-39, 40-49, & 50-64).

We can create these categories in several ways, perhaps the most elegant of which is

R> w01$age.grp = factor((w01$Age0>=30) + (w01$Age0>=40) + (w01$Age0>=50), levels=1:3, labels=c('30-39', '40-49', '50-64'))
R> summaryBy(Age0~age.grp, w01, FUN=zByFUN)
  age.grp Age0.n   Age0.mean     Age0.sd Age0.min Age0.max
1   30-39    598 34.84448161 2.862141492       30       39
2   40-49    914 44.88074398 2.899182401       40       49
3   50-64   1288 56.40372671 4.237515335       50       64

Regression models with factors as predictors

Let's see if we were right about age differences in Benton performance.

We might also suspect that age interacts with sex.


R> summary(lm(BVRtot ~ age.grp*Sex, w01))

Call:
lm(formula = BVRtot ~ age.grp * Sex, data = w01)

Residuals:
NeuPsy: BVR Total errors 
       Min         1Q     Median         3Q        Max 
-7.7732962 -3.9902913 -0.8383838  3.1616162 22.2267038 

Coefficients:
                      Estimate Std. Error  t value   Pr(>|t|)
(Intercept)          5.0808383  0.2843144 17.87049 < 2.22e-16
age.grp40-49         1.9094529  0.3650473  5.23070 1.8140e-07
age.grp50-64         2.6924579  0.3440715  7.82529 7.1342e-15
SexMen              -0.9249448  0.4283593 -2.15927   0.030914
age.grp40-49:SexMen -0.2269626  0.5514483 -0.41158   0.680682
age.grp50-64:SexMen  0.1800173  0.5185676  0.34714   0.728510

Residual standard error: 5.196036 on 2785 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.04730441,    Adjusted R-squared:  0.04559401 
F-statistic: 27.65685 on 5 and 2785 DF,  p-value: < 2.2204e-16

Were we right?

Default contrasts

By default, R uses the first level as the reference (e.g., 40-49 v. 30-39, 50-64 v. 30-39).

Like most things, there are ways to change this behavior, including the contrast coding (e.g., effect instead of dummy coding).

What we did and what we didn't

This presentation provides some of the basics but didn't touch on lots of other things

Assignment symbols & other (quirky) details

Some conveniences