R takes a programmer's perspective – makes if difficult for
Focus on
Some hints
Lots of ways. We'll mostly use two of them:
.csv
.rdata
Generally looks like
column1,column2,column3
1,2,3
2,3,4
3,4,5
If these values are in a file called data/demo1.csv then you get these data into R by:
R> csvdata = read.csv('data/demo1.csv', stringsAsFactor=FALSE)
R> csvdata
column1 column2 column3
1 1 2 3
2 2 3 4
3 3 4 5
.rdata files are created by the save() function
(e.g., save(csvdata, file='data/demo1.rdata'))
If we had saved csvdata in data/demo1.rdata then we can get it into R by:
R> load('data/demo1.rdata')
R> csvdata
column1 column2 column3
1 1 2 3
2 2 3 4
3 3 4 5
Data in R are represented in various ways.
Rather than dwelling on all the forms, for now it is sufficient to think of data in a table of values where the rows are the individuals who were measured and the columns are the names of the measures.
In R this is called a data frame.
We can make this concrete with some actual data loaded from the HANDLS website:
R> (load(url('http://handls.nih.gov/zrdata/w00Demographics.rdata')))
[1] "w00Dem"
The function is in parenthesis so that it prints the function's output (the name of the data frame is the output of the load function).
head() prints the first few lines of data.
R> head(w00Dem)
HNDid DOB Age0 Age0grp Age0med Sex Race PovStat
1 8031042901 1948-08-17 56 55-59 Above Women White Above
2 8031077401 1946-02-05 58 55-59 Above Women White Above
3 8031079801 1940-08-28 63 60-64 Above Women White Above
4 8031086302 1970-01-02 34 30-34 Below Men White Above
5 8031107801 1941-08-02 63 60-64 Above Men White Above
6 8031107802 1951-06-09 53 50-54 Above Women White Above
Each row (line) is a HANDLS participant in this example. The columns are characteristics of the participants (e.g., Age0: initial age; Race, etc).
tail() prints the last few lines of data.
R> tail(w00Dem)
HNDid DOB Age0 Age0grp Age0med Sex Race PovStat
3715 8224500501 1964-02-06 41 40-44 Below Men White Below
3716 8224505601 1947-07-04 58 55-59 Above Women White Above
3717 8224515701 1952-07-04 53 50-54 Above Men White Above
3718 8224520701 1953-01-13 52 50-54 Above Women White Above
3719 8224521901 1966-09-19 38 35-39 Below Women White Below
3720 8224521902 1971-01-17 34 30-34 Below Men White Below
Both head & tail print 5 lines by default.
If want want more or less then include n=# where # is the number of lines you want to print (e.g., tail(w00Dem, n=10) to print the last 10 lines.
The summary() is used for created a variety of reports. At its most basic, summary(dataframe) provides basic statistics (but not n).
R> summary(w00Dem)
HNDid DOB Age0 Age0grp Age0med Sex Race PovStat
Min. :8031042901 Min. :1939-07-28 Min. :30.00000 30-34:392 Below:1860 Women:2035 White:1522 Above:2185
1st Qu.:8133082301 1st Qu.:1950-11-27 1st Qu.:40.00000 35-39:461 Above:1860 Men :1685 AfrAm:2198 Below:1535
Median :8162502101 Median :1958-02-09 Median :48.00000 40-44:509
Mean :8160956892 Mean :1958-07-25 Mean :47.71263 45-39:692
3rd Qu.:8192566076 3rd Qu.:1966-01-17 3rd Qu.:55.00000 50-54:631
Max. :8224521902 Max. :1978-07-21 Max. :64.00000 55-59:577
60-64:458
The summary provided by the summary() function is not very satisfying.
An alternative is the describe() function from the Hmisc package (neither installed nor loaded by default).
R> library(Hmisc)
R> describe(w00Dem[,c('Age0','Sex')])
w00Dem[, c("Age0", "Sex")]
2 Variables 3720 Observations
------------------------------------------------------------------------------------------------------------------------------------
Age0 : Initial age
n missing unique Mean .05 .10 .25 .50 .75 .90 .95
3720 0 35 47.71 32 34 40 48 55 60 62
lowest : 30 31 32 33 34, highest: 60 61 62 63 64
------------------------------------------------------------------------------------------------------------------------------------
Sex
n missing unique
3720 0 2
Women (2035, 55%), Men (1685, 45%)
------------------------------------------------------------------------------------------------------------------------------------
Hmisc is an example of a package, a collection of user defined functions available from CRAN, a public archive a R packages.
Another descriptive display is available in the psych package, which has a variety of other functions useful for psychometrics (e.g., component analysis with rotations such as varimax and promax, alpha internal consistency, path diagrams).
R> library(psych)
R> describe(w00Dem[,3:8]) #remove 1st and 2nd columns (HNDid & DOB) because HNDid isn't meaningful and DOB chokes this function
var n mean sd median trimmed mad min max range skew kurtosis se
Age0 1 3720 47.71 9.34 48.0 47.85 11.86 30 64 34 -0.11 -1.00 0.15
Age0grp* 2 3720 4.15 1.86 4.0 4.19 2.97 1 7 6 -0.12 -1.05 0.03
Age0med* 3 3720 1.50 0.50 1.5 1.50 0.74 1 2 1 0.00 -2.00 0.01
Sex* 4 3720 1.45 0.50 1.0 1.44 0.00 1 2 1 0.19 -1.96 0.01
Race* 5 3720 1.59 0.49 2.0 1.61 0.00 1 2 1 -0.37 -1.86 0.01
PovStat* 6 3720 1.41 0.49 1.0 1.39 0.00 1 2 1 0.35 -1.87 0.01
Merge wave 1 neuropsych data with demographics selecting just age, sex, Benton total errors, & Trails B seconds.
R> (load(url('http://handls.nih.gov/zrdata/w01Neupsy.rdata')))
[1] "w01Neupsy"
R> w01 = merge(w00Dem[,c('HNDid','Age0','Sex')], w01Neupsy[,c('HNDid','BVRtot','TrailsBtestSec')], 'HNDid')
R> summary(w01)
HNDid Age0 Sex BVRtot TrailsBtestSec
Min. :8031079801 Min. :30.00000 Women:1574 Min. : 0.000000 Min. : 25.0000
1st Qu.:8133205802 1st Qu.:41.00000 Men :1226 1st Qu.: 2.000000 1st Qu.: 64.0000
Median :8162109301 Median :48.00000 Median : 6.000000 Median : 92.0000
Mean :8160707107 Mean :48.03786 Mean : 6.540666 Mean :159.9653
3rd Qu.:8192393501 3rd Qu.:56.00000 3rd Qu.:10.000000 3rd Qu.:150.0000
Max. :8224521902 Max. :64.00000 Max. :30.000000 Max. :600.0000
NA's :9 NA's :121
Separate elements of a data frame are identified as row and column locations.
Rows are identified by
Columns are identified by
Let's say that we want to summarize just the cognitive data for women.
We can do this with the summary command on the merged data frame w01.
We find just the women with the logical expression w01$Sex=='Women'.
We use this expression in the summary function (or anywhere else) by enclosing it in a which() function (the reasons are a bit obscure but are related to what R does when there are missing data).
R> summary(w01[which(w01$Sex=='Women'),])
HNDid Age0 Sex BVRtot TrailsBtestSec
Min. :8031079801 Min. :30.00000 Women:1574 Min. : 0.000000 Min. : 26.0000
1st Qu.:8133176326 1st Qu.:41.00000 Men : 0 1st Qu.: 3.000000 1st Qu.: 62.0000
Median :8161571702 Median :48.00000 Median : 6.000000 Median : 88.0000
Mean :8160317346 Mean :48.06544 Mean : 6.942602 Mean :152.4196
3rd Qu.:8191904502 3rd Qu.:56.00000 3rd Qu.:10.000000 3rd Qu.:146.5000
Max. :8224521901 Max. :64.00000 Max. :30.000000 Max. :600.0000
NA's :6 NA's :63
Calculate a correlation matrix. In the base R package,
R> cor(w01[,c('Age0','BVRtot','TrailsBtestSec')])
Age0 BVRtot TrailsBtestSec
Age0 1 NA NA
BVRtot NA 1 NA
TrailsBtestSec NA NA 1
doesn't seem useful when there are missing data.
zCor from my package zStat isn't tripped up by missing data:
R> library(zStat)
R> zCor(w01[,c('Age0','BVRtot','TrailsBtestSec')])
Age0 BVRtot TrailsBtestSec
Age0 1.00 0.21*** 0.18***
BVRtot 0.21*** 1.00 0.39***
TrailsBtestSec 0.18*** 0.39*** 1.00
n
Age0 BVRtot TrailsBtestSec
Age0 2800 2791 2679
BVRtot 2791 2791 2672
TrailsBtestSec 2679 2672 2679
While we're using zStat, you might try:
R> library(zStat)
R> zDesc(w01)
N Mean SD Min Max
HNDid 2800 8160707107.07 41831641.12 8031079801 8224521902 HANDLS Identification
Age0 2800 48.04 9.24 30 64 Initial age
BVRtot 2791 6.54 5.32 0 30 NeuPsy: BVR Total errors
TrailsBtestSec 2679 159.97 169.62 25 600 NeuPsy: Trails B test seconds
R> library(doBy)
R> library(zStat)
R> summaryBy(BVRtot ~ Sex, w01, FUN=zByFUN)
Sex BVRtot.n BVRtot.mean BVRtot.sd BVRtot.min BVRtot.max
1 Women 1568 6.942602041 5.311534821 0 30
2 Men 1223 6.025347506 5.285484417 0 23
R> summaryBy(TrailsBtestSec ~ Sex, w01, FUN=zByFUN)
Sex TrailsBtestSec.n TrailsBtestSec.mean TrailsBtestSec.sd TrailsBtestSec.min TrailsBtestSec.max
1 Women 1511 152.4195897 163.1331232 26 600
2 Men 1168 169.7268836 177.2463897 25 600
Although R has routines for t tests (t.test), oneway ANOVA (oneway in the lattice package), and ANOVA (aov), we tend to use general linear model formulations most frequently.
Note the aov secretly performs regression and displays the results as a tradional ANOVA table.
A simple multiple regression is done with lm – to regress age, sex, and age × sex on Benton total score:
R> (mod.result1 = lm(BVRtot~Age0*Sex, w01))
Call:
lm(formula = BVRtot ~ Age0 * Sex, data = w01)
Coefficients:
(Intercept) Age0 SexMen Age0:SexMen
1.48932686 0.11345614 -1.64274336 0.01527688
A more interpretable display is available by applying the summary function to the object holding the linear regression results (mod.results1).
R> summary(mod.result1)
Call:
lm(formula = BVRtot ~ Age0 * Sex, data = w01)
Residuals:
NeuPsy: BVR Total errors
Min 1Q Median 3Q Max
-8.7505199 -3.9563269 -0.6395695 3.2167654 21.3629363
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.48932686 0.68618323 2.17045 0.030057
Age0 0.11345614 0.01401388 8.09599 8.3978e-16
SexMen -1.64274336 1.05038263 -1.56395 0.117943
Age0:SexMen 0.01527688 0.02148000 0.71121 0.477011
Residual standard error: 5.184251 on 2787 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.0509401, Adjusted R-squared: 0.04991851
F-statistic: 49.8634 on 3 and 2787 DF, p-value: < 2.2204e-16
For just main effects:
R> mod.result2 = lm(BVRtot~Age0+Sex, w01)
R> summary(mod.result2)
Call:
lm(formula = BVRtot ~ Age0 + Sex, data = w01)
Residuals:
NeuPsy: BVR Total errors
Min 1Q Median 3Q Max
-8.8541374 -3.9854093 -0.6948813 3.2254907 21.2658212
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.17678212 0.52695343 2.23318 0.025616
Age0 0.11995868 0.01061964 11.29593 < 2.22e-16
SexMen -0.90905876 0.19776256 -4.59672 0.000004483
Residual standard error: 5.183791 on 2788 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.05076785, Adjusted R-squared: 0.05008691
F-statistic: 74.5554 on 2 and 2788 DF, p-value: < 2.2204e-16
Although not entirely useful here, we can compare the two linear models to determine whether the interaction term adds useful information. This command is general in that it will compare models from other types of regression (e.g., nonlinear, mixed-model).
R> anova(mod.result1, mod.result2)
Analysis of Variance Table
Model 1: BVRtot ~ Age0 * Sex
Model 2: BVRtot ~ Age0 + Sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2787 74904.680
2 2788 74918.275 -1 -13.594804 0.50583 0.47701
Note that in this simple case the probablility for the difference between the two models (0.477) is the same as the probability of t in the regression analysis).
Most variables in R are numeric on a continuous scale (e.g., height).
In R, these variables are called factors.
In our example data, age is a continous variable with a (more or less) rectangular distribution per the HANDLS sampling design.
R> hist(w01$Age0, main='Frequency histogram for baseline age')
Imagine that we have some reason to believe that there are meaningful differences in Benton errors among three age groups, 30-39, 40-49, & 50-64).
We can create these categories in several ways, perhaps the most elegant of which is
R> w01$age.grp = factor((w01$Age0>=30) + (w01$Age0>=40) + (w01$Age0>=50), levels=1:3, labels=c('30-39', '40-49', '50-64'))
R> summaryBy(Age0~age.grp, w01, FUN=zByFUN)
age.grp Age0.n Age0.mean Age0.sd Age0.min Age0.max
1 30-39 598 34.84448161 2.862141492 30 39
2 40-49 914 44.88074398 2.899182401 40 49
3 50-64 1288 56.40372671 4.237515335 50 64
Let's see if we were right about age differences in Benton performance.
We might also suspect that age interacts with sex.
R> summary(lm(BVRtot ~ age.grp*Sex, w01))
Call:
lm(formula = BVRtot ~ age.grp * Sex, data = w01)
Residuals:
NeuPsy: BVR Total errors
Min 1Q Median 3Q Max
-7.7732962 -3.9902913 -0.8383838 3.1616162 22.2267038
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0808383 0.2843144 17.87049 < 2.22e-16
age.grp40-49 1.9094529 0.3650473 5.23070 1.8140e-07
age.grp50-64 2.6924579 0.3440715 7.82529 7.1342e-15
SexMen -0.9249448 0.4283593 -2.15927 0.030914
age.grp40-49:SexMen -0.2269626 0.5514483 -0.41158 0.680682
age.grp50-64:SexMen 0.1800173 0.5185676 0.34714 0.728510
Residual standard error: 5.196036 on 2785 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.04730441, Adjusted R-squared: 0.04559401
F-statistic: 27.65685 on 5 and 2785 DF, p-value: < 2.2204e-16
Were we right?
By default, R uses the first level as the reference (e.g., 40-49 v. 30-39, 50-64 v. 30-39).
Like most things, there are ways to change this behavior, including the contrast coding (e.g., effect instead of dummy coding).
This presentation provides some of the basics but didn't touch on lots of other things