R is a great tool for getting to know your data. This R script will demonstrate some common data exploration techniques with R.
This R script uses the brain head data set: http://www.stat.ufl.edu/~winner/data/brainhead.dat
The data set is associated with the following paper: A Study of the Relations of the Brain to to the Size of the Head, by R.J. Gladstone, published in Biometrika, 1905. It's a rather quaint data set, created well over a century ago.
First, we'll download the data from R.
The following code checks if the file already exists in the data folder. If not, it will download the file.
data_file = "data/brainhead.dat"
if (!file.exists(data_file)) {
dir.create(dirname(data_file), FALSE)
download.file("http://www.stat.ufl.edu/~winner/data/brainhead.dat", destfile=data_file)
}
Next, the read.table() function is used to read the file into variable brain. R has functions to read csv files, Excel files and much more.
The head() and tail() functions display the first/last few lines. Each line of the file represents data for one individual. This data set has the following columns:
brain = read.table(data_file, header=FALSE)
head(brain)
## V1 V2 V3 V4
## 1 1 1 4512 1530
## 2 1 1 3738 1297
## 3 1 1 4261 1335
## 4 1 1 3777 1282
## 5 1 1 4177 1590
## 6 1 1 3585 1300
tail(brain, n=2) # list the last 2 rows
## V1 V2 V3 V4
## 236 2 2 3352 1170
## 237 2 2 3391 1120
dim(brain) # gives the number of rows and columns
## [1] 237 4
R has two data structures for data tables: the matrix and the data frame. A matrix is a data object in which all variables (columns) contain the same type of data. A data frame is a data object in which the variables (columns) can have different data types: numeric, characters, logical. We will just let it be a data frame.
Notice that our data does not have any column headings. We can add them as shown below.
If we run head() again, we see the column headers are in place.
colnames(brain) <- c("Gender", "Age", "Head","Brain")
head(brain,n=2)
## Gender Age Head Brain
## 1 1 1 4512 1530
## 2 1 1 3738 1297
Exploring the data with the code below indicates that:
length(brain$Gender)
## [1] 237
sum(brain$Gender == 1)
## [1] 134
pct_male = sum(brain$Gender == 1) / length(brain$Gender)
pct_male
## [1] 0.5654008
pct_over46 = sum(brain$Age == 2) / length(brain$Age)
pct_over46
## [1] 0.535865
mean(brain$Brain)
## [1] 1282.873
median(brain$Brain)
## [1] 1280
sd(brain$Brain)
## [1] 120.3404
range(brain$Brain)
## [1] 955 1635
The attach(function) will allow us to simply type the variable (column) name such as “Head” instead of “brain$Head”
attach(brain) # attach the data set
mean(Head) # now we can access column as Head instead of brain$Head
## [1] 3633.992
Data frames and matrices are indexed by [row, col] and counting starts at 1.
When the col is missing, as in “brain[1,]” it selects the entire row. When the row is missing, as in “brain[,3]” it selects the entire column.
If you want a portion of a row or column use the [start:stop] notation.
row1 = brain[1,]
row1 # display row 1
## Gender Age Head Brain
## 1 1 1 4512 1530
col3 = brain[,3]
col3[1:5] # display the first 5 elements of column 3
## [1] 4512 3738 4261 3777 4177
brain[15,4] # head size of 15th individual
## [1] 1208
The summary function gives important statistics about each variable in the data frame.
summary(brain)
## Gender Age Head Brain
## Min. :1.000 Min. :1.000 Min. :2720 Min. : 955
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3389 1st Qu.:1207
## Median :1.000 Median :2.000 Median :3614 Median :1280
## Mean :1.435 Mean :1.536 Mean :3634 Mean :1283
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3876 3rd Qu.:1350
## Max. :2.000 Max. :2.000 Max. :4747 Max. :1635
Is there a correlation between head size and brain weight? Yes, there appears to be nearly 0.8 correlation.
cor(Head, Brain)
## [1] 0.7995697
A histogram shows the distribution of brain weight.
hist(Brain)
R has a plot function to help visualize the data. The following command creates a scatterplot (xy plot), with Head as x and Brain as y. We will see more about the R notation Brain~Head when we talk about linear regression in a later post.
plot(Brain~Head, brain, xlab="Head Size in cm^3", ylab="Brain Weight in grams", main="Plot of Brain Weight as a Function of Head Size")
R provides many ways to look at a portion of the data:
The selection portion of the data frame can then be input into other functions such as summary() and plot().
# select Head and Brain only
sel_col <- c("Head", "Brain")
df <- brain[sel_col]
summary(df)
## Head Brain
## Min. :2720 Min. : 955
## 1st Qu.:3389 1st Qu.:1207
## Median :3614 Median :1280
## Mean :3634 Mean :1283
## 3rd Qu.:3876 3rd Qu.:1350
## Max. :4747 Max. :1635
# select all but Age
names(brain)
## [1] "Gender" "Age" "Head" "Brain"
df <- brain[-2] # omit col 2
names(df)
## [1] "Gender" "Head" "Brain"
df <- brain[6:10,] # get 4 rows
length(df)
## [1] 4
df <- brain[which(Gender==2),]
mean(df$Brain)
## [1] 1219.146
The with() function has the form: with(data, expr)
where data typically is a list or data frame, and expr is one or more R expressions over data. Note that there is also a within() function which returns a new object that reflects any revisions that were made by expr.
The by() function has the form: by(data, indices, func, …)
where data is a vector or data frame, indices is a factor vector, and func is a function to apply to each subset of the data. The by() function applies a function to each level for a factor.
df <- with(brain,
(2990 < Head) & (Head <= 3010) |
(3490 < Head) & (Head <= 3510) |
(3990 < Head) & (Head <= 4010))
plot(Brain~Head, data=brain, subset=df)
by(brain, brain$Gender, function(x) mean(x$Brain))
## brain$Gender: 1
## [1] 1331.858
## --------------------------------------------------------
## brain$Gender: 2
## [1] 1219.146
The form of subset() is: subset(data, subset, drop, …)
where data is the object from which the subset is drawn, subset is a logical expression indicating how to extract the subset, the results of drop are passed on to indexing operations.
df <- subset(brain, Gender==1 & Age==1, select=Brain:Head)
tail(df)
## Brain Head
## 52 1350 3793
## 53 1335 4270
## 54 1390 4063
## 55 1400 4012
## 56 1225 3458
## 57 1310 3890
The sample() function has the form: sample(data, size, replace=FALSE, prob=NULL)
where data typically is a vector, size is the number of items to choose, replace indicates whether or not it is sampling with replacement, and prob is a vector of probability weights for obtaining the elements.
df <- brain[sample(1:nrow(brain), 50, replace=FALSE),]
head(df)
## Gender Age Head Brain
## 4 1 1 3777 1282
## 164 2 1 3292 1075
## 132 1 2 3532 1335
## 51 1 1 3891 1224
## 232 2 2 3704 1220
## 58 1 2 4166 1560
In the first plot below, R treated the Gender variable as a numeric vector. The plot generally tells us that male brains tend to be a little bigger (but not necessarily better!). The Gender variable is actually categorical data, just encoded as numbers. The choice or 1 or 2 for gender is purely arbitrary and should not be interpreted as a quantitative variable. So we can tell R to treat Gender as a qualitative variable by using the as.factor() function.
Now R will create a box and whisker chart with the same command we used earlier.
plot(Brain~Gender) # creates a scatter plot
Gender = as.factor(Gender)
plot(Brain~Gender) # creates a box and whisker plot
Let's do the same thing with Age.
Yikes! it appears that brains shrink a little with age.
Age = as.factor(Age)
plot(Brain~Age)
That's all for this post. In the next post we will explore this same data with linear regression.