Data Exploration

Karen Mazidi

R is a great tool for getting to know your data. This R script will demonstrate some common data exploration techniques with R.

Get some data

This R script uses the brain head data set: http://www.stat.ufl.edu/~winner/data/brainhead.dat

The data set is associated with the following paper: A Study of the Relations of the Brain to to the Size of the Head, by R.J. Gladstone, published in Biometrika, 1905. It's a rather quaint data set, created well over a century ago.

First, we'll download the data from R.

The following code checks if the file already exists in the data folder. If not, it will download the file.

data_file = "data/brainhead.dat"
if (!file.exists(data_file)) {
  dir.create(dirname(data_file), FALSE)
  download.file("http://www.stat.ufl.edu/~winner/data/brainhead.dat", destfile=data_file)
}

Next, the read.table() function is used to read the file into variable brain. R has functions to read csv files, Excel files and much more.

The head() and tail() functions display the first/last few lines. Each line of the file represents data for one individual. This data set has the following columns:

col 1: 1 for male, 2 for female
col 2: 1 for ages 20-46, 2 for over 46
col 3: head size in cubic cm
col 4: brain weight in grams

brain = read.table(data_file, header=FALSE)
head(brain)

##   V1 V2   V3   V4
## 1  1  1 4512 1530
## 2  1  1 3738 1297
## 3  1  1 4261 1335
## 4  1  1 3777 1282
## 5  1  1 4177 1590
## 6  1  1 3585 1300

tail(brain, n=2)  # list the last 2 rows

##     V1 V2   V3   V4
## 236  2  2 3352 1170
## 237  2  2 3391 1120

dim(brain) # gives the number of rows and columns

## [1] 237   4

Data frames and Matrices

R has two data structures for data tables: the matrix and the data frame. A matrix is a data object in which all variables (columns) contain the same type of data. A data frame is a data object in which the variables (columns) can have different data types: numeric, characters, logical. We will just let it be a data frame.

Notice that our data does not have any column headings. We can add them as shown below.

If we run head() again, we see the column headers are in place.

colnames(brain) <- c("Gender", "Age", "Head","Brain")
head(brain,n=2)

##   Gender Age Head Brain
## 1      1   1 4512  1530
## 2      1   1 3738  1297

R functions for data exploration

Exploring the data with the code below indicates that:

there are 237 individuals in the data
about 57% are male
about 54% are over 46
the mean brain size is 1282.873 grams
the median brain size is 1280
the standard deviation of the brain size is 120.3404
brain size ranges from 955 to 1635

length(brain$Gender)

## [1] 237

sum(brain$Gender == 1)

## [1] 134

pct_male = sum(brain$Gender == 1) / length(brain$Gender)
pct_male

## [1] 0.5654008

pct_over46 = sum(brain$Age == 2) / length(brain$Age)
pct_over46

## [1] 0.535865

mean(brain$Brain)

## [1] 1282.873

median(brain$Brain)

## [1] 1280

sd(brain$Brain)

## [1] 120.3404

range(brain$Brain)

## [1]  955 1635

Attaching data

The attach(function) will allow us to simply type the variable (column) name such as “Head” instead of “brain$Head”

attach(brain) # attach the data set
mean(Head)    # now we can access column as Head instead of brain$Head

## [1] 3633.992

Indexing data

Data frames and matrices are indexed by [row, col] and counting starts at 1.

When the col is missing, as in “brain[1,]” it selects the entire row. When the row is missing, as in “brain[,3]” it selects the entire column.

If you want a portion of a row or column use the [start:stop] notation.

row1 = brain[1,]
row1  # display row 1

##   Gender Age Head Brain
## 1      1   1 4512  1530

col3 = brain[,3]
col3[1:5]  # display the first 5 elements of column 3

## [1] 4512 3738 4261 3777 4177

brain[15,4]  # head size of 15th individual

## [1] 1208

Summary

The summary function gives important statistics about each variable in the data frame.

summary(brain)

##      Gender           Age             Head          Brain     
##  Min.   :1.000   Min.   :1.000   Min.   :2720   Min.   : 955  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3389   1st Qu.:1207  
##  Median :1.000   Median :2.000   Median :3614   Median :1280  
##  Mean   :1.435   Mean   :1.536   Mean   :3634   Mean   :1283  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:3876   3rd Qu.:1350  
##  Max.   :2.000   Max.   :2.000   Max.   :4747   Max.   :1635

Correlations

Is there a correlation between head size and brain weight? Yes, there appears to be nearly 0.8 correlation.

cor(Head, Brain)

## [1] 0.7995697

Visualizing data

A histogram shows the distribution of brain weight.

hist(Brain)

plot of chunk unnamed-chunk-9

Plot

R has a plot function to help visualize the data. The following command creates a scatterplot (xy plot), with Head as x and Brain as y. We will see more about the R notation Brain~Head when we talk about linear regression in a later post.

plot(Brain~Head, brain, xlab="Head Size in cm^3", ylab="Brain Weight in grams", main="Plot of Brain Weight as a Function of Head Size")

plot of chunk unnamed-chunk-10

Subsetting

R provides many ways to look at a portion of the data:

selecting and/or excluding variables
selecting and/or excluding observations
using the with() and by() functions
using the subset() function
random samples

The selection portion of the data frame can then be input into other functions such as summary() and plot().

Selecting and/or excluding variables

# select Head and Brain only
sel_col <- c("Head", "Brain")
df <- brain[sel_col]
summary(df)

##       Head          Brain     
##  Min.   :2720   Min.   : 955  
##  1st Qu.:3389   1st Qu.:1207  
##  Median :3614   Median :1280  
##  Mean   :3634   Mean   :1283  
##  3rd Qu.:3876   3rd Qu.:1350  
##  Max.   :4747   Max.   :1635

# select all but Age
names(brain)

## [1] "Gender" "Age"    "Head"   "Brain"

df <- brain[-2] # omit col 2
names(df)

## [1] "Gender" "Head"   "Brain"

Selecting and/or exlcluding observations

df <- brain[6:10,]  # get 4 rows
length(df)

## [1] 4

df <- brain[which(Gender==2),]
mean(df$Brain)

## [1] 1219.146

Using the with() and by() functions

The with() function has the form: with(data, expr)

where data typically is a list or data frame, and expr is one or more R expressions over data. Note that there is also a within() function which returns a new object that reflects any revisions that were made by expr.

The by() function has the form: by(data, indices, func, …)

where data is a vector or data frame, indices is a factor vector, and func is a function to apply to each subset of the data. The by() function applies a function to each level for a factor.

df <- with(brain,
           (2990 < Head) & (Head <= 3010) |
           (3490 < Head) & (Head <= 3510) |
           (3990 < Head) & (Head <= 4010))
plot(Brain~Head, data=brain, subset=df)

plot of chunk unnamed-chunk-13

by(brain, brain$Gender, function(x) mean(x$Brain))

## brain$Gender: 1
## [1] 1331.858
## -------------------------------------------------------- 
## brain$Gender: 2
## [1] 1219.146

Using the subset() function

The form of subset() is: subset(data, subset, drop, …)

where data is the object from which the subset is drawn, subset is a logical expression indicating how to extract the subset, the results of drop are passed on to indexing operations.

df <- subset(brain, Gender==1 & Age==1, select=Brain:Head)
tail(df)

##    Brain Head
## 52  1350 3793
## 53  1335 4270
## 54  1390 4063
## 55  1400 4012
## 56  1225 3458
## 57  1310 3890

Random samples

The sample() function has the form: sample(data, size, replace=FALSE, prob=NULL)

where data typically is a vector, size is the number of items to choose, replace indicates whether or not it is sampling with replacement, and prob is a vector of probability weights for obtaining the elements.

df <- brain[sample(1:nrow(brain), 50, replace=FALSE),]
head(df)

##     Gender Age Head Brain
## 4        1   1 3777  1282
## 164      2   1 3292  1075
## 132      1   2 3532  1335
## 51       1   1 3891  1224
## 232      2   2 3704  1220
## 58       1   2 4166  1560

Quantitative versus qualitative data

In the first plot below, R treated the Gender variable as a numeric vector. The plot generally tells us that male brains tend to be a little bigger (but not necessarily better!). The Gender variable is actually categorical data, just encoded as numbers. The choice or 1 or 2 for gender is purely arbitrary and should not be interpreted as a quantitative variable. So we can tell R to treat Gender as a qualitative variable by using the as.factor() function.

Now R will create a box and whisker chart with the same command we used earlier.

plot(Brain~Gender)  # creates a scatter plot

plot of chunk unnamed-chunk-16

Gender = as.factor(Gender)
plot(Brain~Gender)  # creates a box and whisker plot

plot of chunk unnamed-chunk-16

Let's do the same thing with Age.

Yikes! it appears that brains shrink a little with age.

Age = as.factor(Age)
plot(Brain~Age)

plot of chunk unnamed-chunk-17

Looking ahead

That's all for this post. In the next post we will explore this same data with linear regression.