Data Exploration

R offers many convenient ways to look at your data and simply summarize it. This tutorial will include a few of my favorite. It is by no means exhaustive, but should be useful for you.

If you’re interested in other options and code, check out resources online, especially Quick R.

View your data

View all data

We will be using the iris dataset, a freely available example dataset that can be called from your download of R. See the tutorial Getting Started in R for how to read in your own data using read.csv().

Use the code to view iris below:

View(iris)

That code should have produced a new dataframe window. In this window, you can view all of the data. Importantly, you can click on the headers to sort the dataframe by the column you’re interested in. Click again and it will sort in the inverse direction. Great way to check for outliers!

View some data

If you want to quickly get an idea of your data, and hate that popup that View() produces, there are several options for you below. Check them out:

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

head() returns the top 6 rows, tail() returns the bottom 6 rows.

Summary of data (structure)

If you want to see the “structure” of the data use the code below. It tells you the names of the variables, tells you the class of each variable (number, integer, factor, etc), and the first few records.

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Column names

What if you can’t remember the names of your variables? Was there an underscore or a hyphen? Never fear, the code below solves all, and quickly.

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

See? It returns all of the column names. Super handy if you have 35 variables.

Checking type of data, object, or variable

class(iris) #tells us that iris is a dataframe

## [1] "data.frame"

class(iris$Species) #tells us that the variable Species is a factor

## [1] "factor"

Changing data type

So what if you’re checking out your data and you see that a variable should be a different type than what R is recognizing?

If your variable should be numeric and it is coming in as a factor, you likely have a mistake in the data. Make sure that your missing data is entered as NAs, not periods or anything else. Make sure that there aren’t any extra spaces in your numeric data (I’ve accidentally included a space before a number and that hosed the whole operation).

If you want numeric data to be read as factor (like you want plot number to be categorical, not quantitative), that is easy peasy! Use the code below.

iris$Sepal.Len.factor<-as.factor(iris$Sepal.Length)

Here we made a new variable in the dataframe iris called ‘Sepal.Len.factor’. We used the function as.factor(). Note: what I just did, convert Sepal length to a factor, is nonsensical, but I just wanted to show you how to do it.

Scatterplots

Simple Scatterplot

plot(Sepal.Length~Sepal.Width,data=iris)

plot(Sepal.Length~Sepal.Width,data=iris,xlab="Sepal Width (mm)",ylab="Sepal Length (mm)",main="") #Change x and y labels, specify that no main title should be written

Scatterplot Matrix

Scatterplot matrices are good to check a lot of variables for correlations, all at the same time. We won’t have many variables in this example, but you can get the idea - you can examine a lot all at once.

You simply use the function pairs, include all of the variables you are interested in investigating following the syntax of (~var1+var2+…). Then you specify your data and can add a title to the top using main.

pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris,
   main="Simple Scatterplot Matrix")

Correlation Matrix

Returns Pearson correlations (by default) for each variable combination.

cor(iris[,c(1:4)],use="complete") #Pearson correlation matrix only of numeric variables (columns 1-4)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Can change to Kendall or Spearman correlations changing the method. Below I change the method to Spearman correlations.

cor(iris[,c(1:4)],use="complete",method="spearman") #Spearman correlation matrix only of numeric variables (columns 1-4)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1667777    0.8818981   0.8342888
## Sepal.Width    -0.1667777   1.0000000   -0.3096351  -0.2890317
## Petal.Length    0.8818981  -0.3096351    1.0000000   0.9376668
## Petal.Width     0.8342888  -0.2890317    0.9376668   1.0000000

So that you have options for correlation matrixes, check out another method below. This one shows the same information, but with colored graphics. Remember to use install.packages(“corrplot”) ONCE to install the package.

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.4.2

correlations <- cor(iris[,1:4]) #makes correlations for the 1st through 4th columns of the data iris
corrplot(correlations, method="circle") #produces colored correlation plot

Basic data summarization

R offers many ways to summarize data; I present the simplest methods below. I calculate mean, standard deviation, and median. Each one of these I read into objects (named mean, SD, and median). I then summarize these all into one summary table.

*For more specialized data summary, see the Data Management tutorial.

mean<-mean(iris$Sepal.Length)
SD<-sd(iris$Sepal.Length)
median<-median(iris$Sepal.Length)

labels<-(c("mean", "SD", "median")) #making labels for table
summaryStats<-c(mean,SD,median) #bringing together mean, SD, and median for the table

#Run next two lines together
labels #calls header labels

## [1] "mean"   "SD"     "median"

summaryStats #calls the summary stats (mean, standard deviation, and median)

## [1] 5.8433333 0.8280661 5.8000000

Histograms

Histograms are a great way to visualize data, see outliers, and just generally get to know your data.

hist(iris$Sepal.Length)

hist(iris$Petal.Width)

We can see that sepal length is fairly normally distributed and that petal width is not normal and likely bimodal (2 peaks).

Residual plot and Quantile-Quantile Plot (Q-Q plot)

Including Residual vs Fit, Normal Q-Q, sqrt residuals vs fit, leverage plot

This is a quick way to see residual and q-q norm plots for any linear model (which you use to make ANOVA and regressions).

The most useful (to me) is the q-q norm plot (the 2nd one). If the errors of the model are normal, the data points will fall closely on the line. In this particular example, it’s pretty much perfect!

For more information, google these nested functions: plot(lm()) and see this link.

lm.ir<-lm(iris$Sepal.Length~iris$Petal.Length) #make a 'linear model' object
                #in this model, y=Sepal Length, x=Petal Length
par(mfrow = c(2, 2)) #makes the plotting panel 2x2
plot(lm.ir)

#Note, if 4 graphs aren't showing up for you all at once, run the par() line and plot() line together.

par(mfrow = c(1, 1)) #returns the plotting panel to normal (1x1)

Don’t forget to run the last par line - this returns your graphing window to normal.

Good luck with your data exploration! Don’t forget to check out Quick R, stack exchange, and other online resources for help on questions that go beyond this tutorial.

Simple Data Exploration

Michael Sinclair

July 20, 2019