R offers many convenient ways to look at your data and simply summarize it. This tutorial will include a few of my favorite. It is by no means exhaustive, but should be useful for you.
If you’re interested in other options and code, check out resources online, especially Quick R.
We will be using the iris dataset, a freely available example dataset that can be called from your download of R. See the tutorial Getting Started in R for how to read in your own data using read.csv().
Use the code to view iris below:
View(iris)
That code should have produced a new dataframe window. In this window, you can view all of the data. Importantly, you can click on the headers to sort the dataframe by the column you’re interested in. Click again and it will sort in the inverse direction. Great way to check for outliers!
If you want to quickly get an idea of your data, and hate that popup that View() produces, there are several options for you below. Check them out:
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
head() returns the top 6 rows, tail() returns the bottom 6 rows.
If you want to see the “structure” of the data use the code below. It tells you the names of the variables, tells you the class of each variable (number, integer, factor, etc), and the first few records.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
What if you can’t remember the names of your variables? Was there an underscore or a hyphen? Never fear, the code below solves all, and quickly.
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
See? It returns all of the column names. Super handy if you have 35 variables.
class(iris) #tells us that iris is a dataframe
## [1] "data.frame"
class(iris$Species) #tells us that the variable Species is a factor
## [1] "factor"
So what if you’re checking out your data and you see that a variable should be a different type than what R is recognizing?
If your variable should be numeric and it is coming in as a factor, you likely have a mistake in the data. Make sure that your missing data is entered as NAs, not periods or anything else. Make sure that there aren’t any extra spaces in your numeric data (I’ve accidentally included a space before a number and that hosed the whole operation).
If you want numeric data to be read as factor (like you want plot number to be categorical, not quantitative), that is easy peasy! Use the code below.
iris$Sepal.Len.factor<-as.factor(iris$Sepal.Length)
Here we made a new variable in the dataframe iris called ‘Sepal.Len.factor’. We used the function as.factor(). Note: what I just did, convert Sepal length to a factor, is nonsensical, but I just wanted to show you how to do it.
plot(Sepal.Length~Sepal.Width,data=iris)
plot(Sepal.Length~Sepal.Width,data=iris,xlab="Sepal Width (mm)",ylab="Sepal Length (mm)",main="") #Change x and y labels, specify that no main title should be written
Scatterplot matrices are good to check a lot of variables for correlations, all at the same time. We won’t have many variables in this example, but you can get the idea - you can examine a lot all at once.
You simply use the function pairs, include all of the variables you are interested in investigating following the syntax of (~var1+var2+…). Then you specify your data and can add a title to the top using main.
pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris,
main="Simple Scatterplot Matrix")
Returns Pearson correlations (by default) for each variable combination.
cor(iris[,c(1:4)],use="complete") #Pearson correlation matrix only of numeric variables (columns 1-4)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Can change to Kendall or Spearman correlations changing the method. Below I change the method to Spearman correlations.
cor(iris[,c(1:4)],use="complete",method="spearman") #Spearman correlation matrix only of numeric variables (columns 1-4)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1667777 0.8818981 0.8342888
## Sepal.Width -0.1667777 1.0000000 -0.3096351 -0.2890317
## Petal.Length 0.8818981 -0.3096351 1.0000000 0.9376668
## Petal.Width 0.8342888 -0.2890317 0.9376668 1.0000000
So that you have options for correlation matrixes, check out another method below. This one shows the same information, but with colored graphics. Remember to use install.packages(“corrplot”) ONCE to install the package.
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
correlations <- cor(iris[,1:4]) #makes correlations for the 1st through 4th columns of the data iris
corrplot(correlations, method="circle") #produces colored correlation plot
R offers many ways to summarize data; I present the simplest methods below. I calculate mean, standard deviation, and median. Each one of these I read into objects (named mean, SD, and median). I then summarize these all into one summary table.
*For more specialized data summary, see the Data Management tutorial.
mean<-mean(iris$Sepal.Length)
SD<-sd(iris$Sepal.Length)
median<-median(iris$Sepal.Length)
labels<-(c("mean", "SD", "median")) #making labels for table
summaryStats<-c(mean,SD,median) #bringing together mean, SD, and median for the table
#Run next two lines together
labels #calls header labels
## [1] "mean" "SD" "median"
summaryStats #calls the summary stats (mean, standard deviation, and median)
## [1] 5.8433333 0.8280661 5.8000000
Histograms are a great way to visualize data, see outliers, and just generally get to know your data.
hist(iris$Sepal.Length)
hist(iris$Petal.Width)
We can see that sepal length is fairly normally distributed and that petal width is not normal and likely bimodal (2 peaks).
This is a quick way to see residual and q-q norm plots for any linear model (which you use to make ANOVA and regressions).
The most useful (to me) is the q-q norm plot (the 2nd one). If the errors of the model are normal, the data points will fall closely on the line. In this particular example, it’s pretty much perfect!
For more information, google these nested functions: plot(lm()) and see this link.
lm.ir<-lm(iris$Sepal.Length~iris$Petal.Length) #make a 'linear model' object
#in this model, y=Sepal Length, x=Petal Length
par(mfrow = c(2, 2)) #makes the plotting panel 2x2
plot(lm.ir)
#Note, if 4 graphs aren't showing up for you all at once, run the par() line and plot() line together.
par(mfrow = c(1, 1)) #returns the plotting panel to normal (1x1)
Don’t forget to run the last par line - this returns your graphing window to normal.
Good luck with your data exploration! Don’t forget to check out Quick R, stack exchange, and other online resources for help on questions that go beyond this tutorial.