Section 5: Exploring Your Data

Uploading your data is an important first step (obviously), but familiarizing yourself with the structure and content of the data is also very important. Once you’ve mastered this step, using R will become much easier!

Lets use the diamonds data from the ggplot2 package that we installed in the previous section.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.3

As we saw before, you can view the first five entries of your data using the head() command.

head(diamonds)

##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Not surprisingly, the tail command allows you to look at the last five entries.

tail(diamonds)

##       carat       cut color clarity depth table price    x    y    z
## 53935  0.72   Premium     D     SI1  62.7    59  2757 5.69 5.73 3.58
## 53936  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

As we discussed before, R allows you to work with a variety of data structures. So, after you upload a file it may be helpful to use the class() command to tell you what sort of data structure you have.

class(diamonds)

## [1] "data.frame"

How big is this data frame?

dim(diamonds)

## [1] 53940    10

What are the variables in this data frame?

colnames(diamonds)

##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

Now that we’re familiar with the basic structure of the diamonds data, lets take a look at the first 10 entries of the carat variable. Note that we call the variable within the data frame using a $. We can also index the data frame. Below I use both methods.

diamonds$carat[1:10]

##  [1] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23

diamonds[1:10, 1]

##  [1] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23

What type of value is carat?

class(diamonds$carat)

## [1] "numeric"

What’s the maximum carat value in the dataset? The minimum value? The mean value?

max(diamonds$carat)

## [1] 5.01

min(diamonds$carat)

## [1] 0.2

mean(diamonds$carat)

## [1] 0.7979397

What’s the price of the diamonds with the smallest and largest. To find this out, we can use the which() command to index our data frame.

max_val<-max(diamonds$carat)

max_index<-which(diamonds$carat==max_val)
max_index

## [1] 27416

diamonds$price[max_index]

## [1] 18018

min_val<-min(diamonds$carat)

min_index<-which(diamonds$carat==min_val)
min_index

##  [1]    15 31592 31593 31594 31595 31596 31597 31598 31599 31600 31601
## [12] 31602

diamonds$price[min_index[1]]

## [1] 345

Let’s say I don’t know anything about diamonds (which I don’t). How do I find out what types of cuts exist?

table(diamonds$cut)

## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551

How do I find out how many colors of diamonds there are within the dataset? One way to do this is to make a table. Another is to find the unique values of that variable.

unique_colors<-unique(diamonds)

length(unique_colors)

## [1] 10