Uploading your data is an important first step (obviously), but familiarizing yourself with the structure and content of the data is also very important. Once you’ve mastered this step, using R will become much easier!

Lets use the diamonds data from the ggplot2 package that we installed in the previous section.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3

As we saw before, you can view the first five entries of your data using the head() command.

head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Not surprisingly, the tail command allows you to look at the last five entries.

tail(diamonds)
##       carat       cut color clarity depth table price    x    y    z
## 53935  0.72   Premium     D     SI1  62.7    59  2757 5.69 5.73 3.58
## 53936  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 53937  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 53938  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 53939  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 53940  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

As we discussed before, R allows you to work with a variety of data structures. So, after you upload a file it may be helpful to use the class() command to tell you what sort of data structure you have.

class(diamonds)
## [1] "data.frame"

How big is this data frame?

dim(diamonds)
## [1] 53940    10

What are the variables in this data frame?

colnames(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

Now that we’re familiar with the basic structure of the diamonds data, lets take a look at the first 10 entries of the carat variable. Note that we call the variable within the data frame using a $. We can also index the data frame. Below I use both methods.

diamonds$carat[1:10]
##  [1] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23
diamonds[1:10, 1]
##  [1] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23

What type of value is carat?

class(diamonds$carat)
## [1] "numeric"

What’s the maximum carat value in the dataset? The minimum value? The mean value?

max(diamonds$carat)
## [1] 5.01
min(diamonds$carat)
## [1] 0.2
mean(diamonds$carat)
## [1] 0.7979397

What’s the price of the diamonds with the smallest and largest. To find this out, we can use the which() command to index our data frame.

max_val<-max(diamonds$carat)

max_index<-which(diamonds$carat==max_val)
max_index
## [1] 27416
diamonds$price[max_index]
## [1] 18018
min_val<-min(diamonds$carat)

min_index<-which(diamonds$carat==min_val)
min_index
##  [1]    15 31592 31593 31594 31595 31596 31597 31598 31599 31600 31601
## [12] 31602
diamonds$price[min_index[1]]
## [1] 345

Let’s say I don’t know anything about diamonds (which I don’t). How do I find out what types of cuts exist?

table(diamonds$cut)
## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551

How do I find out how many colors of diamonds there are within the dataset? One way to do this is to make a table. Another is to find the unique values of that variable.

unique_colors<-unique(diamonds)

length(unique_colors)
## [1] 10