Uploading your data is an important first step (obviously), but familiarizing yourself with the structure and content of the data is also very important. Once you’ve mastered this step, using R will become much easier!
Lets use the diamonds data from the ggplot2 package that we installed in the previous section.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
As we saw before, you can view the first five entries of your data using the head() command.
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Not surprisingly, the tail command allows you to look at the last five entries.
tail(diamonds)
## carat cut color clarity depth table price x y z
## 53935 0.72 Premium D SI1 62.7 59 2757 5.69 5.73 3.58
## 53936 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 53937 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 53938 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 53939 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 53940 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
As we discussed before, R allows you to work with a variety of data structures. So, after you upload a file it may be helpful to use the class() command to tell you what sort of data structure you have.
class(diamonds)
## [1] "data.frame"
How big is this data frame?
dim(diamonds)
## [1] 53940 10
What are the variables in this data frame?
colnames(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
Now that we’re familiar with the basic structure of the diamonds data, lets take a look at the first 10 entries of the carat variable. Note that we call the variable within the data frame using a $. We can also index the data frame. Below I use both methods.
diamonds$carat[1:10]
## [1] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23
diamonds[1:10, 1]
## [1] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23
What type of value is carat?
class(diamonds$carat)
## [1] "numeric"
What’s the maximum carat value in the dataset? The minimum value? The mean value?
max(diamonds$carat)
## [1] 5.01
min(diamonds$carat)
## [1] 0.2
mean(diamonds$carat)
## [1] 0.7979397
What’s the price of the diamonds with the smallest and largest. To find this out, we can use the which() command to index our data frame.
max_val<-max(diamonds$carat)
max_index<-which(diamonds$carat==max_val)
max_index
## [1] 27416
diamonds$price[max_index]
## [1] 18018
min_val<-min(diamonds$carat)
min_index<-which(diamonds$carat==min_val)
min_index
## [1] 15 31592 31593 31594 31595 31596 31597 31598 31599 31600 31601
## [12] 31602
diamonds$price[min_index[1]]
## [1] 345
Let’s say I don’t know anything about diamonds (which I don’t). How do I find out what types of cuts exist?
table(diamonds$cut)
##
## Fair Good Very Good Premium Ideal
## 1610 4906 12082 13791 21551
How do I find out how many colors of diamonds there are within the dataset? One way to do this is to make a table. Another is to find the unique values of that variable.
unique_colors<-unique(diamonds)
length(unique_colors)
## [1] 10