First load the ggplot2 package. This package contains a datafram called “diamonds”
#Install the package if needed, using the command:
#install.packages("ggplot2")
#Load the ggplot2 library
library("ggplot2")
Let us display some sample data (top rows) of diamonds dataset, using the head function.
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
To display which variables are numeric and which are categorical, use the “str” function of R
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Let us use the summary() function to display various high level characteristics of diamonds dataset variables. For categorical variables, various counts are displayed, and for numerical variables, Min, Max, 25th , 75th percentiles, mean, median are displayed
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
From the above display of diamond dataset, the “cut”, “color”, and “clarity” are non-numeric variables or factors, and the remaining are numeric variables
To determine the frequency for each of the diamond colors
table(diamonds$color)
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
To determine the frequency for each of the diamond cuts, by color
table(diamonds$cut,diamonds$color)
##
## D E F G H I J
## Fair 163 224 312 314 303 175 119
## Good 662 933 909 871 702 522 307
## Very Good 1513 2400 2164 2299 1824 1204 678
## Premium 1603 2337 2331 2924 2360 1428 808
## Ideal 2834 3903 3826 4884 3115 2093 896
To create a graph for a single numeric variable
hist(diamonds$carat)
Create a scatterplot of two numeric variables
plot(log(diamonds$price),log(diamonds$carat))
Let us play with ggplot2 options:
ggplot(data=diamonds,aes(x=log(price),y=log(carat), color = color)) +
geom_point() +
geom_smooth(method="lm",color="black",linetype=1) +
labs(title="Regression between log(price) and log(color)", x="log(price)",y="log(carat)")
Obtained the following code from [stackoverflow!] (“http://stackoverflow.com/questions/9681765/display-regression-equation-and-r2-for-each-scatter-plot-when-using-facet-wrap”)
The following command will produce the price box plots, for all the colors. The X variable must be a factor, and Y must be a numeric
ggplot(diamonds, aes(x=color,y=price)) + geom_boxplot(fill="cornflowerblue",
color="black", notch=TRUE)
ggplot(diamonds, aes(x=color,y=price)) + geom_boxplot(fill="cornflowerblue",
color="black", notch=TRUE) +
geom_point(position="jitter", color="blue", alpha=.5)+
geom_rug(side="l", color="black")
The following command produces a density graph, for various colors of diamonds. The x must be the numeric variable = price, and Y the color
ggplot(data=diamonds, aes(x=price, fill=color)) +
geom_density(alpha=.3)
The diamonds data set is having lots of entries, so let us use a different data set, to demonstrate other graphics. We will use the Salaries data set, which is obtained by installing the “car” package. Use the following R Command, to install car package (Uncomment the code):
#install.packages("car")
#To include the cars package, use the following command:
library("car")
In the following graph, we plot a scatter plot between the “yrs.since.phd” and salary variables of Salaries data sets. But we will use the other categorical variables to represent the point colors, and shapes of the points to represent another categorical variable.
The salaries data set has the salaries of professors. In the following graph, the rank (or designarion) is represented as different colors of the points of the scatter plots, and the shapes of the points represent the sex another categorical variable.
ggplot(Salaries, aes(x=yrs.since.phd, y=salary, color=rank,shape=sex)) + geom_point()