Week_6_Hands

First load the ggplot2 package. This package contains a datafram called “diamonds”

#Install the package if needed, using the command:
#install.packages("ggplot2")

#Load the ggplot2 library
library("ggplot2")

Let us display some sample data (top rows) of diamonds dataset, using the head function.

   head(diamonds)

##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

To display which variables are numeric and which are categorical, use the “str” function of R

str(diamonds)

## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Let us use the summary() function to display various high level characteristics of diamonds dataset variables. For categorical variables, various counts are displayed, and for numerical variables, Min, Max, 25th , 75th percentiles, mean, median are displayed

summary(diamonds)

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
##

From the above display of diamond dataset, the “cut”, “color”, and “clarity” are non-numeric variables or factors, and the remaining are numeric variables

To determine the frequency for each of the diamond colors

table(diamonds$color)

## 
##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808

To determine the frequency for each of the diamond cuts, by color

table(diamonds$cut,diamonds$color)

##            
##                D    E    F    G    H    I    J
##   Fair       163  224  312  314  303  175  119
##   Good       662  933  909  871  702  522  307
##   Very Good 1513 2400 2164 2299 1824 1204  678
##   Premium   1603 2337 2331 2924 2360 1428  808
##   Ideal     2834 3903 3826 4884 3115 2093  896

To create a graph for a single numeric variable

hist(diamonds$carat)

Create a scatterplot of two numeric variables

plot(log(diamonds$price),log(diamonds$carat))

Let us play with ggplot2 options:

ggplot(data=diamonds,aes(x=log(price),y=log(carat), color = color)) +
     geom_point() +
 geom_smooth(method="lm",color="black",linetype=1) + 
  labs(title="Regression between log(price) and log(color)", x="log(price)",y="log(carat)")

Obtained the following code from [stackoverflow!] (“http://stackoverflow.com/questions/9681765/display-regression-equation-and-r2-for-each-scatter-plot-when-using-facet-wrap”)

The following command will produce the price box plots, for all the colors. The X variable must be a factor, and Y must be a numeric

ggplot(diamonds, aes(x=color,y=price)) + geom_boxplot(fill="cornflowerblue",
color="black", notch=TRUE)

ggplot(diamonds, aes(x=color,y=price)) + geom_boxplot(fill="cornflowerblue",
color="black", notch=TRUE) +
geom_point(position="jitter", color="blue", alpha=.5)+
geom_rug(side="l", color="black")

The following command produces a density graph, for various colors of diamonds. The x must be the numeric variable = price, and Y the color

ggplot(data=diamonds, aes(x=price, fill=color)) +
geom_density(alpha=.3)

The diamonds data set is having lots of entries, so let us use a different data set, to demonstrate other graphics. We will use the Salaries data set, which is obtained by installing the “car” package. Use the following R Command, to install car package (Uncomment the code):

#install.packages("car")
#To include the cars package, use the following command:
library("car")

In the following graph, we plot a scatter plot between the “yrs.since.phd” and salary variables of Salaries data sets. But we will use the other categorical variables to represent the point colors, and shapes of the points to represent another categorical variable.

The salaries data set has the salaries of professors. In the following graph, the rank (or designarion) is represented as different colors of the points of the scatter plots, and the shapes of the points represent the sex another categorical variable.

ggplot(Salaries, aes(x=yrs.since.phd, y=salary, color=rank,shape=sex)) + geom_point()

Week_6_Hands_on

Sekhar Mekala

Tuesday, March 03, 2015