This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library("ggplot2")
data(diamonds)
dim(diamonds)
## [1] 53940    10
str(diamonds)
## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

q2) Create a histogram of the price of all the diamonds in the diamond data set.

q3) The distribution is right skewed and mean also is on the right than median.

median = 2401 and mean = 3933

summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18820

q4) Diamonds Count

dim(subset(diamonds, price <500))
## [1] 1729   10
dim(subset(diamonds, price <250))
## [1]  0 10
dim(subset(diamonds, price >=15000))
## [1] 1656   10

q5) Cheaper Diamonds

Explore the largest peak in the price histogram you created earlier.

Try limiting the x-axis, altering the bin width, and setting different breaks on the x-axis.

There won’t be a solution video for this question so go to the discussions to share your thoughts and discover what other people find.

You can save images by using the ggsave() command. ggsave() will save the last plot created. For example… qplot(x = price, data = diamonds) ggsave(‘priceHistogram.png’)

ggsave currently recognises the extensions eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf (windows only).

Submit your final code when you are ready.

qplot(x = price, data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 50) + 
  scale_x_continuous(limits= c(0,2400), breaks=seq(0,2400,200))

Notes: I’ve plotted all diamonds prices up to median. Peak was around 800, almost no diamonds are cheaper than 400 usd. and interstingly there are almost no diamonds around 1500 usd range

as a side note mode function :

onerMode <-function(vctor){

freq <- max(table(vctor))

m <- names(table(vctor))[table(vctor)==freq]

l <-list(mode = m, frequency = freq)

l

}

onerMode(c$price)

$mode [1] “605”

$frequency [1] 132

Q6) Price by Cut Histograms

Break out the histogram of diamond prices by cut.

You should have five histograms in separate panels on your resulting plot.

qplot(x =price, data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 500) +
  facet_wrap(~cut)

by(diamonds$price,diamonds$cut,summary, digits = max(getOption('digits')))
## diamonds$cut: Fair
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   337.000  2050.250  3282.000  4358.758  5205.500 18574.000 
## -------------------------------------------------------- 
## diamonds$cut: Good
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   327.000  1145.000  3050.500  3928.864  5028.000 18788.000 
## -------------------------------------------------------- 
## diamonds$cut: Very Good
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   336.00   912.00  2648.00  3981.76  5372.75 18818.00 
## -------------------------------------------------------- 
## diamonds$cut: Premium
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   326.000  1046.000  3185.000  4584.258  6296.000 18823.000 
## -------------------------------------------------------- 
## diamonds$cut: Ideal
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   326.000   878.000  1810.000  3457.542  4678.500 18806.000

Q7) Scales and Multiple Histograms

# In the two last exercises, we looked at
# the distribution for diamonds by cut.

# Run the code below in R Studio to generate
# the histogram as a reminder.

# ===============================================================
qplot(x = price, data = diamonds) + facet_wrap(~cut)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

# ===============================================================

# In the last exercise, we looked at the summary statistics
# for diamond price by cut. If we look at the output table, the
# the median and quartiles are reasonably close to each other.

# diamonds$cut: Fair
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     337    2050    3282    4359    5206   18570 
# ------------------------------------------------------------------------ 
# diamonds$cut: Good
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     327    1145    3050    3929    5028   18790 
# ------------------------------------------------------------------------ 
# diamonds$cut: Very Good
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     336     912    2648    3982    5373   18820 
# ------------------------------------------------------------------------ 
# diamonds$cut: Premium
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     326    1046    3185    4584    6296   18820 
# ------------------------------------------------------------------------ 
# diamonds$cut: Ideal
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     326     878    1810    3458    4678   18810 

# This means the distributions should be somewhat similar,
# but the histograms we created don't show that.

# The 'Fair' and 'Good' diamonds appear to have 
# different distributions compared to the better
# cut diamonds. They seem somewhat uniform
# on the left with long tails on the right.

# Let's look in to this more.

# Look up the documentation for facet_wrap in R Studio.
# Then, scroll back up and add a parameter to facet_wrap so that
# the y-axis in the histograms is not fixed. You want the y-axis to
# be different for each histogram.
qplot(x =price, data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 600) +
  facet_wrap(~cut, scales="free_y")

Q8) Price per Carat by Cut

# Create a histogram of price per carat
# and facet it by cut. You can make adjustments
# to the code from the previous exercise to get
# started.

# Adjust the bin width and transform the scale
# of the x-axis using log10.

# Submit your final code when you are ready.

# ENTER YOUR CODE BELOW THIS LINE.
# ===========================================================================
qplot(x =price/carat, data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 600) +
  facet_wrap(~cut, scales="free_y") 

qplot(x =log10(price/carat), data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 0.05) +
  facet_wrap(~cut, scales="free_y") 

Q9) Price Box Plots

# Investigate the price of diamonds using box plots,
# numerical summaries, and one of the following categorical
# variables: cut, clarity, or color.

# There won't be a solution video for this
# exercise so go to the discussion thread for either
# BOXPLOTS BY CLARITY, BOXPLOT BY COLOR, or BOXPLOTS BY CUT
# to share you thoughts and to
# see what other people found.

# You can save images by using the ggsave() command.
# ggsave() will save the last plot created.
# For example...
#                  qplot(x = price, data = diamonds)
#                  ggsave('priceHistogram.png')

# ggsave currently recognises the extensions eps/ps, tex (pictex),
# pdf, jpeg, tiff, png, bmp, svg and wmf (windows only).

# Copy and paste all of the code that you used for
# your investigation, and submit it when you are ready.
# =================================================================
qplot(x = color, y = price, data = diamonds, geom = "boxplot") + 
  coord_cartesian(ylim = c(0,8000))

qplot(x = cut, y = price, data = diamonds, geom = "boxplot")  + 
  coord_cartesian(ylim = c(0,7000))

qplot(x = clarity, y = price, data = diamonds, geom = "boxplot") + 
  coord_cartesian(ylim = c(0,7000))

Q10) IQR

by(diamonds$price,diamonds$color,summary)
## diamonds$color: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     357     911    1838    3170    4214   18690 
## -------------------------------------------------------- 
## diamonds$color: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     882    1739    3077    4003   18730 
## -------------------------------------------------------- 
## diamonds$color: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     342     982    2344    3725    4868   18790 
## -------------------------------------------------------- 
## diamonds$color: G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     354     931    2242    3999    6048   18820 
## -------------------------------------------------------- 
## diamonds$color: H
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     337     984    3460    4487    5980   18800 
## -------------------------------------------------------- 
## diamonds$color: I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1120    3730    5092    7202   18820 
## -------------------------------------------------------- 
## diamonds$color: J
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     335    1860    4234    5324    7695   18710

Q11) Price per Carat Box Plots by Color

# Investigate the price per carat of diamonds across
# the different colors of diamonds using boxplots.
qplot(x = color, y = price/carat, data = diamonds, geom = "boxplot") 

Note :The boxplot above and below shows that best color D has lots of outliers and outliers tend to decrease when we go to worse colors the IQR for all colors tend to be pretty similar with close median values.

qplot(x = color, y = price/carat, data = diamonds, geom = "boxplot") + 
  coord_cartesian(ylim = c(0,6000))

Q12 Carat Frequency Polygon

Investigate the weight of diamonds (carat) using a freq polygon. Use different binwidths to see how teh freq polygon changes. What carat size has a count greater than 2000?

qplot(x = carat, 
      data = diamonds, 
      binwidth =0.01, 
      geom = 'freqpoly') + 
  scale_x_continuous(lim = c(0,3), breaks = seq(0,3,0.3))
## Warning: Removed 2 rows containing missing values (geom_path).

# Below is the ratio of carat size in total
qplot(x = carat, y= ..count../sum(..count..), 
      data = diamonds, 
      binwidth =0.01, 
      geom = 'freqpoly') + 
  scale_x_continuous(lim = c(0,3), breaks = seq(0,3,0.3))
## Warning: Removed 2 rows containing missing values (geom_path).