This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library("ggplot2")
data(diamonds)
dim(diamonds)
## [1] 53940 10
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
q2) Create a histogram of the price of all the diamonds in the diamond data set.
q3) The distribution is right skewed and mean also is on the right than median.
median = 2401 and mean = 3933
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18820
q4) Diamonds Count
dim(subset(diamonds, price <500))
## [1] 1729 10
dim(subset(diamonds, price <250))
## [1] 0 10
dim(subset(diamonds, price >=15000))
## [1] 1656 10
q5) Cheaper Diamonds
Explore the largest peak in the price histogram you created earlier.
Try limiting the x-axis, altering the bin width, and setting different breaks on the x-axis.
There won’t be a solution video for this question so go to the discussions to share your thoughts and discover what other people find.
You can save images by using the ggsave() command. ggsave() will save the last plot created. For example… qplot(x = price, data = diamonds) ggsave(‘priceHistogram.png’)
ggsave currently recognises the extensions eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf (windows only).
Submit your final code when you are ready.
qplot(x = price, data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 50) +
scale_x_continuous(limits= c(0,2400), breaks=seq(0,2400,200))
Notes: I’ve plotted all diamonds prices up to median. Peak was around 800, almost no diamonds are cheaper than 400 usd. and interstingly there are almost no diamonds around 1500 usd range
as a side note mode function :
onerMode <-function(vctor){
freq <- max(table(vctor))
m <- names(table(vctor))[table(vctor)==freq]
l <-list(mode = m, frequency = freq)
l
}
onerMode(c$price)
$mode [1] “605”
$frequency [1] 132
Q6) Price by Cut Histograms
Break out the histogram of diamond prices by cut.
You should have five histograms in separate panels on your resulting plot.
qplot(x =price, data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 500) +
facet_wrap(~cut)
by(diamonds$price,diamonds$cut,summary, digits = max(getOption('digits')))
## diamonds$cut: Fair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337.000 2050.250 3282.000 4358.758 5205.500 18574.000
## --------------------------------------------------------
## diamonds$cut: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 327.000 1145.000 3050.500 3928.864 5028.000 18788.000
## --------------------------------------------------------
## diamonds$cut: Very Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336.00 912.00 2648.00 3981.76 5372.75 18818.00
## --------------------------------------------------------
## diamonds$cut: Premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326.000 1046.000 3185.000 4584.258 6296.000 18823.000
## --------------------------------------------------------
## diamonds$cut: Ideal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326.000 878.000 1810.000 3457.542 4678.500 18806.000
Q7) Scales and Multiple Histograms
# In the two last exercises, we looked at
# the distribution for diamonds by cut.
# Run the code below in R Studio to generate
# the histogram as a reminder.
# ===============================================================
qplot(x = price, data = diamonds) + facet_wrap(~cut)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
# ===============================================================
# In the last exercise, we looked at the summary statistics
# for diamond price by cut. If we look at the output table, the
# the median and quartiles are reasonably close to each other.
# diamonds$cut: Fair
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 337 2050 3282 4359 5206 18570
# ------------------------------------------------------------------------
# diamonds$cut: Good
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 327 1145 3050 3929 5028 18790
# ------------------------------------------------------------------------
# diamonds$cut: Very Good
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 336 912 2648 3982 5373 18820
# ------------------------------------------------------------------------
# diamonds$cut: Premium
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 326 1046 3185 4584 6296 18820
# ------------------------------------------------------------------------
# diamonds$cut: Ideal
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 326 878 1810 3458 4678 18810
# This means the distributions should be somewhat similar,
# but the histograms we created don't show that.
# The 'Fair' and 'Good' diamonds appear to have
# different distributions compared to the better
# cut diamonds. They seem somewhat uniform
# on the left with long tails on the right.
# Let's look in to this more.
# Look up the documentation for facet_wrap in R Studio.
# Then, scroll back up and add a parameter to facet_wrap so that
# the y-axis in the histograms is not fixed. You want the y-axis to
# be different for each histogram.
qplot(x =price, data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 600) +
facet_wrap(~cut, scales="free_y")
Q8) Price per Carat by Cut
# Create a histogram of price per carat
# and facet it by cut. You can make adjustments
# to the code from the previous exercise to get
# started.
# Adjust the bin width and transform the scale
# of the x-axis using log10.
# Submit your final code when you are ready.
# ENTER YOUR CODE BELOW THIS LINE.
# ===========================================================================
qplot(x =price/carat, data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 600) +
facet_wrap(~cut, scales="free_y")
qplot(x =log10(price/carat), data = diamonds, color = I('black'), fill = I('#099DD9'), binwidth = 0.05) +
facet_wrap(~cut, scales="free_y")
Q9) Price Box Plots
# Investigate the price of diamonds using box plots,
# numerical summaries, and one of the following categorical
# variables: cut, clarity, or color.
# There won't be a solution video for this
# exercise so go to the discussion thread for either
# BOXPLOTS BY CLARITY, BOXPLOT BY COLOR, or BOXPLOTS BY CUT
# to share you thoughts and to
# see what other people found.
# You can save images by using the ggsave() command.
# ggsave() will save the last plot created.
# For example...
# qplot(x = price, data = diamonds)
# ggsave('priceHistogram.png')
# ggsave currently recognises the extensions eps/ps, tex (pictex),
# pdf, jpeg, tiff, png, bmp, svg and wmf (windows only).
# Copy and paste all of the code that you used for
# your investigation, and submit it when you are ready.
# =================================================================
qplot(x = color, y = price, data = diamonds, geom = "boxplot") +
coord_cartesian(ylim = c(0,8000))
qplot(x = cut, y = price, data = diamonds, geom = "boxplot") +
coord_cartesian(ylim = c(0,7000))
qplot(x = clarity, y = price, data = diamonds, geom = "boxplot") +
coord_cartesian(ylim = c(0,7000))
Q10) IQR
by(diamonds$price,diamonds$color,summary)
## diamonds$color: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 357 911 1838 3170 4214 18690
## --------------------------------------------------------
## diamonds$color: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 882 1739 3077 4003 18730
## --------------------------------------------------------
## diamonds$color: F
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 342 982 2344 3725 4868 18790
## --------------------------------------------------------
## diamonds$color: G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 354 931 2242 3999 6048 18820
## --------------------------------------------------------
## diamonds$color: H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337 984 3460 4487 5980 18800
## --------------------------------------------------------
## diamonds$color: I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1120 3730 5092 7202 18820
## --------------------------------------------------------
## diamonds$color: J
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335 1860 4234 5324 7695 18710
Q11) Price per Carat Box Plots by Color
# Investigate the price per carat of diamonds across
# the different colors of diamonds using boxplots.
qplot(x = color, y = price/carat, data = diamonds, geom = "boxplot")
Note :The boxplot above and below shows that best color D has lots of outliers and outliers tend to decrease when we go to worse colors the IQR for all colors tend to be pretty similar with close median values.
qplot(x = color, y = price/carat, data = diamonds, geom = "boxplot") +
coord_cartesian(ylim = c(0,6000))
Q12 Carat Frequency Polygon
Investigate the weight of diamonds (carat) using a freq polygon. Use different binwidths to see how teh freq polygon changes. What carat size has a count greater than 2000?
qplot(x = carat,
data = diamonds,
binwidth =0.01,
geom = 'freqpoly') +
scale_x_continuous(lim = c(0,3), breaks = seq(0,3,0.3))
## Warning: Removed 2 rows containing missing values (geom_path).
# Below is the ratio of carat size in total
qplot(x = carat, y= ..count../sum(..count..),
data = diamonds,
binwidth =0.01,
geom = 'freqpoly') +
scale_x_continuous(lim = c(0,3), breaks = seq(0,3,0.3))
## Warning: Removed 2 rows containing missing values (geom_path).