diamondsI created the .Rmd file that created this RPubs document so that you can examine the RMarkdown code that I selected to format this page. Download the original RMarkdown document from a Piazza note I just published.Double-click on the file, Diamonds_Plot_Demo.Rmd, you download. This file should open in your RStudio. Study the code I wrote in comparison with the information on the RMarkdown Cheat Sheet. ~dlp
diamonds Data FrameThe diamonds data frame is available when the ggplot2 package is loaded. Data extracted for the diamonds data frame include 10 characteristics (variables in columns) of 53,940 diamonds (observations in rows).
Characteristics of variables in diamonds include:
carat - The weight or overall size of a diamond is measured in carats. In fact, all gemstones are measured in this fashion. Carat weight is made up of points…like ounces to a pound. It takes 100 points to equal 1 carat. For example, 25 points = 1/4 carat, 50 points = 1/2 carat, etc. Of course, the higher the carat weight of the diamond, the more you can expect to pay for it. However the price does not increase on an even scale. A 2 carat diamond will not be twice the cost of a 1 carat diamond, despite being twice the size. The larger the diamond, the rarer it becomes and the price increases exponentially.
cut - A diamond cut is a style or design guide used when shaping a diamond for polishing. Cut does not refer to shape (pear, oval), but the symmetry, proportioning, and polish of a diamond. The cut of a diamond greatly affects a diamond’s brilliance; this means if it is cut poorly, it will be less luminous. This variable focuses on a judgment about the quality of the diamond’s cut: Fair; Good; Very Good; Premium; Ideal.
color - Most commercially available diamonds are classified by color, or more appropriately, the lack of color. The most valuable diamonds are those classified as colorless, yet there are stones that have rich colors inluding yellow, red, green and even black that are extremely rare and valuable. Color is graded on a letter scale from D to Z, with D representing a colorless diamond.
[id]: http://www.diamondse.info/images/diamond-color-chart.gif “Diamond Color Chart” !Images [diamond color chart: colorless=DEF; near colorless = GHIJ; faint yellow=KLM; very light yellow=NOPQR; lt. yellow=SZ; fancy yellow=FANCY][id]
x - Length of the diamond in millimeters.
y - Width of the diamond in millimeters.
z - This variable is a measure of the height in millimeters measured from the bottom of the diamond to its table (the flat surface on the top of the diamond); also called depth of the diamond.
depth - This variable actually is the depth total percentage of the diamond defined by 2(z) / (x + y).
table - This variable actually is a measure of table width, the width of top of diamond relative to widest point.
price - The retail price of the diamond in U.S. dollars.
Here is a listing of the first 10 lines in diamonds:
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
# ... with 53,930 more rows
Notice that the data types of variables in diamond include dbl and ord. dbl indicates a numeric variable, but we have not seen an ord data type yet. Here is a list of the variables in the data frame, along with their data types:
Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
The ord data type indicates an ordered factor. So, it is a factor, meaning that the variable has a small number of values that represent nominal categories. The values are ordered meaning that, for example, in cut diamonds are classified in an order ranging from “fair,” the lowest quality cut, to “Ideal,” the highest quality cut`."
A variable is an entity that has two or more mutually esclusive values. A frequency distributions displays the number of observations for each value of a variable. Here are a few examples. Try others on your own. Start by copying my code; then, modify the statements to include other variables.
color.freq
D 6775
E 9797
F 9542
G 11292
H 8304
I 5422
J 2808
clarity.freq
I1 741
SI2 9194
SI1 13065
VS2 12258
VS1 8171
VVS2 5066
VVS1 3655
IF 1790
The range of the prices (variable price) of the 53,940 diamonds:
[1] 326 18823
The frequency distribution of prices (in scientific notation) grouped in $1,000 intervals:
price.freq
[326,1.33e+03) 18790
[1.33e+03,2.33e+03) 7517
[2.33e+03,3.33e+03) 5466
[3.33e+03,4.33e+03) 4381
[4.33e+03,5.33e+03) 4303
[5.33e+03,6.33e+03) 2743
[6.33e+03,7.33e+03) 2057
[7.33e+03,8.33e+03) 1502
[8.33e+03,9.33e+03) 1257
[9.33e+03,1.03e+04) 1015
[1.03e+04,1.13e+04) 935
[1.13e+04,1.23e+04) 737
[1.23e+04,1.33e+04) 685
[1.33e+04,1.43e+04) 561
[1.43e+04,1.53e+04) 512
[1.53e+04,1.63e+04) 479
[1.63e+04,1.73e+04) 439
[1.73e+04,1.83e+04) 393
I computed all of these statistics with dplyr. Look at the code in the .Rmd file, my friends….look at the code.
# A tibble: 7 x 3
color average_price number
<ord> <dbl> <int>
1 D 3169.954 6775
2 E 3076.752 9797
3 F 3724.886 9542
4 G 3999.136 11292
5 H 4486.669 8304
6 I 5091.875 5422
7 J 5323.818 2808
# A tibble: 8 x 3
clarity average_price number
<ord> <dbl> <int>
1 I1 3924.169 741
2 SI2 5063.029 9194
3 SI1 3996.001 13065
4 VS2 3924.989 12258
5 VS1 3839.455 8171
6 VVS2 3283.737 5066
7 VVS1 2523.115 3655
8 IF 2864.839 1790
# A tibble: 8 x 5
clarity average_price minimum maximum number
<ord> <dbl> <dbl> <dbl> <int>
1 I1 3924.169 345 18531 741
2 SI2 5063.029 326 18804 9194
3 SI1 3996.001 326 18818 13065
4 VS2 3924.989 334 18823 12258
5 VS1 3839.455 327 18795 8171
6 VVS2 3283.737 336 18768 5066
7 VVS1 2523.115 336 18777 3655
8 IF 2864.839 369 18806 1790
# A tibble: 8 x 5
clarity average_price minimum maximum number
<ord> <dbl> <dbl> <dbl> <int>
1 SI2 5063.029 326 18804 9194
2 SI1 3996.001 326 18818 13065
3 VS2 3924.989 334 18823 12258
4 I1 3924.169 345 18531 741
5 VS1 3839.455 327 18795 8171
6 VVS2 3283.737 336 18768 5066
7 IF 2864.839 369 18806 1790
8 VVS1 2523.115 336 18777 3655
# A tibble: 5 x 3
cut average_carat number
<ord> <dbl> <int>
1 Fair 1.0461366 1610
2 Good 0.8491847 4906
3 Very Good 0.8063814 12082
4 Premium 0.8919549 13791
5 Ideal 0.7028370 21551
Test of the hypothesis that there is no difference in the price of diamonds with clarity = VVS1 and clarity = IF.
Examine the code in the .Rmd file, and try this yourself with other variables. ~ dlp
# I will show the R code in the RPubs document
price.vvs1 <- diamonds[diamonds$clarity == "VVS1",]$price
price.if <- diamonds[diamonds$clarity == "IF",]$price
t.test(price.vvs1,price.if, var.equal=TRUE)
Two Sample t-test
data: price.vvs1 and price.if
t = -3.3481, df = 5443, p-value = 0.0008193
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-541.8145 -141.6344
sample estimates:
mean of x mean of y
2523.115 2864.839
The difference in the average price of diamonds with clarity VVS1 (M = M = $2,523) and diamonds with clarity IF (M = $2,865) was -$342, which was statistically significant at the .05 level, t(5443) = -3.35, p < .001, 95% CI [-$542, -$142].
ggplot2*Again, examine the R code in the .RMD file to see how these plots were executed. Compare my code withe the GGPLOT2 Cheat Sheet.
Give it a try! By the way: A 5 carat diamond?