We will load the Diamonds data in R studio first.

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("ggplot2")
data("diamonds")

To view the dataset in sheet

click here

Now, we will install and load more packages as per our need.

We will install “here” package to make file referencing easier.

install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("here")
## here() starts at /cloud/project

We will install “skimr” package to summarize and skim through data quickly.

install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("skimr")

we will install “janitor” package to do some cleaning.

install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("janitor")
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

We will also load “dplyr” Package . Since, we will be using some of its functions

install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Now we will see an overview of Diamonds dataset, including basic statistics of data

Data summary
Name diamonds
Number of rows 53940
Number of columns 10
_______________________
Column type frequency:
factor 3
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cut 0 1 TRUE 5 Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color 0 1 TRUE 7 G: 11292, E: 9797, F: 9542, H: 8304
clarity 0 1 TRUE 8 SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
carat 0 1 0.80 0.47 0.2 0.40 0.70 1.04 5.01
depth 0 1 61.75 1.43 43.0 61.00 61.80 62.50 79.00
table 0 1 57.46 2.23 43.0 56.00 57.00 59.00 95.00
price 0 1 3932.80 3989.44 326.0 950.00 2401.00 5324.25 18823.00
x 0 1 5.73 1.12 0.0 4.71 5.70 6.54 10.74
y 0 1 5.73 1.14 0.0 4.72 5.71 6.54 58.90
z 0 1 3.54 0.71 0.0 2.91 3.53 4.04 31.80

Lets see the structure of the dataset

str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

We will now take a glimpse of the diamonds dataset, to know the total number of rows and columns. Along with it, we will see some sample values present in those columns

glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

If we want to see column names, it can be known by following code.

glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

Now, lets fetch the first few rows of diamonds dataset

head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

To highlight columns, we will make them in uppercase

DIAMOND <- rename_with(diamonds,toupper)
head(DIAMOND)
## # A tibble: 6 × 10
##   CARAT CUT       COLOR CLARITY DEPTH TABLE PRICE     X     Y     Z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Now, we can change the spelling of column named “COLOR”

DIAMOND <- DIAMOND %>% rename(COLOUR=COLOR) 
head(DIAMOND)
## # A tibble: 6 × 10
##   CARAT CUT       COLOUR CLARITY DEPTH TABLE PRICE     X     Y     Z
##   <dbl> <ord>     <ord>  <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E      SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E      SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E      VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I      VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J      SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J      VVS2     62.8    57   336  3.94  3.96  2.48

We will remove the columns that we don’t need.

DIAMOND<-DIAMOND %>% select(CARAT,CUT,COLOUR,CLARITY,DEPTH,PRICE)
head(DIAMOND)
## # A tibble: 6 × 6
##   CARAT CUT       COLOUR CLARITY DEPTH PRICE
##   <dbl> <ord>     <ord>  <ord>   <dbl> <int>
## 1  0.23 Ideal     E      SI2      61.5   326
## 2  0.21 Premium   E      SI1      59.8   326
## 3  0.23 Good      E      VS1      56.9   327
## 4  0.29 Premium   I      VS2      62.4   334
## 5  0.31 Good      J      SI2      63.3   335
## 6  0.24 Very Good J      VVS2     62.8   336

Lets ensure that there are only characters, numbers and underscores in the column names

clean_names(DIAMOND)
## # A tibble: 53,940 × 6
##    carat cut       colour clarity depth price
##    <dbl> <ord>     <ord>  <ord>   <dbl> <int>
##  1  0.23 Ideal     E      SI2      61.5   326
##  2  0.21 Premium   E      SI1      59.8   326
##  3  0.23 Good      E      VS1      56.9   327
##  4  0.29 Premium   I      VS2      62.4   334
##  5  0.31 Good      J      SI2      63.3   335
##  6  0.24 Very Good J      VVS2     62.8   336
##  7  0.24 Very Good I      VVS1     62.3   336
##  8  0.26 Very Good H      SI1      61.9   337
##  9  0.22 Fair      E      VS2      65.1   337
## 10  0.23 Very Good H      VS1      59.4   338
## # … with 53,930 more rows

Lets, visualize this data

library("ggplot2")

We will visualise the relation between cut and clarity, along with the number of diamonds

ggplot(data=DIAMOND) + 
  geom_bar(mapping=aes(x=CUT, fill=CLARITY))

We will plot the relation between diamond colour type and cut type

ggplot(data=DIAMOND) + 
  geom_bar(mapping=aes(x=CUT, fill=COLOUR)) 

Let’s divide the above bar chart into multiple charts on the basis of cut types in diamond

ggplot(data=DIAMOND) + 
  geom_bar(mapping=aes(x=COLOUR, fill = CUT)) + 
 facet_wrap(~CUT)

we will limit the cut type to “Ideal”(as they are the most preferred one’s) and arrange it by carat.

Ideal_diamonds <- DIAMOND %>% 
filter(CUT=='Ideal') %>%
arrange(-CARAT)
head(Ideal_diamonds)
## # A tibble: 6 × 6
##   CARAT CUT   COLOUR CLARITY DEPTH PRICE
##   <dbl> <ord> <ord>  <ord>   <dbl> <int>
## 1  3.5  Ideal H      I1       62.8 12587
## 2  3.22 Ideal I      I1       62.6 12545
## 3  3.01 Ideal J      SI2      61.7 16037
## 4  3.01 Ideal J      I1       65.4 16538
## 5  2.75 Ideal D      I1       60.9 13156
## 6  2.72 Ideal H      I1       59.6 11594

We can see that the most expensive diamond with Ideal CUT type and of 3.50 carat, costs around 12587$