In this lab we will be using the ‘diamonds’ dataset which is included in the ggplot2 package. You’ll need to load the library!

  1. In the console, type View(diamonds) to see the dataframe.
  1. Display the top 3 lines of the file. How many variables are given? How can you find out what they mean?
#install.packages("ggplot2")
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 4.0.3
head(diamonds)
diamonds[c(1, 2, 3),]
  1. Use an R command to determine the number of diamonds in this dataset.
dim(diamonds)[1]
## [1] 53940
  1. What percentage of the diamonds are Fair? What percentage are ideal?
print("percent fair:")
## [1] "percent fair:"
length(which(diamonds[2] == "Fair"))/c(dim(diamonds))[1]*100
## [1] 2.984798
print("percent ideal:")
## [1] "percent ideal:"
length(which(diamonds[2] == "Ideal"))/c(dim(diamonds))[1]*100
## [1] 39.95365
  1. Use ggplot2 to create a bar chart of the color variable, colored by cut.
ggplot(diamonds, aes(x = color, fill = cut)) + geom_bar()

2. Now use ggplot2 to create a histogram of the carat variable. Try changing the bin size with the ‘binwidth’ command, and color by cut. What are your observations?

ggplot(diamonds, aes(x = carat, fill = cut)) + geom_histogram(binwidth = 5)

ggplot(diamonds, aes(x = carat, fill = cut)) + geom_histogram(binwidth = 10)

ggplot(diamonds, aes(x = carat, fill = cut)) + geom_histogram(binwidth = 15)

ggplot(diamonds, aes(x = carat, fill = cut)) + geom_histogram(binwidth = 1)

ggplot(diamonds, aes(x = carat, fill = cut)) + geom_histogram(binwidth = 0.01)

ggplot(diamonds, aes(x = carat, fill = cut)) + geom_histogram(binwidth = 0.1)

3. Get side-by-side boxplots of the carat variable, grouped by cut. Do the boxplots tell you anything that the histogram does not?

ggplot(diamonds, aes(x = cut, y = carat, fill = cut)) + geom_boxplot()

4. Use an R command to subset the diamonds weigh over 2 carat? How many are there? Which cut do they correspond to? Get side by side boxplots of this set to get an idea how the cuts are dispersed.

length((which(diamonds[,1] >2)))
## [1] 1889
heavyDiamonds <- subset(diamonds, carat>2, select=c("carat", "cut"))
ggplot(heavyDiamonds, aes(x = cut, y =carat), fill = cut) + geom_boxplot()

print("there are 1889 of them. they correspond to cuts as shown in the boxplot.")
## [1] "there are 1889 of them. they correspond to cuts as shown in the boxplot."
  1. Finally, replace the geom_boxplot() command with geom_violin(). What do you think is happening?
ggplot(heavyDiamonds, aes(x = cut, y =carat), fill = cut) + geom_violin()

print("if a diamond is in this set, it is most likely to be around 2 carats")
## [1] "if a diamond is in this set, it is most likely to be around 2 carats"