The following uses the diamonds dataset from the GGPLOT2 library
library(ggplot2)
mydiamonds <- diamonds
You can paste in a new “code chunk” by pressing CTRL- ALT - I so for each question listed below, put in a new code chunk to answer that question. Knit your notebook to Word and submit the word document on blackboard.
#Knowing my data```
summary(mydiamonds)
carat cut color clarity
Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
Max. :5.0100 I: 5422 VVS1 : 3655
J: 2808 (Other): 2531
depth table price x
Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
Median :61.80 Median :57.00 Median : 2401 Median : 5.700
Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
y z
Min. : 0.000 Min. : 0.000
1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.710 Median : 3.530
Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :58.900 Max. :31.800
str(mydiamonds)
tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
head(mydiamonds)
tail(mydiamonds)
table(mydiamonds$cut)
Fair Good Very Good Premium Ideal
1610 4906 12082 13791 21551
par(mfrow=c(1,1))
plot(mydiamonds$color)
ggplot(data = mydiamonds, aes(x = price, y = carat, color = cut)) + geom_point()
How many records are in the dataset?
nrow(mydiamonds)
[1] 53940
NROW(mydiamonds)
[1] 53940
What is the largest diamond by weight (carat)?
max(mydiamonds$carat)
[1] 5.01
Most and least expensive?
most_expensive <- max(mydiamonds$price)
print(most_expensive)
[1] 18823
least_expensive <- min(mydiamonds$price)
print(least_expensive)
[1] 326
Plot a bar chart of count of diamonds vs cut.
ggplot(data = mydiamonds, aes(x = cut)) + geom_bar()
Let’s explore the data a bit. What attributes does the most expensive diamond have? Change max(price) to min(price) to see the least expensive.
subset(mydiamonds, price == max(price))
subset(mydiamonds, price == min(price))
Create a plot of carat vs price.
ggplot(data = mydiamonds, aes(x = carat, y = price, color = "red")) + geom_point(alpha = 0.8)
Does it look like carat and price have a linear relationship?
#Answer: From the graphs, it is clear that the relationship between the two variables (Carat and Price) is non-linear: As shown, as the carat size increases, the price also increases. As such, this could be exponential. Also, the variance of the relationship increases as carat size increases. There are also apparent discrete values that carat size takes on, vertical strips on the graph. Moreover, the linear trend line does not go through the center of the data at some key places.
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(fill=I("#F79420"), color=I("red"), shape=21) +
stat_smooth(method="lm") +
scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
ggtitle("Relationship Between Price vs. Carat") + xlab("CARAT") + ylab("PRICE")
ggplot(data = mydiamonds, aes(x = carat, y = price)) + geom_point(alpha = 0.6) + stat_smooth(method = "lm", color = "blue")
ggplot(data = mydiamonds, aes(x = carat, y = price)) + geom_line() + geom_smooth(method = "lm", color = "green")
ggplot(data = mydiamonds, aes(x = carat, y = price)) + geom_point() + coord_fixed() + scale_x_log10() + scale_y_log10() + geom_smooth(method = "lm", color = "green") + xlab('CARAT INCH') + ylab('PRICE INCH')
Create three other plots of other variables vs price. The point of exploratory analysis (know your data) is to do just that, explore. You might have to plot more than three to find variables that plot correctly. Please realize too that scatter plots (or line) are for continuous variables and not for categorical variables. See the ggplot2 intro for references. Please try to pick three variables that you think have a strong influence in the price of the diamond. The main point for this is to make a model later on.
ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=price)) + ggtitle("Diamond Price Distribution") + xlab("Diamond Price U$ - Binwidth 500") + ylab("Frequency") + theme_minimal()
ggplot(data = diamonds) + aes(x = price, color = cut) + geom_bar() + ylab("Frequency") + xlab("Diamon Price U$")
ggplot(data = mydiamonds) + aes(x=cut,y=price, color=cut) +
geom_boxplot() + ggtitle("Relationship of Price and Cut") + xlab("Cut") + ylab("Price")
ggplot(data = mydiamonds) + aes(x=(price)) +
geom_histogram(stat="bin",binwidth= 500) +
facet_wrap(~cut, scales = "free")