5,BONUS - place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
datasets <- read.csv(file="https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv", header=TRUE, sep=",")
1,Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text
summary(datasets)
## X carat cut color
## Min. : 1 Min. :0.2000 Fair : 1610 D: 6775
## 1st Qu.:13486 1st Qu.:0.4000 Good : 4906 E: 9797
## Median :26971 Median :0.7000 Ideal :21551 F: 9542
## Mean :26971 Mean :0.7979 Premium :13791 G:11292
## 3rd Qu.:40455 3rd Qu.:1.0400 Very Good:12082 H: 8304
## Max. :53940 Max. :5.0100 I: 5422
## J: 2808
## clarity depth table price
## SI1 :13065 Min. :43.00 Min. :43.00 Min. : 326
## VS2 :12258 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950
## SI2 : 9194 Median :61.80 Median :57.00 Median : 2401
## VS1 : 8171 Mean :61.75 Mean :57.46 Mean : 3933
## VVS2 : 5066 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324
## VVS1 : 3655 Max. :79.00 Max. :95.00 Max. :18823
## (Other): 2531
## x y z
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :10.740 Max. :58.900 Max. :31.800
##
mean(datasets$carat)
## [1] 0.7979397
mean(datasets$price)
## [1] 3932.8
median(datasets$carat)
## [1] 0.7
median(datasets$price)
## [1] 2401
We see a dataset of 53940 observations and each representing an individual. The mean number of carat in the dataset is 0.80 and the mean number of price in the market is 3932.80. The range of price is from 326 to 18823 . (Larger positive numbers are associated with higher price.)
2,Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)
dfnew <- data.frame(subset(datasets, carat == 1))
create a new column renaming
colnames(dfnew) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7","X8","X9","X10","X11")
print new data frame
summary(dfnew)
## X1 X2 X3 X4 X5
## Min. : 285 Min. :1 Fair :166 D:199 SI1 :463
## 1st Qu.: 7277 1st Qu.:1 Good :333 E:317 SI2 :409
## Median :11136 Median :1 Ideal :208 F:350 VS2 :335
## Mean :11812 Mean :1 Premium :462 G:312 VS1 :163
## 3rd Qu.:15451 3rd Qu.:1 Very Good:389 H:224 VVS2 :100
## Max. :53864 Max. :1 I: 99 VVS1 : 39
## J: 57 (Other): 49
## X6 X7 X8 X9
## Min. :43.0 Min. :49.00 Min. : 1681 Min. :0.000
## 1st Qu.:60.8 1st Qu.:57.00 1st Qu.: 4155 1st Qu.:6.320
## Median :62.2 Median :59.00 Median : 4864 Median :6.380
## Mean :62.0 Mean :58.58 Mean : 5242 Mean :6.376
## 3rd Qu.:63.2 3rd Qu.:60.00 3rd Qu.: 6073 3rd Qu.:6.440
## Max. :70.2 Max. :68.00 Max. :16469 Max. :6.770
##
## X10 X11
## Min. :0.000 Min. :0.000
## 1st Qu.:6.310 1st Qu.:3.900
## Median :6.380 Median :3.960
## Mean :6.364 Mean :3.947
## 3rd Qu.:6.440 3rd Qu.:4.000
## Max. :6.720 Max. :4.380
##
print the mean and median for the same two attributes
mean(dfnew$X2)
## [1] 1
mean(dfnew$X8)
## [1] 5241.59
median(dfnew$X2)
## [1] 1
median(dfnew$X8)
## [1] 4864
3, Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2 Scatter Plot of Diamond Carat vs. Price
plot(datasets$price, datasets$carat, main = "Diamond Carat vs Price")
>Histogram of Diamond Price
hist(datasets$price, breaks = 500, main = "Diamond Price in Study")
>Boxplots of Diamond Price
boxplot(datasets$price, main = "Diamond Price in Study")
library(ggplot2)
ggplot(datasets, aes(x=datasets$price, y=datasets$carat)) + geom_point(aes(color = color))
>Density plot of diamond carats
ggplot(datasets) + geom_density(aes(x = carat), main = "Density plot of Diamond Carat")
## Warning: Ignoring unknown parameters: main
4, Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end Question: Do prices of diamonds go up with the bigger diamond carat?
summary(datasets)
## X carat cut color
## Min. : 1 Min. :0.2000 Fair : 1610 D: 6775
## 1st Qu.:13486 1st Qu.:0.4000 Good : 4906 E: 9797
## Median :26971 Median :0.7000 Ideal :21551 F: 9542
## Mean :26971 Mean :0.7979 Premium :13791 G:11292
## 3rd Qu.:40455 3rd Qu.:1.0400 Very Good:12082 H: 8304
## Max. :53940 Max. :5.0100 I: 5422
## J: 2808
## clarity depth table price
## SI1 :13065 Min. :43.00 Min. :43.00 Min. : 326
## VS2 :12258 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950
## SI2 : 9194 Median :61.80 Median :57.00 Median : 2401
## VS1 : 8171 Mean :61.75 Mean :57.46 Mean : 3933
## VVS2 : 5066 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324
## VVS1 : 3655 Max. :79.00 Max. :95.00 Max. :18823
## (Other): 2531
## x y z
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :10.740 Max. :58.900 Max. :31.800
##
when the diamond carrat = 1.00
dfnew <- data.frame(subset(datasets, carat == 1))
summary(dfnew)
## X carat cut color clarity
## Min. : 285 Min. :1 Fair :166 D:199 SI1 :463
## 1st Qu.: 7277 1st Qu.:1 Good :333 E:317 SI2 :409
## Median :11136 Median :1 Ideal :208 F:350 VS2 :335
## Mean :11812 Mean :1 Premium :462 G:312 VS1 :163
## 3rd Qu.:15451 3rd Qu.:1 Very Good:389 H:224 VVS2 :100
## Max. :53864 Max. :1 I: 99 VVS1 : 39
## J: 57 (Other): 49
## depth table price x
## Min. :43.0 Min. :49.00 Min. : 1681 Min. :0.000
## 1st Qu.:60.8 1st Qu.:57.00 1st Qu.: 4155 1st Qu.:6.320
## Median :62.2 Median :59.00 Median : 4864 Median :6.380
## Mean :62.0 Mean :58.58 Mean : 5242 Mean :6.376
## 3rd Qu.:63.2 3rd Qu.:60.00 3rd Qu.: 6073 3rd Qu.:6.440
## Max. :70.2 Max. :68.00 Max. :16469 Max. :6.770
##
## y z
## Min. :0.000 Min. :0.000
## 1st Qu.:6.310 1st Qu.:3.900
## Median :6.380 Median :3.960
## Mean :6.364 Mean :3.947
## 3rd Qu.:6.440 3rd Qu.:4.000
## Max. :6.720 Max. :4.380
##
when the diamond carrat = 0.90
dnew <- data.frame(subset(datasets, carat == 0.90))
summary(dnew)
## X carat cut color clarity
## Min. : 113 Min. :0.9 Fair :104 D:203 SI1 :493
## 1st Qu.: 3938 1st Qu.:0.9 Good :330 E:233 SI2 :379
## Median : 5960 Median :0.9 Ideal :215 F:270 VS2 :334
## Mean : 7666 Mean :0.9 Premium :314 G:293 VS1 :174
## 3rd Qu.: 8433 3rd Qu.:0.9 Very Good:522 H:265 VVS2 : 44
## Max. :53918 Max. :0.9 I:159 VVS1 : 30
## J: 62 (Other): 31
## depth table price x
## Min. :55.90 Min. :52.00 Min. :1599 Min. :5.740
## 1st Qu.:61.40 1st Qu.:57.00 1st Qu.:3447 1st Qu.:6.090
## Median :62.40 Median :58.00 Median :3881 Median :6.140
## Mean :62.26 Mean :58.42 Mean :3939 Mean :6.145
## 3rd Qu.:63.20 3rd Qu.:60.00 3rd Qu.:4334 3rd Qu.:6.190
## Max. :72.90 Max. :68.00 Max. :9182 Max. :6.450
##
## y z
## Min. :5.670 Min. :3.570
## 1st Qu.:6.100 1st Qu.:3.790
## Median :6.150 Median :3.830
## Mean :6.151 Mean :3.827
## 3rd Qu.:6.210 3rd Qu.:3.860
## Max. :6.420 Max. :4.160
##
ggplot(dfnew, aes(x=dfnew$carat, y=dfnew$price)) + geom_boxplot(aes(color = color))
ggplot(dnew, aes(x=dnew$carat, y=dnew$price)) + geom_boxplot(aes(color = color))
Conclusion: The prices of diamonds goes up as the carat increases, and the increase is not smooth. For example, the difference between the size of 0.90 carat diamond and the size of 1.0 carat diamond is not easily to notice, but the price of the size of 1.0 carat diamond tends to be much more higher than the price of the size 0.99 carat diamond. So when the diamond carat is bigger, the price of diamond will increase higher.
My RPubs link is http://rpubs.com/Zchen116/458021