5,BONUS - place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

datasets <- read.csv(file="https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv", header=TRUE, sep=",")

1,Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text

summary(datasets)
##        X             carat               cut        color    
##  Min.   :    1   Min.   :0.2000   Fair     : 1610   D: 6775  
##  1st Qu.:13486   1st Qu.:0.4000   Good     : 4906   E: 9797  
##  Median :26971   Median :0.7000   Ideal    :21551   F: 9542  
##  Mean   :26971   Mean   :0.7979   Premium  :13791   G:11292  
##  3rd Qu.:40455   3rd Qu.:1.0400   Very Good:12082   H: 8304  
##  Max.   :53940   Max.   :5.0100                     I: 5422  
##                                                     J: 2808  
##     clarity          depth           table           price      
##  SI1    :13065   Min.   :43.00   Min.   :43.00   Min.   :  326  
##  VS2    :12258   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950  
##  SI2    : 9194   Median :61.80   Median :57.00   Median : 2401  
##  VS1    : 8171   Mean   :61.75   Mean   :57.46   Mean   : 3933  
##  VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324  
##  VVS1   : 3655   Max.   :79.00   Max.   :95.00   Max.   :18823  
##  (Other): 2531                                                  
##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800  
## 
mean(datasets$carat)
## [1] 0.7979397
mean(datasets$price)
## [1] 3932.8
median(datasets$carat)
## [1] 0.7
median(datasets$price)
## [1] 2401

We see a dataset of 53940 observations and each representing an individual. The mean number of carat in the dataset is 0.80 and the mean number of price in the market is 3932.80. The range of price is from 326 to 18823 . (Larger positive numbers are associated with higher price.)

2,Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)

dfnew <- data.frame(subset(datasets, carat == 1))

create a new column renaming

colnames(dfnew) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7","X8","X9","X10","X11")

print new data frame

summary(dfnew)
##        X1              X2            X3      X4            X5     
##  Min.   :  285   Min.   :1   Fair     :166   D:199   SI1    :463  
##  1st Qu.: 7277   1st Qu.:1   Good     :333   E:317   SI2    :409  
##  Median :11136   Median :1   Ideal    :208   F:350   VS2    :335  
##  Mean   :11812   Mean   :1   Premium  :462   G:312   VS1    :163  
##  3rd Qu.:15451   3rd Qu.:1   Very Good:389   H:224   VVS2   :100  
##  Max.   :53864   Max.   :1                   I: 99   VVS1   : 39  
##                                              J: 57   (Other): 49  
##        X6             X7              X8              X9       
##  Min.   :43.0   Min.   :49.00   Min.   : 1681   Min.   :0.000  
##  1st Qu.:60.8   1st Qu.:57.00   1st Qu.: 4155   1st Qu.:6.320  
##  Median :62.2   Median :59.00   Median : 4864   Median :6.380  
##  Mean   :62.0   Mean   :58.58   Mean   : 5242   Mean   :6.376  
##  3rd Qu.:63.2   3rd Qu.:60.00   3rd Qu.: 6073   3rd Qu.:6.440  
##  Max.   :70.2   Max.   :68.00   Max.   :16469   Max.   :6.770  
##                                                                
##       X10             X11       
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:6.310   1st Qu.:3.900  
##  Median :6.380   Median :3.960  
##  Mean   :6.364   Mean   :3.947  
##  3rd Qu.:6.440   3rd Qu.:4.000  
##  Max.   :6.720   Max.   :4.380  
## 

print the mean and median for the same two attributes

mean(dfnew$X2)
## [1] 1
mean(dfnew$X8)
## [1] 5241.59
median(dfnew$X2)
## [1] 1
median(dfnew$X8)
## [1] 4864

3, Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2 Scatter Plot of Diamond Carat vs. Price

plot(datasets$price, datasets$carat, main = "Diamond Carat vs Price")

>Histogram of Diamond Price

hist(datasets$price, breaks = 500, main = "Diamond Price in Study")

>Boxplots of Diamond Price

boxplot(datasets$price, main = "Diamond Price in Study")

library(ggplot2)
ggplot(datasets, aes(x=datasets$price, y=datasets$carat)) + geom_point(aes(color = color))

>Density plot of diamond carats

ggplot(datasets) + geom_density(aes(x = carat), main = "Density plot of Diamond Carat")
## Warning: Ignoring unknown parameters: main

4, Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end Question: Do prices of diamonds go up with the bigger diamond carat?

summary(datasets)
##        X             carat               cut        color    
##  Min.   :    1   Min.   :0.2000   Fair     : 1610   D: 6775  
##  1st Qu.:13486   1st Qu.:0.4000   Good     : 4906   E: 9797  
##  Median :26971   Median :0.7000   Ideal    :21551   F: 9542  
##  Mean   :26971   Mean   :0.7979   Premium  :13791   G:11292  
##  3rd Qu.:40455   3rd Qu.:1.0400   Very Good:12082   H: 8304  
##  Max.   :53940   Max.   :5.0100                     I: 5422  
##                                                     J: 2808  
##     clarity          depth           table           price      
##  SI1    :13065   Min.   :43.00   Min.   :43.00   Min.   :  326  
##  VS2    :12258   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950  
##  SI2    : 9194   Median :61.80   Median :57.00   Median : 2401  
##  VS1    : 8171   Mean   :61.75   Mean   :57.46   Mean   : 3933  
##  VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324  
##  VVS1   : 3655   Max.   :79.00   Max.   :95.00   Max.   :18823  
##  (Other): 2531                                                  
##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800  
## 

when the diamond carrat = 1.00

dfnew <- data.frame(subset(datasets, carat == 1))
summary(dfnew)
##        X             carat          cut      color      clarity   
##  Min.   :  285   Min.   :1   Fair     :166   D:199   SI1    :463  
##  1st Qu.: 7277   1st Qu.:1   Good     :333   E:317   SI2    :409  
##  Median :11136   Median :1   Ideal    :208   F:350   VS2    :335  
##  Mean   :11812   Mean   :1   Premium  :462   G:312   VS1    :163  
##  3rd Qu.:15451   3rd Qu.:1   Very Good:389   H:224   VVS2   :100  
##  Max.   :53864   Max.   :1                   I: 99   VVS1   : 39  
##                                              J: 57   (Other): 49  
##      depth          table           price             x        
##  Min.   :43.0   Min.   :49.00   Min.   : 1681   Min.   :0.000  
##  1st Qu.:60.8   1st Qu.:57.00   1st Qu.: 4155   1st Qu.:6.320  
##  Median :62.2   Median :59.00   Median : 4864   Median :6.380  
##  Mean   :62.0   Mean   :58.58   Mean   : 5242   Mean   :6.376  
##  3rd Qu.:63.2   3rd Qu.:60.00   3rd Qu.: 6073   3rd Qu.:6.440  
##  Max.   :70.2   Max.   :68.00   Max.   :16469   Max.   :6.770  
##                                                                
##        y               z        
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:6.310   1st Qu.:3.900  
##  Median :6.380   Median :3.960  
##  Mean   :6.364   Mean   :3.947  
##  3rd Qu.:6.440   3rd Qu.:4.000  
##  Max.   :6.720   Max.   :4.380  
## 

when the diamond carrat = 0.90

dnew <- data.frame(subset(datasets, carat == 0.90))
summary(dnew)
##        X             carat            cut      color      clarity   
##  Min.   :  113   Min.   :0.9   Fair     :104   D:203   SI1    :493  
##  1st Qu.: 3938   1st Qu.:0.9   Good     :330   E:233   SI2    :379  
##  Median : 5960   Median :0.9   Ideal    :215   F:270   VS2    :334  
##  Mean   : 7666   Mean   :0.9   Premium  :314   G:293   VS1    :174  
##  3rd Qu.: 8433   3rd Qu.:0.9   Very Good:522   H:265   VVS2   : 44  
##  Max.   :53918   Max.   :0.9                   I:159   VVS1   : 30  
##                                                J: 62   (Other): 31  
##      depth           table           price            x        
##  Min.   :55.90   Min.   :52.00   Min.   :1599   Min.   :5.740  
##  1st Qu.:61.40   1st Qu.:57.00   1st Qu.:3447   1st Qu.:6.090  
##  Median :62.40   Median :58.00   Median :3881   Median :6.140  
##  Mean   :62.26   Mean   :58.42   Mean   :3939   Mean   :6.145  
##  3rd Qu.:63.20   3rd Qu.:60.00   3rd Qu.:4334   3rd Qu.:6.190  
##  Max.   :72.90   Max.   :68.00   Max.   :9182   Max.   :6.450  
##                                                                
##        y               z        
##  Min.   :5.670   Min.   :3.570  
##  1st Qu.:6.100   1st Qu.:3.790  
##  Median :6.150   Median :3.830  
##  Mean   :6.151   Mean   :3.827  
##  3rd Qu.:6.210   3rd Qu.:3.860  
##  Max.   :6.420   Max.   :4.160  
## 
ggplot(dfnew, aes(x=dfnew$carat, y=dfnew$price)) + geom_boxplot(aes(color = color))

ggplot(dnew, aes(x=dnew$carat, y=dnew$price)) + geom_boxplot(aes(color = color))

Conclusion: The prices of diamonds goes up as the carat increases, and the increase is not smooth. For example, the difference between the size of 0.90 carat diamond and the size of 1.0 carat diamond is not easily to notice, but the price of the size of 1.0 carat diamond tends to be much more higher than the price of the size 0.99 carat diamond. So when the diamond carat is bigger, the price of diamond will increase higher.

My RPubs link is http://rpubs.com/Zchen116/458021