Diamond Analysis

What is the learning in this lab?

  • Working with packages
  • Installing and loading R packages
  • Practicing using functions to
    • view
    • clean
    • visualize

Diamonds Dataset Description

The diamonds dataset is a collection of data on the attributes of diamonds, commonly used for data analysis and machine learning tasks. It contains information about 53,940 diamonds and 10 variables. These variables include characteristics such as carat, cut, color, clarity, price, and dimensions. The dataset is often used for tasks like regression, classification, and data visualization to explore relationships between different attributes of diamonds.

Step A: Using R packages

  • Packages are a key part of working with R.
  • We will be using a package called tidyverse.
  • The tidyverse package is actually a collection of individual packages that can help you perform a wide variety of analysis tasks.

Step B: Installing and loading R packages

Many of the tidyverse packages contain sample datasets that you can use to practice your R skills. The diamonds dataset in the ggplot2 package is a great example for previewing R functions.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Step C: Practicing using functions

1) Viewing data

Commonly used functions

It aims to explore data and understand its structure and contents. These include obtaining statistical summaries, displaying the first and last rows, and checking for missing values.

1. summary(): Provides a statistical summary of the dataset (e.g., mean, standard deviation, min/max values).

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

2. head(): Displays the first few rows of the data (default is 6 rows) for quick inspection.

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

3. tail(): Displays the last few rows of the data for inspection.

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.72 Premium   D     SI1      62.7    59  2757  5.69  5.73  3.58
## 2  0.72 Ideal     D     SI1      60.8    57  2757  5.75  5.76  3.5 
## 3  0.72 Good      D     SI1      63.1    55  2757  5.69  5.75  3.61
## 4  0.7  Very Good D     SI1      62.8    60  2757  5.66  5.68  3.56
## 5  0.86 Premium   H     SI2      61      58  2757  6.15  6.12  3.74
## 6  0.75 Ideal     D     SI2      62.2    55  2757  5.83  5.87  3.64

4. dim(): Displays the dimensions of the data (number of rows and columns).

## [1] 53940    10

5. str(): Displays the structure of the data, showing column types, and the number of rows and columns.

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

6. names(): Displays the column names of the data.

##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

7. colnames(): Displays the column names (same as names()).

##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

8. table(): Shows the count of each category in a categorical column.

## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551

9. summary() for individual columns: Provides a statistical summary for a specific column.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

10. anyNA()or is.na() : Checks if there are any missing (NA) values in the data.

## [1] FALSE

11. glimpse(): Provides a quick overview of the data, showing column names, types, and a preview of the first 10 values in each column.

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

2) Cleaning data

One of the most frequent tasks you will have to perform as an analyst is to clean and organize your data. R makes this easy! There are many functions you can use to help you perform important tasks easily and quickly. ### Commonly used functions #### 1. na.omit(): Removes rows with missing (NA) values from the data.

2. complete.cases(): Identifies rows with complete cases (no missing values).

3. replace(): Replaces specific values in a column or dataset.

4. subset(): Selects a subset of data based on conditions.

5. droplevels(): Drops unused factor levels in a categorical column.

6. mutate() (from dplyr): Adds or modifies columns in the dataset.

7. filter() (from dplyr): Filters rows based on conditions.

8. arrange() (from dplyr): Sorts the dataset by one or more columns.

9. select() (from dplyr): Selects specific columns from the dataset.

Another handy function for summarizing your data is summarize(). You can use it to generate a wide range of summary statistics for your data. For example, if you wanted to know what the mean for carat was in this dataset, you could run the code in the chunk below:

## # A tibble: 1 × 1
##   mean_carat
##        <dbl>
## 1      0.798

3)Visualizing data

To build a visualization with ggplot2

Documentation

You have been working in an R markdown file, which allows you to put code and writing in the same place. Markdown is a simple language for adding formatting to text documents.

Activity Wrap-up

You have had a chance to explore more R tools that you can start using on your own. You learned how to install and load R packages; functions for viewing, cleaning, and visualizing data; and using R markdownto export you