Task 2 Data Wrangling, Renad Galal

Author

Renad Galal (Ray)

Published

October 10, 2024

Week 3 post session exercise

A few important notes to remember from chapter 4,5,6 R for Graduates book:

  1. to load tidyverse use: (its installed already) library(tidyverse)
  2. Pipe operator %>% used to pass the result of one function to another in a readable way.
  3. to view diamonds dataset: view (diamonds) ## Exercises for data analysis
  4. to display the structure or internal details: str(diamonds)
  5. if u need help u can access the help page by: (?diamonds)
  6. We can take a quick view of the variable names using: names(diamonds)
  7. functions: mutate, summarize, group and ungroup, filter, select, arrange

Exercises for Data analysis

  1. make sure to load tidy verse: library (tidyverse).

  2. now we type the codes given in the exercise manually and explain what they do:

    Problems:

    A

    library(tidyverse)
    ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ✔ dplyr     1.1.4     ✔ readr     2.1.5
    ✔ forcats   1.0.0     ✔ stringr   1.5.1
    ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
    ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
    ✔ purrr     1.0.2     
    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ✖ dplyr::filter() masks stats::filter()
    ✖ dplyr::lag()    masks stats::lag()
    ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    midwest %>% 
      group_by(state) %>%  # Group the dataset by 'state'.
      summarize(poptotalmean = mean(poptotal),  # Calculate the mean of 'poptotal' for each state.
                poptotalmed = median(poptotal),  # Calculate the median of 'poptotal' for each state.
                popmax = max(poptotal),  # Find the maximum 'poptotal' value for each state.
                popmin = min(poptotal),  # Find the minimum 'poptotal' value for each state.
                popdistinct = n_distinct(poptotal),  # Count the number of unique 'poptotal' values for each state.
                popfirst = first(poptotal),  # Get the first value of 'poptotal' in each state.
                popany = any(poptotal < 5000),  # Check if any 'poptotal' value is less than 5000 in each state.
                popany2 = any(poptotal > 2000000)) %>%  # Check if any 'poptotal' value is greater than 2,000,000.
      ungroup()  # Remove the grouping, returning a non-grouped dataset.
    # A tibble: 5 × 9
      state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
      <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
    1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
    2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
    3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
    4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
    5 WI          67941.      33528   959275   3890          72    15682 TRUE  
    # ℹ 1 more variable: popany2 <lgl>

    B

    midwest %>% 
      group_by(state) %>%  # Group the data by 'state'.
      summarize(num5k = sum(poptotal < 5000),  # Count how many 'poptotal' values are less than 5000 for each state.
                num2mil = sum(poptotal > 2000000),  # Count how many 'poptotal' values are greater than 2,000,000 for each state.
                numrows = n()) %>%  # Count the total number of rows (observations) in each state.
      ungroup()  # Remove the grouping, returning the dataset without state groups.
    # A tibble: 5 × 4
      state num5k num2mil numrows
      <chr> <int>   <int>   <int>
    1 IL        1       1     102
    2 IN        0       0      92
    3 MI        1       1      83
    4 OH        0       0      88
    5 WI        2       0      72

    C

    # part I
    midwest %>%                               # Start with the 'midwest' dataset and pipe it to the next function
      group_by(county) %>%                    # Group the data by the 'county' column, so the operations apply to each county separately
      summarize(x = n_distinct(state)) %>%    # Create a new column 'x' that counts the distinct number of states within each county
      arrange(desc(x)) %>%                    # Sort the result in descending order of 'x' (the distinct state count)
      ungroup()                               # Remove the grouping after summarizing to return an ungrouped data frame
    # A tibble: 320 × 2
       county         x
       <chr>      <int>
     1 CRAWFORD       5
     2 JACKSON        5
     3 MONROE         5
     4 ADAMS          4
     5 BROWN          4
     6 CLARK          4
     7 CLINTON        4
     8 JEFFERSON      4
     9 LAKE           4
    10 WASHINGTON     4
    # ℹ 310 more rows
    # part II
    midwest %>%                               # Start with the 'midwest' dataset
      group_by(county) %>%                    # Group the data by 'county'
      summarize(x = n()) %>%                  # Create a new column 'x' that counts the total number of rows in each county
      ungroup()                               # Remove the grouping after summarizing
    # A tibble: 320 × 2
       county        x
       <chr>     <int>
     1 ADAMS         4
     2 ALCONA        1
     3 ALEXANDER     1
     4 ALGER         1
     5 ALLEGAN       1
     6 ALLEN         2
     7 ALPENA        1
     8 ANTRIM        1
     9 ARENAC        1
    10 ASHLAND       2
    # ℹ 310 more rows
    # part III
    midwest %>%                               # Start with the 'midwest' dataset
      group_by(county) %>%                    # Group the data by 'county'
      summarize(x = n_distinct(county)) %>%   # Create a new column 'x' that counts the distinct number of 'county' values within each county (which will always be 1)
      ungroup()                               # Remove the grouping to return an ungrouped data frame
    # A tibble: 320 × 2
       county        x
       <chr>     <int>
     1 ADAMS         1
     2 ALCONA        1
     3 ALEXANDER     1
     4 ALGER         1
     5 ALLEGAN       1
     6 ALLEN         1
     7 ALPENA        1
     8 ANTRIM        1
     9 ARENAC        1
    10 ASHLAND       1
    # ℹ 310 more rows

    D

    diamonds %>%                                # Start with the 'diamonds' dataset
      group_by(clarity) %>%                     # Group the data by the 'clarity' column (each unique value of clarity will form a group)
      summarize(a = n_distinct(color),          # Create a new column 'a' that counts the number of distinct colors in each clarity group
                b = n_distinct(price),          # Create a new column 'b' that counts the number of distinct prices in each clarity group
                c = n()) %>%                    # Create a new column 'c' that counts the total number of diamonds in each clarity group (the total number of rows)
      ungroup()                                 # Remove the grouping to return an ungrouped data frame
    # A tibble: 8 × 4
      clarity     a     b     c
      <ord>   <int> <int> <int>
    1 I1          7   632   741
    2 SI2         7  4904  9194
    3 SI1         7  5380 13065
    4 VS2         7  5051 12258
    5 VS1         7  3926  8171
    6 VVS2        7  2409  5066
    7 VVS1        7  1623  3655
    8 IF          7   902  1790

    E

    # part I
    diamonds %>%                               # Start with the 'diamonds' dataset
      group_by(color, cut) %>%                 # Group the data by both 'color' and 'cut' columns (each unique combination of color and cut forms a group)
      summarize(m = mean(price),               # Create a new column 'm' that calculates the mean (average) price for each color-cut combination
                s = sd(price)) %>%             # Create a new column 's' that calculates the standard deviation of the price for each color-cut combination
      ungroup()                                # Remove the grouping to return an ungrouped data frame
    `summarise()` has grouped output by 'color'. You can override using the
    `.groups` argument.
    # A tibble: 35 × 4
       color cut           m     s
       <ord> <ord>     <dbl> <dbl>
     1 D     Fair      4291. 3286.
     2 D     Good      3405. 3175.
     3 D     Very Good 3470. 3524.
     4 D     Premium   3631. 3712.
     5 D     Ideal     2629. 3001.
     6 E     Fair      3682. 2977.
     7 E     Good      3424. 3331.
     8 E     Very Good 3215. 3408.
     9 E     Premium   3539. 3795.
    10 E     Ideal     2598. 2956.
    # ℹ 25 more rows
    # part II
    diamonds %>%                                # Start with the 'diamonds' dataset
      group_by(cut, color) %>%                  # Group the data by 'cut' and 'color' (note the order is now 'cut' first, then 'color')
      summarize(m = mean(price),                # Create a new column 'm' that calculates the mean (average) price for each cut-color combination
                s = sd(price)) %>%              # Create a new column 's' that calculates the standard deviation of the price for each cut-color combination
      ungroup()                                 # Remove the grouping to return an ungrouped data frame
    `summarise()` has grouped output by 'cut'. You can override using the `.groups`
    argument.
    # A tibble: 35 × 4
       cut   color     m     s
       <ord> <ord> <dbl> <dbl>
     1 Fair  D     4291. 3286.
     2 Fair  E     3682. 2977.
     3 Fair  F     3827. 3223.
     4 Fair  G     4239. 3610.
     5 Fair  H     5136. 3886.
     6 Fair  I     4685. 3730.
     7 Fair  J     4976. 4050.
     8 Good  D     3405. 3175.
     9 Good  E     3424. 3331.
    10 Good  F     3496. 3202.
    # ℹ 25 more rows
    # part III
    diamonds %>%                                   # Start with the 'diamonds' dataset
      group_by(cut, color, clarity) %>%            # Group the data by 'cut', 'color', and 'clarity' (each unique combination of these forms a group)
      summarize(m = mean(price),                   # Create a new column 'm' that calculates the mean (average) price for each cut-color-clarity combination
                s = sd(price),                     # Create a new column 's' that calculates the standard deviation of the price for each group
                msale = m * 0.80) %>%              # Create a new column 'msale' where the sale price is 80% of the mean price (i.e., a 20% discount)
      ungroup()                                    # Remove the grouping to return an ungrouped data frame
    `summarise()` has grouped output by 'cut', 'color'. You can override using the
    `.groups` argument.
    # A tibble: 276 × 6
       cut   color clarity     m     s msale
       <ord> <ord> <ord>   <dbl> <dbl> <dbl>
     1 Fair  D     I1      7383  5899. 5906.
     2 Fair  D     SI2     4355. 3260. 3484.
     3 Fair  D     SI1     4273. 3019. 3419.
     4 Fair  D     VS2     4513. 3383. 3610.
     5 Fair  D     VS1     2921. 2550. 2337.
     6 Fair  D     VVS2    3607  3629. 2886.
     7 Fair  D     VVS1    4473  5457. 3578.
     8 Fair  D     IF      1620.  525. 1296.
     9 Fair  E     I1      2095.  824. 1676.
    10 Fair  E     SI2     4172. 3055. 3338.
    # ℹ 266 more rows

    F

    diamonds %>%                                  # Start with the 'diamonds' dataset
      group_by(cut) %>%                           # Group the data by the 'cut' column (each unique value of cut forms a group)
      summarize(potato = mean(depth),             # Create a new column 'potato' that calculates the mean of the 'depth' for each cut group
                pizza = mean(price),              # Create a new column 'pizza' that calculates the mean of the 'price' for each cut group
                popcorn = median(y),              # Create a new column 'popcorn' that calculates the median of the 'y' dimension (diamond height) for each cut group
                pineapple = potato - pizza,       # Create a new column 'pineapple' that calculates the difference between the mean depth and mean price
                papaya = pineapple ^ 2,           # Create a new column 'papaya' that calculates the square of the 'pineapple' value (the squared difference between depth and price)
                peach = n()) %>%                  # Create a new column 'peach' that counts the total number of diamonds (rows) for each cut group
      ungroup()                                   # Remove the grouping to return an ungrouped data frame
    # A tibble: 5 × 7
      cut       potato pizza popcorn pineapple    papaya peach
      <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
    1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
    2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
    3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
    4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
    5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

    G

    # part I
    diamonds %>%                                    # Start with the 'diamonds' dataset
      group_by(color) %>%                           # Group the data by 'color' (each unique diamond color forms a group)
      summarize(m = mean(price)) %>%                # Create a new column 'm' that calculates the mean price for each color
      mutate(x1 = str_c("Diamond color ", color),   # Create a new column 'x1' by concatenating the string "Diamond color " with the diamond's color value
             x2 = 5) %>%                            # Create a new column 'x2' that assigns the value 5 to each row (same for every color group)
      ungroup()                                     # Remove the grouping to return an ungrouped data frame
    # A tibble: 7 × 4
      color     m x1                 x2
      <ord> <dbl> <chr>           <dbl>
    1 D     3170. Diamond color D     5
    2 E     3077. Diamond color E     5
    3 F     3725. Diamond color F     5
    4 G     3999. Diamond color G     5
    5 H     4487. Diamond color H     5
    6 I     5092. Diamond color I     5
    7 J     5324. Diamond color J     5
    # part II
    diamonds %>%                                    # Start with the 'diamonds' dataset
      group_by(color) %>%                           # Group the data by 'color'
      summarize(m = mean(price)) %>%                # Create a new column 'm' that calculates the mean price for each color
      ungroup() %>%                                 # Remove the grouping after summarizing, so the next operations are not grouped
      mutate(x1 = str_c("Diamond color ", color),   # Create a new column 'x1' by concatenating "Diamond color " with the color value
             x2 = 5)                                 # Create a new column 'x2' with a constant value of 5 for all rows
    # A tibble: 7 × 4
      color     m x1                 x2
      <ord> <dbl> <chr>           <dbl>
    1 D     3170. Diamond color D     5
    2 E     3077. Diamond color E     5
    3 F     3725. Diamond color F     5
    4 G     3999. Diamond color G     5
    5 H     4487. Diamond color H     5
    6 I     5092. Diamond color I     5
    7 J     5324. Diamond color J     5

    H

    # part I
    diamonds %>%                                     # Start with the 'diamonds' dataset
      group_by(color) %>%                            # Group the data by 'color' (each unique diamond color forms a group)
      mutate(x1 = price * 0.5) %>%                   # Create a new column 'x1' that calculates half the price of each diamond in the group
      summarize(m = mean(x1)) %>%                    # Create a new column 'm' that calculates the mean of the 'x1' values (mean half price) for each color group
      ungroup()                                      # Remove the grouping to return an ungrouped data frame
    # A tibble: 7 × 2
      color     m
      <ord> <dbl>
    1 D     1585.
    2 E     1538.
    3 F     1862.
    4 G     2000.
    5 H     2243.
    6 I     2546.
    7 J     2662.
    # part II
    diamonds %>%                                     # Start with the 'diamonds' dataset
      group_by(color) %>%                            # Group the data by 'color'
      mutate(x1 = price * 0.5) %>%                   # Create a new column 'x1' that calculates half the price of each diamond in the group
      ungroup() %>%                                  # Remove the grouping to return an ungrouped data frame
      summarize(m = mean(x1))                        # Calculate the mean of the 'x1' values (mean half price) across the entire dataset, since it is now ungrouped
    # A tibble: 1 × 1
          m
      <dbl>
    1 1966.
  3. after that we answer these questions:

    Q1: Why is grouping data necessary?

    Grouping allows performing aggregations on subsets of the data rather than on the entire dataset, and It enables us to analyze different segments of the data independently. It is also essential to calculate statistics that are specific to a subset of the data.

    Q2: Why is ungrouping data necessary?

    To return to a standard data frame structure for subsequent operations. Once ungrouped, you can freely manipulate the data frame without being constrained by the previous grouping.

    Q3: When should you ungroup data?

    1. {r}

    2. After Aggregation

    3. Before Operations that Shouldn’t Be Grouped

    4. Before Merging or Joining

    Q4: If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?

    No, you do not need to use ungroup() if you have not previously used group_by().

  4. Extra exercise:

    view (diamonds)

    # Arrange diamonds by various criteria

    Lowest to highest price

    diamonds_low_high <- diamonds %>%
      arrange(price)

    Highest to lowest price

    diamonds_high_low <- diamonds %>%
      arrange(desc(price))

    Lowest price and cut

    diamonds_low_price_cut <- diamonds %>%
      arrange(price, cut)

    Highest price and cut

    diamonds_high_price_cut <- diamonds %>%
      arrange(desc(price), cut)

    Arrange by lowest to highest price and worst to best clarity

    # Arrange diamonds by lowest to highest price and worst to best clarity
    diamonds_sorted <- diamonds %>%
      arrange(price, clarity)
    
    # View the sorted dataset
    head(diamonds_sorted)
    # A tibble: 6 × 10
      carat cut       color clarity depth table price     x     y     z
      <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
    1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
    2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
    3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
    4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
    5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
    6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

    Create a new variable ‘salePrice’ reflecting a $250 discount

    # Create a new variable 'salePrice' reflecting a $250 discount
    diamonds_with_sale_price <- diamonds %>%
      mutate(salePrice = price - 250)
    
    # View the updated dataset
    head(diamonds_with_sale_price)
    # A tibble: 6 × 11
      carat cut       color clarity depth table price     x     y     z salePrice
      <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
    1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
    2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
    3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
    4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
    5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
    6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86

    Remove the x, y, and z variables from the diamonds dataset (hint: select()).

    # Load the necessary library
    library(tidyverse)
    
    # Remove the x, y, and z variables from the dataset
    diamonds_cleaned <- diamonds %>%
      select(-x, -y, -z)
    
    # View the updated dataset
    head(diamonds_cleaned)
    # A tibble: 6 × 7
      carat cut       color clarity depth table price
      <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
    1  0.23 Ideal     E     SI2      61.5    55   326
    2  0.21 Premium   E     SI1      59.8    61   326
    3  0.23 Good      E     VS1      56.9    65   327
    4  0.29 Premium   I     VS2      62.4    58   334
    5  0.31 Good      J     SI2      63.3    58   335
    6  0.24 Very Good J     VVS2     62.8    57   336

    Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).

    # Load the necessary library
    library(tidyverse)
    
    # Group by cut and count the number of diamonds for each cut
    diamonds_by_cut <- diamonds %>%
      group_by(cut) %>%
      summarize(number_of_diamonds = n())
    
    # View the result
    print(diamonds_by_cut)
    # A tibble: 5 × 2
      cut       number_of_diamonds
      <ord>                  <int>
    1 Fair                    1610
    2 Good                    4906
    3 Very Good              12082
    4 Premium                13791
    5 Ideal                  21551

    Create a new column named totalNum that calculates the total number of diamonds.

    # Load the necessary library
    library(tidyverse)
    
    # Calculate the total number of diamonds and add it as a new column
    diamonds_with_totalNum <- diamonds %>%
      mutate(totalNum = n())
    
    # View the first few rows of the dataset with the new column
    head(diamonds_with_totalNum)
    # A tibble: 6 × 11
      carat cut       color clarity depth table price     x     y     z totalNum
      <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
    1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
    2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
    3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
    4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
    5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
    6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940

Exercises for Research methods

Good question:

What percentage of diamonds fall within a specific price range (e.g., $1,000 to $5,000)?

Bad question:

Are diamonds shiny?