A few important notes to remember from chapter 4,5,6 R for Graduates book:
to load tidyverse use: (its installed already) library(tidyverse)
Pipe operator %>% used to pass the result of one function to another in a readable way.
to view diamonds dataset: view (diamonds) ## Exercises for data analysis
to display the structure or internal details: str(diamonds)
if u need help u can access the help page by: (?diamonds)
We can take a quick view of the variable names using: names(diamonds)
functions: mutate, summarize, group and ungroup, filter, select, arrange
Exercises for Data analysis
make sure to load tidy verse: library (tidyverse).
now we type the codes given in the exercise manually and explain what they do:
Problems:
A
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
midwest %>%group_by(state) %>%# Group the dataset by 'state'.summarize(poptotalmean =mean(poptotal), # Calculate the mean of 'poptotal' for each state.poptotalmed =median(poptotal), # Calculate the median of 'poptotal' for each state.popmax =max(poptotal), # Find the maximum 'poptotal' value for each state.popmin =min(poptotal), # Find the minimum 'poptotal' value for each state.popdistinct =n_distinct(poptotal), # Count the number of unique 'poptotal' values for each state.popfirst =first(poptotal), # Get the first value of 'poptotal' in each state.popany =any(poptotal <5000), # Check if any 'poptotal' value is less than 5000 in each state.popany2 =any(poptotal >2000000)) %>%# Check if any 'poptotal' value is greater than 2,000,000.ungroup() # Remove the grouping, returning a non-grouped dataset.
midwest %>%group_by(state) %>%# Group the data by 'state'.summarize(num5k =sum(poptotal <5000), # Count how many 'poptotal' values are less than 5000 for each state.num2mil =sum(poptotal >2000000), # Count how many 'poptotal' values are greater than 2,000,000 for each state.numrows =n()) %>%# Count the total number of rows (observations) in each state.ungroup() # Remove the grouping, returning the dataset without state groups.
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
C
# part Imidwest %>%# Start with the 'midwest' dataset and pipe it to the next functiongroup_by(county) %>%# Group the data by the 'county' column, so the operations apply to each county separatelysummarize(x =n_distinct(state)) %>%# Create a new column 'x' that counts the distinct number of states within each countyarrange(desc(x)) %>%# Sort the result in descending order of 'x' (the distinct state count)ungroup() # Remove the grouping after summarizing to return an ungrouped data frame
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
# part IImidwest %>%# Start with the 'midwest' datasetgroup_by(county) %>%# Group the data by 'county'summarize(x =n()) %>%# Create a new column 'x' that counts the total number of rows in each countyungroup() # Remove the grouping after summarizing
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
# part IIImidwest %>%# Start with the 'midwest' datasetgroup_by(county) %>%# Group the data by 'county'summarize(x =n_distinct(county)) %>%# Create a new column 'x' that counts the distinct number of 'county' values within each county (which will always be 1)ungroup() # Remove the grouping to return an ungrouped data frame
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
D
diamonds %>%# Start with the 'diamonds' datasetgroup_by(clarity) %>%# Group the data by the 'clarity' column (each unique value of clarity will form a group)summarize(a =n_distinct(color), # Create a new column 'a' that counts the number of distinct colors in each clarity groupb =n_distinct(price), # Create a new column 'b' that counts the number of distinct prices in each clarity groupc =n()) %>%# Create a new column 'c' that counts the total number of diamonds in each clarity group (the total number of rows)ungroup() # Remove the grouping to return an ungrouped data frame
# part Idiamonds %>%# Start with the 'diamonds' datasetgroup_by(color, cut) %>%# Group the data by both 'color' and 'cut' columns (each unique combination of color and cut forms a group)summarize(m =mean(price), # Create a new column 'm' that calculates the mean (average) price for each color-cut combinations =sd(price)) %>%# Create a new column 's' that calculates the standard deviation of the price for each color-cut combinationungroup() # Remove the grouping to return an ungrouped data frame
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
# part IIdiamonds %>%# Start with the 'diamonds' datasetgroup_by(cut, color) %>%# Group the data by 'cut' and 'color' (note the order is now 'cut' first, then 'color')summarize(m =mean(price), # Create a new column 'm' that calculates the mean (average) price for each cut-color combinations =sd(price)) %>%# Create a new column 's' that calculates the standard deviation of the price for each cut-color combinationungroup() # Remove the grouping to return an ungrouped data frame
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
# part IIIdiamonds %>%# Start with the 'diamonds' datasetgroup_by(cut, color, clarity) %>%# Group the data by 'cut', 'color', and 'clarity' (each unique combination of these forms a group)summarize(m =mean(price), # Create a new column 'm' that calculates the mean (average) price for each cut-color-clarity combinations =sd(price), # Create a new column 's' that calculates the standard deviation of the price for each groupmsale = m *0.80) %>%# Create a new column 'msale' where the sale price is 80% of the mean price (i.e., a 20% discount)ungroup() # Remove the grouping to return an ungrouped data frame
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
F
diamonds %>%# Start with the 'diamonds' datasetgroup_by(cut) %>%# Group the data by the 'cut' column (each unique value of cut forms a group)summarize(potato =mean(depth), # Create a new column 'potato' that calculates the mean of the 'depth' for each cut grouppizza =mean(price), # Create a new column 'pizza' that calculates the mean of the 'price' for each cut grouppopcorn =median(y), # Create a new column 'popcorn' that calculates the median of the 'y' dimension (diamond height) for each cut grouppineapple = potato - pizza, # Create a new column 'pineapple' that calculates the difference between the mean depth and mean pricepapaya = pineapple ^2, # Create a new column 'papaya' that calculates the square of the 'pineapple' value (the squared difference between depth and price)peach =n()) %>%# Create a new column 'peach' that counts the total number of diamonds (rows) for each cut groupungroup() # Remove the grouping to return an ungrouped data frame
# part Idiamonds %>%# Start with the 'diamonds' datasetgroup_by(color) %>%# Group the data by 'color' (each unique diamond color forms a group)summarize(m =mean(price)) %>%# Create a new column 'm' that calculates the mean price for each colormutate(x1 =str_c("Diamond color ", color), # Create a new column 'x1' by concatenating the string "Diamond color " with the diamond's color valuex2 =5) %>%# Create a new column 'x2' that assigns the value 5 to each row (same for every color group)ungroup() # Remove the grouping to return an ungrouped data frame
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
# part IIdiamonds %>%# Start with the 'diamonds' datasetgroup_by(color) %>%# Group the data by 'color'summarize(m =mean(price)) %>%# Create a new column 'm' that calculates the mean price for each colorungroup() %>%# Remove the grouping after summarizing, so the next operations are not groupedmutate(x1 =str_c("Diamond color ", color), # Create a new column 'x1' by concatenating "Diamond color " with the color valuex2 =5) # Create a new column 'x2' with a constant value of 5 for all rows
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
H
# part Idiamonds %>%# Start with the 'diamonds' datasetgroup_by(color) %>%# Group the data by 'color' (each unique diamond color forms a group)mutate(x1 = price *0.5) %>%# Create a new column 'x1' that calculates half the price of each diamond in the groupsummarize(m =mean(x1)) %>%# Create a new column 'm' that calculates the mean of the 'x1' values (mean half price) for each color groupungroup() # Remove the grouping to return an ungrouped data frame
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
# part IIdiamonds %>%# Start with the 'diamonds' datasetgroup_by(color) %>%# Group the data by 'color'mutate(x1 = price *0.5) %>%# Create a new column 'x1' that calculates half the price of each diamond in the groupungroup() %>%# Remove the grouping to return an ungrouped data framesummarize(m =mean(x1)) # Calculate the mean of the 'x1' values (mean half price) across the entire dataset, since it is now ungrouped
# A tibble: 1 × 1
m
<dbl>
1 1966.
after that we answer these questions:
Q1: Why is grouping data necessary?
Grouping allows performing aggregations on subsets of the data rather than on the entire dataset, and It enables us to analyze different segments of the data independently. It is also essential to calculate statistics that are specific to a subset of the data.
Q2: Why is ungrouping data necessary?
To return to a standard data frame structure for subsequent operations. Once ungrouped, you can freely manipulate the data frame without being constrained by the previous grouping.
Q3: When should you ungroup data?
{r}
After Aggregation
Before Operations that Shouldn’t Be Grouped
Before Merging or Joining
Q4: If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?
No, you do not need to use ungroup() if you have not previously used group_by().
Arrange by lowest to highest price and worst to best clarity
# Arrange diamonds by lowest to highest price and worst to best claritydiamonds_sorted <- diamonds %>%arrange(price, clarity)# View the sorted datasethead(diamonds_sorted)
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Create a new variable ‘salePrice’ reflecting a $250 discount
# Create a new variable 'salePrice' reflecting a $250 discountdiamonds_with_sale_price <- diamonds %>%mutate(salePrice = price -250)# View the updated datasethead(diamonds_with_sale_price)
# A tibble: 6 × 11
carat cut color clarity depth table price x y z salePrice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
Remove the x, y, and z variables from the diamonds dataset (hint: select()).
# Load the necessary librarylibrary(tidyverse)# Remove the x, y, and z variables from the datasetdiamonds_cleaned <- diamonds %>%select(-x, -y, -z)# View the updated datasethead(diamonds_cleaned)
# A tibble: 6 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).
# Load the necessary librarylibrary(tidyverse)# Group by cut and count the number of diamonds for each cutdiamonds_by_cut <- diamonds %>%group_by(cut) %>%summarize(number_of_diamonds =n())# View the resultprint(diamonds_by_cut)
# A tibble: 5 × 2
cut number_of_diamonds
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
Create a new column named totalNum that calculates the total number of diamonds.
# Load the necessary librarylibrary(tidyverse)# Calculate the total number of diamonds and add it as a new columndiamonds_with_totalNum <- diamonds %>%mutate(totalNum =n())# View the first few rows of the dataset with the new columnhead(diamonds_with_totalNum)
# A tibble: 6 × 11
carat cut color clarity depth table price x y z totalNum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
Exercises for Research methods
Good question:
What percentage of diamonds fall within a specific price range (e.g., $1,000 to $5,000)?