# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
## Problem C# part Imidwest %>%group_by(county) %>%summarize(x =n_distinct(state)) %>%arrange(desc(x)) %>%ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
# part II# How does n() differ from n_distinct()? n_distinct is used to count the number of unique variables. Where as n() counts the number of current group size. # When would they be the same? different? midwest %>%group_by(county) %>%summarize(x =n()) %>%ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
# part III# hint: # - How many distinctly different counties are there for each county? 1# - Can there be more than 1 (county) county in each county? no# - What if we replace 'county' with 'state'? Nothing shows up, it shows an x for state.midwest %>%group_by(state) %>%summarize(x =n_distinct(state)) %>%ungroup()
# A tibble: 5 × 2
state x
<chr> <int>
1 IL 1
2 IN 1
3 MI 1
4 OH 1
5 WI 1
## Problem Ddiamonds %>%group_by(clarity) %>%summarize(a =n_distinct(color),b =n_distinct(price),c =n()) %>%ungroup()
## Problem E# part Idiamonds %>%group_by(color, cut) %>%summarize(m =mean(price),s =sd(price)) %>%ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
# part IIdiamonds %>%group_by(cut, color) %>%summarize(m =mean(price),s =sd(price)) %>%ungroup()
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
# part III# hint: # - How good is the sale if the price of diamonds equaled msale? # - e.x. The diamonds are x% off original price in msale.diamonds %>%group_by(cut, color, clarity) %>%summarize(m =mean(price),s =sd(price),msale = m *0.80) %>%ungroup()
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
## Problem G# part Idiamonds %>%group_by(color) %>%summarize(m =mean(price)) %>%mutate(x1 =str_c("Diamond color ", color),x2 =5) %>%ungroup()
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
# part II# What does the first ungroup() do? Is it useful here? Why/why not? # Why isn't there a closing ungroup() after the mutate()? The first ungroup() is removing a variable there is no ungroup() at the the end because a variable is being added. diamonds %>%group_by(color) %>%summarize(m =mean(price)) %>%ungroup() %>%mutate(x1 =str_c("Diamond color ", color),x2 =5)
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
## Problem H# part Idiamonds %>%group_by(color) %>%mutate(x1 = price *0.5) %>%summarize(m =mean(x1)) %>%ungroup()
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
# part II# What's the difference between part I and II? Part I is ungrouping the data whereas part II is not. diamonds %>%group_by(color) %>%mutate(x1 = price *0.5) %>%ungroup() %>%summarize(m =mean(x1))
# A tibble: 1 × 1
m
<dbl>
1 1966.
Why is grouping data necessary?
It’s important to group data to increase the accuracy of the estimation.
Why is ungrouping data necessary?
Ungroubing data is essential as it reduces the possibility of future errors.
When should you ungroup data?
You should ungroup data when you no longer want to focus on it.
If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?
If the code does not contain group_by(), you do not need to ungroup() at the end, because there is not a selection of data groups.