ESC&R Workbook 2024

Author

Nick Park (N1343061)

FM Lecture on 8/10/24 Post session Exercises taken from:

Chapter 6.6.1 of “R for Graduate Students” by Y.Wendy Huynh 2019

Start up stuff

For example:

  • setting the working directory,

  • installing packages (no need in this case as Tidyverse (Wickham et al. 2019) already installed),

  • loading packages,

  • importing data (in this case the Diamonds data in ggplot)

getwd() 
library(tidyverse) 
library(ggplot2) 
data("diamonds")
data("midwest")

Checking the data

Always worth looking at the data before we start analysing it

For example we could use:

  • view(diamonds) but it’s a big data file so might be better to look at small section of it

  • using the select() function

  • also really useful to look at the structure using str(diamonds)

# Exploring the data
diamonds %>% 
  select(1:8)
# A tibble: 53,940 × 8
   carat cut       color clarity depth table price     x
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95
 2  0.21 Premium   E     SI1      59.8    61   326  3.89
 3  0.23 Good      E     VS1      56.9    65   327  4.05
 4  0.29 Premium   I     VS2      62.4    58   334  4.2 
 5  0.31 Good      J     SI2      63.3    58   335  4.34
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95
 8  0.26 Very Good H     SI1      61.9    55   337  4.07
 9  0.22 Fair      E     VS2      65.1    61   337  3.87
10  0.23 Very Good H     VS1      59.4    61   338  4   
# ℹ 53,930 more rows
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Exercises 6.6.1

Useful Tip!

Good idea to execute one line at a time to see how each line changes the output but remember to not include the pipe or commas at the end of a line or it won’t work!

1. 1st problem on the Diamonds dataset

This problem shows how to group and ungroup variables in a dataframe using the group_by function; create and add in an extra column using the mutate function; choose which ones to display using the select function, and list values in a variable in ascending or descending order using the arrange function.

diamonds %>% #utilises the diamonds dataset
  group_by(color, clarity) %>% #groups by color and clarity variables
  mutate(price200 = mean(price)) %>% #creates new variable (average price by groups)
  ungroup() %>% #data no longer grouped by color and clarity
  mutate(random10 = 10 + price) %>% #new variable, original price +$10
  select(cut, color, clarity, price, price200, random10) %>% # retain only these columns
  arrange(color) %>% #visualise data ordered by color
  group_by(cut) %>% #group data by cut
  mutate(dis = n_distinct(price), #numbers each row consecutively for each cut
        rowID = row_number()) %>% #numbers each row consecutively for each cut
    ungroup() #then final ungroup
# A tibble: 53,940 × 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# ℹ 53,930 more rows
Warning!

Always Ungroup after grouping!

2. Problem 2a. on the Midwest dataset

This problem shows how to collapse all rows into a one row summary using the summarise() function. This can also be used along with the group() function as per below.

midwest %>% 
  group_by(state) %>%
  summarise(poptotalmean = mean(poptotal),
            poptotalmed = median(poptotal),
            popmax = max(poptotal),
            popmin = min(poptotal),
            popdistinct = n_distinct(poptotal),
            popfirst = first(poptotal),
            popany = any(poptotal < 5000),
            popany2 = any(poptotal > 2000000)) %>% 
  ungroup()
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem 2b. on the Midwest dataset

This problem shows how to put parameters against the summarise() function to show how many rows of values exist in the dataframe within those set parameters as per below.

## Problem B
midwest %>% 
  group_by(state) %>% 
  summarise(num5k = sum(poptotal < 5000),
            num2mil = sum(poptotal > 2000000),
            numrows = n()) %>% 
  ungroup()
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem 2c.1 on the Midwest dataset

This problem shows how to show the number of unique entries there are in the dataset with values ??I’m not sure!!

midwest %>% 
  group_by(county) %>% 
  summarise(x = n_distinct(state)) %>% 
  arrange(desc(x)) %>% 
  ungroup()
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

Problem 2c.1.1 on the Midwest dataset

This problem shows how n() differs from n_distinct(). They might be the same or different when??????not sure again

midwest %>% 
  group_by(county) %>% 
  summarize(x = n()) %>% 
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

Problem 3. Still on the Midwest dataset

This one shows that there can’t be more than one distinct county for each county! If replace county by state then get a more sensible response!

midwest %>%
  group_by(county) %>%
  summarise(x=n_distinct(county)) %>%
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows
midwest %>%
  group_by(state) %>%
  summarise(x=n_distinct(county)) %>%
  ungroup()
# A tibble: 5 × 2
  state     x
  <chr> <int>
1 IL      102
2 IN       92
3 MI       83
4 OH       88
5 WI       72

Problem 4. Looking at the Diamonds dataset

This one shows how to look at a specific variable using the group() function, and then identify within that group, how many unique values there were for 2 other variables (using the n_distinct(variable) function, aswell as the total number of values using the n() function

diamonds %>%
  group_by(clarity) %>% 
  summarise(a=n_distinct(color),
            b=n_distinct(price),
            c=n()) %>%
  ungroup()
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem 5.1 Still on the Diamonds dataset

This one shows how you can use the summarise function to show the mean and sd values for values that are from 2 groups of variables

diamonds %>% 
  group_by(color, cut) %>% 
  summarise(m = mean(price),
            s = sd(price)) %>% 
  ungroup()
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

Problem 5.2 and this one shows what happens if you put the 2 variables in a different order!

If you reverse cut and color order in first line then the values are the same but presented in different order.

diamonds %>% 
  group_by(cut, color) %>% 
  summarize(m = mean(price),
            s = sd(price)) %>% 
  ungroup()
# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows

Problem 5.3 Still in diamonds

This problem shows a practical application in that can see what effect a 20% discount has when compared to the mean sale price

diamonds %>% 
  group_by(cut, color, clarity) %>% 
  summarize(m = mean(price),
            s = sd(price),
            msale = m * 0.80) %>% 
  ungroup()
# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Problem 6. Still in Diamonds

This shows how you can call the output variables whatever you want!

diamonds %>% 
  group_by(cut) %>% 
  summarize(potato = mean(depth),
            pizza = mean(price),
            popcorn = median(y),
            pineapple = potato - pizza,
            papaya = pineapple ^ 2,
            peach = n()) %>% 
  ungroup()
# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem 7.1 Still in Diamonds….

What is the difference between 7.1 and 7.2?

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% 
  summarize(m = mean(x1)) %>% 
  ungroup()
# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

Problem 7.2…..

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% 
  ungroup() %>%  
  summarize(m = mean(x1))
# A tibble: 1 × 1
      m
  <dbl>
1 1966.

The top one shows what the mean of the prices of all of those diamonds of each color (ie this is different for each colour) after each price has been times by 0.5. The bottom one shows the mean of all the diamonds’ prices times by 0.5 regardless of colour.

Further questions:

  • why is grouping the data necessary? This allows specific variables to be brought together for future operations. For example, grouping a dataset by age and sex might be useful if we were looking to see what effect age and sex had on a particular test result. It is then possible to conduct further manipulation and analysis of the grouped data.

  • Why is ungrouping data necessary? If you don’t ungroup once you’ve finished your calculations then future data management will likely produce errors as any further fucntions will be based on the already manipulated data

  • When should you ungroup data? At the end of your calculation

  • If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()? No

Further practice from the same book section 6.7

Q1. View all of the variable names in diamonds. An easy one to get started! Could also use head(), tail(), slice_head or slice_tail functions as below:

view(diamonds) #this brings up the table in a separate tab at the top
head(diamonds) #brings up the first 6 rows
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
diamonds %>%
slice_head(n=5) # to bring up the first 5 rows
# A tibble: 5 × 10
  carat cut     color clarity depth table price     x     y     z
  <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good    J     SI2      63.3    58   335  4.34  4.35  2.75
diamonds %>%
  slice_tail(n=5) #shows the last 5 rows
# A tibble: 5 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.72 Ideal     D     SI1      60.8    57  2757  5.75  5.76  3.5 
2  0.72 Good      D     SI1      63.1    55  2757  5.69  5.75  3.61
3  0.7  Very Good D     SI1      62.8    60  2757  5.66  5.68  3.56
4  0.86 Premium   H     SI2      61      58  2757  6.15  6.12  3.74
5  0.75 Ideal     D     SI2      62.2    55  2757  5.83  5.87  3.64

Q2. Arrange the diamonds by:

  • Lowest to highest price (hint: arrange())

  • Highest to lowest price (hint: arrange()desc())

  • Lowest price and cut

  • highest price and cut

diamonds %>%
  arrange(price) #lowest to highest
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
diamonds %>% 
  arrange(desc(price)) #highest to lowest
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows
diamonds %>%
  arrange(cut, price) # by lowest price and cut
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows
diamonds %>%
  arrange(cut, desc(price)) #by highest price and cut
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.01 Fair  G     SI1      70.6    64 18574  7.43  6.64  4.69
 2  2.02 Fair  H     VS2      64.5    57 18565  8     7.95  5.14
 3  4.5  Fair  J     I1       65.8    58 18531 10.2  10.2   6.72
 4  2    Fair  G     VS2      67.6    58 18515  7.65  7.61  5.16
 5  2.51 Fair  H     SI2      64.7    57 18308  8.44  8.5   5.48
 6  3.01 Fair  I     SI2      65.8    56 18242  8.99  8.94  5.9 
 7  3.01 Fair  I     SI2      65.8    56 18242  8.99  8.94  5.9 
 8  2.32 Fair  H     SI1      62      62 18026  8.47  8.31  5.2 
 9  5.01 Fair  J     I1       65.5    59 18018 10.7  10.5   6.98
10  1.93 Fair  F     VS1      58.9    62 17995  8.17  7.97  4.75
# ℹ 53,930 more rows

Q3. Arrange the diamonds by lowest to highest price and worst to best clarity

diamonds %>%
  arrange(price, clarity)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Q4. Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond (hint: mutate())

diamonds %>%
  mutate(salePrice = price - 250)
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z salePrice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows

Q5. Remove the xy, and z variables from the diamonds dataset (hint: select())

diamonds %>%
  select(1:7)
# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

Q6. Determine the number of diamonds there are for each cut value (hint: group_by()summarize()).

diamonds %>%
  group_by(cut) %>%
  summarise(number=n()) %>%
  ungroup()
# A tibble: 5 × 2
  cut       number
  <ord>      <int>
1 Fair        1610
2 Good        4906
3 Very Good  12082
4 Premium    13791
5 Ideal      21551

Q7. Create a new column named totalNum that calculates the total number of diamonds.

diamonds %>%
  mutate(total_diamonds =n()) # see 11 column on far right
# A tibble: 53,940 × 11
   carat cut    color clarity depth table price     x     y     z total_diamonds
   <dbl> <ord>  <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>          <int>
 1  0.23 Ideal  E     SI2      61.5    55   326  3.95  3.98  2.43          53940
 2  0.21 Premi… E     SI1      59.8    61   326  3.89  3.84  2.31          53940
 3  0.23 Good   E     VS1      56.9    65   327  4.05  4.07  2.31          53940
 4  0.29 Premi… I     VS2      62.4    58   334  4.2   4.23  2.63          53940
 5  0.31 Good   J     SI2      63.3    58   335  4.34  4.35  2.75          53940
 6  0.24 Very … J     VVS2     62.8    57   336  3.94  3.96  2.48          53940
 7  0.24 Very … I     VVS1     62.3    57   336  3.95  3.98  2.47          53940
 8  0.26 Very … H     SI1      61.9    55   337  4.07  4.11  2.53          53940
 9  0.22 Fair   E     VS2      65.1    61   337  3.87  3.78  2.49          53940
10  0.23 Very … H     VS1      59.4    61   338  4     4.05  2.39          53940
# ℹ 53,930 more rows

Research Methods subsection

Exercise: Generate a good question and a bad question about the diamonds data set you’ve just explored. Try to use the principles discussed to simulate both types of questions

diamonds

What makes a good question?

Good questions should be clear and interesting; relevant and meaningful; feasible and manageable in terms of available resources; have measurable variables; not answerable with yes/no; often contain what/how rather than is/are/do/does; be novel and push the boundaries of knowledge forwards; tend to be specific rather than vague; aim to discover/explain/explore; be ethical; be systematically constructed

Examples of good questions might include:

  • why does a diamond have differing clarities/depths of colours/cuts?

  • what makes a diamond attractive to buyers?

  • how does colour affect the price?

  • which is the most important determinant of price?

Examples of bad questions might include:

  • is there a difference in price between diamonds of different cuts/clarity/colours etc?

  • are poor clarity diamonds cheaper?

  • does my wife want another diamond? This is a terrible question!!

References

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.