getwd()
library(tidyverse)
library(ggplot2)
data("diamonds")
data("midwest")ESC&R Workbook 2024
FM Lecture on 8/10/24 Post session Exercises taken from:
Chapter 6.6.1 of “R for Graduate Students” by Y.Wendy Huynh 2019
Start up stuff
For example:
setting the working directory,
installing packages (no need in this case as Tidyverse (Wickham et al. 2019) already installed),
loading packages,
importing data (in this case the Diamonds data in ggplot)
Checking the data
Always worth looking at the data before we start analysing it
For example we could use:
view(diamonds) but it’s a big data file so might be better to look at small section of it
using the select() function
also really useful to look at the structure using str(diamonds)
# Exploring the data
diamonds %>%
select(1:8)# A tibble: 53,940 × 8
carat cut color clarity depth table price x
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95
2 0.21 Premium E SI1 59.8 61 326 3.89
3 0.23 Good E VS1 56.9 65 327 4.05
4 0.29 Premium I VS2 62.4 58 334 4.2
5 0.31 Good J SI2 63.3 58 335 4.34
6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 0.24 Very Good I VVS1 62.3 57 336 3.95
8 0.26 Very Good H SI1 61.9 55 337 4.07
9 0.22 Fair E VS2 65.1 61 337 3.87
10 0.23 Very Good H VS1 59.4 61 338 4
# ℹ 53,930 more rows
str(diamonds)tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Exercises 6.6.1
Good idea to execute one line at a time to see how each line changes the output but remember to not include the pipe or commas at the end of a line or it won’t work!
1. 1st problem on the Diamonds dataset
This problem shows how to group and ungroup variables in a dataframe using the group_by function; create and add in an extra column using the mutate function; choose which ones to display using the select function, and list values in a variable in ascending or descending order using the arrange function.
diamonds %>% #utilises the diamonds dataset
group_by(color, clarity) %>% #groups by color and clarity variables
mutate(price200 = mean(price)) %>% #creates new variable (average price by groups)
ungroup() %>% #data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>% #new variable, original price +$10
select(cut, color, clarity, price, price200, random10) %>% # retain only these columns
arrange(color) %>% #visualise data ordered by color
group_by(cut) %>% #group data by cut
mutate(dis = n_distinct(price), #numbers each row consecutively for each cut
rowID = row_number()) %>% #numbers each row consecutively for each cut
ungroup() #then final ungroup# A tibble: 53,940 × 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# ℹ 53,930 more rows
Always Ungroup after grouping!
2. Problem 2a. on the Midwest dataset
This problem shows how to collapse all rows into a one row summary using the summarise() function. This can also be used along with the group() function as per below.
midwest %>%
group_by(state) %>%
summarise(poptotalmean = mean(poptotal),
poptotalmed = median(poptotal),
popmax = max(poptotal),
popmin = min(poptotal),
popdistinct = n_distinct(poptotal),
popfirst = first(poptotal),
popany = any(poptotal < 5000),
popany2 = any(poptotal > 2000000)) %>%
ungroup()# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
Problem 2b. on the Midwest dataset
This problem shows how to put parameters against the summarise() function to show how many rows of values exist in the dataframe within those set parameters as per below.
## Problem B
midwest %>%
group_by(state) %>%
summarise(num5k = sum(poptotal < 5000),
num2mil = sum(poptotal > 2000000),
numrows = n()) %>%
ungroup()# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem 2c.1 on the Midwest dataset
This problem shows how to show the number of unique entries there are in the dataset with values ??I’m not sure!!
midwest %>%
group_by(county) %>%
summarise(x = n_distinct(state)) %>%
arrange(desc(x)) %>%
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
Problem 2c.1.1 on the Midwest dataset
This problem shows how n() differs from n_distinct(). They might be the same or different when??????not sure again
midwest %>%
group_by(county) %>%
summarize(x = n()) %>%
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
Problem 3. Still on the Midwest dataset
This one shows that there can’t be more than one distinct county for each county! If replace county by state then get a more sensible response!
midwest %>%
group_by(county) %>%
summarise(x=n_distinct(county)) %>%
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
midwest %>%
group_by(state) %>%
summarise(x=n_distinct(county)) %>%
ungroup()# A tibble: 5 × 2
state x
<chr> <int>
1 IL 102
2 IN 92
3 MI 83
4 OH 88
5 WI 72
Problem 4. Looking at the Diamonds dataset
This one shows how to look at a specific variable using the group() function, and then identify within that group, how many unique values there were for 2 other variables (using the n_distinct(variable) function, aswell as the total number of values using the n() function
diamonds %>%
group_by(clarity) %>%
summarise(a=n_distinct(color),
b=n_distinct(price),
c=n()) %>%
ungroup()# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
Problem 5.1 Still on the Diamonds dataset
This one shows how you can use the summarise function to show the mean and sd values for values that are from 2 groups of variables
diamonds %>%
group_by(color, cut) %>%
summarise(m = mean(price),
s = sd(price)) %>%
ungroup()# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
Problem 5.2 and this one shows what happens if you put the 2 variables in a different order!
If you reverse cut and color order in first line then the values are the same but presented in different order.
diamonds %>%
group_by(cut, color) %>%
summarize(m = mean(price),
s = sd(price)) %>%
ungroup()# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
Problem 5.3 Still in diamonds
This problem shows a practical application in that can see what effect a 20% discount has when compared to the mean sale price
diamonds %>%
group_by(cut, color, clarity) %>%
summarize(m = mean(price),
s = sd(price),
msale = m * 0.80) %>%
ungroup()# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
Problem 6. Still in Diamonds
This shows how you can call the output variables whatever you want!
diamonds %>%
group_by(cut) %>%
summarize(potato = mean(depth),
pizza = mean(price),
popcorn = median(y),
pineapple = potato - pizza,
papaya = pineapple ^ 2,
peach = n()) %>%
ungroup()# A tibble: 5 × 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
Problem 7.1 Still in Diamonds….
What is the difference between 7.1 and 7.2?
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>%
summarize(m = mean(x1)) %>%
ungroup()# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
Problem 7.2…..
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>%
ungroup() %>%
summarize(m = mean(x1))# A tibble: 1 × 1
m
<dbl>
1 1966.
The top one shows what the mean of the prices of all of those diamonds of each color (ie this is different for each colour) after each price has been times by 0.5. The bottom one shows the mean of all the diamonds’ prices times by 0.5 regardless of colour.
Further questions:
why is grouping the data necessary? This allows specific variables to be brought together for future operations. For example, grouping a dataset by age and sex might be useful if we were looking to see what effect age and sex had on a particular test result. It is then possible to conduct further manipulation and analysis of the grouped data.
Why is ungrouping data necessary? If you don’t ungroup once you’ve finished your calculations then future data management will likely produce errors as any further fucntions will be based on the already manipulated data
When should you ungroup data? At the end of your calculation
If the code does not contain
group_by(), do you still needungroup()at the end? For example, doesdata() %>% mutate(newVar = 1 + 2)requireungroup()? No
Further practice from the same book section 6.7
Q1. View all of the variable names in diamonds. An easy one to get started! Could also use head(), tail(), slice_head or slice_tail functions as below:
view(diamonds) #this brings up the table in a separate tab at the tophead(diamonds) #brings up the first 6 rows# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
diamonds %>%
slice_head(n=5) # to bring up the first 5 rows# A tibble: 5 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
diamonds %>%
slice_tail(n=5) #shows the last 5 rows# A tibble: 5 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.5
2 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
3 0.7 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
4 0.86 Premium H SI2 61 58 2757 6.15 6.12 3.74
5 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
Q2. Arrange the diamonds by:
Lowest to highest
price(hint:arrange())Highest to lowest
price(hint:arrange(),desc())Lowest
priceandcuthighest
priceandcut
diamonds %>%
arrange(price) #lowest to highest# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
diamonds %>%
arrange(desc(price)) #highest to lowest# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
diamonds %>%
arrange(cut, price) # by lowest price and cut# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.25 Fair E VS1 55.2 64 361 4.21 4.23 2.33
3 0.23 Fair G VVS2 61.4 66 369 3.87 3.91 2.39
4 0.27 Fair E VS1 66.4 58 371 3.99 4.02 2.66
5 0.3 Fair J VS2 64.8 58 416 4.24 4.16 2.72
6 0.3 Fair F SI1 63.1 58 496 4.3 4.22 2.69
7 0.34 Fair J SI1 64.5 57 497 4.38 4.36 2.82
8 0.37 Fair F SI1 65.3 56 527 4.53 4.47 2.94
9 0.3 Fair D SI2 64.6 54 536 4.29 4.25 2.76
10 0.25 Fair D VS1 61.2 55 563 4.09 4.11 2.51
# ℹ 53,930 more rows
diamonds %>%
arrange(cut, desc(price)) #by highest price and cut# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.01 Fair G SI1 70.6 64 18574 7.43 6.64 4.69
2 2.02 Fair H VS2 64.5 57 18565 8 7.95 5.14
3 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
4 2 Fair G VS2 67.6 58 18515 7.65 7.61 5.16
5 2.51 Fair H SI2 64.7 57 18308 8.44 8.5 5.48
6 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.9
7 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.9
8 2.32 Fair H SI1 62 62 18026 8.47 8.31 5.2
9 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
10 1.93 Fair F VS1 58.9 62 17995 8.17 7.97 4.75
# ℹ 53,930 more rows
Q3. Arrange the diamonds by lowest to highest price and worst to best clarity
diamonds %>%
arrange(price, clarity)# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Q4. Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond (hint: mutate())
diamonds %>%
mutate(salePrice = price - 250)# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z salePrice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
Q5. Remove the x, y, and z variables from the diamonds dataset (hint: select())
diamonds %>%
select(1:7)# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
Q6. Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).
diamonds %>%
group_by(cut) %>%
summarise(number=n()) %>%
ungroup()# A tibble: 5 × 2
cut number
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
Q7. Create a new column named totalNum that calculates the total number of diamonds.
diamonds %>%
mutate(total_diamonds =n()) # see 11 column on far right# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z total_diamonds
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very … I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very … H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very … H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Research Methods subsection
Exercise: Generate a good question and a bad question about the diamonds data set you’ve just explored. Try to use the principles discussed to simulate both types of questions
diamonds
Good questions should be clear and interesting; relevant and meaningful; feasible and manageable in terms of available resources; have measurable variables; not answerable with yes/no; often contain what/how rather than is/are/do/does; be novel and push the boundaries of knowledge forwards; tend to be specific rather than vague; aim to discover/explain/explore; be ethical; be systematically constructed
Examples of good questions might include:
why does a diamond have differing clarities/depths of colours/cuts?
what makes a diamond attractive to buyers?
how does colour affect the price?
which is the most important determinant of price?
Examples of bad questions might include:
is there a difference in price between diamonds of different cuts/clarity/colours etc?
are poor clarity diamonds cheaper?
does my wife want another diamond? This is a terrible question!!