ARES40011 Workbook

Author

Martin Cooper

Week 1

Formatting notes

Remember the cheat sheet under the help menu!

Tip

One * for italics two for bold

When comfortable/knowledgeable you can edit in both Visual (has buttons to help with new things) or Source (more code looking).

For example, I do not know know how to insert tables or images in the source code. The visual input can help with this.

Bog Orchid - Creative Commons license

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

library(tidyverse)

The above code would usually display a table of data relating to the tidyverse package, however, as we have added an output:false instruction it is not doing so here.

These call out sections are called “chunks”.

From the pre-work for this week, I had successfully imported a data set from dropbox using the following code. However, this is now returning the error more columns than column names. What is the fix for this?

SOLVED I typed the web address incorrectly several times and on several attempts!

cyan <- read.table(file="https://www.dropbox.com/s/5q3hdyxcouw6ifa/cyanistes.txt?dl=1",header=TRUE,dec=".")

Add a plot

Code
ggplot(diamonds, aes(x=cut)) +
  geom_bar()
more descriptive stuff
Figure 1: A bar chart of diamons cuts

You can refer to areas. This is a reference: Figure 1

Using LaTeX

We can put formulas inline like this: \(\alpha\)

We indent like this

\[\pi \in \mathbb{r}\] Google LaTeX for more options and guides.

Other display options

For different output types, you can change the format of the document at the top of the code.

Examples include: - PDF - Slides, use reveals at format for this. Looks like MS SWAY. Be cautions with this though as some changes will be required to ensure display of code sections works. From the example video, adding echo: true to code to sort this. There are lots of options with the slides.

watch more tutorials on this

You can also add additional options to the file types. Under format in yarl header, you can select themes. Web search for more Quatro Themes

To add a table of contents, you can add TOC within the HTML format area. Further customization is available - use the crtl+space options to explore.

You can also have multiple outputs at once e.g. HTML and PDF at once.

Other useful stuff!

-use crtl + space to see autofill possibilities within code or yarl sections

-you can set global, as in the default across the code, in the yarl header area too.

-you can add additional code in other languages by changing the info between the {} in the chunks.

Using python example

This is where I got stuck error - python biding not loaded. code removed as unable to render/publish.

Week 2 - During session notes

-Added callout boxes to highlight areas in week 1s notes.

library(palmerpenguins)
library(tidyverse) # added again as was not running
Warning

Tasks for this week:

Play around with penguin data

Insert a picture into the workbook done

Playing with Penguin data

penguins %>%
  glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
$ sex               <fct> male, female, female, NA, female, male, female, male~
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

Week 2 - Post Session

Playing with Blue Tit data

For this data, re-run the command to get the blue tit data as detailed in the chunk above. This sets the dataframe to cyan.

First step in data analysis is a data exploration. This is to help identify any issues in the dataset and ensure any analysis is valid.

The steps suggested for data exploration are to identify:

  • Outliers in response and independent variables

  • Normality and homogeneity of the response variable

  • An excess of zeros in the response variable

  • Multicollinearity among independent variables

  • Relationships among response and independent variables

  • Independence of the response variable

These look familiar to me. Unsurprising as many of these conditions are assumptions of analyses that we may want to run!

Once we have the data, we can inspect the data using the structure function

str(cyan)
'data.frame':   438 obs. of  7 variables:
 $ id    : chr  "B1" "B10" "B111" "B112" ...
 $ zone  : chr  "B" "B" "B" "B" ...
 $ year  : int  2002 2001 2002 2002 2001 2001 2002 2002 2002 2002 ...
 $ multi : chr  "yes" "yes" "no" "no" ...
 $ height: num  2.7 2 2 1.9 2.1 2.5 1.9 2.2 2.4 2.3 ...
 $ day   : int  8 25 10 6 28 23 10 10 11 7 ...
 $ depth : num  0.33 0.25 0.2 0.25 0.33 0.4 0.2 0.2 0.2 0.33 ...

This is telling us a few things that we should note and would link to study design.

  • id is the unique identifier of the nest box in the study area

  • the variable zone are the plots within the study area

  • year the year the data was collected

  • multi is a bionomial data set stating if a nest was used more than once (0 for no, 1 for yes)

  • height is the height at which the box is attached to the tree

  • day is the number of days after 1st of April that the first egg was laid in the nest

  • depth is a measure of nest size (as a fraction of nest box filled) and may indicate body condition and resources of the nest box occupants. In this example this is the main variable of interest

We could run a similar analysis using the glimpse function used above.

glimpse(cyan)
Rows: 438
Columns: 7
$ id     <chr> "B1", "B10", "B111", "B112", "B115", "B116", "B118", "B121", "B~
$ zone   <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B"~
$ year   <int> 2002, 2001, 2002, 2002, 2001, 2001, 2002, 2002, 2002, 2002, 200~
$ multi  <chr> "yes", "yes", "no", "no", "no", "no", "no", "yes", "no", "yes",~
$ height <dbl> 2.7, 2.0, 2.0, 1.9, 2.1, 2.5, 1.9, 2.2, 2.4, 2.3, 2.6, 1.5, 2.7~
$ day    <int> 8, 25, 10, 6, 28, 23, 10, 10, 11, 7, 18, 19, 11, 32, 7, 27, 4, ~
$ depth  <dbl> 0.33, 0.25, 0.20, 0.25, 0.33, 0.40, 0.20, 0.20, 0.20, 0.33, 0.5~
Tip

Note that the output also tells you the data type

-chr is a categorical data type defined by characters

-int shows the variable is a integer (whole number).

-num are continuous numeric data. Note that in the glimpse function num is replaced by dbl. This is not an important difference as far as we are concerned.

Remember

Data types can be categorical or continuous

There may be missing data in the set, which can cause difficulties in analysis.R treats missing data as NA. To find them we can run the following:

colSums(is.na(cyan)) #what this is saying is, return the sum of the number of times NA appears in the dataset cyan
    id   zone   year  multi height    day  depth 
     0      0      0      0      0      0      0 

There are no missing data in this dataset.

Checking for outliers

One of the best ways to inspect data is visually. For outliers we can do this using a boxpot.

boxplot(depth ~ zone, ylab = 'Nest Depth', xlab = 'Woodland Zone', data = cyan, col = 'lightblue',outpch = 16, las=1)

# the boxplot paraters are as follows:
#depth ~ zone tells R to split the values of the depth by the category zone 
#xlab and ylab are the axis labels
#data is which dataset to use for the plot
#col is the colour of the boxes
#outpch is a graphical paramater that changes the default boxplot display
#las = 1 tells R to keep the axis labels horizontal

We can see from this plot that there are 4 data points that sit outside of the expected range. This plot also shows us that there are differences between nest depths and the different areas of the woodlands. This inspection therefore has lots of advantages

Normality

To test for normality we can use a frequency polygon.

p<-ggplot()
p<-p+xlab("Nest Depth")
p<-p+ylab("Frequency")
p<-theme(text=element_text(size=15))
p<-theme(panel.background = element_blank())
p<-theme(panel.border = element_rect(fill=NA, colour="black", size=1))
p<-p+theme(strip.background = element_rect(fill="white", colour="white", size=1))
p<-p+theme(text=element_text(15))
p<-p+theme(legend.position = 'none')
p<-p+geom_freqpoly(data=cyan,aes(depth),bins=7)

I could not get this to work with the books code or from a variety of online searches! I wanted to plot a histogram or frequency polynomial graphic.

Warning

This was the erroe Error in +.gg: ! Can’t add geom_freqpoly(data = cyan, aes(depth), bins = 7) to a theme object. Backtrace: 1. ggplot2:::+.gg(p, geom_freqpoly(data = cyan, aes(depth), bins = 7))

Another way to test for normality is the Shapiro-Wilks normality test. For this test, the null hypothesis is that the data is normally distributed - if it is, we will see a p-value <0.05

shapiro.test(cyan$depth)

    Shapiro-Wilk normality test

data:  cyan$depth
W = 0.89872, p-value < 2.2e-16

This test shows the data is not distributed normally.

Next step in data exploration is to check the homogeneity of the data. This is another assumption of many statistical tests. This can be done visually (covariate boxplot) or with tests such as Levene’s Test.

#|label: levene test 

library(lawstat)
Warning: package 'lawstat' was built under R version 4.1.3
#I had to add the lawstat package to run this test

levene.test(cyan$depth, cyan$zone)

    Modified robust Brown-Forsythe Levene-type test based on the absolute
    deviations from the median

data:  cyan$depth
Test Statistic = 0.6051, p-value = 0.7738

Levene’s test does not assume normality and so is suitable for this dataset. As the p-vale is >0.05, we can say that this data does not deviate from homogeneity.

An excess of zeros in a response variable is called zero inflation. If this is a problem there are lots of ways to deal with it, however, lets test first!

sum(cyan$depth==0, na.rm=TRUE)*100/nrow(cyan)
[1] 0

The result is 0, so we can move on to the next exploration step.

Multicollinearity among covariates

Coll <- c("zone", "year", "height", "day")


panel.cor <- function(x, y, digits=1, prefix="", cex.cor = 6)
{usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r1=cor(x,y,use="pairwise.complete.obs")
r <- abs(cor(x, y,use="pairwise.complete.obs"))
txt <- format(c(r1, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) { cex <- 0.9/strwidth(txt) } else {
  cex = cex.cor}
text(0.5, 0.5, txt, cex = cex * r)}
pairs(cyan[, Coll], lower.panel = panel.cor)
Warning

This code is copied directly from the zip files supplied via the Statistics in R for Biodiversity Conservation dropbox but it would not run unsure why. I tried to google some other ways of doing but was unsuccessful.

Relationships among dependent and independent variables

Visually inspecting plotted data is useful to check these relationships as well

#| label: multipanel scatter plot of depth and nestbox height across zones

ggplot(cyan, aes(x = height, y = depth)) +
  geom_point() +
  geom_smooth(method = 'lm', colour = 'red', se = FALSE) +
  theme(panel.background = element_blank()) +
  theme(panel.border = element_rect(fill = NA, size = 1)) +
  theme(strip.background = element_rect(fill = "white", 
                                        color = "white", size = 1)) +
  theme(text = element_text(size=16)) +
  facet_wrap(~zone)
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
i Please use the `linewidth` argument instead.
`geom_smooth()` using formula = 'y ~ x'

The best fit lines fitted to the above data show that figures CP and O might have some interactions and this suggests that further investigation is needed.

Independence of response variable

In this dataset, there are relationships between responses - nests are in the same boxes in different years. For the purposes of this example, no need to worry. ## Week 3 Pre work

Week 3 Session

No notes made during session

Week 3 Post Session

First let’s ensure we have called the necessary packages

library(tidyverse)

6.1.1.01

Task Mutations of the midwest dataset

midwest.new<- #defines new object name
  midwest %>% #calls the existing dataset and pipes to the next commands
  mutate(avg.pop.den=mean(popdensity), #this adds the column for task a
avg.area=mean(area),
# this is for b
totadult=sum(popadults), # the totals for all these are the same throughout the data as they are averages or totals from the netire dataset
tot.minus.white=(totadult-popwhite), # this is now different per line as the popwhite varies across the lines of the data. You could achieve the same result by adding together the other pop* datafields but this is more efficient
child.to.adult=(percchildbelowpovert/percadultpoverty), # as with the tot.minus.white data, these vary by row due to difference between the rows of data
ratio.adult=(popadults/poptotal), # diving the 2 fields to get the ratio
perc.adult=(ratio.adult*100)) # multiplying by 100 to convert ratio to percentage

midwest.new  # this then displays the newly amended table
# A tibble: 437 x 35
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN~ IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA~ IL    0.058   173025      2983.   146506    16559           331
# i 427 more rows
# i 26 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, ...
#note that by combining all the mutations in a single, multi-command code it is neater than having several chunks running one after the other.

Task Mutations of the presidential dataset

presidential.new<- #defines new object name
  presidential %>% # utilizes the presidential dataset
  mutate(duration=(end-start)) # duration is the end date minus the start date

presidential.new # to display the new data
# A tibble: 12 x 5
   name       start      end        party      duration 
   <chr>      <date>     <date>     <chr>      <drtn>   
 1 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days
 2 Kennedy    1961-01-20 1963-11-22 Democratic 1036 days
 3 Johnson    1963-11-22 1969-01-20 Democratic 1886 days
 4 Nixon      1969-01-20 1974-08-09 Republican 2027 days
 5 Ford       1974-08-09 1977-01-20 Republican  895 days
 6 Carter     1977-01-20 1981-01-20 Democratic 1461 days
 7 Reagan     1981-01-20 1989-01-20 Republican 2922 days
 8 Bush       1989-01-20 1993-01-20 Republican 1461 days
 9 Clinton    1993-01-20 2001-01-20 Democratic 2922 days
10 Bush       2001-01-20 2009-01-20 Republican 2922 days
11 Obama      2009-01-20 2017-01-20 Democratic 2922 days
12 Trump      2017-01-20 2021-01-20 Republican 1461 days

Task Mutations with the economic dataset

econmic.new<- #defines new object name
  economics %>% # utilizes the economics dataset
  mutate(perc.unemploy=((unemploy/pop)*100)) # combination of 2 functions - the ratio and then the multiplication. Note the extra brackets required to achieve this

econmic.new # to display the new data
# A tibble: 574 x 7
   date         pce    pop psavert uempmed unemploy perc.unemploy
   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>         <dbl>
 1 1967-07-01  507. 198712    12.6     4.5     2944          1.48
 2 1967-08-01  510. 198911    12.6     4.7     2945          1.48
 3 1967-09-01  516. 199113    11.9     4.6     2958          1.49
 4 1967-10-01  512. 199311    12.9     4.9     3143          1.58
 5 1967-11-01  517. 199498    12.8     4.7     3066          1.54
 6 1967-12-01  525. 199657    11.8     4.8     3018          1.51
 7 1968-01-01  531. 199808    11.7     5.1     2878          1.44
 8 1968-02-01  534. 199920    12.3     4.5     3001          1.50
 9 1968-03-01  544. 200056    11.7     4.1     2877          1.44
10 1968-04-01  544  200208    12.3     4.6     2709          1.35
# i 564 more rows

Task Mutations with the txhousing dataset

txhousing.new<- #defines new object name
  txhousing %>% # utilizes the txhousing dataset
  mutate(successrate=((sales/listings)*100), # combination of 2 functions - the ratio and then the multiplication. Note the extra brackets required to achieve this
failrate=(100-successrate)) # fail rate is the difference between the whole (100) and the successrate previously calculated.
txhousing.new # to display the new data
# A tibble: 8,602 x 11
   city     year month sales  volume median listings inventory  date successrate
   <chr>   <int> <int> <dbl>   <dbl>  <dbl>    <dbl>     <dbl> <dbl>       <dbl>
 1 Abilene  2000     1    72  5.38e6  71400      701       6.3 2000         10.3
 2 Abilene  2000     2    98  6.51e6  58700      746       6.6 2000.        13.1
 3 Abilene  2000     3   130  9.28e6  58100      784       6.8 2000.        16.6
 4 Abilene  2000     4    98  9.73e6  68600      785       6.9 2000.        12.5
 5 Abilene  2000     5   141  1.06e7  67300      794       6.8 2000.        17.8
 6 Abilene  2000     6   156  1.39e7  66900      780       6.6 2000.        20  
 7 Abilene  2000     7   152  1.26e7  73500      742       6.2 2000.        20.5
 8 Abilene  2000     8   131  1.07e7  75000      765       6.4 2001.        17.1
 9 Abilene  2000     9   104  7.62e6  64500      771       6.5 2001.        13.5
10 Abilene  2000    10   101  7.04e6  59300      764       6.6 2001.        13.2
# i 8,592 more rows
# i 1 more variable: failrate <dbl>

Tasks from 6.6.1

Recreating code from the guide

## Example illustrating how to answer these problems:
diamonds.new<- #added name of new dataset
diamonds %>%                         # utilizes the diamonds dataset
  group_by(color, clarity) %>%       # groups data by color and clarity variables
  mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
  ungroup() %>%                      # data no longer grouped by color and clarity
  mutate(random10 = 10 + price) %>%  # new variable, original price + $10
  select(cut, color,                 # retain only these listed columns
         clarity, price, 
         price200, random10) %>% 
  arrange(color) %>%                 # visualize data ordered by color
  group_by(cut) %>%                  # group data by cut
  mutate(dis = n_distinct(price),    # counts the number of unique price values per cut 
         rowID = row_number()) %>%   # numbers each row consecutively for each cut
  ungroup()                          # final ungrouping of data

diamonds.new                         # display the dataset
# A tibble: 53,940 x 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# i 53,930 more rows

The functions all make sense, but the ordering does not to me. The rowID collum is confusing when the data is ordered as it is.

midwest %>% # utilizes the midwest dataset
  group_by(state) %>%  # groups the data by state
  summarize(poptotalmean = mean(poptotal), # create new field and calculate the mean of the population
            poptotalmed = median(poptotal), # create new field and calculate the median of the population
            popmax = max(poptotal), # create new field and calculate the median of the population
            popmin = min(poptotal), # create new field and return the minimum of the population
            popdistinct = n_distinct(poptotal), # create new field and count the number of different poptotals
            popfirst = first(poptotal), # create new field and return the first value of total population
            popany = any(poptotal < 5000), # are there any items where the total population is under 5000? (TRUE FALSE response)
            popany2 = any(poptotal > 2000000)) %>% # are there any items where the total population is over 2000000(TRUE FALSE response)
  ungroup()
# A tibble: 5 x 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# i 1 more variable: popany2 <lgl>
midwest %>% # utilizes the midwest dataset
  group_by(state) %>% # groups the data by state
  summarize(num5k = sum(poptotal < 5000), # return the number of items where the poptotal is less than 5000
            num2mil = sum(poptotal > 2000000), # return the number of items where the poptotal is more than 2000000
            numrows = n()) %>% # number of rows (counties) in each state
  ungroup()
# A tibble: 5 x 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72
midwest %>% # utilizes the midwest dataset
  group_by(county) %>% # group by county
  summarize(x = n_distinct(state)) %>% # set field x to number of times that county appears in different states
  arrange(desc(x)) %>% # arrange data with the highest counts first
  ungroup()
# A tibble: 320 x 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# i 310 more rows
#| Label: Problem C part 2

midwest %>% 
  group_by(county) %>% 
  summarize(x = n()) %>% # set field x to number of times that county appears total
  ungroup()
# A tibble: 320 x 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# i 310 more rows
Important

Questions

How does n() differ from n_distinct()?

n is the number of times the county name appears, n_distinct is the number of times a county name appears in different states

When would they be the same? different?

If all the counties only appear in each state once, they would be the same. However, if a county appears twice in one state it will only be counted once by n_distinct but twice by n

midwest %>% 
  group_by(county) %>% 
  summarize(x = n_distinct(county)) %>% 
  arrange(desc(x)) %>% # this line is added to solve part 2 of the questions below
  ungroup()
# A tibble: 320 x 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# i 310 more rows
Important

Questions

How many distinctly different counties are there for each county??

Each county is individually names and there are 320 rows, representing 320 different counties

Can there be more than 1 (county) county in each county?

There is not. By sorting the data in 3i largest to smallest we would identify if there were any counts higher than 1

What if we replace ‘county’ with ‘state’? See below

#| Label: Problem C part 3ii

midwest %>% 
  group_by(state) %>% 
  summarize(x = n_distinct(county)) %>% 
  ungroup()
# A tibble: 5 x 2
  state     x
  <chr> <int>
1 IL      102
2 IN       92
3 MI       83
4 OH       88
5 WI       72
# this changed code now returns the number of counties by state
#| Label: Problem D

diamonds %>% 
  group_by(clarity) %>% # items are grouped in the category clarity
  summarize(a = n_distinct(color), # no of unique colors by clarity
            b = n_distinct(price),# no of unique prices by clarity
            c = n()) %>% # total number of items
  ungroup()
# A tibble: 8 x 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790
#| label: Problem E part 1

diamonds %>% 
  group_by(color, cut) %>% 
  summarize(m = mean(price),
            s = sd(price)) %>% 
  ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 x 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# i 25 more rows
#this has grouped the items in the dataset by color and cut, then returned the mean price of items in those categories along with the standard deviation
#| Label: Problem E part 2

diamonds %>% 
  group_by(cut, color) %>% 
  summarize(m = mean(price),
            s = sd(price)) %>% 
  ungroup()
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 x 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# i 25 more rows
#this has grouped the items in the dataset by cut then colour, returning the mean price of items and standard deviation. It is the same summary as part 1, just in a different order
#| Label: Problem E part 3

diamonds %>% 
  group_by(cut, color, clarity) %>% # this is too many arguments for R and so the output is only grouped by cut and color. See error message below
  summarize(m = mean(price),
            s = sd(price),
            msale = m * 0.80) %>% # multiples the mean price by 0.8 (20% off)
  ungroup()
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 x 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# i 266 more rows
# the msale data is the mean data with 20% off.
Important

Questions

How good is the sale if the price of diamonds equaled msale?

Not bad! 20% off is nothing to be sniffed at!

#| label: Problem F

diamonds %>% 
  group_by(cut) %>% 
  summarize(potato = mean(depth), # data potato is the mean depth
            pizza = mean(price), # data pizza is the mean price
            popcorn = median(y), # data popcorn is the media of y
            pineapple = potato - pizza, # gives the difference between the mean depth and the mean price
            papaya = pineapple ^ 2, # the square of the difference in the means
            peach = n()) %>% # the number of items
  ungroup()
# A tibble: 5 x 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551
# these labels are stupid
#| label: Problem G part I


diamonds %>% 
  group_by(color) %>% # groups items by color
  summarize(m = mean(price)) %>% # gives the mean of the items 
  mutate(x1 = str_c("Diamond color ", color), # combines the two text fields diamond color and color to a new field "x1"
         x2 = 5) %>% # creates new column "x2" with the value 5 for each row
  ungroup()
# A tibble: 7 x 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5
#| label: Problem G part ii


diamonds %>% 
  group_by(color) %>% # groups items by color
  summarize(m = mean(price)) %>% # gives the mean of the items 
  ungroup() %>%
  mutate(x1 = str_c("Diamond color ", color), # combines the two text fields diamond color and color to a new field "x1"
         x2 = 5)  # creates new column "x2" with the value 5 for each row
# A tibble: 7 x 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5
Important

Questions

What does the first ungroup() do? Is it useful here? Why/why not?

In this example, it does nothing as the summarise function has already taken place and the commands after the ungroup do not result in data changes, just reordering

No further ungroup is required as nothing is grouped at this point

#| label: Problem H part i

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% # creates column x1 which gives the price divided by 2
  summarize(m = mean(x1)) %>% # gives the mean of teh values in x1, grouped by color per the inital group command
  ungroup() 
# A tibble: 7 x 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.
#| label: part ii

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% 
  ungroup() %>%  
  summarize(m = mean(x1)) # returns the mean of all data
# A tibble: 1 x 1
      m
  <dbl>
1 1966.
Important

Questions

What’s the difference between part I and II?

In part 1, the mean is calculated by each group of items, it part 2 it is the mean of teh dataset. That is why you only have a single value in part 2 compared to part 1

Important

Questions

Why is grouping data necessary?

So that existing data and groups specific variables can be treated together for operations

Why is ungrouping data necessary?

To prevent complexity and error

When should you ungroup data?

In the same code chunk, whilst learning R, probably after each function or group of functions

If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?

No, no grouping has been created therefore there is no need to undo it

Week 3 Post session - Research Methods

Bad Question on the Diamonds Dataset

How does diamond price vary?

Good Question on the Diamonds Dataset

What affect does Diamond Color have on price?