library(tidyverse)ARES40011 Workbook
Week 1
Formatting notes
Remember the cheat sheet under the help menu!
One * for italics two for bold
When comfortable/knowledgeable you can edit in both Visual (has buttons to help with new things) or Source (more code looking).
For example, I do not know know how to insert tables or images in the source code. The visual input can help with this.
Running Code
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
The above code would usually display a table of data relating to the tidyverse package, however, as we have added an output:false instruction it is not doing so here.
These call out sections are called “chunks”.
From the pre-work for this week, I had successfully imported a data set from dropbox using the following code. However, this is now returning the error more columns than column names. What is the fix for this?
SOLVED I typed the web address incorrectly several times and on several attempts!
cyan <- read.table(file="https://www.dropbox.com/s/5q3hdyxcouw6ifa/cyanistes.txt?dl=1",header=TRUE,dec=".")Add a plot
Code
ggplot(diamonds, aes(x=cut)) +
geom_bar()You can refer to areas. This is a reference: Figure 1
Using LaTeX
We can put formulas inline like this: \(\alpha\)
We indent like this
\[\pi \in \mathbb{r}\] Google LaTeX for more options and guides.
Other display options
For different output types, you can change the format of the document at the top of the code.
Examples include: - PDF - Slides, use reveals at format for this. Looks like MS SWAY. Be cautions with this though as some changes will be required to ensure display of code sections works. From the example video, adding echo: true to code to sort this. There are lots of options with the slides.
watch more tutorials on this
You can also add additional options to the file types. Under format in yarl header, you can select themes. Web search for more Quatro Themes
To add a table of contents, you can add TOC within the HTML format area. Further customization is available - use the crtl+space options to explore.
You can also have multiple outputs at once e.g. HTML and PDF at once.
Other useful stuff!
-use crtl + space to see autofill possibilities within code or yarl sections
-you can set global, as in the default across the code, in the yarl header area too.
-you can add additional code in other languages by changing the info between the {} in the chunks.
Using python example
This is where I got stuck error - python biding not loaded. code removed as unable to render/publish.
Week 2 - During session notes
-Added callout boxes to highlight areas in week 1s notes.
library(palmerpenguins)
library(tidyverse) # added again as was not runningTasks for this week:
Play around with penguin data
Insert a picture into the workbook done
Playing with Penguin data
penguins %>%
glimpse()Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
$ sex <fct> male, female, female, NA, female, male, female, male~
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
Week 2 - Post Session
Playing with Blue Tit data
For this data, re-run the command to get the blue tit data as detailed in the chunk above. This sets the dataframe to cyan.
First step in data analysis is a data exploration. This is to help identify any issues in the dataset and ensure any analysis is valid.
The steps suggested for data exploration are to identify:
Outliers in response and independent variables
Normality and homogeneity of the response variable
An excess of zeros in the response variable
Multicollinearity among independent variables
Relationships among response and independent variables
Independence of the response variable
These look familiar to me. Unsurprising as many of these conditions are assumptions of analyses that we may want to run!
Once we have the data, we can inspect the data using the structure function
str(cyan)'data.frame': 438 obs. of 7 variables:
$ id : chr "B1" "B10" "B111" "B112" ...
$ zone : chr "B" "B" "B" "B" ...
$ year : int 2002 2001 2002 2002 2001 2001 2002 2002 2002 2002 ...
$ multi : chr "yes" "yes" "no" "no" ...
$ height: num 2.7 2 2 1.9 2.1 2.5 1.9 2.2 2.4 2.3 ...
$ day : int 8 25 10 6 28 23 10 10 11 7 ...
$ depth : num 0.33 0.25 0.2 0.25 0.33 0.4 0.2 0.2 0.2 0.33 ...
This is telling us a few things that we should note and would link to study design.
id is the unique identifier of the nest box in the study area
the variable zone are the plots within the study area
year the year the data was collected
multi is a bionomial data set stating if a nest was used more than once (0 for no, 1 for yes)
height is the height at which the box is attached to the tree
day is the number of days after 1st of April that the first egg was laid in the nest
depth is a measure of nest size (as a fraction of nest box filled) and may indicate body condition and resources of the nest box occupants. In this example this is the main variable of interest
We could run a similar analysis using the glimpse function used above.
glimpse(cyan)Rows: 438
Columns: 7
$ id <chr> "B1", "B10", "B111", "B112", "B115", "B116", "B118", "B121", "B~
$ zone <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B"~
$ year <int> 2002, 2001, 2002, 2002, 2001, 2001, 2002, 2002, 2002, 2002, 200~
$ multi <chr> "yes", "yes", "no", "no", "no", "no", "no", "yes", "no", "yes",~
$ height <dbl> 2.7, 2.0, 2.0, 1.9, 2.1, 2.5, 1.9, 2.2, 2.4, 2.3, 2.6, 1.5, 2.7~
$ day <int> 8, 25, 10, 6, 28, 23, 10, 10, 11, 7, 18, 19, 11, 32, 7, 27, 4, ~
$ depth <dbl> 0.33, 0.25, 0.20, 0.25, 0.33, 0.40, 0.20, 0.20, 0.20, 0.33, 0.5~
Note that the output also tells you the data type
-chr is a categorical data type defined by characters
-int shows the variable is a integer (whole number).
-num are continuous numeric data. Note that in the glimpse function num is replaced by dbl. This is not an important difference as far as we are concerned.
Remember
Data types can be categorical or continuous
There may be missing data in the set, which can cause difficulties in analysis.R treats missing data as NA. To find them we can run the following:
colSums(is.na(cyan)) #what this is saying is, return the sum of the number of times NA appears in the dataset cyan id zone year multi height day depth
0 0 0 0 0 0 0
There are no missing data in this dataset.
Checking for outliers
One of the best ways to inspect data is visually. For outliers we can do this using a boxpot.
boxplot(depth ~ zone, ylab = 'Nest Depth', xlab = 'Woodland Zone', data = cyan, col = 'lightblue',outpch = 16, las=1)# the boxplot paraters are as follows:
#depth ~ zone tells R to split the values of the depth by the category zone
#xlab and ylab are the axis labels
#data is which dataset to use for the plot
#col is the colour of the boxes
#outpch is a graphical paramater that changes the default boxplot display
#las = 1 tells R to keep the axis labels horizontalWe can see from this plot that there are 4 data points that sit outside of the expected range. This plot also shows us that there are differences between nest depths and the different areas of the woodlands. This inspection therefore has lots of advantages
Normality
To test for normality we can use a frequency polygon.
p<-ggplot()
p<-p+xlab("Nest Depth")
p<-p+ylab("Frequency")
p<-theme(text=element_text(size=15))
p<-theme(panel.background = element_blank())
p<-theme(panel.border = element_rect(fill=NA, colour="black", size=1))
p<-p+theme(strip.background = element_rect(fill="white", colour="white", size=1))
p<-p+theme(text=element_text(15))
p<-p+theme(legend.position = 'none')
p<-p+geom_freqpoly(data=cyan,aes(depth),bins=7)I could not get this to work with the books code or from a variety of online searches! I wanted to plot a histogram or frequency polynomial graphic.
This was the erroe Error in +.gg: ! Can’t add geom_freqpoly(data = cyan, aes(depth), bins = 7) to a theme object. Backtrace: 1. ggplot2:::+.gg(p, geom_freqpoly(data = cyan, aes(depth), bins = 7))
Another way to test for normality is the Shapiro-Wilks normality test. For this test, the null hypothesis is that the data is normally distributed - if it is, we will see a p-value <0.05
shapiro.test(cyan$depth)
Shapiro-Wilk normality test
data: cyan$depth
W = 0.89872, p-value < 2.2e-16
This test shows the data is not distributed normally.
Next step in data exploration is to check the homogeneity of the data. This is another assumption of many statistical tests. This can be done visually (covariate boxplot) or with tests such as Levene’s Test.
#|label: levene test
library(lawstat)Warning: package 'lawstat' was built under R version 4.1.3
#I had to add the lawstat package to run this test
levene.test(cyan$depth, cyan$zone)
Modified robust Brown-Forsythe Levene-type test based on the absolute
deviations from the median
data: cyan$depth
Test Statistic = 0.6051, p-value = 0.7738
Levene’s test does not assume normality and so is suitable for this dataset. As the p-vale is >0.05, we can say that this data does not deviate from homogeneity.
An excess of zeros in a response variable is called zero inflation. If this is a problem there are lots of ways to deal with it, however, lets test first!
sum(cyan$depth==0, na.rm=TRUE)*100/nrow(cyan)[1] 0
The result is 0, so we can move on to the next exploration step.
Multicollinearity among covariates
Coll <- c("zone", "year", "height", "day")
panel.cor <- function(x, y, digits=1, prefix="", cex.cor = 6)
{usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r1=cor(x,y,use="pairwise.complete.obs")
r <- abs(cor(x, y,use="pairwise.complete.obs"))
txt <- format(c(r1, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) { cex <- 0.9/strwidth(txt) } else {
cex = cex.cor}
text(0.5, 0.5, txt, cex = cex * r)}
pairs(cyan[, Coll], lower.panel = panel.cor)This code is copied directly from the zip files supplied via the Statistics in R for Biodiversity Conservation dropbox but it would not run unsure why. I tried to google some other ways of doing but was unsuccessful.
Relationships among dependent and independent variables
Visually inspecting plotted data is useful to check these relationships as well
#| label: multipanel scatter plot of depth and nestbox height across zones
ggplot(cyan, aes(x = height, y = depth)) +
geom_point() +
geom_smooth(method = 'lm', colour = 'red', se = FALSE) +
theme(panel.background = element_blank()) +
theme(panel.border = element_rect(fill = NA, size = 1)) +
theme(strip.background = element_rect(fill = "white",
color = "white", size = 1)) +
theme(text = element_text(size=16)) +
facet_wrap(~zone)Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
i Please use the `linewidth` argument instead.
`geom_smooth()` using formula = 'y ~ x'
The best fit lines fitted to the above data show that figures CP and O might have some interactions and this suggests that further investigation is needed.
Independence of response variable
In this dataset, there are relationships between responses - nests are in the same boxes in different years. For the purposes of this example, no need to worry. ## Week 3 Pre work
Week 3 Session
No notes made during session
Week 3 Post Session
First let’s ensure we have called the necessary packages
library(tidyverse)6.1.1.01
Task Mutations of the midwest dataset
midwest.new<- #defines new object name
midwest %>% #calls the existing dataset and pipes to the next commands
mutate(avg.pop.den=mean(popdensity), #this adds the column for task a
avg.area=mean(area),
# this is for b
totadult=sum(popadults), # the totals for all these are the same throughout the data as they are averages or totals from the netire dataset
tot.minus.white=(totadult-popwhite), # this is now different per line as the popwhite varies across the lines of the data. You could achieve the same result by adding together the other pop* datafields but this is more efficient
child.to.adult=(percchildbelowpovert/percadultpoverty), # as with the tot.minus.white data, these vary by row due to difference between the rows of data
ratio.adult=(popadults/poptotal), # diving the 2 fields to get the ratio
perc.adult=(ratio.adult*100)) # multiplying by 100 to convert ratio to percentage
midwest.new # this then displays the newly amended table# A tibble: 437 x 35
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAN~ IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
7 567 CALHOUN IL 0.017 5322 313. 5298 1 8
8 568 CARROLL IL 0.027 16805 622. 16519 111 30
9 569 CASS IL 0.024 13437 560. 13384 16 8
10 570 CHAMPA~ IL 0.058 173025 2983. 146506 16559 331
# i 427 more rows
# i 26 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, ...
#note that by combining all the mutations in a single, multi-command code it is neater than having several chunks running one after the other.Task Mutations of the presidential dataset
presidential.new<- #defines new object name
presidential %>% # utilizes the presidential dataset
mutate(duration=(end-start)) # duration is the end date minus the start date
presidential.new # to display the new data# A tibble: 12 x 5
name start end party duration
<chr> <date> <date> <chr> <drtn>
1 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days
2 Kennedy 1961-01-20 1963-11-22 Democratic 1036 days
3 Johnson 1963-11-22 1969-01-20 Democratic 1886 days
4 Nixon 1969-01-20 1974-08-09 Republican 2027 days
5 Ford 1974-08-09 1977-01-20 Republican 895 days
6 Carter 1977-01-20 1981-01-20 Democratic 1461 days
7 Reagan 1981-01-20 1989-01-20 Republican 2922 days
8 Bush 1989-01-20 1993-01-20 Republican 1461 days
9 Clinton 1993-01-20 2001-01-20 Democratic 2922 days
10 Bush 2001-01-20 2009-01-20 Republican 2922 days
11 Obama 2009-01-20 2017-01-20 Democratic 2922 days
12 Trump 2017-01-20 2021-01-20 Republican 1461 days
Task Mutations with the economic dataset
econmic.new<- #defines new object name
economics %>% # utilizes the economics dataset
mutate(perc.unemploy=((unemploy/pop)*100)) # combination of 2 functions - the ratio and then the multiplication. Note the extra brackets required to achieve this
econmic.new # to display the new data# A tibble: 574 x 7
date pce pop psavert uempmed unemploy perc.unemploy
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1967-07-01 507. 198712 12.6 4.5 2944 1.48
2 1967-08-01 510. 198911 12.6 4.7 2945 1.48
3 1967-09-01 516. 199113 11.9 4.6 2958 1.49
4 1967-10-01 512. 199311 12.9 4.9 3143 1.58
5 1967-11-01 517. 199498 12.8 4.7 3066 1.54
6 1967-12-01 525. 199657 11.8 4.8 3018 1.51
7 1968-01-01 531. 199808 11.7 5.1 2878 1.44
8 1968-02-01 534. 199920 12.3 4.5 3001 1.50
9 1968-03-01 544. 200056 11.7 4.1 2877 1.44
10 1968-04-01 544 200208 12.3 4.6 2709 1.35
# i 564 more rows
Task Mutations with the txhousing dataset
txhousing.new<- #defines new object name
txhousing %>% # utilizes the txhousing dataset
mutate(successrate=((sales/listings)*100), # combination of 2 functions - the ratio and then the multiplication. Note the extra brackets required to achieve this
failrate=(100-successrate)) # fail rate is the difference between the whole (100) and the successrate previously calculated.
txhousing.new # to display the new data# A tibble: 8,602 x 11
city year month sales volume median listings inventory date successrate
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene 2000 1 72 5.38e6 71400 701 6.3 2000 10.3
2 Abilene 2000 2 98 6.51e6 58700 746 6.6 2000. 13.1
3 Abilene 2000 3 130 9.28e6 58100 784 6.8 2000. 16.6
4 Abilene 2000 4 98 9.73e6 68600 785 6.9 2000. 12.5
5 Abilene 2000 5 141 1.06e7 67300 794 6.8 2000. 17.8
6 Abilene 2000 6 156 1.39e7 66900 780 6.6 2000. 20
7 Abilene 2000 7 152 1.26e7 73500 742 6.2 2000. 20.5
8 Abilene 2000 8 131 1.07e7 75000 765 6.4 2001. 17.1
9 Abilene 2000 9 104 7.62e6 64500 771 6.5 2001. 13.5
10 Abilene 2000 10 101 7.04e6 59300 764 6.6 2001. 13.2
# i 8,592 more rows
# i 1 more variable: failrate <dbl>
Tasks from 6.6.1
Recreating code from the guide
## Example illustrating how to answer these problems:
diamonds.new<- #added name of new dataset
diamonds %>% # utilizes the diamonds dataset
group_by(color, clarity) %>% # groups data by color and clarity variables
mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
ungroup() %>% # data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>% # new variable, original price + $10
select(cut, color, # retain only these listed columns
clarity, price,
price200, random10) %>%
arrange(color) %>% # visualize data ordered by color
group_by(cut) %>% # group data by cut
mutate(dis = n_distinct(price), # counts the number of unique price values per cut
rowID = row_number()) %>% # numbers each row consecutively for each cut
ungroup() # final ungrouping of data
diamonds.new # display the dataset# A tibble: 53,940 x 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# i 53,930 more rows
The functions all make sense, but the ordering does not to me. The rowID collum is confusing when the data is ordered as it is.
midwest %>% # utilizes the midwest dataset
group_by(state) %>% # groups the data by state
summarize(poptotalmean = mean(poptotal), # create new field and calculate the mean of the population
poptotalmed = median(poptotal), # create new field and calculate the median of the population
popmax = max(poptotal), # create new field and calculate the median of the population
popmin = min(poptotal), # create new field and return the minimum of the population
popdistinct = n_distinct(poptotal), # create new field and count the number of different poptotals
popfirst = first(poptotal), # create new field and return the first value of total population
popany = any(poptotal < 5000), # are there any items where the total population is under 5000? (TRUE FALSE response)
popany2 = any(poptotal > 2000000)) %>% # are there any items where the total population is over 2000000(TRUE FALSE response)
ungroup()# A tibble: 5 x 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# i 1 more variable: popany2 <lgl>
midwest %>% # utilizes the midwest dataset
group_by(state) %>% # groups the data by state
summarize(num5k = sum(poptotal < 5000), # return the number of items where the poptotal is less than 5000
num2mil = sum(poptotal > 2000000), # return the number of items where the poptotal is more than 2000000
numrows = n()) %>% # number of rows (counties) in each state
ungroup()# A tibble: 5 x 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
midwest %>% # utilizes the midwest dataset
group_by(county) %>% # group by county
summarize(x = n_distinct(state)) %>% # set field x to number of times that county appears in different states
arrange(desc(x)) %>% # arrange data with the highest counts first
ungroup()# A tibble: 320 x 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# i 310 more rows
#| Label: Problem C part 2
midwest %>%
group_by(county) %>%
summarize(x = n()) %>% # set field x to number of times that county appears total
ungroup()# A tibble: 320 x 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# i 310 more rows
Questions
How does n() differ from n_distinct()?
n is the number of times the county name appears, n_distinct is the number of times a county name appears in different states
When would they be the same? different?
If all the counties only appear in each state once, they would be the same. However, if a county appears twice in one state it will only be counted once by n_distinct but twice by n
midwest %>%
group_by(county) %>%
summarize(x = n_distinct(county)) %>%
arrange(desc(x)) %>% # this line is added to solve part 2 of the questions below
ungroup()# A tibble: 320 x 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# i 310 more rows
Questions
How many distinctly different counties are there for each county??
Each county is individually names and there are 320 rows, representing 320 different counties
Can there be more than 1 (county) county in each county?
There is not. By sorting the data in 3i largest to smallest we would identify if there were any counts higher than 1
What if we replace ‘county’ with ‘state’? See below
#| Label: Problem C part 3ii
midwest %>%
group_by(state) %>%
summarize(x = n_distinct(county)) %>%
ungroup()# A tibble: 5 x 2
state x
<chr> <int>
1 IL 102
2 IN 92
3 MI 83
4 OH 88
5 WI 72
# this changed code now returns the number of counties by state#| Label: Problem D
diamonds %>%
group_by(clarity) %>% # items are grouped in the category clarity
summarize(a = n_distinct(color), # no of unique colors by clarity
b = n_distinct(price),# no of unique prices by clarity
c = n()) %>% # total number of items
ungroup()# A tibble: 8 x 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
#| label: Problem E part 1
diamonds %>%
group_by(color, cut) %>%
summarize(m = mean(price),
s = sd(price)) %>%
ungroup()`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 x 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# i 25 more rows
#this has grouped the items in the dataset by color and cut, then returned the mean price of items in those categories along with the standard deviation#| Label: Problem E part 2
diamonds %>%
group_by(cut, color) %>%
summarize(m = mean(price),
s = sd(price)) %>%
ungroup()`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 x 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# i 25 more rows
#this has grouped the items in the dataset by cut then colour, returning the mean price of items and standard deviation. It is the same summary as part 1, just in a different order#| Label: Problem E part 3
diamonds %>%
group_by(cut, color, clarity) %>% # this is too many arguments for R and so the output is only grouped by cut and color. See error message below
summarize(m = mean(price),
s = sd(price),
msale = m * 0.80) %>% # multiples the mean price by 0.8 (20% off)
ungroup()`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 x 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# i 266 more rows
# the msale data is the mean data with 20% off.Questions
How good is the sale if the price of diamonds equaled msale?
Not bad! 20% off is nothing to be sniffed at!
#| label: Problem F
diamonds %>%
group_by(cut) %>%
summarize(potato = mean(depth), # data potato is the mean depth
pizza = mean(price), # data pizza is the mean price
popcorn = median(y), # data popcorn is the media of y
pineapple = potato - pizza, # gives the difference between the mean depth and the mean price
papaya = pineapple ^ 2, # the square of the difference in the means
peach = n()) %>% # the number of items
ungroup()# A tibble: 5 x 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
# these labels are stupid#| label: Problem G part I
diamonds %>%
group_by(color) %>% # groups items by color
summarize(m = mean(price)) %>% # gives the mean of the items
mutate(x1 = str_c("Diamond color ", color), # combines the two text fields diamond color and color to a new field "x1"
x2 = 5) %>% # creates new column "x2" with the value 5 for each row
ungroup()# A tibble: 7 x 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
#| label: Problem G part ii
diamonds %>%
group_by(color) %>% # groups items by color
summarize(m = mean(price)) %>% # gives the mean of the items
ungroup() %>%
mutate(x1 = str_c("Diamond color ", color), # combines the two text fields diamond color and color to a new field "x1"
x2 = 5) # creates new column "x2" with the value 5 for each row# A tibble: 7 x 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
Questions
What does the first ungroup() do? Is it useful here? Why/why not?
In this example, it does nothing as the summarise function has already taken place and the commands after the ungroup do not result in data changes, just reordering
No further ungroup is required as nothing is grouped at this point
#| label: Problem H part i
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>% # creates column x1 which gives the price divided by 2
summarize(m = mean(x1)) %>% # gives the mean of teh values in x1, grouped by color per the inital group command
ungroup() # A tibble: 7 x 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
#| label: part ii
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>%
ungroup() %>%
summarize(m = mean(x1)) # returns the mean of all data# A tibble: 1 x 1
m
<dbl>
1 1966.
Questions
What’s the difference between part I and II?
In part 1, the mean is calculated by each group of items, it part 2 it is the mean of teh dataset. That is why you only have a single value in part 2 compared to part 1
Questions
Why is grouping data necessary?
So that existing data and groups specific variables can be treated together for operations
Why is ungrouping data necessary?
To prevent complexity and error
When should you ungroup data?
In the same code chunk, whilst learning R, probably after each function or group of functions
If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?
No, no grouping has been created therefore there is no need to undo it
Week 3 Post session - Research Methods
Bad Question on the Diamonds Dataset
How does diamond price vary?
Good Question on the Diamonds Dataset
What affect does Diamond Color have on price?