Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
Week 3
Problem A
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
midwest %>%group_by(state) %>%summarize(poptotalmean =mean (poptotal), # this calculates the total mean populationpoptotalmed =median(poptotal), # this calculates the median for populationpopmax =max(poptotal), # this calculates the maximum for populationpopmin =min(poptotal), # this calculates the minimum for populationpopdistinct =n_distinct(poptotal), # counts amount of unique valuespopfirst =first(poptotal), # creates new variable 'popfirst' with the value 'poptotal'popany =any(poptotal <5000), # shows any poptotal below 5000popany2 =any(poptotal >2000000)) %>%# shows any poptotal over 2000000ungroup() # removes the group created by group_by which returns it to the ungrouped state
midwest %>%group_by(state) %>%# segments the data based on unique values of 'state'summarize(num5k =sum(poptotal <5000), # counts any poptotal less than 5000 and puts it in a new table called 'num5k'num2mil =sum(poptotal >2000000), # counts any poptotal over 2000000 and puts it in a new variable called 'num2mil'numrows =n()) %>%# counts the number of rows within groups ungroup() # removes the group created by group_by which returns it to the ungrouped state
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem C Part 1
midwest %>%group_by(county) %>%summarize(x =n_distinct(state)) %>%# assesses how many unique entries exist in "county"arrange(desc(x)) %>%# arranges data from largest to smallest ungroup() # removes the group created by group_by which returns it to the ungrouped state
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
Part 2
midwest %>%group_by(county) %>%summarize(x =n())%>%# counts the number of rows within the group 'county'ungroup() # removes the group created by group_by which returns it to the ungrouped state
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
The difference betweem using ‘n()” and ’n_distinct()’ is that n_distinct shows the unique values in a certain column where as n() counts the total number of rows in a group. - they will remain the same when there is only one unique value on the group being looked at.
Part 3
midwest %>%group_by(county)%>%summarize(x =n_distinct(county))%>%# determins th number of unique entries of the column 'county'ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
as shown above there isn’t more than 1 county for each county. If the counties were grouped together as “states” then this is when it would be viewed as more than 1 county.
Problem D
diamonds %>%group_by(clarity)%>%summarize(a =n_distinct(color),b =n_distinct(price),c =n()) %>%# summary of the data by counting unique values of colour and price and the total number of rows.ungroup()
diamonds %>%group_by(color, cut) %>%summarize(m =mean(price),s =sd(price))%>%# calculates the mean of price and standard deviation of price.ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
Part 3
diamonds %>%group_by(cut, color, clarity) %>%summarize(m =mean (price),s =sd(price),msale = m *0.80) %>%# calculates the 'msale' by timesing m by 0.8 ungroup()
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
if the price of the diamonds is equaled to the msale then it is considered a ‘fair’ sale
Problem F
diamonds %>%group_by(cut) %>%summarize(potato =mean(depth),pizza =mean(price),popcorn =median(y),pineapple = potato - pizza, # clculates the difference between potato - mean of depth and pizza mean of price and assigns it to 'pineapple' papya = pineapple ^2, # squares the value of pineapple and assigns it to 'papya' peach =n()) %>%# counts the total number of rows in the data frame and assigns it to 'peach'ungroup()
diamonds %>%group_by(color) %>%summarize(m =mean(price)) %>%# calculates the mean price of diamonds for each color group and puts it in a new column 'm'mutate(x1 =str_c("diamond color", color), # adds new columns combines the string "diamond color" with each unique color which results in a new coloumn that labels each color x2 =5) %>%# a constant value of 5 for every row, creating a new column 'x2'ungroup()
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. diamond colorD 5
2 E 3077. diamond colorE 5
3 F 3725. diamond colorF 5
4 G 3999. diamond colorG 5
5 H 4487. diamond colorH 5
6 I 5092. diamond colorI 5
7 J 5324. diamond colorJ 5
Part 2
diamonds %>%group_by(color) %>%summarize(m =mean(price)) %>%ungroup() %>%mutate(x1 =str_c("diamond color", color),x2 =5) # adds two new columns, using 'str_c to create a new string that combines 'diamond color with the corresponding color value. and the other column being a constant value of 5.
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. diamond colorD 5
2 E 3077. diamond colorE 5
3 F 3725. diamond colorF 5
4 G 3999. diamond colorG 5
5 H 4487. diamond colorH 5
6 I 5092. diamond colorI 5
7 J 5324. diamond colorJ 5
The code ‘ungroup()’ removes the grouping structure. This is important because it can then perform operations on the entire data frame without any grouping context. If the function ‘ungroup()’ wasn’t used then any subsequent operations would still consider the original grouping.
If ‘ungroup()’ was included after the ‘mutate()’ coding dataset then it wouldn’t recognise that you are working with the the non-grouped dataset
Problem H Part 1
diamonds %>%group_by(color) %>%mutate(x1 = price *0.5) %>%# creates a new column 'x1' within each group where the value is half of the corresponding price. by inc 'group_bycolor' it means it will operate within each color group.summarize(m =mean(x1)) %>%# this then calculates the mean of x1 and outs it in the new column of 'm'ungroup()
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
Part 2
diamonds %>%group_by(color) %>%mutate(x1 = price *0.5) %>%ungroup() %>%# removes the group created by group_by which returns it to the ungrouped statesummarize(m =mean(x1)) # calculates the mean of x1 and puts it into new column 'm'
# A tibble: 1 × 1
m
<dbl>
1 1966.
The difference between the codes in part 1 and 2 is that part 1 provides the mean of half the prices for all the color groups whereas part 2 shows the mean of half the prices for all of the diamonds regardless of the color.
Added Notes
Grouping data is important for many reasons such as: 1. It allows you to target a specific set of data e.g color, 2. it prepares data for statistical analysis tests making it easier to view meaning the results are easier to make sense of by seeing them in tables and then turning that into charts, 3. when the data needs to be calculated as mean, median or standard deviation ‘sd’ for example, then it allows you to carry out the calculations in a simple manner.
Ungrouping data is also important for many reasons such as: 1. It allows you to extra operations such as extra calculations, adds to the entire dataset and not just within the groups, 2. It ensures that the continuing steps work on the intended data structure, important after a series of calculations or transformations 3. It allows for operations such as if you are wanting to calculate the overall mean the following code could be used:
If group_by has not been used then you don’t need to use ungroup() as ungroup() is specifically for removing grouping from a dataset that has previously been ‘grouped_by()’
Good and bad Question
Good Question - In the diamonds dataset, does the carat weight effect the price and is there a correlation?
Bad question - Using the diamonds dataset how many diamonds can you identify?
To develop a good research hypothesis it first must be clear and it must be able to be tested through data analysis for example. different things that must be looked at before developing the hypothesis such as looking at a specific relationship for example that relates between variables, the hypothesis question must outline clearly what it is that is being measured. for example, after using the cricket data above a good hypothesis could have been “Does temperature have an influence on the rate of chirping in crickets and does it differ depending on the species?”.
Week 5
Box plot
# Load the ggplot2 packagelibrary(ggplot2)# Create the box plotggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +# this maps the species variable to the x-axis and sepal.length to the y-axis. fill = speecies makes each box colored by species.geom_boxplot() +# this adds the box plot layer theme_minimal() +# optional, but it adds a clean themelabs(x ="Species", y ="Sepal Length") +# labels the x and y axis scale_fill_manual(values =c("red", "green", "blue")) # sets colors for each species
Density plot
library(ggplot2)ggplot(iris, aes(x = Petal.Length, color = Species)) +# this maps petal.length to x-axis and assigns different colors to each speciesgeom_density() # creates density plot which shows the distribution of petal length for each species.
Scatter plot with line of regression
library(ggplot2)ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +# maps petal,length to x-axis and petal.width to y-axis.geom_point(mapping =aes(color = Species, shape = Species))+# this adds the points to the plot, with the points' color and shapes representing different speciesgeom_smooth(method ="lm") # this adds a linear regression line (lm standing for linear model) to show the trent of petal length and width across all the species.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa small
2 4.9 3.0 1.4 0.2 setosa small
3 4.7 3.2 1.3 0.2 setosa small
4 4.6 3.1 1.5 0.2 setosa small
5 5.0 3.6 1.4 0.2 setosa small
6 5.4 3.9 1.7 0.4 setosa small
7 4.6 3.4 1.4 0.3 setosa small
8 5.0 3.4 1.5 0.2 setosa small
9 4.4 2.9 1.4 0.2 setosa small
10 4.9 3.1 1.5 0.1 setosa small
11 5.4 3.7 1.5 0.2 setosa small
12 4.8 3.4 1.6 0.2 setosa small
13 4.8 3.0 1.4 0.1 setosa small
14 4.3 3.0 1.1 0.1 setosa small
15 5.8 4.0 1.2 0.2 setosa big
16 5.7 4.4 1.5 0.4 setosa small
17 5.4 3.9 1.3 0.4 setosa small
18 5.1 3.5 1.4 0.3 setosa small
19 5.7 3.8 1.7 0.3 setosa small
20 5.1 3.8 1.5 0.3 setosa small
21 5.4 3.4 1.7 0.2 setosa small
22 5.1 3.7 1.5 0.4 setosa small
23 4.6 3.6 1.0 0.2 setosa small
24 5.1 3.3 1.7 0.5 setosa small
25 4.8 3.4 1.9 0.2 setosa small
26 5.0 3.0 1.6 0.2 setosa small
27 5.0 3.4 1.6 0.4 setosa small
28 5.2 3.5 1.5 0.2 setosa small
29 5.2 3.4 1.4 0.2 setosa small
30 4.7 3.2 1.6 0.2 setosa small
31 4.8 3.1 1.6 0.2 setosa small
32 5.4 3.4 1.5 0.4 setosa small
33 5.2 4.1 1.5 0.1 setosa small
34 5.5 4.2 1.4 0.2 setosa small
35 4.9 3.1 1.5 0.2 setosa small
36 5.0 3.2 1.2 0.2 setosa small
37 5.5 3.5 1.3 0.2 setosa small
38 4.9 3.6 1.4 0.1 setosa small
39 4.4 3.0 1.3 0.2 setosa small
40 5.1 3.4 1.5 0.2 setosa small
41 5.0 3.5 1.3 0.3 setosa small
42 4.5 2.3 1.3 0.3 setosa small
43 4.4 3.2 1.3 0.2 setosa small
44 5.0 3.5 1.6 0.6 setosa small
45 5.1 3.8 1.9 0.4 setosa small
46 4.8 3.0 1.4 0.3 setosa small
47 5.1 3.8 1.6 0.2 setosa small
48 4.6 3.2 1.4 0.2 setosa small
49 5.3 3.7 1.5 0.2 setosa small
50 5.0 3.3 1.4 0.2 setosa small
51 7.0 3.2 4.7 1.4 versicolor big
52 6.4 3.2 4.5 1.5 versicolor big
53 6.9 3.1 4.9 1.5 versicolor big
54 5.5 2.3 4.0 1.3 versicolor small
55 6.5 2.8 4.6 1.5 versicolor big
56 5.7 2.8 4.5 1.3 versicolor small
57 6.3 3.3 4.7 1.6 versicolor big
58 4.9 2.4 3.3 1.0 versicolor small
59 6.6 2.9 4.6 1.3 versicolor big
60 5.2 2.7 3.9 1.4 versicolor small
61 5.0 2.0 3.5 1.0 versicolor small
62 5.9 3.0 4.2 1.5 versicolor big
63 6.0 2.2 4.0 1.0 versicolor big
64 6.1 2.9 4.7 1.4 versicolor big
65 5.6 2.9 3.6 1.3 versicolor small
66 6.7 3.1 4.4 1.4 versicolor big
67 5.6 3.0 4.5 1.5 versicolor small
68 5.8 2.7 4.1 1.0 versicolor big
69 6.2 2.2 4.5 1.5 versicolor big
70 5.6 2.5 3.9 1.1 versicolor small
71 5.9 3.2 4.8 1.8 versicolor big
72 6.1 2.8 4.0 1.3 versicolor big
73 6.3 2.5 4.9 1.5 versicolor big
74 6.1 2.8 4.7 1.2 versicolor big
75 6.4 2.9 4.3 1.3 versicolor big
76 6.6 3.0 4.4 1.4 versicolor big
77 6.8 2.8 4.8 1.4 versicolor big
78 6.7 3.0 5.0 1.7 versicolor big
79 6.0 2.9 4.5 1.5 versicolor big
80 5.7 2.6 3.5 1.0 versicolor small
81 5.5 2.4 3.8 1.1 versicolor small
82 5.5 2.4 3.7 1.0 versicolor small
83 5.8 2.7 3.9 1.2 versicolor big
84 6.0 2.7 5.1 1.6 versicolor big
85 5.4 3.0 4.5 1.5 versicolor small
86 6.0 3.4 4.5 1.6 versicolor big
87 6.7 3.1 4.7 1.5 versicolor big
88 6.3 2.3 4.4 1.3 versicolor big
89 5.6 3.0 4.1 1.3 versicolor small
90 5.5 2.5 4.0 1.3 versicolor small
91 5.5 2.6 4.4 1.2 versicolor small
92 6.1 3.0 4.6 1.4 versicolor big
93 5.8 2.6 4.0 1.2 versicolor big
94 5.0 2.3 3.3 1.0 versicolor small
95 5.6 2.7 4.2 1.3 versicolor small
96 5.7 3.0 4.2 1.2 versicolor small
97 5.7 2.9 4.2 1.3 versicolor small
98 6.2 2.9 4.3 1.3 versicolor big
99 5.1 2.5 3.0 1.1 versicolor small
100 5.7 2.8 4.1 1.3 versicolor small
101 6.3 3.3 6.0 2.5 virginica big
102 5.8 2.7 5.1 1.9 virginica big
103 7.1 3.0 5.9 2.1 virginica big
104 6.3 2.9 5.6 1.8 virginica big
105 6.5 3.0 5.8 2.2 virginica big
106 7.6 3.0 6.6 2.1 virginica big
107 4.9 2.5 4.5 1.7 virginica small
108 7.3 2.9 6.3 1.8 virginica big
109 6.7 2.5 5.8 1.8 virginica big
110 7.2 3.6 6.1 2.5 virginica big
111 6.5 3.2 5.1 2.0 virginica big
112 6.4 2.7 5.3 1.9 virginica big
113 6.8 3.0 5.5 2.1 virginica big
114 5.7 2.5 5.0 2.0 virginica small
115 5.8 2.8 5.1 2.4 virginica big
116 6.4 3.2 5.3 2.3 virginica big
117 6.5 3.0 5.5 1.8 virginica big
118 7.7 3.8 6.7 2.2 virginica big
119 7.7 2.6 6.9 2.3 virginica big
120 6.0 2.2 5.0 1.5 virginica big
121 6.9 3.2 5.7 2.3 virginica big
122 5.6 2.8 4.9 2.0 virginica small
123 7.7 2.8 6.7 2.0 virginica big
124 6.3 2.7 4.9 1.8 virginica big
125 6.7 3.3 5.7 2.1 virginica big
126 7.2 3.2 6.0 1.8 virginica big
127 6.2 2.8 4.8 1.8 virginica big
128 6.1 3.0 4.9 1.8 virginica big
129 6.4 2.8 5.6 2.1 virginica big
130 7.2 3.0 5.8 1.6 virginica big
131 7.4 2.8 6.1 1.9 virginica big
132 7.9 3.8 6.4 2.0 virginica big
133 6.4 2.8 5.6 2.2 virginica big
134 6.3 2.8 5.1 1.5 virginica big
135 6.1 2.6 5.6 1.4 virginica big
136 7.7 3.0 6.1 2.3 virginica big
137 6.3 3.4 5.6 2.4 virginica big
138 6.4 3.1 5.5 1.8 virginica big
139 6.0 3.0 4.8 1.8 virginica big
140 6.9 3.1 5.4 2.1 virginica big
141 6.7 3.1 5.6 2.4 virginica big
142 6.9 3.1 5.1 2.3 virginica big
143 5.8 2.7 5.1 1.9 virginica big
144 6.8 3.2 5.9 2.3 virginica big
145 6.7 3.3 5.7 2.5 virginica big
146 6.7 3.0 5.2 2.3 virginica big
147 6.3 2.5 5.0 1.9 virginica big
148 6.5 3.0 5.2 2.0 virginica big
149 6.2 3.4 5.4 2.3 virginica big
150 5.9 3.0 5.1 1.8 virginica big
Bar chart comparing size in species
library(ggplot2)data("iris")iris.new <-iris %>%mutate(size=ifelse(Sepal.Length <median(Sepal.Length), # creates a new dataset where a new column size is added. the size column categorizes rows on if they are less than the median."small", "big"))ggplot(iris.new, aes(x = Species,fill = size)) +# initializes a ggplot mapping species to x-axis and size to the fill color geom_bar(position ="dodge") +# draws bars for each size are placed next to each other for comparison scale_fill_brewer(palette ="Dark2") # applies a specific color palette to the fill aesthetic
The bar chart shows the distribution of iris flowers by species and sepal length, classified as either “small” or “big.” Setosa flowers are mostly small, versicolor has a balanced mix of small and big, and virginica flowers are mostly big. This highlights size differences in sepal length across species.