My Workbook
Week 1- Meet the penguins
The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.
The plot below shows the relationship between flipper and bill lengths of these penguins.
Week 2- Computations
This dataset contains a subset of the fuel economy data from the EPA. Specifically, we use the mpg dataset from the ggplot2 package.
Code
library(ggplot2)The plots in Figure 1 show the relationship between city and highway mileage for 38 popular models of cars. In Figure 1 (a) the points are colored by the number of cylinders while in Figure 1(b) the points are colored by engine displacement.
Code
#| #| label: fig-mpg
#| fig-cap: "City and highway mileage for 38 popular models of cars."
#| fig-subcap:
#| - "Color by number of cylinders"
#| - "Color by engine displacement, in liters"
#| layout-ncol: 2
#| column: page
ggplot(mpg, aes(x = hwy, y = cty, color = cyl)) +
geom_point(alpha = 0.5, size = 2) +
scale_color_viridis_c() +
theme_minimal()
ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
geom_point(alpha = 0.5, size = 2) +
scale_color_viridis_c(option = "E") +
theme_minimal()
#| echo: false
mean_cty <- round(mean(mpg$cty), 2)
mean_hwy <- round(mean(mpg$hwy), 2)There are 234 observations in our data.
The average city mileage of the cars in our data is 16.86 and the average highway mileage is 23.44.
Week 3- Exercise Diamond Dataset
Code
1 + 1[1] 2
Code
#| View(diamonds) # to view the dataset
#| str(diamonds) # to show the structure of the dataset
#| names(diamonds) # to view the variable names
diamonds %>%
mutate(JustOne = 1,
Values = "something",
Simple = TRUE) # used to create variables. can be used to create them based on existing variables# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z JustOne Values
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1 somet…
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 1 somet…
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1 somet…
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 1 somet…
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1 somet…
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 1 somet…
7 0.24 Very … I VVS1 62.3 57 336 3.95 3.98 2.47 1 somet…
8 0.26 Very … H SI1 61.9 55 337 4.07 4.11 2.53 1 somet…
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1 somet…
10 0.23 Very … H VS1 59.4 61 338 4 4.05 2.39 1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>
Code
diamonds %>%
mutate(price200 = price - 200, # £200 off original price
price20perc = price * 0.20, # 20% of original price
price20percoff = price * 0.80, # 20% off original price
pricepercarat = price / carat, # ratio of price to carat
squaredepth = depth ^ 2) # original depth, squared# A tibble: 53,940 × 15
carat cut color clarity depth table price x y z price200
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 126
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 126
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 127
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 134
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 135
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 136
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 136
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 137
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 137
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 138
# ℹ 53,930 more rows
# ℹ 4 more variables: price20perc <dbl>, price20percoff <dbl>,
# pricepercarat <dbl>, squaredepth <dbl>
Code
diamonds.new <- # used to save changes to dataset as a new object
diamonds %>%
mutate(price200 = price - 200, # £200 off original price
price20perc = price * 0.20, # 20% of original price
price20percoff = price * 0.80, # 20% off original price
pricepercarat = price / carat, # ratio of price to carat
squaredepth = depth ^ 2) # original depth, squared
## Nesting Fuctions
diamonds %>%
mutate(m = mean(price), # calculates the mean price
sd = sd(price), # calculates the standard deviation
med = median(price)) # calculates the median price# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z m sd
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3933. 3989.
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3933. 3989.
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3933. 3989.
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 3933. 3989.
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 3933. 3989.
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 3933. 3989.
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 3933. 3989.
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 3933. 3989.
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 3933. 3989.
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>
Midwest data set
To calculate average population density
Code
library(tidyverse)
library(ggplot2) # This is where midwest is located
midwest <- ggplot2::midwest # Load the midwest dataset
avg_density <- mean(midwest$popdensity)
midwest$avg.pop.den <- avg_density
head(midwest)# A tibble: 6 × 29
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 20 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>
To calculate entire average area for entire data set
Code
avg.area <- mean(midwest$area)
head(midwest)# A tibble: 6 × 29
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 20 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>
To calculate total number of adults in dataset
Code
total_adults <- sum(midwest$popadults)
midwest$totadult <- total_adults
head(midwest)# A tibble: 6 × 30
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 21 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>
To calculate the difference between total population and white people.
Code
midwest$tot.minus.white <- midwest$poptotal - midwest$popwhite ## alternatively without using poptotal and popwhite
midwest$tot.minus.white <- midwest$popblack + midwest$popamerindian + midwest$popasian + midwest$popother
head(midwest)# A tibble: 6 × 31
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 22 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>, tot.minus.white <int>
To calculate the ratio between populations
Code
midwest$child.to.adult <- midwest$percchildbelowpovert / midwest$percadultpoverty
head(midwest) # A tibble: 6 × 32
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 23 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>, tot.minus.white <int>, child.to.adult <dbl>
Code
## Why are the values in child.to.adult all different? = Because they represent the ratio of the percentage of children below the poverty line to the percentage of adults in poverty for each observation (row) in the dataset.
midwest$ratio.adult <- midwest$popadults / midwest$poptotal
head(midwest)# A tibble: 6 × 33
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 24 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>, tot.minus.white <int>, child.to.adult <dbl>, …
To calculate a percentage of a total population that are adults per county
Code
midwest$perc.adult <- (midwest$popadults / midwest$poptotal) * 100
head(midwest)# A tibble: 6 × 34
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 25 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>, tot.minus.white <int>, child.to.adult <dbl>, …
Presidential data set
To load dataset
Code
library(datasets)
data("presidential")
head(presidential) # to look at the data itself# A tibble: 6 × 4
name start end party
<chr> <date> <date> <chr>
1 Eisenhower 1953-01-20 1961-01-20 Republican
2 Kennedy 1961-01-20 1963-11-22 Democratic
3 Johnson 1963-11-22 1969-01-20 Democratic
4 Nixon 1969-01-20 1974-08-09 Republican
5 Ford 1974-08-09 1977-01-20 Republican
6 Carter 1977-01-20 1981-01-20 Democratic
Code
str(presidential) # to understand the structure and types of the datasettibble [12 × 4] (S3: tbl_df/tbl/data.frame)
$ name : chr [1:12] "Eisenhower" "Kennedy" "Johnson" "Nixon" ...
$ start: Date[1:12], format: "1953-01-20" "1961-01-20" ...
$ end : Date[1:12], format: "1961-01-20" "1963-11-22" ...
$ party: chr [1:12] "Republican" "Democratic" "Democratic" "Republican" ...
To create a duration column by calculating the difference in days
Code
presidential$start <- as.Date(presidential$start)
presidential$end <- as.Date(presidential$end)
presidential$duration <- as.numeric(presidential$end - presidential$start)
head(presidential)# A tibble: 6 × 5
name start end party duration
<chr> <date> <date> <chr> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 2922
2 Kennedy 1961-01-20 1963-11-22 Democratic 1036
3 Johnson 1963-11-22 1969-01-20 Democratic 1886
4 Nixon 1969-01-20 1974-08-09 Republican 2027
5 Ford 1974-08-09 1977-01-20 Republican 895
6 Carter 1977-01-20 1981-01-20 Democratic 1461
Exercises (economics dataset)
To load data
Code
library(datasets)
data("economics")
head(economics)# A tibble: 6 × 6
date pce pop psavert uempmed unemploy
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1967-07-01 507. 198712 12.6 4.5 2944
2 1967-08-01 510. 198911 12.6 4.7 2945
3 1967-09-01 516. 199113 11.9 4.6 2958
4 1967-10-01 512. 199311 12.9 4.9 3143
5 1967-11-01 517. 199498 12.8 4.7 3066
6 1967-12-01 525. 199657 11.8 4.8 3018
To calculate the % of the population that is unemployed
Code
economics$perc.unemploy <- (economics$unemploy / economics$pop) * 100 ##multiplying by 100 turns the ratio into a %.
head(economics)# A tibble: 6 × 7
date pce pop psavert uempmed unemploy perc.unemploy
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1967-07-01 507. 198712 12.6 4.5 2944 1.48
2 1967-08-01 510. 198911 12.6 4.7 2945 1.48
3 1967-09-01 516. 199113 11.9 4.6 2958 1.49
4 1967-10-01 512. 199311 12.9 4.9 3143 1.58
5 1967-11-01 517. 199498 12.8 4.7 3066 1.54
6 1967-12-01 525. 199657 11.8 4.8 3018 1.51
Exercises (TX housing dataset)
To load the dataset
Code
library(ggplot2)
data("txhousing")
head(txhousing)# A tibble: 6 × 9
city year month sales volume median listings inventory date
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
To create success rate
Code
txhousing$successrate <- (txhousing$sales / txhousing$listings) * 100
head(txhousing)# A tibble: 6 × 10
city year month sales volume median listings inventory date successrate
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene 2000 1 72 5380000 71400 701 6.3 2000 10.3
2 Abilene 2000 2 98 6505000 58700 746 6.6 2000. 13.1
3 Abilene 2000 3 130 9285000 58100 784 6.8 2000. 16.6
4 Abilene 2000 4 98 9730000 68600 785 6.9 2000. 12.5
5 Abilene 2000 5 141 10590000 67300 794 6.8 2000. 17.8
6 Abilene 2000 6 156 13910000 66900 780 6.6 2000. 20
To calculate the % of houses that do not sell of the total listing available.
Code
txhousing$failrate <- ((txhousing$listings - txhousing$sales) / txhousing$listings) * 100
head(txhousing)# A tibble: 6 × 11
city year month sales volume median listings inventory date successrate
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene 2000 1 72 5380000 71400 701 6.3 2000 10.3
2 Abilene 2000 2 98 6505000 58700 746 6.6 2000. 13.1
3 Abilene 2000 3 130 9285000 58100 784 6.8 2000. 16.6
4 Abilene 2000 4 98 9730000 68600 785 6.9 2000. 12.5
5 Abilene 2000 5 141 10590000 67300 794 6.8 2000. 17.8
6 Abilene 2000 6 156 13910000 66900 780 6.6 2000. 20
# ℹ 1 more variable: failrate <dbl>
Chapter 6.6.1: Exercises
Problem A
Code
midwest %>% # dataset and pipe tool used to take output of one function & use as input for next
group_by(state) %>% # groups the data by the state column
summarize(poptotalmean = mean(poptotal), # calculates summary stats for each group & calculates the mean of the pop total
poptotalmed = median(poptotal), # calculates median of population total
popmax = max(poptotal), # finds maximum value of the population total
popmin = min(poptotal), # finds minimum value of the population total
popdistinct = n_distinct(poptotal), # counts the number of distinct values in population total column
popfirst = first(poptotal), # retrieves the first value of the pop total column for each state
popany = any(poptotal < 5000),# checks if there are any values in the pop total coumn that are less than 5000 for each state.
popany2 = any(poptotal > 2000000)) %>% # checks for any values in the pop total column that are greater than 2mil
ungroup() # removes grouping structure from date frame. Useful if plan to perform operations that should not be done in groups# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
Problem B
Code
midwest %>%
group_by(state) %>%
summarize(num5k = sum(poptotal < 5000), # calculates summary statistic for each group. Inside function, several calculations are defined.
num2mil = sum(poptotal > 2000000), numrows = n()) %>% ungroup() #5K= calculates number of counties in each state where the population is less than 5000 #num2mil= where pop is greater than 2mil #numrows= counts number of counties in each state group# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem C
Code
midwest %>%
group_by(county) %>% # groups data by the county column
summarize(x = n_distinct(state)) %>% # calculate number of distinct states associated with each county
arrange(desc(x)) %>% # sorts summarised data in decending order based on values in column x
ungroup() # A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
Code
# part II
# How does n() differ from n_distinct()? = n counts the total number of observations (rows) in current group, whereas n distinct counts number of unique (distinct) values in specified column within current group
# When would they be the same? different? = Same when every row in group corresponds to a distinct value in the column being counted. Different when theres multiple records for the same value in the specified column
midwest %>%
group_by(county) %>%
summarize(x = n()) %>% # calculates total number of rows or observations for each county and stores in x
ungroup() # A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
Code
# part III # hint:
# - How many distinctly different counties are there for each county? = if you group by county, you'll get 1 row per county, showing total count of records associated with that county
# - Can there be more than 1 (county) county in each county? = no but can be multiple records for that county based on other factors such as different years
# - What if we replace 'county' with 'state'? = the output will show the total number of records for each state.
midwest %>%
group_by(county) %>%
summarize(x = n_distinct(county)) %>% # calculates number of distinct counties associated with each county which will always return 1.
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
Problem D
Code
diamonds %>%
group_by(clarity) %>% # groups dataset by clarity column
summarize(a = n_distinct(color), # creates summary statistics for each group. In function, calculates number of distinct colours for each clarity group
b = n_distinct(price),# In function, calculates number of distinct colours for each clarity group
c = n()) %>% # counts total number of diamonds within each clarity group
ungroup()# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
Problem E
Code
# part I
diamonds %>%
group_by(color, cut) %>% # groups dataset by color and cut
summarize(m = mean(price), # calculates mean price of diamonds for each combination of color and cut
s = sd(price), # calculates standard deviation of price for each combination of color and cut
.groups = 'drop') %>%
ungroup() # A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
Code
# part II
diamonds %>%
group_by(cut, color) %>% # operations are identical
summarize(m = mean(price),
s = sd(price),
.groups = 'drop') %>%
ungroup() # A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
Code
# part III
# hint:
# - How good is the sale if the price of diamonds equaled msale?
# - If msale represents a price that is 20% off mean price of diamonds in each group, you could evaluate how the sale compares to the original mean price
# - e.x. The diamonds are x% off original price in msale.
diamonds %>%
group_by(cut, color, clarity) %>% # groups dataset by the columns
summarize(m = mean(price),
s = sd(price),
msale = m * 0.80, # calculates sale price that is 20% off mean price
.groups = 'drop')# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
Problem F
Code
diamonds %>%
group_by(cut) %>%
summarize(potato = mean(depth),# calculates mean depth of diamonds for each cut category
pizza = mean(price),# calculates mean price of diamonds for each cut category
popcorn = median(y),# calculates median of a variable called 'y' (y must be a column in diamonds dataset. If not defined, will produce an error.)
pineapple = potato - pizza,# calculates difference between mean depth and mean price
papaya = pineapple ^ 2,# squares value of pineapple
peach = n()) %>% # counts total number of diamonds in each cut category
ungroup()# A tibble: 5 × 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
Problem G
Code
# part I
diamonds %>%
group_by(color) %>% # groups dataset by colour variable
summarize(m = mean(price)) %>% # calculates mean price of diamonds for each colour group
mutate(x1 = str_c("Diamond color ", color), x2 = 5) %>% # adds new columns to summarize new dataset. X1 creates a new column (x1) that links the string diamond colour with the actual colour value. X2 creates a new column (x2) and assigns the value 5 to every row.
ungroup() # A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
Code
# part II # What does the first ungroup() do? Is it useful here? Why/why not? = used to remove grouping structure after summarisation. its useful if you plan to perform additional operations on resulting dataframes that should not be affected by previous grouping
# Why isn't there a closing ungroup() after the mutate()? = since the first ungroup was to ensure that the subsequent operations do not treat the data as grouped, there is no need for another ungroup after mutate.
diamonds %>%
group_by(color) %>%
summarize(m = mean(price)) %>%
ungroup() %>%
# ungroup before mutate: (1) summarise mean price by colour, (2) ungrouping summarised data, (3) adding new columns (x1 & x2)
mutate(x1 = str_c("Diamond color ", color), x2 = 5)# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
Problem H
Code
# part I
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>% # creates new column (x1) that contains half the value of the price for each diamond
summarize(m = mean(x1)) %>% # calculates mean of new column for each colour group stored in a new column named 'm'
ungroup() # A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
Code
# part II # What's the difference between part I and II? = (1) position of ungroup, (2) effect on results
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>%
ungroup() %>% # removed grouping structure before summarising
summarize(m = mean(x1))# A tibble: 1 × 1
m
<dbl>
1 1966.
Why is grouping data necessary? - crucial for performing calculations and analyses within subsets of data
Why is ungrouping data necessary? - necessary to ensure that further operations do not unintentionally respect previous groupings
When should you ungroup data? - after summary operations or before applying long group specific trabsformations
If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()? - if theres no grouping in code, ungroup is not required
Chapter 6.7: Extra Practice
Code
diamonds_sorted_low_to_high <- diamonds %>% # sort diamonds by price (low to high)
arrange(price)
diamonds_sorted_high_to_low <- diamonds %>% # sort diamonds by price (high to low)
arrange(desc(price))
diamonds_sorted_low_price_cut <- diamonds %>% # sort diamonds by price (low to high, then by cut)
arrange(price, cut)
diamonds_sorted_high_price_cut <- diamonds %>% # sort diamonds by price (high to low, then by cut)
arrange(desc(price), cut)
diamonds_sorted_price_clarity <- diamonds %>% # sort diamonds by price and then by clarity (high to low)
arrange(price, desc(clarity))
diamonds_with_sale_price <- diamonds %>% # create a new column for sale price
mutate(salePrice = price - 250)
diamonds_selected <- diamonds %>% # select specific columns, removing x, y, and z
select(-x, -y, -z)
diamonds_count_by_cut <- diamonds %>% # count diamonds by cut
group_by(cut) %>%
summarise(count = n())
diamonds_with_total_num <- diamonds %>% #add total number of diamonds as a new column
mutate(totalNum = n())Exercise for research method:
Good Question- “How does the price of diamonds change by the quality of the cut?”
“What impact does clarity have on the price of diamonds?”
Bad Question- “What is the average price of a diamond?
“Which diamond is the cheapest?”
Data Exploration: Cricket Dataset
To load and run data:
Code
options(repos = c(CRAN = "https://cloud.r-project.org/"))
library(tidyverse)
install.packages("modeldata")
The downloaded binary packages are in
/var/folders/_3/gl1b3gb52rsg0qg560909w4r0000gn/T//RtmpVTPNvs/downloaded_packages
Code
library(modeldata)
Attaching package: 'modeldata'
The following object is masked _by_ '.GlobalEnv':
penguins
The following object is masked from 'package:palmerpenguins':
penguins
Code
?ggplot
?crickets
View(crickets)The basics:
Code
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point() +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")To add colour to visualise differences between species:
Code
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2")Modyifying basic properties of the plot:
Code
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point(color = "red",
size = 2,
alpha = .3,
shape = "square") +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")Code
# Learn more about the options for the geom_abline()
# with ?geom_pointAdding another layer:
Code
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")`geom_smooth()` using formula = 'y ~ x'
Code
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2") `geom_smooth()` using formula = 'y ~ x'
Other plots:
Code
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) # one quantitative variableCode
ggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)Code
ggplot(crickets, aes(x = species)) +
geom_bar(color = "black",
fill = "lightblue")Code
ggplot(crickets, aes(x = species,
fill = species)) +
geom_bar(show.legend = FALSE) +
scale_fill_brewer(palette = "Dark2")Code
ggplot(crickets, aes(x = species,
y = rate,
color = species)) +
geom_boxplot(show.legend = FALSE) +
scale_color_brewer(palette = "Dark2") +
theme_minimal()Code
?theme_minimal()Faceting (technique used to create multiple subplots within a single plot based on a categorical variable).
Code
# not great:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2") ## to assign a colour (package containing this colour)Code
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species) + ## creates grid of plots. Useful when have moderate number of levels.
scale_fill_brewer(palette = "Dark2")Code
?facet_wrap
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species,
ncol = 1) +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()Research Methods: ‘what is a good research hypothesis?’
A good research hypothesis is a specific, concise and testable statement that establishes an assumption which is either supported or falsified through data collection and statistical analysis. Serving as a foundation for conducting research, it provides empirical evidence to guide further research and to understand the subject being investigated.
Week 5: Formative
Code
data("iris")Graph 1:
Code
boxplot(Sepal.Length ~ Species, data = iris,
main = "Boxplot of Sepal Length by Species",
xlab = "Species",
ylab = "Sepal Length (cm)",
col= c("lightblue", "lightgreen", "lightpink"))- Boxplots are used for visualising distribution of data and can be associated with several statistical tests: t-tests, ANOVA (post hoc), Kruskal-Wallis test, Mann-Whitney U test and Chi-Square tests.
Graph 2:
Code
library(ggplot2)
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
geom_density(alpha = 0.5) + # plot with transparency
labs(title = "Density plot of petal length",
x = "Petal length (cm)",
y = "Density") +
theme_minimal()- Frequency graphs are useful for showing the shape or distribution of data and are associated with the following tests: Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, Chi-square test, t-tests, ANOVA (post hocs), Kruskal-Wallis test and Mann-Whitney U test.
Graph 3:
Code
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
geom_point(size = 3, alpha = 0.7) + ## adjust size and transparency
geom_smooth(method = "lm", se = FALSE, color = "black") + ## line of best fit
labs(title = "Scatterplot of Petal Length vs. Petal Width",
x = "Petal Length (cm)",
y = "Petal Width (cm)") +
theme_minimal() +
scale_colour_brewer(palette = "Set1")`geom_smooth()` using formula = 'y ~ x'
- Scatterplots show possible relationships between 2 variables, validates assumptions and identifies outliers. Associated tests: Correlation coefficient (pearsons r), Spearmans rank correlation test, linear regression, ANOVA, Chi-square test, Residual analysis and Levene’s test (HOV).
Graph 4:
Code
library(dplyr)
library(ggplot2)
# Create the size variable
iris <- iris %>%
mutate(size = ifelse(Sepal.Length < median(Sepal.Length), "Small", "Big"))
head(iris) # should show the new 'size' column Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa Small
2 4.9 3.0 1.4 0.2 setosa Small
3 4.7 3.2 1.3 0.2 setosa Small
4 4.6 3.1 1.5 0.2 setosa Small
5 5.0 3.6 1.4 0.2 setosa Small
6 5.4 3.9 1.7 0.4 setosa Small
Code
iris_summary <- iris %>%
group_by(Species, size) %>%
summarise(count = n(), .groups = "drop")
head(iris_summary) # should include Species, size, and count# A tibble: 6 × 3
Species size count
<fct> <chr> <int>
1 setosa Big 1
2 setosa Small 49
3 versicolor Big 29
4 versicolor Small 21
5 virginica Big 47
6 virginica Small 3
Code
ggplot(iris_summary, aes(x = Species, y = count, fill = size)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Species", y = "Count") +
theme_minimal() +
scale_fill_manual(values = c("Small" = "lightblue", "Big" = "salmon")) - Bar graphs are used to display and represent the frequency count of values for a categorical variable and are associated with the following tests: Chi-squared test, t-tests, ANOVA (post hoc), Kruskal-Wallis test, Proportion test and Fishers Exact test.
Week 6: Formative
Code
# Import the data
file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
head(housetasks) Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
Code
install.packages("gplots")
The downloaded binary packages are in
/var/folders/_3/gl1b3gb52rsg0qg560909w4r0000gn/T//RtmpVTPNvs/downloaded_packages
Code
library(gplots)
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
Code
# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
label = FALSE, show.margins = FALSE)Code
library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
main = "housetasks")Code
## The argument shade is used to color the graph.
# The argument las = 2 produces vertical labels
# Blue color indicates that the observed value is higher than the expected value if the data were random
# Red color specifies that the observed value is lower than the expected value if the data were randomCode
#There is another package named vcd, which can be used to make a mosaic plot (function mosaic()) or an association plot (function assoc()).
install.packages("vcd")
The downloaded binary packages are in
/var/folders/_3/gl1b3gb52rsg0qg560909w4r0000gn/T//RtmpVTPNvs/downloaded_packages
Code
library("vcd")Loading required package: grid
Code
# plot just a subset of the table
assoc(head(dt, 5), shade = TRUE, las=3)Code
# Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.
# This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with df=(r−1)(c−1) degrees of freedom and p = 0.05.
# If the calculated Chi-square statistic is greater than the critical value, then we must conclude that the row and the column variables are not independent of each other. This implies that they are significantly associated.
chisq <- chisq.test(housetasks)
chisq
Pearson's Chi-squared test
data: housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
Code
# Observed counts
chisq$observed Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
Shopping 33 23 9 55
Official 12 46 23 15
Driving 10 51 75 3
Finances 13 13 21 66
Insurance 8 1 53 77
Repairs 0 3 160 2
Holidays 0 1 6 153
Code
# Expected counts
round(chisq$expected,2) Wife Alternating Husband Jointly
Laundry 60.55 25.63 38.45 51.37
Main_meal 52.64 22.28 33.42 44.65
Dinner 37.16 15.73 23.59 31.52
Breakfeast 48.17 20.39 30.58 40.86
Tidying 41.97 17.77 26.65 35.61
Dishes 38.88 16.46 24.69 32.98
Shopping 41.28 17.48 26.22 35.02
Official 33.03 13.98 20.97 28.02
Driving 47.82 20.24 30.37 40.57
Finances 38.88 16.46 24.69 32.98
Insurance 47.82 20.24 30.37 40.57
Repairs 56.77 24.03 36.05 48.16
Holidays 55.05 23.30 34.95 46.70
To know the cells that are contributing most to the chi2 score, have to calculate the chi2 statistic for each cell. This formula returns the Pearson residuals (r) for each cell.
Pearson residuals can be extracted from the output of the function chisq.test()
Code
round(chisq$residuals, 3) Wife Alternating Husband Jointly
Laundry 12.266 -2.298 -5.878 -6.609
Main_meal 9.836 -0.484 -4.917 -6.084
Dinner 6.537 -1.192 -3.416 -3.299
Breakfeast 4.875 3.457 -2.818 -5.297
Tidying 1.702 -1.606 -4.969 3.585
Dishes -1.103 1.859 -4.163 3.486
Shopping -1.289 1.321 -3.362 3.376
Official -3.659 8.563 0.443 -2.459
Driving -5.469 6.836 8.100 -5.898
Finances -4.150 -0.852 -0.742 5.750
Insurance -5.758 -4.277 4.107 5.720
Repairs -7.534 -4.290 20.646 -6.651
Holidays -7.419 -4.620 -4.897 15.556
Code
# To visualise
install.packages("corrplot")
The downloaded binary packages are in
/var/folders/_3/gl1b3gb52rsg0qg560909w4r0000gn/T//RtmpVTPNvs/downloaded_packages
Code
library(corrplot)corrplot 0.95 loaded
Code
corrplot(chisq$residuals, is.cor = FALSE)- Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.
In the image above, it’s evident that there are an association between the column Wife and the rows Laundry, Main_meal.
There is a strong positive association between the column Husband and the row Repair
- Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables. For example the column Wife are negatively associated (~ “not associated”) with the row Repairs. There is a repulsion between the column Husband and, the rows Laundry and Main_meal
Code
# Contibution in percentage (%)
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3) Wife Alternating Husband Jointly
Laundry 7.738 0.272 1.777 2.246
Main_meal 4.976 0.012 1.243 1.903
Dinner 2.197 0.073 0.600 0.560
Breakfeast 1.222 0.615 0.408 1.443
Tidying 0.149 0.133 1.270 0.661
Dishes 0.063 0.178 0.891 0.625
Shopping 0.085 0.090 0.581 0.586
Official 0.688 3.771 0.010 0.311
Driving 1.538 2.403 3.374 1.789
Finances 0.886 0.037 0.028 1.700
Insurance 1.705 0.941 0.868 1.683
Repairs 2.919 0.947 21.921 2.275
Holidays 2.831 1.098 1.233 12.445
Code
# Visualize the contribution
corrplot(contrib, is.cor = FALSE)Code
# printing the p-value
chisq$p.value[1] 0
Code
# printing the mean
chisq$estimateNULL