πŸ§ͺ R Data Management Lesson Script

Helpful Hints:

  1. You can β€œtab” to finish commands. This avoids typos.
  2. Use β€œcommand/control” enter to run the line of code you are on.
  3. To comment out multiple lines of code, hold β€œCommand/Control” + β€œshift” + β€œc”
  4. You must install a package once on a new device or version of R. You must load the library every time you open R.
  5. Remember that β€œcode” has to be in the code chunk surrounded by the {r} and.
  6. You can use the little arrows by your line numbers to β€œhide” chunks of code. This can make large documents easier to navigate.
  7. Use hashtags to make headers, which will create your outline.
  8. For printing your HTML, there are a few things you can do to make them look cleaner.
  1. You can choose R markdown styles by changing the theme in your header (lines 1-5).
# output:
#   html_document:
#     theme: journal
#     toc: yes
#     toc_float:
#       collapsed: true

πŸ› οΈ PART 1: Install and Load Packages ———————–

# Install the tidyverse if you don't have it yet
# (Uncomment the line below if needed)

#install.packages("tidyverse")  

# Load tidyverse which includes dplyr, ggplot2, readr, etc.
library(tidyverse)  
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## βœ” dplyr     1.2.0     βœ” readr     2.2.0
## βœ” forcats   1.0.1     βœ” stringr   1.6.0
## βœ” ggplot2   4.0.2     βœ” tibble    3.3.1
## βœ” lubridate 1.9.5     βœ” tidyr     1.3.2
## βœ” purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## βœ– dplyr::filter() masks stats::filter()
## βœ– dplyr::lag()    masks stats::lag()
## β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
help("ggplot2")
citation("tidyverse")
## To cite package 'tidyverse' in publications use:
## 
##   Wickham H, Averick M, Bryan J, Chang W, McGowan LD, FranΓ§ois R,
##   Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
##   E, Bache SM, MΓΌller K, Ooms J, Robinson D, Seidel DP, Spinu V,
##   Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
##   the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
##   doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Welcome to the {tidyverse}},
##     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain FranΓ§ois and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill MΓΌller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
##     year = {2019},
##     journal = {Journal of Open Source Software},
##     volume = {4},
##     number = {43},
##     pages = {1686},
##     doi = {10.21105/joss.01686},
##   }

πŸ“ PART 2: Read in a Dataset β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”

# Load a CSV file from the web
url <- "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
tips <- read_csv(url)
## Rows: 244 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): sex, smoker, day, time
## dbl (3): total_bill, tip, size
## 
## β„Ή Use `spec()` to retrieve the full column specification for this data.
## β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
# data=read.csv("testdata.csv")

# Look at the data
head(tips)              # View the first few rows
## # A tibble: 6 Γ— 7
##   total_bill   tip sex    smoker day   time    size
##        <dbl> <dbl> <chr>  <chr>  <chr> <chr>  <dbl>
## 1       17.0  1.01 Female No     Sun   Dinner     2
## 2       10.3  1.66 Male   No     Sun   Dinner     3
## 3       21.0  3.5  Male   No     Sun   Dinner     3
## 4       23.7  3.31 Male   No     Sun   Dinner     2
## 5       24.6  3.61 Female No     Sun   Dinner     4
## 6       25.3  4.71 Male   No     Sun   Dinner     4
tail(tips)
## # A tibble: 6 Γ— 7
##   total_bill   tip sex    smoker day   time    size
##        <dbl> <dbl> <chr>  <chr>  <chr> <chr>  <dbl>
## 1       35.8  4.67 Female No     Sat   Dinner     3
## 2       29.0  5.92 Male   No     Sat   Dinner     3
## 3       27.2  2    Female Yes    Sat   Dinner     2
## 4       22.7  2    Male   Yes    Sat   Dinner     2
## 5       17.8  1.75 Male   No     Sat   Dinner     2
## 6       18.8  3    Female No     Thur  Dinner     2
str(tips)               # Structure: column names and types
## spc_tbl_ [244 Γ— 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ total_bill: num [1:244] 17 10.3 21 23.7 24.6 ...
##  $ tip       : num [1:244] 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
##  $ sex       : chr [1:244] "Female" "Male" "Male" "Male" ...
##  $ smoker    : chr [1:244] "No" "No" "No" "No" ...
##  $ day       : chr [1:244] "Sun" "Sun" "Sun" "Sun" ...
##  $ time      : chr [1:244] "Dinner" "Dinner" "Dinner" "Dinner" ...
##  $ size      : num [1:244] 2 3 3 2 4 4 2 4 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   total_bill = col_double(),
##   ..   tip = col_double(),
##   ..   sex = col_character(),
##   ..   smoker = col_character(),
##   ..   day = col_character(),
##   ..   time = col_character(),
##   ..   size = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(tips$bill_total) #use a "$" to isolate a specific variable
## Warning: Unknown or uninitialised column: `bill_total`.
## NULL
#using tidyverse
summary(tips)           # Quick stats summary
##    total_bill         tip             sex               smoker         
##  Min.   : 3.07   Min.   : 1.000   Length:244         Length:244        
##  1st Qu.:13.35   1st Qu.: 2.000   Class :character   Class :character  
##  Median :17.80   Median : 2.900   Mode  :character   Mode  :character  
##  Mean   :19.79   Mean   : 2.998                                        
##  3rd Qu.:24.13   3rd Qu.: 3.562                                        
##  Max.   :50.81   Max.   :10.000                                        
##      day                time                size     
##  Length:244         Length:244         Min.   :1.00  
##  Class :character   Class :character   1st Qu.:2.00  
##  Mode  :character   Mode  :character   Median :2.00  
##                                        Mean   :2.57  
##                                        3rd Qu.:3.00  
##                                        Max.   :6.00
dim(tips)
## [1] 244   7
glimpse(tips)
## Rows: 244
## Columns: 7
## $ total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26.88, 15.0…
## $ tip        <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.96, 3.23,…
## $ sex        <chr> "Female", "Male", "Male", "Male", "Female", "Male", "Male",…
## $ smoker     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ day        <chr> "Sun", "Sun", "Sun", "Sun", "Sun", "Sun", "Sun", "Sun", "Su…
## $ time       <chr> "Dinner", "Dinner", "Dinner", "Dinner", "Dinner", "Dinner",…
## $ size       <dbl> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, 3, 3, 3,…
?cor.test

πŸ” PART 3: Inspect Variables & Data Types —————–

names(tips)             # See all column names
## [1] "total_bill" "tip"        "sex"        "smoker"     "day"       
## [6] "time"       "size"
sapply(tips, class)     # View each column's data type
##  total_bill         tip         sex      smoker         day        time 
##   "numeric"   "numeric" "character" "character" "character" "character" 
##        size 
##   "numeric"
glimpse(tips)           # dplyr alternative to str()
## Rows: 244
## Columns: 7
## $ total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26.88, 15.0…
## $ tip        <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.96, 3.23,…
## $ sex        <chr> "Female", "Male", "Male", "Male", "Female", "Male", "Male",…
## $ smoker     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ day        <chr> "Sun", "Sun", "Sun", "Sun", "Sun", "Sun", "Sun", "Sun", "Su…
## $ time       <chr> "Dinner", "Dinner", "Dinner", "Dinner", "Dinner", "Dinner",…
## $ size       <dbl> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, 3, 3, 3,…
# Convert character to factor if needed
tips$sex_Cat <- as.factor(tips$sex)
levels(tips$sex_Cat)        # Show possible values for factor
## [1] "Female" "Male"

🎯 PART 4: Selecting and Filtering Data with dplyr β€”β€”β€”

# Select specific columns
select(tips, sex, total_bill)
## # A tibble: 244 Γ— 2
##    sex    total_bill
##    <chr>       <dbl>
##  1 Female      17.0 
##  2 Male        10.3 
##  3 Male        21.0 
##  4 Male        23.7 
##  5 Female      24.6 
##  6 Male        25.3 
##  7 Male         8.77
##  8 Male        26.9 
##  9 Male        15.0 
## 10 Male        14.8 
## # β„Ή 234 more rows
# Filter rows based on condition
filter(tips, total_bill > 20)
## # A tibble: 97 Γ— 8
##    total_bill   tip sex    smoker day   time    size sex_Cat
##         <dbl> <dbl> <chr>  <chr>  <chr> <chr>  <dbl> <fct>  
##  1       21.0  3.5  Male   No     Sun   Dinner     3 Male   
##  2       23.7  3.31 Male   No     Sun   Dinner     2 Male   
##  3       24.6  3.61 Female No     Sun   Dinner     4 Female 
##  4       25.3  4.71 Male   No     Sun   Dinner     4 Male   
##  5       26.9  3.12 Male   No     Sun   Dinner     4 Male   
##  6       35.3  5    Female No     Sun   Dinner     4 Female 
##  7       21.6  3.92 Male   No     Sun   Dinner     2 Male   
##  8       20.6  3.35 Male   No     Sat   Dinner     3 Male   
##  9       20.3  2.75 Female No     Sat   Dinner     2 Female 
## 10       39.4  7.58 Male   No     Sat   Dinner     4 Male   
## # β„Ή 87 more rows
# Combine select and filter using pipes
tips_filtered <- tips %>%
  select(sex, total_bill) %>%
  filter(total_bill > 20)

# Select columns by position or range
select(tips, 1:3)            # First 3 columns
## # A tibble: 244 Γ— 3
##    total_bill   tip sex   
##         <dbl> <dbl> <chr> 
##  1      17.0   1.01 Female
##  2      10.3   1.66 Male  
##  3      21.0   3.5  Male  
##  4      23.7   3.31 Male  
##  5      24.6   3.61 Female
##  6      25.3   4.71 Male  
##  7       8.77  2    Male  
##  8      26.9   3.12 Male  
##  9      15.0   1.96 Male  
## 10      14.8   3.23 Male  
## # β„Ή 234 more rows
select(tips, -tip)           # All except 'tip'
## # A tibble: 244 Γ— 7
##    total_bill sex    smoker day   time    size sex_Cat
##         <dbl> <chr>  <chr>  <chr> <chr>  <dbl> <fct>  
##  1      17.0  Female No     Sun   Dinner     2 Female 
##  2      10.3  Male   No     Sun   Dinner     3 Male   
##  3      21.0  Male   No     Sun   Dinner     3 Male   
##  4      23.7  Male   No     Sun   Dinner     2 Male   
##  5      24.6  Female No     Sun   Dinner     4 Female 
##  6      25.3  Male   No     Sun   Dinner     4 Male   
##  7       8.77 Male   No     Sun   Dinner     2 Male   
##  8      26.9  Male   No     Sun   Dinner     4 Male   
##  9      15.0  Male   No     Sun   Dinner     2 Male   
## 10      14.8  Male   No     Sun   Dinner     2 Male   
## # β„Ή 234 more rows

πŸ” PART 5: Grouping and Summarizing β€”β€”β€”β€”β€”β€”β€”β€”

# Group by a category and summarize
tips %>%
  group_by(sex) %>%
  summarise(
    count = n(),
    avg_tip = mean(tip),
    max_tip = max(tip)
  )
## # A tibble: 2 Γ— 4
##   sex    count avg_tip max_tip
##   <chr>  <int>   <dbl>   <dbl>
## 1 Female    87    2.83     6.5
## 2 Male     157    3.09    10
# Group by two variables
tips %>%
  group_by(sex, smoker) %>%
  summarise(
    mean_total = mean(total_bill),
    .groups = "drop"   # Removes group structure
  )
## # A tibble: 4 Γ— 3
##   sex    smoker mean_total
##   <chr>  <chr>       <dbl>
## 1 Female No           18.1
## 2 Female Yes          18.0
## 3 Male   No           19.8
## 4 Male   Yes          22.3

🧹 PART 6: Handling Missing Data β€”β€”β€”β€”β€”β€”β€”β€”β€”-

# Check for missing values
colSums(is.na(tips))
## total_bill        tip        sex     smoker        day       time       size 
##          0          0          0          0          0          0          0 
##    sex_Cat 
##          0
# Remove rows with any missing values
tips_clean <- na.omit(tips)

# Replace NAs with 0 in tip column
tips$tip[is.na(tips$tip)] <- 0

✍️ PART 7: Rename and Create New Columns β€”β€”β€”β€”β€”β€”-

# Rename columns
tips <- tips %>%
  rename(
    bill_total = total_bill,
    waiter_tip = tip
  )

# Create a new column: tip percentage
tips <- tips %>%
  mutate(
    tip_percent = (waiter_tip / bill_total) * 100,
    tip_percent = round(tip_percent, 1)
  )

πŸ”½ PART 8: Sorting Data ———————————–

# Sort by tip percentage descending
tips %>%
  arrange(desc(tip_percent)) %>%
  head(10)
## # A tibble: 10 Γ— 9
##    bill_total waiter_tip sex    smoker day   time    size sex_Cat tip_percent
##         <dbl>      <dbl> <chr>  <chr>  <chr> <chr>  <dbl> <fct>         <dbl>
##  1       7.25       5.15 Male   Yes    Sun   Dinner     2 Male           71  
##  2       9.6        4    Female Yes    Sun   Dinner     2 Female         41.7
##  3       3.07       1    Female Yes    Sat   Dinner     1 Female         32.6
##  4      11.6        3.39 Male   No     Sat   Dinner     2 Male           29.2
##  5      23.2        6.5  Male   Yes    Sun   Dinner     4 Male           28.1
##  6      14.3        4    Female Yes    Sat   Dinner     2 Female         28  
##  7       7.51       2    Male   No     Thur  Lunch      2 Male           26.6
##  8      16.3        4.3  Female Yes    Fri   Dinner     2 Female         26.3
##  9      13.4        3.48 Female Yes    Fri   Lunch      2 Female         25.9
## 10      10.3        2.6  Female No     Sun   Dinner     2 Female         25.3

🧠 PART 9: Summary Data

# Mean tip by sex
aggregate(waiter_tip ~ sex, data = tips, FUN = mean)
##      sex waiter_tip
## 1 Female   2.833448
## 2   Male   3.089618
# Mean and max tip by sex and smoker status
aggregate(waiter_tip ~ sex + smoker, data = tips, FUN = function(x) c(mean = mean(x), max = max(x)))
##      sex smoker waiter_tip.mean waiter_tip.max
## 1 Female     No        2.773519       5.200000
## 2   Male     No        3.113402       9.000000
## 3 Female    Yes        2.931515       6.500000
## 4   Male    Yes        3.051167      10.000000
# Load dplyr if not already
library(dplyr)

# Mean tip by sex
tips %>%
  group_by(sex) %>%
  summarise(mean_tip = mean(waiter_tip, na.rm = TRUE))
## # A tibble: 2 Γ— 2
##   sex    mean_tip
##   <chr>     <dbl>
## 1 Female     2.83
## 2 Male       3.09
# Mean and max tip by sex and smoker
tips %>%
  group_by(sex, smoker) %>%
  summarise(
    mean_tip = mean(waiter_tip, na.rm = TRUE),
    max_tip = max(waiter_tip, na.rm = TRUE),
    .groups = "drop"  # Optional: drops grouping after summarise
  )
## # A tibble: 4 Γ— 4
##   sex    smoker mean_tip max_tip
##   <chr>  <chr>     <dbl>   <dbl>
## 1 Female No         2.77     5.2
## 2 Female Yes        2.93     6.5
## 3 Male   No         3.11     9  
## 4 Male   Yes        3.05    10
# Count number of observations by group (base R)
table(tips$sex, tips$smoker)
##         
##          No Yes
##   Female 54  33
##   Male   97  60
# Count with dplyr
tips %>%
  count(sex, smoker)
## # A tibble: 4 Γ— 3
##   sex    smoker     n
##   <chr>  <chr>  <int>
## 1 Female No        54
## 2 Female Yes       33
## 3 Male   No        97
## 4 Male   Yes       60

Practice Exercise with mpg dataset ————–

# Load built-in mpg dataset from ggplot2
data(mpg)

# Preview it
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…
# Assignment steps
mpg_new <- mpg %>%
  select(manufacturer, model, cty, hwy) %>%
  mutate(avg_mpg = (cty + hwy) / 2) %>%
  filter(avg_mpg > 25) %>%
  drop_na() %>%
  rename(brand = manufacturer)

# View result
head(mpg_new)
## # A tibble: 6 Γ— 5
##   brand     model    cty   hwy avg_mpg
##   <chr>     <chr>  <int> <int>   <dbl>
## 1 audi      a4        20    31    25.5
## 2 audi      a4        21    30    25.5
## 3 chevrolet malibu    22    30    26  
## 4 honda     civic     28    33    30.5
## 5 honda     civic     24    32    28  
## 6 honda     civic     25    32    28.5