1. Read in library tidyverse
  2. Read in library skimr
  3. Read in data_to_explore

are you getting an error? - make sure to install.packages(““) in the console to fix that

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
data_to_explore <- read_csv("data/data_to_explore.csv")
## Rows: 943 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): student_id, subject, semester, section, gender, enrollment_reason...
## dbl  (23): total_points_possible, total_points_earned, proportion_earned, ti...
## dttm  (3): date_x, date_y, date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Use the skim() function to view the data_to explore

👉 Your Turn

#skim the data by adding the skim function in front of the data
skim(data_to_explore)
Data summary
Name data_to_explore
Number of rows 943
Number of columns 34
_______________________
Column type frequency:
character 8
numeric 23
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
student_id 0 1.00 2 6 0 879 0
subject 0 1.00 4 5 0 5 0
semester 0 1.00 4 4 0 4 0
section 0 1.00 2 2 0 4 0
gender 227 0.76 1 1 0 2 0
enrollment_reason 227 0.76 5 34 0 5 0
enrollment_status 227 0.76 7 17 0 3 0
course_id 281 0.70 12 13 0 36 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_points_possible 226 0.76 1619.55 387.12 1212.00 1217.00 1676.00 1791.00 2425.00 ▇▂▆▁▃
total_points_earned 226 0.76 1229.98 510.64 0.00 1002.50 1177.13 1572.45 2413.50 ▂▂▇▅▂
proportion_earned 226 0.76 76.23 25.20 0.00 72.36 85.59 92.29 100.74 ▁▁▁▃▇
time_spent 232 0.75 1828.80 1363.13 0.45 895.57 1559.97 2423.94 8870.88 ▇▅▁▁▁
time_spent_hours 232 0.75 30.48 22.72 0.01 14.93 26.00 40.40 147.85 ▇▅▁▁▁
int 293 0.69 4.30 0.60 1.80 4.00 4.40 4.80 5.00 ▁▁▂▆▇
val 287 0.70 3.75 0.75 1.00 3.33 3.67 4.33 5.00 ▁▁▆▇▆
percomp 288 0.69 3.64 0.69 1.50 3.00 3.50 4.00 5.00 ▁▁▇▃▃
tv 292 0.69 4.07 0.59 1.00 3.71 4.12 4.46 5.00 ▁▁▂▇▇
q1 285 0.70 4.34 0.66 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q2 285 0.70 3.66 0.93 1.00 3.00 4.00 4.00 5.00 ▁▂▆▇▃
q3 286 0.70 3.31 0.85 1.00 3.00 3.00 4.00 5.00 ▁▂▇▅▂
q4 289 0.69 4.35 0.80 1.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q5 286 0.70 4.28 0.69 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▆
q6 285 0.70 4.05 0.80 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▅
q7 286 0.70 3.96 0.85 1.00 3.00 4.00 5.00 5.00 ▁▁▅▇▆
q8 286 0.70 4.35 0.65 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q9 286 0.70 3.55 0.92 1.00 3.00 4.00 4.00 5.00 ▁▂▇▇▃
q10 285 0.70 4.17 0.87 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▇
post_int 848 0.10 3.88 0.94 1.00 3.50 4.00 4.50 5.00 ▁▁▃▇▇
post_uv 848 0.10 3.48 0.99 1.00 3.00 3.67 4.00 5.00 ▂▂▅▇▅
post_tv 848 0.10 3.71 0.90 1.00 3.29 3.86 4.29 5.00 ▁▂▃▇▆
post_percomp 848 0.10 3.47 0.88 1.00 3.00 3.50 4.00 5.00 ▁▂▂▇▂

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date_x 393 0.58 2015-09-02 15:40:00 2016-05-24 15:53:00 2015-10-01 15:57:30 536
date_y 848 0.10 2015-09-02 15:31:00 2016-01-22 15:43:00 2016-01-04 13:25:00 95
date 834 0.12 2017-01-23 13:14:00 2017-02-13 13:00:00 2017-01-25 18:43:00 107

In the code chunk below: 1. use the data_to_explore then 2. group_by subject variable then 3. add skim() function

👉 #3 Your Turn

group_df <- data_to_explore |>
  group_by(subject) %>% 
  skim() 

group_df
Data summary
Name Piped data
Number of rows 943
Number of columns 34
_______________________
Column type frequency:
character 7
numeric 23
POSIXct 3
________________________
Group variables subject

Variable type: character

skim_variable subject n_missing complete_rate min max empty n_unique whitespace
student_id AnPhA 0 1.00 2 6 0 207 0
student_id BioA 0 1.00 3 6 0 47 0
student_id FrScA 0 1.00 2 6 0 414 0
student_id OcnA 0 1.00 2 6 0 171 0
student_id PhysA 0 1.00 3 6 0 74 0
semester AnPhA 0 1.00 4 4 0 4 0
semester BioA 0 1.00 4 4 0 4 0
semester FrScA 0 1.00 4 4 0 4 0
semester OcnA 0 1.00 4 4 0 4 0
semester PhysA 0 1.00 4 4 0 4 0
section AnPhA 0 1.00 2 2 0 2 0
section BioA 0 1.00 2 2 0 1 0
section FrScA 0 1.00 2 2 0 4 0
section OcnA 0 1.00 2 2 0 3 0
section PhysA 0 1.00 2 2 0 1 0
gender AnPhA 45 0.79 1 1 0 2 0
gender BioA 4 0.92 1 1 0 2 0
gender FrScA 130 0.70 1 1 0 2 0
gender OcnA 42 0.76 1 1 0 2 0
gender PhysA 6 0.92 1 1 0 2 0
enrollment_reason AnPhA 45 0.79 5 34 0 4 0
enrollment_reason BioA 4 0.92 5 34 0 5 0
enrollment_reason FrScA 130 0.70 5 34 0 5 0
enrollment_reason OcnA 42 0.76 5 34 0 5 0
enrollment_reason PhysA 6 0.92 5 34 0 4 0
enrollment_status AnPhA 45 0.79 7 17 0 2 0
enrollment_status BioA 4 0.92 7 17 0 3 0
enrollment_status FrScA 130 0.70 7 17 0 3 0
enrollment_status OcnA 42 0.76 7 17 0 3 0
enrollment_status PhysA 6 0.92 7 17 0 2 0
course_id AnPhA 58 0.72 13 13 0 7 0
course_id BioA 7 0.86 12 12 0 4 0
course_id FrScA 150 0.66 13 13 0 12 0
course_id OcnA 55 0.69 12 12 0 9 0
course_id PhysA 11 0.85 13 13 0 4 0

Variable type: numeric

skim_variable subject n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_points_possible AnPhA 45 0.79 1776.52 12.28 1655.00 1775.00 1775.00 1775.00 1805.00 ▁▁▁▇▁
total_points_possible BioA 4 0.92 2421.00 2.02 2420.00 2420.00 2420.00 2420.00 2425.00 ▇▁▁▁▂
total_points_possible FrScA 129 0.70 1230.81 38.26 1212.00 1212.00 1217.00 1232.00 1361.00 ▇▁▁▁▁
total_points_possible OcnA 42 0.76 1738.47 78.48 1480.00 1676.00 1676.00 1833.00 1833.00 ▁▁▇▁▇
total_points_possible PhysA 6 0.92 2225.00 0.00 2225.00 2225.00 2225.00 2225.00 2225.00 ▁▁▇▁▁
total_points_earned AnPhA 45 0.79 1340.16 423.45 0.00 1269.09 1511.14 1616.37 1732.52 ▁▁▁▂▇
total_points_earned BioA 4 0.92 1546.66 813.01 0.00 1035.16 1865.13 2198.50 2413.50 ▃▁▁▃▇
total_points_earned FrScA 129 0.70 952.30 305.60 0.00 914.92 1062.75 1130.00 1319.02 ▁▁▁▅▇
total_points_earned OcnA 42 0.76 1283.25 427.25 0.00 1216.68 1396.85 1572.50 1786.76 ▁▁▁▆▇
total_points_earned PhysA 6 0.92 1898.45 469.31 110.00 1891.75 2072.00 2149.12 2216.00 ▁▁▁▂▇
proportion_earned AnPhA 45 0.79 75.44 23.84 0.00 71.57 84.90 90.96 97.61 ▁▁▁▂▇
proportion_earned BioA 4 0.92 63.89 33.58 0.00 42.78 77.07 90.85 99.73 ▃▁▁▃▇
proportion_earned FrScA 129 0.70 77.42 24.82 0.00 74.85 86.43 92.19 100.74 ▁▁▁▃▇
proportion_earned OcnA 42 0.76 73.99 24.70 0.00 69.76 81.60 91.04 99.22 ▁▁▁▃▇
proportion_earned PhysA 6 0.92 85.32 21.09 4.94 85.02 93.12 96.59 99.60 ▁▁▁▂▇
time_spent AnPhA 45 0.79 2374.39 1669.58 0.45 1209.85 2164.90 3134.97 7084.70 ▆▇▃▂▁
time_spent BioA 5 0.90 1404.57 1528.14 1.22 297.02 827.30 1955.08 6664.45 ▇▂▁▁▁
time_spent FrScA 134 0.69 1591.90 1016.76 2.42 935.03 1404.90 2130.75 6537.02 ▇▇▂▁▁
time_spent OcnA 42 0.76 2031.44 1496.82 0.58 1133.47 1800.22 2573.45 8870.88 ▇▆▂▁▁
time_spent PhysA 6 0.92 1431.76 990.40 0.70 749.32 1282.81 2049.85 5373.35 ▇▆▃▁▁
time_spent_hours AnPhA 45 0.79 39.57 27.83 0.01 20.16 36.08 52.25 118.08 ▆▇▃▂▁
time_spent_hours BioA 5 0.90 23.41 25.47 0.02 4.95 13.79 32.58 111.07 ▇▂▁▁▁
time_spent_hours FrScA 134 0.69 26.53 16.95 0.04 15.58 23.42 35.51 108.95 ▇▇▂▁▁
time_spent_hours OcnA 42 0.76 33.86 24.95 0.01 18.89 30.00 42.89 147.85 ▇▆▂▁▁
time_spent_hours PhysA 6 0.92 23.86 16.51 0.01 12.49 21.38 34.16 89.56 ▇▆▃▁▁
int AnPhA 62 0.70 4.42 0.57 1.80 4.00 4.40 5.00 5.00 ▁▁▁▅▇
int BioA 9 0.82 3.69 0.63 2.40 3.35 3.80 4.00 5.00 ▂▆▇▆▂
int FrScA 154 0.65 4.42 0.52 2.60 4.00 4.40 5.00 5.00 ▁▁▃▃▇
int OcnA 56 0.68 4.24 0.58 2.20 4.00 4.20 4.60 5.00 ▁▁▂▇▆
int PhysA 12 0.84 4.00 0.65 2.20 3.60 4.00 4.40 5.00 ▁▂▆▇▅
val AnPhA 59 0.72 4.29 0.62 1.00 4.00 4.33 4.67 5.00 ▁▁▁▅▇
val BioA 7 0.86 3.50 0.58 2.67 3.00 3.33 3.67 5.00 ▆▆▇▁▂
val FrScA 155 0.64 3.53 0.72 1.67 3.00 3.67 4.00 5.00 ▂▅▇▅▂
val OcnA 55 0.69 3.62 0.77 1.00 3.00 3.67 4.00 5.00 ▁▁▅▇▃
val PhysA 11 0.85 3.89 0.56 2.00 3.67 4.00 4.33 5.00 ▁▁▇▇▃
percomp AnPhA 61 0.71 3.80 0.67 2.00 3.50 4.00 4.50 5.00 ▂▃▇▆▇
percomp BioA 8 0.84 3.34 0.75 2.00 3.00 3.00 4.00 5.00 ▅▇▃▇▂
percomp FrScA 152 0.65 3.64 0.63 1.50 3.00 3.50 4.00 5.00 ▁▁▇▅▃
percomp OcnA 56 0.68 3.57 0.67 2.00 3.00 3.50 4.00 5.00 ▂▇▆▅▅
percomp PhysA 11 0.85 3.56 0.84 2.00 3.00 3.50 4.00 5.00 ▅▅▇▅▇
tv AnPhA 60 0.71 4.35 0.57 1.00 4.00 4.43 4.83 5.00 ▁▁▁▅▇
tv BioA 9 0.82 3.61 0.56 2.29 3.14 3.57 3.86 5.00 ▁▃▇▂▁
tv FrScA 156 0.64 4.04 0.52 2.29 3.71 4.00 4.43 5.00 ▁▂▆▇▅
tv OcnA 55 0.69 3.97 0.62 1.71 3.71 4.00 4.38 5.00 ▁▁▂▇▅
tv PhysA 12 0.84 3.94 0.56 2.14 3.57 4.00 4.29 5.00 ▁▂▃▇▂
q1 AnPhA 59 0.72 4.43 0.64 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q1 BioA 7 0.86 3.76 0.66 2.00 3.00 4.00 4.00 5.00 ▁▃▁▇▁
q1 FrScA 153 0.65 4.50 0.57 2.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q1 OcnA 55 0.69 4.20 0.69 2.00 4.00 4.00 5.00 5.00 ▁▂▁▇▅
q1 PhysA 11 0.85 4.03 0.72 2.00 4.00 4.00 4.50 5.00 ▁▃▁▇▃
q2 AnPhA 59 0.72 4.30 0.74 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▇
q2 BioA 7 0.86 3.48 0.71 2.00 3.00 3.00 4.00 5.00 ▁▇▁▆▁
q2 FrScA 152 0.65 3.35 0.89 1.00 3.00 3.00 4.00 5.00 ▁▃▇▆▂
q2 OcnA 56 0.68 3.46 0.93 1.00 3.00 4.00 4.00 5.00 ▁▂▆▇▂
q2 PhysA 11 0.85 4.03 0.76 2.00 4.00 4.00 5.00 5.00 ▁▂▁▇▅
q3 AnPhA 60 0.71 3.53 0.87 1.00 3.00 3.00 4.00 5.00 ▁▁▇▅▃
q3 BioA 7 0.86 2.98 0.87 2.00 2.00 3.00 3.00 5.00 ▅▇▁▂▁
q3 FrScA 152 0.65 3.25 0.79 1.00 3.00 3.00 4.00 5.00 ▁▂▇▃▁
q3 OcnA 56 0.68 3.30 0.86 2.00 3.00 3.00 4.00 5.00 ▃▇▁▅▂
q3 PhysA 11 0.85 3.32 0.95 1.00 3.00 3.00 4.00 5.00 ▁▃▇▆▂
q4 AnPhA 61 0.71 4.52 0.78 1.00 4.00 5.00 5.00 5.00 ▁▁▁▃▇
q4 BioA 7 0.86 3.69 0.81 2.00 3.00 4.00 4.00 5.00 ▂▃▁▇▂
q4 FrScA 154 0.65 4.44 0.74 1.00 4.00 5.00 5.00 5.00 ▁▁▁▅▇
q4 OcnA 56 0.68 4.29 0.75 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▇
q4 PhysA 11 0.85 4.02 0.87 2.00 4.00 4.00 5.00 5.00 ▁▃▁▇▆
q5 AnPhA 59 0.72 4.36 0.69 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q5 BioA 8 0.84 3.88 0.68 2.00 4.00 4.00 4.00 5.00 ▁▃▁▇▂
q5 FrScA 153 0.65 4.38 0.62 2.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q5 OcnA 55 0.69 4.20 0.77 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▆
q5 PhysA 11 0.85 4.06 0.67 2.00 4.00 4.00 4.00 5.00 ▁▁▁▇▃
q6 AnPhA 59 0.72 4.50 0.65 1.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q6 BioA 7 0.86 3.83 0.70 3.00 3.00 4.00 4.00 5.00 ▅▁▇▁▂
q6 FrScA 153 0.65 3.88 0.79 2.00 3.00 4.00 4.00 5.00 ▁▃▁▇▃
q6 OcnA 55 0.69 3.84 0.84 1.00 3.00 4.00 4.00 5.00 ▁▁▅▇▃
q6 PhysA 11 0.85 4.27 0.68 2.00 4.00 4.00 5.00 5.00 ▁▁▁▇▆
q7 AnPhA 60 0.71 4.08 0.85 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▆
q7 BioA 8 0.84 3.71 0.96 2.00 3.00 4.00 4.00 5.00 ▂▇▁▇▆
q7 FrScA 152 0.65 4.02 0.83 1.00 3.00 4.00 5.00 5.00 ▁▁▅▇▆
q7 OcnA 55 0.69 3.83 0.82 2.00 3.00 4.00 4.00 5.00 ▁▆▁▇▅
q7 PhysA 11 0.85 3.81 0.90 2.00 3.00 4.00 4.00 5.00 ▂▅▁▇▅
q8 AnPhA 60 0.71 4.45 0.65 1.00 4.00 5.00 5.00 5.00 ▁▁▁▇▇
q8 BioA 7 0.86 3.79 0.72 2.00 3.00 4.00 4.00 5.00 ▁▃▁▇▂
q8 FrScA 152 0.65 4.45 0.58 3.00 4.00 4.00 5.00 5.00 ▁▁▇▁▇
q8 OcnA 55 0.69 4.33 0.60 3.00 4.00 4.00 5.00 5.00 ▁▁▇▁▆
q8 PhysA 12 0.84 4.05 0.73 2.00 4.00 4.00 4.00 5.00 ▁▁▁▇▃
q9 AnPhA 59 0.72 4.07 0.81 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▆
q9 BioA 7 0.86 3.19 0.86 2.00 3.00 3.00 4.00 5.00 ▃▇▁▅▁
q9 FrScA 154 0.65 3.37 0.91 1.00 3.00 3.00 4.00 5.00 ▁▃▇▆▂
q9 OcnA 55 0.69 3.54 0.91 1.00 3.00 4.00 4.00 5.00 ▁▂▇▇▃
q9 PhysA 11 0.85 3.38 0.83 2.00 3.00 3.00 4.00 5.00 ▃▇▁▇▂
q10 AnPhA 59 0.72 4.35 0.74 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q10 BioA 8 0.84 3.37 0.89 2.00 3.00 3.00 4.00 5.00 ▂▇▁▅▂
q10 FrScA 152 0.65 4.30 0.81 1.00 4.00 4.00 5.00 5.00 ▁▁▂▆▇
q10 OcnA 55 0.69 4.13 0.93 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▇
q10 PhysA 11 0.85 3.78 0.89 2.00 3.00 4.00 4.00 5.00 ▂▆▁▇▅
post_int AnPhA 209 0.00 1.00 NA 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
post_int BioA 40 0.18 3.06 0.69 1.75 2.75 3.00 3.25 4.25 ▂▃▇▂▂
post_int FrScA 392 0.10 4.00 0.93 1.50 3.75 4.00 4.88 5.00 ▁▃▁▇▇
post_int OcnA 157 0.10 4.33 0.56 3.00 4.00 4.25 4.75 5.00 ▁▂▅▅▇
post_int PhysA 50 0.32 3.75 0.88 1.50 3.50 4.00 4.25 5.00 ▁▁▂▇▂
post_uv AnPhA 209 0.00 1.00 NA 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
post_uv BioA 40 0.18 3.11 0.80 1.67 2.67 3.33 3.67 4.33 ▂▃▂▇▂
post_uv FrScA 392 0.10 3.38 1.11 1.00 2.67 3.67 4.00 5.00 ▃▃▆▇▆
post_uv OcnA 157 0.10 3.93 0.88 1.33 3.67 4.00 4.58 5.00 ▁▁▁▇▇
post_uv PhysA 50 0.32 3.57 0.66 1.67 3.33 3.67 4.00 4.67 ▁▁▃▇▂
post_tv AnPhA 209 0.00 1.00 NA 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
post_tv BioA 40 0.18 3.08 0.70 1.71 2.86 3.00 3.29 4.29 ▂▂▇▃▂
post_tv FrScA 392 0.10 3.73 0.96 1.29 3.29 4.00 4.43 5.00 ▁▃▅▆▇
post_tv OcnA 157 0.10 4.16 0.60 3.00 3.86 4.14 4.71 4.86 ▂▁▅▅▇
post_tv PhysA 50 0.32 3.67 0.74 1.57 3.43 3.86 4.04 4.71 ▂▁▃▇▅
post_percomp AnPhA 209 0.00 3.00 NA 3.00 3.00 3.00 3.00 3.00 ▁▁▇▁▁
post_percomp BioA 40 0.18 3.06 0.58 2.00 2.50 3.50 3.50 3.50 ▂▃▁▂▇
post_percomp FrScA 392 0.10 3.51 0.96 1.00 3.00 3.50 4.00 5.00 ▁▂▆▇▅
post_percomp OcnA 157 0.10 3.69 0.75 2.00 3.50 4.00 4.00 5.00 ▃▁▆▇▃
post_percomp PhysA 50 0.32 3.40 0.91 1.50 3.00 3.50 4.00 4.50 ▂▂▂▆▇

Variable type: POSIXct

skim_variable subject n_missing complete_rate min max median n_unique
date_x AnPhA 80 0.62 2015-09-02 15:40:00 2016-03-23 16:11:00 2015-09-27 20:10:30 129
date_x BioA 9 0.82 2015-09-08 19:52:00 2016-03-09 14:07:00 2015-09-16 14:27:00 40
date_x FrScA 215 0.51 2015-09-08 13:10:00 2016-04-27 02:12:00 2015-10-08 19:19:30 218
date_x OcnA 75 0.57 2015-09-08 20:08:00 2016-03-03 15:57:00 2016-01-25 20:17:00 97
date_x PhysA 14 0.81 2015-09-09 12:24:00 2016-05-24 15:53:00 2015-10-08 21:17:00 60
date_y AnPhA 209 0.00 2015-09-02 15:31:00 2015-09-02 15:31:00 2015-09-02 15:31:00 1
date_y BioA 40 0.18 2015-11-17 03:04:00 2016-01-21 23:38:00 2016-01-16 23:48:00 9
date_y FrScA 392 0.10 2015-09-09 15:21:00 2016-01-22 15:43:00 2016-01-04 13:13:00 43
date_y OcnA 157 0.10 2015-09-12 15:56:00 2016-01-08 17:51:00 2015-09-18 04:08:30 18
date_y PhysA 50 0.32 2015-09-14 14:45:00 2016-01-22 05:36:00 2016-01-17 08:24:30 24
date AnPhA 189 0.10 2017-01-23 14:28:00 2017-02-10 15:25:00 2017-02-01 17:09:00 21
date BioA 47 0.04 2017-02-06 20:12:00 2017-02-09 19:15:00 2017-02-08 07:43:30 2
date FrScA 372 0.14 2017-01-23 13:14:00 2017-02-13 13:00:00 2017-01-24 17:23:00 62
date OcnA 155 0.11 2017-01-23 14:07:00 2017-02-09 18:45:00 2017-02-01 21:53:30 20
date PhysA 71 0.04 2017-01-30 14:41:00 2017-02-03 15:23:00 2017-02-02 20:54:00 3

GGplot is designed to work iteratively. You start with a layer that shows the raw data. Then you add layers of annotations and statistical summaries.

You can read more about ggplot in the book “GGPLOT: Elegant Graphics for Data Analysis”. You can also find lots of inspiration in the r-graph gallery that includes code. Finally you can use the GGPLOT cheat sheet to help.

” Elegant Graphics for Data Analysis” states that “every ggplot2 plot has three key components:

One Continuous variable

Create a basic visualization that examines a continuous variable of interest.

Barplot

Which online course had the largest enrollment numbers?

Which variable should we be looking at?

👉 Your Turn

#inspect at the data frame
data_to_explore
## # A tibble: 943 × 34
##    student_id subject semester section total_points_possible total_points_earned
##    <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
##  1 43146      FrScA   S216     02                       1217               1150 
##  2 44638      OcnA    S116     01                       1676               1384.
##  3 47448      FrScA   S216     01                       1232               1116 
##  4 47979      OcnA    S216     01                       1833               1493.
##  5 48797      PhysA   S116     01                       2225               1995.
##  6 51943      FrScA   S216     03                       1222                 70 
##  7 52326      AnPhA   S216     01                       1775               1519.
##  8 52446      PhysA   S116     01                       2225               2198 
##  9 53447      FrScA   S116     01                       1212               1173 
## 10 53475      FrScA   S116     02                       1212                  0 
## # ℹ 933 more rows
## # ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
## #   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
## #   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
## #   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
## #   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
## #   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

Level a. The most basic level for a plot

Includes:

  • data: data_to_explore.csv
  • aes function: one continuous variable:
    • subject mapped to x position
  • Geom:geom_bar() function - bar graph
ggplot(data_to_explore, aes(x = subject)) +
  geom_bar()  

Level b. Add another layer with labels

  • title: “Number of Student Enrollments per Subject”
  • caption: “Which online courses have had the largest enrollment numbers?”
ggplot(data_to_explore, aes(x = subject)) +
  geom_bar() +
  labs(title = "Number of Student Enrollments per Subject",
       caption = "Which online courses have had the largest enrollment numbers?")

Level c: Add Scale with a different color.

  • scale: fill = gender
What can we notice about gender?
ggplot(data_to_explore, aes(x = subject, fill = gender)) +
  geom_bar() +
  labs(title = "Gender Distribution of Students Across Subjects",
       caption = "Which subjects enroll more female students?")

Histogram - You try

  • data: data_to_explore
  • aes() function - one continuous variables:
    • tv variable mapped to x position
  • Geom: geom_histogram() this code is already there you just need to un-comment it.
  • Add a title ““Number of Hours Students Watch TV per Day”
  • Add a caption that poses the question “Approximately how many students watch 4+ hours of TV per day?”

NEED HELP? TRY STHDA

Yours could look like something below…

👉 Your Turn

ggplot(data_to_explore, aes(x = tv)) + 
  geom_histogram(bins = 5) +
  labs(title = "Number of Hours Students Watch TV per Day", 
       caption = "Approximately how many students watch 4+ hours of TV per day?")

or maybe you added a theme()

data_to_explore%>%
  ggplot(aes(x= tv))+
  geom_histogram(bins = 5, fill = "red", colour = "black")+
  labs(title = "Number of Hours Students Watch TV per Day", 
       caption = "Approximately how many students watch 4+ hours of TV per day?") +
  theme_classic()
## Warning: Removed 292 rows containing non-finite values (`stat_bin()`).

Two categorical Variables

Create a basic visualization that examines the relationship between two categorical variables.

RESEARCH QUESTION: What do you wonder about the reasons for enrollment in various courses?

Heatmap

  • data: data_to_explore
  • use count() function for subject, enrollment then,
  • ggplot() function
  • aes() function - one continuous variables
    • subject variable mapped to x position
    • enrollment reason variable mapped to x position
  • Geom: geom_tile() function
  • Add a title “Reasons for Enrollment by Subject”
  • Add a caption: “Which subjects were the least available at local schools?”

👉 Your Turn

data_to_explore %>% 
  count(subject, enrollment_reason) %>% 
  ggplot() + 
  geom_tile(mapping = aes(x = subject, 
                          y = enrollment_reason, 
                          fill = n)) + 
  labs(title = "Reasons for Enrollment by Subject", 
       caption = "Which subjects were the least available at local schools?")

Two continuous variables

Create a basic visualization that examines the relationship between two continuous variables.

Scatter plot

REASERCH QUESTION: Can we predict the grade on a course from the time spent in the course LMS?

Which variables should we be looking at?

#look at the data frame
data_to_explore
## # A tibble: 943 × 34
##    student_id subject semester section total_points_possible total_points_earned
##    <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
##  1 43146      FrScA   S216     02                       1217               1150 
##  2 44638      OcnA    S116     01                       1676               1384.
##  3 47448      FrScA   S216     01                       1232               1116 
##  4 47979      OcnA    S216     01                       1833               1493.
##  5 48797      PhysA   S116     01                       2225               1995.
##  6 51943      FrScA   S216     03                       1222                 70 
##  7 52326      AnPhA   S216     01                       1775               1519.
##  8 52446      PhysA   S116     01                       2225               2198 
##  9 53447      FrScA   S116     01                       1212               1173 
## 10 53475      FrScA   S116     02                       1212                  0 
## # ℹ 933 more rows
## # ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
## #   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
## #   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
## #   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
## #   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
## #   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

Level a. The most basic level for a plot

Includes:

  • data: data_to_explore.csv
  • aes() function - two continuous variables
    • time spent in hours mapped to x position
    • proportion earned mapped to y position
  • Geom: geom_point() function - Scatter plot

👉 Your Turn

#layer 1: add data and aesthetics mapping 
ggplot(data_to_explore,
       aes(x = time_spent_hours, 
           y = proportion_earned)) +
#layer 2: +  geom function type
  geom_point() 

Level b. Add another layer with labels

  • Add a title: “How Time Spent on Course LMS is Related to Points Earned in the course”
  • Add a x label: “Time Spent (Hours)”
  • Add a y label: “Proportion of Points Earned”

👉 Your Turn

#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
ggplot(data_to_explore, 
       aes(x = time_spent_hours, 
           y = proportion_earned,
           color = enrollment_status)) +
#layer 2: +  geom function type
  geom_point() +
#layer 4: add labels
  labs(title="How Time Spent on Course LMS is Related to Points Earned in the course", 
       x="Time Spent (Hours)", 
       y = "Proportion of Points Earned")

Level c. Add Scale with a different color.

RESEARCH QUESTION: Can we notice anything about enrollment status?
  • Add scale: color = enrollment_status

👉 Your Turn

#layer 1: add data and aesthetics mapping 
#layer 4: add color scale by type
ggplot(data_to_explore, 
       aes(x = time_spent_hours, 
           y = proportion_earned,
           color = enrollment_status)) +
#layer 2: +  geom function type
  geom_point() +
#layer 3: add labels
  labs(title="How Time Spent on Course LMS is Related to Points Earned in the course", 
       x="Time Spent (Hours)", 
       y = "Proportion of Points Earned")

Level d. Divide up graphs using facet to visualize by subject.

  • Add facet with facet_wrap() function: by subject

👉 Your Turn

#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
#layer 2: +  geom function type
  geom_point() +
#layer 4: add labels
  labs(title="How Time Spent on Course LMS is Related to Points Earned in the Course", 
       x="Time Spent (Hours)",
       y = "Proportion of Points Earned")+
#layer 5: add facet wrap
  facet_wrap(~ subject) 

Level e. How can we remove NA’s from plot? and What will the code look like without the comments?

You can pipe the data with the dataframe and use drop_na() function.

  • use data then,
  • use drop_na function to remove na’s from enrollment status then,
  • add ggplot function like above

👉 Your Turn

data_to_explore %>%
  drop_na(enrollment_status) %>%
  ggplot(aes(x = time_spent_hours, 
             y = proportion_earned, 
             color = enrollment_status)) +
  geom_point() +
  labs(title="How Time Spent on Course LMS is Related to Points Earned in the Course", 
       x="Time Spent (Hours)",
       y = "Proportion of Points Earned")+
  facet_wrap(~ subject)