ggplot_only KEY

Read in library tidyverse
Read in library skimr
Read in data_to_explore

are you getting an error? - make sure to install.packages(““) in the console to fix that

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)
data_to_explore <- read_csv("data/data_to_explore.csv")

## Rows: 943 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): student_id, subject, semester, section, gender, enrollment_reason...
## dbl  (23): total_points_possible, total_points_earned, proportion_earned, ti...
## dttm  (3): date_x, date_y, date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Use the skim() function to view the data_to explore

👉 Your Turn ⤵

#skim the data by adding the skim function in front of the data
skim(data_to_explore)

Data summary
Name	data_to_explore
Number of rows	943
Number of columns	34
_______________________
Column type frequency:
character	8
numeric	23
POSIXct	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
student_id	0	1.00	2	6	879
subject	0	1.00	4	5	5
semester	0	1.00	4	4	4
section	0	1.00	2	2	4
gender	227	0.76	1	1	2
enrollment_reason	227	0.76	5	34	5
enrollment_status	227	0.76	7	17	3
course_id	281	0.70	12	13	36

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_points_possible	226	0.76	1619.55	387.12	1212.00	1217.00	1676.00	1791.00	2425.00	▇▂▆▁▃
total_points_earned	226	0.76	1229.98	510.64	0.00	1002.50	1177.13	1572.45	2413.50	▂▂▇▅▂
proportion_earned	226	0.76	76.23	25.20	0.00	72.36	85.59	92.29	100.74	▁▁▁▃▇
time_spent	232	0.75	1828.80	1363.13	0.45	895.57	1559.97	2423.94	8870.88	▇▅▁▁▁
time_spent_hours	232	0.75	30.48	22.72	0.01	14.93	26.00	40.40	147.85	▇▅▁▁▁
int	293	0.69	4.30	0.60	1.80	4.00	4.40	4.80	5.00	▁▁▂▆▇
val	287	0.70	3.75	0.75	1.00	3.33	3.67	4.33	5.00	▁▁▆▇▆
percomp	288	0.69	3.64	0.69	1.50	3.00	3.50	4.00	5.00	▁▁▇▃▃
tv	292	0.69	4.07	0.59	1.00	3.71	4.12	4.46	5.00	▁▁▂▇▇
q1	285	0.70	4.34	0.66	1.00	4.00	4.00	5.00	5.00	▁▁▁▇▇
q2	285	0.70	3.66	0.93	1.00	3.00	4.00	4.00	5.00	▁▂▆▇▃
q3	286	0.70	3.31	0.85	1.00	3.00	3.00	4.00	5.00	▁▂▇▅▂
q4	289	0.69	4.35	0.80	1.00	4.00	5.00	5.00	5.00	▁▁▁▆▇
q5	286	0.70	4.28	0.69	1.00	4.00	4.00	5.00	5.00	▁▁▁▇▆
q6	285	0.70	4.05	0.80	1.00	4.00	4.00	5.00	5.00	▁▁▃▇▅
q7	286	0.70	3.96	0.85	1.00	3.00	4.00	5.00	5.00	▁▁▅▇▆
q8	286	0.70	4.35	0.65	1.00	4.00	4.00	5.00	5.00	▁▁▁▇▇
q9	286	0.70	3.55	0.92	1.00	3.00	4.00	4.00	5.00	▁▂▇▇▃
q10	285	0.70	4.17	0.87	1.00	4.00	4.00	5.00	5.00	▁▁▃▇▇
post_int	848	0.10	3.88	0.94	1.00	3.50	4.00	4.50	5.00	▁▁▃▇▇
post_uv	848	0.10	3.48	0.99	1.00	3.00	3.67	4.00	5.00	▂▂▅▇▅
post_tv	848	0.10	3.71	0.90	1.00	3.29	3.86	4.29	5.00	▁▂▃▇▆
post_percomp	848	0.10	3.47	0.88	1.00	3.00	3.50	4.00	5.00	▁▂▂▇▂

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date_x	393	0.58	2015-09-02 15:40:00	2016-05-24 15:53:00	2015-10-01 15:57:30	536
date_y	848	0.10	2015-09-02 15:31:00	2016-01-22 15:43:00	2016-01-04 13:25:00	95
date	834	0.12	2017-01-23 13:14:00	2017-02-13 13:00:00	2017-01-25 18:43:00	107

What do you notice about this library?

In the code chunk below: 1. use the data_to_explore then 2. group_by subject variable then 3. add skim() function

👉 #3 Your Turn ⤵

group_df <- data_to_explore |>
  group_by(subject) %>% 
  skim() 

group_df

Data summary
Name	Piped data
Number of rows	943
Number of columns	34
_______________________
Column type frequency:
character	7
numeric	23
POSIXct	3
________________________
Group variables	subject

Variable type: character

skim_variable	subject	n_missing	complete_rate	min	max	n_unique
student_id	AnPhA	0	1.00	2	6	207
student_id	BioA	0	1.00	3	6	47
student_id	FrScA	0	1.00	2	6	414
student_id	OcnA	0	1.00	2	6	171
student_id	PhysA	0	1.00	3	6	74
semester	AnPhA	0	1.00	4	4	4
semester	BioA	0	1.00	4	4	4
semester	FrScA	0	1.00	4	4	4
semester	OcnA	0	1.00	4	4	4
semester	PhysA	0	1.00	4	4	4
section	AnPhA	0	1.00	2	2	2
section	BioA	0	1.00	2	2	1
section	FrScA	0	1.00	2	2	4
section	OcnA	0	1.00	2	2	3
section	PhysA	0	1.00	2	2	1
gender	AnPhA	45	0.79	1	1	2
gender	BioA	4	0.92	1	1	2
gender	FrScA	130	0.70	1	1	2
gender	OcnA	42	0.76	1	1	2
gender	PhysA	6	0.92	1	1	2
enrollment_reason	AnPhA	45	0.79	5	34	4
enrollment_reason	BioA	4	0.92	5	34	5
enrollment_reason	FrScA	130	0.70	5	34	5
enrollment_reason	OcnA	42	0.76	5	34	5
enrollment_reason	PhysA	6	0.92	5	34	4
enrollment_status	AnPhA	45	0.79	7	17	2
enrollment_status	BioA	4	0.92	7	17	3
enrollment_status	FrScA	130	0.70	7	17	3
enrollment_status	OcnA	42	0.76	7	17	3
enrollment_status	PhysA	6	0.92	7	17	2
course_id	AnPhA	58	0.72	13	13	7
course_id	BioA	7	0.86	12	12	4
course_id	FrScA	150	0.66	13	13	12
course_id	OcnA	55	0.69	12	12	9
course_id	PhysA	11	0.85	13	13	4

Variable type: numeric

skim_variable	subject	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_points_possible	AnPhA	45	0.79	1776.52	12.28	1655.00	1775.00	1775.00	1775.00	1805.00	▁▁▁▇▁
total_points_possible	BioA	4	0.92	2421.00	2.02	2420.00	2420.00	2420.00	2420.00	2425.00	▇▁▁▁▂
total_points_possible	FrScA	129	0.70	1230.81	38.26	1212.00	1212.00	1217.00	1232.00	1361.00	▇▁▁▁▁
total_points_possible	OcnA	42	0.76	1738.47	78.48	1480.00	1676.00	1676.00	1833.00	1833.00	▁▁▇▁▇
total_points_possible	PhysA	6	0.92	2225.00	0.00	2225.00	2225.00	2225.00	2225.00	2225.00	▁▁▇▁▁
total_points_earned	AnPhA	45	0.79	1340.16	423.45	0.00	1269.09	1511.14	1616.37	1732.52	▁▁▁▂▇
total_points_earned	BioA	4	0.92	1546.66	813.01	0.00	1035.16	1865.13	2198.50	2413.50	▃▁▁▃▇
total_points_earned	FrScA	129	0.70	952.30	305.60	0.00	914.92	1062.75	1130.00	1319.02	▁▁▁▅▇
total_points_earned	OcnA	42	0.76	1283.25	427.25	0.00	1216.68	1396.85	1572.50	1786.76	▁▁▁▆▇
total_points_earned	PhysA	6	0.92	1898.45	469.31	110.00	1891.75	2072.00	2149.12	2216.00	▁▁▁▂▇
proportion_earned	AnPhA	45	0.79	75.44	23.84	0.00	71.57	84.90	90.96	97.61	▁▁▁▂▇
proportion_earned	BioA	4	0.92	63.89	33.58	0.00	42.78	77.07	90.85	99.73	▃▁▁▃▇
proportion_earned	FrScA	129	0.70	77.42	24.82	0.00	74.85	86.43	92.19	100.74	▁▁▁▃▇
proportion_earned	OcnA	42	0.76	73.99	24.70	0.00	69.76	81.60	91.04	99.22	▁▁▁▃▇
proportion_earned	PhysA	6	0.92	85.32	21.09	4.94	85.02	93.12	96.59	99.60	▁▁▁▂▇
time_spent	AnPhA	45	0.79	2374.39	1669.58	0.45	1209.85	2164.90	3134.97	7084.70	▆▇▃▂▁
time_spent	BioA	5	0.90	1404.57	1528.14	1.22	297.02	827.30	1955.08	6664.45	▇▂▁▁▁
time_spent	FrScA	134	0.69	1591.90	1016.76	2.42	935.03	1404.90	2130.75	6537.02	▇▇▂▁▁
time_spent	OcnA	42	0.76	2031.44	1496.82	0.58	1133.47	1800.22	2573.45	8870.88	▇▆▂▁▁
time_spent	PhysA	6	0.92	1431.76	990.40	0.70	749.32	1282.81	2049.85	5373.35	▇▆▃▁▁
time_spent_hours	AnPhA	45	0.79	39.57	27.83	0.01	20.16	36.08	52.25	118.08	▆▇▃▂▁
time_spent_hours	BioA	5	0.90	23.41	25.47	0.02	4.95	13.79	32.58	111.07	▇▂▁▁▁
time_spent_hours	FrScA	134	0.69	26.53	16.95	0.04	15.58	23.42	35.51	108.95	▇▇▂▁▁
time_spent_hours	OcnA	42	0.76	33.86	24.95	0.01	18.89	30.00	42.89	147.85	▇▆▂▁▁
time_spent_hours	PhysA	6	0.92	23.86	16.51	0.01	12.49	21.38	34.16	89.56	▇▆▃▁▁
int	AnPhA	62	0.70	4.42	0.57	1.80	4.00	4.40	5.00	5.00	▁▁▁▅▇
int	BioA	9	0.82	3.69	0.63	2.40	3.35	3.80	4.00	5.00	▂▆▇▆▂
int	FrScA	154	0.65	4.42	0.52	2.60	4.00	4.40	5.00	5.00	▁▁▃▃▇
int	OcnA	56	0.68	4.24	0.58	2.20	4.00	4.20	4.60	5.00	▁▁▂▇▆
int	PhysA	12	0.84	4.00	0.65	2.20	3.60	4.00	4.40	5.00	▁▂▆▇▅
val	AnPhA	59	0.72	4.29	0.62	1.00	4.00	4.33	4.67	5.00	▁▁▁▅▇
val	BioA	7	0.86	3.50	0.58	2.67	3.00	3.33	3.67	5.00	▆▆▇▁▂
val	FrScA	155	0.64	3.53	0.72	1.67	3.00	3.67	4.00	5.00	▂▅▇▅▂
val	OcnA	55	0.69	3.62	0.77	1.00	3.00	3.67	4.00	5.00	▁▁▅▇▃
val	PhysA	11	0.85	3.89	0.56	2.00	3.67	4.00	4.33	5.00	▁▁▇▇▃
percomp	AnPhA	61	0.71	3.80	0.67	2.00	3.50	4.00	4.50	5.00	▂▃▇▆▇
percomp	BioA	8	0.84	3.34	0.75	2.00	3.00	3.00	4.00	5.00	▅▇▃▇▂
percomp	FrScA	152	0.65	3.64	0.63	1.50	3.00	3.50	4.00	5.00	▁▁▇▅▃
percomp	OcnA	56	0.68	3.57	0.67	2.00	3.00	3.50	4.00	5.00	▂▇▆▅▅
percomp	PhysA	11	0.85	3.56	0.84	2.00	3.00	3.50	4.00	5.00	▅▅▇▅▇
tv	AnPhA	60	0.71	4.35	0.57	1.00	4.00	4.43	4.83	5.00	▁▁▁▅▇
tv	BioA	9	0.82	3.61	0.56	2.29	3.14	3.57	3.86	5.00	▁▃▇▂▁
tv	FrScA	156	0.64	4.04	0.52	2.29	3.71	4.00	4.43	5.00	▁▂▆▇▅
tv	OcnA	55	0.69	3.97	0.62	1.71	3.71	4.00	4.38	5.00	▁▁▂▇▅
tv	PhysA	12	0.84	3.94	0.56	2.14	3.57	4.00	4.29	5.00	▁▂▃▇▂
q1	AnPhA	59	0.72	4.43	0.64	1.00	4.00	4.00	5.00	5.00	▁▁▁▇▇
q1	BioA	7	0.86	3.76	0.66	2.00	3.00	4.00	4.00	5.00	▁▃▁▇▁
q1	FrScA	153	0.65	4.50	0.57	2.00	4.00	5.00	5.00	5.00	▁▁▁▆▇
q1	OcnA	55	0.69	4.20	0.69	2.00	4.00	4.00	5.00	5.00	▁▂▁▇▅
q1	PhysA	11	0.85	4.03	0.72	2.00	4.00	4.00	4.50	5.00	▁▃▁▇▃
q2	AnPhA	59	0.72	4.30	0.74	1.00	4.00	4.00	5.00	5.00	▁▁▂▇▇
q2	BioA	7	0.86	3.48	0.71	2.00	3.00	3.00	4.00	5.00	▁▇▁▆▁
q2	FrScA	152	0.65	3.35	0.89	1.00	3.00	3.00	4.00	5.00	▁▃▇▆▂
q2	OcnA	56	0.68	3.46	0.93	1.00	3.00	4.00	4.00	5.00	▁▂▆▇▂
q2	PhysA	11	0.85	4.03	0.76	2.00	4.00	4.00	5.00	5.00	▁▂▁▇▅
q3	AnPhA	60	0.71	3.53	0.87	1.00	3.00	3.00	4.00	5.00	▁▁▇▅▃
q3	BioA	7	0.86	2.98	0.87	2.00	2.00	3.00	3.00	5.00	▅▇▁▂▁
q3	FrScA	152	0.65	3.25	0.79	1.00	3.00	3.00	4.00	5.00	▁▂▇▃▁
q3	OcnA	56	0.68	3.30	0.86	2.00	3.00	3.00	4.00	5.00	▃▇▁▅▂
q3	PhysA	11	0.85	3.32	0.95	1.00	3.00	3.00	4.00	5.00	▁▃▇▆▂
q4	AnPhA	61	0.71	4.52	0.78	1.00	4.00	5.00	5.00	5.00	▁▁▁▃▇
q4	BioA	7	0.86	3.69	0.81	2.00	3.00	4.00	4.00	5.00	▂▃▁▇▂
q4	FrScA	154	0.65	4.44	0.74	1.00	4.00	5.00	5.00	5.00	▁▁▁▅▇
q4	OcnA	56	0.68	4.29	0.75	1.00	4.00	4.00	5.00	5.00	▁▁▂▇▇
q4	PhysA	11	0.85	4.02	0.87	2.00	4.00	4.00	5.00	5.00	▁▃▁▇▆
q5	AnPhA	59	0.72	4.36	0.69	1.00	4.00	4.00	5.00	5.00	▁▁▁▇▇
q5	BioA	8	0.84	3.88	0.68	2.00	4.00	4.00	4.00	5.00	▁▃▁▇▂
q5	FrScA	153	0.65	4.38	0.62	2.00	4.00	4.00	5.00	5.00	▁▁▁▇▇
q5	OcnA	55	0.69	4.20	0.77	1.00	4.00	4.00	5.00	5.00	▁▁▂▇▆
q5	PhysA	11	0.85	4.06	0.67	2.00	4.00	4.00	4.00	5.00	▁▁▁▇▃
q6	AnPhA	59	0.72	4.50	0.65	1.00	4.00	5.00	5.00	5.00	▁▁▁▆▇
q6	BioA	7	0.86	3.83	0.70	3.00	3.00	4.00	4.00	5.00	▅▁▇▁▂
q6	FrScA	153	0.65	3.88	0.79	2.00	3.00	4.00	4.00	5.00	▁▃▁▇▃
q6	OcnA	55	0.69	3.84	0.84	1.00	3.00	4.00	4.00	5.00	▁▁▅▇▃
q6	PhysA	11	0.85	4.27	0.68	2.00	4.00	4.00	5.00	5.00	▁▁▁▇▆
q7	AnPhA	60	0.71	4.08	0.85	1.00	4.00	4.00	5.00	5.00	▁▁▃▇▆
q7	BioA	8	0.84	3.71	0.96	2.00	3.00	4.00	4.00	5.00	▂▇▁▇▆
q7	FrScA	152	0.65	4.02	0.83	1.00	3.00	4.00	5.00	5.00	▁▁▅▇▆
q7	OcnA	55	0.69	3.83	0.82	2.00	3.00	4.00	4.00	5.00	▁▆▁▇▅
q7	PhysA	11	0.85	3.81	0.90	2.00	3.00	4.00	4.00	5.00	▂▅▁▇▅
q8	AnPhA	60	0.71	4.45	0.65	1.00	4.00	5.00	5.00	5.00	▁▁▁▇▇
q8	BioA	7	0.86	3.79	0.72	2.00	3.00	4.00	4.00	5.00	▁▃▁▇▂
q8	FrScA	152	0.65	4.45	0.58	3.00	4.00	4.00	5.00	5.00	▁▁▇▁▇
q8	OcnA	55	0.69	4.33	0.60	3.00	4.00	4.00	5.00	5.00	▁▁▇▁▆
q8	PhysA	12	0.84	4.05	0.73	2.00	4.00	4.00	4.00	5.00	▁▁▁▇▃
q9	AnPhA	59	0.72	4.07	0.81	1.00	4.00	4.00	5.00	5.00	▁▁▃▇▆
q9	BioA	7	0.86	3.19	0.86	2.00	3.00	3.00	4.00	5.00	▃▇▁▅▁
q9	FrScA	154	0.65	3.37	0.91	1.00	3.00	3.00	4.00	5.00	▁▃▇▆▂
q9	OcnA	55	0.69	3.54	0.91	1.00	3.00	4.00	4.00	5.00	▁▂▇▇▃
q9	PhysA	11	0.85	3.38	0.83	2.00	3.00	3.00	4.00	5.00	▃▇▁▇▂
q10	AnPhA	59	0.72	4.35	0.74	1.00	4.00	4.00	5.00	5.00	▁▁▁▇▇
q10	BioA	8	0.84	3.37	0.89	2.00	3.00	3.00	4.00	5.00	▂▇▁▅▂
q10	FrScA	152	0.65	4.30	0.81	1.00	4.00	4.00	5.00	5.00	▁▁▂▆▇
q10	OcnA	55	0.69	4.13	0.93	1.00	4.00	4.00	5.00	5.00	▁▁▃▇▇
q10	PhysA	11	0.85	3.78	0.89	2.00	3.00	4.00	4.00	5.00	▂▆▁▇▅
post_int	AnPhA	209	0.00	1.00	NA	1.00	1.00	1.00	1.00	1.00	▁▁▇▁▁
post_int	BioA	40	0.18	3.06	0.69	1.75	2.75	3.00	3.25	4.25	▂▃▇▂▂
post_int	FrScA	392	0.10	4.00	0.93	1.50	3.75	4.00	4.88	5.00	▁▃▁▇▇
post_int	OcnA	157	0.10	4.33	0.56	3.00	4.00	4.25	4.75	5.00	▁▂▅▅▇
post_int	PhysA	50	0.32	3.75	0.88	1.50	3.50	4.00	4.25	5.00	▁▁▂▇▂
post_uv	AnPhA	209	0.00	1.00	NA	1.00	1.00	1.00	1.00	1.00	▁▁▇▁▁
post_uv	BioA	40	0.18	3.11	0.80	1.67	2.67	3.33	3.67	4.33	▂▃▂▇▂
post_uv	FrScA	392	0.10	3.38	1.11	1.00	2.67	3.67	4.00	5.00	▃▃▆▇▆
post_uv	OcnA	157	0.10	3.93	0.88	1.33	3.67	4.00	4.58	5.00	▁▁▁▇▇
post_uv	PhysA	50	0.32	3.57	0.66	1.67	3.33	3.67	4.00	4.67	▁▁▃▇▂
post_tv	AnPhA	209	0.00	1.00	NA	1.00	1.00	1.00	1.00	1.00	▁▁▇▁▁
post_tv	BioA	40	0.18	3.08	0.70	1.71	2.86	3.00	3.29	4.29	▂▂▇▃▂
post_tv	FrScA	392	0.10	3.73	0.96	1.29	3.29	4.00	4.43	5.00	▁▃▅▆▇
post_tv	OcnA	157	0.10	4.16	0.60	3.00	3.86	4.14	4.71	4.86	▂▁▅▅▇
post_tv	PhysA	50	0.32	3.67	0.74	1.57	3.43	3.86	4.04	4.71	▂▁▃▇▅
post_percomp	AnPhA	209	0.00	3.00	NA	3.00	3.00	3.00	3.00	3.00	▁▁▇▁▁
post_percomp	BioA	40	0.18	3.06	0.58	2.00	2.50	3.50	3.50	3.50	▂▃▁▂▇
post_percomp	FrScA	392	0.10	3.51	0.96	1.00	3.00	3.50	4.00	5.00	▁▂▆▇▅
post_percomp	OcnA	157	0.10	3.69	0.75	2.00	3.50	4.00	4.00	5.00	▃▁▆▇▃
post_percomp	PhysA	50	0.32	3.40	0.91	1.50	3.00	3.50	4.00	4.50	▂▂▂▆▇

Variable type: POSIXct

skim_variable	subject	n_missing	complete_rate	min	max	median	n_unique
date_x	AnPhA	80	0.62	2015-09-02 15:40:00	2016-03-23 16:11:00	2015-09-27 20:10:30	129
date_x	BioA	9	0.82	2015-09-08 19:52:00	2016-03-09 14:07:00	2015-09-16 14:27:00	40
date_x	FrScA	215	0.51	2015-09-08 13:10:00	2016-04-27 02:12:00	2015-10-08 19:19:30	218
date_x	OcnA	75	0.57	2015-09-08 20:08:00	2016-03-03 15:57:00	2016-01-25 20:17:00	97
date_x	PhysA	14	0.81	2015-09-09 12:24:00	2016-05-24 15:53:00	2015-10-08 21:17:00	60
date_y	AnPhA	209	0.00	2015-09-02 15:31:00	2015-09-02 15:31:00	2015-09-02 15:31:00	1
date_y	BioA	40	0.18	2015-11-17 03:04:00	2016-01-21 23:38:00	2016-01-16 23:48:00	9
date_y	FrScA	392	0.10	2015-09-09 15:21:00	2016-01-22 15:43:00	2016-01-04 13:13:00	43
date_y	OcnA	157	0.10	2015-09-12 15:56:00	2016-01-08 17:51:00	2015-09-18 04:08:30	18
date_y	PhysA	50	0.32	2015-09-14 14:45:00	2016-01-22 05:36:00	2016-01-17 08:24:30	24
date	AnPhA	189	0.10	2017-01-23 14:28:00	2017-02-10 15:25:00	2017-02-01 17:09:00	21
date	BioA	47	0.04	2017-02-06 20:12:00	2017-02-09 19:15:00	2017-02-08 07:43:30	2
date	FrScA	372	0.14	2017-01-23 13:14:00	2017-02-13 13:00:00	2017-01-24 17:23:00	62
date	OcnA	155	0.11	2017-01-23 14:07:00	2017-02-09 18:45:00	2017-02-01 21:53:30	20
date	PhysA	71	0.04	2017-01-30 14:41:00	2017-02-03 15:23:00	2017-02-02 20:54:00	3

GGplot is designed to work iteratively. You start with a layer that shows the raw data. Then you add layers of annotations and statistical summaries.

You can read more about ggplot in the book “GGPLOT: Elegant Graphics for Data Analysis”. You can also find lots of inspiration in the r-graph gallery that includes code. Finally you can use the GGPLOT cheat sheet to help.

” Elegant Graphics for Data Analysis” states that “every ggplot2 plot has three key components:

data,
A set of aesthetic mappings between variables in the data and visual properties, and
At least one layer which describes how to render each observation. Layers are usually created with a geom function.”

One Continuous variable

Create a basic visualization that examines a continuous variable of interest.

Barplot

Which online course had the largest enrollment numbers?

Which variable should we be looking at?

👉 Your Turn ⤵

#inspect at the data frame
data_to_explore

## # A tibble: 943 × 34
##    student_id subject semester section total_points_possible total_points_earned
##    <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
##  1 43146      FrScA   S216     02                       1217               1150 
##  2 44638      OcnA    S116     01                       1676               1384.
##  3 47448      FrScA   S216     01                       1232               1116 
##  4 47979      OcnA    S216     01                       1833               1493.
##  5 48797      PhysA   S116     01                       2225               1995.
##  6 51943      FrScA   S216     03                       1222                 70 
##  7 52326      AnPhA   S216     01                       1775               1519.
##  8 52446      PhysA   S116     01                       2225               2198 
##  9 53447      FrScA   S116     01                       1212               1173 
## 10 53475      FrScA   S116     02                       1212                  0 
## # ℹ 933 more rows
## # ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
## #   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
## #   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
## #   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
## #   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
## #   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

Level a. The most basic level for a plot

Includes:

data: data_to_explore.csv
aes function: one continuous variable:
- subject mapped to x position
Geom:geom_bar() function - bar graph

ggplot(data_to_explore, aes(x = subject)) +
  geom_bar()

Level b. Add another layer with labels

title: “Number of Student Enrollments per Subject”
caption: “Which online courses have had the largest enrollment numbers?”

ggplot(data_to_explore, aes(x = subject)) +
  geom_bar() +
  labs(title = "Number of Student Enrollments per Subject",
       caption = "Which online courses have had the largest enrollment numbers?")

Level c: Add Scale with a different color.

scale: fill = gender

What can we notice about gender?

ggplot(data_to_explore, aes(x = subject, fill = gender)) +
  geom_bar() +
  labs(title = "Gender Distribution of Students Across Subjects",
       caption = "Which subjects enroll more female students?")

Histogram - You try

data: data_to_explore
aes() function - one continuous variables:
- tv variable mapped to x position
Geom: geom_histogram() this code is already there you just need to un-comment it.
Add a title ““Number of Hours Students Watch TV per Day”
Add a caption that poses the question “Approximately how many students watch 4+ hours of TV per day?”

NEED HELP? TRY STHDA

Yours could look like something below…

👉 Your Turn ⤵

ggplot(data_to_explore, aes(x = tv)) + 
  geom_histogram(bins = 5) +
  labs(title = "Number of Hours Students Watch TV per Day", 
       caption = "Approximately how many students watch 4+ hours of TV per day?")

or maybe you added a theme()

data_to_explore%>%
  ggplot(aes(x= tv))+
  geom_histogram(bins = 5, fill = "red", colour = "black")+
  labs(title = "Number of Hours Students Watch TV per Day", 
       caption = "Approximately how many students watch 4+ hours of TV per day?") +
  theme_classic()

## Warning: Removed 292 rows containing non-finite values (`stat_bin()`).

Two categorical Variables

Create a basic visualization that examines the relationship between two categorical variables.

RESEARCH QUESTION: What do you wonder about the reasons for enrollment in various courses?

Heatmap

data: data_to_explore
use count() function for subject, enrollment then,
ggplot() function
aes() function - one continuous variables
- subject variable mapped to x position
- enrollment reason variable mapped to x position
Geom: geom_tile() function
Add a title “Reasons for Enrollment by Subject”
Add a caption: “Which subjects were the least available at local schools?”

👉 Your Turn ⤵

data_to_explore %>% 
  count(subject, enrollment_reason) %>% 
  ggplot() + 
  geom_tile(mapping = aes(x = subject, 
                          y = enrollment_reason, 
                          fill = n)) + 
  labs(title = "Reasons for Enrollment by Subject", 
       caption = "Which subjects were the least available at local schools?")

Two continuous variables

Create a basic visualization that examines the relationship between two continuous variables.

Scatter plot

REASERCH QUESTION: Can we predict the grade on a course from the time spent in the course LMS?

Which variables should we be looking at?

#look at the data frame
data_to_explore

## # A tibble: 943 × 34
##    student_id subject semester section total_points_possible total_points_earned
##    <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
##  1 43146      FrScA   S216     02                       1217               1150 
##  2 44638      OcnA    S116     01                       1676               1384.
##  3 47448      FrScA   S216     01                       1232               1116 
##  4 47979      OcnA    S216     01                       1833               1493.
##  5 48797      PhysA   S116     01                       2225               1995.
##  6 51943      FrScA   S216     03                       1222                 70 
##  7 52326      AnPhA   S216     01                       1775               1519.
##  8 52446      PhysA   S116     01                       2225               2198 
##  9 53447      FrScA   S116     01                       1212               1173 
## 10 53475      FrScA   S116     02                       1212                  0 
## # ℹ 933 more rows
## # ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
## #   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
## #   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
## #   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
## #   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
## #   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

Level a. The most basic level for a plot

Includes:

data: data_to_explore.csv
aes() function - two continuous variables
- time spent in hours mapped to x position
- proportion earned mapped to y position
Geom: geom_point() function - Scatter plot

👉 Your Turn ⤵

#layer 1: add data and aesthetics mapping 
ggplot(data_to_explore,
       aes(x = time_spent_hours, 
           y = proportion_earned)) +
#layer 2: +  geom function type
  geom_point()

Level b. Add another layer with labels

Add a title: “How Time Spent on Course LMS is Related to Points Earned in the course”
Add a x label: “Time Spent (Hours)”
Add a y label: “Proportion of Points Earned”

👉 Your Turn ⤵

#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
ggplot(data_to_explore, 
       aes(x = time_spent_hours, 
           y = proportion_earned,
           color = enrollment_status)) +
#layer 2: +  geom function type
  geom_point() +
#layer 4: add labels
  labs(title="How Time Spent on Course LMS is Related to Points Earned in the course", 
       x="Time Spent (Hours)", 
       y = "Proportion of Points Earned")

Level c. Add Scale with a different color.

RESEARCH QUESTION: Can we notice anything about enrollment status?

Add scale: color = enrollment_status

👉 Your Turn ⤵

#layer 1: add data and aesthetics mapping 
#layer 4: add color scale by type
ggplot(data_to_explore, 
       aes(x = time_spent_hours, 
           y = proportion_earned,
           color = enrollment_status)) +
#layer 2: +  geom function type
  geom_point() +
#layer 3: add labels
  labs(title="How Time Spent on Course LMS is Related to Points Earned in the course", 
       x="Time Spent (Hours)", 
       y = "Proportion of Points Earned")

👉 Your Turn ⤵

#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
#layer 2: +  geom function type
  geom_point() +
#layer 4: add labels
  labs(title="How Time Spent on Course LMS is Related to Points Earned in the Course", 
       x="Time Spent (Hours)",
       y = "Proportion of Points Earned")+
#layer 5: add facet wrap
  facet_wrap(~ subject)

Level e. How can we remove NA’s from plot? and What will the code look like without the comments?

You can pipe the data with the dataframe and use drop_na() function.

use data then,
use drop_na function to remove na’s from enrollment status then,
add ggplot function like above

👉 Your Turn ⤵

data_to_explore %>%
  drop_na(enrollment_status) %>%
  ggplot(aes(x = time_spent_hours, 
             y = proportion_earned, 
             color = enrollment_status)) +
  geom_point() +
  labs(title="How Time Spent on Course LMS is Related to Points Earned in the Course", 
       x="Time Spent (Hours)",
       y = "Proportion of Points Earned")+
  facet_wrap(~ subject)

ggplot_only KEY

2023-07-20

👉 Your Turn ⤵

👉 #3 Your Turn ⤵

One Continuous variable

Barplot

Which online course had the largest enrollment numbers?

👉 Your Turn ⤵

Level a. The most basic level for a plot

Level b. Add another layer with labels

Level c: Add Scale with a different color.

What can we notice about gender?

Histogram - You try

👉 Your Turn ⤵

Two categorical Variables

RESEARCH QUESTION: What do you wonder about the reasons for enrollment in various courses?

Heatmap

👉 Your Turn ⤵

Two continuous variables

Scatter plot

REASERCH QUESTION: Can we predict the grade on a course from the time spent in the course LMS?

Level a. The most basic level for a plot

👉 Your Turn ⤵

Level b. Add another layer with labels

👉 Your Turn ⤵

Level c. Add Scale with a different color.

RESEARCH QUESTION: Can we notice anything about enrollment status?

👉 Your Turn ⤵

Level d. Divide up graphs using facet to visualize by subject.

👉 Your Turn ⤵

Level e. How can we remove NA’s from plot? and What will the code look like without the comments?

👉 Your Turn ⤵