The following is an exemplar of a Data-Analytic Memo (DAM1) for the Fall 2024 semester. This exemplar is a combination of everyone’s code.
Please don’t use this as the “right” answer. But you can find good ideas from everyone. We will continue to hone our skills – and our expectations will rise as we progress.
Purpose of this Data-Analytic Memo
Here we conduct analyses on behalf of a trio of 3rd-grade teachers at Lindquist Elementary School in Hometown, USA. The Principal of the school has requested analysis of some of their preliminary in-house assessments. Ms. Affolter, Mr. Miller-Lane, and Ms. Weston teach 3rd grade, and they plan their instruction together often. After one month of school, they compiled scores for a comprehensive 64-item spelling test for the words they’ve agreed to teach (16 words each week). And they administered the first unit test of 40 items from their mathematics textbook. Ms. Weston keyed in the scores into a Google sheet for all the students on the 3rd grade team and shared them with the Principal.
Boys were assigned as 1; girls were assigned as 2. However, she left three of the students’ gender blank, as Morgan, Paige, and River are to be categorized as nonbinary. The teachers ask that you work out how to make this happen in the data.
The three classrooms are labeled 1 for Ms. Affolter’s class, 2 for Mr. Miller-Lane’s class, and 3 for Ms. Weston’s class.
Data Analysis
Task 1: Data Import
Load the packages we’re going to use
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(googlesheets4)
Read the Google file from the teachers
gs4_deauth() # deauthorize Google Sheet so that anyone can access itdata1 <-read_sheet("https://docs.google.com/spreadsheets/d/1778B79za0VK7-OnSNGeybGRR0F_4AlSxd2n64Nh9LiE/edit?usp=sharing")
✔ Reading from "DAM1_Data_Fall2024".
✔ Range 'Sheet1'.
Print the data to see what it looks like
print(data1, n=75) # Print all first 75 rows of the data
Produce suitable labels for the assigned classrooms (Affolter, Miller-Lane, Weston).
Produce suitable labels for student gender.
Print the data set with labels for classroom and gender.
Mo and Ged submitted the following code, which works!
I noticed that they used “old” code that still uses what’s called the magrittr pipe.
From now on, I encourage everyone to substitute |> where you see %>%
data2 <- data1 |>mutate(Classroom =factor(Classroom, levels=c("1", "2", "3"),labels=c("Affolter", "Miller-Lane", "Weston")))# Class labelsdata2 <- data1 |>mutate(Classroom =factor(Classroom, levels=c("1", "2", "3"),labels=c("Affolter", "Miller-Lane", "Weston")))# Gender labelsdata3 <- data2 %>%mutate(across(everything(), ~replace_na(., 3)))# NOTE: Here you can skip this step and make it work as one pipedata3 <- data3 |>mutate(Gender =factor(Gender, levels=c("1", "2", "3"),labels=c("Male", "Female", "Non-Binary")))print(data3, n=75)
# A tibble: 75 × 5
Name Gender Classroom Spelling Math
<chr> <fct> <fct> <dbl> <dbl>
1 Avery Male Affolter 38 24
2 Austin Male Affolter 48 27
3 Beth Female Affolter 53 28
4 Charlotte Female Affolter 58 32
5 Dennis Male Affolter 59 26
6 Ezra Male Affolter 60 30
7 Gray Male Affolter 37 22
8 Heather Male Affolter 51 26
9 Iris Female Affolter 44 24
10 Jessica Female Affolter 63 25
11 Josephine Female Affolter 52 28
12 Julie Female Affolter 63 33
13 Kevin Male Affolter 43 18
14 Laverne Female Affolter 62 37
15 Leslie Female Affolter 60 24
16 Mary Female Affolter 64 27
17 Morgan Non-Binary Affolter 60 23
18 Nathan Male Affolter 41 17
19 Oprah Female Affolter 54 23
20 Patrick Male Affolter 57 35
21 Rahim Male Affolter 61 33
22 Sam Male Affolter 55 23
23 Thomas Male Affolter 61 28
24 Victor Male Affolter 64 30
25 Walter Male Affolter 54 18
26 Aiden Male Miller-Lane 45 28
27 Bailey Female Miller-Lane 39 26
28 Chloe Female Miller-Lane 64 38
29 Dakota Female Miller-Lane 42 25
30 Debra Female Miller-Lane 45 31
31 Dominik Male Miller-Lane 61 23
32 Eleanor Female Miller-Lane 56 26
33 Felicity Female Miller-Lane 50 28
34 Francis Female Miller-Lane 47 28
35 Grace Female Miller-Lane 55 35
36 Guido Male Miller-Lane 45 19
37 Hillary Female Miller-Lane 56 36
38 Jill Female Miller-Lane 62 34
39 Jordan Male Miller-Lane 57 38
40 Kareem Female Miller-Lane 60 39
41 Laura Female Miller-Lane 53 27
42 Molly Female Miller-Lane 50 25
43 Nancy Female Miller-Lane 50 33
44 Petra Female Miller-Lane 62 40
45 Rachel Female Miller-Lane 47 25
46 Rhonda Female Miller-Lane 43 20
47 Sally Female Miller-Lane 44 22
48 Tara Female Miller-Lane 59 36
49 Thelma Female Miller-Lane 58 34
50 Abigail Female Weston 64 40
51 Caden Female Weston 41 29
52 Dana Male Weston 48 21
53 Darcy Female Weston 63 31
54 Erik Male Weston 51 34
55 Erin Female Weston 56 19
56 Fernando Male Weston 63 31
57 Gail Female Weston 64 28
58 Gemma Female Weston 52 29
59 Hermione Female Weston 47 19
60 Iara Female Weston 53 27
61 Iman Female Weston 41 17
62 Jack Male Weston 46 24
63 James Male Weston 40 24
64 Jon Female Weston 53 29
65 Julia Female Weston 46 29
66 Lawrence Male Weston 59 32
67 Mo Male Weston 43 26
68 Otis Male Weston 42 21
69 Paige Non-Binary Weston 39 16
70 Patrick Male Weston 63 32
71 Quincy Male Weston 49 25
72 Quinn Female Weston 48 25
73 River Non-Binary Weston 59 30
74 Wesley Male Weston 55 26
75 William Male Weston 52 28
Task 3 - Describe the Spelling scores
We produced a histogram fora ll 75 scores on the Spelling test of 64 words.
Ella, Maeve, Reese, and Jane submitted the following code with colors!
ggplot(data3, aes(x = Spelling)) +geom_histogram(binwidth =2, fill ="blue", color ="green")
Anna and Felix submitted the following code to find the median, min and max of the Spelling scores.
median(data3$Spelling)
[1] 53
min(data3$Spelling)
[1] 37
max(data3$Spelling)
[1] 64
We calculated the median, minimum, and maximum Spelling Scores for all 75 students. The median is 53, the minimum is 37, and the maximum is 64.
Mo and Ged stated: The graph shows a negative skew, this means that the Spelling scores piled up on the high end of the graph which implies most of the students did relatively well. This graph can also be described as having a ceiling effect.
Islam, Mark, and Zamzama produced a density plot of Spelling scores using the following code:
# Creating a Density Plot for Spellng scoresggplot(data = data3,mapping =aes(x = Spelling)) +geom_density()
The density plot shows a broad upward curve of spelling scores. The density plot is an efficient method of viewing important trends in the data regarding spelling scores. While the histogram provides more exact information regarding individual student performance, the density plot is more suited to the purposes of our colleagues. For its clarity, we recommend that our third-grade teachers utilize a density plot for parsing the data we are working with.
Task 4 - Describe the Math scores
Ella, Jane, Maeve, and Reese produced a histogram for all 75 scores on the Math test of 40 items with this code:
ggplot(data3, aes(x = Math)) +geom_histogram(binwidth =2, fill ="red", color ="black")
Anna and Felix submitted the following code to find the median, min and max of the Math scores.
median(data3$Math)
[1] 27
min(data3$Math)
[1] 16
max(data3$Math)
[1] 40
We calculated the median, minimum, and maximum Math Scores for all 75 students. The median is 27, the minimum is 16, and the maximum is 40.
Islam, Mark, and Zamzama produced a density plot of Math scores using the following code:
# Creating a Density Plot for Math scoresggplot(data = data3,mapping =aes(x = Math)) +geom_density()
Task 5 - A scatterplot
We produced a scatterplot showing the relationship between Spelling (on the x-axis) and Math (on the y-axis).
# Make a scatterplot with Spelling on X and Math on Yggplot(data = data3,mapping =aes(x = Spelling, y = Math)) +geom_point() +labs(title ="Relationship between spelling and math",subtitle ="By Islam",x ="Spelling",y ="Math")
Task 6 - Analyze the scatterplot
It seems that students who performed well on this mathematics assessment performed similarly proficiently on the spelling assessment. We will have to make further visualizations to make accurate judgements about these trends.
Task 7 - Scatterplots by Classroom
We produced a scatterplot showing the relationship between Spelling (on the x-axis) and Math (on the y-axis), by Classroom.
ggplot(data = data3,mapping =aes(x = Spelling, y = Math)) +geom_point(aes(color = Classroom, shape = Classroom)) +labs(title ="Relationship between spelling and math",subtitle ="By Islam",x ="Spelling",y ="Math",color ="Classroom",shape ="Classroom")
Task 8 - Boxplots of Spelling scores by Classroom
Ella, Jane, Maeve, and Reese produced boxplots of Spelling scores by Classroom with something like the following code:
ggplot(data3, aes(x = Classroom, y = Spelling)) +geom_boxplot(color ='black') +labs(x ='Classroom', y ='Spelling Scores', title ='Boxplots of Spelling Scores by Classroom')
Task 9 - Describing the boxplots
I like what Anna and Felix did here: They calculated the median, 25th and 75th percentiles, and the interquartile range of each boxplot by subsetting the data and naming these smaller datasets. It’s not super efficient, but it’s clear and easy to understand.
Affolter<- data2 |>filter(Classroom =="Affolter")
median(Affolter$Spelling)
[1] 57
IQR(Affolter$Spelling)
[1] 10
summary(Affolter)
Name Gender Classroom Spelling
Length:25 Min. :1.000 Affolter :25 Min. :37.00
Class :character 1st Qu.:1.000 Miller-Lane: 0 1st Qu.:51.00
Mode :character Median :1.000 Weston : 0 Median :57.00
Mean :1.417 Mean :54.48
3rd Qu.:2.000 3rd Qu.:61.00
Max. :2.000 Max. :64.00
NA's :1
Math
Min. :17.00
1st Qu.:23.00
Median :26.00
Mean :26.44
3rd Qu.:30.00
Max. :37.00
The 25th percentile of data regarding Ms. Affolter’s kids’ spelling results is 51, the median is 57, and the 75th percentile is 61. The IQR is 10. No outliers to speak of.
NOTE: I think that they should have used IQR() function to calculate the IQR for all three classrooms. (Perhaps they can SEE the IQR by eyeballing and calculating 75th - 25th ???)
Name Gender Classroom Spelling
Length:24 Min. :1.000 Affolter : 0 Min. :39.00
Class :character 1st Qu.:2.000 Miller-Lane:24 1st Qu.:45.00
Mode :character Median :2.000 Weston : 0 Median :51.50
Mean :1.833 Mean :52.08
3rd Qu.:2.000 3rd Qu.:58.25
Max. :2.000 Max. :64.00
Math
Min. :19.00
1st Qu.:25.00
Median :28.00
Mean :29.83
3rd Qu.:35.25
Max. :40.00
The 25th percentile of data regarding Mr. Miller-Lane’s kids’ spelling results is 45, the median is 51.5, the 75th percentile is 58.25, meaning the IQR range is 13.25. Still no outliers to speak of.
Weston <- data2 |>filter(Classroom =="Weston")
summary(Weston)
Name Gender Classroom Spelling
Length:26 Min. :1.0 Affolter : 0 Min. :39.00
Class :character 1st Qu.:1.0 Miller-Lane: 0 1st Qu.:46.00
Mode :character Median :1.5 Weston :26 Median :51.50
Mean :1.5 Mean :51.42
3rd Qu.:2.0 3rd Qu.:58.25
Max. :2.0 Max. :64.00
NA's :2
Math
Min. :16.00
1st Qu.:24.00
Median :27.50
Mean :26.62
3rd Qu.:29.75
Max. :40.00
The 25th percentile of data regarding Ms. Weston’s kids’ spelling results is 46, the median is 51.5, the 75th percentile is 58.25, meaning the IQR range is 12.25. Again no outliers to speak of.
The median and 75th percentile are the exact same for Mr. Miller Lane and Ms. Weston’s classes, which is a little odd but not crazy. It is a lot easier to analyze the data this way with concrete numbers as opposed to trying to make out trends from on a graph.
Task 10 - Density plots of Spelling by Classroom
While many of you produced density plots of Spelling scores by Classroom, I think there’s something to be said for adding fill = Classroom to the aes() function. This will make the plot easier to read and understand. Here’s the code:
ggplot(data = data3,mapping =aes(x = Spelling, color = Classroom, fill = Classroom)) +geom_density(alpha =0.5)