Each of 5 schools (A, B, C, D and E) is implementing the same math course this semester, with 35 lessons. There are 30 sections total. The semester is about 3/4 of the way through. For each section, we record the number of students who are: ⢠very ahead (more than 5 lessons ahead) ⢠middling (5 lessons ahead to 0 lessons ahead) ⢠behind (1 to 5 lessons behind) ⢠more behind (6 to 10 lessons behind) ⢠very behind (more than 10 lessons behind) ⢠completed (finished with the course) Whatâs the story (or stories) in this data? Find it, and tell it visually and, above all, truthfully
The approach to analyse the dataset will be as follows 1. Data Loading 2. Data Cleaning & Wrangling 3. Data Analysis & Visualization * Pie Chart * Bar Chart * Box Plot * Scatter Plot
#install.packages("tidyverse")
library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.5.3
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
#install.packages("gridExtra")
library("gridExtra")
schools <- read.csv("data-storyteller.csv",stringsAsFactors = FALSE)
str(schools)
## 'data.frame': 30 obs. of 8 variables:
## $ School : chr "A" "A" "A" "A" ...
## $ Section : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Very.Ahead..5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Middling..0 : int 5 8 9 14 9 7 19 3 6 13 ...
## $ Behind..1.5 : int 54 40 35 44 42 29 22 37 29 40 ...
## $ More.Behind..6.10: int 3 10 12 5 2 3 5 11 8 5 ...
## $ Very.Behind..11 : int 9 16 13 12 24 10 14 18 12 5 ...
## $ Completed : int 10 6 11 10 8 9 19 5 10 20 ...
On exploration of the dataset through str() command we found that all of the attributes except School is in integer form. School is of the datatype character. To help us with our analysis we would be changing the data type of School to a factor
summary(schools)
## School Section Very.Ahead..5 Middling..0
## Length:30 Min. : 1.00 Min. :0 Min. : 2.00
## Class :character 1st Qu.: 2.25 1st Qu.:0 1st Qu.: 4.25
## Mode :character Median : 5.50 Median :0 Median : 7.50
## Mean : 5.90 Mean :0 Mean : 7.40
## 3rd Qu.: 9.00 3rd Qu.:0 3rd Qu.: 9.75
## Max. :13.00 Max. :0 Max. :19.00
## Behind..1.5 More.Behind..6.10 Very.Behind..11 Completed
## Min. : 4.00 Min. : 0.000 Min. : 0.000 Min. : 1.00
## 1st Qu.:15.25 1st Qu.: 1.000 1st Qu.: 1.250 1st Qu.: 6.00
## Median :22.00 Median : 2.000 Median : 5.500 Median :10.00
## Mean :25.13 Mean : 3.333 Mean : 6.967 Mean :10.53
## 3rd Qu.:34.25 3rd Qu.: 4.750 3rd Qu.:11.500 3rd Qu.:14.00
## Max. :56.00 Max. :12.000 Max. :24.000 Max. :27.00
The summary function helps us understand a little more about our dataset
We will be making changes to our dataset so that it is more easier to analyze and derive insights from it. According to insight obtained so far the “very ahead” category has a value 0 across all schools. This could mean that since semester is almost coming to an end, all of the students in this category have completed the course. Therefore if we want, we can remove this attribute from the dataset.
#School is in character converting it to factor
schools$School <- as.factor(schools$School)
#Renaming Columns
colnames(schools) <- c("School","Section","VeryAhead","Middling",
"Behind","MoreBehind","VeryBehind","Completed")
#Re-arranging the rows and columns
col_order <- c("School","Section","Completed","VeryAhead","Middling","Behind","MoreBehind","VeryBehind")
schools <- schools[,col_order]
#Since there is no data in "very ahead"" it would be ok to remove it
schools$VeryAhead <- NULL
#Since the school should be concerned about all the students who are too behind, we create an attribute TooBehind that sums up all the students in more behind and very behind attributes
schools$TooBehind <- schools$MoreBehind + schools$VeryBehind
In the code that follows I will be using the pipe operator also known as %>%. Here is a video of R chief data scientist wikham explaining why pipe operators are so cool! https://www.youtube.com/watch?v=40tyOFMZUSM. During our analysis we will be answering some of the business questions through visualization such as 1. How many students are there in each school 2. What is the distribution of student progress categories 3. How are student progress distributed across each school 4. Which schools have higher percentage of students ahead or behind 5. Are their any outliers in our analysis? 6. What is the breakup of students & schools according to sections
#What is the grade distribution of students across all the schools
school_performances <- gather(schools[,3:7]) %>% group_by(key) %>% summarise(count=sum(value))
school_performances
## # A tibble: 5 x 2
## key count
## <chr> <int>
## 1 Behind 754
## 2 Completed 316
## 3 Middling 222
## 4 MoreBehind 100
## 5 VeryBehind 209
#Pie Chart using quick plot
colors <- c("blue","green","yellow","red","brown","grey")
pie(school_performances$count, labels = school_performances$count,
main="Pie Chart of Students Progress in Schools",col = colors,radius = 0.6)
legend("bottomright", c("Behind","Completed","Middling","MoreBehind","VeryBehind"),
cex=0.6, fill=colors, ncol = 1)
The number of students in behind is alarming. Only 25% of the semester is left and the schools have more than 70% of their students lagging behind. However their is still hope for behind and middling category students if we identify schools and sections that have high numbers of students lagging behind we can assign resources accordingly.
#How many section in each School?
table(schools$School) #This command returns the frequency of each variable
##
## A B C D E
## 13 12 3 1 1
School A has the highest number of sections
#How many Students in Each School?
students_by_school <- schools %>%
mutate(Tot_students = `Middling` + `Behind` + `MoreBehind` +
`VeryBehind` + `Completed`) %>%
group_by(School) %>%
summarise(tot_stu_sch = sum(Tot_students))
students_by_school
## # A tibble: 5 x 2
## School tot_stu_sch
## <fct> <int>
## 1 A 932
## 2 B 446
## 3 C 85
## 4 D 22
## 5 E 116
#Visually Plot Schools
#Bar Chart using ggplot
ggplot(students_by_school,aes(x= School,y = tot_stu_sch, fill = School, label = tot_stu_sch))+
geom_bar(stat ="identity") + labs(x="Schools",y="No of Students",
title = "Number of Students by School")
schoolsbyP <- gather(schools[,-c(2,8)],Status,Frequency,-School) %>% group_by(School,Status) %>%
summarise(count = sum(Frequency)) %>% mutate(pct = count/sum(count))
#School Performance breakdown in relative and absolute values
sch_performance_bd <- ggplot(schoolsbyP,aes(x = School,y = count,fill=Status)) +
geom_bar(stat='identity',position="stack") + labs(x="Schools",
y="Value",
title = "Status of Students")
sch_performance_bd
sch_performance_bd_ab <- ggplot(schoolsbyP,aes(x = School,y = pct,fill=Status)) +
geom_bar(stat='identity',position="fill") + labs(x="Schools",y="Proportion",
title = "Status of Student in Percentage")+
geom_text(aes(label=paste(round(pct*100,2),'%',sep ='')),position= position_stack(vjust=0.5),
size = 2)
sch_performance_bd_ab
According to the 1st bar chart, we see that Schools A and B have a very high number of students middling or behind.In the coming weeks if resources are concentrated around these schools we can expect good results.
The 2nd bar chart is based on percentage.We observe the School D is performing very poorly and 27.27% of the students are very behind.There is little hope for these students this semester but in the future the administration needs to analyze as to why students are performing so poorly in School D and make improvements.
#Box Plot: Distribution of performance across each school and section
schoolsbyS <- gather(schools,Status,Frequency,-School,-Section) %>% group_by(School,Status,Section) %>%
summarise(count = sum(Frequency))
outliers <- ggplot(schoolsbyS, aes(x=School,y=count)) +
geom_boxplot() +
facet_wrap(~Status)
outliers
From the boxplot we observe that there are some sections in School A and B that are outliers. Pinpointing these sections will help us allocate resources in a much better way. We also see that School E has only one section and that has the highest number of students who belong in the category of completed or all behind.
ScatterPlots are very insightful because they help us create multi-dimensional plots.Now we will create scatterplots that will help us pin point which sections need help now and in the future. We will be creating 3 scatterplots for middling, behind and too behind categories. These plots will help us make better decisions.
schools$SectionTotal <- rowSums(schools[,3:7])
#Middling
ScatterPMiddling <- ggplot(schools, aes(x=SectionTotal,y=Middling)) +
geom_point(aes(color=School,size=Middling)) + geom_text(aes(label = Section),size=2)+
geom_hline(yintercept = mean(schools$Middling),linetype ="dashed",color="red",size=1 )+
labs(title = "Section Total vs Middling Counts")
ScatterPMiddling
In this Scatter plot the dotted red line shows mean of students who are in middling category. According to this scatter plot we see that in School A - section 7, 4, 10 and School E have a high number of students in middling category. Concentrating our efforts on them and on sections that are above the mean line will help us get better results now.
ScatterPBehind <- ggplot(schools, aes(x=SectionTotal,y=Behind)) +
geom_point(aes(color=School,size=Behind)) + geom_text(aes(label = Section),size=2)+
geom_hline(yintercept = mean(schools$Behind),linetype ="dashed",color="red",size=1 )+ labs(title = "Section Total vs Behind Counts")
ScatterPBehind
In this scatterplot School A and School E have high numbers of student that are lagging behind. However there is still some hope to improve the performance if the administration concentrates on these 2 schools. #### Plot that maps Section Total vs TooBehind Counts
#Too Behind - This was calculated by adding the behind categories i:e More behind + Very Behind
ScatterPTooBehind <- ggplot(schools, aes(x=SectionTotal,y=TooBehind)) +
geom_point(aes(color=School,size=TooBehind)) + geom_text(aes(label = Section),size=2)+
geom_hline(yintercept = mean(schools$TooBehind),linetype ="dashed",color="red",size=1 )+ labs(title = "Section Total vs TooBehind Counts")
ScatterPTooBehind
Here too we see a same pattern. However we get A and E because these sections have a very high number of students.
Below are the observations and actions to be taken based on the above chart