HW2-School Dataset

Approach

The approach to analyse the dataset will be as follows
1. Data Loading
2. Data Cleaning & Wrangling
3. Data Analysis & Visualization
* Pie Chart
* Bar Chart
* Box Plot
* Scatter Plot

Data Loading

#install.packages("tidyverse")
library("tidyverse")

## Warning: package 'tidyverse' was built under R version 3.5.3

## Warning: package 'ggplot2' was built under R version 3.5.3

## Warning: package 'tibble' was built under R version 3.5.3

## Warning: package 'tidyr' was built under R version 3.5.3

## Warning: package 'readr' was built under R version 3.5.3

## Warning: package 'purrr' was built under R version 3.5.3

## Warning: package 'dplyr' was built under R version 3.5.3

## Warning: package 'stringr' was built under R version 3.5.3

## Warning: package 'forcats' was built under R version 3.5.3

#install.packages("gridExtra")
library("gridExtra")
schools <- read.csv("data-storyteller.csv",stringsAsFactors = FALSE)

Analyzing Structure and Summary of Data

str(schools)

## 'data.frame':    30 obs. of  8 variables:
##  $ School           : chr  "A" "A" "A" "A" ...
##  $ Section          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Very.Ahead..5    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Middling..0      : int  5 8 9 14 9 7 19 3 6 13 ...
##  $ Behind..1.5      : int  54 40 35 44 42 29 22 37 29 40 ...
##  $ More.Behind..6.10: int  3 10 12 5 2 3 5 11 8 5 ...
##  $ Very.Behind..11  : int  9 16 13 12 24 10 14 18 12 5 ...
##  $ Completed        : int  10 6 11 10 8 9 19 5 10 20 ...

On exploration of the dataset through str() command we found that all of the attributes except School is in integer form. School is of the datatype character. To help us with our analysis we would be changing the data type of School to a factor

summary(schools)

##     School             Section      Very.Ahead..5  Middling..0   
##  Length:30          Min.   : 1.00   Min.   :0     Min.   : 2.00  
##  Class :character   1st Qu.: 2.25   1st Qu.:0     1st Qu.: 4.25  
##  Mode  :character   Median : 5.50   Median :0     Median : 7.50  
##                     Mean   : 5.90   Mean   :0     Mean   : 7.40  
##                     3rd Qu.: 9.00   3rd Qu.:0     3rd Qu.: 9.75  
##                     Max.   :13.00   Max.   :0     Max.   :19.00  
##   Behind..1.5    More.Behind..6.10 Very.Behind..11    Completed    
##  Min.   : 4.00   Min.   : 0.000    Min.   : 0.000   Min.   : 1.00  
##  1st Qu.:15.25   1st Qu.: 1.000    1st Qu.: 1.250   1st Qu.: 6.00  
##  Median :22.00   Median : 2.000    Median : 5.500   Median :10.00  
##  Mean   :25.13   Mean   : 3.333    Mean   : 6.967   Mean   :10.53  
##  3rd Qu.:34.25   3rd Qu.: 4.750    3rd Qu.:11.500   3rd Qu.:14.00  
##  Max.   :56.00   Max.   :12.000    Max.   :24.000   Max.   :27.00

The summary function helps us understand a little more about our dataset

Data Cleaning & Wrangling

We will be making changes to our dataset so that it is more easier to analyze and derive insights from it. According to insight obtained so far the “very ahead” category has a value 0 across all schools. This could mean that since semester is almost coming to an end, all of the students in this category have completed the course. Therefore if we want, we can remove this attribute from the dataset.

#School is in character converting it to factor
schools$School <- as.factor(schools$School)

#Renaming Columns
colnames(schools) <- c("School","Section","VeryAhead","Middling",
                       "Behind","MoreBehind","VeryBehind","Completed")

#Re-arranging the rows and columns
col_order <- c("School","Section","Completed","VeryAhead","Middling","Behind","MoreBehind","VeryBehind")
schools <- schools[,col_order]

#Since there is no data in "very ahead"" it would be ok to remove it
schools$VeryAhead <- NULL

#Since the school should be concerned about all the students who are too behind, we create an attribute TooBehind that sums up all the students in more behind and very behind attributes
schools$TooBehind <- schools$MoreBehind + schools$VeryBehind

Data Analysis & Visualization

In the code that follows I will be using the pipe operator also known as %>%. Here is a video of R chief data scientist wikham explaining why pipe operators are so cool! https://www.youtube.com/watch?v=40tyOFMZUSM. During our analysis we will be answering some of the business questions through visualization such as
1. How many students are there in each school
2. What is the distribution of student progress categories
3. How are student progress distributed across each school
4. Which schools have higher percentage of students ahead or behind
5. Are their any outliers in our analysis?
6. What is the breakup of students & schools according to sections

#What is the grade distribution of students across all the schools
school_performances <- gather(schools[,3:7]) %>% group_by(key) %>% summarise(count=sum(value))
school_performances

## # A tibble: 5 x 2
##   key        count
##   <chr>      <int>
## 1 Behind       754
## 2 Completed    316
## 3 Middling     222
## 4 MoreBehind   100
## 5 VeryBehind   209

Pie Chart - High Level Summary

#Pie Chart using quick plot
colors <- c("blue","green","yellow","red","brown","grey")
pie(school_performances$count, labels = school_performances$count, 
    main="Pie Chart of Students Progress in Schools",col = colors,radius = 0.6)

legend("bottomright", c("Behind","Completed","Middling","MoreBehind","VeryBehind"), 
       cex=0.6, fill=colors, ncol = 1)

The number of students in behind is alarming. Only 25% of the semester is left and the schools have more than 70% of their students lagging behind. However their is still hope for behind and middling category students if we identify schools and sections that have high numbers of students lagging behind we can assign resources accordingly.

Creating Frequency Table

#How many section in each School?
table(schools$School) #This command returns the frequency of each variable

## 
##  A  B  C  D  E 
## 13 12  3  1  1

School A has the highest number of sections

Bar Chart - Number of Students in each School

#How many Students in Each School?
students_by_school <- schools %>%
  
  mutate(Tot_students = `Middling` + `Behind` + `MoreBehind` + 
           `VeryBehind` + `Completed`) %>%
  
  group_by(School) %>%
  
  summarise(tot_stu_sch = sum(Tot_students))

students_by_school

## # A tibble: 5 x 2
##   School tot_stu_sch
##   <fct>        <int>
## 1 A              932
## 2 B              446
## 3 C               85
## 4 D               22
## 5 E              116

#Visually Plot Schools

#Bar Chart using ggplot
ggplot(students_by_school,aes(x= School,y = tot_stu_sch, fill = School, label = tot_stu_sch))+
  geom_bar(stat ="identity") + labs(x="Schools",y="No of Students",
                                    title = "Number of Students by School")

Bar Chart - Status of Students Across each schools

schoolsbyP <- gather(schools[,-c(2,8)],Status,Frequency,-School) %>% group_by(School,Status) %>%
  summarise(count = sum(Frequency)) %>% mutate(pct = count/sum(count))

#School Performance breakdown in relative and absolute values
sch_performance_bd <- ggplot(schoolsbyP,aes(x = School,y = count,fill=Status)) + 
  geom_bar(stat='identity',position="stack") + labs(x="Schools",
                                                    y="Value",
                                                    title = "Status of Students")
sch_performance_bd

sch_performance_bd_ab <- ggplot(schoolsbyP,aes(x = School,y = pct,fill=Status)) + 
  geom_bar(stat='identity',position="fill") + labs(x="Schools",y="Proportion",
                                                   title = "Status of Student in Percentage")+
  geom_text(aes(label=paste(round(pct*100,2),'%',sep ='')),position= position_stack(vjust=0.5),
            size = 2)

sch_performance_bd_ab

According to the 1st bar chart, we see that Schools A and B have a very high number of students middling or behind.In the coming weeks if resources are concentrated around these schools we can expect good results.

The 2nd bar chart is based on percentage.We observe the School D is performing very poorly and 27.27% of the students are very behind.There is little hope for these students this semester but in the future the administration needs to analyze as to why students are performing so poorly in School D and make improvements.

Box Plot - Performance across each school

#Box Plot: Distribution of performance across each school and section
schoolsbyS <- gather(schools,Status,Frequency,-School,-Section) %>% group_by(School,Status,Section) %>%
  summarise(count = sum(Frequency))


outliers <- ggplot(schoolsbyS, aes(x=School,y=count)) +
  geom_boxplot() +
  facet_wrap(~Status)
outliers

From the boxplot we observe that there are some sections in School A and B that are outliers. Pinpointing these sections will help us allocate resources in a much better way. We also see that School E has only one section and that has the highest number of students who belong in the category of completed or all behind.

ScatterPlots - Details by School and Section

ScatterPlots are very insightful because they help us create multi-dimensional plots.Now we will create scatterplots that will help us pin point which sections need help now and in the future. We will be creating 3 scatterplots for middling, behind and too behind categories. These plots will help us make better decisions.

Plot that maps Section Total vs Middling Counts

schools$SectionTotal <- rowSums(schools[,3:7])

#Middling
ScatterPMiddling <- ggplot(schools, aes(x=SectionTotal,y=Middling)) +
  geom_point(aes(color=School,size=Middling)) + geom_text(aes(label = Section),size=2)+
  geom_hline(yintercept = mean(schools$Middling),linetype ="dashed",color="red",size=1 )+
  labs(title = "Section Total vs Middling Counts")
ScatterPMiddling

In this Scatter plot the dotted red line shows mean of students who are in middling category. According to this scatter plot we see that in School A - section 7, 4, 10 and School E have a high number of students in middling category. Concentrating our efforts on them and on sections that are above the mean line will help us get better results now.

Plot that maps Section Total vs Behind Counts

ScatterPBehind <- ggplot(schools, aes(x=SectionTotal,y=Behind)) +
  geom_point(aes(color=School,size=Behind)) + geom_text(aes(label = Section),size=2)+
  geom_hline(yintercept = mean(schools$Behind),linetype ="dashed",color="red",size=1 )+ labs(title = "Section Total vs Behind Counts")
ScatterPBehind

In this scatterplot School A and School E have high numbers of student that are lagging behind. However there is still some hope to improve the performance if the administration concentrates on these 2 schools. #### Plot that maps Section Total vs TooBehind Counts

#Too Behind - This was calculated by adding the behind categories i:e More behind + Very Behind
ScatterPTooBehind <- ggplot(schools, aes(x=SectionTotal,y=TooBehind)) +
  geom_point(aes(color=School,size=TooBehind)) + geom_text(aes(label = Section),size=2)+
  geom_hline(yintercept = mean(schools$TooBehind),linetype ="dashed",color="red",size=1 )+ labs(title = "Section Total vs TooBehind Counts")
ScatterPTooBehind

Here too we see a same pattern. However we get A and E because these sections have a very high number of students.

Conclusion

Below are the observations and actions to be taken based on the above chart

Data discovery showed us 0 student in “Very Ahead” bucket. Assumption is that these students have completed their course
Pie chart showed us that more than 70% students are lagging behind
Bar chart showed us School A has the highest number of students
Relative Bar Chart showed us that School D is performing very poorly with 27% of their students lagging very behind. In future we would need to focus on improving School D
Box plot showed us some outliers for School A and B. To look more closely we then generated scatterplots
Middling scatterplot helped us identify schools which are in need of immediate help
Behind scatterplot showed us Schools A and E have high number of students lagging behind
Too Behind scatterplot gave us similar results that point to schools A and E

HW2-School Dataset - Solution

January 29, 2019

Schools Data Challenge