We were given the task of exploring a dataset provided by RMIT University. We analysed the Csv files provided to see what sort of information was possible to derive. This was a general analysis, not really going into deep analysis. The real purpose was for our team to get an understanding on how to use the rstudio program, also finding different packages for deep analysis of the Csv files provided. You can download the dataset from here.

Methodology:

students per course:

For the first part of exploration, we measured how many students decided to take a particular course for each semester. This information is represented below in the graph below.

  • The number of students taking the course AAA for each semester was the lowest when compared to the rest.
  • The course BBB have had quite a high number of students in each semester compared to the rest of the courses.
  • Course CCC was only run in the second-year semesters i.e. 2014B, 2014J.
  • The courses DDD and FFF are almost the same in all the semesters with a slight change in numbers.
  • FFF and GGG are the same as the above, they are almost the same in all semesters besides 2013B where no student took up those courses.
ggplot(studentinfo) + geom_bar(aes(x=studentinfo$code_module, fill=studentinfo$code_module)) + labs(title = "Students in each module per semester", fill = "Modules", x="Modules", y="Amount of students") + facet_wrap(~ studentinfo$code_presentation) 

Gender Distribution:

Then we moved on to exploring the genders data recorded. The distribution of gender is represented in the graph below. The graph divides the gender in each semester for both years. It’s interesting to see that only in the first semester, the percentage of females is greater than the males. Also, there was a dramatic increase of males for each semester. We have also included a table below to indicate the exact amount for both males and females for each semester.

ggplot(studentinfo) + geom_bar(aes(x=studentinfo$gender, fill=studentinfo$gender)) + labs(title = "Gender distribution for each semester", fill = "Gender", x="Gender", y="Amount of students") + facet_wrap(~ studentinfo$code_presentation)

gender%>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
Semesters Females Males
2013B 2389 2295
51.0% 49.0%
2013J 4200 4645
47.5% 52.5%
2014B 3368 4436
43.2% 56.8%
2014J 4761 6499
42.30% 57.7%

Geographical location:

Following gender distribution analysis, we explored the geographical location of students. We produced a bar chart for the number of students per region (located below ).
From the graph, we see that the year 2014 had more students enrolled for the semester in all regions. Some regions even had double the number of students.

ggplot(studentInfo) + aes(x=region, fill = region) + geom_bar() + facet_wrap(~ code_presentation, scales = "free_y") + theme(axis.text.x = element_text(angle = 90)) 

Disability

To understand all aspects of the data, we also explored the Disability rate of each semester. We have provided a graph below to display this. It seems the ratio of students with a disability are very low when compared to those that were not registered with a disability. There may be a chance that the number of students with a disability is higher, as some students may have not disclosed such details with to the schooling department.

Conclusion

The large datasets were analysed and were found to have some interesting data. The team was able to develop an understanding of how to use R programming language, particularly generating different graphs for analysis.

With these skills, we were able to analyse different aspects of the dataset and plan to go further and deeper into understanding the data.