We were given the task of exploring a dataset provided by RMIT University. We analysed the Csv files provided to see what sort of information was possible to derive. This was a general analysis, not really going into deep analysis. The real purpose was for our team to get an understanding on how to use the rstudio program, also finding different packages for deep analysis of the Csv files provided. You can download the dataset from here.
For the first part of exploration, we measured how many students decided to take a particular course for each semester. This information is represented below in the graph below.
ggplot(studentinfo) + geom_bar(aes(x=studentinfo$code_module, fill=studentinfo$code_module)) + labs(title = "Students in each module per semester", fill = "Modules", x="Modules", y="Amount of students") + facet_wrap(~ studentinfo$code_presentation)
Then we moved on to exploring the genders data recorded. The distribution of gender is represented in the graph below. The graph divides the gender in each semester for both years. It’s interesting to see that only in the first semester, the percentage of females is greater than the males. Also, there was a dramatic increase of males for each semester. We have also included a table below to indicate the exact amount for both males and females for each semester.
ggplot(studentinfo) + geom_bar(aes(x=studentinfo$gender, fill=studentinfo$gender)) + labs(title = "Gender distribution for each semester", fill = "Gender", x="Gender", y="Amount of students") + facet_wrap(~ studentinfo$code_presentation)
gender%>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
| Semesters | Females | Males |
|---|---|---|
| 2013B | 2389 | 2295 |
| 51.0% | 49.0% | |
| 2013J | 4200 | 4645 |
| 47.5% | 52.5% | |
| 2014B | 3368 | 4436 |
| 43.2% | 56.8% | |
| 2014J | 4761 | 6499 |
| 42.30% | 57.7% |
Following gender distribution analysis, we explored the geographical location of students. We produced a bar chart for the number of students per region (located below ).
From the graph, we see that the year 2014 had more students enrolled for the semester in all regions. Some regions even had double the number of students.
ggplot(studentInfo) + aes(x=region, fill = region) + geom_bar() + facet_wrap(~ code_presentation, scales = "free_y") + theme(axis.text.x = element_text(angle = 90))
To understand all aspects of the data, we also explored the Disability rate of each semester. We have provided a graph below to display this. It seems the ratio of students with a disability are very low when compared to those that were not registered with a disability. There may be a chance that the number of students with a disability is higher, as some students may have not disclosed such details with to the schooling department.
The large datasets were analysed and were found to have some interesting data. The team was able to develop an understanding of how to use R programming language, particularly generating different graphs for analysis.
With these skills, we were able to analyse different aspects of the dataset and plan to go further and deeper into understanding the data.