In this assignment, we are going to study test scores from the The Programme for International Student Assessment (PISA) scheme, which tests 15-year-old students across all states in Australia. You can find the data sets and a code book for the assignment in the Data folder. Broadly speaking, PISA measures scholastic ability across three categories: science, reading and math.
The goal of this analysis is to understand if differences exist between PISA testing scores across various dimensions, such as income, school type, extra-curricular activities and gender.
Please ensure that the report knits properly into html and all the R code and R outputs are visible in the final knitted report. You will need to save your rendered html document into a pdf file (you can use your internet browser to print your html file into a pdf file) and upload that pdf file into Moodle for submission.
This is an individual assignment and you must use R code to answer all the questions. Make sure that you have your messages and warnings turned off before you submit the assessment (see lines 15-17 of this Rmd file) and echo = FALSE set for the R code chunk where you load your libraries.
pisa <- read.csv("pisa_au.csv")
pisa %>%
select(last_col(offset = 2):last_col()) %>% #creating a range- from third last to last column
tail(5)
## math science read
## 14526 578.2601 578.6234 596.3567
## 14527 351.0943 315.8658 381.5746
## 14528 639.1595 664.5759 583.6448
## 14529 707.5041 767.9710 694.8842
## 14530 572.2217 583.7702 534.1746
math,
science and read score across states. (2pts)
Create a table where you display the results (1pt). Which state has the
highest score in read. (2pts)pisa %>% group_by(state) %>% summarise(across(c(math, science, read), ~quantile(., 0.75), .names = "P75_{.col}"))
## # A tibble: 8 × 4
## state P75_math P75_science P75_read
## <chr> <dbl> <dbl> <dbl>
## 1 ACT 563. 598. 581.
## 2 NSW 550. 576. 569.
## 3 NT 521. 549. 541.
## 4 QLD 533. 561. 553.
## 5 SA 540. 572. 562.
## 6 TAS 524. 549. 538.
## 7 VIC 553. 575. 567.
## 8 WA 556. 584. 569.
*all values will be rounded to two decimal places
The Australian Capital Territory (ACT) shows the highest value in 75th percentile in ‘read’ with a score of 580.80
pisa %>% group_by(state) %>% summarise(across(c(math, science, read), ~max(.), .names = "max_{.col}"))
## # A tibble: 8 × 4
## state max_math max_science max_read
## <chr> <dbl> <dbl> <dbl>
## 1 ACT 737. 795. 784.
## 2 NSW 780. 780. 779.
## 3 NT 708. 763. 728.
## 4 QLD 732. 762. 755.
## 5 SA 752. 768. 735.
## 6 TAS 709. 757. 750.
## 7 VIC 750. 802. 765.
## 8 WA 736. 767. 767.
The Australian Capital Territory (ACT) also shows the highest maximum read score at 783.85
read
(3pts).tab1 <- pisa %>% filter(gender == 'female' & birthyr >= 2000) %>% group_by(schtype) %>% summarise(high_avg_read = mean(read)) %>% arrange(desc( high_avg_read))
tab1
## # A tibble: 3 × 2
## schtype high_avg_read
## <chr> <dbl>
## 1 Ind 548.
## 2 Catholic 523.
## 3 Gov 485.
There are no students born after year 2000. The ‘Ind’ school type shows the highest average performance in ‘read’ score at 648.25.
pisa %>% group_by(schtype, gender) %>% summarise(across(c(math, science, read), ~IQR(.), .names = "IQR_{.col}"))
## # A tibble: 6 × 5
## # Groups: schtype [3]
## schtype gender IQR_math IQR_science IQR_read
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Catholic female 108. 128. 113.
## 2 Catholic male 124. 139. 124.
## 3 Gov female 120. 142. 135.
## 4 Gov male 132. 159. 149.
## 5 Ind female 103. 106. 104.
## 6 Ind male 116. 126. 119.
For all three subjects, males in ‘Gov’ school types shows the biggest interquartile range, indicating a bigger spread and higher variability in the scores.
pisa %>% filter(birthyr >= 2000) %>% pivot_longer(cols = c(math, science, read), names_to = "subject", values_to = "score") %>% ggplot(aes(x = subject, y = score, fill = gender)) + geom_boxplot() + facet_wrap(~schtype) + ggtitle('Boxplot showing the variability in scores across subjects')
This is further shown with a boxplot visualisation.
geom_histogram plot the distribution of marks in
read by gender using a faceted plot with shading to capture
school type (2pt). Which combination of states/schools do female
students peform worst in (1pt)? Are the results similar for
science (1pt)?pisa %>% filter(state %in% c('VIC','NSW','QLD')) %>% ggplot(aes(x = read, fill = schtype)) + geom_boxplot() + facet_grid(state~gender)+ scale_fill_brewer(palette = "Blues") + ggtitle('Histogram showing the distribution of read scores across states, school types by gender')
Based on the box plot above, the left hand side of the faceted plots represents the distribution of read scores in females across the states. Across all three states, ‘Gov’ school types has the lowest 50th percentile score. The spread is fairly symmetric for NSW and QLD, and is negatively skewed for VIC with more outliers. Overall, females perform worse with the combination of ‘Gov’ school types in VIC.
pisa %>% filter(state %in% c('VIC','NSW','QLD')) %>% ggplot(aes(x = science, fill = schtype)) + geom_boxplot() + facet_grid(state~gender) + scale_fill_brewer(palette = "Blues") + ggtitle("Boxplot showing the distribution of science scores across states, school types by gender")
Results are similar for science scores. ‘Gov’ school types continues to show the lowest 50th percentile scores. The main differences between read and science scores, will be in spread of data, notably in NSW. Interestingly, despite the lowest median score, QLD shows a slightly positively skewed distribution.
geom_density (1pt). Which set of results, those for
geom_density or geom_histogram allows one to
more accurately compare across results (3pt)? Why?pisa %>% filter(state %in% c('VIC','NSW','QLD')) %>% ggplot(aes(x = read, fill = schtype)) + geom_density(alpha = 0.5) + facet_grid(state~gender) + scale_fill_brewer(palette = "Blues") +labs(title = "Density plot showing the distribution of read scores across sch types, states by gender", x = "Read Scores", y = "Density ")
pisa %>% filter(state %in% c('VIC','NSW','QLD')) %>% ggplot(aes(x = read, fill = schtype)) + geom_histogram() + facet_grid(state~gender) + scale_fill_brewer(palette = "Blues") + labs(title = "Histogram showing the distribution of read scores across school types, states by gender", x = "Read Scores", y = "Count of Students")
The use of histograms or density plot to compare results will depend on
the goal. A density plot is able to provide the Probability Density
Function (PDF), and able to provide the likelihood of a student’s score,
in a particular school type of state. It is also useful for identifying
insights in trends and distributions. On the other hand, histogram plots
are more useful for frequency based analysis and tends to work better on
a smaller dataset.
In this case, the goal is to compare scores in females across states and school types, in the form of a frequency based analysis. Hence, a histogram will be the better choice for a more accurate comparison of results.
geom_point()
describe the relationship between tot_time (x-axis) and
tot_score (y-axis). (2pts)pisa_filtered <- na.omit(pisa) #removing the missing values
pisa_filtered <- pisa_filtered %>% mutate(`_tot_score_` = math + science + read) %>% mutate(`_tot_time_` = math_time + science_time + read_time) #new variable of total score, total time is created, and added back to original data frame.
ggplot(pisa_filtered, aes(x = `_tot_time_`, y = `_tot_score_`)) + geom_point() + labs(title = "Scatter plot showing the relationship between the total time and score in student scores", x = "Total Score", y = "Total Time")
The majority of data points are clustered roughly at 800, indicating
that most students have a similar total score. There is a greater
vertical than horizontal spread of data points, indicating that there is
a bigger variation in the total time taken, as opposed to the total
scores. There are a few outliers from the main cluster, with the data
showing students performing much better, despite taking the same amount
of time. Overall, there is no clear pattern or trend, to indicate a
correlation between the total time taken, and score achieved. This
suggests that there are other other contributing factors to explain the
variation in scores.
language records the
language most often spoken at students homes, with language code 313
referring to English. On average, do students who do not
speak English in their home perform worse than native english speakers
in terms of total scores? Does your answer remain the same when we look
at students who perform in the lowest 25th quantile?pisa_filtered <- pisa_filtered %>% mutate(english_native_speaker = ifelse(language == '313', 'Yes', 'No'))
ggplot(pisa_filtered) + geom_bar(aes(x = english_native_speaker, y = `_tot_score_`), stat = 'identity') + labs(title = "Box plot comparing test scores in Native and Non- Native English Speakers", x = "English Native Speaker", y = "Total Score")
Yes. On average, english native speakers perform significantly better in
their test scores as compared to non- english native speakers.
pisa_summary <- pisa_filtered %>% group_by(english_native_speaker) %>% summarise(p25_tot_score = quantile(`_tot_score_`, 0.25))
ggplot(pisa_summary, aes(x = english_native_speaker, y = p25_tot_score))+ geom_bar(stat = 'identity') + labs(title = "Box plot comparing 25 percentile test scores in Native and Non- Native English Speakers", x = "English Native Speaker", y = "Total Score")
No, the results are not the same. The differences in scores between the
two groups are very minimal in the bottom 25th percentile.
facet_wrap(), plot the total scores
using densities for both males and females. What do the results tell us
about the usefulness of looking at average scores?mean_scores <- pisa_filtered %>% group_by(gender) %>% summarise(avg_score = mean(`_tot_score_`))
pisa_filtered %>% ggplot(aes( x= `_tot_score_`)) + geom_density() + facet_wrap( ~gender) + geom_vline(data = mean_scores, aes(xintercept = avg_score, color = gender)) + labs(title = "Density plot comparing total scores by gender", x = "Total Score", y = "Density")
The density plot shows a higher peak (mode) for females, indicating a greater number of female students with the same score. Both male and female scores show a fairly normal distribution, with no skewness of data. There is a marginally higher average score in males than females. The similarity in distribution indicates that both groups have a similar range and variability in test scores. A V- line was added to visualise the average score. However, it was not very useful as both averages are very similar.
set.seed(6)
k_means_data <- pisa_filtered %>% select(anxtest, motivat, `_tot_time_`, `_tot_score_`)
kmeans_5 <- kmeans(k_means_data, centers = 5)
kmeans_table <- data.frame(kmeans_5$size, kmeans_5$centers)
kmeans_table
## kmeans_5.size anxtest motivat X_tot_time_ X_tot_score_
## 1 1424 0.34628687 0.1294504 638.9080 1198.483
## 2 453 0.08734481 0.6326269 1052.8918 1803.004
## 3 2567 0.24770296 0.3300835 658.6482 1527.032
## 4 1968 0.04046153 0.6192629 661.8471 1833.236
## 5 179 0.24078045 0.3077480 1411.0335 1365.989
With a 5 cluster solution, there are 1424, 453, 2567, 1968, 179 in Cluster 1,2,3,4,5 respectively, with set.seed(6).
set.seed(6)
kmeans_5_df1 <- data.frame(Clusters = kmeans_5$cluster, pisa_filtered) #appending cluster solution back to dataset
cluster_labels <- c("Anxious/ Low Motivation", "High Achiever", "Average", "High Achiever/ Efficient", "High Motivation/ Inefficient") #descriptive labels added for better clarity in the relationship between motivational factors and total scores for further visualisations
kmeans_5_df1 <- kmeans_5_df1 %>%
mutate(Labels = (cluster_labels[Clusters]))
#cluster assignments indexed to cluster labels
median_values <- kmeans_5_df1 %>% group_by(Clusters) %>% summarise(across(c(anxtest, motivat, `X_tot_time_`, `X_tot_score_`), median)) %>% arrange(desc(`X_tot_score_`))
median_values #table showing median values
## # A tibble: 5 × 5
## Clusters anxtest motivat X_tot_time_ X_tot_score_
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 4 -0.0345 0.503 660 1808.
## 2 2 0.0752 0.539 1000 1795.
## 3 3 0.209 0.158 660 1533.
## 4 5 0.156 0.230 1350 1375.
## 5 1 0.259 -0.145 660 1224.
The following features were identified for each cluster: Cluster 1
(1424)- Lowest score: Highest anxiety, lowest (negative) motivation and
spends a low amount of time. Their anxiety and low motivation is
negatively affecting their scores. Cluster 2 (453)- Good score. Highest
motivation, moderate anxiety levels. There is a considerable amount of
time spend on the task, suggesting their anxiety might be driving their
motivation. Cluster 3 (2567)- Average score. Moderate anxiety and
motivation. Spends a low amount of time.
Cluster 4 (1968)- Highest score. Lowest anxiety with a relatively high
motivation, while spending a low amount of time. Their calmness and
efficiency contributes to their high performance. Cluster 5 (179)- Low
Score. Moderate anxiety and motivation levels, but spends the longest
time on tasks. Their anxiety increases their time on tasks, but does not
translate to a higher score.
*efficiency was gauged on how much total time they spent, as compared to the score received.
According to the cluster features, the key insights are: 1) If there is a high level of anxiety, a greater number of time spent on the task does not translate to a higher score, 2) Low anxiety and high motivation produces the highest efficiency in performance and vice versa, 3) Majority of the students are average, with moderate anxiety and motivation.
tot_scores
and color the plot by cluster. (3pt). Is there a meaningful difference
between the scores across the different clusters? (1pt). What do you
think this finding says about the importance of motivational factors in
overall test scores? (2pt)p_anxtest <- kmeans_5_df1 %>% ggplot(aes(x = anxtest, y = `X_tot_score_`, color = factor(Labels))) + geom_point() + scale_color_brewer(palette = "Set1") + labs(title = "Total scores by anxiety levels", x = "Scores", y = "Total Score")
p_motivat <- kmeans_5_df1 %>% ggplot(aes(x = motivat, y = `X_tot_score_`, color = factor(Labels))) + geom_point() + scale_color_brewer(palette = "Set1") + labs(title = "Total scores by motivation levels", x = "Scores", y= "Total Score")
combined_plot <- plot_grid(p_anxtest, p_motivat, ncol = 2)
title <- ggdraw() + draw_label("Cluster Analysis of Total Scores")
final_plot <- plot_grid(title, combined_plot, ncol = 1, rel_heights = c(0.1, 1))
final_plot
From the graph above, there are three distinct clusters showing
meaningful differences in their scores: Cluster 4 (High Efficiency)-
with the highest score, Cluster 1 (Anxious/Low Motivation)- with the
lowest score, and Cluster 3 (Average)- with the average score. Cluster 2
(Motivated) is the smallest cluster, characterized by high scores, with
some outliers falling into the average range. In contrast, Cluster 5
(Diligent) shows an even distribution between average and low
scores.
Overall, the clusters exhibit similar trends in their distribution. Notably, despite similar total scores, anxiety scores show significant variability. High scorers, though often outliers, can have both extremely low and high anxiety levels. On the other hand, motivation appears to have a more direct impact on performance: high motivation is associated with high scores, while low motivation typically results in average scores at best.
set.seed(6)
k_means_read_data <- pisa_filtered %>% select(anxtest, motivat, `_tot_time_`, read)
kmeans_read <- kmeans(k_means_read_data, centers = 5)
kmeans_read_table <- data.frame(kmeans_read$size, kmeans_read$centers)
kmeans_read_df1 <- data.frame(Clusters = kmeans_read$cluster, pisa_filtered)
kmeans_read_table
## kmeans_read.size anxtest motivat X_tot_time_ read
## 1 846 0.2591387 0.2920376 468.3688 491.3349
## 2 836 0.1644999 0.4872213 915.4127 543.8466
## 3 1976 0.2421139 0.1725094 673.5324 436.2662
## 4 2681 0.1511062 0.5517251 666.4976 588.2725
## 5 252 0.2022829 0.4666810 1401.9841 531.7736
cluster_labels_read <- c("Average", "High Anxiety/ Low Motivation", "Low Anxiety/ High Motivation", "Good Scorers", "Low Scorers")
kmeans_read_df1 <- kmeans_read_df1 %>%
mutate(Labels = (cluster_labels_read[Clusters]))
median_read <- kmeans_read_df1 %>% group_by(Clusters) %>% summarise(across(c(anxtest, motivat, `X_tot_time_`, read), median)) %>% arrange(desc(read))
median_read
## # A tibble: 5 × 5
## Clusters anxtest motivat X_tot_time_ read
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 4 0.0752 0.463 660 582.
## 2 2 0.201 0.463 900 551.
## 3 5 0.0752 0.409 1320 537.
## 4 1 0.209 0.122 495 496.
## 5 3 0.248 -0.0413 675 449.
*Different labelling methods were used due to their differing key features.
Cluster 1(846): Moderate anxiety, motivation and score. Interestingly, the previous results showed this group as the majority of the population. However, this result shows this as a much smaller cluster. Cluster 2(836): High anxiety and low motivation, with low read score. Cluster 3(1976): High motivation, moderate anxiety and high read scores. There is an efficient use of time. Cluster 4(2671): This is the biggest cluster and has a similar profile to Cluster 1, with a slightly higher read score. Cluster 5(252): This is the smallest cluster. There is moderate anxiety with low scores. There is a significant inefficient use of time.
Key differentiating insights: 1) There is considerably less time total time spent overall. This might be due to the nature of the task.
read_anxtest <- kmeans_read_df1 %>% ggplot(aes(x = anxtest, y = `read`, color = factor(Labels))) + geom_point() + scale_color_brewer(palette = "Set1") + labs(title = "Total scores by anxiety levels", x = "Scores", y = "Read Score")
read_motivat <- kmeans_5_df1 %>% ggplot(aes(x = motivat, y = `read`, color = factor(Labels))) + geom_point() + scale_color_brewer(palette = "Set1") + labs(title = "Total scores by motivation levels", x = "Scores", y= "Read Score")
combined_read_plot <- plot_grid(read_anxtest, read_motivat, ncol = 2)
title_read <- ggdraw() + draw_label("Cluster Analysis of Read Scores")
final_read_plot <- plot_grid(combined_read_plot, ncol = 1, rel_heights = c(0.1, 1))
final_read_plot
As compared to the previous results, there are now 4 distinct, as
opposed to 3 clusters when looking at anxiety levels. The average
cluster shows a big variability in read scores. However, the clustering
is similar in motivation levels.
In all, there is consistency in the findings. Both plots show that high anxiety negatively impacts performance, and a high motivation positively impacts performance.