ETC1010-5510: Introduction to Data Analysis

In this assignment, we are going to study test scores from the The Programme for International Student Assessment (PISA) scheme, which tests 15-year-old students across all states in Australia. You can find the data sets and a code book for the assignment in the Data folder. Broadly speaking, PISA measures scholastic ability across three categories: science, reading and math.

The goal of this analysis is to understand if differences exist between PISA testing scores across various dimensions, such as income, school type, extra-curricular activities and gender.

Please ensure that the report knits properly into html and all the R code and R outputs are visible in the final knitted report. You will need to save your rendered html document into a pdf file (you can use your internet browser to print your html file into a pdf file) and upload that pdf file into Moodle for submission.

This is an individual assignment and you must use R code to answer all the questions. Make sure that you have your messages and warnings turned off before you submit the assessment (see lines 15-17 of this Rmd file) and echo = FALSE set for the R code chunk where you load your libraries.

Question 1: Read in the pisa data set (1pt) and show the last 5 rows and last three variables in the data frame (1pt).

pisa <- read.csv("pisa_au.csv")
pisa %>% 
  select(last_col(offset = 2):last_col()) %>% #creating a range- from third last to last column
  tail(5)

##           math  science     read
## 14526 578.2601 578.6234 596.3567
## 14527 351.0943 315.8658 381.5746
## 14528 639.1595 664.5759 583.6448
## 14529 707.5041 767.9710 694.8842
## 14530 572.2217 583.7702 534.1746

Question 2: Calculate the 75th quantile for `math`, `science` and `read` score across states. (2pts) Create a table where you display the results (1pt). Which state has the highest score in `read`. (2pts)

pisa %>% group_by(state) %>% summarise(across(c(math, science, read), ~quantile(., 0.75), .names = "P75_{.col}"))

## # A tibble: 8 × 4
##   state P75_math P75_science P75_read
##   <chr>    <dbl>       <dbl>    <dbl>
## 1 ACT       563.        598.     581.
## 2 NSW       550.        576.     569.
## 3 NT        521.        549.     541.
## 4 QLD       533.        561.     553.
## 5 SA        540.        572.     562.
## 6 TAS       524.        549.     538.
## 7 VIC       553.        575.     567.
## 8 WA        556.        584.     569.

*all values will be rounded to two decimal places

The Australian Capital Territory (ACT) shows the highest value in 75th percentile in ‘read’ with a score of 580.80

pisa %>% group_by(state) %>% summarise(across(c(math, science, read), ~max(.), .names = "max_{.col}"))

## # A tibble: 8 × 4
##   state max_math max_science max_read
##   <chr>    <dbl>       <dbl>    <dbl>
## 1 ACT       737.        795.     784.
## 2 NSW       780.        780.     779.
## 3 NT        708.        763.     728.
## 4 QLD       732.        762.     755.
## 5 SA        752.        768.     735.
## 6 TAS       709.        757.     750.
## 7 VIC       750.        802.     765.
## 8 WA        736.        767.     767.

The Australian Capital Territory (ACT) also shows the highest maximum read score at 783.85

Question 3: For female students born after on or 2000, which type of school had the highest average performce in `read` (3pts).

tab1 <- pisa %>% filter(gender == 'female' & birthyr >= 2000) %>% group_by(schtype) %>% summarise(high_avg_read = mean(read)) %>% arrange(desc( high_avg_read))
tab1

## # A tibble: 3 × 2
##   schtype  high_avg_read
##   <chr>            <dbl>
## 1 Ind               548.
## 2 Catholic          523.
## 3 Gov               485.

There are no students born after year 2000. The ‘Ind’ school type shows the highest average performance in ‘read’ score at 648.25.

Question 4: For male and female students born after on or 2000, across the different types of schools, whose scores were more variable in reading? (3pts). Math (1pt)?. Science (1pt)? Place all results in a single table to receive full marks.

pisa %>% group_by(schtype, gender) %>% summarise(across(c(math, science, read), ~IQR(.), .names = "IQR_{.col}"))

## # A tibble: 6 × 5
## # Groups:   schtype [3]
##   schtype  gender IQR_math IQR_science IQR_read
##   <chr>    <chr>     <dbl>       <dbl>    <dbl>
## 1 Catholic female     108.        128.     113.
## 2 Catholic male       124.        139.     124.
## 3 Gov      female     120.        142.     135.
## 4 Gov      male       132.        159.     149.
## 5 Ind      female     103.        106.     104.
## 6 Ind      male       116.        126.     119.

For all three subjects, males in ‘Gov’ school types shows the biggest interquartile range, indicating a bigger spread and higher variability in the scores.

pisa %>% filter(birthyr >= 2000) %>% pivot_longer(cols = c(math, science, read), names_to = "subject", values_to = "score") %>% ggplot(aes(x = subject, y = score, fill = gender)) + geom_boxplot() + facet_wrap(~schtype) + ggtitle('Boxplot showing the variability in scores across subjects')

This is further shown with a boxplot visualisation.

Question 5: For the states of VIC, NSW, and QLD, using `geom_histogram` plot the distribution of marks in `read` by gender using a faceted plot with shading to capture school type (2pt). Which combination of states/schools do female students peform worst in (1pt)? Are the results similar for `science` (1pt)?

pisa %>% filter(state %in% c('VIC','NSW','QLD')) %>% ggplot(aes(x = read, fill = schtype)) + geom_boxplot() + facet_grid(state~gender)+ scale_fill_brewer(palette = "Blues") + ggtitle('Histogram showing the distribution of read scores across states, school types by gender')

Based on the box plot above, the left hand side of the faceted plots represents the distribution of read scores in females across the states. Across all three states, ‘Gov’ school types has the lowest 50th percentile score. The spread is fairly symmetric for NSW and QLD, and is negatively skewed for VIC with more outliers. Overall, females perform worse with the combination of ‘Gov’ school types in VIC.

pisa %>% filter(state %in% c('VIC','NSW','QLD')) %>% ggplot(aes(x = science, fill = schtype)) + geom_boxplot() + facet_grid(state~gender) + scale_fill_brewer(palette = "Blues") + ggtitle("Boxplot showing the distribution of science scores across states, school types by gender")

Results are similar for science scores. ‘Gov’ school types continues to show the lowest 50th percentile scores. The main differences between read and science scores, will be in spread of data, notably in NSW. Interestingly, despite the lowest median score, QLD shows a slightly positively skewed distribution.

Question 6: Repeat the above exercise using `geom_density` (1pt). Which set of results, those for `geom_density` or `geom_histogram` allows one to more accurately compare across results (3pt)? Why?

pisa %>% filter(state %in% c('VIC','NSW','QLD')) %>% ggplot(aes(x = read, fill = schtype)) + geom_density(alpha = 0.5) + facet_grid(state~gender) + scale_fill_brewer(palette = "Blues") +labs(title = "Density plot showing the distribution of read scores across sch types, states by gender", x = "Read Scores", y = "Density ")

pisa %>% filter(state %in% c('VIC','NSW','QLD')) %>% ggplot(aes(x = read, fill = schtype)) + geom_histogram() + facet_grid(state~gender) + scale_fill_brewer(palette = "Blues") + labs(title = "Histogram showing the distribution of read scores across school types, states by gender", x = "Read Scores", y = "Count of Students")

The use of histograms or density plot to compare results will depend on the goal. A density plot is able to provide the Probability Density Function (PDF), and able to provide the likelihood of a student’s score, in a particular school type of state. It is also useful for identifying insights in trends and distributions. On the other hand, histogram plots are more useful for frequency based analysis and tends to work better on a smaller dataset.

In this case, the goal is to compare scores in females across states and school types, in the form of a frequency based analysis. Hence, a histogram will be the better choice for a more accurate comparison of results.

Question 7: First, create a data frame called pisa_filtered that excludes observations with missing values among any variable (2pt). Then, calculate a new variable called tot_score that is the sum of the math/science/reading scores and add this to the data frame pisa_filtered (1pt); in addition, calculate a new variable tot_time as the sum of the math_time/read_time/science_time and add this new variable to the data frame pisa_filtered (1pt). Using a `geom_point()` describe the relationship between tot_time (x-axis) and tot_score (y-axis). (2pts)

pisa_filtered <- na.omit(pisa) #removing the missing values 

pisa_filtered <- pisa_filtered %>% mutate(`_tot_score_` = math + science + read) %>% mutate(`_tot_time_` = math_time + science_time + read_time) #new variable of total score, total time is created, and added back to original data frame. 

ggplot(pisa_filtered, aes(x = `_tot_time_`, y = `_tot_score_`)) + geom_point() + labs(title = "Scatter plot showing the relationship between the total time and score in student scores", x = "Total Score", y = "Total Time")

The majority of data points are clustered roughly at 800, indicating that most students have a similar total score. There is a greater vertical than horizontal spread of data points, indicating that there is a bigger variation in the total time taken, as opposed to the total scores. There are a few outliers from the main cluster, with the data showing students performing much better, despite taking the same amount of time. Overall, there is no clear pattern or trend, to indicate a correlation between the total time taken, and score achieved. This suggests that there are other other contributing factors to explain the variation in scores.

Question 8. The Australian PISA test is administered in English. It is believed that, on average, students who speak languages other than english at their primary residence may be disadvantaged by having to take the test in English. The variable `language` records the language most often spoken at students homes, with language code 313 referring to `English`. On average, do students who do not speak English in their home perform worse than native english speakers in terms of total scores? Does your answer remain the same when we look at students who perform in the lowest 25th quantile?

pisa_filtered <- pisa_filtered %>% mutate(english_native_speaker = ifelse(language == '313', 'Yes', 'No')) 
ggplot(pisa_filtered) + geom_bar(aes(x = english_native_speaker, y = `_tot_score_`), stat = 'identity') + labs(title = "Box plot comparing test scores in Native and Non- Native English Speakers", x = "English Native Speaker", y = "Total Score")

Yes. On average, english native speakers perform significantly better in their test scores as compared to non- english native speakers.

pisa_summary <- pisa_filtered %>% group_by(english_native_speaker) %>% summarise(p25_tot_score = quantile(`_tot_score_`, 0.25))
ggplot(pisa_summary, aes(x = english_native_speaker, y = p25_tot_score))+ geom_bar(stat = 'identity') + labs(title = "Box plot comparing 25 percentile test scores in Native and Non- Native English Speakers", x = "English Native Speaker", y = "Total Score")

No, the results are not the same. The differences in scores between the two groups are very minimal in the bottom 25th percentile.

Question 9. Using `facet_wrap()`, plot the total scores using densities for both males and females. What do the results tell us about the usefulness of looking at average scores?

mean_scores <- pisa_filtered %>% group_by(gender) %>% summarise(avg_score = mean(`_tot_score_`))
pisa_filtered %>% ggplot(aes( x= `_tot_score_`)) + geom_density() + facet_wrap( ~gender) + geom_vline(data = mean_scores, aes(xintercept = avg_score, color = gender)) + labs(title = "Density plot comparing total scores by gender", x = "Total Score", y = "Density")

The density plot shows a higher peak (mode) for females, indicating a greater number of female students with the same score. Both male and female scores show a fairly normal distribution, with no skewness of data. There is a marginally higher average score in males than females. The similarity in distribution indicates that both groups have a similar range and variability in test scores. A V- line was added to visualise the average score. However, it was not very useful as both averages are very similar.

Question 10. Previous analysis has so yielded several categorical variables that may influence test scores, such as gender, music_instr, and what type of school the student is from, schtype. However, there are many other numerical variables that we may have missed. As a first step in understanding these variables, we can cluster the data to try and see if any patterns emerge beyond those already seen. Use k-means clustering on the following variables with \(k=5\) clusters: anxtest, motivat, tot_time, and tot_score.(3pt) Report the number of elements in each cluster (2pt).

set.seed(6)

k_means_data <- pisa_filtered %>% select(anxtest, motivat, `_tot_time_`, `_tot_score_`)

kmeans_5 <- kmeans(k_means_data, centers = 5)
kmeans_table <- data.frame(kmeans_5$size, kmeans_5$centers)

kmeans_table

##   kmeans_5.size    anxtest   motivat X_tot_time_ X_tot_score_
## 1          1424 0.34628687 0.1294504    638.9080     1198.483
## 2           453 0.08734481 0.6326269   1052.8918     1803.004
## 3          2567 0.24770296 0.3300835    658.6482     1527.032
## 4          1968 0.04046153 0.6192629    661.8471     1833.236
## 5           179 0.24078045 0.3077480   1411.0335     1365.989

With a 5 cluster solution, there are 1424, 453, 2567, 1968, 179 in Cluster 1,2,3,4,5 respectively, with set.seed(6).

Question 11. To understand the clusters, produce a table that displays the median value of each variable in the cluster, and arrange the values from largest to smallest by tot_score (2pt). Describe the relationship between the clusters and the variables tot_score, anxtest, motivat; i.e., what do you notice about the similarity between these variables and the assigned cluster? (3pt)

set.seed(6)

kmeans_5_df1 <- data.frame(Clusters = kmeans_5$cluster, pisa_filtered) #appending cluster solution back to dataset 
cluster_labels <- c("Anxious/ Low Motivation", "High Achiever", "Average", "High Achiever/ Efficient", "High Motivation/ Inefficient") #descriptive labels added for better clarity in the relationship between motivational factors and total scores for further visualisations
kmeans_5_df1 <- kmeans_5_df1 %>%
  mutate(Labels = (cluster_labels[Clusters]))
 #cluster assignments indexed to cluster labels 

median_values <- kmeans_5_df1 %>% group_by(Clusters) %>% summarise(across(c(anxtest, motivat, `X_tot_time_`, `X_tot_score_`), median)) %>% arrange(desc(`X_tot_score_`)) 
median_values #table showing median values

## # A tibble: 5 × 5
##   Clusters anxtest motivat X_tot_time_ X_tot_score_
##      <int>   <dbl>   <dbl>       <dbl>        <dbl>
## 1        4 -0.0345   0.503         660        1808.
## 2        2  0.0752   0.539        1000        1795.
## 3        3  0.209    0.158         660        1533.
## 4        5  0.156    0.230        1350        1375.
## 5        1  0.259   -0.145         660        1224.

The following features were identified for each cluster: Cluster 1 (1424)- Lowest score: Highest anxiety, lowest (negative) motivation and spends a low amount of time. Their anxiety and low motivation is negatively affecting their scores. Cluster 2 (453)- Good score. Highest motivation, moderate anxiety levels. There is a considerable amount of time spend on the task, suggesting their anxiety might be driving their motivation. Cluster 3 (2567)- Average score. Moderate anxiety and motivation. Spends a low amount of time.
Cluster 4 (1968)- Highest score. Lowest anxiety with a relatively high motivation, while spending a low amount of time. Their calmness and efficiency contributes to their high performance. Cluster 5 (179)- Low Score. Moderate anxiety and motivation levels, but spends the longest time on tasks. Their anxiety increases their time on tasks, but does not translate to a higher score.

*efficiency was gauged on how much total time they spent, as compared to the score received.

According to the cluster features, the key insights are: 1) If there is a high level of anxiety, a greater number of time spent on the task does not translate to a higher score, 2) Low anxiety and high motivation produces the highest efficiency in performance and vice versa, 3) Majority of the students are average, with moderate anxiety and motivation.

Question 12. Plot the relationship between `tot_scores` and color the plot by cluster. (3pt). Is there a meaningful difference between the scores across the different clusters? (1pt). What do you think this finding says about the importance of motivational factors in overall test scores? (2pt)

p_anxtest <- kmeans_5_df1 %>% ggplot(aes(x = anxtest, y = `X_tot_score_`, color = factor(Labels))) + geom_point() + scale_color_brewer(palette = "Set1") + labs(title = "Total scores by anxiety levels", x = "Scores", y = "Total Score")
p_motivat <- kmeans_5_df1 %>% ggplot(aes(x = motivat, y = `X_tot_score_`, color = factor(Labels))) + geom_point() + scale_color_brewer(palette = "Set1") + labs(title = "Total scores by motivation levels", x = "Scores", y= "Total Score")

combined_plot <- plot_grid(p_anxtest, p_motivat, ncol = 2)
title <- ggdraw() + draw_label("Cluster Analysis of Total Scores")

final_plot <- plot_grid(title, combined_plot, ncol = 1, rel_heights = c(0.1, 1))
final_plot

From the graph above, there are three distinct clusters showing meaningful differences in their scores: Cluster 4 (High Efficiency)- with the highest score, Cluster 1 (Anxious/Low Motivation)- with the lowest score, and Cluster 3 (Average)- with the average score. Cluster 2 (Motivated) is the smallest cluster, characterized by high scores, with some outliers falling into the average range. In contrast, Cluster 5 (Diligent) shows an even distribution between average and low scores.

Overall, the clusters exhibit similar trends in their distribution. Notably, despite similar total scores, anxiety scores show significant variability. High scorers, though often outliers, can have both extremely low and high anxiety levels. On the other hand, motivation appears to have a more direct impact on performance: high motivation is associated with high scores, while low motivation typically results in average scores at best.

Question 14 (ETC5510): Do the findings described in question 11 remain true if we instead analyse the relationship between read and cluster assignment? (5pt)

set.seed(6)

k_means_read_data <- pisa_filtered %>% select(anxtest, motivat, `_tot_time_`, read)
kmeans_read <- kmeans(k_means_read_data, centers = 5)
kmeans_read_table <- data.frame(kmeans_read$size, kmeans_read$centers)
kmeans_read_df1 <- data.frame(Clusters = kmeans_read$cluster, pisa_filtered)
kmeans_read_table

##   kmeans_read.size   anxtest   motivat X_tot_time_     read
## 1              846 0.2591387 0.2920376    468.3688 491.3349
## 2              836 0.1644999 0.4872213    915.4127 543.8466
## 3             1976 0.2421139 0.1725094    673.5324 436.2662
## 4             2681 0.1511062 0.5517251    666.4976 588.2725
## 5              252 0.2022829 0.4666810   1401.9841 531.7736

cluster_labels_read <- c("Average", "High Anxiety/ Low Motivation", "Low Anxiety/ High Motivation", "Good Scorers", "Low Scorers") 
kmeans_read_df1 <- kmeans_read_df1 %>%
  mutate(Labels = (cluster_labels_read[Clusters]))

median_read <- kmeans_read_df1 %>% group_by(Clusters) %>% summarise(across(c(anxtest, motivat, `X_tot_time_`, read), median)) %>% arrange(desc(read))
median_read

## # A tibble: 5 × 5
##   Clusters anxtest motivat X_tot_time_  read
##      <int>   <dbl>   <dbl>       <dbl> <dbl>
## 1        4  0.0752  0.463          660  582.
## 2        2  0.201   0.463          900  551.
## 3        5  0.0752  0.409         1320  537.
## 4        1  0.209   0.122          495  496.
## 5        3  0.248  -0.0413         675  449.

*Different labelling methods were used due to their differing key features.

Cluster 1(846): Moderate anxiety, motivation and score. Interestingly, the previous results showed this group as the majority of the population. However, this result shows this as a much smaller cluster. Cluster 2(836): High anxiety and low motivation, with low read score. Cluster 3(1976): High motivation, moderate anxiety and high read scores. There is an efficient use of time. Cluster 4(2671): This is the biggest cluster and has a similar profile to Cluster 1, with a slightly higher read score. Cluster 5(252): This is the smallest cluster. There is moderate anxiety with low scores. There is a significant inefficient use of time.

Key differentiating insights: 1) There is considerably less time total time spent overall. This might be due to the nature of the task.

read_anxtest <- kmeans_read_df1 %>% ggplot(aes(x = anxtest, y = `read`, color = factor(Labels))) + geom_point() + scale_color_brewer(palette = "Set1") + labs(title = "Total scores by anxiety levels", x = "Scores", y = "Read Score")
read_motivat <- kmeans_5_df1 %>% ggplot(aes(x = motivat, y = `read`, color = factor(Labels))) + geom_point() + scale_color_brewer(palette = "Set1") + labs(title = "Total scores by motivation levels", x = "Scores", y= "Read Score")

combined_read_plot <- plot_grid(read_anxtest, read_motivat, ncol = 2)
title_read <- ggdraw() + draw_label("Cluster Analysis of Read Scores")

final_read_plot <- plot_grid(combined_read_plot, ncol = 1, rel_heights = c(0.1, 1))
final_read_plot

As compared to the previous results, there are now 4 distinct, as opposed to 3 clusters when looking at anxiety levels. The average cluster shows a big variability in read scores. However, the clustering is similar in motivation levels.

In all, there is consistency in the findings. Both plots show that high anxiety negatively impacts performance, and a high motivation positively impacts performance.

ETC1010-5510: Introduction to Data Analysis

Debbie Tan”

Assignment 2

Question 1: Read in the pisa data set (1pt) and show the last 5 rows and last three variables in the data frame (1pt).

Question 2: Calculate the 75th quantile for `math`, `science` and `read` score across states. (2pts) Create a table where you display the results (1pt). Which state has the highest score in `read`. (2pts)

Question 3: For female students born after on or 2000, which type of school had the highest average performce in `read` (3pts).

Question 4: For male and female students born after on or 2000, across the different types of schools, whose scores were more variable in reading? (3pts). Math (1pt)?. Science (1pt)? Place all results in a single table to receive full marks.

Question 6: Repeat the above exercise using `geom_density` (1pt). Which set of results, those for `geom_density` or `geom_histogram` allows one to more accurately compare across results (3pt)? Why?

Question 9. Using `facet_wrap()`, plot the total scores using densities for both males and females. What do the results tell us about the usefulness of looking at average scores?

Question 12. Plot the relationship between `tot_scores` and color the plot by cluster. (3pt). Is there a meaningful difference between the scores across the different clusters? (1pt). What do you think this finding says about the importance of motivational factors in overall test scores? (2pt)

Question 14 (ETC5510): Do the findings described in question 11 remain true if we instead analyse the relationship between read and cluster assignment? (5pt)

ETC1010-5510: Introduction to Data Analysis

Debbie Tan”

Assignment 2

Question 1: Read in the pisa data set (1pt) and show the last 5 rows and last three variables in the data frame (1pt).

Question 2: Calculate the 75th quantile for math, science and read score across states. (2pts) Create a table where you display the results (1pt). Which state has the highest score in read. (2pts)

Question 3: For female students born after on or 2000, which type of school had the highest average performce in read (3pts).

Question 4: For male and female students born after on or 2000, across the different types of schools, whose scores were more variable in reading? (3pts). Math (1pt)?. Science (1pt)? Place all results in a single table to receive full marks.

Question 6: Repeat the above exercise using geom_density (1pt). Which set of results, those for geom_density or geom_histogram allows one to more accurately compare across results (3pt)? Why?

Question 9. Using facet_wrap(), plot the total scores using densities for both males and females. What do the results tell us about the usefulness of looking at average scores?

Question 12. Plot the relationship between tot_scores and color the plot by cluster. (3pt). Is there a meaningful difference between the scores across the different clusters? (1pt). What do you think this finding says about the importance of motivational factors in overall test scores? (2pt)

Question 14 (ETC5510): Do the findings described in question 11 remain true if we instead analyse the relationship between read and cluster assignment? (5pt)

Question 2: Calculate the 75th quantile for `math`, `science` and `read` score across states. (2pts) Create a table where you display the results (1pt). Which state has the highest score in `read`. (2pts)

Question 3: For female students born after on or 2000, which type of school had the highest average performce in `read` (3pts).

Question 6: Repeat the above exercise using `geom_density` (1pt). Which set of results, those for `geom_density` or `geom_histogram` allows one to more accurately compare across results (3pt)? Why?

Question 9. Using `facet_wrap()`, plot the total scores using densities for both males and females. What do the results tell us about the usefulness of looking at average scores?

Question 12. Plot the relationship between `tot_scores` and color the plot by cluster. (3pt). Is there a meaningful difference between the scores across the different clusters? (1pt). What do you think this finding says about the importance of motivational factors in overall test scores? (2pt)