1 Introduction

This report replicates and extends analyses conducted in Lin, Chai, and Jong (2019), to introduce a number of packages that can be used with R (R Core Team 2020), for data analysis, reporting and visualization. In particular, it shows how R can be used to conduct and report the results of principal components analysis, simple linear regressions, and multilevel regression analysis.

Lin, Chai, and Jong (2019) used publicly available data from the Program for International Student Assessment (PISA) 2015 for students aged 15 from Singapore and Finland. Their analyses explored the relations between interest in broad science topics, perceived information and communications technology (ICT) competence, environmental awareness and environmental optimism.

The following R packages were drawn on in preparation of this report and the tables and figures included therein¹:

arsenal for creation of summary tables (Heinzen et al. 2020)
bookdown for structuring the document, generating tables of contents, table and figure numbering, footnotes, cross-referencing, and citations (Xie 2020a)
dplyr for filtering and selecting data, renaming and recoding variables (Wickham et al. 2020)
GPArotation for oblique rotation (Bernaards and Jennrich 2014)
haven for importing SPSS files (Wickham and Miller 2020)
here for referencing files and file locations (Müller 2017)
kableExtra for constructing and formatting complex tables (Zhu 2019)
knitr for various tasks involved in report generation (Xie 2020b)
likert for creation of likert scales and visualisation of likert-scale data (Bryer and Speerschneider 2016)
lme4 for multilevel regression analyses (Bates et al. 2020)
parameters for testing appropriateness of data for factor analysis (Lüdecke et al. 2020)
psych for scoring scales from items, reliability analyses and principal components analysis (Revelle 2020)
sjPlot for generating tables summarising the results of principal components and regression analyses (Lüdecke 2020)
summarytools for generating summary tables of raw data, for data screening (Comtois 2020)

2 Data and sample

2.1 Data

These analyses use data publicly available from the Program for International Student Assessment (PISA) 2015. Data was sourced and prepared for analyses as shown in Appendix A.

2.2 Participants

The Singapore PISA 2015 sample is composed of 6115 students and the Finland sample is comprised of 5882 students.

2.3 Variables

2.3.1 Interest in broad science topics

Students participating in PISA 2015 were asked to indicate their level of interest in five broad science topics (“how science can help us prevent disease”, “the universe and its history”, “energy and its transformation (e.g., conservation, chemical reactions)”, “motion and forces (e.g., velocity, friction, magnetic and gravitational forces)”; and “the biosphere (e.g. ecosystem services, sustainability)” on a four-point scale (from 1 = “not interested” to 4 = “highly interested”). A fifth possible response (5 = “I don’t know what this is”) was recoded as missing. Internal reliability (Cronbach’s \(\alpha\)) of this scale was 0.77 in Singapore and 0.83 in Finland (as shown in Table 3.1 and Table 3.2 below).

Figure 2.1 plots responses to the broad interest in science topics scale items for the whole sample, in descending order of students’ interest. Code for generating and plotting likert scale objects can be seen in Appendix B.

Figure 2.1: Interest in broad science topics

2.3.2 Environmental awareness

Students were asked to indicate their level of knowledge about a range of environmental issues (“air pollution”; “extinction of plants and animals”; “the consequences of clearing forests for other land use”; “the increase in greenhouse gases in the atmosphere”; “water shortage”; “nuclear waste”; and “the use of genetically modified organisms (GMO)”) on a four-point scale (1 = “I have never heard of this”; 2 = “I have heard about this but I would not be able to explain what it is really about”; 3 = “I know something about this and could explain the general issue”, 4 = “I am familiar with this and would be able to explain this well”).

Internal reliability (Cronbach’s \(\alpha\)) of this scale was 0.86 in Singapore and 0.85 in Finland.

Figure 2.2 plots responses to the environmental awareness scale items for the whole sample, in descending order of students’ ability to explain.

Figure 2.2: Environmental awareness

2.3.3 Environmental optimism

Students were asked to indicate their level of optimism about each of the above environmental issues, by responding to a question about whether they thought each issue would improve or get worse over the next 20 years on a three-point scale. Responses were reverse coded such that higher scores represented environmental optimism (i.e., 1 = “Get worse”; 2 = “Stay about the same”; 3 = “Improve”).

Internal reliability (Cronbach’s \(\alpha\)) of this scale was 0.85 in Singapore and 0.81 in Finland.

Figure 2.3 plots responses to the environmental optimism scale items for the whole sample, in descending order of students’ optimism.

Figure 2.3: Environmental optimism

2.3.4 Perceived ICT competence

Students were also asked to indicate their level of competence using information and communications technology (ICT) by responding to the following statements (“I feel comfortable using my digital devices at home”, “When I come across problems with digital devices, I think I can solve them”, “If my friends and relatives have a problem with digital devices, I can help them”; “If my friends and relatives want to buy new digital devices or applications, I can give them advice”; and “I feel comfortable using digital devices that I am less familiar with”) on a four-point scale (from 1 = “strongly disagree” to 4 = “strongly agree”).

Figure 2.4 plots responses to the perceived ICT competence scale items for the whole sample, in descending order of students’ agreement with each statement.

Lin, Chai, and Jong (2019) found that the item “I feel comfortable using my digital devices at home” had an extraction communality of 0.187 in the Singapore dataset and excluded this from their analyses. For consistency, this item is omitted from subsequent analyses (and this result is discussed in more detail in Appendix D.

Internal reliability (Cronbach’s \(\alpha\)) of this scale - with the first item removed - was 0.82 in Singapore and 0.85 in Finland.

Figure 2.4: Perceived ICT competence

2.4 Descriptive summary table

Table 2.1 shows the mean scores (with standard deviations) for each item (by scale) separately for Singapore and Finland. (Code for generating this table can be seen in Appendix C.)

Table 2.1: Mean item scores (with standard deviation) by country
	Singapore (N=6115)	Finland (N=5882)	Total (N=11997)
Interest in broad science topics
How science can help us prevent disease	3.11 (0.90)	2.78 (0.97)	2.95 (0.95)
The universe and its history	3.05 (1.01)	2.84 (1.03)	2.95 (1.02)
Energy and its transformation	2.68 (0.96)	2.34 (0.97)	2.52 (0.98)
Motion and forces	2.59 (0.97)	2.35 (0.96)	2.48 (0.97)
The biosphere	2.46 (0.96)	2.02 (0.86)	2.25 (0.94)
Environmental awareness
Air pollution	3.44 (0.69)	3.22 (0.67)	3.34 (0.69)
Extinction of plants and animals	3.19 (0.79)	3.14 (0.73)	3.17 (0.76)
Consequences of clearing forests	3.36 (0.78)	2.90 (0.80)	3.14 (0.82)
Increase of greenhouse gases in the atmosphere	3.17 (0.84)	2.95 (0.81)	3.07 (0.83)
Water shortage	3.12 (0.80)	2.86 (0.81)	2.99 (0.82)
Nuclear waste	2.45 (0.83)	2.78 (0.75)	2.61 (0.81)
Use of genetically modified organisms	2.32 (1.02)	2.05 (0.85)	2.19 (0.95)
Environmental optimism
Water shortages	1.79 (0.81)	1.77 (0.70)	1.78 (0.76)
Use of genetically modified organisms	1.88 (0.75)	1.77 (0.60)	1.83 (0.69)
Nuclear waste	1.73 (0.74)	1.63 (0.69)	1.68 (0.72)
Increase of greenhouse gases in the atmosphere	1.54 (0.77)	1.46 (0.67)	1.50 (0.72)
Air pollution	1.51 (0.75)	1.46 (0.69)	1.49 (0.72)
Extinction of plants and animals	1.50 (0.71)	1.71 (0.67)	1.60 (0.70)
Clearing of forests for other land use	1.46 (0.72)	1.53 (0.66)	1.49 (0.69)
Perceived ICT competence
When I come across problems with digital devices, I think I can solve them	2.95 (0.74)	2.88 (0.75)	2.92 (0.74)
If my friends and relatives have a problem with digital devices, I can help them	2.84 (0.79)	2.86 (0.77)	2.85 (0.78)
If my friends and relatives want to buy new digital devices or applications, I can give them advice	2.76 (0.80)	2.88 (0.74)	2.82 (0.77)
I feel comfortable using digital devices that I am less familiar with	2.68 (0.78)	2.44 (0.78)	2.57 (0.79)

3 Results

Lin, Chai, and Jong (2019) sought to explore:

whether a comparable factor structure of interest in broad science topics, perceived ICT competence, environmental awareness and environmental optimism could be established between Singapore and Finland
whether a significant difference existed in these variables between students from Singapore and Finland
what relationship existed between these variables in Singapore and Finland

Results are presented in three sections.

3.1 Factor structure

Lin, Chai, and Jong (2019) conducted a principal components analysis on the data from Singapore and Finland in order to examine construct validity separately in each context. Both principal components analysis and exploratory factor analysis are used to reduce the volume of data used by researchers, by looking at the way individual variables ‘hang together’. Exploratory factor analysis assumes observed variables relate to each other because of the presence of an unobserved latent variable impacting the observed variables, and thus seeks to explain why observed variables relate to each other. Principal components analysis, by contrast, simply seeks to describe observed variables which appear to measure the same thing.

Using principal components analysis on the data from each country separately shows whether there are similar relationships between items and scale scores across countries, and tests whether it is appropriate to combine items as structured in the PISA questionnaires in these specific country contexts for the following correlation and regression analyses. Appendix D provides a step by step guide to conducting these analyses in R, including data screening and preparation. Lin, Chai, and Jong (2019) reported using oblimin rotation, a form of oblique rotation, that assumes components may be correlated/ non-independent, and extracting factors based on loadings (technically ‘eigenvectors’) and eigenvalues greater than one. They also removed “I feel comfortable using my digital devices at home” because this item had an extraction communality of .187 in the Singapore sample. These decisions are further discussed in this appendix.

Table 3.1 shows components, loadings², and alpha reliabilities³ for the data from Singapore. Components across the top of the table are ordered by the amount of variance in the data each explains. Cumulatively, these four components explain 57% of the variance in the data⁴. Environmental awareness (here component 1) explains 22.29% of variance; environmental optimism (component 2) 15.88%; perceived ICT competence (component 3) 10.88% and interest in broad science topics (component 4) 8.02%. The factor loading of the items ranged from 0.49 to 0.89 in the Singapore data.

Table 3.1: Principal Component Analysis - Singapore
	Component 1	Component 2	Component 3	Component 4
The biosphere	0.12	-0.03	-0.05	0.67
Motion and forces	-0.09	0.05	0.07	0.77
Energy and its transformation	-0.07	0.04	0.02	0.83
The universe and its history	0.12	-0.09	-0.04	0.60
How science can help us prevent disease	0.10	-0.04	-0.06	0.63
I feel comfortable using digital devices that I am less familiar with	-0.01	0.00	0.65	0.01
If my friends and relatives want to buy new digital devices or applications, I can give them advice	0.00	0.01	0.81	-0.02
When I come across problems with digital devices, I think I can solve them	0.02	-0.02	0.86	0.01
If my friends and relatives have a problem with digital devices, I can help them	0.00	-0.00	0.89	0.00
Increase of greenhouse gases in the atmosphere	0.74	-0.08	-0.02	0.11
Use of genetically modified organisms	0.49	0.01	0.06	0.15
Nuclear waste	0.59	0.09	0.08	0.09
Consequences of clearing forests	0.84	-0.03	-0.05	0.02
Air pollution	0.85	-0.00	0.00	-0.03
Extinction of plants and animals	0.83	0.03	0.02	-0.07
Water shortage	0.81	0.03	0.04	-0.04
Air pollution	-0.04	0.76	0.01	-0.03
Extinction of plants and animals	-0.03	0.76	-0.01	-0.01
Clearing of forests for other land use	-0.03	0.77	-0.00	-0.02
Water shortages	0.02	0.72	-0.01	0.04
Nuclear waste	0.06	0.72	-0.03	0.04
Increase of greenhouse gases in the atmosphere	-0.02	0.76	0.03	-0.02
Use of genetically modified organisms	0.12	0.61	-0.03	0.06
Proportion of Variance	22.29 %	15.88 %	10.88 %	8.02 %
Cumulative Proportion	22.29 %	38.17 %	49.05 %	57.06 %
Cronbach’s α	0.86	0.85	0.82	0.77
oblimin-rotation

Table 3.2 shows components, loadings and alpha reliabilities for the data from Finland. Cumulatively, the same four components explain 55.6% of the variance in the data. Environmental awareness (again component 1) explains 22.18% of variance; environmental optimism (component 2) 14.09%; interest in broad science topics (here component 3) 11.39%, and perceived ICT competence (here component 4) 7.95%. The factor loading of the items ranged from 0.45 to 0.89 in the Finland data.

Table 3.2: Principal Component Analysis - Finland
	Component 1	Component 2	Component 3	Component 4
The biosphere	0.13	-0.02	0.72	-0.08
Motion and forces	-0.08	0.04	0.85	0.07
Energy and its transformation	-0.05	0.03	0.87	0.05
The universe and its history	0.12	-0.06	0.64	-0.05
How science can help us prevent disease	0.09	-0.07	0.68	-0.08
I feel comfortable using digital devices that I am less familiar with	-0.07	0.01	0.08	0.70
If my friends and relatives want to buy new digital devices or applications, I can give them advice	0.02	-0.02	-0.03	0.85
When I come across problems with digital devices, I think I can solve them	0.03	-0.01	0.01	0.86
If my friends and relatives have a problem with digital devices, I can help them	0.02	0.00	-0.02	0.89
Increase of greenhouse gases in the atmosphere	0.67	-0.04	0.14	0.00
Use of genetically modified organisms	0.45	-0.01	0.12	0.10
Nuclear waste	0.70	0.04	0.07	0.08
Consequences of clearing forests	0.77	0.02	0.03	0.01
Air pollution	0.84	-0.01	-0.02	-0.03
Extinction of plants and animals	0.83	-0.01	-0.06	-0.00
Water shortage	0.75	0.01	-0.07	-0.01
Air pollution	0.02	0.73	0.04	0.01
Extinction of plants and animals	-0.00	0.69	-0.03	-0.02
Clearing of forests for other land use	0.03	0.69	0.01	-0.05
Water shortages	-0.01	0.64	-0.02	-0.00
Nuclear waste	0.01	0.66	0.02	0.03
Increase of greenhouse gases in the atmosphere	-0.01	0.74	0.01	0.02
Use of genetically modified organisms	-0.01	0.61	-0.04	-0.02
Proportion of Variance	22.18 %	14.09 %	11.39 %	7.95 %
Cumulative Proportion	22.18 %	36.27 %	47.66 %	55.61 %
Cronbach’s α	0.85	0.81	0.83	0.85
oblimin-rotation

3.2 Differences between Singapore and Finland

Table 3.3 shows means and standard deviations for each scale in Singapore and Finland. Appendix E shows how R was used to calculate scale scores and test the statistical significance of differences by country.

As shown, students from Singapore generally had higher levels of interest in broad science topics, environmental awareness and perceived ICT competence than the Finnish students, but did not differ significantly in terms of their environmental optimism.

The statistical significance of differences by country is shown with an ANOVA (t-test with equal variances) p-value.

Table 3.3: Mean scale scores by country
	Singapore (N=6115)	Finland (N=5882)	Total (N=11997)	p value
Interest in broad science topics	2.77 (0.66)	2.48 (0.70)	2.63 (0.70)	< 0.001
Perceived ICT competence	2.81 (0.61)	2.77 (0.60)	2.79 (0.61)	< 0.001
Environmental awareness	3.01 (0.60)	2.85 (0.55)	2.93 (0.58)	< 0.001
Environmental optimism	1.63 (0.54)	1.62 (0.44)	1.63 (0.49)	0.230

3.3 Relationships among variables

3.3.1 Correlation matrices

Appendix F shows how R was used to calculate correlations between variables. Interest in broad science topics and environmental awareness were positively and significantly correlated in both countries (and of medium magnitude), while environmental awareness and environmental optimism were negatively and significantly correlated (but of small magnitude). Positive significant (but small) correlations were also found between other variables in both countries (see Figure 3.1 and Figure 3.2 for details). In Singapore but not in Finland, environmental optimism and perceived ICT competence were positively and significantly correlated; in Finland but not in Singapore environmental optimism and interest in broad science topics were negatively and significantly correlated. Lin, Chai, and Jong (2019) found the low (<0.42) and significant correlations to indicate the appropriateness of proceeding with the following regression analyses.

Figure 3.1: Correlation matrix - Singapore

Figure 3.2: Correlation matrix - Finland

3.3.2 Regression analyses

Environmental awareness

First, following Lin, Chai, and Jong (2019), we explore the extent to which interest in broad science topics and perceived ICT competence individually and in combination predict environmental awareness. Code for generating these analyses can be found in Appendix G.

Table 3.4 shows standardized coefficients of the multiple linear regression models used for predicting environmental awareness. In both Singapore and Finland, interest in broad science topics and perceived ICT competence were significantly positive predictors of environmental awareness. Both multiple regression models were significant, accounting for 15 to 18% of the variance. Standardized coefficients in a regression analysis indicate that a one standard deviation increase in the given variable will be associated with the specific number of standard deviations of change in the target variable, holding all other included variables constant. The standardized coefficient of 0.41 for interest in broad science topics in Finland for example, suggests that an increase of one standard deviation on the interest in broad science scale would be associated with an increase of 0.41 of a standard deviation on the environmental awareness scale.

Table 3.4: Environmental awareness
	Singapore		Finland
Predictors	std. Beta	p	std. Beta	p
Interest in broad science topics	0.37	<0.001	0.41	<0.001
Perceived ICT competence	0.08	<0.001	0.05	<0.001
Observations	6115		5882
R² / R² adjusted	0.148 / 0.147		0.177 / 0.176

Environmental optimism

Table 3.5 shows standardized coefficients of the multiple linear regression models used for predicting environmental optimism. In both Singapore and Finland, environmental awareness was a significant negative predictor of environmental optimism. In Singapore, interest in broad science topics and perceived ICT competence were significant positive predictors of environmental optimism. In Finland, interest in broad science topics was not significant, and perceived ICT competence was of marginal significance. Note however, that this model explained only a small proportion of variance (2 - 3%).

Table 3.5: Environmental optimism
	Singapore		Finland
Predictors	std. Beta	p	std. Beta	p
Interest in broad science topics	0.05	<0.001	0.02	0.184
Perceived ICT competence	0.06	<0.001	0.03	0.040
Environmental awareness	-0.18	<0.001	-0.16	<0.001
Observations	6115		5882
R² / R² adjusted	0.028 / 0.028		0.023 / 0.022

3.3.3 Multilevel regression analyses

It is important to note, however, that PISA employed a two-stage stratified sample design (OECD 2017), and observations from a single cluster (i.e., students from the same school) are usually more similar to each other than observations from different clusters. Statistical methods such as linear regression assume independence of observations. It is therefore more appropriate to perform a multilevel (or mixed effects) regression analysis on this data. Multilevel models not only take into account correlations that might exist in observations from the same cluster, they can also give an estimate of the magnitude of that correlation. Code for conducting the multilevel regression analyses that follow can be found in Appendix H.

Environmental awareness

Table 3.6 shows standardized coefficients of the linear and multilevel linear regression models used for predicting environmental awareness in Singapore and Finland. The standardized coefficients in the multilevel model are of slightly smaller magnitude, once variability associated with school attended is taken into account⁵

This table shows that the Singaporean students surveyed came from 176 schools, while the Finnish students came from 162 schools. The Intraclass Correlation Coefficients (ICC) reported here show the variability in students’ environmental awareness that could be explained simply by knowing what school they attended⁶. In this case, 9% of variability in environmental awareness in Singapore, and 3% of variability in Finland, can be accounted for solely by the clusters identified in the model (i.e., school). The revised coefficients in the multilevel model thus show the influence of interest in broad science topics and perceived ICT competence in predicting environmental awareness, after controlling for variability between schools.

The marginal and conditional R-squared statistics presented here for the multilevel model⁷ suggest that this model explains approximately 13% of the variance in environmental awareness in Singaporean students after controlling for school attended, but that around 20% of the variance is accounted for if we also account for school attended. These analyses suggest that it might in fact be worth exploring key dimensions of differences between schools in Singapore captured in the PISA schools questionnaire, and incorporating these into our model.

Table 3.6: Environmental awareness - model comparison
	Singapore linear		Singapore multilevel		Finland linear		Finland multilevel
Predictors	std. Beta	p	std. Beta	p	std. Beta	p	std. Beta	p
Interest in broad science topics	0.37	<0.001	0.34	<0.001	0.41	<0.001	0.40	<0.001
Perceived ICT competence	0.08	<0.001	0.07	<0.001	0.05	<0.001	0.05	<0.001
Random Effects
σ²			0.28				0.24
τ₀₀			0.03 _school				0.01 _school
ICC			0.09				0.03
N			177 _school				168 _school
Observations	6115		6115		5882		5882
R² / R² adjusted	0.148 / 0.147		0.127 / 0.203		0.177 / 0.176		0.171 / 0.195

Environmental optimism

Table 3.7 shows standardized coefficients of the linear and multilevel linear regression models used for predicting environmental optimism in Singapore and Finland. In this case, the intraclass correlation coefficients show random variance due to school explains only approximately 2-3% of variability, and the adjusted models still explain only around 2-5% of the total variance.

Table 3.7: Environmental optimism - model comparison
	Singapore linear		Singapore multilevel		Finland linear		Finland multilevel
Predictors	std. Beta	p	std. Beta	p	std. Beta	p	std. Beta	p
Interest in broad science topics	0.05	<0.001	0.05	<0.001	0.02	0.184	0.02	0.168
Perceived ICT competence	0.06	<0.001	0.05	<0.001	0.03	0.040	0.03	0.035
Environmental awareness	-0.18	<0.001	-0.17	<0.001	-0.16	<0.001	-0.16	<0.001
Random Effects
σ²			0.28				0.19
τ₀₀			0.01 _school				0.00 _school
ICC			0.03				0.01
N			177 _school				168 _school
Observations	6115		6115		5882		5882
R² / R² adjusted	0.028 / 0.028		0.025 / 0.051		0.023 / 0.022		0.022 / 0.028

4 Conclusions

Lin, Chai, and Jong (2019) concluded firstly that a comparable factor structure could be identified between Singapore and Finland in PISA 2015 data on students’ interest in broad science topics, perceived ICT competence, environmental awareness and environmental optimism.

Secondly, the authors concluded that Singaporean students had higher levels, compared to Finnish students, of interest in broad science topics, perceived ICT competence, and environmental awareness. They did not differ, however, in terms of environmental optimism. The authors argue this is consistent with the emphasis on science and ICT in the Singaporean education system, and the immediacy with which obvious environmental threats relating to rising sea levels and haze problems are perceived in Singapore (e.g., Tan, Lee, and Tan 2009).

Thirdly, correlation analyses showed positive associations between interest in broad science topics, perceived ICT competence and environmental awareness; and a negative association between environmental awareness and optimism. Lin, Chai, and Jong (2019) suggest that the correlation between perceived ICT competence and environmental optimism in Singapore suggested that for Singaporean students, familiarity with ICT might contribute to a more optimistic outlook towards the possibility of improvements in the environmental issues included in the survey. They further suggest that the negative correlation in Finland between interest in broad science topics and environmental optimism might indicate that Finnish students believed science insufficient to resolve environmental issues. Given the small magnitude of the observed correlations, however, care should be taken not to over-interpret these findings.

Fourth, regression analyses showed that in both country contexts interest in broad science topics emerged as a significant predictor of environmental awareness, as did - to a lesser extent - perceived ICT competence. In the Singaporean data, interest in broad science topics and perceived ICT competence were positive predictors of environmental optimism, whereas environmental awareness was a negative predictor. Lin, Chai, and Jong (2019) suggest that this is consistent with studies suggesting increased interest in science and ICT use could help resolve environmental problems (e.g., Gonel and Akinci 2018). In the Finland data, environmental awareness was similarly a negative predictor of environmental optimism. Interest in broad science topics and perceived ICT competence, however, were of marginal or no significance.

Fifth, this replication and extension shows that the above results are broadly corroborated when multilevel regression analyses are conducted to account for the clustering of observations within schools in the PISA 2015 study design. Including a random effect of schools in the model for predicting environmental awareness in Singapore in particular, however, suggested that there was variation in environmental awareness associated with the school students attended that could be further explored. Given evidence from PISA 2018 that high- and low-performing students tend to be clustered in schools to a greater extent in Singapore (OECD 2019b) than in Finland (OECD 2019a), it would be interesting to take a broader look at the variables available in the PISA dataset and assess a revised model incorporating academic ability (e.g., scientific literacy) and other school-related factors to further explore factors associated with environmental awareness. Results from Turkey suggest it might also be worth considering the extent to which environmental awareness might vary in association with student socioeconomic status (Ozturk 2018). The small amount of variability explained by the model seeking to explain environmental optimism suggests caution ought to be taken not to over-interpret these findings (and perhaps additional theorising of relevant predictors might help to generate a model with greater utility).

Overall, these results suggest, following Lin, Chai, and Jong (2019), that the PISA data may be a useful source for researchers seeking to further understand factors associated with environmental awareness in particular, and how this might be promoted in different country contexts. Potentially profitable directions for future research have been outlined above.

Appendices

A Reading in and preparing data

The Program for International Student Assessment (PISA) 2015 student data used for these analyses was retrieved from https://www.oecd.org/pisa/data/2015database/. The SPSS file was saved (as “PISA_2015.sav”) in a folder labelled ‘data’ in the current working directory. The following script uses the here package (Müller 2017) to navigate to and read in the full datafile, uses filter and select (Wickham et al. 2020) functions to retain only observations and variables necessary for these analyses, and then saves the file in its abbreviated form (as “PISA15_new”) in the same folder. This process was performed via a separate script in order to minimise the volume of data loaded for - and processing time taken by - generation of this report. Using caches and dependencies in the R code might be an alternate methodology to consider when generating future reports.

# Step 1: load packages 

library(tidyverse)
library(here)
library(haven)

# Step 2: Read in data

PISA15 <- haven::read_spss(here("data", "PISA_2015.sav"))

# Step 3: Prepare abbreviated data file for use in report generation

# 3a: Select variables

PISA15_new <- PISA15 %>%
  filter(CNT %in% c("SGP", "FIN")) %>%
  select(CNT, CNTSTUID, ST004D01T, 
   INTBRSCI, ST095Q04NA, ST095Q07NA, ST095Q08NA, ST095Q13NA, ST095Q15NA, #interest
   COMPICT, IC014Q03NA, IC014Q04NA, IC014Q06NA, IC014Q08NA, IC014Q09NA, #ICT 
   ENVAWARE, ST092Q01TA, ST092Q02TA, ST092Q04TA, ST092Q05TA, ST092Q06NA, ST092Q08NA, ST092Q09NA, # Environmental awareness
   ENVOPT, ST093Q01TA, ST093Q03TA, ST093Q04TA, ST093Q05TA, ST093Q06TA, ST093Q07NA, ST093Q08NA, # Environmental optimism
   CNTSCHID, W_FSTURWT80)

# 3b: Export complete file for use in report generation

saveRDS(PISA15_new, here("data", "PISA15_new.RDS"))

The code that follows is then run on the abbreviated data file to further manipulate the data and generate this report.

# Step 1: Load packages

library(tidyverse)
library(here)
library(haven)
library(summarytools)
library(sjPlot)
library(likert)
library(arsenal)
library(kableExtra)
library(bookdown)
library(psych) 
library(GPArotation)
library(parameters) 

# Step 2: Read in abbreviated data file

PISA15_new <- readRDS(here("data", "PISA15_new.RDS"))

# Step 3: Prepare data file for summary and analysis ---------

# 3a: Add labels

attr(PISA15_new$ST092Q01TA,'label') <- 'Increase of greenhouse gases in the atmosphere'
attr(PISA15_new$ST092Q02TA,'label') <- 'Use of genetically modified organisms'
attr(PISA15_new$ST092Q04TA,'label') <- 'Nuclear waste'
attr(PISA15_new$ST092Q05TA,'label') <- 'Consequences of clearing forests'
attr(PISA15_new$ST092Q06NA,'label') <- 'Air pollution'
attr(PISA15_new$ST092Q08NA,'label') <- 'Extinction of plants and animals'
attr(PISA15_new$ST092Q09NA,'label') <- 'Water shortage'
attr(PISA15_new$ST095Q04NA,'label') <- 'The biosphere'
attr(PISA15_new$ST095Q07NA,'label') <- 'Motion and forces'
attr(PISA15_new$ST095Q08NA,'label') <- 'Energy and its transformation'
attr(PISA15_new$ST095Q13NA,'label') <- 'The universe and its history'
attr(PISA15_new$ST095Q15NA,'label') <- 'How science can help us prevent disease'
attr(PISA15_new$IC014Q03NA,'label') <- 'Feel comfortable using less familiar digital devices'
attr(PISA15_new$IC014Q04NA,'label') <- 'Can advise friends/relatives wanting to buy devices or applications'
attr(PISA15_new$IC014Q06NA,'label') <- 'Feel comfortable using digital devices at home'
attr(PISA15_new$IC014Q08NA,'label') <- 'Think I can solve problems with devices when come across'
attr(PISA15_new$IC014Q09NA,'label') <- 'Can help friends/relatives with problems with digital devices'
attr(PISA15_new$ST093Q01TA,'label') <- 'Air pollution'
attr(PISA15_new$ST093Q03TA,'label') <- 'Extinction of plants and animals'
attr(PISA15_new$ST093Q04TA,'label') <- 'Clearing of forests for other land use'
attr(PISA15_new$ST093Q05TA,'label') <- 'Water shortages'
attr(PISA15_new$ST093Q06TA,'label') <- 'Nuclear waste'
attr(PISA15_new$ST093Q07NA,'label') <- 'Increase of greenhouse gases in the atmosphere'
attr(PISA15_new$ST093Q08NA,'label') <- 'Use of genetically modified organisms'

#3b: Recode '5' for broad science interests as NA

PISA15_new$ST095Q04NA[PISA15_new$ST095Q04NA == 5] <- NA 
PISA15_new$ST095Q07NA[PISA15_new$ST095Q07NA == 5] <- NA 
PISA15_new$ST095Q08NA[PISA15_new$ST095Q08NA == 5] <- NA 
PISA15_new$ST095Q13NA[PISA15_new$ST095Q13NA == 5] <- NA 
PISA15_new$ST095Q15NA[PISA15_new$ST095Q15NA == 5] <- NA 

#3c: Recode environmental optimism scale (reverse-score)

PISA15_new <- PISA15_new %>%
  mutate(ST093Q01TA_R = ifelse(ST093Q01TA %in% c("1"), "3",
               ifelse(ST093Q01TA %in% c("2"), "2",
                ifelse(ST093Q01TA %in% c("3"), "1", NA)))) %>%
  mutate(ST093Q01TA_R = as.numeric(ST093Q01TA_R)) %>%
  mutate(ST093Q03TA_R = ifelse(ST093Q03TA %in% c("1"), "3",
               ifelse(ST093Q03TA %in% c("2"), "2",
                ifelse(ST093Q03TA %in% c("3"), "1", NA)))) %>%
  mutate(ST093Q03TA_R = as.numeric(ST093Q03TA_R)) %>%
  mutate(ST093Q04TA_R = ifelse(ST093Q04TA %in% c("1"), "3",
               ifelse(ST093Q04TA %in% c("2"), "2",
                ifelse(ST093Q04TA %in% c("3"), "1", NA)))) %>%
  mutate(ST093Q04TA_R = as.numeric(ST093Q04TA_R)) %>%
  mutate(ST093Q05TA_R = ifelse(ST093Q05TA %in% c("1"), "3",
               ifelse(ST093Q05TA %in% c("2"), "2",
                ifelse(ST093Q05TA %in% c("3"), "1", NA)))) %>%
  mutate(ST093Q05TA_R = as.numeric(ST093Q05TA_R)) %>%
  mutate(ST093Q06TA_R = ifelse(ST093Q06TA %in% c("1"), "3",
               ifelse(ST093Q06TA %in% c("2"), "2",
                ifelse(ST093Q06TA %in% c("3"), "1", NA)))) %>%
  mutate(ST093Q06TA_R = as.numeric(ST093Q06TA_R)) %>%  
  mutate(ST093Q07NA_R = ifelse(ST093Q07NA %in% c("1"), "3",
               ifelse(ST093Q07NA %in% c("2"), "2",
                ifelse(ST093Q07NA %in% c("3"), "1", NA)))) %>%
  mutate(ST093Q07NA_R = as.numeric(ST093Q07NA_R)) %>%
  mutate(ST093Q08NA_R = ifelse(ST093Q08NA %in% c("1"), "3",
               ifelse(ST093Q08NA %in% c("2"), "2",
                ifelse(ST093Q08NA %in% c("3"), "1", NA)))) %>%
  mutate(ST093Q08NA_R = as.numeric(ST093Q08NA_R))
  
attr(PISA15_new$ST093Q01TA_R,'label') <- 'Air pollution'
attr(PISA15_new$ST093Q03TA_R,'label') <- 'Extinction of plants and animals'
attr(PISA15_new$ST093Q04TA_R,'label') <- 'Clearing of forests for other land use'
attr(PISA15_new$ST093Q05TA_R,'label') <- 'Water shortages'
attr(PISA15_new$ST093Q06TA_R,'label') <- 'Nuclear waste'
attr(PISA15_new$ST093Q07NA_R,'label') <- 'Increase of greenhouse gases in the atmosphere'
attr(PISA15_new$ST093Q08NA_R,'label') <- 'Use of genetically modified organisms'

#3d: make country factor, add labels, and set up order for reporting

PISA15_new$CNT <- factor(PISA15_new$CNT, 
      levels = c("SGP", "FIN"),
      labels = c("Singapore", "Finland"))

B Likert objects and plots

The code that follows shows how to use the likert function (Bryer and Speerschneider 2016) to create a likert scale object, based on selected items.

B.1 Creating the likert object

# create a likert scale object (l95) based on selected items 
# plotting grouped objects (l95g) compares responses by gp 

items95 <- PISA15_new %>%
  select(ST095Q04NA, ST095Q07NA, ST095Q08NA, ST095Q13NA, 
         ST095Q15NA) %>%
  as.data.frame()

convert_to_factor_95 <- c('ST095Q04NA', 'ST095Q07NA', 
              'ST095Q08NA', 'ST095Q13NA', 'ST095Q15NA')
                       
items95[,convert_to_factor_95] <- lapply(items95[,convert_to_factor_95], factor, 
        levels = c("1", "2", "3", "4"),
        labels = c("Not interested", "Hardly interested", 
                   "Interested", "Highly interested"),
                  ordered = TRUE)            

names(items95) <- c("The biosphere", 
                    "Motion and forces", 
                    "Energy and its transformation", 
                    "The universe and its history", 
                    "How science can help us prevent disease")

l95 <- likert(items95)
l95g <- likert(items95, grouping=PISA15_new$CNT)

B.2 Stacked centred bar graph

The stacked centred bar graph generated in Figure 2.1 above was plotted with the following code:

# set print options

scale_height = knitr::opts_chunk$get('fig.height')*0.5
scale_width = knitr::opts_chunk$get('fig.width')*1.25
knitr::opts_chunk$set(fig.height = scale_height, fig.width = scale_width)

theme_update(legend.text = element_text(size = rel(0.6)))

plot(l95)

# note auto-numberings requires chunk header to include:
  # 'fig.cap' = [Caption] 
# chunk headers for sections can be printed using bookdown

C Summary tables

The code below shows how to use arsenal::tableby (Heinzen et al. 2020) and options in kableExtra (Zhu 2019) to reproduce Table 2.1 above. This table summarized mean item scores (with standard deviations) by country, with items grouped by scale.

tab_summary <- arsenal::tableby(CNT ~ ST095Q15NA + ST095Q13NA + ST095Q08NA + ST095Q07NA + ST095Q04NA + 
    ST092Q06NA + ST092Q08NA + ST092Q05TA +ST092Q01TA + ST092Q09NA + ST092Q04TA + ST092Q02TA +
    ST093Q05TA_R + ST093Q08NA_R + ST093Q06TA_R + ST093Q07NA_R + ST093Q01TA_R + ST093Q03TA_R + ST093Q04TA_R +   
    IC014Q08NA + IC014Q09NA + IC014Q04NA + IC014Q03NA,
    data=PISA15_new, numeric.simplify=TRUE, 
    numeric.stats = c("meansd"), digits=2, test = FALSE)

tab_summary %>%
  summary(text = TRUE) %>%
  as.data.frame() %>%
  knitr::kable(caption = "Mean item scores by country") %>%
  kableExtra::pack_rows(index = c("Interest in broad science topics" = 5, 
                      "Environmental awareness" = 7, 
                      "Environmental optimism" = 7, 
                      "Perceived ICT competence" = 4)) %>%
  kableExtra::kable_styling(latex_options = c("striped"))

D Factor analysis

D.1 Testing the dataset for suitability for factor analysis

First, datafiles for each country were created using the code below. Next, the summarytools::descr function (Comtois 2020) was used to review summary statistics for each variable to be included in the model (i.e., to check for missing or out of range data, high levels of skewness or kurtosis, etc). Note here that there is a relatively small percentage of missing data (2~11%) but greater prevalence of missing data for some items - and scales - over others. Scales in this report were scored by imputing mean values for each item to avoid omitting observations with missing data from analysis, but more sophisticated imputation of missing not at random items would have been preferable.

# create datafiles for each country with only scale items included

# original datafiles with all items for all scales

FIN_original <- PISA15_new %>%
  filter(CNT == "Finland") %>%
  select(5:9, 11:15, 17:23, 34:40)

SGP_original <- PISA15_new %>%
  filter(CNT == "Singapore") %>%
  select(5:9, 11:15, 17:23, 34:40)

# revised datafiles excluding IC014Q06NA

FIN <- PISA15_new %>%
  filter(CNT == "Finland") %>%
  select(5:9, 11:12, 14:15, 17:23, 34:40) 

SGP <- PISA15_new %>%
  filter(CNT == "Singapore") %>%
  select(5:9, 11:12, 14:15, 17:23, 34:40)

# screen for out-of-range values, high levels of missing data, skewness, kurtosis etc ... 

summarytools::descr(SGP_original, transpose = TRUE, headings = FALSE, 
    stats = c("mean", "sd", "min", "max", "skewness", "kurtosis", "n.valid", "pct.valid"))

## 
##                      Mean   Std.Dev    Min    Max   Skewness   Kurtosis   N.Valid   Pct.Valid
## ------------------ ------ --------- ------ ------ ---------- ---------- --------- -----------
##         IC014Q03NA   2.68      0.78   1.00   4.00      -0.15      -0.37   5883.00       96.21
##         IC014Q04NA   2.76      0.80   1.00   4.00      -0.39      -0.18   5870.00       95.99
##         IC014Q06NA   3.35      0.61   1.00   4.00      -0.75       1.41   5861.00       95.85
##         IC014Q08NA   2.95      0.74   1.00   4.00      -0.53       0.34   5870.00       95.99
##         IC014Q09NA   2.84      0.79   1.00   4.00      -0.50       0.01   5871.00       96.01
##         ST092Q01TA   3.17      0.84   1.00   4.00      -0.80      -0.01   6012.00       98.32
##         ST092Q02TA   2.32      1.02   1.00   4.00       0.20      -1.08   6011.00       98.30
##         ST092Q04TA   2.45      0.83   1.00   4.00       0.18      -0.52   5958.00       97.43
##         ST092Q05TA   3.36      0.78   1.00   4.00      -1.12       0.78   6008.00       98.25
##         ST092Q06NA   3.44      0.69   1.00   4.00      -1.12       0.99   6008.00       98.25
##         ST092Q08NA   3.19      0.79   1.00   4.00      -0.68      -0.15   6018.00       98.41
##         ST092Q09NA   3.12      0.80   1.00   4.00      -0.60      -0.25   6011.00       98.30
##       ST093Q01TA_R   1.51      0.75   1.00   3.00       1.07      -0.41   6024.00       98.51
##       ST093Q03TA_R   1.50      0.71   1.00   3.00       1.06      -0.28   6011.00       98.30
##       ST093Q04TA_R   1.46      0.72   1.00   3.00       1.24       0.03   6018.00       98.41
##       ST093Q05TA_R   1.79      0.81   1.00   3.00       0.39      -1.36   6022.00       98.48
##       ST093Q06TA_R   1.73      0.74   1.00   3.00       0.47      -1.04   5997.00       98.07
##       ST093Q07NA_R   1.54      0.77   1.00   3.00       1.00      -0.58   5995.00       98.04
##       ST093Q08NA_R   1.88      0.75   1.00   3.00       0.20      -1.21   6003.00       98.17
##         ST095Q04NA   2.46      0.96   1.00   4.00      -0.09      -0.98   5382.00       88.01
##         ST095Q07NA   2.59      0.97   1.00   4.00      -0.21      -0.94   5875.00       96.08
##         ST095Q08NA   2.68      0.96   1.00   4.00      -0.31      -0.84   5846.00       95.60
##         ST095Q13NA   3.05      1.01   1.00   4.00      -0.75      -0.58   5655.00       92.48
##         ST095Q15NA   3.11      0.90   1.00   4.00      -0.83      -0.09   5795.00       94.77

summarytools::descr(FIN_original, transpose = TRUE, headings = FALSE, 
    stats = c("mean", "sd", "min", "max", "skewness", "kurtosis", "n.valid", "pct.valid"))

## 
##                      Mean   Std.Dev    Min    Max   Skewness   Kurtosis   N.Valid   Pct.Valid
## ------------------ ------ --------- ------ ------ ---------- ---------- --------- -----------
##         IC014Q03NA   2.44      0.78   1.00   4.00       0.01      -0.41   5386.00       91.57
##         IC014Q04NA   2.88      0.74   1.00   4.00      -0.63       0.56   5380.00       91.47
##         IC014Q06NA   3.20      0.62   1.00   4.00      -0.74       1.96   5367.00       91.24
##         IC014Q08NA   2.88      0.75   1.00   4.00      -0.57       0.38   5355.00       91.04
##         IC014Q09NA   2.86      0.77   1.00   4.00      -0.58       0.27   5354.00       91.02
##         ST092Q01TA   2.95      0.81   1.00   4.00      -0.46      -0.25   5574.00       94.76
##         ST092Q02TA   2.05      0.85   1.00   4.00       0.45      -0.46   5562.00       94.56
##         ST092Q04TA   2.78      0.75   1.00   4.00      -0.15      -0.35   5549.00       94.34
##         ST092Q05TA   2.90      0.80   1.00   4.00      -0.39      -0.26   5561.00       94.54
##         ST092Q06NA   3.22      0.67   1.00   4.00      -0.67       0.78   5559.00       94.51
##         ST092Q08NA   3.14      0.73   1.00   4.00      -0.60       0.24   5571.00       94.71
##         ST092Q09NA   2.86      0.81   1.00   4.00      -0.41      -0.26   5551.00       94.37
##       ST093Q01TA_R   1.46      0.69   1.00   3.00       1.18       0.05   5580.00       94.87
##       ST093Q03TA_R   1.71      0.67   1.00   3.00       0.42      -0.81   5560.00       94.53
##       ST093Q04TA_R   1.53      0.66   1.00   3.00       0.84      -0.40   5551.00       94.37
##       ST093Q05TA_R   1.77      0.70   1.00   3.00       0.35      -0.94   5557.00       94.47
##       ST093Q06TA_R   1.63      0.69   1.00   3.00       0.64      -0.74   5540.00       94.19
##       ST093Q07NA_R   1.46      0.67   1.00   3.00       1.14       0.02   5532.00       94.05
##       ST093Q08NA_R   1.77      0.60   1.00   3.00       0.16      -0.52   5439.00       92.47
##         ST095Q04NA   2.02      0.86   1.00   4.00       0.38      -0.70   5223.00       88.80
##         ST095Q07NA   2.35      0.96   1.00   4.00       0.07      -0.98   5484.00       93.23
##         ST095Q08NA   2.34      0.97   1.00   4.00       0.10      -1.00   5467.00       92.94
##         ST095Q13NA   2.84      1.03   1.00   4.00      -0.46      -0.95   5411.00       91.99
##         ST095Q15NA   2.78      0.97   1.00   4.00      -0.38      -0.83   5400.00       91.81

For data to be suitable for factor analysis, there should be some relationships between variables. Bartlett’s test of sphericity tests the statistical probability of the correlation matrix having significant correlations among at least some of the variables.

The Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy (MSA) ranges from 0 to 1 and “indicates the degree to which each variable in a set is predicted without error by the other variables. A value of 0 indicates that the sum of partial correlations is large relative to the sum correlations, indicating factor analysis is likely to be inappropriate. A KMO value close to 1 indicates that the sum of partial correlations is not large relative to the sum of correlations and so factor analysis should yield distinct and reliable factors” (“Structural Models (EFA, CFA, SEM...)” n.d.).

The code that follows uses the parameters::check_factorstructure (Lüdecke et al. 2020) function to generate these test statistics for the data from Singapore and Finland, respectively.

parameters::check_factorstructure(SGP_original)

## # Is the data suitable for Factor Analysis?
## 
##   - KMO: The Kaiser, Meyer, Olkin (KMO) measure of sampling adequacy suggests that data seems appropriate for factor analysis (KMO = 0.86).
##   - Sphericity: Bartlett's test of sphericity suggests that there is sufficient significant correlation in the data for factor analaysis (Chisq(276) = 59719.55, p < .001).

parameters::check_factorstructure(FIN_original)

## # Is the data suitable for Factor Analysis?
## 
##   - KMO: The Kaiser, Meyer, Olkin (KMO) measure of sampling adequacy suggests that data seems appropriate for factor analysis (KMO = 0.86).
##   - Sphericity: Bartlett's test of sphericity suggests that there is sufficient significant correlation in the data for factor analaysis (Chisq(276) = 56085.17, p < .001).

D.2 Determining how many components to extract

The code that follows uses psych::fa.parallel (Revelle 2020) to generate a scree plot that can be used to inform choices of the appropriate number of components to extract from the data for Singapore and Finland respectively.

Generally, it is best to select a number of components/factors whose eigenvalues (i.e., the amount of variance explained by each principal component or factor) are above those of the simulated dataset with no factors (i.e., >1). Theoretical arguments (e.g., the proposed structure of the PISA data) can also support these decisions. Both figures suggest extracting four components is appropriate, as the slope of the plot drops noticeably after this point.

# change figure height setting for printing 
knitr::opts_chunk$set(fig.height = knitr::opts_chunk$get('fig.height')*2)

# how many factors to extract - number of factors whose eigenvalues above simulated dataset with no factors
psych::fa.parallel(SGP_original, 
              fm = "pa", # factor method 
              fa = "both", # - principle axis (fa), principle components (pc) or both (both)
              use = "pairwise") #pairwise deletion of missing data

## Parallel analysis suggests that the number of factors =  6  and the number of components =  4

# how many factors to extract - number of factors whose eigenvalues above simulated dataset with no factors
psych::fa.parallel(FIN_original, 
              fm = "pa", # factor method 
              fa = "both", # - principle axis (fa), principle components (pc) or both (both)
              use = "pairwise") #pairwise deletion of missing data

## Parallel analysis suggests that the number of factors =  6  and the number of components =  4

Let’s now perform a principal components analysis, extracting 4 factors, with oblimin rotation and view a summary of results. The h2 column in the summary that follows is a measure of communality (i.e., shared variance with other items), while the u2 column is a measure of uniqueness (i.e., variance not explained by other items). Together these values total to 1.

principalSGP <- psych::principal(r = SGP_original, nfactors = 4, rotate = "oblimin")
principalSGP

## Principal Components Analysis
## Call: psych::principal(r = SGP_original, nfactors = 4, rotate = "oblimin")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                TC1   TC2   TC3   TC4   h2   u2 com
## ST095Q04NA    0.13 -0.04 -0.05  0.67 0.51 0.49 1.1
## ST095Q07NA   -0.09  0.05  0.06  0.78 0.59 0.41 1.0
## ST095Q08NA   -0.06  0.04  0.02  0.83 0.67 0.33 1.0
## ST095Q13NA    0.13 -0.10 -0.02  0.59 0.42 0.58 1.2
## ST095Q15NA    0.10 -0.04 -0.05  0.62 0.43 0.57 1.1
## IC014Q03NA   -0.02  0.01  0.64  0.02 0.41 0.59 1.0
## IC014Q04NA   -0.01  0.01  0.79 -0.01 0.63 0.37 1.0
## IC014Q06NA    0.13 -0.05  0.54 -0.08 0.32 0.68 1.2
## IC014Q08NA    0.02 -0.01  0.86  0.01 0.75 0.25 1.0
## IC014Q09NA   -0.01  0.00  0.87  0.01 0.76 0.24 1.0
## ST092Q01TA    0.74 -0.08 -0.02  0.11 0.63 0.37 1.1
## ST092Q02TA    0.49  0.01  0.05  0.16 0.31 0.69 1.2
## ST092Q04TA    0.58  0.09  0.06  0.10 0.39 0.61 1.1
## ST092Q05TA    0.84 -0.03 -0.04  0.02 0.72 0.28 1.0
## ST092Q06NA    0.85  0.00  0.01 -0.03 0.72 0.28 1.0
## ST092Q08NA    0.83  0.03  0.02 -0.07 0.65 0.35 1.0
## ST092Q09NA    0.81  0.04  0.03 -0.03 0.63 0.37 1.0
## ST093Q01TA_R -0.04  0.76  0.01 -0.03 0.59 0.41 1.0
## ST093Q03TA_R -0.03  0.76 -0.01 -0.01 0.59 0.41 1.0
## ST093Q04TA_R -0.03  0.78 -0.01 -0.02 0.61 0.39 1.0
## ST093Q05TA_R  0.03  0.72  0.00  0.04 0.51 0.49 1.0
## ST093Q06TA_R  0.07  0.72 -0.02  0.03 0.50 0.50 1.0
## ST093Q07NA_R -0.02  0.76  0.03 -0.02 0.59 0.41 1.0
## ST093Q08NA_R  0.13  0.61 -0.02  0.05 0.36 0.64 1.1
## 
##                        TC1  TC2  TC3  TC4
## SS loadings           4.05 3.77 2.86 2.62
## Proportion Var        0.17 0.16 0.12 0.11
## Cumulative Var        0.17 0.33 0.44 0.55
## Proportion Explained  0.30 0.28 0.22 0.20
## Cumulative Proportion 0.30 0.59 0.80 1.00
## 
##  With component correlations of 
##       TC1   TC2  TC3  TC4
## TC1  1.00 -0.18 0.10 0.30
## TC2 -0.18  1.00 0.04 0.01
## TC3  0.10  0.04 1.00 0.13
## TC4  0.30  0.01 0.13 1.00
## 
## Mean item complexity =  1
## Test of the hypothesis that 4 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  9372.88  with prob <  0 
## 
## Fit based upon off diagonal values = 0.95

This shows an extraction commonality of .32 for IC014Q06NA (‘I feel comfortable using my digital devices at home’) and supports the decision of Lin, Chai, and Jong (2019) to exclude this variable from their full analyses. Note, however, that ST092Q02TA (the ‘genetically modified organisms’ item from the environmental awareness scale) has a similarly low extraction communality (.31) and might also be excluded on this basis⁸

These scree plots and examination of eigenvalues and communality statistics are illustrative of the sorts of iterative analyses that one might conduct in the course of performing a principal components analysis, and evaluating the best factor structure to use in subsequent analyses. Ultimately, for this replication, we adopt a methodology similar to Lin, Chai, and Jong (2019).

D.3 Principal components analysis

The code below replicates the principal components analysis with oblimin rotation on the data from Singapore reported in Table 3.1 above, using sjPlot::tab_pca (Lüdecke 2020) to generate a summary table including principal components, “loadings” and alpha reliabilities.

# package inputenc error if used to generate pdf - not set up for use with LaTex 

sjPlot::tab_pca(SGP, rotation = c("oblimin"),  
nmbr.fctr = 4, show.var = TRUE, alternate.rows = TRUE, title = "(\\#tab:PCASGP) Principal Component Analysis - Singapore")

D.4 Factor structure diagrams

Figure D.1 below plots the results of this analysis, using the psych::fa.diagram function.

Figure D.1: Components analysis - Singapore

E Scoring scales and reporting differences between Singapore and Finland

Now that we have finalized our scales, the code that follows scores each using psych::scoreItems (Revelle 2020), replacing missing values for an item with the mean response for that item. Differences in the processing of missing items may be responsible for the small variations found between results reported here and those reported by Lin, Chai, and Jong (2019).

This code further shows how to reproduce Table 3.3 above to summarise mean scale score separately for Singapore and Finland.

scale_keys = list(
  ICT = c("IC014Q03NA", "IC014Q04NA", "IC014Q08NA", "IC014Q09NA"),
  interest = c("ST095Q04NA", "ST095Q07NA", "ST095Q08NA", "ST095Q13NA", "ST095Q15NA"),
  awareness = c("ST092Q01TA", "ST092Q02TA", "ST092Q04TA", "ST092Q05TA", "ST092Q06NA", "ST092Q08NA", "ST092Q09NA"),
  optimism = c("ST093Q01TA_R", "ST093Q03TA_R", "ST093Q04TA_R", "ST093Q05TA_R", "ST093Q06TA_R", "ST093Q07NA_R", "ST093Q08NA_R"))

scales_scored = psych::scoreItems(
  keys = scale_keys,
  items = PISA15_new,
  totals = FALSE, #find average scores
  impute = "mean") #replace missing values with mean response

# alternative options considered for scoreItems function:
  # impute = "none": scores are based on average of keyed, but non missing, scores

PISA15_new[, c("ICT", "interest", "awareness", "optimism")] = scales_scored$scores
attr(PISA15_new$ICT,'label') <- 'Perceived ICT competence'
attr(PISA15_new$interest,'label') <- 'Interest in broad science topics'
attr(PISA15_new$awareness,'label') <- 'Environmental awareness'
attr(PISA15_new$optimism,'label') <- 'Environmental optimism'

tab_compare <- arsenal::tableby(CNT ~ interest + ICT + awareness + optimism, 
    PISA15_new, numeric.simplify=TRUE, numeric.stats = c("Nmiss", "meansd"), digits=2)

tab_compare %>%
  summary(text = TRUE) %>%
  as.data.frame() %>%
  knitr::kable(caption = "Mean scale scores by country") %>%
  kableExtra::kable_styling(latex_options = c("striped"))

F Correlation matrices

The following code uses sjPlot::sjp.corr (Lüdecke 2020) to generate Figure 3.1 above, showing correlations between variables for the sample for Singapore.

# set print options

scale_height = knitr::opts_chunk$get('fig.height')*0.5
scale_width = knitr::opts_chunk$get('fig.width')*1.25
knitr::opts_chunk$set(fig.height = scale_height, fig.width = scale_width)

theme_update(legend.text = element_text(size = rel(0.7)))

SGP_scales <- PISA15_new %>%
  filter(CNT=="Singapore") %>%
  select(interest, ICT, awareness, optimism) %>%
  as.data.frame()
  
sjp.corr(SGP_scales, decimals = 2, show.p = TRUE, sort.corr = FALSE, corr.method = "pearson", na.deletion = "pairwise", show.legend = TRUE)

G Regression analyses

sjplot::tab_model (Lüdecke 2020) is used in the following code to generate Table 3.4 above. This table summarizes linear regression models for environmental awareness for Singapore and Finland side by side. Standardized coefficients are selected by using the argument show.std=TRUE, and the terms argument is used to specify variables for which coefficients should be reported (i.e., excluding the intercept from this table).

# create linear regression analysis objects
SGPawareness <- lm(awareness ~ interest + ICT, data = SGP_scales)
FINawareness <- lm(awareness ~ interest + ICT, data = FIN_scales)

# summarize these side by side in a table
sjplot::tab_model(SGPawareness, FINawareness, show.std = TRUE, show.est = FALSE, show.ci = FALSE, dv.labels = c("Singapore", "Finland"), terms = c("interest", "ICT"), title = "(\\#tab:awareness) Environmental awareness")

H Multilevel regression analyses

The following code uses sjPlot::tab_model to replicate Table 3.6 above.

First, new summary datasets are created including the school ID variable (CNTSCHID), subseqeuntly renamed school.

# create summary dataset for Singapore
SGP_scaleswithschool <- PISA15_new %>%
  filter(CNT=="Singapore") %>%
  select(interest, ICT, awareness, optimism, CNTSCHID) %>%
  as.data.frame()

# create summary dataset for Finland
FIN_scaleswithschool <- PISA15_new %>%
  filter(CNT=="Finland") %>%
  select(interest, ICT, awareness, optimism, CNTSCHID) %>%
  as.data.frame()

# create multilevel rgression analysis objects
SGPmultilevelawareness <- lmer(awareness ~ interest + ICT + (1|CNTSCHID), data = SGP_scaleswithschool)
FINmultilevelawareness <- lmer(awareness ~ interest + ICT + (1|CNTSCHID), data = FIN_scaleswithschool)

# summarize these side by side in a table
tab_model(SGPawareness, SGPmultilevelawareness, FINawareness, FINmultilevelawareness, show.std = TRUE, show.est = FALSE, show.ci = FALSE, dv.labels = c("Singapore linear", "Singapore multilevel", "Finland linear", "Finland multilevel"), terms = c("interest", "ICT"), title = "(\\#tab:comparisonawareness) Environmental awareness - model comparison")

The code above generates a lme4::lmer object, a summary of which can be viewed by simply typing its name:

SGPmultilevelawareness

## Linear mixed model fit by REML ['lmerMod']
## Formula: awareness ~ interest + ICT + (1 | school)
##    Data: SGP_scaleswithschool
## REML criterion at convergence: 9933.393
## Random effects:
##  Groups   Name        Std.Dev.
##  school   (Intercept) 0.1643  
##  Residual             0.5330  
## Number of obs: 6115, groups:  school, 177
## Fixed Effects:
## (Intercept)     interest          ICT  
##     1.94539      0.30777      0.07359

References

Bates, Douglas, Martin Maechler, Ben Bolker, and Steven Walker. 2020. Lme4: Linear Mixed-Effects Models Using ’Eigen’ and S4. https://CRAN.R-project.org/package=lme4.

Bernaards, Coen, and Robert Jennrich. 2014. GPArotation: GPA Factor Rotation. https://CRAN.R-project.org/package=GPArotation.

Bryer, Jason, and Kimberly Speerschneider. 2016. Likert: Analysis and Visualization Likert Items. https://CRAN.R-project.org/package=likert.

Comtois, Dominic. 2020. Summarytools: Tools to Quickly and Neatly Summarize Data. https://CRAN.R-project.org/package=summarytools.

Estrellado, Ryan, Emily Freer, Jesse Mostipak, Joshua Rosenberg, and Isabella Velásquez. 2020. Data Science in Education Using R. datascienceineducation.com/.

Gonel, Feride, and Atakan Akinci. 2018. “How Does ICT-Use Improve the Environment? The Case of Turkey.” World Journal of Science, Technology and Sustainable Development 15 (1): 2–12. https://doi.org/10.1108/WJSTSD-03-2017-0007.

Heinzen, Ethan, Jason Sinnwell, Elizabeth Atkinson, Tina Gunderson, and Gregory Dougherty. 2020. Arsenal: An Arsenal of ’R’ Functions for Large-Scale Statistical Summaries. https://CRAN.R-project.org/package=arsenal.

Lin, Pei-Yi, Ching Sing Chai, and Morris Siu-Yung Jong. 2019. “A PISA-2015 Comparative Meta-Analysis Between Singapore and Finland: Relations of Students’ Interest in Science, Perceived ICT Competence, and Environmental Awareness and Optimism.” International Journal of Environmental Research and Public Health 16 (24): 5157. https://doi.org/10.3390/ijerph16245157.

Lüdecke, Daniel. 2020. SjPlot: Data Visualization for Statistics in Social Science. https://CRAN.R-project.org/package=sjPlot.

Lüdecke, Daniel, Dominique Makowski, Mattan S. Ben-Shachar, Indrajeet Patil, and Søren Højsgaard. 2020. Parameters: Processing of Model Parameters. https://CRAN.R-project.org/package=parameters.

Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Nakagawa, Shinichi, Paul C. D. Johnson, and Holger Schielzeth. 2017. “The Coefficient of Determination R2 and Intra-Class Correlation Coefficient from Generalized Linear Mixed-Effects Models Revisited and Expanded.” Journal of the Royal Society Interface 14 (134): 20170213. https://doi.org/10.1098/rsif.2017.0213.

OECD. 2017. “PISA 2015 Technical Report.” Paris: OECD Publishing. https://www.oecd.org/pisa/sitedocument/PISA-2015-technical-report-final.pdf.

———. 2019a. “Country Note: Programme for International Student Assessment (PISA) 2018 (Finland).” Paris: OECD Publishing. https://www.oecd.org/pisa/publications/PISA2018_CN_FIN.pdf.

———. 2019b. “Country Note: Programme for International Student Assessment (PISA) 2018 (Singapore).” Paris: OECD Publishing. https://www.oecd.org/pisa/publications/PISA2018_CN_SGP.pdf.

Ozturk, Ozlem. 2018. “Using PISA2015 Data to Analyze How the Scientific Literacy of Students from Different Socioeconomic Levels Can Be Predicted by Environmental Awareness and by Environmental Optimism.” PhD thesis. http://repository.bilkent.edu.tr/bitstream/handle/11693/47581/10194084.pdf?sequence=1&isAllowed=y.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Revelle, William. 2020. Psych: Procedures for Psychological, Psychometric, and Personality Research. https://CRAN.R-project.org/package=psych.

“Structural Models (EFA, CFA, SEM...).” n.d. Accessed October 4, 2020. https://easystats.github.io/parameters/articles/efa_cfa.html.

Tan, Kim Chwee Daniel, Yew-Jin Lee, and Aik-Ling Tan. 2009. “Environmental Education in Singapore: Learning to Manage Urban Landscapes and Futures.” In Environmental Education in Context, edited by Neil Taylor, Michael Littledyke, Chris Eames, and Richard K Coll, 243–51. Rotterdam: Sense. https://brill.com/view/book/edcoll/9789087909635/BP000024.xml.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Evan Miller. 2020. Haven: Import and Export ’Spss’, ’Stata’ and ’Sas’ Files. https://CRAN.R-project.org/package=haven.

Xie, Yihui. 2020a. Bookdown: Authoring Books and Technical Documents with R Markdown. https://CRAN.R-project.org/package=bookdown.

———. 2020b. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.

Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

I used these R packages for the purposes listed here, but these descriptions do not fully represent the capabilities of each package↩︎
Lin et al (2019) reported as ‘loadings’ the ‘eigenvectors’ representing unit-normalized principal directions (here scaled by the square roots of the respective eigenvalues)↩︎
Cronbach’s \(\alpha\) for each scale is reported here, consistent with Lin et al (2019), although psych::alpha could instead be used to calculate statistics thought more appropriate↩︎
Note that the variance estimates returned by sjPlot::tab_pca match those generated by psych::principal when rotate = none↩︎
p-values for lmer() objects are not given by default, given debates about how degrees of freedom for coefficients should be calculated (Estrellado et al. 2020). sjPlot::tab_model by default shows p-values for standardized coefficients for multilevel models that are simple approximations, based on t-statistics and using the normal distribution function. Greater precision can be provided when reporting linear mixed models by using conditional F-tests with Kenward-Roger approximation for degrees of freedom (which can also be displayed), using the pbkrtest package and options p.val = "kr", show.df = TRUE. This takes time to run, and when testing on this model did not result in a change of reported significance levels.↩︎
That is, the Intraclass Correlation Coefficient (ICC) represents the ratio of the between-cluster variance (τ₀₀) to the total variance (σ² + τ₀₀)↩︎
Marginal and conditional R-squared statistics based on Nakagawa, Johnson, and Schielzeth (2017) are presented as defaults for mixed effects models by the sjPlot::tab_model function. The marginal R-squared takes into account only variance of the fixed effects, while the conditional considers both the fixed and random effects. There is controversy in the appropriateness and interpretation of R-squared statistics that means varying options for evaluating the appropriateness and significance of these models should be considered↩︎
If this analysis is run performed on the Singaporean datafile with all missing values omitted, however, IC14Q06NA is the single variable with the lowest extraction communality, consistent with Lin, Chai, and Jong (2019). This shows that decisions about how to handle missing data can have flow-on effects that have a non-trivial impact on subsequent analyses and their interpretation.↩︎

Principal components analysis, linear regression and multi-level regression using R: Replication and extension of Lin, Chai and Jong (2019) - Draft 1

Lynley Aldridge

5/10/2020

1 Introduction

2 Data and sample

2.1 Data

2.2 Participants

2.3 Variables

2.3.1 Interest in broad science topics

2.3.2 Environmental awareness

2.3.3 Environmental optimism

2.3.4 Perceived ICT competence

2.4 Descriptive summary table

3 Results

3.1 Factor structure

3.2 Differences between Singapore and Finland

3.3 Relationships among variables

3.3.1 Correlation matrices

3.3.2 Regression analyses

Environmental awareness

Environmental optimism

3.3.3 Multilevel regression analyses

Environmental awareness

Environmental optimism

4 Conclusions

Appendices

A Reading in and preparing data

B Likert objects and plots

B.1 Creating the likert object

B.2 Stacked centred bar graph

C Summary tables

D Factor analysis

D.1 Testing the dataset for suitability for factor analysis

D.2 Determining how many components to extract

D.3 Principal components analysis

D.4 Factor structure diagrams

E Scoring scales and reporting differences between Singapore and Finland

F Correlation matrices

G Regression analyses

H Multilevel regression analyses

References