Data Wrangling for PISA Study

1. Uploading the dataset and Getting to Know the Respective Variables

A. Student Interview Data

Student Interview data provides many essential variables that are included in the model. The table below serves as the codebook for all the data in the model.

Student Interview Data Codebook
SN	Variable Name	Label	Other Information
1	CNTRYID	Country Identifier; USA Country Code = 840	Type: Numeric; Width: 3; Decimals: 0; Location: 1-3
2	CNT	Country code 3-character; USA	Type: Character; Width: 3; Location: 4-6
3	CNTSCHID	International school ID; 84000001-84000175	Type: Numeric; Width: 8; Decimals: 0; Location: 7-14
4	CNTSTUID	International student ID; 84000001-84008626	Type:Numeric ; Width: 8; Decimals: 0; Location: 15-22
5	ST001C01TA	NAT/What grade are you in? 1: Grade_7, 2: Grade_8, 3: Grade_9, 4: Grade_10, 5: Grade_11, 6: Grade_12, 95/97/98/99/Blank: Missing	Type: Numeric; Width: 2; Decimals: 0; Location: 23-24
6	RACETHC	NAT/Collapsed derived student race/ethnicity 1: White, not Hispanic; 2: Black or African American; 3: Hispanic or Latino; 4: Asian 5: Two or More Races 6: Other	Type: Numeric; Width: 2; Decimals: 0; Location: 25-26 95/97/98/99/Blank: Missing
7	ST005C01TA	NAT/What is the highest level of schooling (not including college) completed by your father? 1: She completed grade 12 (GED) 2: She completed grade 9 3: She completed grade 6 4: She did not complete grade 6	Type: Numeric; Width: 2; Decimals: 0; Location: 27-28 95/97/98/99/Blank: Missing
8	ST007C01TA	NAT/What is the highest level of schooling (not including college) completed by your father? 1: She completed grade 12 (GED) 2: She completed grade 9 3: She completed grade 6 4: She did not complete grade 6	Type: Numeric; Width: 2; Decimals: 0; Location: 29-30 95/97/98/99/Blank: Missing
9	ST011C17TA	NAT/Which of the following are in your home? A guest room 1: Yes 2: No 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 31
10	ST011C18TA	NAT/Which of the following are your home? A high-speed Internet connection 1: Yes 2: No 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 32
11	ST011C19TA	NAT/Which of the following are in your home? A musical instrument 1: Yes 2: No 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 33
12	ST019AC01T	NAT/In what country were you and your parents born? You 1: United States 2: Other Country 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 34
13	ST019BC01T	NAT/In what country were you and your parents born? Mother 1: United States 2: Other Country 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 35
14	ST019CC01T	NAT/In what country were you and your parents born? Father 1: United States 2: Other Country 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 36
15	ST127A01TA	NAT/Have you ever repeated a grade? in Kindergarten 1: No, never 2: Yes, once 3: Yes, twice or more 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 37
16	ST127C01TA	NAT/Have you ever repeated a grade? in grades 1-6 1: No, never 2: Yes, once 3: Yes, twice or more 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 38
17	ST127C02TA	NAT/Have you ever repeated a grade? in grades 7-9 1: No, never 2: Yes, once 3: Yes, twice or more 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 39
18	ST127C03TA	NAT/Have you ever repeated a grade? in grades 10-12 1: No, never 2: Yes, once 3: Yes, twice or more 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 40
19	ST022C01TA	NAT/What language do you speak at home most of the time? 1: English 2: Spanish 3: Other Language 5/7/8/9/Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 41
20	ST225C01HA	NAT/Which of the following do you expect to complete? Less than high school 0: Not Checked 1: Checked Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 42
21	ST225C02HA	NAT/Which of the following do you expect to complete? High school (high school diploma or GED) 0: Not Checked 1: Checked Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 43
22	ST225C03HA	NAT/Which of the following do you expect to complete? Vocational/technical certificate (such as cosmetology or auto mechanics) 0: Not Checked 1: Checked Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 44
23	ST225C04HA	NAT/Which of the following do you expect to complete? Associate’s degree (2-year degree from a community college) 0: Not Checked 1: Checked Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 45
24	ST225C05HA	NAT/Which of the following do you expect to complete? Bachelor’s degree (4-year college degree) 0: Not Checked 1: Checked Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 46
25	ST225C06HA	NAT/Which of the following do you expect to complete? Master’s degree/doctoral degree/professional degree (e.g., J.D. or M.D.) 0: Not Checked 1: Checked Blank: Missing	Type: Numeric; Width: 1; Decimals: 0; Location: 47

Other than conducting some data cleaning and modeling for some data science courses, I have not used this kind of data in real analysis. The National Assessment of Educational Progress (NAEP) conducts the required assessment for PISA. Other than the big data files stored in the OECD online portal (https://www.oecd.org/pisa/data/2018database/), which contains data for all countries. It is painful to go through all the variable names and parse out the required ones. It took some time, but I found the US only Public Use Files (PUF) in IES NCES web portal (https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2021019).

The portal also have SPSS files but in the syntax form, however, to help me brush off my skills I liked to go with the data stored as a text files. The data are all jumbled and the data set does not have variables name. More information regarding the data is stored in a separate zip file named “Codebooks”. As we become more familiar with the big data, our chances of encountering these types of data are more common. Here’s how I am going to approach them. Going through the Codebook, I figured out that the variable Country Identifier (CNTRYID), three Character Country Code (CNT), International School ID (CNTSCHID), and International Student ID (CNTSTUID) are common among these data files. Further inspection confirms that their place values are exactly same, thus, I don’t have to repeat the same information in all tables below.

student_data <- readLines("pisa18_us_stud_pud.dat")
student_data <- student_data |>
  data.frame() # Changing the list to a dataframe
head(student_data) # Checking how the data are put together

                                     student_data
1 840USA8400011284000001 4 3 1 121122211112111111
2 840USA8400002484000002 4 4 1 421122211113010001
3 840USA8400006184000003 4 1 1 111211111111110110
4 840USA8400003484000004 4 1 1 122211111111011000
5 840USA8400006284000006 4 3 1 121111111111000010
6 840USA8400013384000007 4 3 1 111111111113000010

str(student_data) # checking the structure of the data

'data.frame':   4838 obs. of  1 variable:
 $ student_data: chr  "840USA8400011284000001 4 3 1 121122211112111111" "840USA8400002484000002 4 4 1 421122211113010001" "840USA8400006184000003 4 1 1 111211111111110110" "840USA8400003484000004 4 1 1 122211111111011000" ...

dim(student_data) # checking the dimension

[1] 4838    1

Based on what we see, there is a jumbled string that runs all the way through 47 characters. Each row contains both numbers and letters. They have to be split into 25 different columns, but the good news, is each row has same length and they have fixed number of characters for each column. For example ST225C05HA has one number value without any decimal and it is located in the 46th position. Some numbers seems to follow SPACES but further analysis of code book confirms that they are not. The spaces came to being when the input was 1 character long in 2 character long space.

B. Teacher Data

There were two variables, i.e., TC186C01HA & TC800C001HA other than the IDs in Teacher Data.

Variable Name	Label	Other Information
TC186C01HA	NAT/In what country were you born? 1: The United States 2: Other Country 5, 7, 8, 9, & Blank: Missing	Type: Numeric; Width: 1; Decimal: 0; Location: 23
TC800C001HA	NAT/What is the highest level of education you have completed? 1: High school and/or some college courses 2: Associate’s degree 3: Bachelor’s degree 4: Master’s degree 5: Doctoral or professional degree (e.g., PhD, M.D., J.D.) 95, 97, 98, 99, & Blank: Missing	Type: Numeric; Width: 2; Decimal: 0; Location: 24-25

TC186C01HA

NAT/In what country were you born?

1: The United States

2: Other Country

5, 7, 8, 9, & Blank: Missing

Type: Numeric;

Width: 1;

Decimal: 0;

Location: 23

TC800C001HA

NAT/What is the highest level of education you have completed?

1: High school and/or some college courses

2: Associate’s degree

3: Bachelor’s degree

4: Master’s degree

5: Doctoral or professional degree (e.g., PhD, M.D., J.D.)

95, 97, 98, 99, & Blank: Missing

Type: Numeric;

Width: 2;

Decimal: 0;

Location: 24-25

teacher_data <- readLines("pisa18_us_tchr_pud.dat")
teacher_data <- teacher_data |>
  data.frame()
head(teacher_data)

               teacher_data
1 840USA8400014484000001  4
2 840USA84000011840000021 4
3 840USA84000084840000031 3
4 840USA84000082840000041 4
5 840USA8400016184000005  3
6 840USA84000132840000061 3

# str(teacher_data)
dim(teacher_data)

[1] 3526    1

Based on the outcomes there is one string of values per row which is supposed to be broken down into 6-different variables. Last two values are for column named TC800C001HA which has the information about teachers’ highest level of education. One before them, i.e., 23rd value is for TC186C01HA variable which provides information regarding whether the teachers were born in the United States or Other country. All the values between 1-22 are for four ID variables mentioned above.

C. School Data

There was just one variable, i.e., FRPL other than the ID variables mentioned above in the school database.

Variable Name	Label	Other Information
FRPL	NAT/Percentage of students eligible for free-/reduced-price lunch (Derived) 1: Less than 10 percent 2: 10-24.9 percent 3: 25-49.9 percent 4: 50-74.9 percent 5: 75 percent or more	Type: Numeric Width: 2 Decimal: 0 Location: 15-16 95, 97, 98, 99, & Blank: Missing Data

FRPL

NAT/Percentage of students eligible for free-/reduced-price lunch (Derived)

1: Less than 10 percent

2: 10-24.9 percent

3: 25-49.9 percent

4: 50-74.9 percent

5: 75 percent or more

Type: Numeric

Width: 2

Decimal: 0

Location: 15-16

95, 97, 98, 99, & Blank: Missing Data

school_data <- readLines("pisa18_us_schl_pud.dat")
school_data <- school_data |>
  data.frame()
head(school_data)

       school_data
1 840USA84000001 5
2 840USA84000002 4
3 840USA84000004 5
4 840USA8400000599
5 840USA8400000695
6 840USA84000007 3

# str(school_data)
dim(school_data)

[1] 164   1

Likewise, looking at the strings of values in the school_data, we have to break them into 4-columns. Three of them are the ID variables and the last one the information about percentage of students receiving free- or reduced-price lunch (FRPL). Location 15-16 pertains to FRPL percentage.

So far, I have uploaded the files, figured out their location and identified number of variables for each data set. Now, I have to go do the following activities to be able to analyze the data.

Parse out variables and separate entries for variables for all Student, Teacher, and School Level Data
Merge the data files using common variables
Find student wise Reading achievement test scores and append them with the merged data, and
Save the data as an excel file on the local device for further analyses

As a demonstration, this report shows the action I take regarding the student_data. For the sake of avoiding monotony, all the actions on teacher_data or school_data will not be shown.

D. Creating Variables Based on Given Information

I will have to come up with 25 variables, as given and have to exactly locate the values with high level of precision. I am using the extract() function.

# library(gsubfn)
# student_pattern <- "(...)(...)(........)(........)(..)(..)(..)(..)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)"
# clmn_std_data <- strapplyc(student_data, student_pattern, simplify = rbind)
# head(clmn_std_data)

student_data_clmns <- student_data |>
  tidyr::separate(
    col = student_data,
    sep = c(3, 6, 14, 22, 24, 26, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47),
    into = c(
      "CNTRYID", "CNT", "CNTSCHID", "CNTSTUID", "ST001C01TA", "RACETHC", "ST005C01TA",
      "ST007C01TA", "ST011C17TA", "ST011C18TA", "ST011C19TA", "ST019AC01T", "ST019BC01T",
      "ST019CC01T", "ST127A01TA", "ST127C01TA", "ST127C02TA", "ST127C03TA", "ST022C01TA",
      "ST225C01HA", "ST225C02HA", "ST225C03HA", "ST225C04HA", "ST225C05HA", "ST225C06HA"
    ),
    remove = TRUE
  )
str(student_data_clmns)

'data.frame':   4838 obs. of  25 variables:
 $ CNTRYID   : chr  "840" "840" "840" "840" ...
 $ CNT       : chr  "USA" "USA" "USA" "USA" ...
 $ CNTSCHID  : chr  "84000112" "84000024" "84000061" "84000034" ...
 $ CNTSTUID  : chr  "84000001" "84000002" "84000003" "84000004" ...
 $ ST001C01TA: chr  " 4" " 4" " 4" " 4" ...
 $ RACETHC   : chr  " 3" " 4" " 1" " 1" ...
 $ ST005C01TA: chr  " 1" " 1" " 1" " 1" ...
 $ ST007C01TA: chr  " 1" " 4" " 1" " 1" ...
 $ ST011C17TA: chr  "2" "2" "1" "2" ...
 $ ST011C18TA: chr  "1" "1" "1" "2" ...
 $ ST011C19TA: chr  "1" "1" "2" "2" ...
 $ ST019AC01T: chr  "2" "2" "1" "1" ...
 $ ST019BC01T: chr  "2" "2" "1" "1" ...
 $ ST019CC01T: chr  "2" "2" "1" "1" ...
 $ ST127A01TA: chr  "1" "1" "1" "1" ...
 $ ST127C01TA: chr  "1" "1" "1" "1" ...
 $ ST127C02TA: chr  "1" "1" "1" "1" ...
 $ ST127C03TA: chr  "1" "1" "1" "1" ...
 $ ST022C01TA: chr  "2" "3" "1" "1" ...
 $ ST225C01HA: chr  "1" "0" "1" "0" ...
 $ ST225C02HA: chr  "1" "1" "1" "1" ...
 $ ST225C03HA: chr  "1" "0" "0" "1" ...
 $ ST225C04HA: chr  "1" "0" "1" "0" ...
 $ ST225C05HA: chr  "1" "0" "1" "0" ...
 $ ST225C06HA: chr  "1" "1" "0" "0" ...

hist(as.numeric(student_data_clmns$ST001C01TA))

summary(as.numeric(student_data_clmns$ST225C05HA))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  1.0000  0.7133  1.0000  1.0000      95

schl_data_clmns <- school_data |>
  tidyr::separate(
    col = school_data,
    sep = c(3, 6, 14, 16),
    into = c("CNTRYID", "CNT", "CNTSCHID", "FRPL"),
    remove = TRUE
  ) |>
  data.frame()
str(schl_data_clmns)

'data.frame':   164 obs. of  4 variables:
 $ CNTRYID : chr  "840" "840" "840" "840" ...
 $ CNT     : chr  "USA" "USA" "USA" "USA" ...
 $ CNTSCHID: chr  "84000001" "84000002" "84000004" "84000005" ...
 $ FRPL    : chr  " 5" " 4" " 5" "99" ...

hist(as.numeric(schl_data_clmns$FRPL))

techr_data_clmns <- teacher_data |>
  tidyr::separate(
    col = teacher_data,
    sep = c(3, 6, 14, 22, 23, 25),
    into = c("CNTRYID", "CNT", "CNTSCHID", "CNTTCHID", "TC186C01HA", "TC800C001HA"),
    remove = TRUE
  ) |>
  data.frame()
str(techr_data_clmns)

'data.frame':   3526 obs. of  6 variables:
 $ CNTRYID    : chr  "840" "840" "840" "840" ...
 $ CNT        : chr  "USA" "USA" "USA" "USA" ...
 $ CNTSCHID   : chr  "84000144" "84000011" "84000084" "84000082" ...
 $ CNTTCHID   : chr  "84000001" "84000002" "84000003" "84000004" ...
 $ TC186C01HA : chr  " " "1" "1" "1" ...
 $ TC800C001HA: chr  " 4" " 4" " 3" " 4" ...

hist(as.numeric(techr_data_clmns$TC800C001HA))

E. Changing Various Values to `NA` Using `na_if` Function in {dplyr} Package

The NEAP used various values to refer to missingness in 2018 PISA dataset. There are at least two patterns:

in some cases, they noted 5,7,8,9 & Blank as missing values, where 5 represented valid skip, 7 as not applicable, 8 for invalid, 9 for no response, & Blank.
In other cases, 95 for valid skip, 97 for not applicable, 98 for invalid, 99 for no response, and Blank.

So, I need to create two different indexes that I will pass through the columns and change the noted values to NAs.

# Creating indexes to pass through columns
index_95 <- c(95, 97, 98, 99)
index_5 <- c(5, 7, 8, 9)

Note: Most of the columns are stored as character class, if I change them to numeric all blank rows in corresponding columns will be marked to be NA.

The school data has only one column with the missing values, i.e., FRPL. The missing values are denoted by 95 and so forth. Thus, I am going to pass the whole data table through the is_95_na function.

options(scipen = 999)
# Checking the class of FRPL Column in School Dataset
class(schl_data_clmns$FRPL)

[1] "character"

# class(schl_data_clmns)

# Changing the class type of selected columns in school dataset
schl_data_clmns$FRPL <- as.numeric(schl_data_clmns$FRPL)
schl_data_clmns$CNTRYID <- as.numeric(schl_data_clmns$CNTRYID)
schl_data_clmns$CNTSCHID <- as.numeric(schl_data_clmns$CNTSCHID)
# class(schl_data_clmns$FRPL)

# Passing `index_95` through FRPL column in `school dataset`
schl_data_clmns$FRPL <- lapply(schl_data_clmns$FRPL, function(t) replace(t, t %in% index_95, NA))

# Calculating NAs and confirming the values with given statistics
# sum(is.na(schl_data_clmns$FRPL))
schl_data_clmns$FRPL <- as.numeric(as.character(schl_data_clmns$FRPL))
summary(schl_data_clmns)

    CNTRYID        CNT               CNTSCHID             FRPL      
 Min.   :840   Length:164         Min.   :84000001   Min.   :1.000  
 1st Qu.:840   Class :character   1st Qu.:84000044   1st Qu.:3.000  
 Median :840   Mode  :character   Median :84000086   Median :3.000  
 Mean   :840                      Mean   :84000087   Mean   :3.478  
 3rd Qu.:840                      3rd Qu.:84000129   3rd Qu.:4.000  
 Max.   :840                      Max.   :84000175   Max.   :5.000  
                                                     NA's   :30

# Saving the data frame as `school_final`
school_final <- schl_data_clmns
# Requesting histogram of FRPL to compare againg the one I requested earlier
hist(school_final$FRPL)

F. Preparing Teacher Data

# Changing the class type of selected columns in school dataset
techr_data_clmns$CNTRYID <- as.numeric(techr_data_clmns$CNTRYID)
techr_data_clmns$CNTSCHID <- as.numeric(techr_data_clmns$CNTSCHID)
techr_data_clmns$CNTTCHID <- as.numeric(techr_data_clmns$CNTTCHID)
techr_data_clmns$TC186C01HA <- as.numeric(techr_data_clmns$TC186C01HA)
techr_data_clmns$TC800C001HA <- as.numeric(techr_data_clmns$TC800C001HA)

# Passing `index_5` & `index_95' through "TC186C01HA" & "TC800C001HA" columns in `teacher dataset`, respectively
techr_data_clmns$TC186C01HA <- lapply(techr_data_clmns$TC186C01HA, function(u) replace(u, u %in% index_5, NA))
techr_data_clmns$TC800C001HA <- lapply(techr_data_clmns$TC800C001HA, function(v) replace(v, v %in% index_95, NA))

# For some weired reasons the above columns are saved as list, thus, changing them to numeric
techr_data_clmns$TC186C01HA <- as.numeric(as.character(techr_data_clmns$TC186C01HA))
techr_data_clmns$TC800C001HA <- as.numeric(as.character(techr_data_clmns$TC800C001HA))

# Saving the data frame as `school_final`
teacher_final <- techr_data_clmns

# Requesting histogram of FRPL to compare againg the one I requested earlier
summary(teacher_final)

    CNTRYID        CNT               CNTSCHID           CNTTCHID       
 Min.   :840   Length:3526        Min.   :84000001   Min.   :84000001  
 1st Qu.:840   Class :character   1st Qu.:84000047   1st Qu.:84000942  
 Median :840   Mode  :character   Median :84000088   Median :84001888  
 Mean   :840                      Mean   :84000088   Mean   :84001889  
 3rd Qu.:840                      3rd Qu.:84000131   3rd Qu.:84002838  
 Max.   :840                      Max.   :84000175   Max.   :84003779  
                                                                       
   TC186C01HA     TC800C001HA   
 Min.   :1.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:3.000  
 Median :1.000   Median :4.000  
 Mean   :1.052   Mean   :3.655  
 3rd Qu.:1.000   3rd Qu.:4.000  
 Max.   :2.000   Max.   :5.000  
 NA's   :1742    NA's   :696

## Checking Histogram for Consistency
hist(teacher_final$TC186C01HA)

hist(teacher_final$TC800C001HA)

The final calculations match the given number of missing values. Eyeballing the summary and running it through the actual code book, the data information matches, as well.

G. Preparing Student Data

This is kept at last because it has many columns and the procedures repeat over and over again. There are 24-columns that are numeric but listed as character. I can either change them one by one using 24-repeated syntax or use only a few lines to do. I would definitely like to save my time. First I would create an index containing the columns that need to be changed and them pass a sapply() function to change all of them at once.

# Indexing the columns that need to be changed
change_needed_columns <- c("CNTRYID", "CNTSCHID", "CNTSTUID", "ST001C01TA", "RACETHC", "ST005C01TA", "ST007C01TA", "ST011C17TA", "ST011C18TA", "ST011C19TA", "ST019AC01T", "ST019BC01T", "ST019CC01T", "ST127A01TA", "ST127C01TA", "ST127C02TA", "ST127C03TA", "ST022C01TA", "ST225C01HA", "ST225C02HA", "ST225C03HA", "ST225C04HA", "ST225C05HA", "ST225C06HA")

# Passing the index through the dataset
student_data_clmns[change_needed_columns] <- sapply(student_data_clmns[change_needed_columns], as.numeric)

# Checking if that worked
sapply(student_data_clmns, class)

    CNTRYID         CNT    CNTSCHID    CNTSTUID  ST001C01TA     RACETHC 
  "numeric" "character"   "numeric"   "numeric"   "numeric"   "numeric" 
 ST005C01TA  ST007C01TA  ST011C17TA  ST011C18TA  ST011C19TA  ST019AC01T 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
 ST019BC01T  ST019CC01T  ST127A01TA  ST127C01TA  ST127C02TA  ST127C03TA 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
 ST022C01TA  ST225C01HA  ST225C02HA  ST225C03HA  ST225C04HA  ST225C05HA 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
 ST225C06HA 
  "numeric"

Yes. That worked.We now, have 1 character coulumn and 24 numeric columns. Now, I have to calculate the amount of missing data in the data set. Changing the class to numeric might have helped us in marking blank rows to NAs but there are other NAs we have to account for. Let’s check if the blanks were changed into NAs in a couple of columns.

colSums(is.na(student_data_clmns[change_needed_columns]))

   CNTRYID   CNTSCHID   CNTSTUID ST001C01TA    RACETHC ST005C01TA ST007C01TA 
         0          0          0         34         34         34         34 
ST011C17TA ST011C18TA ST011C19TA ST019AC01T ST019BC01T ST019CC01T ST127A01TA 
        34         34         34         35         35         35         36 
ST127C01TA ST127C02TA ST127C03TA ST022C01TA ST225C01HA ST225C02HA ST225C03HA 
        36         36         36         36         95         95         95 
ST225C04HA ST225C05HA ST225C06HA 
        95         95         95

Tallying these missing numbers (came from BLANK spaces) to the original files, I made sure they are accurate. Now, I have to figure out how many of these columns have missing values listed as 5,7,8,9 or 95,97,98,99. Once I do so, I would index them out and pass the the index_95 or index_5 index to change those numbers into NAs, at once. Here’s the list of columns divided based on the missing values:

# options(tibble.width = Inf) shows all the columns in a table regardless of number of columns
tibble(tribble(
  ~N_5_7_8_9, ~N_95_97_98_99,
  "ST011C17TA, ST011C18TA, ST011C19TA", "ST001C01TA",
  "ST019AC01T, ST019BC01T, ST019CC01T", "RACETHC",
  "ST127A01TA, ST127C01TA, ST127C02TA", "ST005C01TA", "ST127C03TA, ST022C01TA", "ST007C01TA"
))

# A tibble: 4 × 2
  N_5_7_8_9                          N_95_97_98_99
  <chr>                              <chr>        
1 ST011C17TA, ST011C18TA, ST011C19TA ST001C01TA   
2 ST019AC01T, ST019BC01T, ST019CC01T RACETHC      
3 ST127A01TA, ST127C01TA, ST127C02TA ST005C01TA   
4 ST127C03TA, ST022C01TA             ST007C01TA

Further analysis of the data files gave some ideas that there are third types of columns which left only blank spaces to denote missing values. As the blank spaces are already turned into NAs there is nothing I need to do at the moment.

Next step: Now I am going to created two separate indexes of columns that based on the information provided above.

missing_5 <- c("ST011C17TA", "ST011C18TA", "ST011C19TA", "ST019AC01T", "ST019BC01T", "ST019CC01T", "ST127A01TA", "ST127C01TA", "ST127C02TA", "ST127C03TA", "ST022C01TA")
missing_95 <- c("ST001C01TA", "RACETHC", "ST005C01TA", "ST007C01TA")

After they are indexed, I am going to pass them through the student_data and change the respective values to NAs.

# Columns that denoted 5,7,8, & 9 as missing values
student_data_clmns[missing_5] <- sapply(student_data_clmns[missing_5], function(w) replace(w, w %in% index_5, NA))

# Columns that denoted 95,97,98, & 99 as misssing values
student_data_clmns[missing_95] <- sapply(student_data_clmns[missing_95], function(x) replace(x, x %in% index_95, NA))

# Let's check if that worked
comp_mat <- cbind(summary(student_data_clmns$ST005C01TA), summary(student_data_clmns$ST011C19TA))
colnames(comp_mat) <- c("ST005C01TA/from_95_series", "ST011C19TA/from_5_series")
comp_mat

        ST005C01TA/from_95_series ST011C19TA/from_5_series
Min.                     1.000000                 1.000000
1st Qu.                  1.000000                 1.000000
Median                   1.000000                 1.000000
Mean                     1.205355                 1.371765
3rd Qu.                  1.000000                 2.000000
Max.                     4.000000                 2.000000
NA's                    95.000000                85.000000

# Changing the student data to `student_final`
student_final <- student_data_clmns

Before we made these changes, both of the variables had 34 NAs in each. However, they have been changed to 95 for ST005C01TA and 85 to ST011C19TA. If we compare the number of missing values, the action worked. Great!

Like before, some of the columns might have actually changed to list rather than numeric. Checking for their structure using str() function helps us fingure that out.

str(student_final)

'data.frame':   4838 obs. of  25 variables:
 $ CNTRYID   : num  840 840 840 840 840 840 840 840 840 840 ...
 $ CNT       : chr  "USA" "USA" "USA" "USA" ...
 $ CNTSCHID  : num  84000112 84000024 84000061 84000034 84000062 ...
 $ CNTSTUID  : num  84000001 84000002 84000003 84000004 84000006 ...
 $ ST001C01TA: num  4 4 4 4 4 4 4 4 4 4 ...
 $ RACETHC   : num  3 4 1 1 3 3 3 1 1 3 ...
 $ ST005C01TA: num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST007C01TA: num  1 4 1 1 1 1 1 1 1 1 ...
 $ ST011C17TA: num  2 2 1 2 2 1 1 2 2 2 ...
 $ ST011C18TA: num  1 1 1 2 1 1 1 1 1 1 ...
 $ ST011C19TA: num  1 1 2 2 1 1 1 1 1 2 ...
 $ ST019AC01T: num  2 2 1 1 1 1 1 1 1 1 ...
 $ ST019BC01T: num  2 2 1 1 1 1 1 1 1 1 ...
 $ ST019CC01T: num  2 2 1 1 1 1 1 1 1 1 ...
 $ ST127A01TA: num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST127C01TA: num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST127C02TA: num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST127C03TA: num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST022C01TA: num  2 3 1 1 1 3 1 1 1 1 ...
 $ ST225C01HA: num  1 0 1 0 0 0 0 0 0 0 ...
 $ ST225C02HA: num  1 1 1 1 0 0 1 1 1 1 ...
 $ ST225C03HA: num  1 0 0 1 0 0 0 0 0 1 ...
 $ ST225C04HA: num  1 0 1 0 0 0 0 1 0 1 ...
 $ ST225C05HA: num  1 0 1 0 1 1 1 1 1 1 ...
 $ ST225C06HA: num  1 1 0 0 0 0 0 1 0 0 ...

Looks like everything is intact. It saves some of my time.

Merging Data

Next step in this process is to merge the school, teacher, and student data together. Merging data frames is relatively easy task, as long as we have common ID variables. I was mindful of this requirement the whole time I cleaned and modeled the data. I am going to use them. However, even the subtle differences in the common column names throws an error. So, I want to check the difference in names before I proceed to merge the data files.

Waldo{} package in r is very fancy, in this respect. I can quickly compare the names and figure out if they are same or different, or where they differ using the compare() function. The drawback of this function is, it can’t compare more than two at once. Now, lets compare school_final with teacher_final at first.

# Saving the column names as objects
teacher_name <- names(teacher_final)
school_name <- names(school_final)
student_name <- names(student_final)

# comparing school and teacher column names
waldo::compare(school_name, teacher_name)

`old`: "CNTRYID" "CNT" "CNTSCHID" "FRPL"                               
`new`: "CNTRYID" "CNT" "CNTSCHID" "CNTTCHID" "TC186C01HA" "TC800C001HA"

# Comparing teacher and student column names
waldo::compare(teacher_name, student_name)

    old           | new                         
[1] "CNTRYID"     | "CNTRYID"    [1]            
[2] "CNT"         | "CNT"        [2]            
[3] "CNTSCHID"    | "CNTSCHID"   [3]            
[4] "CNTTCHID"    - "CNTSTUID"   [4]            
[5] "TC186C01HA"  - "ST001C01TA" [5]            
[6] "TC800C001HA" - "RACETHC"    [6]            
                  - "ST005C01TA" [7]            
                  - "ST007C01TA" [8]            
                  - "ST011C17TA" [9]            
                  - "ST011C18TA" [10]           
... ...             ...          and 15 more ...

Based on the comparisons,CNTRYID, CNT, & CNTSCHID are common among all three datasets and they are indeed same in datasets. When CNTTCHID is unique to teacher data, and CNTSCHID is unique ID variable to student data. The remaining variables are supposed to be unique to student, teacher, and school variables.

Taking sometime to think how to proceed, I realized that there is a required hierarchy in these datasets. Teacher dataset has school ID, i.e., CNTSCHID variable which is also present in school and student data sets. This is the key variable to rely on when I merge my data.

I am first going to merge school data to teacher data and merge the combined dataset to student data.

school_teacher_data <- merge(school_final, teacher_final, by = c("CNTRYID", "CNT", "CNTSCHID"))
str(school_teacher_data)

'data.frame':   3526 obs. of  7 variables:
 $ CNTRYID    : num  840 840 840 840 840 840 840 840 840 840 ...
 $ CNT        : chr  "USA" "USA" "USA" "USA" ...
 $ CNTSCHID   : num  84000001 84000001 84000001 84000001 84000001 ...
 $ FRPL       : num  5 5 5 5 5 5 5 5 5 5 ...
 $ CNTTCHID   : num  84003620 84000135 84001445 84001269 84003595 ...
 $ TC186C01HA : num  1 1 NA NA 1 1 1 1 NA 1 ...
 $ TC800C001HA: num  3 4 4 NA 4 4 4 4 3 4 ...

Looks like that worked. I now have a merged data set named school_teacher_data, and it has 7 columns and 3, 526 rows. Each row in this data set refer to the number of teachers who took part in the study.

school_teacher_student_data <- merge(student_final, school_teacher_data, by = c("CNTSCHID", "CNTRYID", "CNT"))
str(school_teacher_student_data)

'data.frame':   107873 obs. of  29 variables:
 $ CNTSCHID   : num  84000001 84000001 84000001 84000001 84000001 ...
 $ CNTRYID    : num  840 840 840 840 840 840 840 840 840 840 ...
 $ CNT        : chr  "USA" "USA" "USA" "USA" ...
 $ CNTSTUID   : num  84007979 84007979 84007979 84007979 84007979 ...
 $ ST001C01TA : num  3 3 3 3 3 3 3 3 3 3 ...
 $ RACETHC    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST005C01TA : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST007C01TA : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST011C17TA : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST011C18TA : num  2 2 2 2 2 2 2 2 2 2 ...
 $ ST011C19TA : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST019AC01T : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST019BC01T : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST019CC01T : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST127A01TA : num  2 2 2 2 2 2 2 2 2 2 ...
 $ ST127C01TA : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ST127C02TA : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ST127C03TA : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ST022C01TA : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST225C01HA : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ST225C02HA : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST225C03HA : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ST225C04HA : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST225C05HA : num  1 1 1 1 1 1 1 1 1 1 ...
 $ ST225C06HA : num  0 0 0 0 0 0 0 0 0 0 ...
 $ FRPL       : num  5 5 5 5 5 5 5 5 5 5 ...
 $ CNTTCHID   : num  84001445 84001269 84000669 84000751 84003620 ...
 $ TC186C01HA : num  NA NA 1 1 1 1 1 1 1 1 ...
 $ TC800C001HA: num  4 NA 4 4 3 3 3 4 4 4 ...

Because the school and teacher data don’t have studentID in them, I am given with with 107873 X 31 dimensional data set, which is definitely wrong. Looks like I have to have two different data sets.

Furthermore, I have been able to figure out the student data with their test scores. I have trimmed the data in SPSS, and it’s ready to be loaded. Let’s see.

library(foreign)
added_student_data <- read.spss("us_student_only.sav", to.data.frame = TRUE)
str(added_student_data)

'data.frame':   4838 obs. of  21 variables:
 $ CNTRYID : num  840 840 840 840 840 840 840 840 840 840 ...
 $ CNT     : chr  "USA" "USA" "USA" "USA" ...
 $ CNTSCHID: num  84000001 84000001 84000001 84000001 84000001 ...
 $ CNTSTUID: num  84000250 84000304 84000353 84000536 84001240 ...
 $ GRADE   : Factor w/ 5 levels "Grade 8","Grade 9",..: 4 4 2 3 3 3 3 4 4 3 ...
 $ GENDER  : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 2 1 1 1 2 ...
 $ COB_STD : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ COB_MOM : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ COB_DAD : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 2 1 ...
 $ COM_HOM : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
 $ LANGN   : Factor w/ 4 levels "Spanish","English",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ INTERNET: Factor w/ 3 levels "Yes, and I use it",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ MISCED  : Factor w/ 7 levels "None","ISCED 1",..: 6 5 5 3 6 5 3 7 6 7 ...
 $ FISCED  : Factor w/ 7 levels "None","ISCED 1",..: 5 5 7 5 7 5 5 6 5 7 ...
 $ ICTHOME : Factor w/ 12 levels "0","1","2","3",..: 10 5 9 6 12 11 9 10 6 9 ...
 $ LOC_INFO: num  537 428 511 432 508 ...
 $ UNDERSTD: num  516 407 501 429 536 ...
 $ EVAL_REF: num  518 422 506 436 562 ...
 $ SINGLE  : num  559 429 508 470 518 ...
 $ MULTIPLE: num  556 422 503 445 537 ...
 $ READ_SCR: num  544 432 504 438 536 ...
 - attr(*, "variable.labels")= Named chr [1:21] "Country Identifier" "Country code 3-character" "Intl. School ID" "Intl. Student ID" ...
  ..- attr(*, "names")= chr [1:21] "CNTRYID" "CNT" "CNTSCHID" "CNTSTUID" ...
 - attr(*, "codepage")= int 65001

Looks like the dataset has been successfully loaded in R, and the quick snapshot of the structures of the variables, I see that the variable class and information have been successfully saved. I have experienced that sometimes a single variable tend to have subcategories in different formats (e.g., some as number and other as list). Checking a summary will allow me to check for their consistency and triangulate with the original data set for fidelity.

summary(added_student_data)

    CNTRYID        CNT               CNTSCHID           CNTSTUID       
 Min.   :840   Length:4838        Min.   :84000001   Min.   :84000001  
 1st Qu.:840   Class :character   1st Qu.:84000047   1st Qu.:84002155  
 Median :840   Mode  :character   Median :84000086   Median :84004338  
 Mean   :840                      Mean   :84000087   Mean   :84004300  
 3rd Qu.:840                      3rd Qu.:84000129   3rd Qu.:84006418  
 Max.   :840                      Max.   :84000175   Max.   :84008626  
                                                                       
      GRADE         GENDER              COB_STD              COB_MOM    
 Grade 8 :   8   Female:2376   United States:4411   United States:3524  
 Grade 9 : 401   Male  :2462   Other country: 314   Other country:1214  
 Grade 10:3598                 NA's         : 113   NA's         : 100  
 Grade 11: 826                                                          
 Grade 12:   5                                                          
                                                                        
                                                                        
          COB_DAD     COM_HOM                        LANGN     
 United States:3449   Yes :4170   Spanish               : 517  
 Other country:1254   No  : 604   English               :4054  
 NA's         : 135   NA's:  64   Another language (USA): 219  
                                  Missing               :  48  
                                                               
                                                               
                                                               
                    INTERNET                  MISCED    
 Yes, and I use it      :4261   ISCED 5A, 6      :2045  
 Yes, but I don’t use it: 139   ISCED 3A, ISCED 4:1447  
 No                     : 182   ISCED 5B         : 688  
 NA's                   : 256   ISCED 2          : 387  
                                ISCED 1          : 104  
                                (Other)          :  74  
                                NA's             :  93  
               FISCED        ICTHOME       LOC_INFO        UNDERSTD    
 ISCED 3A, ISCED 4:1784   10     :912   Min.   :155.6   Min.   :191.7  
 ISCED 5A, 6      :1674   9      :899   1st Qu.:425.2   1st Qu.:417.4  
 ISCED 5B         : 535   11     :788   Median :503.5   Median :501.2  
 ISCED 2          : 448   8      :705   Mean   :496.9   Mean   :495.7  
 ISCED 1          : 118   7      :498   3rd Qu.:574.2   3rd Qu.:575.5  
 (Other)          :  93   (Other):844   Max.   :784.6   Max.   :814.5  
 NA's             : 186   NA's   :192                                  
    EVAL_REF         SINGLE         MULTIPLE        READ_SCR    
 Min.   :185.4   Min.   :185.4   Min.   :193.5   Min.   :157.3  
 1st Qu.:424.9   1st Qu.:420.7   1st Qu.:423.6   1st Qu.:425.9  
 Median :509.7   Median :502.6   Median :503.4   Median :504.8  
 Mean   :505.0   Mean   :497.3   Mean   :500.0   Mean   :500.6  
 3rd Qu.:587.2   3rd Qu.:578.0   3rd Qu.:579.3   3rd Qu.:578.4  
 Max.   :795.4   Max.   :783.4   Max.   :784.5   Max.   :810.5

summary(student_final)

    CNTRYID        CNT               CNTSCHID           CNTSTUID       
 Min.   :840   Length:4838        Min.   :84000001   Min.   :84000001  
 1st Qu.:840   Class :character   1st Qu.:84000047   1st Qu.:84002155  
 Median :840   Mode  :character   Median :84000086   Median :84004338  
 Mean   :840                      Mean   :84000087   Mean   :84004300  
 3rd Qu.:840                      3rd Qu.:84000129   3rd Qu.:84006418  
 Max.   :840                      Max.   :84000175   Max.   :84008626  
                                                                       
   ST001C01TA       RACETHC        ST005C01TA      ST007C01TA   
 Min.   :2.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:4.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
 Median :4.000   Median :2.000   Median :1.000   Median :1.000  
 Mean   :4.088   Mean   :2.205   Mean   :1.205   Mean   :1.239  
 3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:1.000  
 Max.   :6.000   Max.   :6.000   Max.   :4.000   Max.   :4.000  
 NA's   :34      NA's   :53      NA's   :95      NA's   :193    
   ST011C17TA      ST011C18TA      ST011C19TA      ST019AC01T   
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
 Median :2.000   Median :1.000   Median :1.000   Median :1.000  
 Mean   :1.525   Mean   :1.122   Mean   :1.372   Mean   :1.066  
 3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:1.000  
 Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
 NA's   :115     NA's   :72      NA's   :85      NA's   :113    
   ST019BC01T      ST019CC01T      ST127A01TA      ST127C01TA      ST127C02TA  
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.00  
 Median :1.000   Median :1.000   Median :1.000   Median :1.000   Median :1.00  
 Mean   :1.256   Mean   :1.267   Mean   :1.049   Mean   :1.087   Mean   :1.03  
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.00  
 Max.   :2.000   Max.   :2.000   Max.   :3.000   Max.   :3.000   Max.   :3.00  
 NA's   :100     NA's   :135     NA's   :230     NA's   :160     NA's   :240   
   ST127C03TA      ST022C01TA      ST225C01HA       ST225C02HA    
 Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.000   Median :1.000   Median :0.0000   Median :1.0000  
 Mean   :1.015   Mean   :1.199   Mean   :0.1056   Mean   :0.7455  
 3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :3.000   Max.   :3.000   Max.   :1.0000   Max.   :1.0000  
 NA's   :294     NA's   :48      NA's   :95       NA's   :95      
   ST225C03HA      ST225C04HA       ST225C05HA       ST225C06HA   
 Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.000   Median :0.0000   Median :1.0000   Median :0.000  
 Mean   :0.214   Mean   :0.4213   Mean   :0.7133   Mean   :0.444  
 3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
 Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
 NA's   :95      NA's   :95       NA's   :95       NA's   :95

Turns out that the SPSS data is better shaped than the student_fianl. I am going to trim the student final data and merge it with the added_student_data.

trimmed_student <- student_final |>
  select(CNTRYID, CNT, CNTSCHID, CNTSTUID, RACETHC) |>
  rename(ETHNICITY = RACETHC)
head(trimmed_student)

  CNTRYID CNT CNTSCHID CNTSTUID ETHNICITY
1     840 USA 84000112 84000001         3
2     840 USA 84000024 84000002         4
3     840 USA 84000061 84000003         1
4     840 USA 84000034 84000004         1
5     840 USA 84000062 84000006         3
6     840 USA 84000133 84000007         3

Looks like that worked. Now, merge trimmed_student and added_student_data together by all other 4 parameters.

student_final_updated <- merge(trimmed_student, added_student_data, by = c("CNTRYID", "CNT", "CNTSCHID", "CNTSTUID"))
summary(student_final_updated)

    CNTRYID        CNT               CNTSCHID           CNTSTUID       
 Min.   :840   Length:4838        Min.   :84000001   Min.   :84000001  
 1st Qu.:840   Class :character   1st Qu.:84000047   1st Qu.:84002155  
 Median :840   Mode  :character   Median :84000086   Median :84004338  
 Mean   :840                      Mean   :84000087   Mean   :84004300  
 3rd Qu.:840                      3rd Qu.:84000129   3rd Qu.:84006418  
 Max.   :840                      Max.   :84000175   Max.   :84008626  
                                                                       
   ETHNICITY          GRADE         GENDER              COB_STD    
 Min.   :1.000   Grade 8 :   8   Female:2376   United States:4411  
 1st Qu.:1.000   Grade 9 : 401   Male  :2462   Other country: 314  
 Median :2.000   Grade 10:3598                 NA's         : 113  
 Mean   :2.205   Grade 11: 826                                     
 3rd Qu.:3.000   Grade 12:   5                                     
 Max.   :6.000                                                     
 NA's   :53                                                        
          COB_MOM              COB_DAD     COM_HOM    
 United States:3524   United States:3449   Yes :4170  
 Other country:1214   Other country:1254   No  : 604  
 NA's         : 100   NA's         : 135   NA's:  64  
                                                      
                                                      
                                                      
                                                      
                    LANGN                         INTERNET   
 Spanish               : 517   Yes, and I use it      :4261  
 English               :4054   Yes, but I don’t use it: 139  
 Another language (USA): 219   No                     : 182  
 Missing               :  48   NA's                   : 256  
                                                             
                                                             
                                                             
               MISCED                   FISCED        ICTHOME   
 ISCED 5A, 6      :2045   ISCED 3A, ISCED 4:1784   10     :912  
 ISCED 3A, ISCED 4:1447   ISCED 5A, 6      :1674   9      :899  
 ISCED 5B         : 688   ISCED 5B         : 535   11     :788  
 ISCED 2          : 387   ISCED 2          : 448   8      :705  
 ISCED 1          : 104   ISCED 1          : 118   7      :498  
 (Other)          :  74   (Other)          :  93   (Other):844  
 NA's             :  93   NA's             : 186   NA's   :192  
    LOC_INFO        UNDERSTD        EVAL_REF         SINGLE     
 Min.   :155.6   Min.   :191.7   Min.   :185.4   Min.   :185.4  
 1st Qu.:425.2   1st Qu.:417.4   1st Qu.:424.9   1st Qu.:420.7  
 Median :503.5   Median :501.2   Median :509.7   Median :502.6  
 Mean   :496.9   Mean   :495.7   Mean   :505.0   Mean   :497.3  
 3rd Qu.:574.2   3rd Qu.:575.5   3rd Qu.:587.2   3rd Qu.:578.0  
 Max.   :784.6   Max.   :814.5   Max.   :795.4   Max.   :783.4  
                                                                
    MULTIPLE        READ_SCR    
 Min.   :193.5   Min.   :157.3  
 1st Qu.:423.6   1st Qu.:425.9  
 Median :503.4   Median :504.8  
 Mean   :500.0   Mean   :500.6  
 3rd Qu.:579.3   3rd Qu.:578.4  
 Max.   :784.5   Max.   :810.5

Now, all other variables have have labels, but not the ETHNITICY. I am going do label the categories now.

student_final_updated$ETHNICITY <- factor(student_final_updated$ETHNICITY,
  levels = c(1:6),
  labels = c("White, not Hispanic", "Black or African American", "Hispanic or Latino", "Asian", "Two or More Race", "Other")
)
summary(student_final_updated)

    CNTRYID        CNT               CNTSCHID           CNTSTUID       
 Min.   :840   Length:4838        Min.   :84000001   Min.   :84000001  
 1st Qu.:840   Class :character   1st Qu.:84000047   1st Qu.:84002155  
 Median :840   Mode  :character   Median :84000086   Median :84004338  
 Mean   :840                      Mean   :84000087   Mean   :84004300  
 3rd Qu.:840                      3rd Qu.:84000129   3rd Qu.:84006418  
 Max.   :840                      Max.   :84000175   Max.   :84008626  
                                                                       
                     ETHNICITY         GRADE         GENDER    
 White, not Hispanic      :2097   Grade 8 :   8   Female:2376  
 Black or African American: 769   Grade 9 : 401   Male  :2462  
 Hispanic or Latino       :1221   Grade 10:3598                
 Asian                    : 284   Grade 11: 826                
 Two or More Race         : 367   Grade 12:   5                
 Other                    :  47                                
 NA's                     :  53                                
          COB_STD              COB_MOM              COB_DAD     COM_HOM    
 United States:4411   United States:3524   United States:3449   Yes :4170  
 Other country: 314   Other country:1214   Other country:1254   No  : 604  
 NA's         : 113   NA's         : 100   NA's         : 135   NA's:  64  
                                                                           
                                                                           
                                                                           
                                                                           
                    LANGN                         INTERNET   
 Spanish               : 517   Yes, and I use it      :4261  
 English               :4054   Yes, but I don’t use it: 139  
 Another language (USA): 219   No                     : 182  
 Missing               :  48   NA's                   : 256  
                                                             
                                                             
                                                             
               MISCED                   FISCED        ICTHOME   
 ISCED 5A, 6      :2045   ISCED 3A, ISCED 4:1784   10     :912  
 ISCED 3A, ISCED 4:1447   ISCED 5A, 6      :1674   9      :899  
 ISCED 5B         : 688   ISCED 5B         : 535   11     :788  
 ISCED 2          : 387   ISCED 2          : 448   8      :705  
 ISCED 1          : 104   ISCED 1          : 118   7      :498  
 (Other)          :  74   (Other)          :  93   (Other):844  
 NA's             :  93   NA's             : 186   NA's   :192  
    LOC_INFO        UNDERSTD        EVAL_REF         SINGLE     
 Min.   :155.6   Min.   :191.7   Min.   :185.4   Min.   :185.4  
 1st Qu.:425.2   1st Qu.:417.4   1st Qu.:424.9   1st Qu.:420.7  
 Median :503.5   Median :501.2   Median :509.7   Median :502.6  
 Mean   :496.9   Mean   :495.7   Mean   :505.0   Mean   :497.3  
 3rd Qu.:574.2   3rd Qu.:575.5   3rd Qu.:587.2   3rd Qu.:578.0  
 Max.   :784.6   Max.   :814.5   Max.   :795.4   Max.   :783.4  
                                                                
    MULTIPLE        READ_SCR    
 Min.   :193.5   Min.   :157.3  
 1st Qu.:423.6   1st Qu.:425.9  
 Median :503.4   Median :504.8  
 Mean   :500.0   Mean   :500.6  
 3rd Qu.:579.3   3rd Qu.:578.4  
 Max.   :784.5   Max.   :810.5

That worked but looks like we got some uninvited columns (6 to be total). They are named VAR00001 - 00006. In addition, we don’t need the country ID and CNT variables because they all are from the United States.

student_final_updated <- select(student_final_updated, CNTSCHID, CNTSTUID, ETHNICITY, GRADE, GENDER, COB_STD, COB_MOM, COB_DAD, COM_HOM, LANGN, INTERNET, MISCED, FISCED, ICTHOME, LOC_INFO, UNDERSTD, EVAL_REF, SINGLE, MULTIPLE, READ_SCR)
student_final_updated$ICTHOME <- as.numeric(student_final_updated$ICTHOME)
str(student_final_updated)

'data.frame':   4838 obs. of  20 variables:
 $ CNTSCHID : num  84000001 84000001 84000001 84000001 84000001 ...
 $ CNTSTUID : num  84000250 84000304 84000353 84000536 84001240 ...
 $ ETHNICITY: Factor w/ 6 levels "White, not Hispanic",..: 5 1 1 1 1 1 5 1 3 1 ...
 $ GRADE    : Factor w/ 5 levels "Grade 8","Grade 9",..: 4 4 2 3 3 3 3 4 4 3 ...
 $ GENDER   : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 2 1 1 1 2 ...
 $ COB_STD  : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ COB_MOM  : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ COB_DAD  : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 2 1 ...
 $ COM_HOM  : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
 $ LANGN    : Factor w/ 4 levels "Spanish","English",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ INTERNET : Factor w/ 3 levels "Yes, and I use it",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ MISCED   : Factor w/ 7 levels "None","ISCED 1",..: 6 5 5 3 6 5 3 7 6 7 ...
 $ FISCED   : Factor w/ 7 levels "None","ISCED 1",..: 5 5 7 5 7 5 5 6 5 7 ...
 $ ICTHOME  : num  10 5 9 6 12 11 9 10 6 9 ...
 $ LOC_INFO : num  537 428 511 432 508 ...
 $ UNDERSTD : num  516 407 501 429 536 ...
 $ EVAL_REF : num  518 422 506 436 562 ...
 $ SINGLE   : num  559 429 508 470 518 ...
 $ MULTIPLE : num  556 422 503 445 537 ...
 $ READ_SCR : num  544 432 504 438 536 ...

At last I want to save the file as .csv for further use.

# write.csv(student_final_updated, "student_final.csv")

Success!

Once again. I also have to save the school and teacher data in the local disc. But before I do so, I want to take of a few things.

Checking Variables

str(school_teacher_data)

'data.frame':  3526 obs. of  7 variables:
 $ CNTRYID    : num  840 840 840 840 840 840 840 840 840 840 ...
 $ CNT        : chr  "USA" "USA" "USA" "USA" ...
 $ CNTSCHID   : num  84000001 84000001 84000001 84000001 84000001 ...
 $ FRPL       : num  5 5 5 5 5 5 5 5 5 5 ...
 $ CNTTCHID   : num  84003620 84000135 84001445 84001269 84003595 ...
 $ TC186C01HA : num  1 1 NA NA 1 1 1 1 NA 1 ...
 $ TC800C001HA: num  3 4 4 NA 4 4 4 4 3 4 ...

Renaming Variables

First, I want to rename the TC186C01HA to country of birth for teacher aka COB_TECH and TC800C001HA to highest education aka HIGH_EDU.

school_teacher_data <- school_teacher_data |>
  rename(
    COB_TECH = TC186C01HA,
    HIGH_EDU = TC800C001HA
  )
names(school_teacher_data)

[1] "CNTRYID"  "CNT"      "CNTSCHID" "FRPL"     "CNTTCHID" "COB_TECH" "HIGH_EDU"

Then, I can get rid of country ID and County variables.

Removing Variables

school_teacher_data <- select(school_teacher_data, CNTSCHID, CNTTCHID, FRPL, COB_TECH, HIGH_EDU)
summary(school_teacher_data)

    CNTSCHID           CNTTCHID             FRPL          COB_TECH    
 Min.   :84000001   Min.   :84000001   Min.   :1.000   Min.   :1.000  
 1st Qu.:84000047   1st Qu.:84000942   1st Qu.:3.000   1st Qu.:1.000  
 Median :84000088   Median :84001888   Median :3.000   Median :1.000  
 Mean   :84000088   Mean   :84001889   Mean   :3.435   Mean   :1.052  
 3rd Qu.:84000131   3rd Qu.:84002838   3rd Qu.:4.000   3rd Qu.:1.000  
 Max.   :84000175   Max.   :84003779   Max.   :5.000   Max.   :2.000  
                                       NA's   :539     NA's   :1742   
    HIGH_EDU    
 Min.   :1.000  
 1st Qu.:3.000  
 Median :4.000  
 Mean   :3.655  
 3rd Qu.:4.000  
 Max.   :5.000  
 NA's   :696

Changing the label of the variables

school_teacher_data$FRPL <- factor(school_teacher_data$FRPL,
  levels = c(1:5),
  labels = c("Less than 10 percent", "10 to 24.9 percent", "25 to 49.9 percent", "50 to 74.9 percent", "75 percent or more")
)

school_teacher_data$COB_TECH <- factor(school_teacher_data$COB_TECH,
  levels = c(1:2),
  labels = c("United States", "Other Country")
)

school_teacher_data$HIGH_EDU <- factor(school_teacher_data$HIGH_EDU,
  levels = c(1:5),
  labels = c("High school and/or some college courses", "Associate's degree", "Bachelor's degree", "Master's degree", "Doctoral or professional degree")
)
summary(school_teacher_data)

    CNTSCHID           CNTTCHID                          FRPL     
 Min.   :84000001   Min.   :84000001   Less than 10 percent: 119  
 1st Qu.:84000047   1st Qu.:84000942   10 to 24.9 percent  : 415  
 Median :84000088   Median :84001888   25 to 49.9 percent  :1072  
 Mean   :84000088   Mean   :84001889   50 to 74.9 percent  : 810  
 3rd Qu.:84000131   3rd Qu.:84002838   75 percent or more  : 571  
 Max.   :84000175   Max.   :84003779   NA's                : 539  
          COB_TECH                                       HIGH_EDU   
 United States:1691   High school and/or some college courses:  11  
 Other Country:  93   Associate's degree                     :  12  
 NA's         :1742   Bachelor's degree                      :1013  
                      Master's degree                        :1700  
                      Doctoral or professional degree        :  94  
                      NA's                                   : 696

Everything worked!!

However, our proposal contains a lot of school and teacher variables. The data sets we have worked on so far provided us almost all student level variables, but not as many in terms of teachers and students. I have .SPSS files for both school and teacher data sets. Now, I am going to load student and teacher data sets and merge them with the school_teacher_data.

added_teacher_data <- read.spss("us_teacher_only.sav", to.data.frame = TRUE)
str(added_teacher_data)

'data.frame':   3526 obs. of  14 variables:
 $ CNTRYID         : num  840 840 840 840 840 840 840 840 840 840 ...
 $ CNT             : chr  "USA" "USA" "USA" "USA" ...
 $ CNTSCHID        : num  84000144 84000011 84000084 84000082 84000161 ...
 $ CNTTCHID        : num  84000001 84000002 84000003 84000004 84000005 ...
 $ TC_SEX          : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1 NA 2 2 1 ...
 $ TC_AGE          : Factor w/ 51 levels "20","21","22",..: 27 39 13 15 3 19 NA 5 22 7 ...
 $ TC_STATUS       : Factor w/ 4 levels "Full-time (more than 90% of full-time hours)",..: 1 4 1 1 1 1 NA 1 1 1 ...
 $ TC_EXPERIENCE   : Factor w/ 50 levels "0","1","2","3",..: 23 37 5 14 3 12 NA 2 18 6 ...
 $ TC_TRAINING     : Factor w/ 3 levels "Yes, a programme of 1 year or less",..: 2 2 1 3 3 2 NA 1 2 2 ...
 $ TC_QUALIFICATION: Factor w/ 5 levels "I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .",..: 1 1 2 1 1 1 NA 3 1 1 ...
 $ TC_PREPREADLIT  : Factor w/ 32 levels "0","2","5","6",..: 8 NA NA NA 17 NA NA 24 8 NA ...
 $ TC_PREPREADPED  : Factor w/ 29 levels "0","4","5","7",..: 19 NA NA NA 11 NA NA 10 13 NA ...
 $ TC_PREPGENPED   : Factor w/ 32 levels "0","5","6","8",..: 18 NA NA NA 18 NA NA 9 21 NA ...
 $ EMPLTIM         : Factor w/ 2 levels "full time","part time": 1 2 1 1 1 1 NA 1 1 1 ...
 - attr(*, "variable.labels")= Named chr [1:14] "Country Identifier" "Country code 3-character" "Intl. School ID" "Intl. Teacher ID" ...
  ..- attr(*, "names")= chr [1:14] "CNTRYID" "CNT" "CNTSCHID" "CNTTCHID" ...
 - attr(*, "codepage")= int 65001

added_school_data <- read.spss("us_school_only.sav", to.data.frame = TRUE)
str(added_school_data)

'data.frame':   164 obs. of  18 variables:
 $ CNTRYID : Factor w/ 82 levels "Albania","Baku (Azerbaijan)",..: 77 77 77 77 77 77 77 77 77 77 ...
 $ CNT     : Factor w/ 82 levels "Albania","United Arab Emirates",..: 81 81 81 81 81 81 81 81 81 81 ...
 $ CNTSCHID: num  84000001 84000002 84000004 84000005 84000006 ...
 $ SCH_LOCA: Factor w/ 5 levels "A village, hamlet or rural area (fewer than 3 000 people)",..: 3 4 4 4 4 4 4 3 NA 3 ...
 $ SCH_TYPE: Factor w/ 2 levels "A public school (Managed by a public education authority, government agency, or governing board)",..: 1 1 1 1 2 1 1 1 NA 1 ...
 $ TOT_BOYS: Factor w/ 136 levels "0","20","23",..: 6 73 62 103 14 3 131 96 NA 133 ...
 $ TOT_GIRL: Factor w/ 137 levels "0","2","27","30",..: 9 71 64 98 13 3 133 100 NA 134 ...
 $ TOT_STDS: num  100 1415 1235 2009 180 ...
 $ SCH_ELS : Factor w/ 43 levels "0","1","2","3",..: 6 11 11 NA 14 5 2 4 NA 14 ...
 $ SCH_DISA: Factor w/ 30 levels "0","1","2","3",..: 28 23 11 NA NA 11 10 14 NA 11 ...
 $ SCH_FRPL: Factor w/ 70 levels "1","2","3","4",..: 59 52 56 NA NA 26 10 19 NA 22 ...
 $ EXT_CLAS: Factor w/ 2 levels "Yes","No": 1 2 2 NA 2 2 2 2 NA 2 ...
 $ STRATIO : Factor w/ 137 levels "1.6667","4","4.2326",..: 2 54 52 NA 8 35 NA 127 NA 104 ...
 $ PROATCE : Factor w/ 48 levels "0","0.007","0.0667",..: 48 38 43 NA NA 48 NA 48 NA 48 ...
 $ PROAT5AB: Factor w/ 59 levels "0.0116","0.0141",..: 17 37 59 NA NA 59 NA 59 NA 11 ...
 $ PROAT5AM: Factor w/ 116 levels "0.0167","0.0395",..: 72 23 82 NA 9 91 NA 112 NA 24 ...
 $ PROAT6  : Factor w/ 80 levels "0","0.0043","0.0065",..: 1 34 79 NA 1 1 NA 12 NA 71 ...
 $ CLSIZE  : Factor w/ 9 levels "15 students or fewer",..: 3 2 4 5 2 1 4 5 NA 1 ...
 - attr(*, "variable.labels")= Named chr [1:18] "Country Identifier" "Country code 3-character" "Intl. School ID" "Which of the following definitions best describes the community in which your school is located?" ...
  ..- attr(*, "names")= chr [1:18] "CNTRYID" "CNT" "CNTSCHID" "SCH_LOCA" ...
 - attr(*, "codepage")= int 65001

Once they are uploaded, I am going to combine them together before I combine them with the school_teacher_data. I have noted the structure of the common variables and they match.

added_school_teacher <- merge(added_school_data, added_teacher_data, by = c("CNTSCHID"))
tail((added_school_teacher), 15)

     CNTSCHID     CNTRYID.x         CNT.x
3512 84000175 United States United States
3513 84000175 United States United States
3514 84000175 United States United States
3515 84000175 United States United States
3516 84000175 United States United States
3517 84000175 United States United States
3518 84000175 United States United States
3519 84000175 United States United States
3520 84000175 United States United States
3521 84000175 United States United States
3522 84000175 United States United States
3523 84000175 United States United States
3524 84000175 United States United States
3525 84000175 United States United States
3526 84000175 United States United States
                                        SCH_LOCA
3512 A small town (3 000 to about 15 000 people)
3513 A small town (3 000 to about 15 000 people)
3514 A small town (3 000 to about 15 000 people)
3515 A small town (3 000 to about 15 000 people)
3516 A small town (3 000 to about 15 000 people)
3517 A small town (3 000 to about 15 000 people)
3518 A small town (3 000 to about 15 000 people)
3519 A small town (3 000 to about 15 000 people)
3520 A small town (3 000 to about 15 000 people)
3521 A small town (3 000 to about 15 000 people)
3522 A small town (3 000 to about 15 000 people)
3523 A small town (3 000 to about 15 000 people)
3524 A small town (3 000 to about 15 000 people)
3525 A small town (3 000 to about 15 000 people)
3526 A small town (3 000 to about 15 000 people)
                                                                                             SCH_TYPE
3512 A public school (Managed by a public education authority, government agency, or governing board)
3513 A public school (Managed by a public education authority, government agency, or governing board)
3514 A public school (Managed by a public education authority, government agency, or governing board)
3515 A public school (Managed by a public education authority, government agency, or governing board)
3516 A public school (Managed by a public education authority, government agency, or governing board)
3517 A public school (Managed by a public education authority, government agency, or governing board)
3518 A public school (Managed by a public education authority, government agency, or governing board)
3519 A public school (Managed by a public education authority, government agency, or governing board)
3520 A public school (Managed by a public education authority, government agency, or governing board)
3521 A public school (Managed by a public education authority, government agency, or governing board)
3522 A public school (Managed by a public education authority, government agency, or governing board)
3523 A public school (Managed by a public education authority, government agency, or governing board)
3524 A public school (Managed by a public education authority, government agency, or governing board)
3525 A public school (Managed by a public education authority, government agency, or governing board)
3526 A public school (Managed by a public education authority, government agency, or governing board)
     TOT_BOYS TOT_GIRL TOT_STDS SCH_ELS SCH_DISA SCH_FRPL EXT_CLAS STRATIO
3512      200      187      387       2       25       45       No   10.75
3513      200      187      387       2       25       45       No   10.75
3514      200      187      387       2       25       45       No   10.75
3515      200      187      387       2       25       45       No   10.75
3516      200      187      387       2       25       45       No   10.75
3517      200      187      387       2       25       45       No   10.75
3518      200      187      387       2       25       45       No   10.75
3519      200      187      387       2       25       45       No   10.75
3520      200      187      387       2       25       45       No   10.75
3521      200      187      387       2       25       45       No   10.75
3522      200      187      387       2       25       45       No   10.75
3523      200      187      387       2       25       45       No   10.75
3524      200      187      387       2       25       45       No   10.75
3525      200      187      387       2       25       45       No   10.75
3526      200      187      387       2       25       45       No   10.75
     PROATCE PROAT5AB PROAT5AM PROAT6         CLSIZE CNTRYID.y CNT.y CNTTCHID
3512       1        1   0.4167      0 16-20 students       840   USA 84003230
3513       1        1   0.4167      0 16-20 students       840   USA 84001024
3514       1        1   0.4167      0 16-20 students       840   USA 84000574
3515       1        1   0.4167      0 16-20 students       840   USA 84000758
3516       1        1   0.4167      0 16-20 students       840   USA 84001229
3517       1        1   0.4167      0 16-20 students       840   USA 84001582
3518       1        1   0.4167      0 16-20 students       840   USA 84003451
3519       1        1   0.4167      0 16-20 students       840   USA 84000161
3520       1        1   0.4167      0 16-20 students       840   USA 84001905
3521       1        1   0.4167      0 16-20 students       840   USA 84001123
3522       1        1   0.4167      0 16-20 students       840   USA 84000374
3523       1        1   0.4167      0 16-20 students       840   USA 84002142
3524       1        1   0.4167      0 16-20 students       840   USA 84000325
3525       1        1   0.4167      0 16-20 students       840   USA 84003520
3526       1        1   0.4167      0 16-20 students       840   USA 84003650
     TC_SEX TC_AGE                                    TC_STATUS TC_EXPERIENCE
3512 Female     26 Full-time (more than 90% of full-time hours)             1
3513   Male     37 Full-time (more than 90% of full-time hours)            15
3514   Male     46 Full-time (more than 90% of full-time hours)            18
3515   Male     40 Full-time (more than 90% of full-time hours)            16
3516   <NA>   <NA>                                         <NA>          <NA>
3517   <NA>   <NA>                                         <NA>          <NA>
3518 Female     31 Full-time (more than 90% of full-time hours)             3
3519 Female     47 Full-time (more than 90% of full-time hours)            20
3520   Male     28 Full-time (more than 90% of full-time hours)             5
3521   <NA>   <NA>                                         <NA>          <NA>
3522 Female     60 Full-time (more than 90% of full-time hours)            24
3523   <NA>   <NA>                                         <NA>          <NA>
3524 Female     33 Full-time (more than 90% of full-time hours)            11
3525 Female     43 Full-time (more than 90% of full-time hours)            12
3526   <NA>     61 Full-time (more than 90% of full-time hours)            39
                             TC_TRAINING
3512 Yes, a programme longer than 1 year
3513 Yes, a programme longer than 1 year
3514 Yes, a programme longer than 1 year
3515 Yes, a programme longer than 1 year
3516                                <NA>
3517                                <NA>
3518 Yes, a programme longer than 1 year
3519 Yes, a programme longer than 1 year
3520                                  No
3521                                <NA>
3522 Yes, a programme longer than 1 year
3523                                <NA>
3524  Yes, a programme of 1 year or less
3525 Yes, a programme longer than 1 year
3526                                <NA>
                                                                                                          TC_QUALIFICATION
3512 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
3513 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
3514                                                      I attended a work-based teacher education or training programme.
3515 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
3516                                                                                                                  <NA>
3517                                                                                                                  <NA>
3518 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
3519 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
3520                                                      I attended a work-based teacher education or training programme.
3521                                                                                                                  <NA>
3522 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
3523                                                                                                                  <NA>
3524 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
3525 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
3526 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
     TC_PREPREADLIT TC_PREPREADPED TC_PREPGENPED   EMPLTIM
3512           <NA>           <NA>          <NA> full time
3513             80             10            10 full time
3514           <NA>           <NA>          <NA> full time
3515             10             10            80 full time
3516           <NA>           <NA>          <NA>      <NA>
3517           <NA>           <NA>          <NA>      <NA>
3518           <NA>           <NA>          <NA> full time
3519           <NA>           <NA>          <NA> full time
3520              5              5            90 full time
3521           <NA>           <NA>          <NA>      <NA>
3522             40             40            20 full time
3523           <NA>           <NA>          <NA>      <NA>
3524             40             20            40 full time
3525           <NA>           <NA>          <NA> full time
3526           <NA>           <NA>          <NA> full time

dim(added_school_teacher)

[1] 3526   31

Looks like we got what we wanted. I have seen some repeated columns or columns that we don’t require in this analysis. I am going to get rid of these variables, now. I will arrange them in order and keep only the ones that I need.

added_school_teacher_updated <- select(added_school_teacher, CNTSCHID, CNTTCHID, TC_SEX, TC_AGE, TC_STATUS, TC_EXPERIENCE, TC_TRAINING, TC_QUALIFICATION, TC_PREPREADLIT, TC_PREPREADPED, TC_PREPGENPED, EMPLTIM, SCH_LOCA, SCH_TYPE, TOT_BOYS, TOT_GIRL, TOT_STDS, SCH_ELS, SCH_DISA, SCH_FRPL, EXT_CLAS, STRATIO, PROATCE, PROAT5AB, PROAT5AM, PROAT6, CLSIZE)
names(added_school_teacher_updated)

 [1] "CNTSCHID"         "CNTTCHID"         "TC_SEX"           "TC_AGE"          
 [5] "TC_STATUS"        "TC_EXPERIENCE"    "TC_TRAINING"      "TC_QUALIFICATION"
 [9] "TC_PREPREADLIT"   "TC_PREPREADPED"   "TC_PREPGENPED"    "EMPLTIM"         
[13] "SCH_LOCA"         "SCH_TYPE"         "TOT_BOYS"         "TOT_GIRL"        
[17] "TOT_STDS"         "SCH_ELS"          "SCH_DISA"         "SCH_FRPL"        
[21] "EXT_CLAS"         "STRATIO"          "PROATCE"          "PROAT5AB"        
[25] "PROAT5AM"         "PROAT6"           "CLSIZE"

dim(added_school_teacher_updated)

[1] 3526   27

We made the changes and our data retain total rows and all columns except the once we omitted. Now, I am going combine this data set to the school_teacher_data, the common merging variables will be CNTSCHID & CNTTCHID.

school_teacher_data_all_merged <- merge(school_teacher_data, added_school_teacher_updated, by = c("CNTSCHID", "CNTTCHID"))
school_teacher_final <- select(school_teacher_data_all_merged, CNTSCHID, CNTTCHID, TC_SEX, TC_AGE, COB_TECH, HIGH_EDU, TC_STATUS, TC_EXPERIENCE, TC_TRAINING, TC_QUALIFICATION, TC_PREPREADLIT, TC_PREPREADPED, TC_PREPGENPED, EMPLTIM, SCH_LOCA, SCH_TYPE, TOT_BOYS, TOT_GIRL, TOT_STDS, SCH_ELS, SCH_DISA, SCH_FRPL, FRPL, EXT_CLAS, STRATIO, PROATCE, PROAT5AB, PROAT5AM, PROAT6, CLSIZE) |>
  rename(
    TC_COB = COB_TECH,
    TC_EDU = HIGH_EDU
  )
head(school_teacher_final)

  CNTSCHID CNTTCHID TC_SEX TC_AGE        TC_COB            TC_EDU
1 84000001 84000135 Female     45 United States   Master's degree
2 84000001 84000471 Female     50 United States Bachelor's degree
3 84000001 84000593 Female     49 United States   Master's degree
4 84000001 84000669 Female     48 United States   Master's degree
5 84000001 84000751 Female     28 United States   Master's degree
6 84000001 84001269   <NA>   <NA>          <NA>              <NA>
                                     TC_STATUS TC_EXPERIENCE
1 Full-time (more than 90% of full-time hours)            16
2 Full-time (more than 90% of full-time hours)            26
3 Full-time (more than 90% of full-time hours)            20
4 Full-time (more than 90% of full-time hours)            25
5 Full-time (more than 90% of full-time hours)          <NA>
6                                         <NA>          <NA>
                          TC_TRAINING
1 Yes, a programme longer than 1 year
2 Yes, a programme longer than 1 year
3 Yes, a programme longer than 1 year
4  Yes, a programme of 1 year or less
5  Yes, a programme of 1 year or less
6                                <NA>
                                                                                                       TC_QUALIFICATION
1 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
2 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
3 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
4 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
5 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .
6                                                                                                                  <NA>
  TC_PREPREADLIT TC_PREPREADPED TC_PREPGENPED   EMPLTIM
1           <NA>           <NA>          <NA> full time
2           <NA>           <NA>          <NA> full time
3           <NA>           <NA>          <NA> full time
4           <NA>           <NA>          <NA> full time
5           <NA>           <NA>          <NA> full time
6           <NA>           <NA>          <NA>      <NA>
                                 SCH_LOCA
1 A town (15 000 to about 100 000 people)
2 A town (15 000 to about 100 000 people)
3 A town (15 000 to about 100 000 people)
4 A town (15 000 to about 100 000 people)
5 A town (15 000 to about 100 000 people)
6 A town (15 000 to about 100 000 people)
                                                                                          SCH_TYPE
1 A public school (Managed by a public education authority, government agency, or governing board)
2 A public school (Managed by a public education authority, government agency, or governing board)
3 A public school (Managed by a public education authority, government agency, or governing board)
4 A public school (Managed by a public education authority, government agency, or governing board)
5 A public school (Managed by a public education authority, government agency, or governing board)
6 A public school (Managed by a public education authority, government agency, or governing board)
  TOT_BOYS TOT_GIRL TOT_STDS SCH_ELS SCH_DISA SCH_FRPL               FRPL
1       51       49      100       5       37       79 75 percent or more
2       51       49      100       5       37       79 75 percent or more
3       51       49      100       5       37       79 75 percent or more
4       51       49      100       5       37       79 75 percent or more
5       51       49      100       5       37       79 75 percent or more
6       51       49      100       5       37       79 75 percent or more
  EXT_CLAS STRATIO PROATCE PROAT5AB PROAT5AM PROAT6         CLSIZE
1      Yes       4       1     0.44     0.56      0 21-25 students
2      Yes       4       1     0.44     0.56      0 21-25 students
3      Yes       4       1     0.44     0.56      0 21-25 students
4      Yes       4       1     0.44     0.56      0 21-25 students
5      Yes       4       1     0.44     0.56      0 21-25 students
6      Yes       4       1     0.44     0.56      0 21-25 students

dim(school_teacher_final)

[1] 3526   30

summary(school_teacher_final)

    CNTSCHID           CNTTCHID           TC_SEX         TC_AGE    
 Min.   :84000001   Min.   :84000001   Female:1784   40     : 108  
 1st Qu.:84000047   1st Qu.:84000942   Male  :1019   39     :  97  
 Median :84000088   Median :84001888   NA's  : 723   41     :  94  
 Mean   :84000088   Mean   :84001889                 48     :  91  
 3rd Qu.:84000131   3rd Qu.:84002838                 36     :  89  
 Max.   :84000175   Max.   :84003779                 (Other):2347  
                                                     NA's   : 700  
           TC_COB                                         TC_EDU    
 United States:1691   High school and/or some college courses:  11  
 Other Country:  93   Associate's degree                     :  12  
 NA's         :1742   Bachelor's degree                      :1013  
                      Master's degree                        :1700  
                      Doctoral or professional degree        :  94  
                      NA's                                   : 696  
                                                                    
                                        TC_STATUS    TC_EXPERIENCE 
 Full-time (more than 90% of full-time hours):2773   15     : 134  
 Part-time (71-90% of full-time hours)       :  22   5      : 123  
 Part-time (50-70% of full-time hours)       :  21   8      : 117  
 Part-time (less than 50% of full-time hours):  17   6      : 114  
 NA's                                        : 693   4      : 111  
                                                     (Other):2186  
                                                     NA's   : 741  
                              TC_TRAINING  
 Yes, a programme of 1 year or less : 577  
 Yes, a programme longer than 1 year:2022  
 No                                 : 229  
 NA's                               : 698  
                                           
                                           
                                           
                                                                                                              TC_QUALIFICATION
 I attended a standard teacher training programme at an  educational institute eligible to educate or train teachers .:2303   
 I attended an in-service teacher education or training programme.                                                    : 113   
 I attended a work-based teacher education or training programme.                                                     : 241   
 I attended training in another pedagogical profession.                                                               :  31   
 Other                                                                                                                : 142   
 NA's                                                                                                                 : 696   
                                                                                                                              
 TC_PREPREADLIT TC_PREPREADPED TC_PREPGENPED       EMPLTIM    
 50     : 143   20     : 203   20     : 168   full time:2773  
 20     : 134   25     : 144   50     : 108   part time:  60  
 40     : 100   30     : 144   25     : 103   NA's     : 693  
 30     :  97   40     : 114   30     :  98                   
 10     :  84   10     : 103   40     :  98                   
 (Other): 444   (Other): 295   (Other): 448                   
 NA's   :2524   NA's   :2523   NA's   :2503                   
                                                      SCH_LOCA   
 A village, hamlet or rural area (fewer than 3 000 people): 187  
 A small town (3 000 to about 15 000 people)              : 587  
 A town (15 000 to about 100 000 people)                  :1192  
 A city (100 000 to about 1 000 000 people)               : 845  
 A large city (with over 1 000 000 people)                : 445  
 NA's                                                     : 270  
                                                                 
                                                                                                                    SCH_TYPE   
 A public school (Managed by a public education authority, government agency, or governing board)                       :3093  
 A private school (Managed by a non-government org; e.g. a church, trade union, business, or other private institution.): 144  
 NA's                                                                                                                   : 289  
                                                                                                                               
                                                                                                                               
                                                                                                                               
                                                                                                                               
    TOT_BOYS       TOT_GIRL       TOT_STDS       SCH_ELS        SCH_DISA   
 950    :  75   1200   :  50   Min.   :  22   1      : 333   10     : 389  
 459    :  50   1300   :  50   1st Qu.: 729   5      : 294   15     : 292  
 724    :  50   1338   :  50   Median :1453   12     : 206   13     : 289  
 1000   :  50   0      :  46   Mean   :1557   2      : 184   20     : 256  
 1230   :  50   276    :  46   3rd Qu.:2100   6      : 168   12     : 190  
 (Other):2852   (Other):2864   Max.   :4507   (Other):1697   (Other):1696  
 NA's   : 399   NA's   : 420   NA's   :420    NA's   : 644   NA's   : 414  
    SCH_FRPL                      FRPL      EXT_CLAS       STRATIO    
 35     : 123   Less than 10 percent: 119   Yes : 886   9.381  :  25  
 30     : 121   10 to 24.9 percent  : 415   No  :2200   9.5176 :  25  
 40     : 114   25 to 49.9 percent  :1072   NA's: 440   9.7447 :  25  
 15     : 101   50 to 74.9 percent  : 810               10.3714:  25  
 12     :  99   75 percent or more  : 571               10.4314:  25  
 (Other):2583   NA's                : 539               (Other):2834  
 NA's   : 385                                           NA's   : 567  
    PROATCE        PROAT5AB       PROAT5AM        PROAT6    
 1      :1824   1      :1453   0.5    : 145   0      : 892  
 0.9778 :  71   0.7143 :  66   0.2    :  71   0.04   :  75  
 0.9524 :  50   0.8    :  46   1      :  66   0.0286 :  50  
 0.9833 :  46   0.1667 :  45   0.1765 :  50   0.0625 :  50  
 0.9231 :  45   0.75   :  41   0.2857 :  50   0.0227 :  42  
 (Other): 895   (Other):1226   (Other):2502   (Other):1761  
 NA's   : 595   NA's   : 649   NA's   : 642   NA's   : 656  
            CLSIZE    
 26-30 students:1230  
 21-25 students: 923  
 31-35 students: 454  
 16-20 students: 356  
 36-40 students:  95  
 (Other)       :  72  
 NA's          : 396

summary(student_final_updated)

    CNTSCHID           CNTSTUID                            ETHNICITY   
 Min.   :84000001   Min.   :84000001   White, not Hispanic      :2097  
 1st Qu.:84000047   1st Qu.:84002155   Black or African American: 769  
 Median :84000086   Median :84004338   Hispanic or Latino       :1221  
 Mean   :84000087   Mean   :84004300   Asian                    : 284  
 3rd Qu.:84000129   3rd Qu.:84006418   Two or More Race         : 367  
 Max.   :84000175   Max.   :84008626   Other                    :  47  
                                       NA's                     :  53  
      GRADE         GENDER              COB_STD              COB_MOM    
 Grade 8 :   8   Female:2376   United States:4411   United States:3524  
 Grade 9 : 401   Male  :2462   Other country: 314   Other country:1214  
 Grade 10:3598                 NA's         : 113   NA's         : 100  
 Grade 11: 826                                                          
 Grade 12:   5                                                          
                                                                        
                                                                        
          COB_DAD     COM_HOM                        LANGN     
 United States:3449   Yes :4170   Spanish               : 517  
 Other country:1254   No  : 604   English               :4054  
 NA's         : 135   NA's:  64   Another language (USA): 219  
                                  Missing               :  48  
                                                               
                                                               
                                                               
                    INTERNET                  MISCED    
 Yes, and I use it      :4261   ISCED 5A, 6      :2045  
 Yes, but I don’t use it: 139   ISCED 3A, ISCED 4:1447  
 No                     : 182   ISCED 5B         : 688  
 NA's                   : 256   ISCED 2          : 387  
                                ISCED 1          : 104  
                                (Other)          :  74  
                                NA's             :  93  
               FISCED        ICTHOME          LOC_INFO        UNDERSTD    
 ISCED 3A, ISCED 4:1784   Min.   : 1.000   Min.   :155.6   Min.   :191.7  
 ISCED 5A, 6      :1674   1st Qu.: 8.000   1st Qu.:425.2   1st Qu.:417.4  
 ISCED 5B         : 535   Median :10.000   Median :503.5   Median :501.2  
 ISCED 2          : 448   Mean   : 9.413   Mean   :496.9   Mean   :495.7  
 ISCED 1          : 118   3rd Qu.:11.000   3rd Qu.:574.2   3rd Qu.:575.5  
 (Other)          :  93   Max.   :12.000   Max.   :784.6   Max.   :814.5  
 NA's             : 186   NA's   :192                                     
    EVAL_REF         SINGLE         MULTIPLE        READ_SCR    
 Min.   :185.4   Min.   :185.4   Min.   :193.5   Min.   :157.3  
 1st Qu.:424.9   1st Qu.:420.7   1st Qu.:423.6   1st Qu.:425.9  
 Median :509.7   Median :502.6   Median :503.4   Median :504.8  
 Mean   :505.0   Mean   :497.3   Mean   :500.0   Mean   :500.6  
 3rd Qu.:587.2   3rd Qu.:578.0   3rd Qu.:579.3   3rd Qu.:578.4  
 Max.   :795.4   Max.   :783.4   Max.   :784.5   Max.   :810.5

The compiled data set has 30 columns and 3526 rows. Each row represent a distinct teacher data. Now, I am going to save the data as a .csv file.

# write.csv(school_teacher_final, "school_teacher_final.csv")

Success!!

Data Wrangling for PISA Study

Nirmal Ghimire, Ph.D.

2022-09-26

1. Uploading the dataset and Getting to Know the Respective Variables

A. Student Interview Data

B. Teacher Data

C. School Data

D. Creating Variables Based on Given Information

E. Changing Various Values to `NA` Using `na_if` Function in {dplyr} Package

F. Preparing Teacher Data

G. Preparing Student Data

Merging Data

Data Wrangling for PISA Study

Nirmal Ghimire, Ph.D.

2022-09-26

1. Uploading the dataset and Getting to Know the Respective Variables

A. Student Interview Data

B. Teacher Data

C. School Data

D. Creating Variables Based on Given Information

E. Changing Various Values to NA Using na_if Function in {dplyr} Package

F. Preparing Teacher Data

G. Preparing Student Data

Merging Data

E. Changing Various Values to `NA` Using `na_if` Function in {dplyr} Package