library(tidyverse)
library(ggplot2)
library(janitor)
library(stringr)
library(readr)
Student Interview data provides many essential variables that are included in the model. The table below serves as the codebook for all the data in the model.
| SN | Variable Name | Label | Other Information |
|---|---|---|---|
| 1 | CNTRYID | Country Identifier; USA Country Code = 840 |
Type: Numeric; Width: 3; Decimals: 0; Location: 1-3 |
| 2 | CNT | Country code 3-character; USA | Type: Character; Width: 3; Location: 4-6 |
| 3 | CNTSCHID | International school ID; 84000001-84000175 |
Type: Numeric; Width: 8; Decimals: 0; Location: 7-14 |
| 4 | CNTSTUID | International student ID; 84000001-84008626 |
Type:Numeric ; Width: 8; Decimals: 0; Location: 15-22 |
| 5 | ST001C01TA | NAT/What grade are you in? 1: Grade_7, 2: Grade_8, 3: Grade_9, 4: Grade_10, 5: Grade_11, 6: Grade_12, 95/97/98/99/Blank: Missing |
Type: Numeric; Width: 2; Decimals: 0; Location: 23-24 |
| 6 | RACETHC | NAT/Collapsed derived student race/ethnicity 1: White, not Hispanic; 2: Black or African American; 3: Hispanic or Latino; 4: Asian 5: Two or More Races 6: Other |
Type: Numeric; Width: 2; Decimals: 0; Location: 25-26 95/97/98/99/Blank: Missing |
| 7 | ST005C01TA | NAT/What is the highest level of schooling (not including college) completed by your father? 1: She completed grade 12 (GED) 2: She completed grade 9 3: She completed grade 6 4: She did not complete grade 6 |
Type: Numeric; Width: 2; Decimals: 0; Location: 27-28 95/97/98/99/Blank: Missing |
| 8 | ST007C01TA | NAT/What is the highest level of schooling (not including college) completed by your father? 1: She completed grade 12 (GED) 2: She completed grade 9 3: She completed grade 6 4: She did not complete grade 6 |
Type: Numeric; Width: 2; Decimals: 0; Location: 29-30 95/97/98/99/Blank: Missing |
| 9 | ST011C17TA | NAT/Which of the following are in your home? A guest room 1: Yes 2: No 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 31 |
| 10 | ST011C18TA | NAT/Which of the following are your home? A high-speed Internet connection 1: Yes 2: No 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 32 |
| 11 | ST011C19TA | NAT/Which of the following are in your home? A musical instrument 1: Yes 2: No 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 33 |
| 12 | ST019AC01T | NAT/In what country were you and your parents born? You 1: United States 2: Other Country 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 34 |
| 13 | ST019BC01T | NAT/In what country were you and your parents born? Mother 1: United States 2: Other Country 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 35 |
| 14 | ST019CC01T | NAT/In what country were you and your parents born? Father 1: United States 2: Other Country 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 36 |
| 15 | ST127A01TA | NAT/Have you ever repeated a grade? in Kindergarten 1: No, never 2: Yes, once 3: Yes, twice or more 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 37 |
| 16 | ST127C01TA | NAT/Have you ever repeated a grade? in grades 1-6 1: No, never 2: Yes, once 3: Yes, twice or more 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 38 |
| 17 | ST127C02TA | NAT/Have you ever repeated a grade? in grades 7-9 1: No, never 2: Yes, once 3: Yes, twice or more 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 39 |
| 18 | ST127C03TA | NAT/Have you ever repeated a grade? in grades 10-12 1: No, never 2: Yes, once 3: Yes, twice or more 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 40 |
| 19 | ST022C01TA | NAT/What language do you speak at home most of the time? 1: English 2: Spanish 3: Other Language 5/7/8/9/Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 41 |
| 20 | ST225C01HA | NAT/Which of the following do you expect to complete? Less than high school 0: Not Checked 1: Checked Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 42 |
| 21 | ST225C02HA | NAT/Which of the following do you expect to complete? High school (high school diploma or GED) 0: Not Checked 1: Checked Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 43 |
| 22 | ST225C03HA | NAT/Which of the following do you expect to complete? Vocational/technical certificate (such as cosmetology or auto mechanics) 0: Not Checked 1: Checked Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 44 |
| 23 | ST225C04HA | NAT/Which of the following do you expect to complete? Associate’s degree (2-year degree from a community college) 0: Not Checked 1: Checked Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 45 |
| 24 | ST225C05HA | NAT/Which of the following do you expect to complete? Bachelor’s degree (4-year college degree) 0: Not Checked 1: Checked Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 46 |
| 25 | ST225C06HA | NAT/Which of the following do you expect to complete? Master’s degree/doctoral degree/professional degree (e.g., J.D. or M.D.) 0: Not Checked 1: Checked Blank: Missing |
Type: Numeric; Width: 1; Decimals: 0; Location: 47 |
Other than conducting some data cleaning and modeling for some data science courses, I have not used this kind of data in real analysis. The National Assessment of Educational Progress (NAEP) conducts the required assessment for PISA. Other than the big data files stored in the OECD online portal (https://www.oecd.org/pisa/data/2018database/), which contains data for all countries. It is painful to go through all the variable names and parse out the required ones. It took some time, but I found the US only Public Use Files (PUF) in IES NCES web portal (https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2021019).
The portal also have SPSS files but in the syntax form, however, to
help me brush off my skills I liked to go with the data stored as a text
files. The data are all jumbled and the data set does not have variables
name. More information regarding the data is stored in a separate zip
file named “Codebooks”. As we become more familiar with the big data,
our chances of encountering these types of data are more common. Here’s
how I am going to approach them. Going through the Codebook, I figured
out that the variable Country Identifier (CNTRYID),
three Character Country Code (CNT),
International School ID (CNTSCHID), and
International Student ID (CNTSTUID) are common among these
data files. Further inspection confirms that their place values are
exactly same, thus, I don’t have to repeat the same information in all
tables below.
student_data <- readLines("pisa18_us_stud_pud.dat")
student_data <- student_data |>
data.frame() # Changing the list to a dataframe
head(student_data) # Checking how the data are put together
student_data
1 840USA8400011284000001 4 3 1 121122211112111111
2 840USA8400002484000002 4 4 1 421122211113010001
3 840USA8400006184000003 4 1 1 111211111111110110
4 840USA8400003484000004 4 1 1 122211111111011000
5 840USA8400006284000006 4 3 1 121111111111000010
6 840USA8400013384000007 4 3 1 111111111113000010
str(student_data) # checking the structure of the data
'data.frame': 4838 obs. of 1 variable:
$ student_data: chr "840USA8400011284000001 4 3 1 121122211112111111" "840USA8400002484000002 4 4 1 421122211113010001" "840USA8400006184000003 4 1 1 111211111111110110" "840USA8400003484000004 4 1 1 122211111111011000" ...
dim(student_data) # checking the dimension
[1] 4838 1
Based on what we see, there is a jumbled string that runs all the way
through 47 characters. Each row contains both numbers and letters.
They have to be split into 25 different columns, but the good
news, is each row has same length and they have fixed number of
characters for each column. For example ST225C05HA
has one number value without any decimal and it is located in the 46th
position. Some numbers seems to follow SPACES but further analysis of
code book confirms that they are not. The spaces came to being when the
input was 1 character long in 2 character long space.
There were two variables, i.e., TC186C01HA &
TC800C001HA other than the IDs in Teacher Data.
| Variable Name | Label | Other Information |
|---|---|---|
| TC186C01HA | NAT/In what country were you born? 1: The United States 2: Other Country 5, 7, 8, 9, & Blank: Missing |
Type: Numeric; Width: 1; Decimal: 0; Location: 23 |
| TC800C001HA | NAT/What is the highest level of education you have completed? 1: High school and/or some college courses 2: Associate’s degree 3: Bachelor’s degree 4: Master’s degree 5: Doctoral or professional degree (e.g., PhD, M.D., J.D.) 95, 97, 98, 99, & Blank: Missing |
Type: Numeric; Width: 2; Decimal: 0; Location: 24-25 |
teacher_data <- readLines("pisa18_us_tchr_pud.dat")
teacher_data <- teacher_data |>
data.frame()
head(teacher_data)
teacher_data
1 840USA8400014484000001 4
2 840USA84000011840000021 4
3 840USA84000084840000031 3
4 840USA84000082840000041 4
5 840USA8400016184000005 3
6 840USA84000132840000061 3
# str(teacher_data)
dim(teacher_data)
[1] 3526 1
Based on the outcomes there is one string of values per row which is
supposed to be broken down into 6-different variables. Last two values
are for column named TC800C001HA which has the information
about teachers’ highest level of education. One before them, i.e., 23rd
value is for TC186C01HA variable which provides information
regarding whether the teachers were born in the United States or Other
country. All the values between 1-22 are for four ID variables mentioned
above.
There was just one variable, i.e., FRPL other than the ID variables mentioned above in the school database.
| Variable Name | Label | Other Information |
|---|---|---|
| FRPL | NAT/Percentage of students eligible for free-/reduced-price lunch (Derived) 1: Less than 10 percent 2: 10-24.9 percent 3: 25-49.9 percent 4: 50-74.9 percent 5: 75 percent or more |
Type: Numeric Width: 2 Decimal: 0 Location: 15-16 95, 97, 98, 99, & Blank: Missing Data |
school_data <- readLines("pisa18_us_schl_pud.dat")
school_data <- school_data |>
data.frame()
head(school_data)
school_data
1 840USA84000001 5
2 840USA84000002 4
3 840USA84000004 5
4 840USA8400000599
5 840USA8400000695
6 840USA84000007 3
# str(school_data)
dim(school_data)
[1] 164 1
Likewise, looking at the strings of values in the
school_data, we have to break them into 4-columns. Three of
them are the ID variables and the last one the information about
percentage of students receiving free- or reduced-price lunch (FRPL).
Location 15-16 pertains to FRPL percentage.
So far, I have uploaded the files, figured out their location and identified number of variables for each data set. Now, I have to go do the following activities to be able to analyze the data.
As a demonstration, this report shows the action I take
regarding the student_data. For the sake of avoiding
monotony, all the actions on teacher_data or
school_data will not be shown.
I will have to come up with 25 variables, as given and have to
exactly locate the values with high level of precision. I am using the
extract() function.
# library(gsubfn)
# student_pattern <- "(...)(...)(........)(........)(..)(..)(..)(..)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)"
# clmn_std_data <- strapplyc(student_data, student_pattern, simplify = rbind)
# head(clmn_std_data)
student_data_clmns <- student_data |>
tidyr::separate(
col = student_data,
sep = c(3, 6, 14, 22, 24, 26, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47),
into = c(
"CNTRYID", "CNT", "CNTSCHID", "CNTSTUID", "ST001C01TA", "RACETHC", "ST005C01TA",
"ST007C01TA", "ST011C17TA", "ST011C18TA", "ST011C19TA", "ST019AC01T", "ST019BC01T",
"ST019CC01T", "ST127A01TA", "ST127C01TA", "ST127C02TA", "ST127C03TA", "ST022C01TA",
"ST225C01HA", "ST225C02HA", "ST225C03HA", "ST225C04HA", "ST225C05HA", "ST225C06HA"
),
remove = TRUE
)
str(student_data_clmns)
'data.frame': 4838 obs. of 25 variables:
$ CNTRYID : chr "840" "840" "840" "840" ...
$ CNT : chr "USA" "USA" "USA" "USA" ...
$ CNTSCHID : chr "84000112" "84000024" "84000061" "84000034" ...
$ CNTSTUID : chr "84000001" "84000002" "84000003" "84000004" ...
$ ST001C01TA: chr " 4" " 4" " 4" " 4" ...
$ RACETHC : chr " 3" " 4" " 1" " 1" ...
$ ST005C01TA: chr " 1" " 1" " 1" " 1" ...
$ ST007C01TA: chr " 1" " 4" " 1" " 1" ...
$ ST011C17TA: chr "2" "2" "1" "2" ...
$ ST011C18TA: chr "1" "1" "1" "2" ...
$ ST011C19TA: chr "1" "1" "2" "2" ...
$ ST019AC01T: chr "2" "2" "1" "1" ...
$ ST019BC01T: chr "2" "2" "1" "1" ...
$ ST019CC01T: chr "2" "2" "1" "1" ...
$ ST127A01TA: chr "1" "1" "1" "1" ...
$ ST127C01TA: chr "1" "1" "1" "1" ...
$ ST127C02TA: chr "1" "1" "1" "1" ...
$ ST127C03TA: chr "1" "1" "1" "1" ...
$ ST022C01TA: chr "2" "3" "1" "1" ...
$ ST225C01HA: chr "1" "0" "1" "0" ...
$ ST225C02HA: chr "1" "1" "1" "1" ...
$ ST225C03HA: chr "1" "0" "0" "1" ...
$ ST225C04HA: chr "1" "0" "1" "0" ...
$ ST225C05HA: chr "1" "0" "1" "0" ...
$ ST225C06HA: chr "1" "1" "0" "0" ...
hist(as.numeric(student_data_clmns$ST001C01TA))
summary(as.numeric(student_data_clmns$ST225C05HA))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 1.0000 0.7133 1.0000 1.0000 95
schl_data_clmns <- school_data |>
tidyr::separate(
col = school_data,
sep = c(3, 6, 14, 16),
into = c("CNTRYID", "CNT", "CNTSCHID", "FRPL"),
remove = TRUE
) |>
data.frame()
str(schl_data_clmns)
'data.frame': 164 obs. of 4 variables:
$ CNTRYID : chr "840" "840" "840" "840" ...
$ CNT : chr "USA" "USA" "USA" "USA" ...
$ CNTSCHID: chr "84000001" "84000002" "84000004" "84000005" ...
$ FRPL : chr " 5" " 4" " 5" "99" ...
hist(as.numeric(schl_data_clmns$FRPL))
techr_data_clmns <- teacher_data |>
tidyr::separate(
col = teacher_data,
sep = c(3, 6, 14, 22, 23, 25),
into = c("CNTRYID", "CNT", "CNTSCHID", "CNTTCHID", "TC186C01HA", "TC800C001HA"),
remove = TRUE
) |>
data.frame()
str(techr_data_clmns)
'data.frame': 3526 obs. of 6 variables:
$ CNTRYID : chr "840" "840" "840" "840" ...
$ CNT : chr "USA" "USA" "USA" "USA" ...
$ CNTSCHID : chr "84000144" "84000011" "84000084" "84000082" ...
$ CNTTCHID : chr "84000001" "84000002" "84000003" "84000004" ...
$ TC186C01HA : chr " " "1" "1" "1" ...
$ TC800C001HA: chr " 4" " 4" " 3" " 4" ...
hist(as.numeric(techr_data_clmns$TC800C001HA))
NA Using
na_if Function in {dplyr} PackageThe NEAP used various values to refer to missingness in 2018 PISA dataset. There are at least two patterns:
valid skip, 7 as
not applicable, 8 for invalid, 9 for
no response, & Blank.valid skip, 97 for
not applicable, 98 for invalid, 99 for
no response, and Blank.So, I need to create two different indexes that I will pass through the columns and change the noted values to NAs.
# Creating indexes to pass through columns
index_95 <- c(95, 97, 98, 99)
index_5 <- c(5, 7, 8, 9)
Note: Most of the columns are stored as
character class, if I change them to
numeric all blank rows in corresponding columns will be
marked to be NA.
The school data has only one column with the missing values, i.e.,
FRPL. The missing values are denoted by 95 and so forth. Thus, I am
going to pass the whole data table through the is_95_na
function.
options(scipen = 999)
# Checking the class of FRPL Column in School Dataset
class(schl_data_clmns$FRPL)
[1] "character"
# class(schl_data_clmns)
# Changing the class type of selected columns in school dataset
schl_data_clmns$FRPL <- as.numeric(schl_data_clmns$FRPL)
schl_data_clmns$CNTRYID <- as.numeric(schl_data_clmns$CNTRYID)
schl_data_clmns$CNTSCHID <- as.numeric(schl_data_clmns$CNTSCHID)
# class(schl_data_clmns$FRPL)
# Passing `index_95` through FRPL column in `school dataset`
schl_data_clmns$FRPL <- lapply(schl_data_clmns$FRPL, function(t) replace(t, t %in% index_95, NA))
# Calculating NAs and confirming the values with given statistics
# sum(is.na(schl_data_clmns$FRPL))
schl_data_clmns$FRPL <- as.numeric(as.character(schl_data_clmns$FRPL))
summary(schl_data_clmns)
CNTRYID CNT CNTSCHID FRPL
Min. :840 Length:164 Min. :84000001 Min. :1.000
1st Qu.:840 Class :character 1st Qu.:84000044 1st Qu.:3.000
Median :840 Mode :character Median :84000086 Median :3.000
Mean :840 Mean :84000087 Mean :3.478
3rd Qu.:840 3rd Qu.:84000129 3rd Qu.:4.000
Max. :840 Max. :84000175 Max. :5.000
NA's :30
# Saving the data frame as `school_final`
school_final <- schl_data_clmns
# Requesting histogram of FRPL to compare againg the one I requested earlier
hist(school_final$FRPL)
# Changing the class type of selected columns in school dataset
techr_data_clmns$CNTRYID <- as.numeric(techr_data_clmns$CNTRYID)
techr_data_clmns$CNTSCHID <- as.numeric(techr_data_clmns$CNTSCHID)
techr_data_clmns$CNTTCHID <- as.numeric(techr_data_clmns$CNTTCHID)
techr_data_clmns$TC186C01HA <- as.numeric(techr_data_clmns$TC186C01HA)
techr_data_clmns$TC800C001HA <- as.numeric(techr_data_clmns$TC800C001HA)
# Passing `index_5` & `index_95' through "TC186C01HA" & "TC800C001HA" columns in `teacher dataset`, respectively
techr_data_clmns$TC186C01HA <- lapply(techr_data_clmns$TC186C01HA, function(u) replace(u, u %in% index_5, NA))
techr_data_clmns$TC800C001HA <- lapply(techr_data_clmns$TC800C001HA, function(v) replace(v, v %in% index_95, NA))
# For some weired reasons the above columns are saved as list, thus, changing them to numeric
techr_data_clmns$TC186C01HA <- as.numeric(as.character(techr_data_clmns$TC186C01HA))
techr_data_clmns$TC800C001HA <- as.numeric(as.character(techr_data_clmns$TC800C001HA))
# Saving the data frame as `school_final`
teacher_final <- techr_data_clmns
# Requesting histogram of FRPL to compare againg the one I requested earlier
summary(teacher_final)
CNTRYID CNT CNTSCHID CNTTCHID
Min. :840 Length:3526 Min. :84000001 Min. :84000001
1st Qu.:840 Class :character 1st Qu.:84000047 1st Qu.:84000942
Median :840 Mode :character Median :84000088 Median :84001888
Mean :840 Mean :84000088 Mean :84001889
3rd Qu.:840 3rd Qu.:84000131 3rd Qu.:84002838
Max. :840 Max. :84000175 Max. :84003779
TC186C01HA TC800C001HA
Min. :1.000 Min. :1.000
1st Qu.:1.000 1st Qu.:3.000
Median :1.000 Median :4.000
Mean :1.052 Mean :3.655
3rd Qu.:1.000 3rd Qu.:4.000
Max. :2.000 Max. :5.000
NA's :1742 NA's :696
## Checking Histogram for Consistency
hist(teacher_final$TC186C01HA)
hist(teacher_final$TC800C001HA)
The final calculations match the given number of missing values. Eyeballing the summary and running it through the actual code book, the data information matches, as well.
This is kept at last because it has many columns and the procedures
repeat over and over again. There are 24-columns that are
numeric but listed as character. I can either
change them one by one using 24-repeated syntax or use only a few lines
to do. I would definitely like to save my time. First I would create an
index containing the columns that need to be changed and
them pass a sapply() function to change all of them at
once.
# Indexing the columns that need to be changed
change_needed_columns <- c("CNTRYID", "CNTSCHID", "CNTSTUID", "ST001C01TA", "RACETHC", "ST005C01TA", "ST007C01TA", "ST011C17TA", "ST011C18TA", "ST011C19TA", "ST019AC01T", "ST019BC01T", "ST019CC01T", "ST127A01TA", "ST127C01TA", "ST127C02TA", "ST127C03TA", "ST022C01TA", "ST225C01HA", "ST225C02HA", "ST225C03HA", "ST225C04HA", "ST225C05HA", "ST225C06HA")
# Passing the index through the dataset
student_data_clmns[change_needed_columns] <- sapply(student_data_clmns[change_needed_columns], as.numeric)
# Checking if that worked
sapply(student_data_clmns, class)
CNTRYID CNT CNTSCHID CNTSTUID ST001C01TA RACETHC
"numeric" "character" "numeric" "numeric" "numeric" "numeric"
ST005C01TA ST007C01TA ST011C17TA ST011C18TA ST011C19TA ST019AC01T
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
ST019BC01T ST019CC01T ST127A01TA ST127C01TA ST127C02TA ST127C03TA
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
ST022C01TA ST225C01HA ST225C02HA ST225C03HA ST225C04HA ST225C05HA
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
ST225C06HA
"numeric"
Yes. That worked.We now, have 1 character coulumn and 24 numeric
columns. Now, I have to calculate the amount of missing data in the data
set. Changing the class to numeric might have helped us in marking blank
rows to NAs but there are other NAs we have to
account for. Let’s check if the blanks were changed into
NAs in a couple of columns.
colSums(is.na(student_data_clmns[change_needed_columns]))
CNTRYID CNTSCHID CNTSTUID ST001C01TA RACETHC ST005C01TA ST007C01TA
0 0 0 34 34 34 34
ST011C17TA ST011C18TA ST011C19TA ST019AC01T ST019BC01T ST019CC01T ST127A01TA
34 34 34 35 35 35 36
ST127C01TA ST127C02TA ST127C03TA ST022C01TA ST225C01HA ST225C02HA ST225C03HA
36 36 36 36 95 95 95
ST225C04HA ST225C05HA ST225C06HA
95 95 95
Tallying these missing numbers (came from BLANK spaces) to the
original files, I made sure they are accurate. Now, I have to figure out
how many of these columns have missing values listed as
5,7,8,9 or 95,97,98,99. Once I do so, I would
index them out and pass the the index_95 or
index_5 index to change those numbers into
NAs, at once. Here’s the list of columns divided based on
the missing values:
# options(tibble.width = Inf) shows all the columns in a table regardless of number of columns
tibble(tribble(
~N_5_7_8_9, ~N_95_97_98_99,
"ST011C17TA, ST011C18TA, ST011C19TA", "ST001C01TA",
"ST019AC01T, ST019BC01T, ST019CC01T", "RACETHC",
"ST127A01TA, ST127C01TA, ST127C02TA", "ST005C01TA", "ST127C03TA, ST022C01TA", "ST007C01TA"
))
# A tibble: 4 × 2
N_5_7_8_9 N_95_97_98_99
<chr> <chr>
1 ST011C17TA, ST011C18TA, ST011C19TA ST001C01TA
2 ST019AC01T, ST019BC01T, ST019CC01T RACETHC
3 ST127A01TA, ST127C01TA, ST127C02TA ST005C01TA
4 ST127C03TA, ST022C01TA ST007C01TA
Further analysis of the data files gave some ideas that there are
third types of columns which left only blank spaces to denote missing
values. As the blank spaces are already turned into NAs
there is nothing I need to do at the moment.
Next step: Now I am going to created two separate indexes of columns that based on the information provided above.
missing_5 <- c("ST011C17TA", "ST011C18TA", "ST011C19TA", "ST019AC01T", "ST019BC01T", "ST019CC01T", "ST127A01TA", "ST127C01TA", "ST127C02TA", "ST127C03TA", "ST022C01TA")
missing_95 <- c("ST001C01TA", "RACETHC", "ST005C01TA", "ST007C01TA")
After they are indexed, I am going to pass them through the
student_data and change the respective values to
NAs.
# Columns that denoted 5,7,8, & 9 as missing values
student_data_clmns[missing_5] <- sapply(student_data_clmns[missing_5], function(w) replace(w, w %in% index_5, NA))
# Columns that denoted 95,97,98, & 99 as misssing values
student_data_clmns[missing_95] <- sapply(student_data_clmns[missing_95], function(x) replace(x, x %in% index_95, NA))
# Let's check if that worked
comp_mat <- cbind(summary(student_data_clmns$ST005C01TA), summary(student_data_clmns$ST011C19TA))
colnames(comp_mat) <- c("ST005C01TA/from_95_series", "ST011C19TA/from_5_series")
comp_mat
ST005C01TA/from_95_series ST011C19TA/from_5_series
Min. 1.000000 1.000000
1st Qu. 1.000000 1.000000
Median 1.000000 1.000000
Mean 1.205355 1.371765
3rd Qu. 1.000000 2.000000
Max. 4.000000 2.000000
NA's 95.000000 85.000000
# Changing the student data to `student_final`
student_final <- student_data_clmns
Before we made these changes, both of the variables had 34 NAs in
each. However, they have been changed to 95 for
ST005C01TA and 85 to ST011C19TA.
If we compare the number of missing values, the action worked.
Great!
Like before, some of the columns might have actually changed to list
rather than numeric. Checking for their structure using
str() function helps us fingure that out.
str(student_final)
'data.frame': 4838 obs. of 25 variables:
$ CNTRYID : num 840 840 840 840 840 840 840 840 840 840 ...
$ CNT : chr "USA" "USA" "USA" "USA" ...
$ CNTSCHID : num 84000112 84000024 84000061 84000034 84000062 ...
$ CNTSTUID : num 84000001 84000002 84000003 84000004 84000006 ...
$ ST001C01TA: num 4 4 4 4 4 4 4 4 4 4 ...
$ RACETHC : num 3 4 1 1 3 3 3 1 1 3 ...
$ ST005C01TA: num 1 1 1 1 1 1 1 1 1 1 ...
$ ST007C01TA: num 1 4 1 1 1 1 1 1 1 1 ...
$ ST011C17TA: num 2 2 1 2 2 1 1 2 2 2 ...
$ ST011C18TA: num 1 1 1 2 1 1 1 1 1 1 ...
$ ST011C19TA: num 1 1 2 2 1 1 1 1 1 2 ...
$ ST019AC01T: num 2 2 1 1 1 1 1 1 1 1 ...
$ ST019BC01T: num 2 2 1 1 1 1 1 1 1 1 ...
$ ST019CC01T: num 2 2 1 1 1 1 1 1 1 1 ...
$ ST127A01TA: num 1 1 1 1 1 1 1 1 1 1 ...
$ ST127C01TA: num 1 1 1 1 1 1 1 1 1 1 ...
$ ST127C02TA: num 1 1 1 1 1 1 1 1 1 1 ...
$ ST127C03TA: num 1 1 1 1 1 1 1 1 1 1 ...
$ ST022C01TA: num 2 3 1 1 1 3 1 1 1 1 ...
$ ST225C01HA: num 1 0 1 0 0 0 0 0 0 0 ...
$ ST225C02HA: num 1 1 1 1 0 0 1 1 1 1 ...
$ ST225C03HA: num 1 0 0 1 0 0 0 0 0 1 ...
$ ST225C04HA: num 1 0 1 0 0 0 0 1 0 1 ...
$ ST225C05HA: num 1 0 1 0 1 1 1 1 1 1 ...
$ ST225C06HA: num 1 1 0 0 0 0 0 1 0 0 ...
Looks like everything is intact. It saves some of my time.
Next step in this process is to merge the school, teacher, and student data together. Merging data frames is relatively easy task, as long as we have common ID variables. I was mindful of this requirement the whole time I cleaned and modeled the data. I am going to use them. However, even the subtle differences in the common column names throws an error. So, I want to check the difference in names before I proceed to merge the data files.
Waldo{} package in r is very fancy, in this respect. I
can quickly compare the names and figure out if they are same or
different, or where they differ using the compare()
function. The drawback of this function is, it can’t compare more than
two at once. Now, lets compare school_final with
teacher_final at first.
# Saving the column names as objects
teacher_name <- names(teacher_final)
school_name <- names(school_final)
student_name <- names(student_final)
# comparing school and teacher column names
waldo::compare(school_name, teacher_name)
`old`: "CNTRYID" "CNT" "CNTSCHID" "FRPL"
`new`: "CNTRYID" "CNT" "CNTSCHID" "CNTTCHID" "TC186C01HA" "TC800C001HA"
# Comparing teacher and student column names
waldo::compare(teacher_name, student_name)
old | new
[1] "CNTRYID" | "CNTRYID" [1]
[2] "CNT" | "CNT" [2]
[3] "CNTSCHID" | "CNTSCHID" [3]
[4] "CNTTCHID" - "CNTSTUID" [4]
[5] "TC186C01HA" - "ST001C01TA" [5]
[6] "TC800C001HA" - "RACETHC" [6]
- "ST005C01TA" [7]
- "ST007C01TA" [8]
- "ST011C17TA" [9]
- "ST011C18TA" [10]
... ... ... and 15 more ...
Based on the comparisons,CNTRYID, CNT, &
CNTSCHID are common among all three datasets and they are
indeed same in datasets. When CNTTCHID is unique to
teacher data, and CNTSCHID is unique ID variable to
student data. The remaining variables are supposed to be unique to
student, teacher, and school variables.
Taking sometime to think how to proceed, I realized that there is a required hierarchy in these datasets. Teacher dataset has school ID, i.e., CNTSCHID variable which is also present in school and student data sets. This is the key variable to rely on when I merge my data.
I am first going to merge school data to
teacher data and merge the combined dataset to
student data.
school_teacher_data <- merge(school_final, teacher_final, by = c("CNTRYID", "CNT", "CNTSCHID"))
str(school_teacher_data)
'data.frame': 3526 obs. of 7 variables:
$ CNTRYID : num 840 840 840 840 840 840 840 840 840 840 ...
$ CNT : chr "USA" "USA" "USA" "USA" ...
$ CNTSCHID : num 84000001 84000001 84000001 84000001 84000001 ...
$ FRPL : num 5 5 5 5 5 5 5 5 5 5 ...
$ CNTTCHID : num 84003620 84000135 84001445 84001269 84003595 ...
$ TC186C01HA : num 1 1 NA NA 1 1 1 1 NA 1 ...
$ TC800C001HA: num 3 4 4 NA 4 4 4 4 3 4 ...
Looks like that worked. I now have a merged data set named
school_teacher_data, and it has 7 columns and 3, 526 rows.
Each row in this data set refer to the number of teachers who took part
in the study.
school_teacher_student_data <- merge(student_final, school_teacher_data, by = c("CNTSCHID", "CNTRYID", "CNT"))
str(school_teacher_student_data)
'data.frame': 107873 obs. of 29 variables:
$ CNTSCHID : num 84000001 84000001 84000001 84000001 84000001 ...
$ CNTRYID : num 840 840 840 840 840 840 840 840 840 840 ...
$ CNT : chr "USA" "USA" "USA" "USA" ...
$ CNTSTUID : num 84007979 84007979 84007979 84007979 84007979 ...
$ ST001C01TA : num 3 3 3 3 3 3 3 3 3 3 ...
$ RACETHC : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST005C01TA : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST007C01TA : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST011C17TA : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST011C18TA : num 2 2 2 2 2 2 2 2 2 2 ...
$ ST011C19TA : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST019AC01T : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST019BC01T : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST019CC01T : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST127A01TA : num 2 2 2 2 2 2 2 2 2 2 ...
$ ST127C01TA : num NA NA NA NA NA NA NA NA NA NA ...
$ ST127C02TA : num NA NA NA NA NA NA NA NA NA NA ...
$ ST127C03TA : num NA NA NA NA NA NA NA NA NA NA ...
$ ST022C01TA : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST225C01HA : num 0 0 0 0 0 0 0 0 0 0 ...
$ ST225C02HA : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST225C03HA : num 0 0 0 0 0 0 0 0 0 0 ...
$ ST225C04HA : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST225C05HA : num 1 1 1 1 1 1 1 1 1 1 ...
$ ST225C06HA : num 0 0 0 0 0 0 0 0 0 0 ...
$ FRPL : num 5 5 5 5 5 5 5 5 5 5 ...
$ CNTTCHID : num 84001445 84001269 84000669 84000751 84003620 ...
$ TC186C01HA : num NA NA 1 1 1 1 1 1 1 1 ...
$ TC800C001HA: num 4 NA 4 4 3 3 3 4 4 4 ...
Because the school and teacher data don’t have studentID in them, I
am given with with 107873 X 31 dimensional data set, which
is definitely wrong. Looks like I have to have two different data
sets.
Furthermore, I have been able to figure out the student data with their test scores. I have trimmed the data in SPSS, and it’s ready to be loaded. Let’s see.
library(foreign)
added_student_data <- read.spss("us_student_only.sav", to.data.frame = TRUE)
str(added_student_data)
'data.frame': 4838 obs. of 21 variables:
$ CNTRYID : num 840 840 840 840 840 840 840 840 840 840 ...
$ CNT : chr "USA" "USA" "USA" "USA" ...
$ CNTSCHID: num 84000001 84000001 84000001 84000001 84000001 ...
$ CNTSTUID: num 84000250 84000304 84000353 84000536 84001240 ...
$ GRADE : Factor w/ 5 levels "Grade 8","Grade 9",..: 4 4 2 3 3 3 3 4 4 3 ...
$ GENDER : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 2 1 1 1 2 ...
$ COB_STD : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 1 1 ...
$ COB_MOM : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 1 1 ...
$ COB_DAD : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 2 1 ...
$ COM_HOM : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
$ LANGN : Factor w/ 4 levels "Spanish","English",..: 2 2 2 2 2 2 2 2 2 2 ...
$ INTERNET: Factor w/ 3 levels "Yes, and I use it",..: 1 1 1 1 1 1 1 1 1 1 ...
$ MISCED : Factor w/ 7 levels "None","ISCED 1",..: 6 5 5 3 6 5 3 7 6 7 ...
$ FISCED : Factor w/ 7 levels "None","ISCED 1",..: 5 5 7 5 7 5 5 6 5 7 ...
$ ICTHOME : Factor w/ 12 levels "0","1","2","3",..: 10 5 9 6 12 11 9 10 6 9 ...
$ LOC_INFO: num 537 428 511 432 508 ...
$ UNDERSTD: num 516 407 501 429 536 ...
$ EVAL_REF: num 518 422 506 436 562 ...
$ SINGLE : num 559 429 508 470 518 ...
$ MULTIPLE: num 556 422 503 445 537 ...
$ READ_SCR: num 544 432 504 438 536 ...
- attr(*, "variable.labels")= Named chr [1:21] "Country Identifier" "Country code 3-character" "Intl. School ID" "Intl. Student ID" ...
..- attr(*, "names")= chr [1:21] "CNTRYID" "CNT" "CNTSCHID" "CNTSTUID" ...
- attr(*, "codepage")= int 65001
Looks like the dataset has been successfully loaded in R, and the quick snapshot of the structures of the variables, I see that the variable class and information have been successfully saved. I have experienced that sometimes a single variable tend to have subcategories in different formats (e.g., some as number and other as list). Checking a summary will allow me to check for their consistency and triangulate with the original data set for fidelity.
summary(added_student_data)
CNTRYID CNT CNTSCHID CNTSTUID
Min. :840 Length:4838 Min. :84000001 Min. :84000001
1st Qu.:840 Class :character 1st Qu.:84000047 1st Qu.:84002155
Median :840 Mode :character Median :84000086 Median :84004338
Mean :840 Mean :84000087 Mean :84004300
3rd Qu.:840 3rd Qu.:84000129 3rd Qu.:84006418
Max. :840 Max. :84000175 Max. :84008626
GRADE GENDER COB_STD COB_MOM
Grade 8 : 8 Female:2376 United States:4411 United States:3524
Grade 9 : 401 Male :2462 Other country: 314 Other country:1214
Grade 10:3598 NA's : 113 NA's : 100
Grade 11: 826
Grade 12: 5
COB_DAD COM_HOM LANGN
United States:3449 Yes :4170 Spanish : 517
Other country:1254 No : 604 English :4054
NA's : 135 NA's: 64 Another language (USA): 219
Missing : 48
INTERNET MISCED
Yes, and I use it :4261 ISCED 5A, 6 :2045
Yes, but I don’t use it: 139 ISCED 3A, ISCED 4:1447
No : 182 ISCED 5B : 688
NA's : 256 ISCED 2 : 387
ISCED 1 : 104
(Other) : 74
NA's : 93
FISCED ICTHOME LOC_INFO UNDERSTD
ISCED 3A, ISCED 4:1784 10 :912 Min. :155.6 Min. :191.7
ISCED 5A, 6 :1674 9 :899 1st Qu.:425.2 1st Qu.:417.4
ISCED 5B : 535 11 :788 Median :503.5 Median :501.2
ISCED 2 : 448 8 :705 Mean :496.9 Mean :495.7
ISCED 1 : 118 7 :498 3rd Qu.:574.2 3rd Qu.:575.5
(Other) : 93 (Other):844 Max. :784.6 Max. :814.5
NA's : 186 NA's :192
EVAL_REF SINGLE MULTIPLE READ_SCR
Min. :185.4 Min. :185.4 Min. :193.5 Min. :157.3
1st Qu.:424.9 1st Qu.:420.7 1st Qu.:423.6 1st Qu.:425.9
Median :509.7 Median :502.6 Median :503.4 Median :504.8
Mean :505.0 Mean :497.3 Mean :500.0 Mean :500.6
3rd Qu.:587.2 3rd Qu.:578.0 3rd Qu.:579.3 3rd Qu.:578.4
Max. :795.4 Max. :783.4 Max. :784.5 Max. :810.5
summary(student_final)
CNTRYID CNT CNTSCHID CNTSTUID
Min. :840 Length:4838 Min. :84000001 Min. :84000001
1st Qu.:840 Class :character 1st Qu.:84000047 1st Qu.:84002155
Median :840 Mode :character Median :84000086 Median :84004338
Mean :840 Mean :84000087 Mean :84004300
3rd Qu.:840 3rd Qu.:84000129 3rd Qu.:84006418
Max. :840 Max. :84000175 Max. :84008626
ST001C01TA RACETHC ST005C01TA ST007C01TA
Min. :2.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:4.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
Median :4.000 Median :2.000 Median :1.000 Median :1.000
Mean :4.088 Mean :2.205 Mean :1.205 Mean :1.239
3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:1.000
Max. :6.000 Max. :6.000 Max. :4.000 Max. :4.000
NA's :34 NA's :53 NA's :95 NA's :193
ST011C17TA ST011C18TA ST011C19TA ST019AC01T
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
Median :2.000 Median :1.000 Median :1.000 Median :1.000
Mean :1.525 Mean :1.122 Mean :1.372 Mean :1.066
3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.000
Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
NA's :115 NA's :72 NA's :85 NA's :113
ST019BC01T ST019CC01T ST127A01TA ST127C01TA ST127C02TA
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00
Median :1.000 Median :1.000 Median :1.000 Median :1.000 Median :1.00
Mean :1.256 Mean :1.267 Mean :1.049 Mean :1.087 Mean :1.03
3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.00
Max. :2.000 Max. :2.000 Max. :3.000 Max. :3.000 Max. :3.00
NA's :100 NA's :135 NA's :230 NA's :160 NA's :240
ST127C03TA ST022C01TA ST225C01HA ST225C02HA
Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000
Median :1.000 Median :1.000 Median :0.0000 Median :1.0000
Mean :1.015 Mean :1.199 Mean :0.1056 Mean :0.7455
3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :3.000 Max. :3.000 Max. :1.0000 Max. :1.0000
NA's :294 NA's :48 NA's :95 NA's :95
ST225C03HA ST225C04HA ST225C05HA ST225C06HA
Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.000 Median :0.0000 Median :1.0000 Median :0.000
Mean :0.214 Mean :0.4213 Mean :0.7133 Mean :0.444
3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.000
NA's :95 NA's :95 NA's :95 NA's :95
Turns out that the SPSS data is better shaped than the
student_fianl. I am going to trim the
student final data and merge it with the
added_student_data.
trimmed_student <- student_final |>
select(CNTRYID, CNT, CNTSCHID, CNTSTUID, RACETHC) |>
rename(ETHNICITY = RACETHC)
head(trimmed_student)
CNTRYID CNT CNTSCHID CNTSTUID ETHNICITY
1 840 USA 84000112 84000001 3
2 840 USA 84000024 84000002 4
3 840 USA 84000061 84000003 1
4 840 USA 84000034 84000004 1
5 840 USA 84000062 84000006 3
6 840 USA 84000133 84000007 3
Looks like that worked. Now, merge trimmed_student and
added_student_data together by all other 4 parameters.
student_final_updated <- merge(trimmed_student, added_student_data, by = c("CNTRYID", "CNT", "CNTSCHID", "CNTSTUID"))
summary(student_final_updated)
CNTRYID CNT CNTSCHID CNTSTUID
Min. :840 Length:4838 Min. :84000001 Min. :84000001
1st Qu.:840 Class :character 1st Qu.:84000047 1st Qu.:84002155
Median :840 Mode :character Median :84000086 Median :84004338
Mean :840 Mean :84000087 Mean :84004300
3rd Qu.:840 3rd Qu.:84000129 3rd Qu.:84006418
Max. :840 Max. :84000175 Max. :84008626
ETHNICITY GRADE GENDER COB_STD
Min. :1.000 Grade 8 : 8 Female:2376 United States:4411
1st Qu.:1.000 Grade 9 : 401 Male :2462 Other country: 314
Median :2.000 Grade 10:3598 NA's : 113
Mean :2.205 Grade 11: 826
3rd Qu.:3.000 Grade 12: 5
Max. :6.000
NA's :53
COB_MOM COB_DAD COM_HOM
United States:3524 United States:3449 Yes :4170
Other country:1214 Other country:1254 No : 604
NA's : 100 NA's : 135 NA's: 64
LANGN INTERNET
Spanish : 517 Yes, and I use it :4261
English :4054 Yes, but I don’t use it: 139
Another language (USA): 219 No : 182
Missing : 48 NA's : 256
MISCED FISCED ICTHOME
ISCED 5A, 6 :2045 ISCED 3A, ISCED 4:1784 10 :912
ISCED 3A, ISCED 4:1447 ISCED 5A, 6 :1674 9 :899
ISCED 5B : 688 ISCED 5B : 535 11 :788
ISCED 2 : 387 ISCED 2 : 448 8 :705
ISCED 1 : 104 ISCED 1 : 118 7 :498
(Other) : 74 (Other) : 93 (Other):844
NA's : 93 NA's : 186 NA's :192
LOC_INFO UNDERSTD EVAL_REF SINGLE
Min. :155.6 Min. :191.7 Min. :185.4 Min. :185.4
1st Qu.:425.2 1st Qu.:417.4 1st Qu.:424.9 1st Qu.:420.7
Median :503.5 Median :501.2 Median :509.7 Median :502.6
Mean :496.9 Mean :495.7 Mean :505.0 Mean :497.3
3rd Qu.:574.2 3rd Qu.:575.5 3rd Qu.:587.2 3rd Qu.:578.0
Max. :784.6 Max. :814.5 Max. :795.4 Max. :783.4
MULTIPLE READ_SCR
Min. :193.5 Min. :157.3
1st Qu.:423.6 1st Qu.:425.9
Median :503.4 Median :504.8
Mean :500.0 Mean :500.6
3rd Qu.:579.3 3rd Qu.:578.4
Max. :784.5 Max. :810.5
Now, all other variables have have labels, but not the ETHNITICY. I am going do label the categories now.
student_final_updated$ETHNICITY <- factor(student_final_updated$ETHNICITY,
levels = c(1:6),
labels = c("White, not Hispanic", "Black or African American", "Hispanic or Latino", "Asian", "Two or More Race", "Other")
)
summary(student_final_updated)
CNTRYID CNT CNTSCHID CNTSTUID
Min. :840 Length:4838 Min. :84000001 Min. :84000001
1st Qu.:840 Class :character 1st Qu.:84000047 1st Qu.:84002155
Median :840 Mode :character Median :84000086 Median :84004338
Mean :840 Mean :84000087 Mean :84004300
3rd Qu.:840 3rd Qu.:84000129 3rd Qu.:84006418
Max. :840 Max. :84000175 Max. :84008626
ETHNICITY GRADE GENDER
White, not Hispanic :2097 Grade 8 : 8 Female:2376
Black or African American: 769 Grade 9 : 401 Male :2462
Hispanic or Latino :1221 Grade 10:3598
Asian : 284 Grade 11: 826
Two or More Race : 367 Grade 12: 5
Other : 47
NA's : 53
COB_STD COB_MOM COB_DAD COM_HOM
United States:4411 United States:3524 United States:3449 Yes :4170
Other country: 314 Other country:1214 Other country:1254 No : 604
NA's : 113 NA's : 100 NA's : 135 NA's: 64
LANGN INTERNET
Spanish : 517 Yes, and I use it :4261
English :4054 Yes, but I don’t use it: 139
Another language (USA): 219 No : 182
Missing : 48 NA's : 256
MISCED FISCED ICTHOME
ISCED 5A, 6 :2045 ISCED 3A, ISCED 4:1784 10 :912
ISCED 3A, ISCED 4:1447 ISCED 5A, 6 :1674 9 :899
ISCED 5B : 688 ISCED 5B : 535 11 :788
ISCED 2 : 387 ISCED 2 : 448 8 :705
ISCED 1 : 104 ISCED 1 : 118 7 :498
(Other) : 74 (Other) : 93 (Other):844
NA's : 93 NA's : 186 NA's :192
LOC_INFO UNDERSTD EVAL_REF SINGLE
Min. :155.6 Min. :191.7 Min. :185.4 Min. :185.4
1st Qu.:425.2 1st Qu.:417.4 1st Qu.:424.9 1st Qu.:420.7
Median :503.5 Median :501.2 Median :509.7 Median :502.6
Mean :496.9 Mean :495.7 Mean :505.0 Mean :497.3
3rd Qu.:574.2 3rd Qu.:575.5 3rd Qu.:587.2 3rd Qu.:578.0
Max. :784.6 Max. :814.5 Max. :795.4 Max. :783.4
MULTIPLE READ_SCR
Min. :193.5 Min. :157.3
1st Qu.:423.6 1st Qu.:425.9
Median :503.4 Median :504.8
Mean :500.0 Mean :500.6
3rd Qu.:579.3 3rd Qu.:578.4
Max. :784.5 Max. :810.5
That worked but looks like we got some uninvited columns (6 to be
total). They are named VAR00001 - 00006. In addition, we
don’t need the country ID and CNT variables because they all are from
the United States.
student_final_updated <- select(student_final_updated, CNTSCHID, CNTSTUID, ETHNICITY, GRADE, GENDER, COB_STD, COB_MOM, COB_DAD, COM_HOM, LANGN, INTERNET, MISCED, FISCED, ICTHOME, LOC_INFO, UNDERSTD, EVAL_REF, SINGLE, MULTIPLE, READ_SCR)
student_final_updated$ICTHOME <- as.numeric(student_final_updated$ICTHOME)
str(student_final_updated)
'data.frame': 4838 obs. of 20 variables:
$ CNTSCHID : num 84000001 84000001 84000001 84000001 84000001 ...
$ CNTSTUID : num 84000250 84000304 84000353 84000536 84001240 ...
$ ETHNICITY: Factor w/ 6 levels "White, not Hispanic",..: 5 1 1 1 1 1 5 1 3 1 ...
$ GRADE : Factor w/ 5 levels "Grade 8","Grade 9",..: 4 4 2 3 3 3 3 4 4 3 ...
$ GENDER : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 2 1 1 1 2 ...
$ COB_STD : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 1 1 ...
$ COB_MOM : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 1 1 ...
$ COB_DAD : Factor w/ 2 levels "United States",..: 1 1 1 1 1 1 1 1 2 1 ...
$ COM_HOM : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
$ LANGN : Factor w/ 4 levels "Spanish","English",..: 2 2 2 2 2 2 2 2 2 2 ...
$ INTERNET : Factor w/ 3 levels "Yes, and I use it",..: 1 1 1 1 1 1 1 1 1 1 ...
$ MISCED : Factor w/ 7 levels "None","ISCED 1",..: 6 5 5 3 6 5 3 7 6 7 ...
$ FISCED : Factor w/ 7 levels "None","ISCED 1",..: 5 5 7 5 7 5 5 6 5 7 ...
$ ICTHOME : num 10 5 9 6 12 11 9 10 6 9 ...
$ LOC_INFO : num 537 428 511 432 508 ...
$ UNDERSTD : num 516 407 501 429 536 ...
$ EVAL_REF : num 518 422 506 436 562 ...
$ SINGLE : num 559 429 508 470 518 ...
$ MULTIPLE : num 556 422 503 445 537 ...
$ READ_SCR : num 544 432 504 438 536 ...
At last I want to save the file as .csv for further use.
# write.csv(student_final_updated, "student_final.csv")
Success!
Once again. I also have to save the school and teacher
data in the local disc. But before I do so, I want to take of a few
things.
Checking Variables
str(school_teacher_data)
'data.frame': 3526 obs. of 7 variables:
$ CNTRYID : num 840 840 840 840 840 840 840 840 840 840 ...
$ CNT : chr "USA" "USA" "USA" "USA" ...
$ CNTSCHID : num 84000001 84000001 84000001 84000001 84000001 ...
$ FRPL : num 5 5 5 5 5 5 5 5 5 5 ...
$ CNTTCHID : num 84003620 84000135 84001445 84001269 84003595 ...
$ TC186C01HA : num 1 1 NA NA 1 1 1 1 NA 1 ...
$ TC800C001HA: num 3 4 4 NA 4 4 4 4 3 4 ...
Renaming Variables
First, I want to rename the TC186C01HA to country of
birth for teacher aka COB_TECH and TC800C001HA
to highest education aka HIGH_EDU.
school_teacher_data <- school_teacher_data |>
rename(
COB_TECH = TC186C01HA,
HIGH_EDU = TC800C001HA
)
names(school_teacher_data)
[1] "CNTRYID" "CNT" "CNTSCHID" "FRPL" "CNTTCHID" "COB_TECH" "HIGH_EDU"
Then, I can get rid of country ID and County variables.
Removing Variables
school_teacher_data <- select(school_teacher_data, CNTSCHID, CNTTCHID, FRPL, COB_TECH, HIGH_EDU)
summary(school_teacher_data)
CNTSCHID CNTTCHID FRPL COB_TECH
Min. :84000001 Min. :84000001 Min. :1.000 Min. :1.000
1st Qu.:84000047 1st Qu.:84000942 1st Qu.:3.000 1st Qu.:1.000
Median :84000088 Median :84001888 Median :3.000 Median :1.000
Mean :84000088 Mean :84001889 Mean :3.435 Mean :1.052
3rd Qu.:84000131 3rd Qu.:84002838 3rd Qu.:4.000 3rd Qu.:1.000
Max. :84000175 Max. :84003779 Max. :5.000 Max. :2.000
NA's :539 NA's :1742
HIGH_EDU
Min. :1.000
1st Qu.:3.000
Median :4.000
Mean :3.655
3rd Qu.:4.000
Max. :5.000
NA's :696
Changing the label of the variables
school_teacher_data$FRPL <- factor(school_teacher_data$FRPL,
levels = c(1:5),
labels = c("Less than 10 percent", "10 to 24.9 percent", "25 to 49.9 percent", "50 to 74.9 percent", "75 percent or more")
)
school_teacher_data$COB_TECH <- factor(school_teacher_data$COB_TECH,
levels = c(1:2),
labels = c("United States", "Other Country")
)
school_teacher_data$HIGH_EDU <- factor(school_teacher_data$HIGH_EDU,
levels = c(1:5),
labels = c("High school and/or some college courses", "Associate's degree", "Bachelor's degree", "Master's degree", "Doctoral or professional degree")
)
summary(school_teacher_data)
CNTSCHID CNTTCHID FRPL
Min. :84000001 Min. :84000001 Less than 10 percent: 119
1st Qu.:84000047 1st Qu.:84000942 10 to 24.9 percent : 415
Median :84000088 Median :84001888 25 to 49.9 percent :1072
Mean :84000088 Mean :84001889 50 to 74.9 percent : 810
3rd Qu.:84000131 3rd Qu.:84002838 75 percent or more : 571
Max. :84000175 Max. :84003779 NA's : 539
COB_TECH HIGH_EDU
United States:1691 High school and/or some college courses: 11
Other Country: 93 Associate's degree : 12
NA's :1742 Bachelor's degree :1013
Master's degree :1700
Doctoral or professional degree : 94
NA's : 696
Everything worked!!
However, our proposal contains a lot of school and teacher variables.
The data sets we have worked on so far provided us almost all student
level variables, but not as many in terms of teachers and students. I
have .SPSS files for both school and teacher data sets. Now, I am going
to load student and teacher data sets and merge them with the
school_teacher_data.
added_teacher_data <- read.spss("us_teacher_only.sav", to.data.frame = TRUE)
str(added_teacher_data)
'data.frame': 3526 obs. of 14 variables:
$ CNTRYID : num 840 840 840 840 840 840 840 840 840 840 ...
$ CNT : chr "USA" "USA" "USA" "USA" ...
$ CNTSCHID : num 84000144 84000011 84000084 84000082 84000161 ...
$ CNTTCHID : num 84000001 84000002 84000003 84000004 84000005 ...
$ TC_SEX : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1 NA 2 2 1 ...
$ TC_AGE : Factor w/ 51 levels "20","21","22",..: 27 39 13 15 3 19 NA 5 22 7 ...
$ TC_STATUS : Factor w/ 4 levels "Full-time (more than 90% of full-time hours)",..: 1 4 1 1 1 1 NA 1 1 1 ...
$ TC_EXPERIENCE : Factor w/ 50 levels "0","1","2","3",..: 23 37 5 14 3 12 NA 2 18 6 ...
$ TC_TRAINING : Factor w/ 3 levels "Yes, a programme of 1 year or less",..: 2 2 1 3 3 2 NA 1 2 2 ...
$ TC_QUALIFICATION: Factor w/ 5 levels "I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .",..: 1 1 2 1 1 1 NA 3 1 1 ...
$ TC_PREPREADLIT : Factor w/ 32 levels "0","2","5","6",..: 8 NA NA NA 17 NA NA 24 8 NA ...
$ TC_PREPREADPED : Factor w/ 29 levels "0","4","5","7",..: 19 NA NA NA 11 NA NA 10 13 NA ...
$ TC_PREPGENPED : Factor w/ 32 levels "0","5","6","8",..: 18 NA NA NA 18 NA NA 9 21 NA ...
$ EMPLTIM : Factor w/ 2 levels "full time","part time": 1 2 1 1 1 1 NA 1 1 1 ...
- attr(*, "variable.labels")= Named chr [1:14] "Country Identifier" "Country code 3-character" "Intl. School ID" "Intl. Teacher ID" ...
..- attr(*, "names")= chr [1:14] "CNTRYID" "CNT" "CNTSCHID" "CNTTCHID" ...
- attr(*, "codepage")= int 65001
added_school_data <- read.spss("us_school_only.sav", to.data.frame = TRUE)
str(added_school_data)
'data.frame': 164 obs. of 18 variables:
$ CNTRYID : Factor w/ 82 levels "Albania","Baku (Azerbaijan)",..: 77 77 77 77 77 77 77 77 77 77 ...
$ CNT : Factor w/ 82 levels "Albania","United Arab Emirates",..: 81 81 81 81 81 81 81 81 81 81 ...
$ CNTSCHID: num 84000001 84000002 84000004 84000005 84000006 ...
$ SCH_LOCA: Factor w/ 5 levels "A village, hamlet or rural area (fewer than 3 000 people)",..: 3 4 4 4 4 4 4 3 NA 3 ...
$ SCH_TYPE: Factor w/ 2 levels "A public school (Managed by a public education authority, government agency, or governing board)",..: 1 1 1 1 2 1 1 1 NA 1 ...
$ TOT_BOYS: Factor w/ 136 levels "0","20","23",..: 6 73 62 103 14 3 131 96 NA 133 ...
$ TOT_GIRL: Factor w/ 137 levels "0","2","27","30",..: 9 71 64 98 13 3 133 100 NA 134 ...
$ TOT_STDS: num 100 1415 1235 2009 180 ...
$ SCH_ELS : Factor w/ 43 levels "0","1","2","3",..: 6 11 11 NA 14 5 2 4 NA 14 ...
$ SCH_DISA: Factor w/ 30 levels "0","1","2","3",..: 28 23 11 NA NA 11 10 14 NA 11 ...
$ SCH_FRPL: Factor w/ 70 levels "1","2","3","4",..: 59 52 56 NA NA 26 10 19 NA 22 ...
$ EXT_CLAS: Factor w/ 2 levels "Yes","No": 1 2 2 NA 2 2 2 2 NA 2 ...
$ STRATIO : Factor w/ 137 levels "1.6667","4","4.2326",..: 2 54 52 NA 8 35 NA 127 NA 104 ...
$ PROATCE : Factor w/ 48 levels "0","0.007","0.0667",..: 48 38 43 NA NA 48 NA 48 NA 48 ...
$ PROAT5AB: Factor w/ 59 levels "0.0116","0.0141",..: 17 37 59 NA NA 59 NA 59 NA 11 ...
$ PROAT5AM: Factor w/ 116 levels "0.0167","0.0395",..: 72 23 82 NA 9 91 NA 112 NA 24 ...
$ PROAT6 : Factor w/ 80 levels "0","0.0043","0.0065",..: 1 34 79 NA 1 1 NA 12 NA 71 ...
$ CLSIZE : Factor w/ 9 levels "15 students or fewer",..: 3 2 4 5 2 1 4 5 NA 1 ...
- attr(*, "variable.labels")= Named chr [1:18] "Country Identifier" "Country code 3-character" "Intl. School ID" "Which of the following definitions best describes the community in which your school is located?" ...
..- attr(*, "names")= chr [1:18] "CNTRYID" "CNT" "CNTSCHID" "SCH_LOCA" ...
- attr(*, "codepage")= int 65001
Once they are uploaded, I am going to combine them together before I
combine them with the school_teacher_data. I have noted the
structure of the common variables and they match.
added_school_teacher <- merge(added_school_data, added_teacher_data, by = c("CNTSCHID"))
tail((added_school_teacher), 15)
CNTSCHID CNTRYID.x CNT.x
3512 84000175 United States United States
3513 84000175 United States United States
3514 84000175 United States United States
3515 84000175 United States United States
3516 84000175 United States United States
3517 84000175 United States United States
3518 84000175 United States United States
3519 84000175 United States United States
3520 84000175 United States United States
3521 84000175 United States United States
3522 84000175 United States United States
3523 84000175 United States United States
3524 84000175 United States United States
3525 84000175 United States United States
3526 84000175 United States United States
SCH_LOCA
3512 A small town (3 000 to about 15 000 people)
3513 A small town (3 000 to about 15 000 people)
3514 A small town (3 000 to about 15 000 people)
3515 A small town (3 000 to about 15 000 people)
3516 A small town (3 000 to about 15 000 people)
3517 A small town (3 000 to about 15 000 people)
3518 A small town (3 000 to about 15 000 people)
3519 A small town (3 000 to about 15 000 people)
3520 A small town (3 000 to about 15 000 people)
3521 A small town (3 000 to about 15 000 people)
3522 A small town (3 000 to about 15 000 people)
3523 A small town (3 000 to about 15 000 people)
3524 A small town (3 000 to about 15 000 people)
3525 A small town (3 000 to about 15 000 people)
3526 A small town (3 000 to about 15 000 people)
SCH_TYPE
3512 A public school (Managed by a public education authority, government agency, or governing board)
3513 A public school (Managed by a public education authority, government agency, or governing board)
3514 A public school (Managed by a public education authority, government agency, or governing board)
3515 A public school (Managed by a public education authority, government agency, or governing board)
3516 A public school (Managed by a public education authority, government agency, or governing board)
3517 A public school (Managed by a public education authority, government agency, or governing board)
3518 A public school (Managed by a public education authority, government agency, or governing board)
3519 A public school (Managed by a public education authority, government agency, or governing board)
3520 A public school (Managed by a public education authority, government agency, or governing board)
3521 A public school (Managed by a public education authority, government agency, or governing board)
3522 A public school (Managed by a public education authority, government agency, or governing board)
3523 A public school (Managed by a public education authority, government agency, or governing board)
3524 A public school (Managed by a public education authority, government agency, or governing board)
3525 A public school (Managed by a public education authority, government agency, or governing board)
3526 A public school (Managed by a public education authority, government agency, or governing board)
TOT_BOYS TOT_GIRL TOT_STDS SCH_ELS SCH_DISA SCH_FRPL EXT_CLAS STRATIO
3512 200 187 387 2 25 45 No 10.75
3513 200 187 387 2 25 45 No 10.75
3514 200 187 387 2 25 45 No 10.75
3515 200 187 387 2 25 45 No 10.75
3516 200 187 387 2 25 45 No 10.75
3517 200 187 387 2 25 45 No 10.75
3518 200 187 387 2 25 45 No 10.75
3519 200 187 387 2 25 45 No 10.75
3520 200 187 387 2 25 45 No 10.75
3521 200 187 387 2 25 45 No 10.75
3522 200 187 387 2 25 45 No 10.75
3523 200 187 387 2 25 45 No 10.75
3524 200 187 387 2 25 45 No 10.75
3525 200 187 387 2 25 45 No 10.75
3526 200 187 387 2 25 45 No 10.75
PROATCE PROAT5AB PROAT5AM PROAT6 CLSIZE CNTRYID.y CNT.y CNTTCHID
3512 1 1 0.4167 0 16-20 students 840 USA 84003230
3513 1 1 0.4167 0 16-20 students 840 USA 84001024
3514 1 1 0.4167 0 16-20 students 840 USA 84000574
3515 1 1 0.4167 0 16-20 students 840 USA 84000758
3516 1 1 0.4167 0 16-20 students 840 USA 84001229
3517 1 1 0.4167 0 16-20 students 840 USA 84001582
3518 1 1 0.4167 0 16-20 students 840 USA 84003451
3519 1 1 0.4167 0 16-20 students 840 USA 84000161
3520 1 1 0.4167 0 16-20 students 840 USA 84001905
3521 1 1 0.4167 0 16-20 students 840 USA 84001123
3522 1 1 0.4167 0 16-20 students 840 USA 84000374
3523 1 1 0.4167 0 16-20 students 840 USA 84002142
3524 1 1 0.4167 0 16-20 students 840 USA 84000325
3525 1 1 0.4167 0 16-20 students 840 USA 84003520
3526 1 1 0.4167 0 16-20 students 840 USA 84003650
TC_SEX TC_AGE TC_STATUS TC_EXPERIENCE
3512 Female 26 Full-time (more than 90% of full-time hours) 1
3513 Male 37 Full-time (more than 90% of full-time hours) 15
3514 Male 46 Full-time (more than 90% of full-time hours) 18
3515 Male 40 Full-time (more than 90% of full-time hours) 16
3516 <NA> <NA> <NA> <NA>
3517 <NA> <NA> <NA> <NA>
3518 Female 31 Full-time (more than 90% of full-time hours) 3
3519 Female 47 Full-time (more than 90% of full-time hours) 20
3520 Male 28 Full-time (more than 90% of full-time hours) 5
3521 <NA> <NA> <NA> <NA>
3522 Female 60 Full-time (more than 90% of full-time hours) 24
3523 <NA> <NA> <NA> <NA>
3524 Female 33 Full-time (more than 90% of full-time hours) 11
3525 Female 43 Full-time (more than 90% of full-time hours) 12
3526 <NA> 61 Full-time (more than 90% of full-time hours) 39
TC_TRAINING
3512 Yes, a programme longer than 1 year
3513 Yes, a programme longer than 1 year
3514 Yes, a programme longer than 1 year
3515 Yes, a programme longer than 1 year
3516 <NA>
3517 <NA>
3518 Yes, a programme longer than 1 year
3519 Yes, a programme longer than 1 year
3520 No
3521 <NA>
3522 Yes, a programme longer than 1 year
3523 <NA>
3524 Yes, a programme of 1 year or less
3525 Yes, a programme longer than 1 year
3526 <NA>
TC_QUALIFICATION
3512 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
3513 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
3514 I attended a work-based teacher education or training programme.
3515 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
3516 <NA>
3517 <NA>
3518 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
3519 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
3520 I attended a work-based teacher education or training programme.
3521 <NA>
3522 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
3523 <NA>
3524 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
3525 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
3526 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
TC_PREPREADLIT TC_PREPREADPED TC_PREPGENPED EMPLTIM
3512 <NA> <NA> <NA> full time
3513 80 10 10 full time
3514 <NA> <NA> <NA> full time
3515 10 10 80 full time
3516 <NA> <NA> <NA> <NA>
3517 <NA> <NA> <NA> <NA>
3518 <NA> <NA> <NA> full time
3519 <NA> <NA> <NA> full time
3520 5 5 90 full time
3521 <NA> <NA> <NA> <NA>
3522 40 40 20 full time
3523 <NA> <NA> <NA> <NA>
3524 40 20 40 full time
3525 <NA> <NA> <NA> full time
3526 <NA> <NA> <NA> full time
dim(added_school_teacher)
[1] 3526 31
Looks like we got what we wanted. I have seen some repeated columns or columns that we don’t require in this analysis. I am going to get rid of these variables, now. I will arrange them in order and keep only the ones that I need.
added_school_teacher_updated <- select(added_school_teacher, CNTSCHID, CNTTCHID, TC_SEX, TC_AGE, TC_STATUS, TC_EXPERIENCE, TC_TRAINING, TC_QUALIFICATION, TC_PREPREADLIT, TC_PREPREADPED, TC_PREPGENPED, EMPLTIM, SCH_LOCA, SCH_TYPE, TOT_BOYS, TOT_GIRL, TOT_STDS, SCH_ELS, SCH_DISA, SCH_FRPL, EXT_CLAS, STRATIO, PROATCE, PROAT5AB, PROAT5AM, PROAT6, CLSIZE)
names(added_school_teacher_updated)
[1] "CNTSCHID" "CNTTCHID" "TC_SEX" "TC_AGE"
[5] "TC_STATUS" "TC_EXPERIENCE" "TC_TRAINING" "TC_QUALIFICATION"
[9] "TC_PREPREADLIT" "TC_PREPREADPED" "TC_PREPGENPED" "EMPLTIM"
[13] "SCH_LOCA" "SCH_TYPE" "TOT_BOYS" "TOT_GIRL"
[17] "TOT_STDS" "SCH_ELS" "SCH_DISA" "SCH_FRPL"
[21] "EXT_CLAS" "STRATIO" "PROATCE" "PROAT5AB"
[25] "PROAT5AM" "PROAT6" "CLSIZE"
dim(added_school_teacher_updated)
[1] 3526 27
We made the changes and our data retain total rows and all columns
except the once we omitted. Now, I am going combine this data set to the
school_teacher_data, the common merging variables will be
CNTSCHID & CNTTCHID.
school_teacher_data_all_merged <- merge(school_teacher_data, added_school_teacher_updated, by = c("CNTSCHID", "CNTTCHID"))
school_teacher_final <- select(school_teacher_data_all_merged, CNTSCHID, CNTTCHID, TC_SEX, TC_AGE, COB_TECH, HIGH_EDU, TC_STATUS, TC_EXPERIENCE, TC_TRAINING, TC_QUALIFICATION, TC_PREPREADLIT, TC_PREPREADPED, TC_PREPGENPED, EMPLTIM, SCH_LOCA, SCH_TYPE, TOT_BOYS, TOT_GIRL, TOT_STDS, SCH_ELS, SCH_DISA, SCH_FRPL, FRPL, EXT_CLAS, STRATIO, PROATCE, PROAT5AB, PROAT5AM, PROAT6, CLSIZE) |>
rename(
TC_COB = COB_TECH,
TC_EDU = HIGH_EDU
)
head(school_teacher_final)
CNTSCHID CNTTCHID TC_SEX TC_AGE TC_COB TC_EDU
1 84000001 84000135 Female 45 United States Master's degree
2 84000001 84000471 Female 50 United States Bachelor's degree
3 84000001 84000593 Female 49 United States Master's degree
4 84000001 84000669 Female 48 United States Master's degree
5 84000001 84000751 Female 28 United States Master's degree
6 84000001 84001269 <NA> <NA> <NA> <NA>
TC_STATUS TC_EXPERIENCE
1 Full-time (more than 90% of full-time hours) 16
2 Full-time (more than 90% of full-time hours) 26
3 Full-time (more than 90% of full-time hours) 20
4 Full-time (more than 90% of full-time hours) 25
5 Full-time (more than 90% of full-time hours) <NA>
6 <NA> <NA>
TC_TRAINING
1 Yes, a programme longer than 1 year
2 Yes, a programme longer than 1 year
3 Yes, a programme longer than 1 year
4 Yes, a programme of 1 year or less
5 Yes, a programme of 1 year or less
6 <NA>
TC_QUALIFICATION
1 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
2 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
3 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
4 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
5 I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .
6 <NA>
TC_PREPREADLIT TC_PREPREADPED TC_PREPGENPED EMPLTIM
1 <NA> <NA> <NA> full time
2 <NA> <NA> <NA> full time
3 <NA> <NA> <NA> full time
4 <NA> <NA> <NA> full time
5 <NA> <NA> <NA> full time
6 <NA> <NA> <NA> <NA>
SCH_LOCA
1 A town (15 000 to about 100 000 people)
2 A town (15 000 to about 100 000 people)
3 A town (15 000 to about 100 000 people)
4 A town (15 000 to about 100 000 people)
5 A town (15 000 to about 100 000 people)
6 A town (15 000 to about 100 000 people)
SCH_TYPE
1 A public school (Managed by a public education authority, government agency, or governing board)
2 A public school (Managed by a public education authority, government agency, or governing board)
3 A public school (Managed by a public education authority, government agency, or governing board)
4 A public school (Managed by a public education authority, government agency, or governing board)
5 A public school (Managed by a public education authority, government agency, or governing board)
6 A public school (Managed by a public education authority, government agency, or governing board)
TOT_BOYS TOT_GIRL TOT_STDS SCH_ELS SCH_DISA SCH_FRPL FRPL
1 51 49 100 5 37 79 75 percent or more
2 51 49 100 5 37 79 75 percent or more
3 51 49 100 5 37 79 75 percent or more
4 51 49 100 5 37 79 75 percent or more
5 51 49 100 5 37 79 75 percent or more
6 51 49 100 5 37 79 75 percent or more
EXT_CLAS STRATIO PROATCE PROAT5AB PROAT5AM PROAT6 CLSIZE
1 Yes 4 1 0.44 0.56 0 21-25 students
2 Yes 4 1 0.44 0.56 0 21-25 students
3 Yes 4 1 0.44 0.56 0 21-25 students
4 Yes 4 1 0.44 0.56 0 21-25 students
5 Yes 4 1 0.44 0.56 0 21-25 students
6 Yes 4 1 0.44 0.56 0 21-25 students
dim(school_teacher_final)
[1] 3526 30
summary(school_teacher_final)
CNTSCHID CNTTCHID TC_SEX TC_AGE
Min. :84000001 Min. :84000001 Female:1784 40 : 108
1st Qu.:84000047 1st Qu.:84000942 Male :1019 39 : 97
Median :84000088 Median :84001888 NA's : 723 41 : 94
Mean :84000088 Mean :84001889 48 : 91
3rd Qu.:84000131 3rd Qu.:84002838 36 : 89
Max. :84000175 Max. :84003779 (Other):2347
NA's : 700
TC_COB TC_EDU
United States:1691 High school and/or some college courses: 11
Other Country: 93 Associate's degree : 12
NA's :1742 Bachelor's degree :1013
Master's degree :1700
Doctoral or professional degree : 94
NA's : 696
TC_STATUS TC_EXPERIENCE
Full-time (more than 90% of full-time hours):2773 15 : 134
Part-time (71-90% of full-time hours) : 22 5 : 123
Part-time (50-70% of full-time hours) : 21 8 : 117
Part-time (less than 50% of full-time hours): 17 6 : 114
NA's : 693 4 : 111
(Other):2186
NA's : 741
TC_TRAINING
Yes, a programme of 1 year or less : 577
Yes, a programme longer than 1 year:2022
No : 229
NA's : 698
TC_QUALIFICATION
I attended a standard teacher training programme at an educational institute eligible to educate or train teachers .:2303
I attended an in-service teacher education or training programme. : 113
I attended a work-based teacher education or training programme. : 241
I attended training in another pedagogical profession. : 31
Other : 142
NA's : 696
TC_PREPREADLIT TC_PREPREADPED TC_PREPGENPED EMPLTIM
50 : 143 20 : 203 20 : 168 full time:2773
20 : 134 25 : 144 50 : 108 part time: 60
40 : 100 30 : 144 25 : 103 NA's : 693
30 : 97 40 : 114 30 : 98
10 : 84 10 : 103 40 : 98
(Other): 444 (Other): 295 (Other): 448
NA's :2524 NA's :2523 NA's :2503
SCH_LOCA
A village, hamlet or rural area (fewer than 3 000 people): 187
A small town (3 000 to about 15 000 people) : 587
A town (15 000 to about 100 000 people) :1192
A city (100 000 to about 1 000 000 people) : 845
A large city (with over 1 000 000 people) : 445
NA's : 270
SCH_TYPE
A public school (Managed by a public education authority, government agency, or governing board) :3093
A private school (Managed by a non-government org; e.g. a church, trade union, business, or other private institution.): 144
NA's : 289
TOT_BOYS TOT_GIRL TOT_STDS SCH_ELS SCH_DISA
950 : 75 1200 : 50 Min. : 22 1 : 333 10 : 389
459 : 50 1300 : 50 1st Qu.: 729 5 : 294 15 : 292
724 : 50 1338 : 50 Median :1453 12 : 206 13 : 289
1000 : 50 0 : 46 Mean :1557 2 : 184 20 : 256
1230 : 50 276 : 46 3rd Qu.:2100 6 : 168 12 : 190
(Other):2852 (Other):2864 Max. :4507 (Other):1697 (Other):1696
NA's : 399 NA's : 420 NA's :420 NA's : 644 NA's : 414
SCH_FRPL FRPL EXT_CLAS STRATIO
35 : 123 Less than 10 percent: 119 Yes : 886 9.381 : 25
30 : 121 10 to 24.9 percent : 415 No :2200 9.5176 : 25
40 : 114 25 to 49.9 percent :1072 NA's: 440 9.7447 : 25
15 : 101 50 to 74.9 percent : 810 10.3714: 25
12 : 99 75 percent or more : 571 10.4314: 25
(Other):2583 NA's : 539 (Other):2834
NA's : 385 NA's : 567
PROATCE PROAT5AB PROAT5AM PROAT6
1 :1824 1 :1453 0.5 : 145 0 : 892
0.9778 : 71 0.7143 : 66 0.2 : 71 0.04 : 75
0.9524 : 50 0.8 : 46 1 : 66 0.0286 : 50
0.9833 : 46 0.1667 : 45 0.1765 : 50 0.0625 : 50
0.9231 : 45 0.75 : 41 0.2857 : 50 0.0227 : 42
(Other): 895 (Other):1226 (Other):2502 (Other):1761
NA's : 595 NA's : 649 NA's : 642 NA's : 656
CLSIZE
26-30 students:1230
21-25 students: 923
31-35 students: 454
16-20 students: 356
36-40 students: 95
(Other) : 72
NA's : 396
summary(student_final_updated)
CNTSCHID CNTSTUID ETHNICITY
Min. :84000001 Min. :84000001 White, not Hispanic :2097
1st Qu.:84000047 1st Qu.:84002155 Black or African American: 769
Median :84000086 Median :84004338 Hispanic or Latino :1221
Mean :84000087 Mean :84004300 Asian : 284
3rd Qu.:84000129 3rd Qu.:84006418 Two or More Race : 367
Max. :84000175 Max. :84008626 Other : 47
NA's : 53
GRADE GENDER COB_STD COB_MOM
Grade 8 : 8 Female:2376 United States:4411 United States:3524
Grade 9 : 401 Male :2462 Other country: 314 Other country:1214
Grade 10:3598 NA's : 113 NA's : 100
Grade 11: 826
Grade 12: 5
COB_DAD COM_HOM LANGN
United States:3449 Yes :4170 Spanish : 517
Other country:1254 No : 604 English :4054
NA's : 135 NA's: 64 Another language (USA): 219
Missing : 48
INTERNET MISCED
Yes, and I use it :4261 ISCED 5A, 6 :2045
Yes, but I don’t use it: 139 ISCED 3A, ISCED 4:1447
No : 182 ISCED 5B : 688
NA's : 256 ISCED 2 : 387
ISCED 1 : 104
(Other) : 74
NA's : 93
FISCED ICTHOME LOC_INFO UNDERSTD
ISCED 3A, ISCED 4:1784 Min. : 1.000 Min. :155.6 Min. :191.7
ISCED 5A, 6 :1674 1st Qu.: 8.000 1st Qu.:425.2 1st Qu.:417.4
ISCED 5B : 535 Median :10.000 Median :503.5 Median :501.2
ISCED 2 : 448 Mean : 9.413 Mean :496.9 Mean :495.7
ISCED 1 : 118 3rd Qu.:11.000 3rd Qu.:574.2 3rd Qu.:575.5
(Other) : 93 Max. :12.000 Max. :784.6 Max. :814.5
NA's : 186 NA's :192
EVAL_REF SINGLE MULTIPLE READ_SCR
Min. :185.4 Min. :185.4 Min. :193.5 Min. :157.3
1st Qu.:424.9 1st Qu.:420.7 1st Qu.:423.6 1st Qu.:425.9
Median :509.7 Median :502.6 Median :503.4 Median :504.8
Mean :505.0 Mean :497.3 Mean :500.0 Mean :500.6
3rd Qu.:587.2 3rd Qu.:578.0 3rd Qu.:579.3 3rd Qu.:578.4
Max. :795.4 Max. :783.4 Max. :784.5 Max. :810.5
The compiled data set has 30 columns and 3526 rows. Each row
represent a distinct teacher data. Now, I am going to save the data as a
.csv file.
# write.csv(school_teacher_final, "school_teacher_final.csv")
Success!!