EdSurvey PISA USA

Author

Affiliation

K-16 Literacy Center at University of Texas at Tyler

Published

February 6, 2023

Loading Required Packages

library(tidyverse)
library(haven)
library(EdSurvey)
library(Dire)
library(WeMix)
library(ggplot2)

Downloading the PISA 2018 Data and Subsetting the US Data

Reading PISA 2018 Data and Subsetting the US Data

eds_pisa <- EdSurvey::readPISA(
  path = "C:/Users/nghimire/OneDrive - The University of Texas at Tyler/Redirected Folders/Documents/edsurvey_PISA_USA/PISA/2018",
  database = "INT", countries = "usa", cognitive = "score"
)

Found cached data for country code "usa"

dim(eds_pisa)

[1] 4838 5045

# eds_pisa$w_fstuwt

It’s massive data set. I am surprised by the volume of the columns!! Showing all 5045 column names doesn’t make sense. I would like to printout just first 100 columns. Let’s see, what they are.

hd <- head(colnames(eds_pisa), 50)
ht <- tail(colnames(eds_pisa), 50)
cbind(hd, ht)

      [,1]           [,2]              
 [1,] "ROWID"        "sc003q01ta"      
 [2,] "cntryid"      "sc053q01ta"      
 [3,] "cnt"          "sc053q02ta"      
 [4,] "cntschid"     "sc053q03ta"      
 [5,] "cntstuid"     "sc053q04ta"      
 [6,] "cyc"          "sc053q12ia"      
 [7,] "natcen"       "sc053q13ia"      
 [8,] "stratum"      "sc053q09ta"      
 [9,] "subnatio"     "sc053q10ta"      
[10,] "oecd"         "sc053q14ia"      
[11,] "adminmode"    "sc053q15ia"      
[12,] "langtest_qqq" "sc053q16ia"      
[13,] "langtest_cog" "sc053d11ta"      
[14,] "langtest_paq" "sc150q01ia"      
[15,] "bookid"       "sc150q02ia"      
[16,] "st001d01t"    "sc150q03ia"      
[17,] "st003d02t"    "sc150q04ia"      
[18,] "st003d03t"    "sc150q05ia"      
[19,] "st004d01t"    "sc164q01ha"      
[20,] "st005q01ta"   "sc064q01ta"      
[21,] "st006q01ta"   "sc064q02ta"      
[22,] "st006q02ta"   "sc064q03ta"      
[23,] "st006q03ta"   "sc064q04na"      
[24,] "st006q04ta"   "sc152q01ha"      
[25,] "st007q01ta"   "sc160q01wa"      
[26,] "st008q01ta"   "sc052q01na"      
[27,] "st008q02ta"   "sc052q02na"      
[28,] "st008q03ta"   "sc052q03ha"      
[29,] "st008q04ta"   "privatesch"      
[30,] "st011q01ta"   "schltype"        
[31,] "st011q02ta"   "stratio"         
[32,] "st011q03ta"   "schsize"         
[33,] "st011q04ta"   "ratcmp1"         
[34,] "st011q05ta"   "ratcmp2"         
[35,] "st011q06ta"   "totat"           
[36,] "st011q07ta"   "proatce"         
[37,] "st011q08ta"   "proat5ab"        
[38,] "st011q09ta"   "proat5am"        
[39,] "st011q10ta"   "proat6"          
[40,] "st011q11ta"   "clsize"          
[41,] "st011q12ta"   "creactiv"        
[42,] "st011q16na"   "edushort"        
[43,] "st011d17ta"   "staffshort"      
[44,] "st011d18ta"   "stubeha"         
[45,] "st011d19ta"   "teachbeha"       
[46,] "st012q01ta"   "scmceg"          
[47,] "st012q02ta"   "w_schgrnrabwt"   
[48,] "st012q03ta"   "w_fstuwt_sch_sum"
[49,] "st012q05na"   "senwt.sch_qqq"   
[50,] "st012q06na"   "ver_dat.sch_qqq"

Amazing!! I seem to know some of them but I have no idea about any of these variables.

Understanding the Data

Checking the structures of some of the variables and labels. First of all, I want to check how many plausible values are there for each cogintive test areas (e.g., math, science, and reading).

# showWeights(eds_pisa, verbose = TRUE)
showPlausibleValues(eds_pisa)

There are 9 subject scale(s) or subscale(s) in this
  edsurvey.data.frame:
'math' subject scale or subscale with 10 plausible values.

'read' subject scale or subscale with 10 plausible values (the
  default).

'scie' subject scale or subscale with 10 plausible values.

'glcm' subject scale or subscale with 10 plausible values.

'rcli' subject scale or subscale with 10 plausible values.

'rcun' subject scale or subscale with 10 plausible values.

'rcer' subject scale or subscale with 10 plausible values.

'rtsn' subject scale or subscale with 10 plausible values.

'rtml' subject scale or subscale with 10 plausible values.

Based on the information, the data set contains 10-plausible values for reading, and I want to learn a little more about them by comparing their summary.

Loos like they are fairly homogeneous, but they are different in terms of their values. The OECD report shows that the average Reading scores for US test takers was 505. The median not mean for pV2read shows the same value but all mean values are smaller. Taking average of all of the means would provide us:

\[ \text{Average of Means: (500.2 + 500.8 + 503.8 + 503.1 + 500.5 + 501.0 + 499.9 + 500.3 + 501.1 + 500.6)/10 = 500.58} \] \[ \text{Average of Medians: (503.6 + 505.2 + 503.8 + 503.1 + 504.5 + 504.4 + 502.6 + 504.0 + 503.9 + 503.6)/10 = 503.87} \] \[ \text{Median of Median of: 502.6, 503.1, 503.6, 503.6, 503.8, 503.9, 504.0, 504.4, 504.5, 505.2 = (503.8 + 503.9)/2 = 503.85} \] Based on these statistics, median-of-median for dependent variable would be much closer to reported scores, followed by median-of-median. Where, the mean-of-means is no where close to the reported average scores. Thus, any analyses conducted using either any single plausible value or an average of all plausible values would lead us to a wrong findings. Let’s come back to this point after I check some other variables.

n_distinct(eds_pisa$ratcmp1)

[1] 93

levelsSDF(varnames = "ratcmp1", data = eds_pisa)

Levels for Variable 'ratcmp1' (Lowest level first):
    995. VALID SKIP* (n = 0)
    997. NOT APPLICABLE* (n = 0)
    998. INVALID* (n = 0)
    999. NO RESPONSE* (n = 0)
    NOTE: * indicates an omitted level.

I was not sure what the ratcmp1 variable was and if it had any levels. I requested the codebook using showCodebook(eds_data_full) and figured out that this variable is defined as the index of availability of computers (RATCMP1) is the ratio of computers available to 15-year-olds for educational purposes to the total number of students in the modal grade for 15-year-olds.

As seen above, there were 93 distinct computer indicators among the US samples and the variable does not have any labels. It should be a fairly straight continuous variables with some missing values. However, I don’t need to know more because I am going to get rid of this variable as it has nothing to do with my current analysis. My focus will be more towards teacher samples.

# getStratumVar(data = eds_pisa, weightVar = "origwt")
# summary2(eds_pisa, "composite")
# summary2(eds_pisa, "composite", weightVar = "NULL")
showCutPoints(data = eds_pisa)

Achievement Levels:
  Mathematics:  357.77, 420.07, 482.38, 544.68, 606.99, 669.3
  Reading:  189.33, 262.04, 334.75, 407.47, 480.18, 552.89, 625.61, 698.32
  Science:  260.54, 334.94, 409.54, 484.14, 558.73, 633.33, 707.93

During our hands-on training at NAEP Winter Data Workshop-2023, I learned that they use “origwt” to weight the variable and “composite” as the composite value of all plausible math scores. I tried both expression but looks like they are not the cases, here.

Coming back to the false mean scores that I discussed earlier. I know that NAEP recommends using weighted samples instead of simple samples in analyses. Here’s the further discussion about using weighted values:

Why Weighting the Samples?

The weights account for the fraction of the population represented by each stratum and reflect the probability that an element of the stratum is selected to be in the sample. One can show that the weighted sample mean is a good estimator in the statistical sense of the population mean when the sampling is a stratified design (pp. 300).
The unweighted sample size is in fact the size of the only sample selected. The weighted sample size is nothing more than the size of the population represented by the sample which is already known, or can be calculated from the weights (pp. 301).
Stratification is often used when the population has groups that are different from each other regarding the variable of interest, such as students from different countries, states, and school districts in the PISA assessment. In such cases, we are usually interested in some inference (e.g, mean, proportion,total, ratio, etc.) about each startum (e.g., students from different ethnic status). The weighting comes into play when combining the inferences from the strata into an inference about the entire population (e.g., we are interested to identifying how the 4, 838 US 15-year-olds did in 2018 PISA Reading Assessment and how they compare with each other based on their ethnicity, and we are going to make an inference about the whole US 15-year-olds in the year 2018, which was close to ~12,506,174). [@Ciol et al., 2006, pp. 301]

Let’s come back to this point.

Subsetting the Reading Data

This file is a compiled version of all possible variables in the assessments (Cognitive- Reading, Math, Science, Digital Literacy; and Surveys- Student, Teacher, and School), thus, we can subset the data to suit our requirements for example school, student, or teacher etc. For now, I am going to subset student only data. The selected variables are the one useful for me in this analysis.

read_data_full <- EdSurvey::getData(eds_pisa,
  varnames = c(
    "ROWID", "cntschid", "cntstuid",
    "privatesch", "schltype", "stratio", "schsize",
    "totat", "proatce", "proat5ab", "proat5am", "proat6",
    "clsize", "teachbeha", "w_schgrnrabwt", "w_fstuwt_sch_sum",
    "read", "w_fstuwt"
  ), addAttributes = TRUE, omittedLevels = FALSE
)
# names(read_data_full)
summary(read_data_full[, -(27:106)])

     ROWID         cntschid          cntstuid         privatesch       
 Min.   :   1   Min.   :8.4e+07   Min.   :84000001   Length:4838       
 1st Qu.:1210   1st Qu.:8.4e+07   1st Qu.:84002155   Class :character  
 Median :2420   Median :8.4e+07   Median :84004338   Mode  :character  
 Mean   :2420   Mean   :8.4e+07   Mean   :84004300                     
 3rd Qu.:3629   3rd Qu.:8.4e+07   3rd Qu.:84006418                     
 Max.   :4838   Max.   :8.4e+07   Max.   :84008626                     
                                                                       
                         schltype       stratio           schsize    
 PRIVATE INDEPENDENT         : 163   Min.   :  1.667   Min.   :  22  
 PRIVATE GOVERNMENT-DEPENDENT:  13   1st Qu.: 13.100   1st Qu.: 639  
 PUBLIC                      :4636   Median : 16.154   Median :1411  
 NO RESPONSE                 :  26   Mean   : 17.523   Mean   :1491  
                                     3rd Qu.: 19.000   3rd Qu.:2076  
                                     Max.   :100.000   Max.   :4507  
                                     NA's   :761       NA's   :559   
     totat           proatce          proat5ab         proat5am     
 Min.   :  1.00   Min.   :0.0000   Min.   :0.0116   Min.   :0.0167  
 1st Qu.: 46.00   1st Qu.:0.9765   1st Qu.:0.6095   1st Qu.:0.2692  
 Median : 80.50   Median :1.0000   Median :1.0000   Median :0.4727  
 Mean   : 85.83   Mean   :0.9407   Mean   :0.8006   Mean   :0.4843  
 3rd Qu.:114.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.6593  
 Max.   :280.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
 NA's   :702      NA's   :802      NA's   :861      NA's   :853     
     proat6                  clsize       teachbeha       w_schgrnrabwt    
 Min.   :0.0000   26-30 STUDENTS:1660   Min.   :-2.0409   Min.   :  20.69  
 1st Qu.:0.0000   21-25 STUDENTS:1271   1st Qu.:-0.1274   1st Qu.:  42.70  
 Median :0.0154   31-35 STUDENTS: 645   Median : 0.2266   Median :  66.49  
 Mean   :0.0227   16-20 STUDENTS: 478   Mean   : 0.2720   Mean   : 120.02  
 3rd Qu.:0.0330   NO RESPONSE   : 277   3rd Qu.: 0.8952   3rd Qu.: 157.37  
 Max.   :0.2222   (Other)       : 221   Max.   : 1.9937   Max.   :1294.02  
 NA's   :886      NA's          : 286   NA's   :467                        
 w_fstuwt_sch_sum     pv1read         pv2read         pv3read     
 Min.   :  820.3   Min.   :161.3   Min.   :176.5   Min.   :132.4  
 1st Qu.:18102.3   1st Qu.:423.5   1st Qu.:424.6   1st Qu.:423.2  
 Median :21341.2   Median :503.6   Median :505.2   Median :503.8  
 Mean   :22421.2   Mean   :500.2   Mean   :500.8   Mean   :500.3  
 3rd Qu.:25895.6   3rd Qu.:578.7   3rd Qu.:578.4   3rd Qu.:577.9  
 Max.   :49343.6   Max.   :868.9   Max.   :898.5   Max.   :858.4  
                                                                  
    pv4read         pv5read         pv6read         pv7read     
 Min.   :140.3   Min.   :137.7   Min.   :128.1   Min.   :148.7  
 1st Qu.:426.1   1st Qu.:423.2   1st Qu.:424.4   1st Qu.:424.4  
 Median :503.1   Median :504.5   Median :504.4   Median :502.6  
 Mean   :501.1   Mean   :500.5   Mean   :501.0   Mean   :499.9  
 3rd Qu.:579.3   3rd Qu.:579.0   3rd Qu.:579.3   3rd Qu.:578.6  
 Max.   :834.1   Max.   :853.5   Max.   :844.8   Max.   :815.3  
                                                                
    pv8read         pv9read         pv10read        w_fstuwt     
 Min.   :170.9   Min.   :173.6   Min.   :167.8   Min.   : 262.8  
 1st Qu.:424.9   1st Qu.:426.0   1st Qu.:426.1   1st Qu.: 563.0  
 Median :504.0   Median :503.9   Median :503.6   Median : 661.7  
 Mean   :500.3   Mean   :501.1   Mean   :500.6   Mean   : 735.6  
 3rd Qu.:579.4   3rd Qu.:577.1   3rd Qu.:579.0   3rd Qu.: 854.5  
 Max.   :823.4   Max.   :818.1   Max.   :834.1   Max.   :2946.1

# showCodebook(read_data_full)
# View(showCodebook(read_data_full))

The truncated datafile contains all teacher related variables. I don’t need all of them. For example, I am not going to use the teacher behavior and ‘w_fstuwt_sch_sum’ variable. Thus, I got rid of them.

Creating A Composite Varialbe Using Ten Plausible Values

psych::alpha(read_data_full[, 17:26])


Reliability analysis   
Call: psych::alpha(x = read_data_full[, 17:26])

  raw_alpha std.alpha G6(smc) average_r S/N     ase mean  sd median_r
      0.99      0.99    0.99      0.94 152 0.00014  501 105     0.94

    95% confidence boundaries 
         lower alpha upper
Feldt     0.99  0.99  0.99
Duhachek  0.99  0.99  0.99

 Reliability if an item is dropped:
         raw_alpha std.alpha G6(smc) average_r S/N alpha se   var.r med.r
pv1read       0.99      0.99    0.99      0.94 137  0.00016 1.6e-06  0.94
pv2read       0.99      0.99    0.99      0.94 136  0.00016 1.5e-06  0.94
pv3read       0.99      0.99    0.99      0.94 137  0.00016 1.6e-06  0.94
pv4read       0.99      0.99    0.99      0.94 137  0.00016 1.7e-06  0.94
pv5read       0.99      0.99    0.99      0.94 136  0.00016 1.7e-06  0.94
pv6read       0.99      0.99    0.99      0.94 136  0.00016 1.8e-06  0.94
pv7read       0.99      0.99    0.99      0.94 137  0.00016 1.4e-06  0.94
pv8read       0.99      0.99    0.99      0.94 136  0.00016 1.2e-06  0.94
pv9read       0.99      0.99    0.99      0.94 136  0.00016 1.4e-06  0.94
pv10read      0.99      0.99    0.99      0.94 136  0.00016 1.4e-06  0.94

 Item statistics 
            n raw.r std.r r.cor r.drop mean  sd
pv1read  4838  0.97  0.97  0.97   0.96  500 108
pv2read  4838  0.97  0.97  0.97   0.97  501 108
pv3read  4838  0.97  0.97  0.97   0.96  500 108
pv4read  4838  0.97  0.97  0.97   0.96  501 108
pv5read  4838  0.97  0.97  0.97   0.97  500 108
pv6read  4838  0.97  0.97  0.97   0.97  501 108
pv7read  4838  0.97  0.97  0.97   0.96  500 108
pv8read  4838  0.97  0.97  0.97   0.97  500 108
pv9read  4838  0.97  0.97  0.97   0.97  501 107
pv10read 4838  0.97  0.97  0.97   0.97  501 108

Qualitative Descriptors of Cronbach’s alpha

.95 - 1.00: Excellent
.90 - .94: Great
.80 - .89: Good
.70 - .79: Acceptable
.60 - .69: Questionable, and
.00 - .59: Unacceptable

Based on the statistics, the raw alpha of .99 shows Excellent internal consistency among the plausible values. The reliability would not get affected even if we drop one or any of the plausible values. Given the alpha = .99, and all ten plausible values appear to measure same thing, we decided to retain all ten plausible values for a composite variable.

read_data_full$composite <- rowMeans(read_data_full[, 17:26], na.rm = TRUE)
summary(read_data_full$composite)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  157.3   425.9   504.8   500.6   578.4   810.5

Nope this is not what I wanted!! I got to try something new but none of the average means when a corresponding item is dropped from the instrument in the Item Statistics are close to the reported mean. The outcomes are worse than what I previously reported. The average of the composite value is also much lower,i.e., 500.6 than 505. Sadly, I cannot proceed with this so called ‘composite’ value as dependent variable.

Let’s get back to circle one. The summary of the variables above give include some variable that I need to check again. Here’s the further summary in a tabular form:

rbind(
  summary(read_data_full$w_fstuwt),
  summary(read_data_full$w_fstuwt_sch_sum),
  summary(read_data_full$w_schgrnrabwt)
)

          Min.     1st Qu.      Median       Mean    3rd Qu.      Max.
[1,] 262.75044   563.04221   661.71657   735.6438   854.5203  2946.134
[2,] 820.25602 18102.33043 21341.22828 22421.2347 25895.5895 49343.581
[3,]  20.68506    42.69652    66.48856   120.0234   157.3688  1294.020

Based on the statistics, w_fstuwt can be of some merit to my study but not w_fstuwt_sch_sum & w_schgrnrabwt.

Now, I would like to go back to my original dataset and check if there is any variable that is listed as weights.

showWeights(eds_pisa, verbose = TRUE)

There is 1 full sample weight in this edsurvey.data.frame:
  'w_fstuwt' with 80 JK replicate weights (the default).
    Jackknife replicate weight variables associated with the full
    sample weight 'w_fstuwt':
    'w_fsturwt1', 'w_fsturwt2', 'w_fsturwt3', 'w_fsturwt4',
    'w_fsturwt5', 'w_fsturwt6', 'w_fsturwt7', 'w_fsturwt8',
    'w_fsturwt9', 'w_fsturwt10', 'w_fsturwt11', 'w_fsturwt12',
    'w_fsturwt13', 'w_fsturwt14', 'w_fsturwt15', 'w_fsturwt16',
    'w_fsturwt17', 'w_fsturwt18', 'w_fsturwt19', 'w_fsturwt20',
    'w_fsturwt21', 'w_fsturwt22', 'w_fsturwt23', 'w_fsturwt24',
    'w_fsturwt25', 'w_fsturwt26', 'w_fsturwt27', 'w_fsturwt28',
    'w_fsturwt29', 'w_fsturwt30', 'w_fsturwt31', 'w_fsturwt32',
    'w_fsturwt33', 'w_fsturwt34', 'w_fsturwt35', 'w_fsturwt36',
    'w_fsturwt37', 'w_fsturwt38', 'w_fsturwt39', 'w_fsturwt40',
    'w_fsturwt41', 'w_fsturwt42', 'w_fsturwt43', 'w_fsturwt44',
    'w_fsturwt45', 'w_fsturwt46', 'w_fsturwt47', 'w_fsturwt48',
    'w_fsturwt49', 'w_fsturwt50', 'w_fsturwt51', 'w_fsturwt52',
    'w_fsturwt53', 'w_fsturwt54', 'w_fsturwt55', 'w_fsturwt56',
    'w_fsturwt57', 'w_fsturwt58', 'w_fsturwt59', 'w_fsturwt60',
    'w_fsturwt61', 'w_fsturwt62', 'w_fsturwt63', 'w_fsturwt64',
    'w_fsturwt65', 'w_fsturwt66', 'w_fsturwt67', 'w_fsturwt68',
    'w_fsturwt69', 'w_fsturwt70', 'w_fsturwt71', 'w_fsturwt72',
    'w_fsturwt73', 'w_fsturwt74', 'w_fsturwt75', 'w_fsturwt76',
    'w_fsturwt77', 'w_fsturwt78', 'w_fsturwt79', and 'w_fsturwt80'

Wow! It is indeed w_fstuwt variable. There are 80 different weights, i.e., w_fsturwt1: w_fsturwt80 for each of the 4838 students. We can use the w-fstuwt as the composite of all of these 80 numbers. The following information comes from the NCES websites, which describes the ~ Jackknife Replication Method~: “A replication method that estimates standard errors of percentages and other statistics. It is particularly suited to complex sample designs. In the jackknife, sample units are grouped into pairs (replicate groups). Portions of the sample (replicates) are formed by repeatedly omitting one half of the units in one of the replicate groups and calculating the desired statistic (replicate estimate). The number of replicate estimates is equal to the number of replicate groups. The variability among the replicate estimates is used to estimate the overall sampling variability” (https://nces.ed.gov/nationsreportcard/glossary.aspx#jackknife).

Now, I can move forward and conduct descriptive analyses. Before starting a series of descriptive statistics, I want to see whether there are difference in weighted and unweighted means of reading scores among students. Here’s the unweighted mean score:

summary2(read_data_full, "read", weightVar = NULL)

Estimates are not weighted.
   Variable    N    Min.  1st Qu.   Median     Mean  3rd Qu.    Max.       SD
1   pv1read 4838 161.343 423.4801 503.6375 500.1502 578.6824 868.870 108.4549
2   pv2read 4838 176.458 424.5367 505.2045 500.7907 578.4439 898.478 107.9547
3   pv3read 4838 132.423 423.1459 503.8005 500.3032 577.9054 858.393 107.8982
4   pv4read 4838 140.293 426.1104 503.0610 501.0978 579.3880 834.076 108.4841
5   pv5read 4838 137.737 423.1717 504.4565 500.4783 579.0315 853.488 108.0790
6   pv6read 4838 128.111 424.4202 504.4380 500.9541 579.3228 844.836 108.1850
7   pv7read 4838 148.739 424.3686 502.5630 499.8935 578.6955 815.275 107.7119
8   pv8read 4838 170.907 424.9227 503.9935 500.3028 579.3685 823.427 108.1128
9   pv9read 4838 173.639 425.9822 503.8790 501.0805 577.1258 818.066 107.4320
10 pv10read 4838 167.822 426.0661 503.5685 500.6259 579.0812 834.091 107.9336
   NA's
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0

The means, as in the past, not close to reported 505. These are the means for all ten plausible scores in reading. Lets check the weighted mean for reading scores among US 15-year-olds who took part in the PISA 2018 assessment. Either of the codes below would give us the same results.

# summary2(read_data_full, "read")
summary2(read_data_full, "read", weightVar = "w_fstuwt")

Estimates are weighted using the weight variable 'w_fstuwt'
  Variable    N Weighted N     Min.  1st Qu.   Median     Mean  3rd Qu.  Max.
1     read 4838    3559045 153.7472 429.7936 509.7353 505.3528 583.5768 844.9
        SD NA's Zero weights
1 107.9064    0            0

Amazing! That’s what I want. The mean reading score for the US student was 505.3528. Thus, w_fstuwt would be the weighted sample and read the outcome/dependent variable. Let’s draw a histogram and look at the distribution.

Descriptive Statistics

Average Scores Based on Class Size

clsize_reading <- edsurveyTable(formula = read ~ clsize, data = read_data_full)
clsize_reading


Formula: read ~ clsize 

Plausible values: 10
jrrIMax: 1
Weight variable: 'w_fstuwt'
Variance method: jackknife
JK replicates: 80
full data n: 4838
n used: 4275


Summary Table:
               clsize    N      WTD_N       PCT  SE(PCT)     MEAN  SE(MEAN)
 15 STUDENTS OR FEWER   98   78303.96  2.521100 1.020527 490.9990 10.635406
       16-20 STUDENTS  478  361655.05 11.643965 2.477297 505.8996 10.771112
       21-25 STUDENTS 1271  911449.99 29.345344 3.874926 507.4223  8.233802
       26-30 STUDENTS 1660 1134459.30 36.525425 3.979064 508.5477  6.052202
       31-35 STUDENTS  645  501987.16 16.162144 2.511393 511.2569  9.204440
       36-40 STUDENTS  123  118088.67  3.802022 1.777732 530.6957 16.336060

School Type and Reading Scores

schltype_reading <- edsurveyTable(
  formula = read ~ schltype,
  data = read_data_full
)
schltype_reading


Formula: read ~ schltype 

Plausible values: 10
jrrIMax: 1
Weight variable: 'w_fstuwt'
Variance method: jackknife
JK replicates: 80
full data n: 4838
n used: 4812


Summary Table:
                      schltype    N      WTD_N       PCT   SE(PCT)     MEAN
1          PRIVATE INDEPENDENT  163  203557.99  5.757412 1.2083048 525.3717
2 PRIVATE GOVERNMENT-DEPENDENT   13   22100.53  0.625089 0.3349908 499.0487
3                       PUBLIC 4636 3309922.77 93.617499 1.2244331 503.6134
   SE(MEAN)
1 21.037771
2 27.273720
3  3.421781

Student Teacher Ratio with Omitted Variables like NAs

summary2(read_data_full, "stratio", omittedLevels = FALSE)

Estimates are weighted using the weight variable 'w_fstuwt'
  Variable    N Weighted N   Min. 1st Qu. Median     Mean 3rd Qu. Max.       SD
1  stratio 4838    3559045 1.6667 12.5827     16 17.21917 18.9474  100 9.444379
  NA's Zero weights
1  761            0

Student Teacher Ratio without Omitted Variables

summary2(read_data_full, "stratio", omittedLevels = TRUE)

Estimates are weighted using the weight variable 'w_fstuwt'
  Variable    N Weighted N   Min. 1st Qu. Median     Mean 3rd Qu. Max.       SD
1  stratio 4077    2932840 1.6667 12.5827     16 17.21917 18.9474  100 9.444379
  NA's Zero weights
1    0            0

School Size (by Total Students)

summary2(read_data_full, "schsize", omittedLevels = FALSE)

Estimates are weighted using the weight variable 'w_fstuwt'
  Variable    N Weighted N Min. 1st Qu. Median     Mean 3rd Qu. Max.       SD
1  schsize 4838    3559045   22     639   1411 1490.035    2061 4507 986.5392
  NA's Zero weights
1  559            0

Total Teachers by School

summary2(read_data_full, "totat", omittedLevels = FALSE)

Estimates are weighted using the weight variable 'w_fstuwt'
  Variable    N Weighted N Min. 1st Qu. Median     Mean 3rd Qu. Max.       SD
1    totat 4838    3559045    1      48     80 86.36344     115  280 51.42637
  NA's Zero weights
1  702            0

Percentage of Teachers Fully Certified

summary2(read_data_full, "proatce", omittedLevels = FALSE)

Estimates are weighted using the weight variable 'w_fstuwt'
  Variable    N Weighted N Min. 1st Qu. Median      Mean 3rd Qu. Max.        SD
1  proatce 4838    3559045    0  0.9685      1 0.9256084       1    1 0.2051878
  NA's Zero weights
1  802            0

Percentage of Teachers with Bachelor Degrees or Above Degrees

summary2(read_data_full, "proat5ab", omittedLevels = FALSE)

Estimates are weighted using the weight variable 'w_fstuwt'
  Variable    N Weighted N   Min. 1st Qu. Median      Mean 3rd Qu. Max.
1 proat5ab 4838    3559045 0.0116  0.6541      1 0.8113005       1    1
         SD NA's Zero weights
1 0.2747722  861            0

Percentage of Teachers with Master Degree or Above (Per-School)

summary2(read_data_full, "proat5am", omittedLevels = FALSE)

Estimates are weighted using the weight variable 'w_fstuwt'
  Variable    N Weighted N   Min. 1st Qu. Median    Mean 3rd Qu. Max.       SD
1 proat5am 4838    3559045 0.0167  0.2974    0.5 0.50363  0.6818    1 0.250698
  NA's Zero weights
1  853            0

Percentage of Teachers by School School Having Doctoral Degree or Other Higher Degrees

summary2(read_data_full, "proat6", omittedLevels = FALSE)

Estimates are weighted using the weight variable 'w_fstuwt'
  Variable    N Weighted N Min. 1st Qu. Median       Mean 3rd Qu.   Max.
1   proat6 4838    3559045    0       0 0.0152 0.02227463   0.033 0.2222
          SD NA's Zero weights
1 0.02850491  886            0

References

Ciol, M. A., Hoffman, J. M., Dudgeon, B. J., Shumway-Cook, A., Yorkston, K. M., & Chan, L. (2006). Understanding the use of weights in the analysis of data from multistage surveys. Archives of physical medicine and rehabilitation, 87(2), 299-303. https://doi.org/10.1016/j.apmr.2005.09.021