Ajustamos la data

StudentsPerformance <- read.csv("C:/Users/user/Desktop/StudentsPerformance.csv")
library(dplyr)
SP <- StudentsPerformance %>% 
  mutate(.,
         gender=as.factor(gender),
         race.ethnicity=as.factor(race.ethnicity),
         parental.level.of.education=as.factor(parental.level.of.education),
         lunch=as.factor(lunch),
         test.preparation.course=as.factor(test.preparation.course),
         math.score=as.numeric(math.score),
         reading.score=as.numeric(reading.score),
         writing.score=as.numeric(writing.score)
         )
str(SP)
## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : num  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : num  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : num  74 88 93 44 75 78 92 39 67 50 ...

Se evidencia que es una data de 1000 observaciones y 8 columnas. Con 5 variables de factor (cualitativas) y 3 de tipo numerico (cuantitativas)

Muestreo aleatorio simple (MAS)

Seleccionar el tamano de la muestra para la base de datos dada

# Margen de error o de precisión
MargenError <- seq(0, 0.1, by = 0.001)  # Nivel de precisión hasta 10% (0.1)
# Cuál es el nivel de confianza de la muestra???
NC <- 0.95  # Nivel de confianza
# El valor del Z para el nivel de confianza
z <- qnorm((1 - NC) / 2, mean = 0, sd = 1)  # Valor de Z
# Qué proporción de éxito es necesaria para la toma de la muestra??
P <- 0.5  # Probabilidad de éxito (es en la relacion p*q el valor mas alto posible, es decir la varianza mas alta)
# Cálculo de la muestra para población infinita
n0 <- ((z^2) * P * (1 - P)) / MargenError^2 # Cuando 
# Cálculo de la muestra ajustada para población finita
N <- nrow(SP)
n <- n0 / (1 + (n0 / N))
# Creación del dataframe para el gráfico
data <- data.frame(MargenError, n=round(n, digits = 0) )
head(data, n = 10)
##    MargenError   n
## 1        0.000 NaN
## 2        0.001 999
## 3        0.002 996
## 4        0.003 991
## 5        0.004 984
## 6        0.005 975
## 7        0.006 964
## 8        0.007 951
## 9        0.008 938
## 10       0.009 922

Relacion entre el tamano de la muestra definitiva y el error

# Gráfico que relaciona el tamaño de la muestra y el error obtenido 
library(ggplot2)
ggplot(data, aes(x = n, y = MargenError)) +
  geom_line() +
  geom_point() +
  labs(x = "Tamaño de muestra (n) | ajustado para población finita",
       y = "Margen de error") +
  ggtitle("Relación entre el tamaño de muestra y el margen de error") +
  theme_classic()+
  geom_vline(xintercept = 248, colour="red", size=0.5)+
  geom_hline(yintercept = 0.054, colour="blue", size=0.5)

A partir de los datos de n y error se selecciono una muestra de 248 elementos, pues se encuentra en el punto de la grafica denominado “codo” o donde las pendientes sufren el cambio mas drastico

Tomamos la muestra de la base de datos, a partir del tamano de muestra mediante MAS

# numero de muestras seleccionado
n <- 248 
# Extraemos los ID de las observaciones
sam.MAS <- sample(nrow(SP), size = n, replace = TRUE) 
# Extraemos los valores del dataframe a partir de los ID seleccionados
data.MAS <- SP[sam.MAS, ]
head(data.MAS, n = 10)
##     gender race.ethnicity parental.level.of.education        lunch
## 771   male        group B                 high school     standard
## 679   male        group D          associate's degree free/reduced
## 828 female        group C            some high school     standard
## 621 female        group C                 high school free/reduced
## 20  female        group C          associate's degree free/reduced
## 821 female        group A            some high school     standard
## 417   male        group C           bachelor's degree     standard
## 810   male        group B           bachelor's degree     standard
## 92    male        group C                 high school free/reduced
## 869   male        group E          associate's degree free/reduced
##     test.preparation.course math.score reading.score writing.score
## 771                    none         52            48            49
## 679                    none         81            75            78
## 828                    none         65            69            76
## 621                    none         35            61            54
## 20                     none         54            58            61
## 821               completed         85            90            92
## 417               completed         71            74            68
## 810                    none         59            54            51
## 92                     none         27            34            36
## 869               completed         78            74            72

Muestreo sistematico:

Seleccione una base de datos de su interés y realice un análisis de muestreo (Prueba piloto) para seleccionar la muestra adecuada para un diseño sistemático y realice los cálculos de interés. ##

library(TeachingSampling)
library(ggplot2)
library(gridExtra)

Prueba piloto

tamaño <- 0.05 # se tomo 5% (se recomienda entre 5 a 10%)
n <- round(tamaño*N,0) # Numero de muestras a tomar
k <- round(N/n,0) # Cada cuanto debo seleccionar la muestra (depende del investigador)
# Extraccion de la muestra
set.seed(1234)
sam <- S.SY(N,k) # Filas de la muestra
data <- SP[sam,] # Seleccionar de la base de datos original las filas seleccionadas como muestra
str(data) # Revisar la estructura de la data muestreada
## 'data.frame':    50 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 2 1 2 2 2 2 1 1 2 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 3 5 3 2 3 3 3 3 3 4 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 6 1 3 1 1 3 2 5 4 1 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 1 1 1 2 2 2 2 2 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 1 2 2 1 1 2 ...
##  $ math.score                 : num  69 81 33 44 78 84 58 70 81 61 ...
##  $ reading.score              : num  75 81 41 41 81 77 55 89 91 55 ...
##  $ writing.score              : num  78 79 43 38 82 74 48 88 87 52 ...

la muestra piloto cuenta con un total de 50, que equivale al 5% de la data original. Las observaciones seleccionadas de la muestra fueron: 16, 36, 56, 76, 96, 116, 136, 156, 176, 196, 216, 236, 256, 276, 296, 316, 336, 356, 376, 396, 416, 436, 456, 476, 496, 516, 536, 556, 576, 596, 616, 636, 656, 676, 696, 716, 736, 756, 776, 796, 816, 836, 856, 876, 896, 916, 936, 956, 976, 996y se evidencia que presentan un distanciamiento sistematico entre muestras.

Muestra definitiva (Variable math.score)

# Valores necesarios para calcular tamaño de la muestra
Error.1 <- 0.10*mean(data$math.score) # Para calcular mi muestra admito un 10% de error
Var <- var(data[,6]) # varianza estimada con la columna "Numero de empelados"
MargenError <- seq(0, Error.1, by = Error.1/100)
NC <- 0.95
z <- abs(qnorm((1 - NC)/2, mean = 0, sd = 1))
n.0 <- round(((z^2) * Var) / MargenError^2, digits=0)
# Cálculos previos para población finita
n <- round(n.0 / (1 + (n.0 / N)), digits=0)
# Crear un data frame con los valores de n.0 y MargenError
data_plot_infinita <- data.frame(n = n.0, MargenError = MargenError)
data_plot_infinita
##          n MargenError
## 1      Inf     0.00000
## 2   230710     0.06564
## 3    57677     0.13128
## 4    25634     0.19692
## 5    14419     0.26256
## 6     9228     0.32820
## 7     6409     0.39384
## 8     4708     0.45948
## 9     3605     0.52512
## 10    2848     0.59076
## 11    2307     0.65640
## 12    1907     0.72204
## 13    1602     0.78768
## 14    1365     0.85332
## 15    1177     0.91896
## 16    1025     0.98460
## 17     901     1.05024
## 18     798     1.11588
## 19     712     1.18152
## 20     639     1.24716
## 21     577     1.31280
## 22     523     1.37844
## 23     477     1.44408
## 24     436     1.50972
## 25     401     1.57536
## 26     369     1.64100
## 27     341     1.70664
## 28     316     1.77228
## 29     294     1.83792
## 30     274     1.90356
## 31     256     1.96920
## 32     240     2.03484
## 33     225     2.10048
## 34     212     2.16612
## 35     200     2.23176
## 36     188     2.29740
## 37     178     2.36304
## 38     169     2.42868
## 39     160     2.49432
## 40     152     2.55996
## 41     144     2.62560
## 42     137     2.69124
## 43     131     2.75688
## 44     125     2.82252
## 45     119     2.88816
## 46     114     2.95380
## 47     109     3.01944
## 48     104     3.08508
## 49     100     3.15072
## 50      96     3.21636
## 51      92     3.28200
## 52      89     3.34764
## 53      85     3.41328
## 54      82     3.47892
## 55      79     3.54456
## 56      76     3.61020
## 57      74     3.67584
## 58      71     3.74148
## 59      69     3.80712
## 60      66     3.87276
## 61      64     3.93840
## 62      62     4.00404
## 63      60     4.06968
## 64      58     4.13532
## 65      56     4.20096
## 66      55     4.26660
## 67      53     4.33224
## 68      51     4.39788
## 69      50     4.46352
## 70      48     4.52916
## 71      47     4.59480
## 72      46     4.66044
## 73      45     4.72608
## 74      43     4.79172
## 75      42     4.85736
## 76      41     4.92300
## 77      40     4.98864
## 78      39     5.05428
## 79      38     5.11992
## 80      37     5.18556
## 81      36     5.25120
## 82      35     5.31684
## 83      34     5.38248
## 84      33     5.44812
## 85      33     5.51376
## 86      32     5.57940
## 87      31     5.64504
## 88      30     5.71068
## 89      30     5.77632
## 90      29     5.84196
## 91      28     5.90760
## 92      28     5.97324
## 93      27     6.03888
## 94      27     6.10452
## 95      26     6.17016
## 96      26     6.23580
## 97      25     6.30144
## 98      25     6.36708
## 99      24     6.43272
## 100     24     6.49836
## 101     23     6.56400
# Crear un data frame con los valores de n y MargenError
data_plot_finita <- data.frame(n = n, MargenError = MargenError)
data_plot_finita
##       n MargenError
## 1   NaN     0.00000
## 2   996     0.06564
## 3   983     0.13128
## 4   962     0.19692
## 5   935     0.26256
## 6   902     0.32820
## 7   865     0.39384
## 8   825     0.45948
## 9   783     0.52512
## 10  740     0.59076
## 11  698     0.65640
## 12  656     0.72204
## 13  616     0.78768
## 14  577     0.85332
## 15  541     0.91896
## 16  506     0.98460
## 17  474     1.05024
## 18  444     1.11588
## 19  416     1.18152
## 20  390     1.24716
## 21  366     1.31280
## 22  343     1.37844
## 23  323     1.44408
## 24  304     1.50972
## 25  286     1.57536
## 26  270     1.64100
## 27  254     1.70664
## 28  240     1.77228
## 29  227     1.83792
## 30  215     1.90356
## 31  204     1.96920
## 32  194     2.03484
## 33  184     2.10048
## 34  175     2.16612
## 35  167     2.23176
## 36  158     2.29740
## 37  151     2.36304
## 38  145     2.42868
## 39  138     2.49432
## 40  132     2.55996
## 41  126     2.62560
## 42  120     2.69124
## 43  116     2.75688
## 44  111     2.82252
## 45  106     2.88816
## 46  102     2.95380
## 47   98     3.01944
## 48   94     3.08508
## 49   91     3.15072
## 50   88     3.21636
## 51   84     3.28200
## 52   82     3.34764
## 53   78     3.41328
## 54   76     3.47892
## 55   73     3.54456
## 56   71     3.61020
## 57   69     3.67584
## 58   66     3.74148
## 59   65     3.80712
## 60   62     3.87276
## 61   60     3.93840
## 62   58     4.00404
## 63   57     4.06968
## 64   55     4.13532
## 65   53     4.20096
## 66   52     4.26660
## 67   50     4.33224
## 68   49     4.39788
## 69   48     4.46352
## 70   46     4.52916
## 71   45     4.59480
## 72   44     4.66044
## 73   43     4.72608
## 74   41     4.79172
## 75   40     4.85736
## 76   39     4.92300
## 77   38     4.98864
## 78   38     5.05428
## 79   37     5.11992
## 80   36     5.18556
## 81   35     5.25120
## 82   34     5.31684
## 83   33     5.38248
## 84   32     5.44812
## 85   32     5.51376
## 86   31     5.57940
## 87   30     5.64504
## 88   29     5.71068
## 89   29     5.77632
## 90   28     5.84196
## 91   27     5.90760
## 92   27     5.97324
## 93   26     6.03888
## 94   26     6.10452
## 95   25     6.17016
## 96   25     6.23580
## 97   24     6.30144
## 98   24     6.36708
## 99   23     6.43272
## 100  23     6.49836
## 101  22     6.56400
# Graficar tamaño de muestra para población infinita
plot_infinita <- ggplot(data = data_plot_infinita, aes(x = n, y = MargenError)) +
  geom_line() +
  labs(title = "Tamaño de muestra para población infinita",
       x = "Tamaño de muestra (n)",
       y = "Margen de error") +
  theme_classic()
# Graficar tamaño de muestra para población finita
plot_finita <- ggplot(data = data_plot_finita, aes(x = n, y = MargenError)) +
  geom_line() +
  labs(title = "Tamaño de muestra para población finita",
       x = "Tamaño de muestra (n)",
       y = "Margen de error") +
  theme_classic()+
  geom_vline(xintercept = 240, colour="red", size=0.5)+
  geom_hline(yintercept = 1.77228, colour="blue", size=0.5)
# Combinar las dos gráficas en una sola imagen
combined_plot <- grid.arrange(plot_infinita, plot_finita, nrow = 1)

## Tomamos la muestra de la base de datos, a partir del tamano de muestra definitiva

# Seleccionar de la base de datos original las filas seleccionadas en el muestreo sistematico
set.seed(1234)
n.sis <- 240
# Calculo el tamaño del intervalo para cubrir el total de filas (nrow(SP))
# con el tamaño de muestra seleccionado (n.sis)
k.sis <- round( (n.sis/nrow(SP))*100, digits = 0)
# Determino aleatoriamente el valor desdee donde inicia el muestreo
arranque <- sample(1:k.sis, size = 1) # posición inicial
# Realizo el muestreo sistematico 
sam.sis <- seq(arranque, arranque+k*(n.sis-1), k)    
# Extraigo las observaciones a partir de la muestra sistematica seleccionada
data.sistematico <- SP[sam.sis,]
str(data.sistematico)
## 'data.frame':    240 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 2 1 2 2 2 2 1 1 2 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 3 5 3 2 3 3 3 3 3 4 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 6 1 3 1 1 3 2 5 4 1 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 1 1 1 2 2 2 2 2 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 1 2 2 1 1 2 ...
##  $ math.score                 : num  69 81 33 44 78 84 58 70 81 61 ...
##  $ reading.score              : num  75 81 41 41 81 77 55 89 91 55 ...
##  $ writing.score              : num  78 79 43 38 82 74 48 88 87 52 ...
# Referencia: https://rpubs.com/luis_bolanos/922284

Muestreo estratificado:

Seleccione una base de su interés donde contenga una variable cualitativa que se pueda estratificar y una variable cuantitativa que este asociada. Realice un análisis de muestreo (Prueba piloto) para seleccionar la muestra adecuada para un diseño estratificado y realice los cálculos de interés.

Definicion de los estratos

estratos <- levels(as.factor(SP$race.ethnicity) )
estrato1 <- subset(x=SP, race.ethnicity=="group A"); head(estrato1, n=10)
##    gender race.ethnicity parental.level.of.education        lunch
## 4    male        group A          associate's degree free/reduced
## 14   male        group A                some college     standard
## 15 female        group A             master's degree     standard
## 26   male        group A             master's degree free/reduced
## 47 female        group A          associate's degree     standard
## 62   male        group A            some high school free/reduced
## 63   male        group A          associate's degree free/reduced
## 73 female        group A          associate's degree free/reduced
## 78   male        group A           bachelor's degree     standard
## 83   male        group A                some college free/reduced
##    test.preparation.course math.score reading.score writing.score
## 4                     none         47            57            44
## 14               completed         78            72            70
## 15                    none         50            53            58
## 26                    none         73            74            72
## 47               completed         55            65            62
## 62                    none         39            39            34
## 63                    none         62            61            55
## 73                    none         41            51            48
## 78               completed         80            78            81
## 83               completed         50            47            54
estrato2 <- subset(x=SP, race.ethnicity=="group B"); head(estrato2, n=10)
##    gender race.ethnicity parental.level.of.education        lunch
## 1  female        group B           bachelor's degree     standard
## 3  female        group B             master's degree     standard
## 6  female        group B          associate's degree     standard
## 7  female        group B                some college     standard
## 8    male        group B                some college free/reduced
## 10 female        group B                 high school free/reduced
## 13 female        group B                 high school     standard
## 18 female        group B            some high school free/reduced
## 22 female        group B                some college free/reduced
## 27   male        group B                some college     standard
##    test.preparation.course math.score reading.score writing.score
## 1                     none         72            72            74
## 3                     none         90            95            93
## 6                     none         71            83            78
## 7                completed         88            95            92
## 8                     none         40            43            39
## 10                    none         38            60            50
## 13                    none         65            81            73
## 18                    none         18            32            28
## 22               completed         65            75            70
## 27                    none         69            54            55
estrato3 <- subset(x=SP, race.ethnicity=="group C"); head(estrato3, n=10)
##    gender race.ethnicity parental.level.of.education        lunch
## 2  female        group C                some college     standard
## 5    male        group C                some college     standard
## 11   male        group C          associate's degree     standard
## 16 female        group C            some high school     standard
## 17   male        group C                 high school     standard
## 19   male        group C             master's degree free/reduced
## 20 female        group C          associate's degree free/reduced
## 24 female        group C            some high school     standard
## 28 female        group C           bachelor's degree     standard
## 29   male        group C                 high school     standard
##    test.preparation.course math.score reading.score writing.score
## 2                completed         69            90            88
## 5                     none         76            78            75
## 11                    none         58            54            52
## 16                    none         69            75            78
## 17                    none         88            89            86
## 19               completed         46            42            46
## 20                    none         54            58            61
## 24                    none         69            73            73
## 28                    none         67            69            75
## 29                    none         70            70            65
estrato4 <- subset(x=SP, race.ethnicity=="group D"); head(estrato4, n=10)
##    gender race.ethnicity parental.level.of.education        lunch
## 9    male        group D                 high school free/reduced
## 12   male        group D          associate's degree     standard
## 21   male        group D                 high school     standard
## 23   male        group D                some college     standard
## 25   male        group D           bachelor's degree free/reduced
## 30 female        group D             master's degree     standard
## 31 female        group D                some college     standard
## 34   male        group D                some college     standard
## 37 female        group D          associate's degree     standard
## 38 female        group D            some high school free/reduced
##    test.preparation.course math.score reading.score writing.score
## 9                completed         64            64            67
## 12                    none         40            52            43
## 21                    none         66            69            63
## 23                    none         44            54            53
## 25               completed         74            71            80
## 30                    none         62            70            75
## 31                    none         69            74            74
## 34                    none         40            42            38
## 37                    none         74            81            83
## 38                    none         50            64            59
estrato5 <- subset(x=SP, race.ethnicity=="group E"); head(estrato5, n=10)
##    gender race.ethnicity parental.level.of.education        lunch
## 33 female        group E             master's degree free/reduced
## 35   male        group E                some college     standard
## 36   male        group E          associate's degree     standard
## 45 female        group E          associate's degree free/reduced
## 51   male        group E                some college     standard
## 52   male        group E          associate's degree free/reduced
## 57 female        group E          associate's degree     standard
## 61   male        group E           bachelor's degree free/reduced
## 77   male        group E            some high school     standard
## 80 female        group E             master's degree     standard
##    test.preparation.course math.score reading.score writing.score
## 33                    none         56            72            65
## 35                    none         97            87            82
## 36               completed         81            81            79
## 45                    none         50            56            54
## 51                    none         53            55            48
## 52               completed         77            69            68
## 57               completed         82            85            86
## 61               completed         79            74            72
## 77                    none         30            26            22
## 80                    none         62            68            68

Se definieron los estratos a partir de la variable cualitativarace.ethnicity porque es tiene mas de dos niveles y es una variable que puede agrupar de forma diferente las respuesta de la poblacion. sin embargo no se realizo una comparacion que determine que es la mejor para estratificar la población.

Estimacion del tamano muestra (Piloto)

n_muestra <- nrow(SP)*0.1  #Se toma eñ 10% del total de la poblacion
n1 <- round(n_muestra * nrow(estrato1)/ nrow(SP), digits = 0); n1
## [1] 9
n2 <- round(n_muestra * nrow(estrato2)/ nrow(SP), digits = 0); n2
## [1] 19
n3 <- round(n_muestra * nrow(estrato3)/ nrow(SP), digits = 0); n3
## [1] 32
n4 <- round(n_muestra * nrow(estrato4)/ nrow(SP), digits = 0); n4
## [1] 26
n5 <- round(n_muestra * nrow(estrato5)/ nrow(SP), digits = 0); n5
## [1] 14

Estimacion muestra (Piloto)

library(sampling)
# Tamaño total muestra piloto
sum(n1, n2, n3, n4, n5)
## [1] 100
estratos <- strata(SP,
                   stratanames = "race.ethnicity", # variable cualitativa que estratifica
                   size = c(n1, n2, n3, n4, n5), # vector con los tamanos estimados por estrato
                   method = "srswor") # simple random sampling without replacement
muestreado <- getdata(SP, estratos)
str(muestreado)
## 'data.frame':    100 obs. of  11 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 2 2 2 1 2 2 2 1 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 3 6 1 3 5 1 5 5 5 6 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 2 2 2 1 2 2 2 2 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 2 2 2 1 1 1 2 2 2 ...
##  $ math.score                 : num  54 82 48 62 62 76 69 66 58 69 ...
##  $ reading.score              : num  64 82 43 55 66 94 77 65 50 75 ...
##  $ writing.score              : num  68 80 45 54 68 87 77 60 45 78 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 2 2 2 2 2 2 2 2 3 ...
##  $ ID_unit                    : int  495 506 566 683 685 716 760 775 835 16 ...
##  $ Prob                       : num  0.0474 0.0474 0.0474 0.0474 0.0474 ...
##  $ Stratum                    : int  1 1 1 1 1 1 1 1 1 2 ...

Datos a muestrear por estrato (Piloto)

table(muestreado$race.ethnicity)
## 
## group A group B group C group D group E 
##      32       9      19      26      14
# Para una muestra total de 100 datos, estas son las muestras a tomar por cada estrato.

Estimamos la media por estrato y la media total estratificada (Piloto)

#Se extrae la media por grupo para la variable math.score
me1 <- mean(muestreado[muestreado$race.ethnicity == "group A","math.score"])
me2 <- mean(muestreado[muestreado$race.ethnicity == "group B","math.score"])
me3 <- mean(muestreado[muestreado$race.ethnicity == "group C","math.score"])
me4 <- mean(muestreado[muestreado$race.ethnicity == "group D","math.score"])
me5 <- mean(muestreado[muestreado$race.ethnicity == "group E","math.score"])
# Calculo de la media estimada total
media.est <- (1/nrow(SP))*(sum(SP$race.ethnicity == "group A")*me1+
                            sum(SP$race.ethnicity == "group B")*me2+
                             sum(SP$race.ethnicity == "group C")*me3+
                             sum(SP$race.ethnicity == "group D")*me4+
                             sum(SP$race.ethnicity == "group E")*me5)
cbind(me1, me2, me3, me4, me5, media.est)
##           me1      me2      me3      me4  me5 media.est
## [1,] 60.09375 64.11111 60.57895 66.61538 74.5  64.73737

Definimos los parametros para calcular el tamano de la muestra definitiva

#A partir de la muestra estratificada piloto (media.est) calculamos el tamaño de la muestra
E <- 0.1*media.est # Error admitido, en este caso seleccionamos 10%
MargenError2 <- seq(0, E, by= E/20) # diferentes niveles de error que vamos a evaluar
NC2 <- 0.95 # Nivel de confianza
Z2 <- abs(qnorm((1-NC2)/2,mean = 0, sd=1)) # estimacion del valor en la distribucion Z
Nt <- nrow(SP) # Numero total de observaciones

Calculamos el tamano de muestra definitiva

#Calcula la varianza estimada de cada estratos
Var_st <-
  ((
    sum(SP$race.ethnicity == "group A") / Nt) * var(muestreado[muestreado$race.ethnicity == "group A", "math.score"])
   + (sum(SP$race.ethnicity == "group B") / Nt) * var(muestreado[muestreado$race.ethnicity == "group B", "math.score"])
   + (sum(SP$race.ethnicity == "group C") / Nt) * var(muestreado[muestreado$race.ethnicity == "group C", "math.score"])
   + (sum(SP$race.ethnicity == "group D") / Nt) * var(muestreado[muestreado$race.ethnicity == "group D", "math.score"])
   + (sum(SP$race.ethnicity == "group E") / Nt) * var(muestreado[muestreado$race.ethnicity == "group E", "math.score"]) )
n0 <- Var_st * (1 / (MargenError2 / Z2)^2)
n <- n0 / (1 + n0 / Nt)
data <- data.frame(
  MargenError2=round(MargenError2, digits = 3),
  num.datos=round(n, digits = 0)
)
data
##    MargenError2 num.datos
## 1         0.000       NaN
## 2         0.324       867
## 3         0.647       620
## 4         0.971       420
## 5         1.295       290
## 6         1.618       207
## 7         1.942       153
## 8         2.266       118
## 9         2.589        93
## 10        2.913        75
## 11        3.237        61
## 12        3.561        51
## 13        3.884        43
## 14        4.208        37
## 15        4.532        32
## 16        4.855        28
## 17        5.179        25
## 18        5.503        22
## 19        5.826        20
## 20        6.150        18
## 21        6.474        16

Relacion entre el tamano de la muestra definitiva y el error

# Grafico
library(ggplot2)
ggplot(data, aes(y = MargenError2, x = num.datos) ) +
  geom_line() +
  geom_point() +
  labs(y = "Margen de Error", x = "Tamaño de muestra") +
  ggtitle("Relación entre Margen de Error y Tamaño de muestra") +
  theme_classic()+
  geom_hline(yintercept = 2.331, colour="red", size=0.5)+
  geom_vline(xintercept = 130, colour="blue", size=0.5)

Tomamos la muestra de la base de datos, a partir del tamano de muestra definitiva

n.estra <- 130 # numero de muestras seleccionado
sam.estra <- sample(nrow(SP), size = n.estra, replace = TRUE)
data.estratificada <- SP[sam.estra, ]
str(data.estratificada)
## 'data.frame':    130 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 2 2 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 4 4 3 5 4 2 4 3 3 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 4 6 3 1 3 2 2 3 2 4 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 1 2 2 1 1 2 2 2 2 1 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 2 2 2 2 1 1 2 2 1 ...
##  $ math.score                 : num  49 59 57 68 57 68 65 76 86 46 ...
##  $ reading.score              : num  53 67 50 65 58 74 81 73 83 42 ...
##  $ writing.score              : num  52 61 54 61 57 74 81 68 86 46 ...

Analisis descriptivo empleando la muestra definitiva estratificada

# Resumen poblacion
summary(SP)
##     gender    race.ethnicity     parental.level.of.education          lunch    
##  female:518   group A: 89    associate's degree:222          free/reduced:355  
##  male  :482   group B:190    bachelor's degree :118          standard    :645  
##               group C:319    high school       :196                            
##               group D:262    master's degree   : 59                            
##               group E:140    some college      :226                            
##                              some high school  :179                            
##  test.preparation.course   math.score     reading.score    writing.score   
##  completed:358           Min.   :  0.00   Min.   : 17.00   Min.   : 10.00  
##  none     :642           1st Qu.: 57.00   1st Qu.: 59.00   1st Qu.: 57.75  
##                          Median : 66.00   Median : 70.00   Median : 69.00  
##                          Mean   : 66.09   Mean   : 69.17   Mean   : 68.05  
##                          3rd Qu.: 77.00   3rd Qu.: 79.00   3rd Qu.: 79.00  
##                          Max.   :100.00   Max.   :100.00   Max.   :100.00
# Resumen muestra aleatoria simple
summary(data.MAS)
##     gender    race.ethnicity     parental.level.of.education          lunch    
##  female:120   group A:27     associate's degree:50           free/reduced: 77  
##  male  :128   group B:48     bachelor's degree :24           standard    :171  
##               group C:77     high school       :47                             
##               group D:56     master's degree   :14                             
##               group E:40     some college      :52                             
##                              some high school  :61                             
##  test.preparation.course   math.score     reading.score    writing.score   
##  completed: 93           Min.   : 23.00   Min.   : 24.00   Min.   : 15.00  
##  none     :155           1st Qu.: 59.00   1st Qu.: 58.00   1st Qu.: 56.75  
##                          Median : 67.00   Median : 70.00   Median : 69.00  
##                          Mean   : 67.69   Mean   : 69.88   Mean   : 68.81  
##                          3rd Qu.: 78.00   3rd Qu.: 81.00   3rd Qu.: 80.00  
##                          Max.   :100.00   Max.   :100.00   Max.   :100.00
# Resumen muestra sistematica
summary(data.sistematico)
##     gender    race.ethnicity     parental.level.of.education          lunch    
##  female: 26   group A:  3    associate's degree: 12          free/reduced: 19  
##  male  : 24   group B: 11    bachelor's degree :  8          standard    : 31  
##  NA's  :190   group C: 20    high school       :  9          NA's        :190  
##               group D:  6    master's degree   :  3                            
##               group E: 10    some college      : 12                            
##               NA's   :190    some high school  :  6                            
##                              NA's              :190                            
##  test.preparation.course   math.score    reading.score   writing.score  
##  completed: 19           Min.   :32.00   Min.   :34.00   Min.   :33.00  
##  none     : 31           1st Qu.:57.25   1st Qu.:57.25   1st Qu.:58.25  
##  NA's     :190           Median :67.50   Median :69.00   Median :69.50  
##                          Mean   :65.64   Mean   :69.22   Mean   :67.82  
##                          3rd Qu.:78.75   3rd Qu.:81.00   3rd Qu.:78.75  
##                          Max.   :97.00   Max.   :99.00   Max.   :96.00  
##                          NA's   :190     NA's   :190     NA's   :190
# Resumen muestra estratificada
summary(data.estratificada)
##     gender   race.ethnicity     parental.level.of.education          lunch   
##  female:60   group A:17     associate's degree:37           free/reduced:39  
##  male  :70   group B:18     bachelor's degree :19           standard    :91  
##              group C:41     high school       :22                            
##              group D:38     master's degree   : 5                            
##              group E:16     some college      :30                            
##                             some high school  :17                            
##  test.preparation.course   math.score     reading.score    writing.score   
##  completed:48            Min.   : 28.00   Min.   : 23.00   Min.   : 19.00  
##  none     :82            1st Qu.: 57.00   1st Qu.: 60.00   1st Qu.: 57.00  
##                          Median : 66.50   Median : 72.00   Median : 70.00  
##                          Mean   : 66.48   Mean   : 69.76   Mean   : 68.27  
##                          3rd Qu.: 77.50   3rd Qu.: 81.00   3rd Qu.: 79.00  
##                          Max.   :100.00   Max.   :100.00   Max.   :100.00
# Resumen muestra estratificada por estratos
by(data = data.estratificada$math.score,
   INDICES = data.estratificada$race.ethnicity,
   FUN = summary)
## data.estratificada$race.ethnicity: group A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.00   50.00   63.00   61.53   71.00   97.00 
## ------------------------------------------------------------ 
## data.estratificada$race.ethnicity: group B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   38.00   57.25   65.00   64.83   71.50   88.00 
## ------------------------------------------------------------ 
## data.estratificada$race.ethnicity: group C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.00   57.00   65.00   64.37   70.00   94.00 
## ------------------------------------------------------------ 
## data.estratificada$race.ethnicity: group D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   57.25   68.50   66.42   78.75   93.00 
## ------------------------------------------------------------ 
## data.estratificada$race.ethnicity: group E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   76.50   79.00   79.12   86.00  100.00

Estratificado con base de datos externa

Seleccione una base de su interés donde contenga una variable cuantitativa y aplique los métodos para generar los estratificados. Seleccione la muestra adecuada y realice los cálculos de interés.