DATA604 Assignment 3

1. Introduction

The overall goal of this assignment was to use Mockaroo, a mock data generator, to create mock (aka fake or synthetic) data that resembles a real dataset.

2. Real data

I selected a dataset of average SAT exam scores from New York City (NYC) schools in 2010 that is publicly available on NYC OpenData, in part because it was less than half of the 1000-row limit on Mockaroo (ie, allowing the entire dataset to be doubled in size).

I downloaded the CSV file, saved it to my GitHub repository, and then loaded it into a dataframe.

SAT_scores_real <- read_csv('https://raw.githubusercontent.com/alexandersimon1/Data604/main/Assignment3/NYC_schools_meanSAT_2010.csv', show_col_types = FALSE)

The raw data has 460 rows and 6 columns.

glimpse(SAT_scores_real)

## Rows: 460
## Columns: 6
## $ DBN                     <chr> "01M292", "01M448", "01M450", "01M458", "01M50…
## $ `School Name`           <chr> "Henry Street School for International Studies…
## $ `Number of Test Takers` <dbl> 31, 60, 69, 26, NA, 154, 47, 35, 138, 11, 50, …
## $ `Critical Reading Mean` <dbl> 391, 394, 418, 385, NA, 314, 568, 411, 630, 40…
## $ `Mathematics Mean`      <dbl> 425, 419, 431, 370, NA, 532, 583, 401, 608, 41…
## $ `Writing Mean`          <dbl> 385, 387, 402, 378, NA, 314, 568, 401, 630, 38…

Some of the column names were difficult to work with, so I renamed them.

SAT_scores_real <- SAT_scores_real %>%
  rename(school_ID = DBN, 
         school_name = `School Name`,
         n_test_takers = `Number of Test Takers`,
         mean_reading_score = `Critical Reading Mean`,
         mean_math_score = `Mathematics Mean`,
         mean_writing_score = `Writing Mean`)

There was also some missing data. I removed these rows because they wouldn’t be informative for subsequent statistical analyses. This left 386 rows.

map(SAT_scores_real, ~ sum(is.na(.)))

## $school_ID
## [1] 0
## 
## $school_name
## [1] 0
## 
## $n_test_takers
## [1] 74
## 
## $mean_reading_score
## [1] 74
## 
## $mean_math_score
## [1] 74
## 
## $mean_writing_score
## [1] 74

SAT_scores_real <- SAT_scores_real %>%
  drop_na()

nrow(SAT_scores_real)

## [1] 386

The number of test takers at each school ranged from 7 to 1047.

summary(SAT_scores_real$n_test_takers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   35.00   54.00  103.66   92.75 1047.00

3. Mock data

I used Mockaroo to create mock SAT score datasets using the attributes of the real dataset above. To define the minimum and maximum score for each test section (reading, math, writing), I looked up the minimum and maximum scores for SAT exams that were administered in 2010.

The Mockaroo input parameters are shown below. I defined a “Digit Sequence” for the school ID based on the real identifiers. The closest field type that I could find for school_name was “University”.

I generated 2 mock datasets with 386 rows and 772 rows (\(2\times\) original). I saved the CSV files to my GitHub repository and loaded them into dataframes.

# mock1 is same size as the original dataset
SAT_scores_mock1 <- read_csv('https://raw.githubusercontent.com/alexandersimon1/Data604/main/Assignment3/MOCK_DATA1.csv', show_col_types = FALSE)

# mock2 is twice as large as the original dataset
SAT_scores_mock2 <- read_csv('https://raw.githubusercontent.com/alexandersimon1/Data604/main/Assignment3/MOCK_DATA2.csv', show_col_types = FALSE)

4. Analysis

I compared the real vs mock data using descriptive statistics.

4.1. Number of test takers

The boxplots below show that the distribution of the number test takers in both mock datasets was similar but was substantially different from the real dataset.

# gdata::cbindX() is used rather than base cbind() because mock1 and mock2 have different number of rows
test_takers_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$n_test_takers),
                                data.frame(SAT_scores_mock1$n_test_takers),
                                data.frame(SAT_scores_mock2$n_test_takers))) %>%
  set_names("real", "mock1", "mock2") %>%
  gather(., key = "dataset", value = "n_test_takers") %>%
  drop_na()

ggplot(test_takers_all, aes(x = n_test_takers, fill = dataset)) + 
  geom_boxplot()

This difference can also be seen by comparing summary statistics. All 3 datasets have the same range of values (as expected); however, the median and mean number of test takers in the mock datasets are much greater than those in the real dataset.

test_takers_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(n_test_takers),
    max = max(n_test_takers),
    median = median(n_test_takers),
    IQR = round(IQR(n_test_takers), 2),
    mean = round(mean(n_test_takers), 2),
    SD = round(sd(n_test_takers), 2),    
  ) %>%
  kbl() %>%
  kable_material()

dataset	n	min	max	median	IQR	mean	SD
mock1	386	7	1047	533.5	524.00	525.90	296.05
mock2	772	8	1046	527.0	518.25	530.87	296.86
real	386	7	1047	54.0	57.75	103.66	145.26

4.2. Reading score

The distribution of mean reading scores in both mock datasets was nearly identical but was much wider than the real dataset.

reading_score_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$mean_reading_score),
                                data.frame(SAT_scores_mock1$mean_reading_score),
                                data.frame(SAT_scores_mock2$mean_reading_score))) %>%
  set_names("real", "mock1", "mock2") %>%
  gather(., key = "dataset", value = "mean_reading_score") %>%
  drop_na()

ggplot(reading_score_all, aes(x = mean_reading_score, fill = dataset)) + 
  geom_boxplot()

The range of reading scores in the mock datasets almost perfectly matches the input parameters (200 to 800), whereas the range of the reading scores in the real dataset is more narrow (291 to 674). This is not surprising given that the real dataset has mean test scores for each school, so even though the theoretical minimum and maximum test scores are 200 and 800, the mean score for any school is unlikely to be that low/high because not every test taker in a school would score the minimum/maximum possible.

The median and mean reading score were higher in the mock datasets than in the real dataset.

reading_score_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(mean_reading_score),
    max = max(mean_reading_score),
    median = median(mean_reading_score),
    IQR = round(IQR(mean_reading_score), 2),
    mean = round(mean(mean_reading_score), 2),
    SD = round(sd(mean_reading_score), 2),    
  ) %>%
  kbl() %>%
  kable_material()

dataset	n	min	max	median	IQR	mean	SD
mock1	386	204	800	493.5	325.75	499.38	178.20
mock2	772	200	800	493.0	304.25	497.44	174.81
real	386	291	674	392.5	49.00	404.25	56.82

4.3. Math score

Similar to the distribution of mean reading scores, the distribution of mean math scores was similar in both mock datasets but differed from the real dataset.

math_score_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$mean_math_score),
                                          data.frame(SAT_scores_mock1$mean_math_score),
                                          data.frame(SAT_scores_mock2$mean_math_score))) %>%
  set_names("real", "mock1", "mock2") %>%
  gather(., key = "dataset", value = "mean_math_score") %>%
  drop_na()

ggplot(math_score_all, aes(x = mean_math_score, fill = dataset)) + 
  geom_boxplot()

The similarity and differences among summary statistics in the mock vs real datasets were similar to those in the reading score analysis (section 3.2).

math_score_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(mean_math_score),
    max = max(mean_math_score),
    median = median(mean_math_score),
    IQR = round(IQR(mean_math_score), 2),
    mean = round(mean(mean_math_score), 2),
    SD = round(sd(mean_math_score), 2),    
  ) %>%
  kbl() %>%
  kable_material()

dataset	n	min	max	median	IQR	mean	SD
mock1	386	200	798	502.0	295.75	502.76	173.79
mock2	772	200	800	508.0	303.25	501.19	174.80
real	386	281	735	394.5	57.75	412.94	64.99

4.4. Writing score

Similar to the distribution of mean reading and math scores, the distribution of mean writing scores was similar in both mock datasets but differed from the real dataset.

writing_score_all <- as_tibble(
                        gdata::cbindX(data.frame(SAT_scores_real$mean_writing_score),
                                      data.frame(SAT_scores_mock1$mean_writing_score),
                                      data.frame(SAT_scores_mock2$mean_writing_score))) %>%
  set_names("real", "mock1", "mock2") %>%
  gather(., key = "dataset", value = "mean_writing_score") %>%
  drop_na()

ggplot(writing_score_all, aes(x = mean_writing_score, fill = dataset)) + 
  geom_boxplot()

The similarity and differences among summary statistics in the mock vs real datasets were similar to those in the reading and math score analyses (sections 3.2 and 3.3).

writing_score_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(mean_writing_score),
    max = max(mean_writing_score),
    median = median(mean_writing_score),
    IQR = round(IQR(mean_writing_score), 2),
    mean = round(mean(mean_writing_score), 2),
    SD = round(sd(mean_writing_score), 2),    
  ) %>%
  kbl() %>%
  kable_material()

dataset	n	min	max	median	IQR	mean	SD
mock1	386	201	799	504	289.50	504.10	168.92
mock2	772	203	800	503	290.25	504.46	169.70
real	386	285	678	383	50.00	397.69	57.76

4.5. Summary of findings from initial mock data

The mock datasets only resembled the real dataset in terms of the range of number of test takers, which was specified as input parameters. With respect to other summary statistics (eg, min, max, median score), neither mock dataset closely resembled the real dataset. Increasing the size of the mock dataset didn’t alter the differences from the real dataset.

5. Improvements

5.1. Fit mock data to distribution of variables in real dataset

My first idea to improve the mock data was to analyze the distribution of values in the real dataset and then customize Mockaroo to generate mock data that fit those distributions.

5.1.1. Real distributions

The histograms below suggest that the test section scores and number of test takers in the real dataset are normally distributed and skewed to the right.

scores_real_all <- gather(SAT_scores_real, key = "variable", value = "value", 
                          c("n_test_takers", "mean_reading_score", "mean_math_score",
                            "mean_writing_score"))

ggplot(scores_real_all, aes(x = value, fill = variable)) +
  geom_histogram(binwidth = 20) +
  facet_wrap(~ factor(variable, c("mean_reading_score", "mean_math_score",
                                 "mean_writing_score", "n_test_takers")), scales = "free") +
  theme(legend.position = "none")

I confirmed this with quantile (Q-Q) plots, which were consistent with right-skewed normal distributions.

ggplot(scores_real_all, aes(sample = value)) +
  stat_qq() +
  stat_qq_line(color = "steelblue") +
  xlab("Theoretical") + ylab("Sample") +
  facet_wrap(vars(variable), scales='free_y')

The mean and standard deviation (SD) of each variable is shown below.

scores_real_all %>%
  group_by(variable) %>%
  summarise(
    mean = round(mean(value), 1),
    SD = round(sd(value), 1),    
  ) %>%
  kbl() %>%
  kable_material()

variable	mean	SD
mean_math_score	412.9	65.0
mean_reading_score	404.2	56.8
mean_writing_score	397.7	57.8
n_test_takers	103.7	145.3

Because the SD for n_test_takers is greater than its mean, random numbers sampled from a normal distribution with those parameters may be negative and not meaningful. There didn’t seem to be any good way to constrain a normal distribution to positive values, so I resorted to using the Custom List field type in Mockaroo with random selection option and pasted a random sample of 10% (39 rows) or 20% (78 rows) of the real data into the field. I chose these percentages based on the number of samples typically needed to demonstrate normality (ie, “Law of Large Numbers”). The random selection should ensure that the mock dataset is not an exact duplicate of the real data.

10% sample

set.seed(128)
# randomly sample row numbers
row_numbers10 <- sample(nrow(SAT_scores_real), 39)
# get the corresponding values
# n_test_takers is column 3
n_test_takers10 <- SAT_scores_real[row_numbers10, 3]
# create input string for Mockaroo
paste(unlist(n_test_takers10), collapse = ", ")

## [1] "116, 86, 302, 627, 71, 32, 136, 90, 14, 10, 17, 22, 50, 13, 50, 48, 31, 1047, 32, 29, 103, 10, 125, 57, 54, 42, 16, 46, 41, 40, 60, 53, 130, 63, 87, 50, 60, 51, 54"

20% sample

set.seed(128)
row_numbers20 <- sample(nrow(SAT_scores_real), 78)
n_test_takers20 <- SAT_scores_real[row_numbers20, 3]
paste(unlist(n_test_takers20), collapse = ", ")

## [1] "116, 86, 302, 627, 71, 32, 136, 90, 14, 10, 17, 22, 50, 13, 50, 48, 31, 1047, 32, 29, 103, 10, 125, 57, 54, 42, 16, 46, 41, 40, 60, 53, 130, 63, 87, 50, 60, 51, 54, 66, 92, 121, 113, 81, 35, 15, 19, 67, 22, 22, 18, 75, 50, 397, 105, 287, 44, 50, 50, 48, 68, 641, 71, 75, 192, 43, 52, 49, 104, 212, 63, 24, 686, 343, 63, 71, 46, 14"

5.1.2. Mockaroo data

Using the insights above, I modified the Mockaroo input parameters as shown below.

As before, I generated 2 mock datasets with 386 rows and 772 rows. I saved the CSV files to my GitHub repository and loaded them into dataframes.

SAT_scores_mock3 <- read_csv('https://raw.githubusercontent.com/alexandersimon1/Data604/main/Assignment3/MOCK_DATA3.csv', show_col_types = FALSE)

SAT_scores_mock4 <- read_csv('https://raw.githubusercontent.com/alexandersimon1/Data604/main/Assignment3/MOCK_DATA4.csv', show_col_types = FALSE)

5.1.3. Analysis

5.1.3.1. Number of test takers

The distribution of the number of test takers in the new mock datasets (“mock3” and “mock4”) more closely resembled the distribution in the real dataset than the initial mock datasets (“mock1” and “mock2”). The larger modified mock dataset (mock4) included more outliers than the smaller modified mock dataset (mock3); however, neither had as many as the real dataset.

test_takers_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$n_test_takers),
                                data.frame(SAT_scores_mock1$n_test_takers),
                                data.frame(SAT_scores_mock2$n_test_takers),
                                data.frame(SAT_scores_mock3$n_test_takers),
                                data.frame(SAT_scores_mock4$n_test_takers))) %>%
  set_names("real", "mock1", "mock2", "mock3", "mock4") %>%
  gather(., key = "dataset", value = "n_test_takers") %>%
  drop_na()

test_takers_all$dataset <- factor(test_takers_all$dataset, 
                                    levels = c("mock3", "mock4", "real", "mock1", "mock2"))

ggplot(test_takers_all, aes(x = n_test_takers, fill = dataset)) + 
  geom_boxplot() +
  scale_fill_manual(values = c("#F8766D", "lightgray", "darkgray", "#00BFC4", "#00BF7D"), 
                    breaks = c("real", "mock1", "mock2", "mock3", "mock4"))

The similarity of the modified mock datasets to the real dataset can also be seen in the summary statistics below.

test_takers_all$dataset <- factor(test_takers_all$dataset, 
                                    levels = c("real", "mock1", "mock2", "mock3", "mock4"))

test_takers_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(n_test_takers),
    max = max(n_test_takers),
    median = median(n_test_takers),
    IQR = round(IQR(n_test_takers), 2),
    mean = round(mean(n_test_takers), 2),
    SD = round(sd(n_test_takers), 2),    
  ) %>%
  kbl() %>%
  kable_material() %>%
  row_spec(1, background = "#F8766D") %>% 
  row_spec(4, background = "#00BFC4") %>% 
  row_spec(5, background = "#00BF7D")

dataset	n	min	max	median	IQR	mean	SD
real	386	7	1047	54.0	57.75	103.66	145.26
mock1	386	7	1047	533.5	524.00	525.90	296.05
mock2	772	8	1046	527.0	518.25	530.87	296.86
mock3	386	10	1047	51.0	54.00	92.54	165.18
mock4	772	10	1047	54.0	64.25	111.83	172.91

5.1.3.2. Reading score

The distribution of mean reading scores in the modified mock datasets also more closely resembles the distribution in the real dataset than the initial mock datasets.

reading_score_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$mean_reading_score),
                                data.frame(SAT_scores_mock1$mean_reading_score),
                                data.frame(SAT_scores_mock2$mean_reading_score),
                                data.frame(SAT_scores_mock3$mean_reading_score),
                                data.frame(SAT_scores_mock4$mean_reading_score))) %>%
  set_names("real", "mock1", "mock2", "mock3", "mock4") %>%
  gather(., key = "dataset", value = "mean_reading_score") %>%
  drop_na()

reading_score_all$dataset <- factor(reading_score_all$dataset, 
                                    levels = c("mock3", "mock4", "real", "mock1", "mock2"))

ggplot(reading_score_all, aes(x = mean_reading_score, fill = dataset)) + 
  geom_boxplot() +
  scale_fill_manual(values = c("#F8766D", "lightgray", "darkgray", "#00BFC4", "#00BF7D"), 
                    breaks = c("real", "mock1", "mock2", "mock3", "mock4"))

As expected, the mean and standard deviation as well as median and IQR of the mock datasets are nearly equal to those for the real dataset. The range of the mock data is more similar as well, although there are not as many high outliers as in the real dataset.

reading_score_all$dataset <- factor(reading_score_all$dataset, 
                                    levels = c("real", "mock1", "mock2", "mock3", "mock4"))

reading_score_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(mean_reading_score),
    max = max(mean_reading_score),
    median = median(mean_reading_score),
    IQR = round(IQR(mean_reading_score), 2),
    mean = round(mean(mean_reading_score), 2),
    SD = round(sd(mean_reading_score), 2),    
  ) %>%
  kbl() %>%
  kable_material() %>%
  row_spec(1, background = "#F8766D") %>% 
  row_spec(4, background = "#00BFC4") %>% 
  row_spec(5, background = "#00BF7D")

dataset	n	min	max	median	IQR	mean	SD
real	386	291	674	392.5	49.00	404.25	56.82
mock1	386	204	800	493.5	325.75	499.38	178.20
mock2	772	200	800	493.0	304.25	497.44	174.81
mock3	386	238	560	404.0	79.00	405.07	58.12
mock4	772	257	554	408.0	76.00	408.86	55.11

5.1.3.3. Math score

Similar to the reading score, the distribution of mean math scores in the new mock datasets more closely resembles the distribution in the real dataset than the initial mock datasets.

math_score_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$mean_math_score),
                                data.frame(SAT_scores_mock1$mean_math_score),
                                data.frame(SAT_scores_mock2$mean_math_score),
                                data.frame(SAT_scores_mock3$mean_math_score),
                                data.frame(SAT_scores_mock4$mean_math_score))) %>%
  set_names("real", "mock1", "mock2", "mock3", "mock4") %>%
  gather(., key = "dataset", value = "mean_math_score") %>%
  drop_na()

math_score_all$dataset <- factor(math_score_all$dataset, 
                                    levels = c("mock3", "mock4", "real", "mock1", "mock2"))

ggplot(math_score_all, aes(x = mean_math_score, fill = dataset)) + 
  geom_boxplot() +
  scale_fill_manual(values = c("#F8766D", "lightgray", "darkgray", "#00BFC4", "#00BF7D"), 
                    breaks = c("real", "mock1", "mock2", "mock3", "mock4"))

The similarity can also be seen in the summary statistics below.

math_score_all$dataset <- factor(math_score_all$dataset, 
                                    levels = c("real", "mock1", "mock2", "mock3", "mock4"))

math_score_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(mean_math_score),
    max = max(mean_math_score),
    median = median(mean_math_score),
    IQR = round(IQR(mean_math_score), 2),
    mean = round(mean(mean_math_score), 2),
    SD = round(sd(mean_math_score), 2),    
  ) %>%
  kbl() %>%
  kable_material() %>%
  row_spec(1, background = "#F8766D") %>% 
  row_spec(4, background = "#00BFC4") %>% 
  row_spec(5, background = "#00BF7D")

dataset	n	min	max	median	IQR	mean	SD
real	386	281	735	394.5	57.75	412.94	64.99
mock1	386	200	798	502.0	295.75	502.76	173.79
mock2	772	200	800	508.0	303.25	501.19	174.80
mock3	386	218	592	407.0	78.00	408.19	59.53
mock4	772	217	649	408.5	86.00	409.73	62.89

5.1.3.4. Writing score

Similar to the reading and math scores, the distribution of mean writing scores in the new mock datasets more closely resembles the distribution in the real dataset than the initial mock datasets.

writing_score_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$mean_writing_score),
                                data.frame(SAT_scores_mock1$mean_writing_score),
                                data.frame(SAT_scores_mock2$mean_writing_score),
                                data.frame(SAT_scores_mock3$mean_writing_score),
                                data.frame(SAT_scores_mock4$mean_writing_score))) %>%
  set_names("real", "mock1", "mock2", "mock3", "mock4") %>%
  gather(., key = "dataset", value = "mean_writing_score") %>%
  drop_na()

writing_score_all$dataset <- factor(writing_score_all$dataset, 
                                    levels = c("mock3", "mock4", "real", "mock1", "mock2"))

ggplot(writing_score_all, aes(x = mean_writing_score, fill = dataset)) + 
  geom_boxplot() +
  scale_fill_manual(values = c("#F8766D", "lightgray", "darkgray", "#00BFC4", "#00BF7D"), 
                    breaks = c("real", "mock1", "mock2", "mock3", "mock4"))

The similarity can also be seen in the summary statistics below.

writing_score_all$dataset <- factor(writing_score_all$dataset, 
                                    levels = c("real", "mock1", "mock2", "mock3", "mock4"))

writing_score_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(mean_writing_score),
    max = max(mean_writing_score),
    median = median(mean_writing_score),
    IQR = round(IQR(mean_writing_score), 2),
    mean = round(mean(mean_writing_score), 2),
    SD = round(sd(mean_writing_score), 2),    
  ) %>%
  kbl() %>%
  kable_material() %>%
  row_spec(1, background = "#F8766D") %>% 
  row_spec(4, background = "#00BFC4") %>% 
  row_spec(5, background = "#00BF7D")

dataset	n	min	max	median	IQR	mean	SD
real	386	285	678	383	50.00	397.69	57.76
mock1	386	201	799	504	289.50	504.10	168.92
mock2	772	203	800	503	290.25	504.46	169.70
mock3	386	158	566	395	70.75	396.87	59.27
mock4	772	241	563	397	80.00	397.84	57.51

5.2. Leveraging relationships between variables in the real data

My second idea to improve the mock data was to determine whether any of the variables in the real data were correlated and, if so, to leverage the relationship to increase the similarity between the real and mock data.

5.2.1. Mean reading score vs mean writing score

Mean reading and writing scores were strongly linearly correlated (\(R^2 = 0.9573\)). The regression model is \(mean\_reading\_score = 21.52677 + 0.96236 \times mean\_writing\_score\)

lm_reading_writing_real <- SAT_scores_real %>%
  select(mean_reading_score, mean_writing_score) %>%
  lm(mean_reading_score ~ mean_writing_score, data = .)
summary(lm_reading_writing_real)

## 
## Call:
## lm(formula = mean_reading_score ~ mean_writing_score, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.526  -6.676   0.068   6.577  70.414 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        21.52677    4.16927   5.163 3.91e-07 ***
## mean_writing_score  0.96236    0.01038  92.757  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.76 on 384 degrees of freedom
## Multiple R-squared:  0.9573, Adjusted R-squared:  0.9572 
## F-statistic:  8604 on 1 and 384 DF,  p-value: < 2.2e-16

coeff <- round(cor(SAT_scores_real$mean_writing_score, SAT_scores_real$mean_reading_score, 
                   method = c("pearson")), 2)

ggplot(SAT_scores_real, aes(x = mean_writing_score, y = mean_reading_score)) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  xlim(200, 800) + ylim(200, 800) +  
  geom_text(x = 550, y = 650,
            label = paste0('r = ', coeff),
            color = 'blue')

5.2.2. Mean reading/writing score vs mean math score

Given the strong correlation between mean reading and writing scores, I averaged the two scores and found a moderate correlation with the mean math score \(R^2 = 0.816\). The regression model is \(mean\_math\_score = -0.21410 + 1.03038 \times average\_reading\_writing\_score\)

SAT_scores_real2 <- SAT_scores_real %>%
  mutate(
    avg_reading_writing_score = (mean_reading_score + mean_writing_score) / 2
    )

lm_reading_writing_math_real <- SAT_scores_real2 %>%
  select(avg_reading_writing_score, mean_math_score) %>%
  lm(mean_math_score ~ avg_reading_writing_score, data = .)

summary(lm_reading_writing_math_real)

## 
## Call:
## lm(formula = mean_math_score ~ avg_reading_writing_score, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.572 -15.276  -2.139   9.066 208.676 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -0.21410   10.11032  -0.021    0.983    
## avg_reading_writing_score  1.03038    0.02496  41.274   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.91 on 384 degrees of freedom
## Multiple R-squared:  0.816,  Adjusted R-squared:  0.8156 
## F-statistic:  1704 on 1 and 384 DF,  p-value: < 2.2e-16

coeff <- round(cor(SAT_scores_real2$mean_math_score,
                   SAT_scores_real2$avg_reading_writing_score, 
                   method = c("pearson")), 2)

ggplot(SAT_scores_real2, aes(x = avg_reading_writing_score, y = mean_math_score)) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  xlim(200, 800) + ylim(200, 800) +
  xlab("Average reading/writing score") +
  geom_text(x = 550, y = 700,
            label = paste0('r = ', coeff),
            color = 'blue')

5.2.3. Mockaroo data

Using the regression formulas above, I modified the Mockaroo input parameters as shown below. To introduce variation, I also created 2 offset terms using a normal distribution with mean 0 and the residual standard error from the regression models.

I did not change the method for generating mock data for the n_test_takers or mean_writing_score fields.

As before, I generated 2 mock datasets with 386 rows and 772 rows. I saved the CSV files to my GitHub repository and loaded them into dataframes.

SAT_scores_mock5 <- read_csv('https://raw.githubusercontent.com/alexandersimon1/Data604/main/Assignment3/MOCK_DATA5.csv', show_col_types = FALSE)

SAT_scores_mock6 <- read_csv('https://raw.githubusercontent.com/alexandersimon1/Data604/main/Assignment3/MOCK_DATA6.csv', show_col_types = FALSE)

5.2.4. Analysis

5.2.4.1. Reading score

The distribution of mean reading scores in the new modified mock datasets (“mock5” and “mock6”) was very similar to the distribution in the previous modified datasets (“mock3" and”mock4“). All 4 modified datasets more closely resembled the distribution in the real dataset than the initial mock datasets (”mock1” and “mock2”).

reading_score_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$mean_reading_score),
                                data.frame(SAT_scores_mock1$mean_reading_score),
                                data.frame(SAT_scores_mock2$mean_reading_score),
                                data.frame(SAT_scores_mock3$mean_reading_score),
                                data.frame(SAT_scores_mock4$mean_reading_score),
                                data.frame(SAT_scores_mock5$mean_reading_score),
                                data.frame(SAT_scores_mock6$mean_reading_score))) %>%
  set_names("real", "mock1", "mock2", "mock3", "mock4", "mock5", "mock6") %>%
  gather(., key = "dataset", value = "mean_reading_score") %>%
  drop_na()

reading_score_all$dataset <- factor(reading_score_all$dataset, 
                                    levels = c("mock3", "mock4", 
                                               "mock5", "mock6", "real", "mock1", "mock2"))

ggplot(reading_score_all, aes(x = mean_reading_score, fill = dataset)) + 
  geom_boxplot() +
  scale_fill_manual(values = c("#F8766D", "lightgray", "darkgray", "#00BFC4", "#00BF7D",
                               "#CCCC00", "#FF9966"), 
                    breaks = c("real", "mock1", "mock2", "mock3", "mock4", "mock5", "mock6"))

The summary statistics for the mock5 and mock6 datasets were similar to those for the mock3 and mock4 datasets.

reading_score_all$dataset <- factor(reading_score_all$dataset, 
                                    levels = c("real", "mock1", "mock2", "mock3", "mock4",
                                               "mock5", "mock6"))

reading_score_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(mean_reading_score),
    max = max(mean_reading_score),
    median = median(mean_reading_score),
    IQR = round(IQR(mean_reading_score), 2),
    mean = round(mean(mean_reading_score), 2),
    SD = round(sd(mean_reading_score), 2),    
  ) %>%
  kbl() %>%
  kable_material() %>%
  row_spec(1, background = "#F8766D") %>% 
  row_spec(6, background = "#CCCC00") %>%
  row_spec(7, background = "#FF9966")

dataset	n	min	max	median	IQR	mean	SD
real	386	291	674	392.5	49.00	404.25	56.82
mock1	386	204	800	493.5	325.75	499.38	178.20
mock2	772	200	800	493.0	304.25	497.44	174.81
mock3	386	238	560	404.0	79.00	405.07	58.12
mock4	772	257	554	408.0	76.00	408.86	55.11
mock5	386	232	617	400.0	73.75	403.66	57.43
mock6	772	226	598	405.0	84.50	406.82	60.03

As expected, the mean reading and writing scores were strongly correlated in the new mock datasets (only mock5 is shown).

coeff <- round(cor(SAT_scores_mock5$mean_writing_score, SAT_scores_mock5$mean_reading_score, 
                   method = c("pearson")), 2)

ggplot(SAT_scores_mock5, aes(x = mean_writing_score, y = mean_reading_score)) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  xlim(200, 800) + ylim(200, 800) +  
  geom_text(x = 550, y = 650,
            label = paste0('r = ', coeff),
            color = 'blue') +
  ggtitle("Mock5")

In contrast, the mean reading and writing scores in the previous iteration of mock data were not correlated (only mock3 is shown).

coeff <- round(cor(SAT_scores_mock3$mean_writing_score, SAT_scores_mock3$mean_reading_score, 
                   method = c("pearson")), 2)

ggplot(SAT_scores_mock3, aes(x = mean_writing_score, y = mean_reading_score)) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  xlim(200, 800) + ylim(200, 800) +  
  geom_text(x = 650, y = 425,
            label = paste0('r = ', coeff),
            color = 'blue') +
  ggtitle("Mock3")

5.2.4.2. Math score

The distribution of mean reading scores in the modified mock datasets (“mock5” and “mock6”) was very similar to the distribution in the mock3 and mock4 datasets. All 4 modified datasets more closely resembled the distribution in the real dataset than the initial mock datasets (“mock1” and “mock2”).

math_score_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$mean_math_score),
                            data.frame(SAT_scores_mock1$mean_math_score),
                            data.frame(SAT_scores_mock2$mean_math_score),
                            data.frame(SAT_scores_mock3$mean_math_score),
                            data.frame(SAT_scores_mock4$mean_math_score),
                            data.frame(SAT_scores_mock5$mean_math_score),
                            data.frame(SAT_scores_mock6$mean_math_score))) %>%
  set_names("real", "mock1", "mock2", "mock3", "mock4", "mock5", "mock6") %>%
  gather(., key = "dataset", value = "mean_math_score") %>%
  drop_na()

math_score_all$dataset <- factor(math_score_all$dataset, 
                                 levels = c("mock3", "mock4", 
                                            "mock5", "mock6", "real", "mock1", "mock2"))

ggplot(math_score_all, aes(x = mean_math_score, fill = dataset)) + 
  geom_boxplot() +
  scale_fill_manual(values = c("#F8766D", "lightgray", "darkgray", "#00BFC4", "#00BF7D",
                               "#CCCC00", "#FF9966"), 
                    breaks = c("real", "mock1", "mock2", "mock3", "mock4", "mock5", "mock6"))

The summary statistics for the mock5 and mock6 datasets were similar to those for the mock3 and mock4 datasets. However, the range of math scores in the mock5 dataset more closely resembled the real dataset than the mock6 dataset. In addition, the mean math score in the mock5 dataset was nearly identical to the mean in the real dataset.

math_score_all$dataset <- factor(math_score_all$dataset, 
                                 levels = c("real", "mock1", "mock2", "mock3", "mock4",
                                            "mock5", "mock6"))

math_score_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    min = min(mean_math_score),
    max = max(mean_math_score),
    median = median(mean_math_score),
    IQR = round(IQR(mean_math_score), 2),
    mean = round(mean(mean_math_score), 2),
    SD = round(sd(mean_math_score), 2),    
  ) %>%
  kbl() %>%
  kable_material() %>%
  row_spec(1, background = "#F8766D") %>% 
  row_spec(6, background = "#CCCC00")

dataset	n	min	max	median	IQR	mean	SD
real	386	281	735	394.5	57.75	412.94	64.99
mock1	386	200	798	502.0	295.75	502.76	173.79
mock2	772	200	800	508.0	303.25	501.19	174.80
mock3	386	218	592	407.0	78.00	408.19	59.53
mock4	772	217	649	408.5	86.00	409.73	62.89
mock5	386	216	669	410.0	83.00	412.91	68.18
mock6	772	189	622	415.0	94.00	416.27	67.92

Both the mean reading/writing scores were strongly correlated with the mean math scores in the new mock datasets (only mock5 reading vs math is shown).

coeff <- round(cor(SAT_scores_mock5$mean_reading_score, SAT_scores_mock5$mean_math_score, 
                   method = c("pearson")), 2)

ggplot(SAT_scores_mock5, aes(x = mean_reading_score, y = mean_math_score)) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  xlim(200, 800) + ylim(200, 800) +  
  geom_text(x = 550, y = 650,
            label = paste0('r = ', coeff),
            color = 'blue') +
  ggtitle("Mock5")

In contrast, the mean reading/writing and math scores in the previous iteration of mock data were not correlated (only mock3 reading vs math is shown).

coeff <- round(cor(SAT_scores_mock3$mean_reading_score, SAT_scores_mock3$mean_math_score, 
                   method = c("pearson")), 2)

ggplot(SAT_scores_mock3, aes(x = mean_reading_score, y = mean_math_score)) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE) +
  xlim(200, 800) + ylim(200, 800) +  
  geom_text(x = 650, y = 425,
            label = paste0('r = ', coeff),
            color = 'blue') +
  ggtitle("Mock3")

6. Comparison of accuracy

In one of the readings for this week,¹ a formula is given for determining whether a simulated dataset has sufficient values to achieve accuracy \(\epsilon\).

\[z_a\sqrt{\frac{1}{n(n-1)}\sum_{k=1}^{n}(x_k - \mu)^2} \le \epsilon\]
where \(\epsilon\) is the maximum allowed error, \(z_a\) is the Z score at significance level \(\alpha\), n is the number of simulated values, and \(x_k\) are the simulated data.

Recognizing that part of the term inside the square root is the formula for sample variance, which is the square of standard deviation, and for \(\alpha = 0.05\), \(z_a \approx 1.96\), the above simplifies to

\[1.96 \times \frac{SD}{\sqrt{n}} \le \epsilon\]

The left-hand side is the formula for margin of error in random sampling. Here, I use it as a metric of accuracy to compare the real vs mock datasets.

The table below suggests that the mock3 and mock5 datasets have the best overall accuracy (ie, across all variables). Increasing the dataset size improved the overall accuracy for the initial “naive” approach of generating mock data (mock2 vs mock1) but reduced the overall accuracy when data distribution and relationships between variables were taken into account (mock4 vs mock3 and mock6 vs mock5, respectively).

test_takers_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$n_test_takers),
                                data.frame(SAT_scores_mock1$n_test_takers),
                                data.frame(SAT_scores_mock2$n_test_takers),
                                data.frame(SAT_scores_mock3$n_test_takers),
                                data.frame(SAT_scores_mock4$n_test_takers),
                                data.frame(SAT_scores_mock5$n_test_takers),
                                data.frame(SAT_scores_mock6$n_test_takers))) %>%
  set_names("real", "mock1", "mock2", "mock3", "mock4", "mock5", "mock6") %>%
  gather(., key = "dataset", value = "n_test_takers") %>%
  drop_na()

writing_score_all <- as_tibble(gdata::cbindX(data.frame(SAT_scores_real$mean_writing_score),
                                data.frame(SAT_scores_mock1$mean_writing_score),
                                data.frame(SAT_scores_mock2$mean_writing_score),
                                data.frame(SAT_scores_mock3$mean_writing_score),
                                data.frame(SAT_scores_mock4$mean_writing_score),
                                data.frame(SAT_scores_mock5$mean_writing_score),
                                data.frame(SAT_scores_mock6$mean_writing_score))) %>%
  set_names("real", "mock1", "mock2", "mock3", "mock4", "mock5", "mock6") %>%
  gather(., key = "dataset", value = "mean_writing_score") %>%
  drop_na()

calc_epsilon <- function(std_dev, n) {
  return(round((1.96 * std_dev) / sqrt(n), 2))
}

data_all <- as_tibble(cbind(dataset = as.character(test_takers_all$dataset), 
                            n_test_takers = test_takers_all$n_test_takers,
                            mean_reading_score = reading_score_all$mean_reading_score,
                            mean_math_score = math_score_all$mean_math_score,
                            mean_writing_score = writing_score_all$mean_writing_score))

data_all$dataset <- factor(data_all$dataset, 
                           levels = c("real", "mock1", "mock2", "mock3", "mock4",
                                      "mock5", "mock6"))

data_all %>%
  group_by(dataset) %>%
  summarise(
    n = n(),
    test_takers_SD = round(sd(n_test_takers), 2),
    test_takers_epsilon = calc_epsilon(test_takers_SD, n),
    reading_SD = round(sd(mean_reading_score), 2),
    reading_epsilon = calc_epsilon(reading_SD, n),
    math_SD = round(sd(mean_math_score), 2),
    math_epsilon = calc_epsilon(math_SD, n),
    writing_SD = round(sd(mean_writing_score), 2),
    writing_epsilon = calc_epsilon(writing_SD, n),    
  ) %>%
  select(dataset, n, test_takers_epsilon, reading_epsilon, math_epsilon, writing_epsilon) %>%
  kbl() %>%
  kable_material() %>%
  row_spec(1, background = "#F8766D") %>% 
  row_spec(4, background = "#00BFC4") %>%
  row_spec(6, background = "#CCCC00")

dataset	n	test_takers_epsilon	reading_epsilon	math_epsilon	writing_epsilon
real	386	14.49	5.67	6.48	5.76
mock1	386	29.53	17.78	17.34	16.85
mock2	772	20.94	12.33	12.33	11.97
mock3	386	16.48	5.80	5.94	5.91
mock4	772	12.20	3.89	4.44	4.06
mock5	386	17.81	5.73	6.80	5.96
mock6	772	11.68	4.23	4.79	4.29

The addition of the second set of improvements (relationships between variables) did not appear to affect the overall accuracy significantly. Because these improvements enhanced the resemblance of the mock data to the real data, I considered the mock5 dataset to be superior to the mock3 dataset.

7. Conclusions

These analyses show that mimicking the distribution of values for each variable in the real data as well as relationships between variables in the real data improve the accuracy of mock data compared with real data. Doubling the amount of mock data generated did not seem to be very helpful. The mock SAT exam score data generated by using these methods are probably reliable subtitutes for real SAT scores. These conclusions are limited to this analysis of SAT exam scores, but I would expect similar principles to be true for other datasets.

It should be noted that these methods of simulating data do not replicate all characteristics of real data, such as outliers, which is a known challenge of generating synthetic data.² To further increase the fidelity to real data in datasets with outliers, additional data may need to be “spiked in” to the Mockaroo-generated data.

Milić DC and Kvesić L. Role of random numbers in simulations of economic processes. Interdisciplinary Management Research 2008;4:562-570. Formula is on page 569.↩︎
Dilmegani C. (2024, January 18). Synthetic data vs real data: Benefits, challenges in 2024. AI Multiple Research. https://research.aimultiple.com/synthetic-data-vs-real-data/↩︎