DATA DIVE 3

A collection of 5-10 random samples of data (with replacement) from at least 6 columns of data Each subsample should be as long as roughly 50% percent of your data. We are simulating the act of collecting data from a population where the “population” is represented by the data set you already have. Store each sample set in a separate data frame (e.g., df_i might contain m rows from columns 1-6) These subsamples should include both categorical and continuous (numeric) data Scrutinize these subsamples. How different are they? What would you have called an anomaly in one sub-sample that you wouldn’t in another? Are there aspects of the data that are consistent among all sub-samples? Consider how this investigation affects how you might draw conclusions about the data in the future. For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
#Loading the dataset
data <- read_delim("data.csv", delim = ";")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Subsample columns

We have included “Marital status”,“Gender”,“Admission grade”,“Course”,“Scholarship holder”,“Age at enrollment”,“Unemployment rate”, in our Sub samples. We also have set a seed value so that data does not change during each kniting of the r markdown file.

## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.

Sub sample 1

Marital status Gender Count_Marital_Status Max_Admission_Grade Min_Admission_Grade Mean_Admission_Grade Count_Scholarship_Holder Max_Age_At_Enrollment Min_Age_At_Enrollment Mean_Age_At_Enrollment Max_Unemployment_Rate Min_Unemployment_Rate Mean_Unemployment_Rate
1 0 1296 190.0 95.0 126.9151 433 60 17 20.83179 16.2 7.6 11.53056
1 1 686 170.0 95.0 126.4611 103 58 17 22.73469 16.2 7.6 11.45248
2 0 94 170.0 96.0 122.3043 7 50 18 34.62766 16.2 7.6 10.55319
2 1 77 168.2 100.0 136.5286 4 52 21 38.15584 16.2 7.6 11.48312
3 0 2 133.5 133.5 133.5000 2 21 21 21.00000 12.7 12.7 12.70000
4 0 37 153.0 95.0 119.6649 9 48 24 34.91892 16.2 7.6 11.07027
4 1 7 170.0 103.0 132.0571 0 70 29 45.14286 15.5 7.6 11.51429
5 0 9 140.5 110.0 119.7667 0 41 23 31.11111 16.2 8.9 12.65556
5 1 1 146.0 146.0 146.0000 0 34 34 34.00000 12.4 12.4 12.40000
6 0 3 119.0 110.0 114.6667 1 48 24 39.66667 10.8 9.4 10.33333

Sub sample 2

Marital status Gender Count_Marital_Status Max_Admission_Grade Min_Admission_Grade Mean_Admission_Grade Count_Scholarship_Holder Max_Age_At_Enrollment Min_Age_At_Enrollment Mean_Age_At_Enrollment Max_Unemployment_Rate Min_Unemployment_Rate Mean_Unemployment_Rate
1 0 1256 190.0 95.0 127.3920 412 54 17 20.63774 16.2 7.6 11.60111
1 1 703 180.0 95.0 126.1442 108 61 18 22.52916 16.2 7.6 11.63642
2 0 97 160.5 98.5 123.2175 13 50 18 35.43299 16.2 7.6 11.40206
2 1 80 168.0 100.0 132.7725 3 57 24 38.22500 16.2 7.6 12.08750
3 0 1 170.0 170.0 170.0000 0 47 47 47.00000 8.9 8.9 8.90000
4 0 40 154.6 95.0 124.9500 15 53 24 37.37500 16.2 7.6 11.31500
4 1 10 149.8 103.0 129.5200 1 70 29 42.80000 15.5 10.8 12.03000
5 0 14 179.6 103.0 119.6071 0 62 23 39.64286 16.2 7.6 11.44286
5 1 7 146.0 116.0 133.8857 4 37 30 33.85714 16.2 8.9 13.44286
6 0 3 119.0 110.0 114.6000 2 48 36 41.66667 16.2 10.8 13.63333
6 1 1 119.0 119.0 119.0000 0 55 55 55.00000 7.6 7.6 7.60000

Sub sample 3

Marital status Gender Count_Marital_Status Max_Admission_Grade Min_Admission_Grade Mean_Admission_Grade Count_Scholarship_Holder Max_Age_At_Enrollment Min_Age_At_Enrollment Mean_Age_At_Enrollment Max_Unemployment_Rate Min_Unemployment_Rate Mean_Unemployment_Rate
1 0 1291 190.0 95.0 127.0119 412 60 17 20.67622 16.2 7.6 11.66375
1 1 670 190.0 95.5 126.5158 116 54 18 22.34776 16.2 7.6 11.71955
2 0 110 170.0 100.0 125.7009 17 59 18 35.05455 16.2 7.6 11.02364
2 1 88 170.0 100.0 136.0057 3 57 21 37.02273 16.2 7.6 11.50682
3 0 1 133.5 133.5 133.5000 1 21 21 21.00000 12.7 12.7 12.70000
4 0 29 160.0 95.0 121.0793 11 53 24 38.75862 16.2 7.6 11.09310
4 1 9 150.0 100.9 119.0000 1 70 29 42.00000 16.2 7.6 12.85556
5 0 8 160.0 110.0 124.6625 1 37 18 28.12500 13.9 10.8 12.25000
5 1 2 119.8 116.0 117.9000 1 34 26 30.00000 11.1 9.4 10.25000
6 0 3 115.0 114.8 114.8667 2 41 24 35.33333 13.9 10.8 12.86667
6 1 1 119.0 119.0 119.0000 0 55 55 55.00000 7.6 7.6 7.60000
## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

Sub sample 4

Marital status Gender Count_Marital_Status Max_Admission_Grade Min_Admission_Grade Mean_Admission_Grade Count_Scholarship_Holder Max_Age_At_Enrollment Min_Age_At_Enrollment Mean_Age_At_Enrollment Max_Unemployment_Rate Min_Unemployment_Rate Mean_Unemployment_Rate
1 0 1271 184.4 95.0 127.6187 395 58 17 20.87333 16.2 7.6 11.60134
1 1 702 180.0 95.0 125.9623 128 61 18 22.64103 16.2 7.6 11.68262
2 0 107 170.0 95.5 122.3692 17 51 19 35.46729 16.2 7.6 10.73645
2 1 63 172.0 100.0 130.4556 5 54 21 39.55556 16.2 7.6 11.77460
3 0 1 170.0 170.0 170.0000 0 47 47 47.00000 8.9 8.9 8.90000
4 0 35 154.3 97.0 124.4629 17 48 24 37.11429 16.2 7.6 11.52571
4 1 12 151.0 107.2 129.5667 1 51 29 38.33333 15.5 7.6 11.93333
5 0 15 179.6 110.0 131.3000 0 46 21 31.13333 16.2 8.9 12.23333
5 1 4 140.0 114.0 128.4500 0 30 26 28.75000 15.5 8.9 12.75000
6 0 2 115.0 114.8 114.9000 1 41 24 32.50000 13.9 10.8 12.35000

Across all the sub sample the gender differentiation is same, as such the there are more females in the class and the scholarship holding is consistent over different subsamples. The other thing is unemplyment rate is consistent for all the sub samples across marital status.