Data Dive 3

DATA DIVE 3

A collection of 5-10 random samples of data (with replacement) from at least 6 columns of data Each subsample should be as long as roughly 50% percent of your data. We are simulating the act of collecting data from a population where the “population” is represented by the data set you already have. Store each sample set in a separate data frame (e.g., df_i might contain m rows from columns 1-6) These subsamples should include both categorical and continuous (numeric) data Scrutinize these subsamples. How different are they? What would you have called an anomaly in one sub-sample that you wouldn’t in another? Are there aspects of the data that are consistent among all sub-samples? Consider how this investigation affects how you might draw conclusions about the data in the future. For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

#Loading the dataset
data <- read_delim("data.csv", delim = ";")

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Subsample columns

We have included “Marital status”,“Gender”,“Admission grade”,“Course”,“Scholarship holder”,“Age at enrollment”,“Unemployment rate”, in our Sub samples. We also have set a seed value so that data does not change during each kniting of the r markdown file.

## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.

Sub sample 1

Marital status	Gender	Count_Marital_Status	Max_Admission_Grade	Min_Admission_Grade	Mean_Admission_Grade	Count_Scholarship_Holder	Max_Age_At_Enrollment	Min_Age_At_Enrollment	Mean_Age_At_Enrollment	Max_Unemployment_Rate	Min_Unemployment_Rate	Mean_Unemployment_Rate
1	0	1296	190.0	95.0	126.9151	433	60	17	20.83179	16.2	7.6	11.53056
1	1	686	170.0	95.0	126.4611	103	58	17	22.73469	16.2	7.6	11.45248
2	0	94	170.0	96.0	122.3043	7	50	18	34.62766	16.2	7.6	10.55319
2	1	77	168.2	100.0	136.5286	4	52	21	38.15584	16.2	7.6	11.48312
3	0	2	133.5	133.5	133.5000	2	21	21	21.00000	12.7	12.7	12.70000
4	0	37	153.0	95.0	119.6649	9	48	24	34.91892	16.2	7.6	11.07027
4	1	7	170.0	103.0	132.0571	0	70	29	45.14286	15.5	7.6	11.51429
5	0	9	140.5	110.0	119.7667	0	41	23	31.11111	16.2	8.9	12.65556
5	1	1	146.0	146.0	146.0000	0	34	34	34.00000	12.4	12.4	12.40000
6	0	3	119.0	110.0	114.6667	1	48	24	39.66667	10.8	9.4	10.33333

Sub sample 2

Marital status	Gender	Count_Marital_Status	Max_Admission_Grade	Min_Admission_Grade	Mean_Admission_Grade	Count_Scholarship_Holder	Max_Age_At_Enrollment	Min_Age_At_Enrollment	Mean_Age_At_Enrollment	Max_Unemployment_Rate	Min_Unemployment_Rate	Mean_Unemployment_Rate
1	0	1256	190.0	95.0	127.3920	412	54	17	20.63774	16.2	7.6	11.60111
1	1	703	180.0	95.0	126.1442	108	61	18	22.52916	16.2	7.6	11.63642
2	0	97	160.5	98.5	123.2175	13	50	18	35.43299	16.2	7.6	11.40206
2	1	80	168.0	100.0	132.7725	3	57	24	38.22500	16.2	7.6	12.08750
3	0	1	170.0	170.0	170.0000	0	47	47	47.00000	8.9	8.9	8.90000
4	0	40	154.6	95.0	124.9500	15	53	24	37.37500	16.2	7.6	11.31500
4	1	10	149.8	103.0	129.5200	1	70	29	42.80000	15.5	10.8	12.03000
5	0	14	179.6	103.0	119.6071	0	62	23	39.64286	16.2	7.6	11.44286
5	1	7	146.0	116.0	133.8857	4	37	30	33.85714	16.2	8.9	13.44286
6	0	3	119.0	110.0	114.6000	2	48	36	41.66667	16.2	10.8	13.63333
6	1	1	119.0	119.0	119.0000	0	55	55	55.00000	7.6	7.6	7.60000

Sub sample 3

Marital status	Gender	Count_Marital_Status	Max_Admission_Grade	Min_Admission_Grade	Mean_Admission_Grade	Count_Scholarship_Holder	Max_Age_At_Enrollment	Min_Age_At_Enrollment	Mean_Age_At_Enrollment	Max_Unemployment_Rate	Min_Unemployment_Rate	Mean_Unemployment_Rate
1	0	1291	190.0	95.0	127.0119	412	60	17	20.67622	16.2	7.6	11.66375
1	1	670	190.0	95.5	126.5158	116	54	18	22.34776	16.2	7.6	11.71955
2	0	110	170.0	100.0	125.7009	17	59	18	35.05455	16.2	7.6	11.02364
2	1	88	170.0	100.0	136.0057	3	57	21	37.02273	16.2	7.6	11.50682
3	0	1	133.5	133.5	133.5000	1	21	21	21.00000	12.7	12.7	12.70000
4	0	29	160.0	95.0	121.0793	11	53	24	38.75862	16.2	7.6	11.09310
4	1	9	150.0	100.9	119.0000	1	70	29	42.00000	16.2	7.6	12.85556
5	0	8	160.0	110.0	124.6625	1	37	18	28.12500	13.9	10.8	12.25000
5	1	2	119.8	116.0	117.9000	1	34	26	30.00000	11.1	9.4	10.25000
6	0	3	115.0	114.8	114.8667	2	41	24	35.33333	13.9	10.8	12.86667
6	1	1	119.0	119.0	119.0000	0	55	55	55.00000	7.6	7.6	7.60000

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

Sub sample 4

Marital status	Gender	Count_Marital_Status	Max_Admission_Grade	Min_Admission_Grade	Mean_Admission_Grade	Count_Scholarship_Holder	Max_Age_At_Enrollment	Min_Age_At_Enrollment	Mean_Age_At_Enrollment	Max_Unemployment_Rate	Min_Unemployment_Rate	Mean_Unemployment_Rate
1	0	1271	184.4	95.0	127.6187	395	58	17	20.87333	16.2	7.6	11.60134
1	1	702	180.0	95.0	125.9623	128	61	18	22.64103	16.2	7.6	11.68262
2	0	107	170.0	95.5	122.3692	17	51	19	35.46729	16.2	7.6	10.73645
2	1	63	172.0	100.0	130.4556	5	54	21	39.55556	16.2	7.6	11.77460
3	0	1	170.0	170.0	170.0000	0	47	47	47.00000	8.9	8.9	8.90000
4	0	35	154.3	97.0	124.4629	17	48	24	37.11429	16.2	7.6	11.52571
4	1	12	151.0	107.2	129.5667	1	51	29	38.33333	15.5	7.6	11.93333
5	0	15	179.6	110.0	131.3000	0	46	21	31.13333	16.2	8.9	12.23333
5	1	4	140.0	114.0	128.4500	0	30	26	28.75000	15.5	8.9	12.75000
6	0	2	115.0	114.8	114.9000	1	41	24	32.50000	13.9	10.8	12.35000

Across all the sub sample the gender differentiation is same, as such the there are more females in the class and the scholarship holding is consistent over different subsamples. The other thing is unemplyment rate is consistent for all the sub samples across marital status.