DATA DIVE 3
A collection of 5-10 random samples of data (with replacement) from
at least 6 columns of data Each subsample should be as long as roughly
50% percent of your data. We are simulating the act of collecting data
from a population where the “population” is represented by the data set
you already have. Store each sample set in a separate data frame (e.g.,
df_i might contain m rows from columns 1-6) These subsamples should
include both categorical and continuous (numeric) data Scrutinize these
subsamples. How different are they? What would you have called an
anomaly in one sub-sample that you wouldn’t in another? Are there
aspects of the data that are consistent among all sub-samples? Consider
how this investigation affects how you might draw conclusions about the
data in the future. For each of the above tasks, you must explain to the
reader what insight was gathered, its significance, and any further
questions you have which might need to be further investigated.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'kableExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## group_rows
#Loading the dataset
data <- read_delim("data.csv", delim = ";")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Subsample columns
We have included “Marital status”,“Gender”,“Admission
grade”,“Course”,“Scholarship holder”,“Age at enrollment”,“Unemployment
rate”, in our Sub samples. We also have set a seed value so that data
does not change during each kniting of the r markdown file.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Marital status'. You can override using
## the `.groups` argument.
Sub sample 4
|
Marital status
|
Gender
|
Count_Marital_Status
|
Max_Admission_Grade
|
Min_Admission_Grade
|
Mean_Admission_Grade
|
Count_Scholarship_Holder
|
Max_Age_At_Enrollment
|
Min_Age_At_Enrollment
|
Mean_Age_At_Enrollment
|
Max_Unemployment_Rate
|
Min_Unemployment_Rate
|
Mean_Unemployment_Rate
|
|
1
|
0
|
1271
|
184.4
|
95.0
|
127.6187
|
395
|
58
|
17
|
20.87333
|
16.2
|
7.6
|
11.60134
|
|
1
|
1
|
702
|
180.0
|
95.0
|
125.9623
|
128
|
61
|
18
|
22.64103
|
16.2
|
7.6
|
11.68262
|
|
2
|
0
|
107
|
170.0
|
95.5
|
122.3692
|
17
|
51
|
19
|
35.46729
|
16.2
|
7.6
|
10.73645
|
|
2
|
1
|
63
|
172.0
|
100.0
|
130.4556
|
5
|
54
|
21
|
39.55556
|
16.2
|
7.6
|
11.77460
|
|
3
|
0
|
1
|
170.0
|
170.0
|
170.0000
|
0
|
47
|
47
|
47.00000
|
8.9
|
8.9
|
8.90000
|
|
4
|
0
|
35
|
154.3
|
97.0
|
124.4629
|
17
|
48
|
24
|
37.11429
|
16.2
|
7.6
|
11.52571
|
|
4
|
1
|
12
|
151.0
|
107.2
|
129.5667
|
1
|
51
|
29
|
38.33333
|
15.5
|
7.6
|
11.93333
|
|
5
|
0
|
15
|
179.6
|
110.0
|
131.3000
|
0
|
46
|
21
|
31.13333
|
16.2
|
8.9
|
12.23333
|
|
5
|
1
|
4
|
140.0
|
114.0
|
128.4500
|
0
|
30
|
26
|
28.75000
|
15.5
|
8.9
|
12.75000
|
|
6
|
0
|
2
|
115.0
|
114.8
|
114.9000
|
1
|
41
|
24
|
32.50000
|
13.9
|
10.8
|
12.35000
|


Across all the sub sample the gender differentiation is same, as such
the there are more females in the class and the scholarship holding is
consistent over different subsamples. The other thing is unemplyment
rate is consistent for all the sub samples across marital status.