EDA
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
summary(rock)
## area peri shape perm
## Min. : 1016 Min. : 308.6 Min. :0.09033 Min. : 6.30
## 1st Qu.: 5305 1st Qu.:1414.9 1st Qu.:0.16226 1st Qu.: 76.45
## Median : 7487 Median :2536.2 Median :0.19886 Median : 130.50
## Mean : 7188 Mean :2682.2 Mean :0.21811 Mean : 415.45
## 3rd Qu.: 8870 3rd Qu.:3989.5 3rd Qu.:0.26267 3rd Qu.: 777.50
## Max. :12212 Max. :4864.2 Max. :0.46413 Max. :1300.00
glimpse(rock)
## Rows: 48
## Columns: 4
## $ area <int> 4990, 7002, 7558, 7352, 7943, 7979, 9333, 8209, 8393, 6425, 9364~
## $ peri <dbl> 2791.90, 3892.60, 3930.66, 3869.32, 3948.54, 4010.15, 4345.75, 4~
## $ shape <dbl> 0.0903296, 0.1486220, 0.1833120, 0.1170630, 0.1224170, 0.1670450~
## $ perm <dbl> 6.3, 6.3, 6.3, 6.3, 17.1, 17.1, 17.1, 17.1, 119.0, 119.0, 119.0,~
## [1] 1016
## [1] 12212
## [1] 7487
## [1] 7187.729
## [1] 2683.849
## [1] 1016 12212
## 0% 25% 50% 75% 100%
## 1016.00 5305.25 7487.00 8869.50 12212.00
## [1] 3564.25
## [1] 7203045
## [1] 308.642
## [1] 4864.22
## [1] 2536.195
## [1] 2682.212
## [1] 1431.661
## [1] 308.642 4864.220
## 0% 25% 50% 75% 100%
## 308.642 1414.907 2536.195 3989.523 4864.220
## [1] 2574.615
## [1] 2049654
## [1] 6.3
## [1] 1300
## [1] 130.5
## [1] 415.45
## [1] 437.8182
## [1] 6.3 1300.0
## 0% 25% 50% 75% 100%
## 6.30 76.45 130.50 777.50 1300.00
## [1] 701.05
## [1] 191684.8
## [1] 0.0903296
## [1] 0.464125
## [1] 0.198862
## [1] 0.2181104
## [1] 0.08349645
## [1] 0.0903296 0.4641250
## 0% 25% 50% 75% 100%
## 0.0903296 0.1622618 0.1988620 0.2626700 0.4641250
## [1] 0.1004083
## [1] 0.006971657
Codebook
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'memisc'
## The following objects are masked from 'package:dplyr':
##
## collect, recode, rename, syms
## The following object is masked from 'package:purrr':
##
## %@%
## The following object is masked from 'package:tibble':
##
## view
## The following object is masked from 'package:ggplot2':
##
## syms
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
typeof(x)
## [1] "list"
x
##
## Data set with 48 observations and 4 variables
##
## rock.area rock.peri rock.shape rock.perm
## 1 4990 2791.90 0.0903296 6.3
## 2 7002 3892.60 0.1486220 6.3
## 3 7558 3930.66 0.1833120 6.3
## 4 7352 3869.32 0.1170630 6.3
## 5 7943 3948.54 0.1224170 17.1
## 6 7979 4010.15 0.1670450 17.1
## 7 9333 4345.75 0.1896510 17.1
## 8 8209 4344.75 0.1641270 17.1
## 9 8393 3682.04 0.2036540 119.0
## 10 6425 3098.65 0.1623940 119.0
## 11 9364 4480.05 0.1509440 119.0
## 12 8624 3986.24 0.1481410 119.0
## 13 10651 4036.54 0.2285950 82.4
## 14 8868 3518.04 0.2316230 82.4
## 15 9417 3999.37 0.1725670 82.4
## 16 8874 3629.07 0.1534810 82.4
## 17 10962 4608.66 0.2043140 58.6
## 18 10743 4787.62 0.2627270 58.6
## 19 11878 4864.22 0.2000710 58.6
## 20 9867 4479.41 0.1448100 58.6
## 21 7838 3428.74 0.1138520 142.0
## 22 11876 4353.14 0.2910290 142.0
## 23 12212 4697.65 0.2400770 142.0
## 24 8233 3518.44 0.1618650 142.0
## 25 6360 1977.39 0.2808870 740.0
## .. ......... ......... .......... .........
## (25 of 48 observations shown)
summary(x)
## rock.area rock.peri rock.shape rock.perm
## Min. : 1016 Min. : 308.6 Min. :0.09033 Min. : 6.30
## 1st Qu.: 5305 1st Qu.:1414.9 1st Qu.:0.16226 1st Qu.: 76.45
## Median : 7487 Median :2536.2 Median :0.19886 Median : 130.50
## Mean : 7188 Mean :2682.2 Mean :0.21811 Mean : 415.45
## 3rd Qu.: 8870 3rd Qu.:3989.5 3rd Qu.:0.26267 3rd Qu.: 777.50
## Max. :12212 Max. :4864.2 Max. :0.46413 Max. :1300.00
sapply(rock, class)
## area peri shape perm
## "integer" "numeric" "numeric" "numeric"
sapply(rock, range)
## area peri shape perm
## [1,] 1016 308.642 0.0903296 6.3
## [2,] 12212 4864.220 0.4641250 1300.0
codebook(x)
## ================================================================================
##
## rock.area
##
## --------------------------------------------------------------------------------
##
## Storage mode: integer
## Measurement: interval
##
## Min: 1016.000
## Max: 12212.000
## Mean: 7187.729
## Std.Dev.: 2655.745
##
## ================================================================================
##
## rock.peri
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
## Measurement: interval
##
## Min: 308.642
## Max: 4864.220
## Mean: 2682.212
## Std.Dev.: 1416.670
##
## ================================================================================
##
## rock.shape
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
## Measurement: interval
##
## Min: 0.090
## Max: 0.464
## Mean: 0.218
## Std.Dev.: 0.083
##
## ================================================================================
##
## rock.perm
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
## Measurement: interval
##
## Min: 6.300
## Max: 1300.000
## Mean: 415.450
## Std.Dev.: 433.234
Original Data frame :
## Label Age Height Weight Money
## 1 A 13 156 43 124
## 2 B 14 158 65 176
## 3 C 15 160 45 157
## 4 D 16 176 56 197
## 5 E 14 145 55 156
## 6 F 19 187 58 147
## 7 G 18 156 49 198
## 8 H 19 158 50 167
## 9 I 16 162 62 184
## 10 J 17 172 63 159
## 11 K 19 159 59 137
## 12 L 15 165 57 180
## 13 M 20 162 55 99
## 14 N 20 169 69 109
## 15 O 18 159 68 144
i. filter( )
filter(ds, Age >= 18)
## Label Age Height Weight Money
## 1 F 19 187 58 147
## 2 G 18 156 49 198
## 3 H 19 158 50 167
## 4 K 19 159 59 137
## 5 M 20 162 55 99
## 6 N 20 169 69 109
## 7 O 18 159 68 144
The filter( ) function will filter or show a subset of the data frame that satisfy certain condition. In this case, it only shows data that fits the condition of age is 18 or older.
ii. arrange( )
arrange(ds, Age, Height)
## Label Age Height Weight Money
## 1 A 13 156 43 124
## 2 E 14 145 55 156
## 3 B 14 158 65 176
## 4 C 15 160 45 157
## 5 L 15 165 57 180
## 6 I 16 162 62 184
## 7 D 16 176 56 197
## 8 J 17 172 63 159
## 9 G 18 156 49 198
## 10 O 18 159 68 144
## 11 H 19 158 50 167
## 12 K 19 159 59 137
## 13 F 19 187 58 147
## 14 M 20 162 55 99
## 15 N 20 169 69 109
The arrange( ) function will sort the rows of the data frame by the values of selected features by order. In this case, it sort the rows of the data frame by age and then by height.
iii. mutate( )
mutate(ds, Height_meter = height/100, Weight_Newton = weight*10)
## Label Age Height Weight Money Height_meter Weight_Newton
## 1 A 13 156 43 124 1.56 430
## 2 B 14 158 65 176 1.58 650
## 3 C 15 160 45 157 1.60 450
## 4 D 16 176 56 197 1.76 560
## 5 E 14 145 55 156 1.45 550
## 6 F 19 187 58 147 1.87 580
## 7 G 18 156 49 198 1.56 490
## 8 H 19 158 50 167 1.58 500
## 9 I 16 162 62 184 1.62 620
## 10 J 17 172 63 159 1.72 630
## 11 K 19 159 59 137 1.59 590
## 12 L 15 165 57 180 1.65 570
## 13 M 20 162 55 99 1.62 550
## 14 N 20 169 69 109 1.69 690
## 15 O 18 159 68 144 1.59 680
The mutate( ) function will create new variable or features. In this case, it create two new feature, height in meter by dividing the height with 100 and weight in newton by multiply the weight feature with 10.
iv. select( )
dplyr::select(ds, Age)
## Age
## 1 13
## 2 14
## 3 15
## 4 16
## 5 14
## 6 19
## 7 18
## 8 19
## 9 16
## 10 17
## 11 19
## 12 15
## 13 20
## 14 20
## 15 18
The select( ) function will only show a specified features from the data frame. In this case, it shows only the age feature.
v. summarise( )
summarise(ds, mean = mean(Height), sd = sd(Height))
## mean sd
## 1 162.9333 9.931671
The summarise( ) function will create a new data frame that shows the output based on specified operation. In this case, it shows a data frame that contains the mean and standard deviation of the height feature.