Katarzyna Smoter
kierunek: geoinformacja
wydział: Geodecji Górniczej i Inżynierii Środowiska
Do dalszej pracy na danych w R dodałam odpowiednie biblioteki.
library(ggplot2)
library(tidyverse)
library(Hmisc)
library(pastecs)
library(psych)
library(doBy)
library(sm)
library(ggpubr)Do przeprowadzenia dalszych analiz wybrałam dane “diamonds” z pakietu ggplot. W następnym kroku postanowiłam sprawdzić jakie kolumny zawiera moja tabela “diamonds”. Następnie wybrałam kolumny zawierające wartości tekstowe, i sprawdziłam co zawierają i w jakich ilociach. Ustawiłam także kolumne “cut” jako kolumne główną.
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
## [1] Ideal Premium Good Very Good Fair
## Levels: Fair < Good < Very Good < Premium < Ideal
##
## Fair Good Very Good Premium Ideal
## 1610 4906 12082 13791 21551
## # A tibble: 2 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## [1] E I J H F G D
## Levels: D < E < F < G < H < I < J
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
## [1] SI2 SI1 VS1 VS2 VVS2 VVS1 I1 IF
## Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF
##
## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
## 741 9194 13065 12258 8171 5066 3655 1790
Następnie dodałam nową kolumne zawierającą wartości “kod1”, “kod2” na przemian, zmieniłam także typ kolumny na factor.
Zrobiłam także podstawowe statystyki opisowe.
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z kod
## Min. : 0.000 kod1:26970
## 1st Qu.: 2.910 kod2:26970
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Utworzyłam także wykresy od wszystkich kolumn.
## [1] 11
nazwa=names(diamonds)
par(mfrow=c(4,3))
for(i in 1:length(diamonds)){
plot(diamonds[,i],ylab=nazwa[i])
}
par(mfrow=c(1,1))Utworzyłam statystyki za pomocą sapply i describe.
## carat cut color clarity depth table
## 0.7979397 NA NA NA 61.7494049 57.4571839
## price x y z kod
## 3932.7997219 5.7311572 5.7345260 3.5387338 NA
## vars n mean sd median trimmed mad min max
## carat 1 53940 0.80 0.47 0.70 0.73 0.47 0.2 5.01
## cut* 2 53940 3.90 1.12 4.00 4.04 1.48 1.0 5.00
## color* 3 53940 3.59 1.70 4.00 3.55 1.48 1.0 7.00
## clarity* 4 53940 4.05 1.65 4.00 3.91 1.48 1.0 8.00
## depth 5 53940 61.75 1.43 61.80 61.78 1.04 43.0 79.00
## table 6 53940 57.46 2.23 57.00 57.32 1.48 43.0 95.00
## price 7 53940 3932.80 3989.44 2401.00 3158.99 2475.94 326.0 18823.00
## x 8 53940 5.73 1.12 5.70 5.66 1.38 0.0 10.74
## y 9 53940 5.73 1.14 5.71 5.66 1.36 0.0 58.90
## z 10 53940 3.54 0.71 3.53 3.49 0.85 0.0 31.80
## kod* 11 53940 1.50 0.50 1.50 1.50 0.74 1.0 2.00
## range skew kurtosis se
## carat 4.81 1.12 1.26 0.00
## cut* 4.00 -0.72 -0.40 0.00
## color* 6.00 0.19 -0.87 0.01
## clarity* 7.00 0.55 -0.39 0.01
## depth 36.00 -0.08 5.74 0.01
## table 52.00 0.80 2.80 0.01
## price 18497.00 1.62 2.18 17.18
## x 10.74 0.38 -0.62 0.00
## y 58.90 2.43 91.20 0.00
## z 31.80 1.52 47.08 0.00
## kod* 1.00 0.00 -2.00 0.00
## carat cut color clarity depth table
## nbr.val 5.394000e+04 NA NA NA 5.394000e+04 5.394000e+04
## nbr.null 0.000000e+00 NA NA NA 0.000000e+00 0.000000e+00
## nbr.na 0.000000e+00 NA NA NA 0.000000e+00 0.000000e+00
## min 2.000000e-01 NA NA NA 4.300000e+01 4.300000e+01
## max 5.010000e+00 NA NA NA 7.900000e+01 9.500000e+01
## range 4.810000e+00 NA NA NA 3.600000e+01 5.200000e+01
## sum 4.304087e+04 NA NA NA 3.330763e+06 3.099241e+06
## median 7.000000e-01 NA NA NA 6.180000e+01 5.700000e+01
## mean 7.979397e-01 NA NA NA 6.174940e+01 5.745718e+01
## SE.mean 2.040954e-03 NA NA NA 6.168448e-03 9.621063e-03
## CI.mean.0.95 4.000286e-03 NA NA NA 1.209021e-02 1.885736e-02
## var 2.246867e-01 NA NA NA 2.052404e+00 4.992948e+00
## std.dev 4.740112e-01 NA NA NA 1.432621e+00 2.234491e+00
## coef.var 5.940439e-01 NA NA NA 2.320057e-02 3.888966e-02
## price x y z kod
## nbr.val 5.394000e+04 5.394000e+04 5.394000e+04 5.394000e+04 NA
## nbr.null 0.000000e+00 8.000000e+00 7.000000e+00 2.000000e+01 NA
## nbr.na 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 NA
## min 3.260000e+02 0.000000e+00 0.000000e+00 0.000000e+00 NA
## max 1.882300e+04 1.074000e+01 5.890000e+01 3.180000e+01 NA
## range 1.849700e+04 1.074000e+01 5.890000e+01 3.180000e+01 NA
## sum 2.121352e+08 3.091386e+05 3.093203e+05 1.908793e+05 NA
## median 2.401000e+03 5.700000e+00 5.710000e+00 3.530000e+00 NA
## mean 3.932800e+03 5.731157e+00 5.734526e+00 3.538734e+00 NA
## SE.mean 1.717736e+01 4.829974e-03 4.917698e-03 3.038533e-03 NA
## CI.mean.0.95 3.366776e+01 9.466787e-03 9.638727e-03 5.955549e-03 NA
## var 1.591563e+07 1.258347e+00 1.304472e+00 4.980109e-01 NA
## std.dev 3.989440e+03 1.121761e+00 1.142135e+00 7.056988e-01 NA
## coef.var 1.014402e+00 1.957302e-01 1.991681e-01 1.994213e-01 NA
## vars n mean sd median trimmed mad min max
## carat 1 53940 0.80 0.47 0.70 0.73 0.47 0.2 5.01
## cut* 2 53940 3.90 1.12 4.00 4.04 1.48 1.0 5.00
## color* 3 53940 3.59 1.70 4.00 3.55 1.48 1.0 7.00
## clarity* 4 53940 4.05 1.65 4.00 3.91 1.48 1.0 8.00
## depth 5 53940 61.75 1.43 61.80 61.78 1.04 43.0 79.00
## table 6 53940 57.46 2.23 57.00 57.32 1.48 43.0 95.00
## price 7 53940 3932.80 3989.44 2401.00 3158.99 2475.94 326.0 18823.00
## x 8 53940 5.73 1.12 5.70 5.66 1.38 0.0 10.74
## y 9 53940 5.73 1.14 5.71 5.66 1.36 0.0 58.90
## z 10 53940 3.54 0.71 3.53 3.49 0.85 0.0 31.80
## kod* 11 53940 1.50 0.50 1.50 1.50 0.74 1.0 2.00
## range skew kurtosis se
## carat 4.81 1.12 1.26 0.00
## cut* 4.00 -0.72 -0.40 0.00
## color* 6.00 0.19 -0.87 0.01
## clarity* 7.00 0.55 -0.39 0.01
## depth 36.00 -0.08 5.74 0.01
## table 52.00 0.80 2.80 0.01
## price 18497.00 1.62 2.18 17.18
## x 10.74 0.38 -0.62 0.00
## y 58.90 2.43 91.20 0.00
## z 31.80 1.52 47.08 0.00
## kod* 1.00 0.00 -2.00 0.00
##
## Descriptive statistics by group
## group: kod1
## vars n mean sd median trimmed mad min max
## carat 1 26970 0.80 0.48 0.70 0.74 0.47 0.2 4.50
## cut* 2 26970 3.90 1.12 4.00 4.04 1.48 1.0 5.00
## color* 3 26970 3.58 1.70 4.00 3.54 1.48 1.0 7.00
## clarity* 4 26970 4.04 1.65 4.00 3.91 1.48 1.0 8.00
## depth 5 26970 61.75 1.43 61.80 61.79 1.04 43.0 79.00
## table 6 26970 57.45 2.23 57.00 57.31 1.48 43.0 95.00
## price 7 26970 3932.63 3989.23 2401.00 3158.89 2475.94 326.0 18818.00
## x 8 26970 5.73 1.12 5.70 5.66 1.36 0.0 10.23
## y 9 26970 5.73 1.11 5.71 5.66 1.36 0.0 10.16
## z 10 26970 3.54 0.72 3.53 3.50 0.85 0.0 31.80
## kod* 11 26970 1.00 0.00 1.00 1.00 0.00 1.0 1.00
## range skew kurtosis se
## carat 4.30 1.13 1.34 0.00
## cut* 4.00 -0.71 -0.40 0.01
## color* 6.00 0.19 -0.87 0.01
## clarity* 7.00 0.55 -0.39 0.01
## depth 36.00 0.05 5.54 0.01
## table 52.00 0.86 4.12 0.01
## price 18492.00 1.62 2.18 24.29
## x 10.23 0.39 -0.62 0.01
## y 10.16 0.39 -0.65 0.01
## z 31.80 2.61 89.38 0.00
## kod* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: kod2
## vars n mean sd median trimmed mad min max
## carat 1 26970 0.80 0.47 0.70 0.73 0.47 0.2 5.01
## cut* 2 26970 3.91 1.12 4.00 4.04 1.48 1.0 5.00
## color* 3 26970 3.60 1.70 4.00 3.56 1.48 1.0 7.00
## clarity* 4 26970 4.06 1.65 4.00 3.92 1.48 1.0 8.00
## depth 5 26970 61.74 1.43 61.80 61.78 1.04 43.0 79.00
## table 6 26970 57.46 2.23 57.00 57.32 1.48 44.0 79.00
## price 7 26970 3932.97 3989.72 2401.00 3159.10 2475.94 326.0 18823.00
## x 8 26970 5.73 1.12 5.69 5.66 1.36 0.0 10.74
## y 9 26970 5.73 1.17 5.71 5.66 1.36 0.0 58.90
## z 10 26970 3.54 0.70 3.52 3.49 0.85 0.0 8.06
## kod* 11 26970 2.00 0.00 2.00 2.00 0.00 2.0 2.00
## range skew kurtosis se
## carat 4.81 1.10 1.17 0.00
## cut* 4.00 -0.72 -0.39 0.01
## color* 6.00 0.19 -0.87 0.01
## clarity* 7.00 0.55 -0.40 0.01
## depth 36.00 -0.21 5.93 0.01
## table 35.00 0.73 1.49 0.01
## price 18497.00 1.62 2.18 24.29
## x 10.74 0.37 -0.62 0.01
## y 58.90 4.20 166.19 0.01
## z 8.06 0.33 -0.38 0.00
## kod* 0.00 NaN NaN 0.00
## item group1 vars n mean sd median trimmed
## carat1 1 kod1 1 26970 0.7984854 0.4751665 0.70 0.7351863
## carat2 2 kod2 1 26970 0.7973941 0.4728613 0.70 0.7347446
## cut*1 3 kod1 2 26970 3.9030033 1.1165241 4.00 4.0410178
## cut*2 4 kod2 2 26970 3.9051910 1.1166954 4.00 4.0438450
## color*1 5 kod1 3 26970 3.5845384 1.7024558 4.00 3.5409251
## color*2 6 kod2 3 26970 3.6038561 1.6997294 4.00 3.5644234
## clarity*1 7 kod1 4 26970 4.0446793 1.6461309 4.00 3.9078142
## clarity*2 8 kod2 4 26970 4.0573600 1.6481468 4.00 3.9217649
## depth1 9 kod1 5 26970 61.7541861 1.4316223 61.80 61.7865406
## depth2 10 kod2 5 26970 61.7446237 1.4336302 61.80 61.7826752
## table1 11 kod1 6 26970 57.4534557 2.2340739 57.00 57.3145532
## table2 12 kod2 6 26970 57.4609121 2.2349424 57.00 57.3221496
## price1 13 kod1 7 26970 3932.6286244 3989.2349561 2401.00 3158.8863089
## price2 14 kod2 7 26970 3932.9708194 3989.7184613 2401.00 3159.0983964
## x1 15 kod1 8 26970 5.7324138 1.1224196 5.70 5.6608282
## x2 16 kod2 8 26970 5.7299006 1.1211209 5.69 5.6593252
## y1 17 kod1 9 26970 5.7343367 1.1136608 5.71 5.6633060
## y2 18 kod2 9 26970 5.7347152 1.1699364 5.71 5.6621825
## z1 19 kod1 10 26970 3.5403089 0.7155786 3.53 3.4955223
## z2 20 kod2 10 26970 3.5371587 0.6956885 3.52 3.4942445
## kod*1 21 kod1 11 26970 1.0000000 0.0000000 1.00 1.0000000
## kod*2 22 kod2 11 26970 2.0000000 0.0000000 2.00 2.0000000
## mad min max range skew kurtosis
## carat1 0.474432 0.2 4.50 4.30 1.13331065 1.3448172
## carat2 0.474432 0.2 5.01 4.81 1.09949928 1.1651945
## cut*1 1.482600 1.0 5.00 4.00 -0.71355322 -0.4033855
## cut*2 1.482600 1.0 5.00 4.00 -0.72068935 -0.3930713
## color*1 1.482600 1.0 7.00 6.00 0.19356737 -0.8673174
## color*2 1.482600 1.0 7.00 6.00 0.18518525 -0.8664942
## clarity*1 1.482600 1.0 8.00 7.00 0.54995798 -0.3887358
## clarity*2 1.482600 1.0 8.00 7.00 0.55281547 -0.4014077
## depth1 1.037820 43.0 79.00 36.00 0.04580134 5.5376313
## depth2 1.037820 43.0 79.00 36.00 -0.20981285 5.9343078
## table1 1.482600 43.0 95.00 52.00 0.86153647 4.1178222
## table2 1.482600 44.0 79.00 35.00 0.73220034 1.4872204
## price1 2475.942000 326.0 18818.00 18492.00 1.61820156 2.1767756
## price2 2475.942000 326.0 18823.00 18497.00 1.61831892 2.1772216
## x1 1.363992 0.0 10.23 10.23 0.38772844 -0.6217964
## x2 1.363992 0.0 10.74 10.74 0.36952271 -0.6150674
## y1 1.363992 0.0 10.16 10.16 0.38702787 -0.6525155
## y2 1.363992 0.0 58.90 58.90 4.19533118 166.1948031
## z1 0.845082 0.0 31.80 31.80 2.61212158 89.3848321
## z2 0.845082 0.0 8.06 8.06 0.33497846 -0.3836541
## kod*1 0.000000 1.0 1.00 0.00 NaN NaN
## kod*2 0.000000 2.0 2.00 0.00 NaN NaN
## se
## carat1 0.002893379
## carat2 0.002879342
## cut*1 0.006798727
## cut*2 0.006799770
## color*1 0.010366577
## color*2 0.010349975
## clarity*1 0.010023604
## clarity*2 0.010035879
## depth1 0.008717420
## depth2 0.008729647
## table1 0.013603700
## table2 0.013608989
## price1 24.291209674
## price2 24.294153830
## x1 0.006834626
## x2 0.006826718
## y1 0.006781292
## y2 0.007123965
## z1 0.004357294
## z2 0.004236179
## kod*1 0.000000000
## kod*2 0.000000000
Utworzyłam statystyki za pomocą summary.
## # A tibble: 5 x 3
## cut depth.m depth.s
## <ord> <dbl> <dbl>
## 1 Fair 64.0 3.64
## 2 Good 62.4 2.17
## 3 Very Good 61.8 1.38
## 4 Premium 61.3 1.16
## 5 Ideal 61.7 0.719
## # A tibble: 10 x 4
## cut kod depth.m depth.s
## <ord> <fct> <dbl> <dbl>
## 1 Fair kod1 64.2 3.54
## 2 Fair kod2 63.9 3.74
## 3 Good kod1 62.4 2.16
## 4 Good kod2 62.4 2.18
## 5 Very Good kod1 61.8 1.39
## 6 Very Good kod2 61.8 1.37
## 7 Premium kod1 61.3 1.16
## 8 Premium kod2 61.3 1.16
## 9 Ideal kod1 61.7 0.708
## 10 Ideal kod2 61.7 0.729
## # A tibble: 10 x 6
## cut kod depth.m depth.s table.m table.s
## <ord> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Fair kod1 64.2 3.54 58.9 4.04
## 2 Fair kod2 63.9 3.74 59.2 3.85
## 3 Good kod1 62.4 2.16 58.6 2.85
## 4 Good kod2 62.4 2.18 58.8 2.85
## 5 Very Good kod1 61.8 1.39 58.0 2.14
## 6 Very Good kod2 61.8 1.37 57.9 2.10
## 7 Premium kod1 61.3 1.16 58.7 1.48
## 8 Premium kod2 61.3 1.16 58.7 1.48
## 9 Ideal kod1 61.7 0.708 56.0 1.25
## 10 Ideal kod2 61.7 0.729 55.9 1.24
## # A tibble: 10 x 16
## cut kod carat.m carat.s depth.m depth.s table.m table.s price.m price.s
## <ord> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Fair kod1 1.06 0.519 64.2 3.54 58.9 4.04 4450. 3626.
## 2 Fair kod2 1.04 0.514 63.9 3.74 59.2 3.85 4268. 3494.
## 3 Good kod1 0.845 0.454 62.4 2.16 58.6 2.85 3927. 3717.
## 4 Good kod2 0.853 0.454 62.4 2.18 58.8 2.85 3931. 3646.
## 5 Very~ kod1 0.802 0.458 61.8 1.39 58.0 2.14 3946. 3903.
## 6 Very~ kod2 0.810 0.461 61.8 1.37 57.9 2.10 4018. 3969.
## 7 Prem~ kod1 0.893 0.515 61.3 1.16 58.7 1.48 4570. 4323.
## 8 Prem~ kod2 0.891 0.516 61.3 1.16 58.7 1.48 4599. 4375.
## 9 Ideal kod1 0.706 0.437 61.7 0.708 56.0 1.25 3481. 3838.
## 10 Ideal kod2 0.699 0.428 61.7 0.729 55.9 1.24 3434. 3778.
## # ... with 6 more variables: x.m <dbl>, x.s <dbl>, y.m <dbl>, y.s <dbl>,
## # z.m <dbl>, z.s <dbl>
Utworzyłam także histogramy.
x=diamonds$carat
h=hist(x,breaks=seq(0,5.5,0.1))
xfit <- seq(min(x), max(x), length=150)
yfit <- dnorm(xfit, mean=mean(x), sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)Utworzyłam estymator jądrowy gęstości.
library(sm)
sm.density.compare(diamonds$carat, diamonds$cut, xlab="carat")
cyl.f <-diamonds$cut
title(main="karat po rodzaju ciecia")
colfill <- c(2:(2+length(levels(cyl.f))))
legend(4,1.5, levels(cyl.f), fill=colfill)Utworzyłam boxploty.
Utworzyłam histogramy.
gghistogram(diamonds, x = "carat", add ="mean", rug=T, color = "cut",
palette = c("#00AFBB", "#E7B800","red", "pink", "purple"))gghistogram(diamonds, x = "carat", add = "mean",rug=T,color = "cut",
fill="cut", palette = c("#00AFBB", "#E7B800","red", "pink", "purple"))gghistogram(diamonds, x = "carat", add = "mean",rug=T,color = "cut",
fill="cut", palette = c("#00AFBB", "#E7B800","red", "pink", "purple"),add_density = TRUE)