Statystyki opisowe

Dane

W celu zrobienia statystyk opisowych wykorzystano zestaw danych diamonds z pakietu ggplot2.

Poniżej zostały wypisane wszystkie pakiety potrzebne do wykonania tego projektu:

library(tidyverse)
library(ggplot2)
library(Hmisc)
library(pastecs)
library(doBy)
library(sm)
library(ggpubr)

Przygotowanie danych

Za pomocą poniższych funkcji sprawdzono strukturę zestawu danych, unikalne wartości, jakie przechowywała kolumna cut, oraz nazwy wszystkich kolumn. Na koniec wyświetlono 5 pierwszych wierszy tabeli.

str(diamonds)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

unique(diamonds$cut)

## [1] Ideal     Premium   Good      Very Good Fair     
## Levels: Fair < Good < Very Good < Premium < Ideal

table(diamonds$cut)

## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551

names(diamonds)

##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

head(diamonds,5)

## # A tibble: 5 x 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good    E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good    J     SI2      63.3    58   335  4.34  4.35  2.75

Usunięcie i dodanie kolumn

Usunięto niektóre kolumny, które miały zbyt dużo braków danych.

dane <- diamonds[,c(-8, -9, -10)]

Dodano nową kolumnę z danymi klasyfikującymi wartości z kolumny carat na 3 rodzaje. Typ danych powstałych w nowo utworzonej kolumnie zamieniono na factor.

dane <- dane %>% 
  mutate(kod = case_when(
    carat < 0.5 ~ "kod1",
    carat >= 0.5 & carat < 2 ~ "kod2",
    carat >= 2 ~ "kod3"))

dane$kod=as.factor(dane$kod)

Podstawowe statystyki opisowe

Poniżej zostały przedstawione podstawowe statystyki dla całego zbioru danych.

summary(dane)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price         kod       
##  Min.   :43.00   Min.   :  326   kod1:17674  
##  1st Qu.:56.00   1st Qu.:  950   kod2:34112  
##  Median :57.00   Median : 2401   kod3: 2154  
##  Mean   :57.46   Mean   : 3933               
##  3rd Qu.:59.00   3rd Qu.: 5324               
##  Max.   :95.00   Max.   :18823               
##

Za pomocą funkcji sapply sprawdzono średnią wartość oraz minimalne i maksymalne wartości w każdej kolumnie. Najpierw usunięto z zestawu danych te kolumny, które nie przechowywały wartości liczbowych.

dane_sap <- dane[, c(-2, -3, -4, -8)]

sapply(dane_sap, mean, na.rm=TRUE)

##        carat        depth        table        price 
##    0.7979397   61.7494049   57.4571839 3932.7997219

sapply(dane_sap, min, na.rm=TRUE)

## carat depth table price 
##   0.2  43.0  43.0 326.0

sapply(dane_sap, max, na.rm=TRUE)

##    carat    depth    table    price 
##     5.01    79.00    95.00 18823.00

Sprawdzono także szczegółowe informacje na temat każdej z kolumn za pomocą funkcji describe.

describe(dane)

## dane 
## 
##  8  Variables      53940  Observations
## --------------------------------------------------------------------------------
## carat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    53940        0      273    0.999   0.7979   0.5122     0.30     0.31 
##      .25      .50      .75      .90      .95 
##     0.40     0.70     1.04     1.51     1.70 
## 
## lowest : 0.20 0.21 0.22 0.23 0.24, highest: 4.00 4.01 4.13 4.50 5.01
## --------------------------------------------------------------------------------
## cut 
##        n  missing distinct 
##    53940        0        5 
## 
## lowest : Fair      Good      Very Good Premium   Ideal    
## highest: Fair      Good      Very Good Premium   Ideal    
##                                                             
## Value           Fair      Good Very Good   Premium     Ideal
## Frequency       1610      4906     12082     13791     21551
## Proportion     0.030     0.091     0.224     0.256     0.400
## --------------------------------------------------------------------------------
## color 
##        n  missing distinct 
##    53940        0        7 
## 
## lowest : D E F G H, highest: F G H I J
##                                                     
## Value          D     E     F     G     H     I     J
## Frequency   6775  9797  9542 11292  8304  5422  2808
## Proportion 0.126 0.182 0.177 0.209 0.154 0.101 0.052
## --------------------------------------------------------------------------------
## clarity 
##        n  missing distinct 
##    53940        0        8 
## 
## lowest : I1   SI2  SI1  VS2  VS1 , highest: VS2  VS1  VVS2 VVS1 IF  
##                                                           
## Value         I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
## Frequency    741  9194 13065 12258  8171  5066  3655  1790
## Proportion 0.014 0.170 0.242 0.227 0.151 0.094 0.068 0.033
## --------------------------------------------------------------------------------
## depth 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    53940        0      184    0.999    61.75    1.515     59.3     60.0 
##      .25      .50      .75      .90      .95 
##     61.0     61.8     62.5     63.3     63.8 
## 
## lowest : 43.0 44.0 50.8 51.0 52.2, highest: 72.2 72.9 73.6 78.2 79.0
## --------------------------------------------------------------------------------
## table 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    53940        0      127     0.98    57.46    2.448       54       55 
##      .25      .50      .75      .90      .95 
##       56       57       59       60       61 
## 
## lowest : 43.0 44.0 49.0 50.0 50.1, highest: 71.0 73.0 76.0 79.0 95.0
## --------------------------------------------------------------------------------
## price 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    53940        0    11602        1     3933     4012      544      646 
##      .25      .50      .75      .90      .95 
##      950     2401     5324     9821    13107 
## 
## lowest :   326   327   334   335   336, highest: 18803 18804 18806 18818 18823
## --------------------------------------------------------------------------------
## kod 
##        n  missing distinct 
##    53940        0        3 
##                             
## Value       kod1  kod2  kod3
## Frequency  17674 34112  2154
## Proportion 0.328 0.632 0.040
## --------------------------------------------------------------------------------

Następnie zbadano jaka jest średnia oraz odchylenie standardowe, kiedy uzależni się od siebie różne kolumny. Użyto w tym celu funkcji summaryBy.

summaryBy(price ~ cut, data = dane,
          FUN = function(x) { c(m = mean(x), s = sd(x)) } )

## # A tibble: 5 x 3
##   cut       price.m price.s
##   <ord>       <dbl>   <dbl>
## 1 Fair        4359.   3560.
## 2 Good        3929.   3682.
## 3 Very Good   3982.   3936.
## 4 Premium     4584.   4349.
## 5 Ideal       3458.   3808.

summaryBy(price~ cut+kod, data = dane,
          FUN = function(x) { c(m = mean(x), s = sd(x)) } )

## # A tibble: 15 x 4
##    cut       kod   price.m price.s
##    <ord>     <fct>   <dbl>   <dbl>
##  1 Fair      kod1     895.    322.
##  2 Fair      kod2    3770.   2367.
##  3 Fair      kod3   11875.   3751.
##  4 Good      kod1     694.    238.
##  5 Good      kod2    4387.   2777.
##  6 Good      kod3   14598.   2836.
##  7 Very Good kod1     699.    227.
##  8 Very Good kod2    4808.   3239.
##  9 Very Good kod3   15104.   2478.
## 10 Premium   kod1     817.    221.
## 11 Premium   kod2    5339.   3437.
## 12 Premium   kod3   14908.   2551.
## 13 Ideal     kod1     831.    249.
## 14 Ideal     kod2    4829.   3510.
## 15 Ideal     kod3   15524.   2290.

summaryBy(price+carat ~ cut+kod, data = dane,
          FUN = function(x) { c(m = mean(x), s = sd(x)) } )

## # A tibble: 15 x 6
##    cut       kod   price.m price.s carat.m carat.s
##    <ord>     <fct>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Fair      kod1     895.    322.   0.376  0.0727
##  2 Fair      kod2    3770.   2367.   0.965  0.300 
##  3 Fair      kod3   11875.   3751.   2.22   0.446 
##  4 Good      kod1     694.    238.   0.349  0.0593
##  5 Good      kod2    4387.   2777.   0.944  0.305 
##  6 Good      kod3   14598.   2836.   2.11   0.216 
##  7 Very Good kod1     699.    227.   0.333  0.0617
##  8 Very Good kod2    4808.   3239.   0.943  0.318 
##  9 Very Good kod3   15104.   2478.   2.10   0.179 
## 10 Premium   kod1     817.    221.   0.351  0.0511
## 11 Premium   kod2    5339.   3437.   1.02   0.335 
## 12 Premium   kod3   14908.   2551.   2.14   0.216 
## 13 Ideal     kod1     831.    249.   0.345  0.0497
## 14 Ideal     kod2    4829.   3510.   0.898  0.333 
## 15 Ideal     kod3   15524.   2290.   2.14   0.177

summaryBy(. ~ cut+kod, data = dane,
          FUN = function(x) { c(m = mean(x), s = sd(x)) } )

## # A tibble: 15 x 10
##    cut     kod   carat.m carat.s depth.m depth.s table.m table.s price.m price.s
##    <ord>   <fct>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Fair    kod1    0.376  0.0727    62.4   4.38     59.1    3.82    895.    322.
##  2 Fair    kod2    0.965  0.300     64.1   3.59     59.1    3.89   3770.   2367.
##  3 Fair    kod3    2.22   0.446     65.0   3.11     58.3    4.40  11875.   3751.
##  4 Good    kod1    0.349  0.0593    62.7   1.88     57.5    2.69    694.    238.
##  5 Good    kod2    0.944  0.305     62.3   2.23     59.1    2.78   4387.   2777.
##  6 Good    kod3    2.11   0.216     62.0   2.46     59.7    2.83  14598.   2836.
##  7 Very G~ kod1    0.333  0.0617    61.8   1.34     57.6    2.15    699.    227.
##  8 Very G~ kod2    0.943  0.318     61.9   1.39     58.1    2.08   4808.   3239.
##  9 Very G~ kod3    2.10   0.179     61.8   1.52     58.6    2.22  15104.   2478.
## 10 Premium kod1    0.351  0.0511    61.3   1.04     58.8    1.27    817.    221.
## 11 Premium kod2    1.02   0.335     61.3   1.20     58.7    1.56   5339.   3437.
## 12 Premium kod3    2.14   0.216     61.2   1.22     59.0    1.49  14908.   2551.
## 13 Ideal   kod1    0.345  0.0497    61.7   0.646    55.7    1.18    831.    249.
## 14 Ideal   kod2    0.898  0.333     61.7   0.755    56.1    1.26   4829.   3510.
## 15 Ideal   kod3    2.14   0.177     61.9   0.939    56.5    1.47  15524.   2290.

Wykresy

Najpierw utworzono ogólne wykresy każdej kolumny.

names=names(dane)

par(mfrow=c(3,3))

for (i in 1:length(dane)) {
  
  plot(dane[,i],ylab=names[i])
}

Poniższy histogram przedstawiał ilość kamieni szlachentych w zależności od ceny.

x=dane$price
h=hist(x,breaks=seq(300,20000,600))

xfit<-seq(min(x),max(x),length=20000)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)

Estymator jądrowy gęstości (ang. Kernel Density Plot)

d <- density(dane$price) 
plot(d)

Poniżej przedstawione zostały wartości ceny po rodzaju cięcia.

sm.density.compare(dane$price, dane$cut, xlab="cena", ylab="cięcie")
cyl.f <- dane$cut
title(main="Cena w zależności od cięcia")
colfill <- c(2:(2+length(levels(cyl.f))))
legend(5,2, levels(cyl.f), fill=colfill)

Boxploty

Poniżej zostały utworzone boxploty na podstawie danych z kolumn: price oraz cut.

ggboxplot(dane, y = "price", width = 0.5)

ggboxplot(dane, y = "price", x="cut", color="cut", width = 0.5)

Histogramy

Poniższe histogramy zostały utworzone na podstawie cen kamieni szlachetnych.

gghistogram(dane, x = "price",bins = 15, add = "mean",fill = "lightgrey")

gghistogram(dane, x = "price", bins = 10, add = "mean",color = "kod",fill="kod", 
            palette = c("#FF0000", "#006600" , "#0000FF"), add_density = FALSE)

Wykresy skumulowane

ggecdf(dane, x = "carat")

Wykres kwantyl-kwantyl

ggqqplot(dane, x = "carat")

Inne wykresy

Poniższe wykresy zostały utworzone za pomocą pakietu ggplot2.

ggplot(dane, mapping = aes(cut)) +
  geom_bar(aes(fill = cut)) +
  scale_fill_manual(values = c("#33FFFF","#330099", "#CC00FF","#660099", "#0099FF")) +
  labs(title = "Podział kamieni szlachetnych ze względu na cięcie",
       x = "cięcie",
       y = "Ilość kamieni o danym rodzaju cięcia") +
  theme(legend.position = c(0.1,0.8),
        legend.background = element_rect(fill = "grey90", 
                                         colour = "black"))

dane %>% ggplot(aes(depth, fill = cut)) +
  geom_histogram(binwidth = 1) +
  labs(title = "Ilość kamieni szlachetnych ze względu na głębię",
       x = "Głębia",
       y = "Ilośc") + 
  theme(legend.position = c(0.9, 0.8), 
        legend.direction = "vertical",
        legend.background = element_rect(fill = "grey90", 
                                         colour = "white"))

dane %>%
  group_by(cut) %>%
  summarise(ilosc = n()) -> dane_summary

ggplot(dane_summary, aes(x=cut, y=ilosc, color=ilosc)) +
  geom_point(size=3, show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Ilość kamieni szlachetnych na podstawie rodzaju cięcia", 
       x = "cięcie", 
       y = "ilość")