Przygotowanie i walidacja danych cz1

Majkowska Agata

semestr letni dzienne

WPROWADZENIE

POMOCNICZEK:

uruchomienie linijki kodu - ctrl + enter
zakomentowanie - ctrl + shift + c

Korzystanie z pomocy (HELP)

?summary

## starting httpd help server ... done

KALKULATOR

a<-3  # a=3
b<-15  # b=15
a+b

## [1] 18

a-b

## [1] -12

a*b

## [1] 45

a/b

## [1] 0.2

a<-c(1,5,3)
prod(a) # iloraz z wektora

## [1] 15

PODSTAWY

GENEROWANIE CIĄGÓW

Ciąg liczb od 1 do 5

1:5

## [1] 1 2 3 4 5

Przypisanie do zmiennej “a” ciągu liczb od 1 do 30

a<-1:30
a1<-c(1:30)

Przypisanie do zmiennej “a1” ciągu złożonego z 4 elemntów zmiennej jakościowej, gdzie “M” to mężczyzna, a “K” to kobieta

a1<-c("M", "K","M","K")

Dlugość ciągu

length(a)

## [1] 30

Generowanie ciągu składającego się z elemntów od 1 do 5 powtórzony 5 razy

rep(1:5,5) ### od 1 do 5 powtarzamy 5 razy

##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Generowanie ciągu skłądającego się z elementów od 1 do 5, gdzie każdy z elemntów jest powtórzony 2 razy

rep(1:5,each=2)

##  [1] 1 1 2 2 3 3 4 4 5 5

INDEKSOWANIE ELEMENTÓW CIĄGU

Wyświetlenie ciągu 10 liczb z zakresu od -1 do 1

a=seq(-1,1,length=10)
a

##  [1] -1.0000000 -0.7777778 -0.5555556 -0.3333333 -0.1111111  0.1111111
##  [7]  0.3333333  0.5555556  0.7777778  1.0000000

Wyświetlenie piątego elementu z ciągu “a”

a[5]

## [1] -0.1111111

Wyświetlenie pierwszego i szóstego elementu z ciągu “a”

a[c(1,6)]

## [1] -1.0000000  0.1111111

Usunięcie pierwszego elementu z ciągu “a”

##  [1] -1.0000000 -0.7777778 -0.5555556 -0.3333333 -0.1111111  0.1111111
##  [7]  0.3333333  0.5555556  0.7777778  1.0000000

a[-1]

## [1] -0.7777778 -0.5555556 -0.3333333 -0.1111111  0.1111111  0.3333333  0.5555556
## [8]  0.7777778  1.0000000

Usunięcie drugiego i szóstego elementu z ciągu “a”

##  [1] -1.0000000 -0.7777778 -0.5555556 -0.3333333 -0.1111111  0.1111111
##  [7]  0.3333333  0.5555556  0.7777778  1.0000000

a[-c(2,6)]

## [1] -1.0000000 -0.5555556 -0.3333333 -0.1111111  0.3333333  0.5555556  0.7777778
## [8]  1.0000000

INSTALOWANIE PAKIETÓW

Instalowanie pakietu “readxl” zwierającego między innymi funkcję do otwierania plików o rozszerzeniu xlsx. Instalacje uruchamiamy raz!

#install.packages("readxl")

Załadowanie pakietu. Wywołanie pakietu uruchamiane jest przy każdym ponownym uruchomieniu R Studio

library(readxl)

## Warning: pakiet 'readxl' został zbudowany w wersji R 4.1.3

Ładowanie zbioru za pomoca funkcji “read_excel”, znajdującej się w bibliotece “readxl” UWAGA! Prosze podać własną ścieżkę do danych.

dane <- read_excel("C:/Users/majko/OneDrive/Dokumenty/Zajecia_WZR/PRZYGOTOWANIE_DANYCH_WALIDACJA_DATA_MANAGMENT/Dane_AW_zanieczyszczenie.xlsx", 
                                       sheet = "Dane")

Zamiana danych na data frame czyli ramkę danych

dane<-as.data.frame(dane)

Sprawdzenie typu zmiennych ze zbioru “dane”

str(dane)

## 'data.frame':    16 obs. of  8 variables:
##  $ Województwo: chr  "DOLNOŚLĄSKIE" "KUJAWSKO-POMORSKIE" "LUBELSKIE" "LUBUSKIE" ...
##  $ NO         : num  3.81 5.94 2.9 2.65 14.56 ...
##  $ CO         : num  2.55 7.46 2.78 2.41 12.72 ...
##  $ CO2        : num  4.4 4.45 2.38 2.14 16.14 ...
##  $ PYŁ %      : num  6.31 5.13 4.48 2.4 5.85 5.36 7.24 3.26 3.41 2.11 ...
##  $ PALIWA     : num  4.49 4.87 3.75 7.2 6.74 ...
##  $ ŚCIEKI     : num  7.99 7 2.74 2.62 5.27 ...
##  $ ZIELEŃ     : num  4.18 3.3 4.16 3.75 3.8 ...

INDEKSOWANIE - WYBÓR WIERSZY I KOLUMN

Wyświetlenie pierwszego wiersza ze zbioru “dane:

dane [ 1, ]

##    Województwo    NO    CO   CO2 PYŁ % PALIWA ŚCIEKI ZIELEŃ
## 1 DOLNOŚLĄSKIE 3.813 2.546 4.399  6.31  4.494   7.99  4.177

Wyświetlenie drugiej kolumny ze zbioru “dane”

dane [ , 2]

##  [1]  3.813  5.936  2.903  2.655 14.561  4.432  5.463 13.284  2.254  1.992
## [11]  3.247  9.376 17.349  1.676  6.368  5.730

Wyświetlenie drugiego wiersza i trzeciej kolumny ze zbioru “dane”

dane [ 2,3]

## [1] 7.464

Wyodrębnienie pierwszej kolumny ze zbioru “dane” zawierającej nazwy województw. Przypisanie nazw województw do zmiennej “woje”

woje<-dane[ ,1]

Przypisanie nazw województw jako nazwy wierszy w ramce danych

row.names(dane)<-woje

Usunięcie kolumny z nazwami województw, która nie jest już potrzebna ze względu na przypisanie województw jako nazwy wierszy ramce danych

dane<-dane[,-1]

BIBLIOTEKA DPLYR

“Sciąga” z funkcjami - https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-transformation.pdf

Wywołanie biblioteki

library(dplyr)

## Warning: pakiet 'dplyr' został zbudowany w wersji R 4.1.3

## 
## Dołączanie pakietu: 'dplyr'

## Następujące obiekty zostały zakryte z 'package:stats':
## 
##     filter, lag

## Następujące obiekty zostały zakryte z 'package:base':
## 
##     intersect, setdiff, setequal, union

DANE

https://www.kaggle.com/zynicide/wine-reviews

BIBLIOTEKI

# install.packages('dplyr')
# install.packages('readr')
# install.packages('tidyverse')
# install.packages('ggplot2')
# install.packages('rpivotTable')


library(dplyr)
library(readr)

## Warning: pakiet 'readr' został zbudowany w wersji R 4.1.3

# library(tidyverse)
library(ggplot2)
library(rpivotTable)

## Warning: pakiet 'rpivotTable' został zbudowany w wersji R 4.1.3

WCZYTANIE DANYCH

wines <- read_csv("C:/Users/majko/OneDrive/Dokumenty/DOKTORAT/5_semestr/Warsztaty_R/Warsztat_dokto_7.12/winemag-data-130k-v2.csv")
wines

## # A tibble: 129,971 x 14
##     ...1 country description designation points price province region_1 region_2
##    <dbl> <chr>   <chr>       <chr>        <dbl> <dbl> <chr>    <chr>    <chr>   
##  1     0 Italy   Aromas inc~ Vulka Bian~     87    NA Sicily ~ Etna     <NA>    
##  2     1 Portug~ This is ri~ Avidagos        87    15 Douro    <NA>     <NA>    
##  3     2 US      Tart and s~ <NA>            87    14 Oregon   Willame~ Willame~
##  4     3 US      Pineapple ~ Reserve La~     87    13 Michigan Lake Mi~ <NA>    
##  5     4 US      Much like ~ Vintner's ~     87    65 Oregon   Willame~ Willame~
##  6     5 Spain   Blackberry~ Ars In Vit~     87    15 Norther~ Navarra  <NA>    
##  7     6 Italy   Here's a b~ Belsito         87    16 Sicily ~ Vittoria <NA>    
##  8     7 France  This dry a~ <NA>            87    24 Alsace   Alsace   <NA>    
##  9     8 Germany Savory dri~ Shine           87    12 Rheinhe~ <NA>     <NA>    
## 10     9 France  This has g~ Les Natures     87    27 Alsace   Alsace   <NA>    
## # i 129,961 more rows
## # i 5 more variables: taster_name <chr>, taster_twitter_handle <chr>,
## #   title <chr>, variety <chr>, winery <chr>

WIDOK DANYCH

glimpse(wines)

## Rows: 129,971
## Columns: 14
## $ ...1                  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14~
## $ country               <chr> "Italy", "Portugal", "US", "US", "US", "Spain", ~
## $ description           <chr> "Aromas include tropical fruit, broom, brimstone~
## $ designation           <chr> "Vulka Bianco", "Avidagos", NA, "Reserve Late Ha~
## $ points                <dbl> 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, ~
## $ price                 <dbl> NA, 15, 14, 13, 65, 15, 16, 24, 12, 27, 19, 30, ~
## $ province              <chr> "Sicily & Sardinia", "Douro", "Oregon", "Michiga~
## $ region_1              <chr> "Etna", NA, "Willamette Valley", "Lake Michigan ~
## $ region_2              <chr> NA, NA, "Willamette Valley", NA, "Willamette Val~
## $ taster_name           <chr> "Kerin O’Keefe", "Roger Voss", "Paul Gregutt", "~
## $ taster_twitter_handle <chr> "@kerinokeefe", "@vossroger", "@paulgwine ", NA,~
## $ title                 <chr> "Nicosia 2013 Vulka Bianco  (Etna)", "Quinta dos~
## $ variety               <chr> "White Blend", "Portuguese Red", "Pinot Gris", "~
## $ winery                <chr> "Nicosia", "Quinta dos Avidagos", "Rainstorm", "~

TABLICA

table(wines$country)

## 
##              Argentina                Armenia              Australia 
##                   3800                      2                   2329 
##                Austria Bosnia and Herzegovina                 Brazil 
##                   3345                      2                     52 
##               Bulgaria                 Canada                  Chile 
##                    141                    257                   4472 
##                  China                Croatia                 Cyprus 
##                      1                     73                     11 
##         Czech Republic                  Egypt                England 
##                     12                      1                     74 
##                 France                Georgia                Germany 
##                  22093                     86                   2165 
##                 Greece                Hungary                  India 
##                    466                    146                      9 
##                 Israel                  Italy                Lebanon 
##                    505                  19540                     35 
##             Luxembourg              Macedonia                 Mexico 
##                      6                     12                     70 
##                Moldova                Morocco            New Zealand 
##                     59                     28                   1419 
##                   Peru               Portugal                Romania 
##                     16                   5691                    120 
##                 Serbia               Slovakia               Slovenia 
##                     12                      1                     87 
##           South Africa                  Spain            Switzerland 
##                   1401                   6645                      7 
##                 Turkey                Ukraine                Uruguay 
##                     90                     14                    109 
##                     US 
##                  54504

FILTROWANIE

Skrót klawiszowy: ctrl+shift+m -> %>%

wines%>%
  filter( points >= 94, price < 25)

## # A tibble: 66 x 14
##     ...1 country description designation points price province region_1 region_2
##    <dbl> <chr>   <chr>       <chr>        <dbl> <dbl> <chr>    <chr>    <chr>   
##  1  5011 US      Truly stun~ Lewis Esta~     95    20 Washing~ Columbi~ Columbi~
##  2  6267 US      This taste~ Lucille La~     94    18 Washing~ Yakima ~ Columbi~
##  3 10763 Portug~ His skills~ Rapariga d~     94    23 Alentej~ <NA>     <NA>    
##  4 12944 France  The Côte d~ Côte du Py~     94    24 Beaujol~ Morgon   <NA>    
##  5 12945 France  Be gratefu~ Vieilles V~     94    24 Beaujol~ Moulin-~ <NA>    
##  6 12967 France  A firm and~ <NA>            94    24 Beaujol~ Moulin-~ <NA>    
##  7 15196 France  The home v~ Château Bo~     95    20 Southwe~ Madiran  <NA>    
##  8 15211 US      The deep g~ <NA>            94    22 Oregon   Willame~ Willame~
##  9 17294 US      Opulento i~ Opulento D~     94    20 Washing~ Yakima ~ Columbi~
## 10 17983 France  This is on~ <NA>            94    20 Provence Coteaux~ <NA>    
## # i 56 more rows
## # i 5 more variables: taster_name <chr>, taster_twitter_handle <chr>,
## #   title <chr>, variety <chr>, winery <chr>

LOSOWANIE

Losowanie próbki 15% obserwacji ze zbioru.

wines%>%
sample_frac( 0.15)

## # A tibble: 19,496 x 14
##      ...1 country description         designation points price province region_1
##     <dbl> <chr>   <chr>               <chr>        <dbl> <dbl> <chr>    <chr>   
##  1 118535 US      From two vineyards~ <NA>            87    12 Califor~ Santa B~
##  2 113658 US      No questioning the~ <NA>            89    60 Califor~ Napa Va~
##  3  25400 Austria This offers broodi~ Frauenfeld      94    40 Thermen~ <NA>    
##  4  72671 Chile   Asphalt and spice ~ Reserva         86    13 Maule V~ <NA>    
##  5 103732 France  A strong effort fr~ Les Forots      90    19 Rhône V~ Côtes d~
##  6  75937 Chile   An herbal cherry a~ <NA>            80    10 Central~ <NA>    
##  7  44055 US      This is a young wi~ <NA>            88    11 Washing~ Columbi~
##  8  53725 US      Starts off with a ~ Reserve         94   125 Califor~ Napa Va~
##  9  77118 Italy   This is an austere~ Agostino P~     87    NA Tuscany  Chianti~
## 10 126711 US      Rich in the flavor~ Old Vine        86    18 Califor~ Lodi    
## # i 19,486 more rows
## # i 6 more variables: region_2 <chr>, taster_name <chr>,
## #   taster_twitter_handle <chr>, title <chr>, variety <chr>, winery <chr>

WYŚWIETLENIE TOPOWYCH OBSERWACJI ZE WZGLEDU NA ZMIENNĄ

wines%>%
top_n( 3, points)

## # A tibble: 19 x 14
##      ...1 country   description       designation points price province region_1
##     <dbl> <chr>     <chr>             <chr>        <dbl> <dbl> <chr>    <chr>   
##  1    345 Australia This wine contai~ Rare           100   350 Victoria Rutherg~
##  2   7335 Italy     Thick as molasse~ Occhio di ~    100   210 Tuscany  Vin San~
##  3  36528 France    This is a fabulo~ Brut           100   259 Champag~ Champag~
##  4  39286 Italy     A perfect wine f~ Masseto        100   460 Tuscany  Toscana 
##  5  42197 Portugal  This is the late~ Barca-Velha    100   450 Douro    <NA>    
##  6  45781 Italy     This gorgeous, f~ Riserva        100   550 Tuscany  Brunell~
##  7  45798 US        Tasted in a flig~ <NA>           100   200 Califor~ Napa Va~
##  8  58352 France    This is a magnif~ <NA>           100   150 Bordeaux Saint-J~
##  9  89728 France    This latest inca~ Cristal Vi~    100   250 Champag~ Champag~
## 10  89729 France    This new release~ Le Mesnil ~    100   617 Champag~ Champag~
## 11 111753 France    Almost black in ~ <NA>           100  1500 Bordeaux Pauillac
## 12 111754 Italy     It takes only a ~ Cerretalto     100   270 Tuscany  Brunell~
## 13 111755 France    This is the fine~ <NA>           100  1500 Bordeaux Saint-É~
## 14 111756 France    A hugely powerfu~ <NA>           100   359 Bordeaux Saint-J~
## 15 113929 US        In 2005 Charles ~ Royal City     100    80 Washing~ Columbi~
## 16 114972 Portugal  A powerful and r~ Nacional V~    100   650 Port     <NA>    
## 17 118058 US        This wine dazzle~ La Muse        100   450 Califor~ Sonoma ~
## 18 122935 France    Full of ripe fru~ <NA>           100   848 Bordeaux Pessac-~
## 19 123545 US        Initially a rath~ Bionic Frog    100    80 Washing~ Walla W~
## # i 6 more variables: region_2 <chr>, taster_name <chr>,
## #   taster_twitter_handle <chr>, title <chr>, variety <chr>, winery <chr>

TOP NAJTAŃSZYCH WIN

wines%>%
top_n( 100, -price)

## # A tibble: 177 x 14
##     ...1 country description designation points price province region_1 region_2
##    <dbl> <chr>   <chr>       <chr>        <dbl> <dbl> <chr>    <chr>    <chr>   
##  1  1620 Portug~ The very l~ Brado Bran~     85     6 Alentej~ <NA>     <NA>    
##  2  1987 Spain   Berry and ~ Flirty Bird     85     4 Central~ Vino de~ <NA>    
##  3  2335 US      Reserved a~ <NA>            85     6 Washing~ Washing~ Washing~
##  4  2618 Argent~ Lightly br~ <NA>            83     6 Mendoza~ Mendoza  <NA>    
##  5  2780 Portug~ This feels~ Morgado da~     84     5 Alentej~ <NA>     <NA>    
##  6  3167 Italy   Packaged i~ Mini            86     5 Veneto   Prosecco <NA>    
##  7  3948 Portug~ Soft, swee~ Coreto          83     6 Lisboa   <NA>     <NA>    
##  8  3950 Portug~ On the dry~ Escolha         83     5 Vinho V~ <NA>     <NA>    
##  9  5152 Spain   A steal fo~ Vina Borgia     87     6 Norther~ Campo d~ <NA>    
## 10  5789 France  This is a ~ <NA>            83     5 France ~ Vin de ~ <NA>    
## # i 167 more rows
## # i 5 more variables: taster_name <chr>, taster_twitter_handle <chr>,
## #   title <chr>, variety <chr>, winery <chr>

SORTOWANIE

wines%>%
  arrange( desc(points))

## # A tibble: 129,971 x 14
##     ...1 country description designation points price province region_1 region_2
##    <dbl> <chr>   <chr>       <chr>        <dbl> <dbl> <chr>    <chr>    <chr>   
##  1   345 Austra~ This wine ~ Rare           100   350 Victoria Rutherg~ <NA>    
##  2  7335 Italy   Thick as m~ Occhio di ~    100   210 Tuscany  Vin San~ <NA>    
##  3 36528 France  This is a ~ Brut           100   259 Champag~ Champag~ <NA>    
##  4 39286 Italy   A perfect ~ Masseto        100   460 Tuscany  Toscana  <NA>    
##  5 42197 Portug~ This is th~ Barca-Velha    100   450 Douro    <NA>     <NA>    
##  6 45781 Italy   This gorge~ Riserva        100   550 Tuscany  Brunell~ <NA>    
##  7 45798 US      Tasted in ~ <NA>           100   200 Califor~ Napa Va~ Napa    
##  8 58352 France  This is a ~ <NA>           100   150 Bordeaux Saint-J~ <NA>    
##  9 89728 France  This lates~ Cristal Vi~    100   250 Champag~ Champag~ <NA>    
## 10 89729 France  This new r~ Le Mesnil ~    100   617 Champag~ Champag~ <NA>    
## # i 129,961 more rows
## # i 5 more variables: taster_name <chr>, taster_twitter_handle <chr>,
## #   title <chr>, variety <chr>, winery <chr>

wines%>%
  arrange( -points)

## # A tibble: 129,971 x 14
##     ...1 country description designation points price province region_1 region_2
##    <dbl> <chr>   <chr>       <chr>        <dbl> <dbl> <chr>    <chr>    <chr>   
##  1   345 Austra~ This wine ~ Rare           100   350 Victoria Rutherg~ <NA>    
##  2  7335 Italy   Thick as m~ Occhio di ~    100   210 Tuscany  Vin San~ <NA>    
##  3 36528 France  This is a ~ Brut           100   259 Champag~ Champag~ <NA>    
##  4 39286 Italy   A perfect ~ Masseto        100   460 Tuscany  Toscana  <NA>    
##  5 42197 Portug~ This is th~ Barca-Velha    100   450 Douro    <NA>     <NA>    
##  6 45781 Italy   This gorge~ Riserva        100   550 Tuscany  Brunell~ <NA>    
##  7 45798 US      Tasted in ~ <NA>           100   200 Califor~ Napa Va~ Napa    
##  8 58352 France  This is a ~ <NA>           100   150 Bordeaux Saint-J~ <NA>    
##  9 89728 France  This lates~ Cristal Vi~    100   250 Champag~ Champag~ <NA>    
## 10 89729 France  This new r~ Le Mesnil ~    100   617 Champag~ Champag~ <NA>    
## # i 129,961 more rows
## # i 5 more variables: taster_name <chr>, taster_twitter_handle <chr>,
## #   title <chr>, variety <chr>, winery <chr>

WYŚWIETLANIE ZMIENNYCH

wines%>%
  select( country, province:region_2)

## # A tibble: 129,971 x 4
##    country  province          region_1            region_2         
##    <chr>    <chr>             <chr>               <chr>            
##  1 Italy    Sicily & Sardinia Etna                <NA>             
##  2 Portugal Douro             <NA>                <NA>             
##  3 US       Oregon            Willamette Valley   Willamette Valley
##  4 US       Michigan          Lake Michigan Shore <NA>             
##  5 US       Oregon            Willamette Valley   Willamette Valley
##  6 Spain    Northern Spain    Navarra             <NA>             
##  7 Italy    Sicily & Sardinia Vittoria            <NA>             
##  8 France   Alsace            Alsace              <NA>             
##  9 Germany  Rheinhessen       <NA>                <NA>             
## 10 France   Alsace            Alsace              <NA>             
## # i 129,961 more rows

ZMIANA NAZWY ZMIENNYCH

wines%>%
  rename( punkty = points)

## # A tibble: 129,971 x 14
##     ...1 country description designation punkty price province region_1 region_2
##    <dbl> <chr>   <chr>       <chr>        <dbl> <dbl> <chr>    <chr>    <chr>   
##  1     0 Italy   Aromas inc~ Vulka Bian~     87    NA Sicily ~ Etna     <NA>    
##  2     1 Portug~ This is ri~ Avidagos        87    15 Douro    <NA>     <NA>    
##  3     2 US      Tart and s~ <NA>            87    14 Oregon   Willame~ Willame~
##  4     3 US      Pineapple ~ Reserve La~     87    13 Michigan Lake Mi~ <NA>    
##  5     4 US      Much like ~ Vintner's ~     87    65 Oregon   Willame~ Willame~
##  6     5 Spain   Blackberry~ Ars In Vit~     87    15 Norther~ Navarra  <NA>    
##  7     6 Italy   Here's a b~ Belsito         87    16 Sicily ~ Vittoria <NA>    
##  8     7 France  This dry a~ <NA>            87    24 Alsace   Alsace   <NA>    
##  9     8 Germany Savory dri~ Shine           87    12 Rheinhe~ <NA>     <NA>    
## 10     9 France  This has g~ Les Natures     87    27 Alsace   Alsace   <NA>    
## # i 129,961 more rows
## # i 5 more variables: taster_name <chr>, taster_twitter_handle <chr>,
## #   title <chr>, variety <chr>, winery <chr>

DODANIE KOLUMNY Z CENĄ WINA W ZŁOTÓWKACH

usd_to_pln = 3.95
wines<-wines%>%
  mutate( price_pln = price * usd_to_pln)
wines

## # A tibble: 129,971 x 15
##     ...1 country description designation points price province region_1 region_2
##    <dbl> <chr>   <chr>       <chr>        <dbl> <dbl> <chr>    <chr>    <chr>   
##  1     0 Italy   Aromas inc~ Vulka Bian~     87    NA Sicily ~ Etna     <NA>    
##  2     1 Portug~ This is ri~ Avidagos        87    15 Douro    <NA>     <NA>    
##  3     2 US      Tart and s~ <NA>            87    14 Oregon   Willame~ Willame~
##  4     3 US      Pineapple ~ Reserve La~     87    13 Michigan Lake Mi~ <NA>    
##  5     4 US      Much like ~ Vintner's ~     87    65 Oregon   Willame~ Willame~
##  6     5 Spain   Blackberry~ Ars In Vit~     87    15 Norther~ Navarra  <NA>    
##  7     6 Italy   Here's a b~ Belsito         87    16 Sicily ~ Vittoria <NA>    
##  8     7 France  This dry a~ <NA>            87    24 Alsace   Alsace   <NA>    
##  9     8 Germany Savory dri~ Shine           87    12 Rheinhe~ <NA>     <NA>    
## 10     9 France  This has g~ Les Natures     87    27 Alsace   Alsace   <NA>    
## # i 129,961 more rows
## # i 6 more variables: taster_name <chr>, taster_twitter_handle <chr>,
## #   title <chr>, variety <chr>, winery <chr>, price_pln <dbl>

STATYSTYKI

wines%>%
  summarise(mean_price = mean(price, na.rm = T),
          std_price = sd(price, na.rm = T))

## # A tibble: 1 x 2
##   mean_price std_price
##        <dbl>     <dbl>
## 1       35.4      41.0

KWANTYLE

quantile(wines$price, na.rm = T, probs = c(0, 0.1, 0.25, 0.50, 0.75, 0.9, 1))

##   0%  10%  25%  50%  75%  90% 100% 
##    4   12   17   25   42   65 3300

MEDIANA

wines%>%
  summarise(median_price = median(price, na.rm = T))

## # A tibble: 1 x 1
##   median_price
##          <dbl>
## 1           25

SPRAWDZENIE STOSUNKU CENY DO JAKOŚCI

Czy drogie wino oznacza dobre?

wines %>% 
  mutate(price_score_ratio = price_pln/points) %>% 
  select(title, price_pln, points, price_score_ratio) %>% 
  arrange(price_score_ratio)

## # A tibble: 129,971 x 4
##    title                                      price_pln points price_score_ratio
##    <chr>                                          <dbl>  <dbl>             <dbl>
##  1 Bandit NV Merlot (California)                   15.8     86             0.184
##  2 Cramele Recas 2011 UnWineD Pinot Grigio (~      15.8     86             0.184
##  3 Felix Solis 2013 Flirty Bird Syrah (Vino ~      15.8     85             0.186
##  4 Dancing Coyote 2015 White (Clarksburg)          15.8     85             0.186
##  5 Broke Ass 2009 Red Malbec-Syrah (Mendoza)       15.8     84             0.188
##  6 Bandit NV Chardonnay (California)               15.8     84             0.188
##  7 Terrenal 2010 Cabernet Sauvignon (Yecla)        15.8     84             0.188
##  8 Bandit NV Merlot (California)                   15.8     84             0.188
##  9 Terrenal 2010 Estate Bottled Tempranillo ~      15.8     84             0.188
## 10 Pam's Cuties NV Unoaked Chardonnay (Calif~      15.8     83             0.190
## # i 129,961 more rows

SPRAWDZENIE OBSERWACJI, KTÓRE UZYSKAŁY POWYŻEJ 90 PUNKTÓW

wines %>% 
  mutate(price_score_ratio = price_pln/points) %>% 
  select(title, price_pln, points, price_score_ratio) %>% 
  filter(points >= 90) %>% 
  arrange(price_score_ratio)

## # A tibble: 49,045 x 4
##    title                                      price_pln points price_score_ratio
##    <chr>                                          <dbl>  <dbl>             <dbl>
##  1 Herdade dos Machados 2012 Toutalga Red (A~      27.6     91             0.304
##  2 Snoqualmie 2006 Winemaker's Select Riesli~      31.6     91             0.347
##  3 Esser Cellars 2001 Chardonnay (California)      31.6     90             0.351
##  4 Aveleda 2013 Quinta da Aveleda Estate Bot~      31.6     90             0.351
##  5 Rothbury Estate 2001 Chardonnay (South Ea~      31.6     90             0.351
##  6 Chateau Ste. Michelle 2011 Riesling (Colu~      35.6     91             0.391
##  7 Chateau Ste. Michelle 2010 Dry Riesling (~      35.6     91             0.391
##  8 Barnard Griffin 2012 Fumé Blanc Sauvignon~      35.6     91             0.391
##  9 Mano A Mano 2011 Tempranillo (Vino de la ~      35.6     90             0.395
## 10 Aveleda 2014 Quinta da Aveleda Estate Bot~      35.6     90             0.395
## # i 49,035 more rows

MEDIANA - GRUPOWANIE

Mediana ze względu na wartośc zmiennej coutry.

wines %>% 
  group_by(country) %>% 
  summarise(median_price_pln = median(price_pln, na.rm = T))

## # A tibble: 44 x 2
##    country                median_price_pln
##    <chr>                             <dbl>
##  1 Argentina                          67.2
##  2 Armenia                            57.3
##  3 Australia                          83.0
##  4 Austria                            98.8
##  5 Bosnia and Herzegovina             49.4
##  6 Brazil                             79  
##  7 Bulgaria                           51.4
##  8 Canada                            118. 
##  9 Chile                              59.2
## 10 China                              71.1
## # i 34 more rows

wines %>% 
  group_by(country) %>% 
  summarise(median_price_pln = median(price_pln, na.rm = T),
            sred_punkty = mean(points, na.rm = T),
            liczba_of_wines = n()) %>% 
  arrange(median_price_pln) %>% 
  filter(liczba_of_wines >= 20)

## # A tibble: 30 x 4
##    country   median_price_pln sred_punkty liczba_of_wines
##    <chr>                <dbl>       <dbl>           <int>
##  1 Romania               35.6        86.4             120
##  2 Bulgaria              51.4        87.9             141
##  3 Moldova               51.4        87.2              59
##  4 Chile                 59.2        86.5            4472
##  5 Portugal              63.2        88.3            5691
##  6 Argentina             67.2        86.7            3800
##  7 Georgia               69.1        87.7              86
##  8 Morocco               71.1        88.6              28
##  9 Spain                 71.1        87.3            6645
## 10 Greece                75.0        87.3             466
## # i 20 more rows


# TWORZENIE SZEREGÓW ROZDZIELCZYCH


```r
n = length(wines$price)
y1=cut(wines$price, sqrt(n))
# y1
head(table(y1),30)

## y1
## (0.704,13.16] (13.16,22.31] (22.31,31.47] (31.47,40.62] (40.62,49.78] 
##         15821         35109         22801         15944          8192 
## (49.78,58.93] (58.93,68.09] (68.09,77.24]  (77.24,86.4]  (86.4,95.56] 
##          7330          5179          3105          1997          1350 
## (95.56,104.7] (104.7,113.9]   (113.9,123]   (123,132.2] (132.2,141.3] 
##           858           464           446           532           262 
## (141.3,150.5] (150.5,159.6] (159.6,168.8]   (168.8,178]   (178,187.1] 
##           354            59           122           135            66 
## (187.1,196.3] (196.3,205.4] (205.4,214.6] (214.6,223.7] (223.7,232.9] 
##            62           110            29            25            64 
##   (232.9,242]   (242,251.2] (251.2,260.4] (260.4,269.5] (269.5,278.7] 
##            40            65            30             7            23

y2=cut(wines$price,breaks=c(1,20,100,300,500))
head(y2, 10)

##  [1] <NA>     (1,20]   (1,20]   (1,20]   (20,100] (1,20]   (1,20]   (20,100]
##  [9] (1,20]   (20,100]
## Levels: (1,20] (20,100] (100,300] (300,500]

levels(y2)=c("bardzo tanie", "tanie", "drogie", "bardzo drogie")
table(y2)

## y2
##  bardzo tanie         tanie        drogie bardzo drogie 
##         46341         71268          3050           225

TABELA PRZESTAWNA

rpivotTable(diamonds, subtotals=TRUE)