Podstawy analizy danych

Pierwsze kroki w R


Wprowadzenie


  1. Zapoznaj się z rodzajami danych w R. Pamiętaj, że nie są to rodzaje zmiennych w statystyce!
  2. Jakie są etapy analizy danych?
  3. Jak wczytywać i zapisywać dane różnych formatów?

Zadanie

TEST
Col1 Col2 Col3
AAA Test
BBB Test
CCC ???
DDD
EEE

Dane dotyczą prawie 985 transakcji sprzedaży mieszkań z Sacramento (Kalifornia, USA).

Utwórz nowy factor: dla cen >50 kUSD “0” o etykiecie „niska”, dla >100 kUSD “1” o etykiecie “srednia”, a dla pozostałych „2” o etykiecie „wysoka”. W jakim celu można go wykorzystać w analizie danych?

dane <- dane%>%
  mutate(cena=cut(price, breaks =c(0,50000, 100000, Inf), labels=c("niska", "srednia" , "wysoka" )))
attach(dane)
## The following objects are masked from dane (pos = 12):
## 
##     baths, beds, city, latitude, longitude, price, sale_date, sq__ft,
##     state, street, type, zip
dane2 <- filter(dane,type!="Unkown") 

boxplot(sq__ft~cena)

plot(price,sq__ft)

ggplot(dane2,aes(beds,type)) +
  geom_boxplot() +
  theme_light()

Siatka grafiki

Za pomocą “Facetingu’u” lub siatki (“grid”) możemy utworzyć na jednej stronie raportu kilka wykresów o różniych wymiarach

wykres1 <- ggplot(dane2,aes(price)) + geom_histogram(binswidth=50000)
## Warning: Ignoring unknown parameters: binswidth
wykres2 <- ggplot(dane2,aes(sq__ft)) + geom_histogram(bins=10)
wykres3 <- ggplot(dane2, aes(price,sq__ft)) + geom_point(pch=17, color="blue", size=2) + geom_smooth(method="lm", color="red",linetype=2)
wykres4 <- ggplot(dane2, aes(x=price, y=sq__ft, color=type)) + geom_point()
grid.arrange(wykres1,wykres2, wykres3, wykres4, ncol=2, nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula 'y ~ x'

Fajki

Poniżej pokazano przykład wykorzystania fajki bazowej R “|>”: wykres logarytmu gęstości cen.

price %>%
  log() %>%
  density() %>%
  plot(xlab="log cen", ylab="gęstość", main="Logarytm cen")

Przykład - dplyr

Wykreśl histogram cen sprzedaży nieruchomości typu “Condo” z dwiema sypialniami. Filtry z dplyr działają na całej ramce danych, stąd by wykreślić ceny musimy użyć pakietu ggplot2:

Bez użycia filtra i fajek nie ma takiej potrzeby:

Zadania z dplyr

Zapisz do ramki “dane3” informacje nt. cen i rodzaju mieszkań (ale tylko te, które mają co najmniej 2 łazienki). Ich ceny uporządkuj rosnąco. Utwórz także nową zmienną price2, która jest wyrażona w PLN z kursu dzisiejszego (średniego).

dane4 <- dane2 %>%
  mutate(cena_PLN =price*3.95) %>%
  filter(baths>=2) %>%
  select(cena_PLN, type) %>%
  arrange(cena_PLN)

Podsumuj ramkę dane3 - pokazując średnie ceny w PLN i grupując je wg typu mieszkania.

dane4 %>%
  group_by(type) %>%
  summarise(srednia=mean(cena_PLN),median(cena_PLN), odchylenie=sd(cena_PLN), zmiennosc=(odchylenie/srednia)*100)
## # A tibble: 3 × 5
##   type          srednia `median(cena_PLN)` odchylenie zmiennosc
##   <chr>           <dbl>              <dbl>      <dbl>     <dbl>
## 1 Condo         713584.            600400     263305.      36.9
## 2 Multi-Family  886912.            873938.    335994.      37.9
## 3 Residential  1042823.            924300     520132.      49.9

Na wykresie wyświetl ceny w PLN wg typu nieruchomości. Podpisz osie, nanieś szablon.

ggplot(dane4, aes(type,cena_PLN)) +
  geom_violin() +
  theme_light() + 
  labs(title="Ceny wg typu",x="Typ", y="PLN")

Opis statystyczny

Dotyczy głównych czterech grup: 1. Miara położenia - w poziomie przebieg, tendencja, dotyczą statycznego po osi X - miary klasyczne, lub piary pozycyjne.

Miary klasyczne - charakteryzują cały rozkład danej zmiennej, natomiast pozycyjne - wyłącznie daną lokalizację, pozycję rozkładu zmiennej.

Klasyczne - średnia ważona, średnia arytmetyczne Pozycyjne - dominanta, kwartyle, decyle, percentyle, mediana (średnia środkowa)

  1. Miary zmienności, zróżnicowania, dyspersji klasyczne - wariancja, odchylenie standardowe, odchylenie przeciętne, odchylenie medianowe pozycyjne -

  2. Miary skosności

  3. Miary kurtozy - umożliwia opis kształtu rozkładu jako całości dotyczy czy mamy grube ogony i czy rozkład na środku jest bardziej spłaszczony, czy wysmukły, czy mamy wiele obserwacji na grubych ogonach - rozkład normalny = kurtoza równa 3

Kurtoza mniejsza czy większa bardziej rozchyla

Rozkład normalny ma zerową asymetrią

Na histogramie ryzujemy przebieg rozkładu normalnego, łatwiej będzie linią w tle nakreślić rozkład normalny

Miary pozycyjne często dotyczą obszaru między kwartylem górnym i dolnym - wówczas mówimy o np. skosności międzykwartlowej, kurtozie międzykwartylowej.

  1. Pakiet staragazer -
attach(dane4)
## The following object is masked from dane (pos = 3):
## 
##     type
## The following object is masked from dane (pos = 13):
## 
##     type
stargazer(dane4$cena_PLN, type = "text", min.max = TRUE, mean.sd = TRUE,
          nobs = TRUE, median = TRUE, iqr = TRUE, digits=2, align=T,
          title= "statystyki opisowe")
## 
## statystyki opisowe
## =============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================
## 6,126.45 7,900 19,343.20 19,343.20 19,343.20 118,500 224,952.00 262,675 276,500 316,000 337,725 351,550 359,035.00 359,450 369,720 374,875.00 377,719.00 386,112.00 390,801.00 395,000 396,221.00 405,862.00 410,800 414,750 419,688.00 429,562.00 430,550 437,265 444,375 450,300 453,460 453,460 454,250 454,250 456,225 456,699 458,200 471,038.00 474,000 474,000 474,427.00 479,925 479,925 480,438.00 481,900 481,900 481,900 481,900 483,875 484,594.00 485,850 485,850 485,850 488,516.00 488,812.00 489,800 489,800 490,195 491,431.00 493,750 493,750 493,750 496,013.00 497,700 500,228 500,520.00 501,073.00 501,883.00 503,625 509,550 511,525 513,500 513,500 513,500 518,240 520,412.00 520,412.00 521,400 525,350 529,300 531,492.00 533,250 535,225 539,175 539,175 543,998.00 544,152 545,100 548,062.00 551,025 553,000 553,000 556,160 556,950 560,900 562,875 566,825 572,750 572,750 572,750 572,750 576,092.00 576,700 577,688.00 580,650 581,867.00 587,562.00 588,550 590,892.00 590,920 592,500 592,500 592,500 592,500 592,500 592,500 592,500 592,500 596,450 596,794.00 600,400 600,400 608,300 612,250 612,250 613,968.00 614,225 615,410 616,200 616,200 616,761.00 619,739.00 621,319.00 622,125 624,100 624,100 624,100 631,605 632,000 632,000 632,000 632,000 636,938.00 636,938.00 637,925 637,925 638,320 638,529.00 639,900 647,800 647,800 647,800 651,750 651,750 651,750 651,750 651,750 654,712.00 657,110.00 659,650 660,807.00 660,807.00 660,807.00 663,600 663,600 663,600 663,600 666,562.00 667,550 667,550 671,500 671,500 671,500 671,500 671,500 671,500 671,500 671,500 672,488.00 678,412.00 679,400 683,350 683,350 683,571.00 687,300 687,300 688,288.00 688,288.00 688,536.00 691,250 695,575.00 696,188.00 696,188.00 698,558.00 703,100 703,100 704,996 706,102 707,050 707,050 709,025 709,341 711,000 711,000 711,000 711,000 711,000 711,000 711,000 712,580 714,950 714,950 718,900 718,900 718,900 721,219.00 721,219.00 721,728.00 721,862.00 723,640 728,775 730,750 731,042.00 734,040.00 737,801.00 738,650 743,923.00 745,365 746,550 746,550 749,852.00 750,500 750,500 750,500 755,438.00 755,438.00 756,425 757,116.00 758,400 758,665.00 761,165 762,350 764,325 766,300 766,300 769,531.00 770,250 770,250 770,250 770,250 772,225 780,733.00 782,100 788,025 789,605 790,000 790,000 790,000 790,000 790,000 791,363.00 796,036.00 799,875 808,762.00 809,426.00 809,750 809,750 809,750 809,750 809,750 809,750 809,750 813,218.00 813,305 817,650 817,650 817,650 820,589.00 821,600 821,600 822,588.00 825,550 826,921.00 829,500 829,500 829,500 835,425 837,400 839,375 840,813.00 841,350 844,016.00 844,103.00 844,312.00 849,250 849,250 849,250 849,250 849,250 849,250 853,200 853,330.00 859,125 861,100 865,050 868,186.00 869,000 869,000 869,000 869,000 869,000 869,000 869,000 871,773.00 872,950 872,950 872,950 873,938.00 876,900 879,862.00 880,455 880,850 881,079.00 886,775 888,750 888,750 888,750 888,750 888,750 890,725 900,154.00 900,600 900,600 901,892.00 903,562.00 904,550 904,657.00 906,525 907,177.00 908,500 908,500 908,500 908,500 908,500 908,500 908,875.00 913,240 914,334.00 916,400 918,079.00 918,375 918,375 920,350 922,325 924,300 924,300 924,300 926,275 927,053.00 928,250 928,250 928,250 928,250 929,439.00 931,165.00 932,200 932,200 932,488.00 934,906.00 939,310 940,100 943,501.00 944,050 946,815 948,000 948,000 948,000 948,000 948,482.00 951,835.00 951,950 955,900 958,420.00 961,628.00 961,825 963,800 967,750 967,750 967,750 971,376.00 971,700 973,849.00 974,662.00 975,650 976,574.00 977,546 983,550 983,550 987,500 987,500 987,500 987,500 987,500 987,500 987,500 988,029.00 991,450 995,400 996,012.00 1,003,979.00 1,004,090 1,004,090 1,007,250 1,007,250 1,007,250 1,011,413.00 1,015,940 1,018,030.00 1,019,100 1,019,100 1,027,000 1,027,000 1,027,000 1,027,000 1,027,000 1,030,950 1,030,950 1,030,950 1,034,110 1,036,875 1,040,825 1,044,653.00 1,046,750 1,046,750 1,046,750 1,050,700 1,052,714.00 1,057,612.00 1,066,500 1,066,500 1,066,500 1,077,165 1,081,312.00 1,083,979.00 1,086,250 1,086,250 1,086,250 1,086,590.00 1,087,577.00 1,090,200 1,092,175 1,098,021 1,098,100 1,102,050 1,106,000 1,106,000 1,106,000 1,106,000 1,109,587.00 1,109,899.00 1,115,480 1,124,510.00 1,125,327.00 1,125,750 1,125,750 1,125,750 1,125,750 1,125,750 1,129,751.00 1,135,297.00 1,137,600 1,141,550 1,141,550 1,145,500 1,145,500 1,149,450 1,153,400 1,153,400 1,161,272.00 1,161,284.00 1,161,300 1,161,300 1,161,983.00 1,165,250 1,165,250 1,169,200 1,169,421.00 1,172,238.00 1,173,150 1,173,150 1,174,568.00 1,175,125 1,177,100 1,177,100 1,181,050 1,184,763 1,185,000 1,185,000 1,185,000 1,185,000 1,185,000 1,186,975 1,187,240.00 1,196,850 1,200,800 1,200,946.00 1,204,750 1,204,750 1,204,750 1,208,700 1,224,500 1,224,500 1,224,500 1,228,450 1,229,746.00 1,230,496.00 1,232,400 1,236,895.00 1,244,250 1,244,250 1,244,250 1,244,250 1,244,250 1,244,250 1,246,371.00 1,250,688.00 1,263,167.00 1,264,000 1,264,000 1,264,000 1,264,000 1,271,900 1,283,750 1,283,750 1,291,456.00 1,297,022 1,297,062.00 1,297,883.00 1,303,500 1,303,500 1,303,500 1,303,500 1,307,450 1,307,450 1,308,240 1,311,400 1,319,300 1,323,250 1,326,212.00 1,326,212.00 1,327,200 1,339,050 1,343,000 1,346,950 1,359,788.00 1,362,750 1,365,697.00 1,367,530.00 1,368,181.00 1,370,765.00 1,371,539.00 1,373,218.00 1,378,550 1,382,500 1,382,500 1,382,500 1,382,500 1,382,500 1,386,450 1,386,450 1,387,635 1,390,400 1,397,380.00 1,406,200 1,406,338.00 1,406,990 1,422,000 1,424,180.00 1,428,893.00 1,429,695.00 1,431,105.00 1,451,479.00 1,451,838.00 1,461,500 1,461,500 1,463,475 1,465,790.00 1,469,400 1,481,250 1,493,100 1,501,000 1,503,283.00 1,506,135 1,508,671.00 1,520,750 1,525,577.00 1,531,537.00 1,544,450 1,558,156.00 1,560,250 1,560,645 1,562,225 1,568,150 1,576,050 1,580,000 1,580,000 1,580,000 1,580,735.00 1,587,900 1,603,803.00 1,613,302.00 1,629,375 1,639,250 1,639,250 1,646,230.00 1,659,000 1,659,000 1,660,793.00 1,670,850 1,670,850 1,678,750 1,678,750 1,678,750 1,688,625 1,698,500 1,702,142.00 1,725,147.00 1,730,100 1,732,865 1,741,950 1,757,750 1,757,750 1,761,700 1,777,500 1,777,500 1,777,500 1,785,400 1,797,250 1,817,000 1,817,000 1,817,000 1,817,000 1,820,950 1,836,750 1,856,500 1,860,450 1,863,412.00 1,876,250 1,876,250 1,896,000 1,911,800 1,913,775 1,915,750 1,925,625 1,930,562.00 1,932,861.00 1,935,500 1,947,350 1,955,250 1,975,000 1,990,800 2,001,418.00 2,006,600 2,014,500 2,022,400 2,054,000 2,073,750 2,085,600 2,105,350 2,129,050 2,212,000 2,261,375 2,271,250 2,287,417.00 2,298,900 2,364,845.00 2,370,000 2,370,000 2,370,000 2,370,000 2,394,640.00 2,409,500 2,422,934.00 2,425,300 2,512,200 2,607,000 2,640,042.00 2,670,990 2,674,340.00 2,686,000 2,732,053.00 2,761,050 3,002,000 3,278,500 3,314,050 3,472,050 3,494,920.00
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  1. Pakiet reporttools - pod pdfa
vars1<-dane3[,c(7,10)]
tableContinuous(vars=vars1, group=dane3$type, print.pval = "kruskal")
## % latex table generated in R 4.1.2 by xtable 1.8-4 package
## % Mon Feb 14 20:47:18 2022
## \begingroup\footnotesize
## \begin{longtable}{lrrrrrrrrrr}
##  \textbf{Variable} & $\mathbf{n}$ & \textbf{Min} & $\mathbf{q_1}$ & $\mathbf{\widetilde{x}}$ & $\mathbf{\bar{x}}$ & $\mathbf{q_3}$ & \textbf{Max} & $\mathbf{s}$ & \textbf{IQR} & \textbf{\#NA} \\ 
##   \hline
## sq\_\_ft & 30 &     0 &   795 &    924 &    893.7 &   1029.2 &   1326 &   292.8 &   234.2 & 0 \\ 
##   price & 30 & 40000 & 90500 & 118125 & 129496.9 & 150340.5 & 260000 & 56224.6 & 59840.5 & 0 \\ 
##   \hline
## \caption{} 
## \label{}
## \end{longtable}
## \endgroup
  1. Pakiet skimr
dane4 %>%
  group_by(type) %>%
  skim()
Data summary
Name Piped data
Number of rows 697
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables type

Variable type: numeric

skim_variable type n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
cena_PLN Condo 0 1 713584 263305 374874.75 525350 600400 918375 1382500 ▇▃▃▂▁
cena_PLN Multi-Family 0 1 886912 335994 395000.00 671500 873938 1115480 1646230 ▆▇▇▂▂
cena_PLN Residential 0 1 1042823 520132 6126.45 683350 924300 1264000 3494920 ▃▇▂▁▁
  1. Pakiet DescTools
Desc(dane4$cena_PLN)
## ------------------------------------------------------------------------------ 
## dane4$cena_PLN (numeric)
## 
##         length           n         NAs      unique            0s          mean'
##            697         697           0         483             0  1'029'994.94
##                     100.0%        0.0%                      0.0%              
##                                                                               
##            .05         .10         .25      median           .75           .90
##     456'604.20  510'735.00  671'500.00  916'400.00  1'244'250.00  1'727'128.02
##                                                                               
##          range          sd       vcoef         mad           IQR          skew
##   3'488'794.05  514'606.47        0.50  398'226.36    572'750.00          1.43
##                                                                               
##         meanCI
##     991'724.56
##   1'068'265.33
##               
##            .95
##   2'002'454.08
##               
##           kurt
##           2.94
##               
## lowest : 6'126.45, 7'900.0, 19'343.15 (3), 118'500.0, 224'952.50
## highest: 3'002'000.0, 3'278'500.0, 3'314'050.0, 3'472'050.0, 3'494'920.50
## 
## ' 95%-CI (classic)

  1. Dodanie - znajdź pakiet, za pomocą którego wyprodukujesz elegancki statystyk opisowych dla html, następnie zinterpretuj te statystyki także wg typu nieruchomości, bo może się okazać, że wobec całego rozkładu statystyki

ANALIZA DANYCH JAKOŚCIOWYCH

Dokonaj opisu i prezentacjii graficznej zmiennej typ nieruchomiści wg miast.

dane2<-dane[!(dane$type=="Unkown") & (dane$city=="SACRAMENTO") | (dane$city=="ANTELOPE"),]
attach(dane2)
## The following object is masked from dane4:
## 
##     type
## The following objects are masked from dane (pos = 4):
## 
##     baths, beds, cena, city, latitude, longitude, price, sale_date,
##     sq__ft, state, street, type, zip
## The following objects are masked from dane (pos = 14):
## 
##     baths, beds, city, latitude, longitude, price, sale_date, sq__ft,
##     state, street, type, zip
bp <- ggplot(dane3, aes(x="", y=freq, fill= city)) + geom_bar(width=1, stat = "identity") + theme (axis.text.x = element_text(angle=70, vjust=0.5))

dane3<-dane2 %>%
  count(city, type) %>%
  mutate(freq = n/ sum(n))

ggplot(dane3, aes(x="", y=freq, fill=type)) + coord_polar(("y"))

WNIOSKOWANIE

Czy ceny mieszkań różnych typów istotnie różnią się od Ciebie?

qqnorm(log(price))
qqline(log(price)) #dodajemy linię prostą

dane2$lprice<-log(price)

ggbetweenstats(data =dane2, 
               x = type,
               y = price,
               type = "nonparametric") 

#jeżeli mam zmienną lprice w ramce danych, to wtedy zmieniam y na lprice i ogę usunąć type

Przykład drugi

Czy ceny domów wielordzinnych są istotnie różne od cen rezydencji?

dane4<-dane2 %>% filter(type=="Multi-Family" | type=="Residential")
  ggbetweenstats(data=dane4, x=type, y=lprice)

Przykład 3

Czy liczba łazienka istotie różni się dla nieruchomości różnych typów?

ggpiestats(data=dane2, x=type, y= baths)

ggbarstats(data=dane2, x=type, y= baths)