เรียนสถิตเบื้องต้นด้วยภาษา R

R Library

ในภาษา R จะมี library ที่มากับ package ให้ใช้เยอะมาก

library(readr) #สำหรับดึงข้อมูลจากภายนอก
library(dplyr) #สำหรับการจัดการข้อมูล

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forcats) # Working with Categorical Variables (Factors)
library(lubridate)#สำหรับการจัดการในเรื่องของวัน เวลา

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)  #สำหรับสร้างกราฟ
library(lattice)  #สำหรับสร้างจัดการกราฟ
library(extrafont) #สำหรับใช้ภาษาไทย

## Registering fonts with R

Import Data

data("diamonds")
#หรือเราสามารถ import จากเครื่องคอมพิวเตอร์เราก็ได้
# diamonds <- read_csv("C:/Projects/Data Science/Data/Diamonds.csv")
head(diamonds)# คำสั่งนี้จะแสดงข้อมูลตัวอย่าง 6 อันดับแรกของข็อมูล

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Level of measures

ต่อไปจะเป็นการ จัดลำดับของขัอมูล จากข้อมูลจะเป็นการสุ่ม Ideal Premium Good Premium Good Very Good

ดังนั้นเราจะต้องจัดการกับข้อมูล และเรียงลำดับว่า เราจะกำหนด การวัดระดับอย่างไร โดยในคำสั่งจะ บอกว่า Fair < Good < Very Good < Premium < Ideal

## [1] Ideal     Premium   Good      Premium   Good      Very Good
## Levels: Fair < Good < Very Good < Premium < Ideal

diamonds$color<- factor(diamonds$color, levels=c('J','I','H','G','F','E','D'), 
                        ordered=TRUE)
head(diamonds$color)

## [1] E E E I J J
## Levels: J < I < H < G < F < E < D

diamonds$clarity<-factor(diamonds$clarity,
                         levels=c('I1','SI1','SI2','VS1','VS2','VVS1','VVS2','IF'), 
                         ordered=TRUE)
head(diamonds$clarity)

## [1] SI2  SI1  VS1  VS2  SI2  VVS2
## Levels: I1 < SI1 < SI2 < VS1 < VS2 < VVS1 < VVS2 < IF

การจัดเก็บข้อมูลในรูปแบบของตารางแจกแจงความถี่

โดยการนับความถี่ของแต่ละ cut และบรรจุในตาราง

diamonds$cut %>% table()

## .
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551

การจัดเก็บข้อมูลในรูปแบบของตารางความน่าจะเป็น

การคำนวณค่าสัดส่วน (Proportion) P = x/N

diamonds$cut %>% table() %>% prop.table()

## .
##       Fair       Good  Very Good    Premium      Ideal 
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652

ค่าสัดส่วนมีค่าอยู่ระหว่าง 0 ถึง 1 เพื่อให้ง่ายต่อการแปลผลและอธิบาย นิยมนำค่าคงที่ 100 มาคูณ เรียกว่าค่า ร้อยละ หรือ %

diamonds$cut %>% table() %>% prop.table()*100

## .
##      Fair      Good Very Good   Premium     Ideal 
##  2.984798  9.095291 22.398962 25.567297 39.953652

freq <- diamonds$cut %>% table() %>% prop.table()*100 #ตั้งชื่อให้กับตาราง

Bar Chart แผนภูมิแท่ง

การสร้างกราฟแท่ง แสดงผลอัตราส่วนของ cut ในแต่ละแบบ ว่ามีสัดสวนเท่าไหร่ของเพรชทั้งหมด

barplot(freq, main = "Diamond Cut Quality - Percentage",ylab="Percent", xlab="Colour")

ใส่สีสวยๆ ให้สักหน่อย แล้วก็เรียงจากจำนวนมากไปหาน้อยด้วย เพื่อการสื่อความหมายได้ดีขึ้น

barplot(freq[order(freq, decreasing = TRUE)], main = "Diamond Cut Quality - Percentage", 
        ylab="Percent", xlab="Colour", col="deepskyblue")

ตัวอย่างการแจกแจงความถี่ (ตารางสองทาง)

table(diamonds$cut,diamonds$clarity) %>% prop.table(margin = 2) %>% round(3)

##            
##                I1   SI1   SI2   VS1   VS2  VVS1  VVS2    IF
##   Fair      0.283 0.031 0.051 0.021 0.021 0.005 0.014 0.005
##   Good      0.130 0.119 0.118 0.079 0.080 0.051 0.056 0.040
##   Very Good 0.113 0.248 0.228 0.217 0.211 0.216 0.244 0.150
##   Premium   0.277 0.274 0.321 0.243 0.274 0.169 0.172 0.128
##   Ideal     0.197 0.328 0.283 0.439 0.414 0.560 0.514 0.677

การสร้างกราฟจากการแจกแจงความถี่ (ตารางสองทาง)

color_cut <- table(diamonds$color,diamonds$cut) %>% prop.table(margin = 2)
barplot(color_cut, main = "Diamond Colour by Cut Quality",
        ylab="Proportion within Cut", xlab="Cut")

จะเห็นได้ว่า กราฟที่แสดงออกมา มันไม่ได้สื่อความหมายอะไรเลย ดังนั้น เราต้องจัดการให้กราฟของเราสามารถที่จะ ให้ความหมาย หรือสามารถแปรผลออกมาได้

barplot(color_cut, main = "Diamond Colour by Cut Quality",
        ylab="Proportion within Cut", 
        xlab="Cut", beside=TRUE)

barplot(color_cut, main = "Diamond Colour by Cut Quality",ylab="Proportion within Cut",
        xlab="Cut", beside=TRUE, legend=rownames(color_cut), 
        args.legend=c(x = "top",horiz=TRUE,title="Color"))

ว้าว!! เริ่มดูดีแล้วเฮะ แต่...มันยังไม่สวย แล้วอันที่เป็น แท่งๆมันยังไม่มีชื่อเลยอ่ะ เราใส่ชื่อให้มันสักหน่อยดีกว่า จะได้เข้าใจมากขึ้น

เยี่ยมไปเลย ตอนนี้กราฟเราก็สามารถสื่อความหมายได้แล้ว ดูแล้วเข้าใจเลย แต่มันก็ยังไม่สวยเลย สีมันจืดไปหน่อยนะ

library(RColorBrewer)# เรียกใช้บริการแพคเกจเพิ่มสักหน่อย ไหนๆก็มีมาให้ใช้ฟรีๆแล้ว :)

มาเริ่มต้นใส่สีกันค่ะ

barplot(color_cut, main = "Diamond Colour by Cut Quality",ylab="Proportion within Cut",
        xlab="Cut", beside=TRUE, 
        legend=rownames(color_cut), 
        args.legend=c(x = "top",horiz=TRUE,title="Color"),
        ylim = c(0,.30),
        col=brewer.pal(7, name = "RdBu"))

เริด!!!

Dot plot

สุ่มตัวอย่างมาสัก 30 ตัวอย่าง N = 30

diamonds_sample <- diamonds %>% sample_n(30)
qplot(data = diamonds_sample, x = carat, geom = "dotplot")

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

We can use the binwidth option to change the number stacks or breaks in the dot plot. ไม่รู้จะอธิบายเป็นภาษาไทยยังไง

diamonds_sample %>% qplot(data = ., x = carat, geom = "dotplot", binwidth =.1)

Histogram

Let’s look at a histogram of diamond depths, z .

diamonds$z %>% hist(xlab="Diamond Depth (mm)", main="Histogram of Diamond Depths (mm)", 
                    col = "dodgerblue3")

เอ..กราฟนี้ก็แปลกๆ เหมือนข้อมูลมากองอยู่ที่ ช่วงระหว่าง 2 ถึง 6 ต้องจัดการสักหน่อย

diamonds$z %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.910   3.530   3.539   4.040  31.800

diamonds_clean <- diamonds %>% filter(z < 6 & z > 2)
hist(diamonds_clean$z,xlab="Diamond Depth (mm)",main="Histogram of Diamond Depths (mm)", col = "dodgerblue3")

library(lattice)
diamonds_clean %>% histogram(~ z|cut,col="dodgerblue3",layout=c(1,5), data=.,xlab="Depth (mm)")

การวัดแนวโน้มเข้าสู่ส่วนกลาง (Measures of Central Tendency)

ค่าเฉลี่ยเลขคณิตหรือมัชฌิมเลขคณิต (Arithmetic Mean or Mean or Average; X-bar )

sum(diamonds_sample$depth)/length(diamonds_sample$depth) # sum(xi)/N

## [1] 61.82333

diamonds_sample$depth %>% mean()

## [1] 61.82333

เป็นค่ากลางที่ได้จากการนำเอาข้อมูลแต่ละตัวของหน่วยสังเกตมารวมกัน แล้ว นำผลรวมที่ได้มาหารด้วยจำนวนข้อมูลทั้งหมด

diamonds$carat %>% mean()

## [1] 0.7979397

พิสัย (Rage) คือ ผลต่างระหว่างข้อมูลที่มีค่าสูงสุดและค่าต่ าสุด ถ้าพิสัยมีค่ามาก แสดงว่าข้อมูลมีการกระจายมาก ถ้ามีค่าน้อยแสดงว่าข้อมูลมีการกระจายน้อย แต่พิสัยเป็นการวัดการ กระจายอย่างหยาบๆ เท่านั้น เพราะค านวณจากค่าสูงสุดและค่าต่ าสุดเท่านั้น หากข้อมูลมีค่าที่มากหรือ น้อยผิดปกติ จะท าให้ค่าพิสัยมีค่ามาก สูตร พิสัย (Range)= Maximun – Minimun

diamonds$carat %>% range() #Min and Max

## [1] 0.20 5.01

มัธยฐาน (Median) มัธยฐาน คือ ค่าที่มีตำแหน่งอยู่ตรงกลางของข้อมูลที่เรียงลำดับจากค่าน้อยไป มาก หรือ จากค่ามากไปน้อย จะทำให้มีจำนวนข้อมูลครึ่งหนึ่ง มีค่าสูงกว่ามัธยฐานและข้อมูลอีกครึ่งหนึ่งมี ค่าต่ำกว่ามัธยฐานธยฐาน

diamonds$carat %>% median() # Median

## [1] 0.7

ความแปรปรวน (Variance; S.D.2 ) เป็นค่าเฉลี่ยของความแตกต่างกำลังสองของ ข้อมูลแต่ละตัวกับค่าเฉลี่ย (Mean Square Deviation: ส่วนเบี่ยงเบนกำลังสองเฉลี่ย) คือ ส่วนเบี่ยงเบน มาตรฐานยกกำลังสองนั่นเอง

diamonds$carat %>% var()

## [1] 0.2246867

ส่วนเบี่ยงเบนมาตรฐาน (Standard Deviation; S.D. หรือ S.) เป็นค่าเฉลี่ย ของความแตกต่างของข้อมูลแต่ละตัวกับค่าเฉลี่ย(Mean) หรือ หมายถึง โดยเฉลี่ยๆ แล้ว ข้อมูลแต่ละตัวมี ความแตกต่าง(ห่าง)จากค่าเฉลี่ย มากน้อยเพียงใด ถ้าห่างมากแสดงว่าข้อมูลมีการกระจายมาก ถ้าห่าง น้อยแสดงว่าข้อมูลมีการกระจายน้อย ส่วนเบี่ยงเบนมาตรฐานของข้อมูล

diamonds$carat %>% sd()

## [1] 0.4740112

พิสัยควอไทล์ (Interquartile Rang; IQR) = Q3 - Q1

diamonds$carat %>% quantile() #Quartiles

##   0%  25%  50%  75% 100% 
## 0.20 0.40 0.70 1.04 5.01

diamonds$carat %>% IQR() #Interquartile range

## [1] 0.64

diamonds %>% group_by(cut) %>% summarise(Min = min(carat,na.rm = TRUE),
                                         Q1 = quantile(carat,probs = .25,na.rm = TRUE),
                                         Median = median(carat, na.rm = TRUE),
                                         Q3 = quantile(carat,probs = .75,na.rm = TRUE),
                                         Max = max(carat,na.rm = TRUE),
                                         Mean = mean(carat, na.rm = TRUE),
                                         SD = sd(carat, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(carat)))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 5 x 10
##   cut         Min    Q1 Median    Q3   Max  Mean    SD     n Missing
##   <ord>     <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <int>   <int>
## 1 Fair       0.22  0.7    1     1.2   5.01 1.05  0.516  1610       0
## 2 Good       0.23  0.5    0.82  1.01  3.01 0.849 0.454  4906       0
## 3 Very Good  0.2   0.41   0.71  1.02  4    0.806 0.459 12082       0
## 4 Premium    0.2   0.41   0.86  1.2   4.01 0.892 0.515 13791       0
## 5 Ideal      0.2   0.35   0.54  1.01  3.5  0.703 0.433 21551       0

diamonds %>% group_by(cut, color) %>% summarise(Min = min(carat,na.rm = TRUE),
                                                Q1 = quantile(carat,probs = .25,na.rm =TRUE),
                                                Median = median(carat, na.rm = TRUE),
                                                Q3 = quantile(carat,probs = .75,na.rm =TRUE),
                                                Max = max(carat,na.rm = TRUE),
                                                Mean = mean(carat, na.rm = TRUE),
                                                SD = sd(carat, na.rm = TRUE),
                                                n = n(),
                                                Missing = sum(is.na(carat)))

## `summarise()` regrouping output by 'cut' (override with `.groups` argument)

## # A tibble: 35 x 11
## # Groups:   cut [5]
##    cut   color   Min    Q1 Median    Q3   Max  Mean    SD     n Missing
##    <ord> <ord> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <int>   <int>
##  1 Fair  J      0.3  0.905   1.03  1.68  5.01 1.34  0.734   119       0
##  2 Fair  I      0.41 0.885   1.01  1.50  3.02 1.20  0.522   175       0
##  3 Fair  H      0.33 0.9     1.01  1.51  4.13 1.22  0.548   303       0
##  4 Fair  G      0.23 0.7     0.98  1.07  2.6  1.02  0.493   314       0
##  5 Fair  F      0.25 0.6     0.9   1.01  2.58 0.905 0.419   312       0
##  6 Fair  E      0.22 0.552   0.9   1.01  2.04 0.857 0.365   224       0
##  7 Fair  D      0.25 0.7     0.9   1.01  3.4  0.920 0.405   163       0
##  8 Good  J      0.28 0.71    1.02  1.5   3    1.10  0.537   307       0
##  9 Good  I      0.3  0.7     1     1.5   3.01 1.06  0.576   522       0
## 10 Good  H      0.25 0.51    0.9   1.09  3.01 0.915 0.498   702       0
## # ... with 25 more rows

diamonds$carat %>%  boxplot(main="Box Plot of Diamond Carat", ylab="Carat", col = "grey")

diamonds %>% boxplot(carat ~ cut, data = ., main="Box Plot of Diamond Carat by Cut", 
                     ylab = "Carat", xlab = "cut", col="grey")

Scatter plots

เพื่อดูการกระจายและความสัมพันธ์ระหว่างตัวแปรสองตัว

plot(y ~ x, data = diamonds, ylab="Width (mm)", xlab="Length (mm)", col="orangered",
     main="Length by Width")

สังเกตจากกราฟ ข้อมูลจะมีความหนาแน่นอยู่ในช่วง 0..20 ดังนั้น เราจึงต้องแก้ไขกราฟ ให้สนใจเฉพาะข้อมูลที่อยู่ในช่วงนั้น

diamonds_clean <- diamonds %>% filter(y < 20 & x > 0)

plot(y ~ x, data = diamonds_clean, ylab="Width (mm)", xlab="Length (mm)",
     col="orangered",main="Length by Width")

Facebook Page: Jack of all trades, master of none