Data manipulation

tmt

2024-01-22

Khái niệm

Data manipulation involves modifying data to make it easier to read and to be more organized. We manipulate data for analysis and visualization.

It is also used with the term ‘data exploration’ which involves organizing data using available sets of variables.

At times, the data collection process done by machines involves a lot of errors and inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and make data more accurate and precise.

Tài liệu tham khảo

https://www.datacamp.com/cheat-sheet/data-manipulation-with-dplyr-in-r-cheat-sheet

Đọc dữ liệu từ file csv

d <- read.csv(file.choose(), header = T)
read.csv(file = './data/abc.csv)

Đọc dữ liệu từ file Excel

Cài đặt và load gói xlsx

install.packages('xlsx')
library(xlsx)

Đọc dữ liệu từ file

tmp <- read.xlsx(file.choose(), sheetIndex = 1, header = T)
tmp <- read.xlsx(file = './data/abc.xlsx')

Sử dụng dữ liệu có sẵn trong R

Load package và load datasets

library(datasets)
data(package = 'datasets')

library(ggplot2)
data(package = 'ggplot2')

Gán dữ liệu cho 1 đối tượng để làm việc

tmp <- swiss
tmp <- diamonds
tmp <- read.csv("https://data.nber.org/cps-basic2/csv/cpsb202302.csv")

Lấy dữ liệu từ World Bank

WB lưu trữ rất nhiều thông tin kinh tế vĩ mô của nhiều nước trên thế giới.

library(WDI)
ind <- WDIsearch('Total reserves')
d <- WDI(indicator = 'FI.RES.TOTL.MO', country = c('VNM'), extra = T)
tmp <- d %>% select(year,FI.RES.TOTL.MO)
tmp <- na.omit(tmp)
names(tmp) <- c('year','DuTru')tmp

Bộ dữ liệu trees

Đây là bộ dữ liệu có sẵn trong package datasets.

tmp <- trees

Thông tin tổng quan

is.data.frame(tmp)
length(tmp)
names(tmp)
dim(tmp)
library(skimr)
skim(tmp)

Một số thông tin mở rộng của bộ dữ liệu

head(tmp, 10)
tail(tmp,10)
str(tmp)

Bộ dữ liệu trees (tt)

Kiểm tra tính hoàn chỉnh của dữ liệu

is.na(tmp)
sum(is.na(tmp))
which(is.na(tmp))

Xử lý dữ liệu bị thiếu

x <- c(1, 2, 0 / 0, 3, NA, 4, 0 / 0)
x[is.na(x)] <- mean(x,na.rm = T)

library(WDI)
d <- WDI(indicator = 'FI.RES.TOTL.MO', country = c('VNM'))
tmp <- tmp %>% select(year,FI.RES.TOTL.MO)
tmp <- tmp %>% rename(dutru = FI.RES.TOTL.MO)
tmp <- d
tmp$dutru[is.na(tmp$dutru)] <- mean(tmp$dutru, na.rm = TRUE)

tmp <- d
tmp <- tmp[complete.cases(tmp),]
tmp <- na.omit(tmp)

Xử lý dữ liệu trùng lắp

Bộ dữ liệu iris

library(tidyverse)
tmp <- iris
duplicated(tmp)
table(duplicated(tmp))
tmp1 <- unique(tmp) # Lấy lượng biến
tmp2 <- distinct(tmp) # Loại bỏ trùng lắp

Rút trích dữ liệu

Đổi tên cho các biến để thuận tiện cho việc thao tác trên dữ liệu.

names(tmp) <- c('G','H','V')
rename(tmp,GG = G, HH = H) #dplyr in tidyverse

Thực hiện thao tác rút trích dữ liệu

a <- tmp[5,3]
H <- tmp$H
b <- tmp[,2]
c <- tmp[4,]
tmp2 <- tmp[,c(1,3)]
tmp2 <- tmp[3:9,]
tmp3 <- tmp[c(3,5,7,21),]
tmp4 <- tmp[c(2,3,6,18),c(2,3)]
tmp5 <- tmp[tmp$H >=80,]
tmp5 <- tmp[tmp$H >=80 & tmp$H <=86,]
tmp6 <- tmp[tmp$H == 76 | tmp$H == 80,]

Rút trích dữ liệu (tt)

Bộ dữ liệu iris cũng là một bộ dữ liệu có sẵn trong package datasets và trong bộ dữ liệu này có cả dữ liệu định tính và dữ liệu định lượng.

Một số thông tin cơ bản

d <- iris
str(d)
head(d,4)
tail(d,5)

Rút trích dữ liệu

d1 <- d[d$Species=='setosa',]
str(d1)
d1 <- d[d$Species=='setosa'|d$Species=='versicolor',]
d2 <- d1[d1$Sepal.Length < 5,]
d3 <- d[d$Species != 'setosa',]

Rút trích dữ liệu - Sử dụng package tidyverse

tidyverse là một package quan trọng của R, nó bao gồm nhiều package con. Trong phần này chúng ta chỉ nói về việc rút trích dữ liệu.

library(tidyverse)
d <- diamonds
d1 <- filter(d,color=='D'|carat > 1)
d2 <- select(d1,color,carat,x,y,z)

d1 <- d %>% filter(color=='D'|carat > 1)
d2 <- d1 %>% select(color,carat,x,y,z)

d22 <- d %>% filter(color=='D'|carat > 1) %>% select(color,carat,x,y,z)

Tạo dữ liệu mới từ dữ liệu có sẵn

d <- trees
d$tich <- d$Girth*d$Height*d$Volume
d <- d %>% mutate(l = log(tich))
d <- d %>% mutate(sq = sqrt(tich))

Các hàm toán học trong R: abs,sqrt,…

Xóa cột

select(dataframe,-column_name)
select(dataframe,-c(column1,column2,.,column n))
select(dataframe,-c(index1,index2,.,index n))
select(dataframe,-contains(‘sub_string’))
dataframe %>% select(-matches(‘sub_string’))
dataframe %>% select(-starts_with('c'))

Xóa hàng:

Mã hóa dữ liệu

tmp <- iris
names(tmp) <- c('SL','SW','PL','PW','S')
tmp$S.Coded <- ifelse(tmp$S == 'setosa','setosa','Not setosa')
tmp$S.Coded1 <- recode(tmp$S,setosa = 'Loại 1',  versicolor = 'Loại 2')
tmp$SL.Coded <- ifelse(tmp$SL >= 5,'Đạt', 'Không đạt')
tmp$SL.Coded1 <- ifelse(tmp$SL >= 5 & tmp$SL <= 6, 'Nhận', 'Loại')
tmp$SL.Coded2 <- case_when(tmp$SL < 5 ~ 'Quá nhỏ', tmp$SL >= 5 & tmp$SL <= 6.5 ~ 'OK', tmp$SL >6.5 ~ 'Quá lớn')
tmp$SL.Coded2 <- cut(tmp$SL,3,labels = c('Loại 1','Loại 2','Loại 3'))

Lập bảng tần số

Lập bảng tần số cho 1 biến

d <- iris
table(d$Species)
cut(d$Sepal.Length,3)
table(cut(d$Sepal.Length,3))

Lập bảng tần số cho 2 biến

d <- iris
d$sl.c <- cut(d$Sepal.Length,3, labels = c('ngắn','tb','dài'))
tmp <- table(d$Species,d$sl.c)

Hoặc chúng ta có thể thực hiện bằng 1 cách khác như sau:

tmp <- d %>% group_by(Species,sl.c) %>% summarise(n= n())

Lập bảng tần số (tt)

Lập biểu đồ nhánh và lá

stem(d$Petal.Length)
stem(d$Sepal.Length,scale = .5)

Tính toán các đặc trưng đo lường

tmp <- diamonds
summary(tmp$carat)
sum(tmp$carat)
mean(tmp$carat,na.rm = T)
length(tmp$carat)
var(tmp$carat)
sd(tmp$carat)
median(tmp$carat)
quantile(tmp$carat, probs = c(.25,.5,.75))

Tính toán các đặc trưng đo lường theo nhóm

tmp <- diamonds
moc <- tmp %>% group_by(color) %>% summarise(mean_of_carat = mean(carat))
moc <- tmp %>% group_by(color) %>% summarise(n = n(),mean_of_carat = mean(carat))
medoc <- tmp %>% group_by(color) %>% summarise(med_of_carat = median(carat))
moc1 <- tmp %>% group_by(cut) %>% summarise(mean_of_carat = mean(carat))
moc2 <- tmp %>% group_by(color,cut) %>% summarise(n = n(),mean_of_carat = mean(carat),.groups = 'drop')

long table và wide table

A dataset can be written in two different formats: wide and long.
A wide format contains values that do not repeat in the first column.
A long format contains values that do repeat in the first column.

long table và wide table (tt)

Đây là 1 wide table

  Comp                      Type Equity  Tax
1    A       Sole proprietorship   10.1 0.54
2    B               Partnership    8.0 0.22
3    C               Corporation    5.2 0.08
4    D Limited liability company    1.1 0.02
5    E               Cooperative  115.0 5.80

Đây là long table

   Comp Đặc điểm                   Ghi chú
1     A     Type       Sole proprietorship
2     B     Type               Partnership
3     C     Type               Corporation
4     D     Type Limited liability company
5     E     Type               Cooperative
6     A   Equity                      10.1
7     B   Equity                         8
8     C   Equity                       5.2
9     D   Equity                       1.1
10    E   Equity                       115
11    A      Tax                      0.54
12    B      Tax                      0.22
13    C      Tax                      0.08
14    D      Tax                      0.02
15    E      Tax                       5.8

long table và wide table (tt)

Đây là 1 wide table

Đây là long table

   Comp Đặc điểm                   Ghi chú
1     A     Type       Sole proprietorship
2     B     Type               Partnership
3     C     Type               Corporation
4     D     Type Limited liability company
5     E     Type               Cooperative
6     A   Equity                      10.1
7     B   Equity                         8
8     C   Equity                       5.2
9     D   Equity                       1.1
10    E   Equity                       115
11    A      Tax                      0.54
12    B      Tax                      0.22
13    C      Tax                      0.08
14    D      Tax                      0.02
15    E      Tax                       5.8

long table và wide table (tt)

Chuẩn bị dữ liệu:

library(tidyverse)
library(googledrive)
pa <- 'https://drive.google.com/file/d/1Jc14iMyezkp0IARQdEPe03HbHh6NCiZA/view?usp=sharing'
id <- as_id(pa)
pa <- paste0('https://docs.google.com/uc?id=',id,'&export=download')
d <- read.csv(pa, header = T)
tmp <- d %>% filter(Country.name %in% c('Vietnam','Thailand','Malaysia', 'cambodia', 'Laos'))
tmp <- tmp %>% select(Country.name,Population,Year)
tmp <- tmp %>% rename(country = Country.name ,pop = Population, year = Year)
tmp

Chuyển dữ liệu từ long sang wide

l2w <- tmp %>% spread(key = country, value = pop)
l2w <- tmp %>% spread(key = year, value = pop)
w2l <- l2w %>% gather(key = 'country', value = 'pop', c(Laos,Malaysia,Thailand,Vietnam))

Yêu cầu thực hành

Tìm dữ liệu về dân số của các nước Đông Nam Á/các khu vực khác và phân tích.
Tìm dữ liệu về nông nghiệp của các nước Đông Nam Á/các khu vực khác và phân tích.
Tìm dữ liệu về GDP của các nước Đông Nam Á/các khu vực khác và phân tích.
Tìm dữ liệu về giáo dục của các nước Đông Nam Á/các khu vực khác và phân tích.
Tìm dữ liệu về bất động sản của các nước Đông Nam Á/các khu vực khác và phân tích.
Tìm dữ liệu về FDI của các nước Đông Nam Á/các khu vực khác và phân tích.
Tìm dữ liệu về bia/rượu của các nước Đông Nam Á/các khu vực khác và phân tích.

Một số chức năng mở rộng

bind_cols(df_1, df_2) # Appending a table to the right side (horizontal) of another
bind_rows(df_1, df_2) # Appending a table to the bottom (vertical) of another
union(df_1, df_2) # Combining rows that exist in both tables and dropping duplicates
union_all(df_1, df_2) # Combining all rows that exist in both table.
intersect(df_1, df_2) # Finding identical columns in both tables
setdiff(df_1, df_2) # Finding rows that don’t exist in another table
merge(df_1, df_2, join_field, join_type) #We can merge two data frames by a given field


df %>% mutate(new_col = operation(other_cols)) # Add new columns on top of old ones
df %>% transmute(new_col = operation(other_cols)) # Add new columns and discard old ones
df %>% mutate_at(vars, funs) # Modify several columns in-place
df %>% mutate_all(funs) # Modify all columns in-place   
df %>% mutate_if(condition, funs) # Modify columns fitting a specific condition
df %>% unite(new_merged_col, old_cols_list) # Unite columns
df %>% separate(col_to_separate, new_cols_list) # Separate columns

inner_join(x, y) # keeps observations appearing in both tables.
left_join(x, y) # keeps all observations in x and only adds matches from y.
right_join(x, y) # keeps all observations in y and only adds matches from x.
full_join(x, y) # keeps all observations in x and y; if there’s no match, it returns NAs.