Fall 2017 MIS 572 – Introduction to Big Data Analytics

Group Exercise 1

1

  1. Load the given dataset cars.csv. Only keep those records with fuelType = “gas” and clean the data by removing those incomplete cases (record with any missing values).

在這裡推薦使用data.table這個package,他有強大的fread()讀取資料函數。可以點這裡參考fread的使用。

因為cars的資料有NULL與.的無用data,我們可以使用fread直接在讀取的時候把他們刪去 。這裡使用fread的參數na.string來刪除。

順便介紹這裡有使用到的參數:

其餘參數就不在這解釋了,上面有連結可參考fread函數。

#設定R當前的工作檔案路徑
setwd("/Users/yuzhe/Desktop")
cars <- data.table::fread("cars.csv", sep = ',', na.strings = c("NULL", "."), header = T, stringsAsFactors = F, data.table = F)

題目有說只要fuelTpye = “gas”的data,所以在這使用subset函數。 也在這提供五種方篩選要的資料並存成一個新的子集。

5 Ways to Subset a Data Frame in R

cars_gas <- subset(cars, fuelType == "gas")

剩下NA值還沒刪除,以下範例四種方式去除有NA值的列。個人推薦使用complete.caseapply的使用概念也必須釐清。

# use is.na() with rowSums(). A bit slow.
#rowSums是算出每一列的總和,裡面包is.na()會去檢查NA。總和為0就表示那一列沒有NA
cars_gas_comp = cars_gas[rowSums(is.na(cars_gas)) == 0,]

# Use apply(). Computation is easier to be parallelized later
cars_gas_comp = cars_gas[apply(cars_gas, MARGIN = 1, FUN = function(x) all(! is.na(x))), ]

# Use R buil-in function stat::complete.cases(). A bit faster, if you just work on a single machine.
cars_gas_comp = cars_gas[complete.cases(cars_gas), ]

# Use R buil-in function na.omit(). It works too, but a bit slow.
cars_gas_comp = na.omit(cars_gas)
  1. Consider the following SQL code. Replace the code with R data aggregation functions or your own split-apply-combine statements if you’d like.
Select bodyStyle, avg(highwayMpg) as avgHwMPG from cars 
group by bodyStyle order by avgHwMPG

其實題目就是分群彙整的概念,可使用1. aggregation函數或 2. split-apply-combine 解答只給aggregation,我自己付上split-apply-combine的寫法。

若真的不懂可以回去參閱unit2 slide

# Check the output of the SQL statments if you'd like
sqldf("Select bodyStyle, avg(highwayMpg) as avgHwMPG from cars_gas_comp group by bodyStyle order by avgHwMPG")
##     bodyStyle avgHwMPG
## 1 convertible 24.00000
## 2       wagon 30.25000
## 3       sedan 30.73134
## 4     hardtop 31.75000
## 5   hatchback 33.76364
#method -aggregate
car_style_avgMPG = aggregate(cars_gas_comp$highwayMpg, by = list("bodyStyle" = cars_gas_comp$bodyStyle),FUN = mean )
colnames(car_style_avgMPG)[2] = "avgHwMPG"
car_style_avgMPG = car_style_avgMPG[order(car_style_avgMPG$avgHwMPG), ]
car_style_avgMPG
##     bodyStyle avgHwMPG
## 1 convertible 24.00000
## 5       wagon 30.25000
## 4       sedan 30.73134
## 2     hardtop 31.75000
## 3   hatchback 33.76364
#method -split-apply-combind
car_style_split <- split(cars_gas_comp, as.factor(cars_gas_comp$bodyStyle))
car_style_avg <- sapply(car_style_split, function(x) mean(x$highwayMpg))
car_style_avg_df <- data.frame("avgHwMPG" = car_style_avg[order(car_style_avg)])
car_style_avg_df
##             avgHwMPG
## convertible 24.00000
## wagon       30.25000
## sedan       30.73134
## hardtop     31.75000
## hatchback   33.76364

2 Consider the following key-value pairs. Create two tables/R data frames and merge them by cID and year. Please keep all the information in both tables.

df_x: cID = c(1,1,1,2,2,3,3,4,4,4,5,5,5) year = c(2002,2003,2004,2005,2006,2002,2003,2004,2005,2006,2002,2003,2004,2005,2006) x = c(7.467, 7.984, NA, 8.356, 10.345, 9.67, 8.422, 6.356, NA, 8.577, 6.897, NA, 4.562,7.562,8.463) df_y: cID = 1:6 year = c(2002, 2005, 2002, 2002, 2004, 2006) y = c(8.456, 6.326, 6.467, 7.366, 4.567, 6.335)

看到merge them by cID and year.就可以很直覺的想到使用merge()函數,去join兩個table。然後題目有說要keep all the information in both tables.表示要保留兩個table的所有資訊,表示要使用outer join!

join的概念可以回去複習unit2的slide

df_x = data.frame(cID = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                  year = c(2002,2003,2004,2005,2006,2002,2003,2004,2005,2006,2002,2003,2004,2005,2006),
                  x = c(7.467, 7.984, NA, 8.356, 10.345, 9.67, 8.422, 6.356, NA, 8.577, 6.897, NA, 4.562,7.562,8.463)
)

df_y = data.frame(cID = 1:6,
                  year = c(2002, 2005, 2002, 2002, 2004, 2006),
                  y = c(8.456, 6.326, 6.467, 7.366, 4.567, 6.335)
)

# Just a simple outer join operation
merge(df_x, df_y, by = c("cID", "year"), all = T)
##    cID year      x     y
## 1    1 2002  7.467 8.456
## 2    1 2003  7.984    NA
## 3    1 2004     NA    NA
## 4    2 2002  9.670    NA
## 5    2 2005  8.356 6.326
## 6    2 2006 10.345    NA
## 7    3 2002     NA 6.467
## 8    3 2003  8.422    NA
## 9    3 2004  6.356    NA
## 10   3 2005     NA    NA
## 11   4 2002  6.897 7.366
## 12   4 2003     NA    NA
## 13   4 2006  8.577    NA
## 14   5 2004  4.562 4.567
## 15   5 2005  7.562    NA
## 16   5 2006  8.463    NA
## 17   6 2006     NA 6.335

3 : Load the built-in dataset “ChickWeight”

  1. Visualize your data to help understand the relationship between the “diet received” and “weight of the chick”
#先載入ggplot2
require(ggplot2)
## Loading required package: ggplot2
#load data ChickWeight
data("ChickWeight")
head(ChickWeight)
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

可以看出diet跟weight的關係,一離散變數相對連續變數,可使用ANOVA,去做Diet群組比較,ANOVA使用box plot 去看出關係。

補充教材:

ggplot2中文教學

常用的繪圖程式

#x: Diet, y: weight, data: ChickWeight, geomtric: boxplot
qplot(Diet, weight, data = ChickWeight, geom = "boxplot", xlab = "Treatment")

  1. Use any statistical methods to check whether the “diet received” might have impact on the “weight of the chick”. 解答提供三種方式:

範例都會提供,但這邊只對ANOVA做講解,其他兩待理解後補充。

# ANOVA. Seems that the "diet" does have impact on the "weight of the chick"
chick_aov = aov(weight ~ Diet, data = ChickWeight)
summary(chick_aov)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## Diet          3  155863   51954   10.81 6.43e-07 ***
## Residuals   574 2758693    4806                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

沒錯,ANOVA就是這麼簡單的兩行。

p-value遠小於0.001,明顯拒絕虛無假說,Diet對weight有顯著影響。

接下來為線性回歸的寫法。因為ANOVA其實也是回歸的一種。

chick_lm = lm(weight ~ Diet, data = ChickWeight)
summary(chick_lm)
## 
## Call:
## lm(formula = weight ~ Diet, data = ChickWeight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103.95  -53.65  -13.64   40.38  230.05 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  102.645      4.674  21.961  < 2e-16 ***
## Diet2         19.971      7.867   2.538   0.0114 *  
## Diet3         40.305      7.867   5.123 4.11e-07 ***
## Diet4         32.617      7.910   4.123 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.33 on 574 degrees of freedom
## Multiple R-squared:  0.05348,    Adjusted R-squared:  0.04853 
## F-statistic: 10.81 on 3 and 574 DF,  p-value: 6.433e-07
#ANOVA table 要有car這個套件
car::Anova(chick_lm, type = 3)
## Anova Table (Type III tests)
## 
## Response: weight
##              Sum Sq  Df F value    Pr(>F)    
## (Intercept) 2317940   1  482.29 < 2.2e-16 ***
## Diet         155863   3   10.81 6.433e-07 ***
## Residuals   2758693 574                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

補充

# Kruskal-Wallis 
kruskal.test(weight ~ Diet, data = ChickWeight)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  weight by Diet
## Kruskal-Wallis chi-squared = 24.45, df = 3, p-value = 2.012e-05
# Tukey's HSD(honest significant difference) test all pairwise differences between group means,
# which give you some idea about which treatments may differ from one another.
TukeyHSD(chick_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ Diet, data = ChickWeight)
## 
## $Diet
##          diff         lwr      upr     p adj
## 2-1 19.971212  -0.2998092 40.24223 0.0552271
## 3-1 40.304545  20.0335241 60.57557 0.0000025
## 4-1 32.617257  12.2353820 52.99913 0.0002501
## 3-2 20.333333  -2.7268370 43.39350 0.1058474
## 4-2 12.646045 -10.5116315 35.80372 0.4954239
## 4-3 -7.687288 -30.8449649 15.47039 0.8277810

4 Write a function, cltsim(rand_data, sampleSize, numOfSamples), that demonstrates Central Limit Theorem. Your function must have 3 arguments: rand_data is a numeric vector that contains large amount of randomly-generated numbers (you may use any R random generation functions). Also, sampleSize and numOfSamples are literally the size of the sample and the number of samples respectively. The only output of your function is a histogram of the simulated distribution along with the mean and variance of the simulated sampling distribution

先解釋題意,這要寫一個function,叫cltsim(rand_data, sampleSize, numOfSamples),這函數展現「中央極限定理」。這function要有3個參數:

  1. rand_data: 是一個數字向量,他有大量隨機抽樣的數字(或許你會用到R隨機產生亂數的函數,其實就是sample()拉!)

  2. sampleSize: 樣本大小。

  3. numOfSamples: 抽樣次數。

最終這函數輸出的結果只能是一張直方圖(Histogram),還要有這隨機抽樣分佈的平均值和變異數。

cltsim = function(rand_data, sampleSize, numOfSamples){
  xbarVec = double(length = numOfSamples)
  
  for(i in 1:numOfSamples){
    #設定每次亂數的基數
    set.seed(i);
    #講每次亂數的平均存進這個xbarVec這個向量
    xbarVec[i] = mean(sample(rand_data, sampleSize));
  }
  
  hist(xbarVec, xlab = bquote(bar(x)), main = '')
  
  #在圖片上加上平均
  #Mean of the"sampling distribution of the sample mean"
  #is equal to the population mean (mu xbar = mu)
  mtext( bquote(E(bar(x)) == .(round(mean(xbarVec), 4))), side = 3, adj = 0.3)
  #在圖片上加上變異數
  #The standard deviation of the sampling distribution of x bar is equal to "std/sqrt(n)"
  #I.e. std xbar = std/sqrt(n)  =>  "variance of xbar" = "varance/n"
  mtext( bquote(V(bar(x)) == .(round(var(xbarVec), 4))),side = 3, adj = 0.7)
}

相信大家到這邊都有看沒有懂,我來實作裡面的function在幹嘛吧!

首先xbarVec = double(length = numOfSamples)這段程式碼是先宣告一個儲存為double資料型態的一個長度為10的向量。

接下來我們直接示範這個for迴圈

#先另外設一個等同於xbarVec的向量
testVec <- double( length = 10 )
#先假設numOfSamples為10
#這個for迴圈會讓i = 1~10跑10次
for(i in 1:10){
  #設定每次亂數的基數
  set.seed(i);
  #講每次亂數的平均存進這個testVec這個向量
  #從testVec[1]存到testVec[10]
  #rand_data = rbinom(10^6, 10, 0.3)
  #numOfSamples = 10
  testVec[i] = mean(sample(rbinom(10^6, 10, 0.3), 10));
}
#來看結果吧!
testVec
##  [1] 3.4 3.2 3.0 3.0 2.8 2.7 2.7 3.5 2.6 2.5

不知道這樣有沒有比較了解這個for迴圈在幹嘛呢?

接下來hist(xbarVec, xlab = bquote(bar(x)), main = '')這段code就只是把結果化成直方圖呈現拉~

xlab = bquote(bar(x))xlab就是命名x座標的名稱而已。到這邊大家可能會疑惑bquote這函數在幹嘛吧!

bquote在幹嘛,點我

main = ''就只是把圖案的標題名稱弄掉而已!

繪圖成果

cltsim(rbinom(10^6, 10, 0.3), sampleSize = 10, numOfSamples = 10)

cltsim(rbinom(10^6,10,0.3), sampleSize = 10, numOfSamples = 10^3)

cltsim(rbinom(10^6,10,0.3), sampleSize = 10^2, numOfSamples = 10^3)