1
- Load the given dataset cars.csv. Only keep those records with fuelType = “gas” and clean the data by removing those incomplete cases (record with any missing values).
在這裡推薦使用data.table這個package,他有強大的fread()讀取資料函數。可以點這裡參考fread的使用。
因為cars的資料有NULL與.的無用data,我們可以使用fread直接在讀取的時候把他們刪去 。這裡使用fread的參數na.string來刪除。
順便介紹這裡有使用到的參數:
sep 檔案是用什麼分隔的。csv全名為 Comma-Separated Values。顧名思義就是用逗號來分格資料作為表格儲存。fread其實可以自動判斷檔案的類型,所以這個參數是可以不用打的。
header boolean。用來判斷第一列資料有沒有每一欄的名稱,若沒有會自動給予預設名稱。
stringAsFactors 要不要把所有是字元的那一欄轉換為factor格式。
data.table 決定讀進來的資料為table or data.frame。True的話回傳 data.table格式,False的話回傳data.frame。
其餘參數就不在這解釋了,上面有連結可參考fread函數。
#設定R當前的工作檔案路徑
setwd("/Users/yuzhe/Desktop")
cars <- data.table::fread("cars.csv", sep = ',', na.strings = c("NULL", "."), header = T, stringsAsFactors = F, data.table = F)
題目有說只要fuelTpye = “gas”的data,所以在這使用subset函數。 也在這提供五種方篩選要的資料並存成一個新的子集。
5 Ways to Subset a Data Frame in R
cars_gas <- subset(cars, fuelType == "gas")
剩下NA值還沒刪除,以下範例四種方式去除有NA值的列。個人推薦使用complete.case,apply的使用概念也必須釐清。
# use is.na() with rowSums(). A bit slow.
#rowSums是算出每一列的總和,裡面包is.na()會去檢查NA。總和為0就表示那一列沒有NA
cars_gas_comp = cars_gas[rowSums(is.na(cars_gas)) == 0,]
# Use apply(). Computation is easier to be parallelized later
cars_gas_comp = cars_gas[apply(cars_gas, MARGIN = 1, FUN = function(x) all(! is.na(x))), ]
# Use R buil-in function stat::complete.cases(). A bit faster, if you just work on a single machine.
cars_gas_comp = cars_gas[complete.cases(cars_gas), ]
# Use R buil-in function na.omit(). It works too, but a bit slow.
cars_gas_comp = na.omit(cars_gas)
- Consider the following SQL code. Replace the code with R data aggregation functions or your own split-apply-combine statements if you’d like.
Select bodyStyle, avg(highwayMpg) as avgHwMPG from cars
group by bodyStyle order by avgHwMPG
其實題目就是分群彙整的概念,可使用1. aggregation函數或 2. split-apply-combine 解答只給aggregation,我自己付上split-apply-combine的寫法。
若真的不懂可以回去參閱unit2 slide
# Check the output of the SQL statments if you'd like
sqldf("Select bodyStyle, avg(highwayMpg) as avgHwMPG from cars_gas_comp group by bodyStyle order by avgHwMPG")
## bodyStyle avgHwMPG
## 1 convertible 24.00000
## 2 wagon 30.25000
## 3 sedan 30.73134
## 4 hardtop 31.75000
## 5 hatchback 33.76364
#method -aggregate
car_style_avgMPG = aggregate(cars_gas_comp$highwayMpg, by = list("bodyStyle" = cars_gas_comp$bodyStyle),FUN = mean )
colnames(car_style_avgMPG)[2] = "avgHwMPG"
car_style_avgMPG = car_style_avgMPG[order(car_style_avgMPG$avgHwMPG), ]
car_style_avgMPG
## bodyStyle avgHwMPG
## 1 convertible 24.00000
## 5 wagon 30.25000
## 4 sedan 30.73134
## 2 hardtop 31.75000
## 3 hatchback 33.76364
#method -split-apply-combind
car_style_split <- split(cars_gas_comp, as.factor(cars_gas_comp$bodyStyle))
car_style_avg <- sapply(car_style_split, function(x) mean(x$highwayMpg))
car_style_avg_df <- data.frame("avgHwMPG" = car_style_avg[order(car_style_avg)])
car_style_avg_df
## avgHwMPG
## convertible 24.00000
## wagon 30.25000
## sedan 30.73134
## hardtop 31.75000
## hatchback 33.76364
2 Consider the following key-value pairs. Create two tables/R data frames and merge them by cID and year. Please keep all the information in both tables.
df_x: cID = c(1,1,1,2,2,3,3,4,4,4,5,5,5) year = c(2002,2003,2004,2005,2006,2002,2003,2004,2005,2006,2002,2003,2004,2005,2006) x = c(7.467, 7.984, NA, 8.356, 10.345, 9.67, 8.422, 6.356, NA, 8.577, 6.897, NA, 4.562,7.562,8.463) df_y: cID = 1:6 year = c(2002, 2005, 2002, 2002, 2004, 2006) y = c(8.456, 6.326, 6.467, 7.366, 4.567, 6.335)
看到merge them by cID and year.就可以很直覺的想到使用merge()函數,去join兩個table。然後題目有說要keep all the information in both tables.表示要保留兩個table的所有資訊,表示要使用outer join!
join的概念可以回去複習unit2的slide
df_x = data.frame(cID = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
year = c(2002,2003,2004,2005,2006,2002,2003,2004,2005,2006,2002,2003,2004,2005,2006),
x = c(7.467, 7.984, NA, 8.356, 10.345, 9.67, 8.422, 6.356, NA, 8.577, 6.897, NA, 4.562,7.562,8.463)
)
df_y = data.frame(cID = 1:6,
year = c(2002, 2005, 2002, 2002, 2004, 2006),
y = c(8.456, 6.326, 6.467, 7.366, 4.567, 6.335)
)
# Just a simple outer join operation
merge(df_x, df_y, by = c("cID", "year"), all = T)
## cID year x y
## 1 1 2002 7.467 8.456
## 2 1 2003 7.984 NA
## 3 1 2004 NA NA
## 4 2 2002 9.670 NA
## 5 2 2005 8.356 6.326
## 6 2 2006 10.345 NA
## 7 3 2002 NA 6.467
## 8 3 2003 8.422 NA
## 9 3 2004 6.356 NA
## 10 3 2005 NA NA
## 11 4 2002 6.897 7.366
## 12 4 2003 NA NA
## 13 4 2006 8.577 NA
## 14 5 2004 4.562 4.567
## 15 5 2005 7.562 NA
## 16 5 2006 8.463 NA
## 17 6 2006 NA 6.335
3 : Load the built-in dataset “ChickWeight”
- Visualize your data to help understand the relationship between the “diet received” and “weight of the chick”
#先載入ggplot2
require(ggplot2)
## Loading required package: ggplot2
#load data ChickWeight
data("ChickWeight")
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
可以看出diet跟weight的關係,一離散變數相對連續變數,可使用ANOVA,去做Diet群組比較,ANOVA使用box plot 去看出關係。
補充教材:
#x: Diet, y: weight, data: ChickWeight, geomtric: boxplot
qplot(Diet, weight, data = ChickWeight, geom = "boxplot", xlab = "Treatment")
- Use any statistical methods to check whether the “diet received” might have impact on the “weight of the chick”. 解答提供三種方式:
ANOVA
Kruskal-Wallis
Tukey’s HSD
範例都會提供,但這邊只對ANOVA做講解,其他兩待理解後補充。
# ANOVA. Seems that the "diet" does have impact on the "weight of the chick"
chick_aov = aov(weight ~ Diet, data = ChickWeight)
summary(chick_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Diet 3 155863 51954 10.81 6.43e-07 ***
## Residuals 574 2758693 4806
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
沒錯,ANOVA就是這麼簡單的兩行。
p-value遠小於0.001,明顯拒絕虛無假說,Diet對weight有顯著影響。
接下來為線性回歸的寫法。因為ANOVA其實也是回歸的一種。
chick_lm = lm(weight ~ Diet, data = ChickWeight)
summary(chick_lm)
##
## Call:
## lm(formula = weight ~ Diet, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.95 -53.65 -13.64 40.38 230.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.645 4.674 21.961 < 2e-16 ***
## Diet2 19.971 7.867 2.538 0.0114 *
## Diet3 40.305 7.867 5.123 4.11e-07 ***
## Diet4 32.617 7.910 4.123 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.33 on 574 degrees of freedom
## Multiple R-squared: 0.05348, Adjusted R-squared: 0.04853
## F-statistic: 10.81 on 3 and 574 DF, p-value: 6.433e-07
#ANOVA table 要有car這個套件
car::Anova(chick_lm, type = 3)
## Anova Table (Type III tests)
##
## Response: weight
## Sum Sq Df F value Pr(>F)
## (Intercept) 2317940 1 482.29 < 2.2e-16 ***
## Diet 155863 3 10.81 6.433e-07 ***
## Residuals 2758693 574
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Kruskal-Wallis
kruskal.test(weight ~ Diet, data = ChickWeight)
##
## Kruskal-Wallis rank sum test
##
## data: weight by Diet
## Kruskal-Wallis chi-squared = 24.45, df = 3, p-value = 2.012e-05
# Tukey's HSD(honest significant difference) test all pairwise differences between group means,
# which give you some idea about which treatments may differ from one another.
TukeyHSD(chick_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight ~ Diet, data = ChickWeight)
##
## $Diet
## diff lwr upr p adj
## 2-1 19.971212 -0.2998092 40.24223 0.0552271
## 3-1 40.304545 20.0335241 60.57557 0.0000025
## 4-1 32.617257 12.2353820 52.99913 0.0002501
## 3-2 20.333333 -2.7268370 43.39350 0.1058474
## 4-2 12.646045 -10.5116315 35.80372 0.4954239
## 4-3 -7.687288 -30.8449649 15.47039 0.8277810
4 Write a function, cltsim(rand_data, sampleSize, numOfSamples), that demonstrates Central Limit Theorem. Your function must have 3 arguments: rand_data is a numeric vector that contains large amount of randomly-generated numbers (you may use any R random generation functions). Also, sampleSize and numOfSamples are literally the size of the sample and the number of samples respectively. The only output of your function is a histogram of the simulated distribution along with the mean and variance of the simulated sampling distribution
先解釋題意,這要寫一個function,叫cltsim(rand_data, sampleSize, numOfSamples),這函數展現「中央極限定理」。這function要有3個參數:
rand_data: 是一個數字向量,他有大量隨機抽樣的數字(或許你會用到R隨機產生亂數的函數,其實就是sample()拉!)
sampleSize: 樣本大小。
numOfSamples: 抽樣次數。
最終這函數輸出的結果只能是一張直方圖(Histogram),還要有這隨機抽樣分佈的平均值和變異數。
cltsim = function(rand_data, sampleSize, numOfSamples){
xbarVec = double(length = numOfSamples)
for(i in 1:numOfSamples){
#設定每次亂數的基數
set.seed(i);
#講每次亂數的平均存進這個xbarVec這個向量
xbarVec[i] = mean(sample(rand_data, sampleSize));
}
hist(xbarVec, xlab = bquote(bar(x)), main = '')
#在圖片上加上平均
#Mean of the"sampling distribution of the sample mean"
#is equal to the population mean (mu xbar = mu)
mtext( bquote(E(bar(x)) == .(round(mean(xbarVec), 4))), side = 3, adj = 0.3)
#在圖片上加上變異數
#The standard deviation of the sampling distribution of x bar is equal to "std/sqrt(n)"
#I.e. std xbar = std/sqrt(n) => "variance of xbar" = "varance/n"
mtext( bquote(V(bar(x)) == .(round(var(xbarVec), 4))),side = 3, adj = 0.7)
}
相信大家到這邊都有看沒有懂,我來實作裡面的function在幹嘛吧!
首先xbarVec = double(length = numOfSamples)這段程式碼是先宣告一個儲存為double資料型態的一個長度為10的向量。
接下來我們直接示範這個for迴圈
#先另外設一個等同於xbarVec的向量
testVec <- double( length = 10 )
#先假設numOfSamples為10
#這個for迴圈會讓i = 1~10跑10次
for(i in 1:10){
#設定每次亂數的基數
set.seed(i);
#講每次亂數的平均存進這個testVec這個向量
#從testVec[1]存到testVec[10]
#rand_data = rbinom(10^6, 10, 0.3)
#numOfSamples = 10
testVec[i] = mean(sample(rbinom(10^6, 10, 0.3), 10));
}
#來看結果吧!
testVec
## [1] 3.4 3.2 3.0 3.0 2.8 2.7 2.7 3.5 2.6 2.5
不知道這樣有沒有比較了解這個for迴圈在幹嘛呢?
接下來hist(xbarVec, xlab = bquote(bar(x)), main = '')這段code就只是把結果化成直方圖呈現拉~
xlab = bquote(bar(x))xlab就是命名x座標的名稱而已。到這邊大家可能會疑惑bquote這函數在幹嘛吧!
main = ''就只是把圖案的標題名稱弄掉而已!
繪圖成果
cltsim(rbinom(10^6, 10, 0.3), sampleSize = 10, numOfSamples = 10)
cltsim(rbinom(10^6,10,0.3), sampleSize = 10, numOfSamples = 10^3)
cltsim(rbinom(10^6,10,0.3), sampleSize = 10^2, numOfSamples = 10^3)