# install.packages(c("amssymb", "amsmath"))
dataset1 = read.csv("dataset1.csv")
head(dataset1)
## length
## 1 77
## 2 289
## 3 128
## 4 59
## 5 19
## 6 148
hist(dataset1$length, breaks = 20)
n_dataset1 = 80
The distribution is significantly skewed to the right and does not appear to be approximately Normal at all. ’
library(boot)
theta <- function(dataset1, i) {
d <- dataset1[i]
mean(d)
}
call_length_boot = boot(dataset1$length, theta, R = 1000)
plot(call_length_boot)
The bootstrap distribution of the call length is closer to Normal than the original sampling distribution, however, it is still slightly skewed to the right.
call_length_boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = dataset1$length, statistic = theta, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 196.575 -1.418962 36.01564
The bootstrap estimate of the bias is -0.341. The observed mean is 196.575 and the boostrap mean is 195.350, meaning the bias is quite small relative to the observed mean. Since the bias is small, the bootstrap t confidence interval is justified.
boot.ci(boot.out = call_length_boot, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = call_length_boot, type = "norm")
##
## Intervals :
## Level Normal
## 95% (127.4, 268.6 )
## Calculations and Intervals on Original Scale
the 95% bootstrap confidence interval for the population mean is (123.0, 269.5)
The values of the two standard errors are:
Bootstrap error: \(37.735\)
Formula-based standard error:
\(\frac{s}{\sqrt{n}}=\frac{\sqrt{\frac{\Sigma(x-\bar{x})^2}{80-1}}}{\sqrt{80}}=\frac{342.02}{\sqrt{80}}= 38.34\)
sd(dataset1$length)
## [1] 342.0215
The usual t 95% interval is
t.test(dataset1$length, conf.level = 0.95)
##
## One Sample t-test
##
## data: dataset1$length
## t = 5.1407, df = 79, p-value = 1.937e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 120.4618 272.6882
## sample estimates:
## mean of x
## 196.575
The bootstrap interval is 195.35 ± (2.000)SE_boot = (123.0, 269.5). The usual t interval is (120.46, 272.68). The bootstrap t interval is slightly narrower than the usual t 95% confidence interval. However, the two interval are very similar and may yield similar downstream interpretations.
dataset2 = read.csv("dataset2.csv")
head(dataset2)
## Name ArchBill13 ArchBill12 N_Arch N_Eng N_Staff
## 1 BSA 18.2 19.1 31 31 62
## 2 Ratio 18.1 17.8 37 0 92
## 3 CSO 16.2 16.0 22 0 72
## 4 Schmidt 12.3 13.3 21 8 82
## 5 American 9.1 8.2 14 109 282
## 6 Browning 6.2 5.2 13 0 40
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggpairs(select(dataset2, -Name), cardinality_threshold = 30)
The distributions of each of the variables all seem to be skewed to the right with most companies having less than 100 staff, less than 30 engineers,a nd less than 20 architects. The total past and current billings seem to mostly be under 10 million dollars, with some right-tailed increased kurtosis representing billings over 15 million dollars.
There seems to be a noticeable positive linear relationship between the number of architects and the current and past year billings. A regression line may also show a very weak linear relationship between the number of staff with the number of engineers and architects, however, that may be susceptible to outliers. There does not appear to be any qualitative linear relationships between the other variables.
The proposed regression equation is \(ArchBill13 = \beta_0 + \beta_1(ArchBill12) + \beta_2(NArch) + \beta_3(NEng) + \beta_4(NStaff)\)
\(H_0: \beta_j = 0\) \(H_1: \beta_j \neq 0\)
arch_fit = lm(ArchBill13 ~ ArchBill12 + N_Arch + N_Eng + N_Staff, data = dataset2)
summary(arch_fit)
##
## Call:
## lm(formula = ArchBill13 ~ ArchBill12 + N_Arch + N_Eng + N_Staff,
## data = dataset2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0792 -0.4611 -0.1639 0.1911 2.9448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2798096 0.2834615 0.987 0.3354
## ArchBill12 0.7519881 0.1145755 6.563 2.15e-06 ***
## N_Arch 0.1219622 0.0693080 1.760 0.0937 .
## N_Eng 0.0045089 0.0176494 0.255 0.8010
## N_Staff 0.0009354 0.0063769 0.147 0.8848
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.877 on 20 degrees of freedom
## Multiple R-squared: 0.9775, Adjusted R-squared: 0.973
## F-statistic: 217.5 on 4 and 20 DF, p-value: 3.542e-16
Fitted equation:
\(ArchBill13 = 0.2798 + 0.7519(ArchBill12) + 0.1219(NArch) + 0.0045(NEng) + 0.0009(NStaff)\)
Regression standard error: \(0.877\)
arch_resid = residuals(arch_fit)
par(mfrow = c(2,2))
plot(dataset2$ArchBill12, resid(arch_fit)); abline(0,0)
plot(dataset2$N_Arch, resid(arch_fit)); abline(0,0)
plot(dataset2$N_Eng, resid(arch_fit)); abline(0,0)
plot(dataset2$N_Staff, resid(arch_fit)); abline(0,0)
There is qualitative evidence of heteroscedasticity in each of the plots given the enrichment of residual data points at the lower end of X values. This would be a concern for downstream interpretation of the regression. This is because of the the assumptions for regression inference is that the standard deviation of the responses is the same for all values of X.
\(ArchBill13 = 0.2798 + 0.7519(1) + 0.1219(3) + 0.0045(1) + 0.0009(17) = \textbf{1.4172 million dollars}\)
When you use inference, you’re hoping to gain estimates for a population mean. This experimental study may suffer from nonresponse from those who did not respond to the survey. It should also be noted that the samples all come from one area, and that infrences about broader population behavior may be biased because of this. This could be fixed by possible gatthering information from all 50 states in the US, controlling for variability in population size, and analyzing it as a stratified random sample.
country = c("Canada", "US", "France")
female = c(7.70, 7.36, 6.38)
male = c(6.39, 6.43, 5.69)
att_score = data.frame(country, female, male)
# library(ggplot2)
# ggplot(att_score, aes(x = country, y = )) + geom_bar()
plot(att_score$female, pch = 18, col = "red", type = "b", lty = 2, ylim = c(5,8), xlab = "Canada, US, France", ylab = "", main = "Attitude Scores between genders and countries")
lines(att_score$male, pch = 19, col = "blue", type = "b", lty = 2)
legend("topleft", legend=c("Female", "Male"),
col=c("red", "blue"), lty = 2:2, cex=0.8)
The above plot does not show qualitative evidence of aninteraction between culture and gender. The largest difference in genders is observed in Canada and the smallest difference in France. for the most part, the lines seam relatively parallel to one another.
The pooled estimate of standard deviation is \(s_p^2 = \frac{\Sigma(n_{ij}-1)s_{ij}^2}{\Sigma(n_{ij}-1)}\)
\(=\frac{(238-1)(1.668)^2 +(125-1)(1.909)^2+(178-1)(1.736)^2 + (101-1)(1.601)^2+(82-1)(2.024)^2+(87-1)(1.875)^2}{(238-1)+(125-1)+(178-1)+(101-1)+(82-1)+(87-1)} = 2535.18/805 = 3.1493\)
\(s_p = \sqrt{3.1493} = \textbf{1.7746}\)
It is reasonable to use a pooled standard deviation because the largest standard deviation is less than double the smallest standard deviation; 2.024 < 2(1.601)=3.202. ## d) DFE The denominator degrees of freedom si the same as the DFE, which is (238-1)+(125-1)+(178-1)+(101-1)+(82-1)+(87-1) = 805
Since there are 6 treatment outcomes, there are \({6 \choose 2} = \frac{6!}{2!4!} = \frac{30}{2} = \textbf{15 pairwise combinations}\)
\(t_{ij} = \frac{\bar{x}_i - \bar{x}_j}{s_p\sqrt{\frac{1}{n_i}+\frac{i}{n_j}}}\)
\(t^{**}= 2.94\) and \(s_p = 1.774\)
\(t_{ij}\) is significant when it’s > \(t^{**} = 2.94\)
\(t = \frac{7.70 - 6.39}{1.774\sqrt{\frac{1}{238}+\frac{1}{125}}} = 1.31/0.1959 = 6.68\)
Significant
\(t = \frac{7.70 - 7.36}{1.774\sqrt{\frac{1}{238}+\frac{1}{178}}} = 0.34/0.1757 = 1.934\)
\(t = \frac{7.70 - 6.38}{1.774\sqrt{\frac{1}{238}+\frac{1}{82}}} = \frac{1.32}{.2271} = 5.81\)
Significant
\(t = \frac{7.70 - 6.43}{1.774\sqrt{\frac{1}{238}+\frac{1}{101}}} = \frac{1.27}{.2106} = 6.028\)
Significant
\(t = \frac{7.70 - 5.69}{1.774\sqrt{\frac{1}{238}+\frac{1}{87}}} = \frac{2.01}{.227} = 9.04\)
Significant
\(t = \frac{7.36 - 6.43}{1.774\sqrt{\frac{1}{178}+\frac{1}{101}}} = \frac{.93}{.2209} = 4.208\)
Significant
\(t = \frac{7.36 - 6.39}{1.774\sqrt{\frac{1}{178}+\frac{1}{125}}} = \frac{.97}{.2070} = 4.685\)
Significant
\(t = \frac{7.36 - 6.38}{1.774\sqrt{\frac{1}{178}+\frac{1}{82}}} = \frac{.98}{.2367} = 4.139\)
Significant
\(t = \frac{7.36 - 5.69}{1.774\sqrt{\frac{1}{178}+\frac{1}{87}}} = \frac{1.67}{.232} = 7.196\)
Significant
\(t = \frac{6.38 - 5.69}{1.774\sqrt{\frac{1}{82}+\frac{1}{87}}} = \frac{.69}{.273} = 2.52\)
\(t = \frac{6.38 - 6.39}{1.774\sqrt{\frac{1}{82}+\frac{1}{125}}} = \frac{-0.01}{.2521} = -0.0396\)
\(t = \frac{6.38 - 5.69}{1.774\sqrt{\frac{1}{82}+\frac{1}{87}}} = \frac{.69}{.273} = 2.52\)
\(t = \frac{6.39 - 6.43}{1.774\sqrt{\frac{1}{125}+\frac{1}{101}}} = \frac{-0.04}{.2373} = -0.1685\)
\(t = \frac{6.39 - 5.69}{1.774\sqrt{\frac{1}{125}+\frac{1}{87}}} = \frac{.7}{.237} = 2.949\)
Barely crosses threshold of significance
\(t = \frac{6.43 - 5.69}{1.774\sqrt{\frac{1}{101}+\frac{1}{87}}} = \frac{.74}{.259} = 2.851\)
In summary, 8/15 pairwise comparisons are very significant:
The comparison of Candaian and French Males came close to significance, but for a conservative analysis, one may not interpret it as such.
\(z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}}= \frac{500-485}{100/\sqrt{400}} = 15/5 = 3\)
p = 0.0013 from Table A, which is significanct at the 10% level.
The probability of cmmitting a Type II error is 1-Power, which is 1-0.9987 = 0.0013. A type II error is when you fail to reject the \(H_0\) and \(H_0\) is false.
x = rnorm(400)
y = rnorm(400, mean = 500, sd = 100) ## generate randome samples with specifed mean and sd
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -101.23, df = 399.08, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -509.7801 -490.3566
## sample estimates:
## mean of x mean of y
## -0.005589055 500.062767008
\(P(z \leq 500) = p (\frac{500-485}{100/\sqrt{400}} = P(\frac{15}{5}) = 3)\)
z @ 3.0 = 0.9987 \(\therefore\) power = \(\textbf{99.8%}\)