# install.packages(c("amssymb", "amsmath"))

1) 20 Points

dataset1 = read.csv("dataset1.csv")
head(dataset1)
##   length
## 1     77
## 2    289
## 3    128
## 4     59
## 5     19
## 6    148

a) Histogram of the call lengths and describe the distribution

hist(dataset1$length, breaks = 20)

n_dataset1 = 80

The distribution is significantly skewed to the right and does not appear to be approximately Normal at all. ’

b) Bootstrap

library(boot)

theta <- function(dataset1, i) {
  d <- dataset1[i]
  mean(d)
}

call_length_boot = boot(dataset1$length, theta, R = 1000)
plot(call_length_boot)

The bootstrap distribution of the call length is closer to Normal than the original sampling distribution, however, it is still slightly skewed to the right.

c) Bootstrap bias

call_length_boot
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = dataset1$length, statistic = theta, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*  196.575 -1.418962    36.01564

The bootstrap estimate of the bias is -0.341. The observed mean is 196.575 and the boostrap mean is 195.350, meaning the bias is quite small relative to the observed mean. Since the bias is small, the bootstrap t confidence interval is justified.

d) Bootstrap interval

boot.ci(boot.out = call_length_boot, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = call_length_boot, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   (127.4, 268.6 )  
## Calculations and Intervals on Original Scale

the 95% bootstrap confidence interval for the population mean is (123.0, 269.5)

e) Bootstrap vs t interval

The values of the two standard errors are:

Bootstrap error: \(37.735\)

Formula-based standard error:

\(\frac{s}{\sqrt{n}}=\frac{\sqrt{\frac{\Sigma(x-\bar{x})^2}{80-1}}}{\sqrt{80}}=\frac{342.02}{\sqrt{80}}= 38.34\)

sd(dataset1$length)
## [1] 342.0215

The usual t 95% interval is

t.test(dataset1$length, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  dataset1$length
## t = 5.1407, df = 79, p-value = 1.937e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  120.4618 272.6882
## sample estimates:
## mean of x 
##   196.575

The bootstrap interval is 195.35 ± (2.000)SE_boot = (123.0, 269.5). The usual t interval is (120.46, 272.68). The bootstrap t interval is slightly narrower than the usual t 95% confidence interval. However, the two interval are very similar and may yield similar downstream interpretations.

2) 30 Points

dataset2 = read.csv("dataset2.csv")
head(dataset2)
##       Name ArchBill13 ArchBill12 N_Arch N_Eng N_Staff
## 1      BSA       18.2       19.1     31    31      62
## 2    Ratio       18.1       17.8     37     0      92
## 3      CSO       16.2       16.0     22     0      72
## 4  Schmidt       12.3       13.3     21     8      82
## 5 American        9.1        8.2     14   109     282
## 6 Browning        6.2        5.2     13     0      40

a) Numerical and graphical summaries

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ggpairs(select(dataset2, -Name), cardinality_threshold = 30)

The distributions of each of the variables all seem to be skewed to the right with most companies having less than 100 staff, less than 30 engineers,a nd less than 20 architects. The total past and current billings seem to mostly be under 10 million dollars, with some right-tailed increased kurtosis representing billings over 15 million dollars.

b) realtionship description

There seems to be a noticeable positive linear relationship between the number of architects and the current and past year billings. A regression line may also show a very weak linear relationship between the number of staff with the number of engineers and architects, however, that may be susceptible to outliers. There does not appear to be any qualitative linear relationships between the other variables.

c) multiple regression

The proposed regression equation is \(ArchBill13 = \beta_0 + \beta_1(ArchBill12) + \beta_2(NArch) + \beta_3(NEng) + \beta_4(NStaff)\)

\(H_0: \beta_j = 0\) \(H_1: \beta_j \neq 0\)

arch_fit = lm(ArchBill13 ~ ArchBill12 + N_Arch + N_Eng + N_Staff, data = dataset2)
summary(arch_fit)
## 
## Call:
## lm(formula = ArchBill13 ~ ArchBill12 + N_Arch + N_Eng + N_Staff, 
##     data = dataset2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0792 -0.4611 -0.1639  0.1911  2.9448 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2798096  0.2834615   0.987   0.3354    
## ArchBill12  0.7519881  0.1145755   6.563 2.15e-06 ***
## N_Arch      0.1219622  0.0693080   1.760   0.0937 .  
## N_Eng       0.0045089  0.0176494   0.255   0.8010    
## N_Staff     0.0009354  0.0063769   0.147   0.8848    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.877 on 20 degrees of freedom
## Multiple R-squared:  0.9775, Adjusted R-squared:  0.973 
## F-statistic: 217.5 on 4 and 20 DF,  p-value: 3.542e-16

Fitted equation:

\(ArchBill13 = 0.2798 + 0.7519(ArchBill12) + 0.1219(NArch) + 0.0045(NEng) + 0.0009(NStaff)\)

Regression standard error: \(0.877\)

d) residuals analysis

arch_resid = residuals(arch_fit)

par(mfrow = c(2,2))

plot(dataset2$ArchBill12, resid(arch_fit)); abline(0,0)
plot(dataset2$N_Arch, resid(arch_fit)); abline(0,0)
plot(dataset2$N_Eng, resid(arch_fit)); abline(0,0)
plot(dataset2$N_Staff, resid(arch_fit)); abline(0,0)

There is qualitative evidence of heteroscedasticity in each of the plots given the enrichment of residual data points at the lower end of X values. This would be a concern for downstream interpretation of the regression. This is because of the the assumptions for regression inference is that the standard deviation of the responses is the same for all values of X.

e) compute billing

\(ArchBill13 = 0.2798 + 0.7519(1) + 0.1219(3) + 0.0045(1) + 0.0009(17) = \textbf{1.4172 million dollars}\)

f) sampling concerns

When you use inference, you’re hoping to gain estimates for a population mean. This experimental study may suffer from nonresponse from those who did not respond to the survey. It should also be noted that the samples all come from one area, and that infrences about broader population behavior may be biased because of this. This could be fixed by possible gatthering information from all 50 states in the US, controlling for variability in population size, and analyzing it as a stratified random sample.

3) 30 Points

a) plot means

country = c("Canada", "US", "France")
female = c(7.70, 7.36, 6.38)
male = c(6.39, 6.43, 5.69)

att_score = data.frame(country, female, male)
# library(ggplot2)
# ggplot(att_score, aes(x = country, y = )) + geom_bar()
plot(att_score$female, pch = 18, col = "red", type = "b", lty = 2, ylim = c(5,8), xlab = "Canada, US, France", ylab = "", main = "Attitude Scores between genders and countries")
lines(att_score$male, pch = 19, col = "blue", type = "b", lty = 2)
legend("topleft", legend=c("Female", "Male"),
       col=c("red", "blue"), lty = 2:2, cex=0.8)

b) interaction description

The above plot does not show qualitative evidence of aninteraction between culture and gender. The largest difference in genders is observed in Canada and the smallest difference in France. for the most part, the lines seam relatively parallel to one another.

c) poole estimate of standard deviation

The pooled estimate of standard deviation is \(s_p^2 = \frac{\Sigma(n_{ij}-1)s_{ij}^2}{\Sigma(n_{ij}-1)}\)

\(=\frac{(238-1)(1.668)^2 +(125-1)(1.909)^2+(178-1)(1.736)^2 + (101-1)(1.601)^2+(82-1)(2.024)^2+(87-1)(1.875)^2}{(238-1)+(125-1)+(178-1)+(101-1)+(82-1)+(87-1)} = 2535.18/805 = 3.1493\)

\(s_p = \sqrt{3.1493} = \textbf{1.7746}\)

It is reasonable to use a pooled standard deviation because the largest standard deviation is less than double the smallest standard deviation; 2.024 < 2(1.601)=3.202. ## d) DFE The denominator degrees of freedom si the same as the DFE, which is (238-1)+(125-1)+(178-1)+(101-1)+(82-1)+(87-1) = 805

e) pairwise combinations

Since there are 6 treatment outcomes, there are \({6 \choose 2} = \frac{6!}{2!4!} = \frac{30}{2} = \textbf{15 pairwise combinations}\)

f) pairwisse calculations

\(t_{ij} = \frac{\bar{x}_i - \bar{x}_j}{s_p\sqrt{\frac{1}{n_i}+\frac{i}{n_j}}}\)

\(t^{**}= 2.94\) and \(s_p = 1.774\)

\(t_{ij}\) is significant when it’s > \(t^{**} = 2.94\)

1. Female Canada + Male Canada

\(t = \frac{7.70 - 6.39}{1.774\sqrt{\frac{1}{238}+\frac{1}{125}}} = 1.31/0.1959 = 6.68\)

Significant

2. Female Canada + Female US

\(t = \frac{7.70 - 7.36}{1.774\sqrt{\frac{1}{238}+\frac{1}{178}}} = 0.34/0.1757 = 1.934\)

3. Female Canada + Female France

\(t = \frac{7.70 - 6.38}{1.774\sqrt{\frac{1}{238}+\frac{1}{82}}} = \frac{1.32}{.2271} = 5.81\)

Significant

4. Male Canada + Male US

\(t = \frac{7.70 - 6.43}{1.774\sqrt{\frac{1}{238}+\frac{1}{101}}} = \frac{1.27}{.2106} = 6.028\)

Significant

5. Female Canada + Male France

\(t = \frac{7.70 - 5.69}{1.774\sqrt{\frac{1}{238}+\frac{1}{87}}} = \frac{2.01}{.227} = 9.04\)

Significant

6. Female US + Male US

\(t = \frac{7.36 - 6.43}{1.774\sqrt{\frac{1}{178}+\frac{1}{101}}} = \frac{.93}{.2209} = 4.208\)

Significant

7. Female US + Male Canada

\(t = \frac{7.36 - 6.39}{1.774\sqrt{\frac{1}{178}+\frac{1}{125}}} = \frac{.97}{.2070} = 4.685\)

Significant

8. Female US + Female France

\(t = \frac{7.36 - 6.38}{1.774\sqrt{\frac{1}{178}+\frac{1}{82}}} = \frac{.98}{.2367} = 4.139\)

Significant

9. Female US + Male France

\(t = \frac{7.36 - 5.69}{1.774\sqrt{\frac{1}{178}+\frac{1}{87}}} = \frac{1.67}{.232} = 7.196\)

Significant

10. Female France + Male France

\(t = \frac{6.38 - 5.69}{1.774\sqrt{\frac{1}{82}+\frac{1}{87}}} = \frac{.69}{.273} = 2.52\)

11. Female France + Male Canada

\(t = \frac{6.38 - 6.39}{1.774\sqrt{\frac{1}{82}+\frac{1}{125}}} = \frac{-0.01}{.2521} = -0.0396\)

12. Female France + Male US

\(t = \frac{6.38 - 5.69}{1.774\sqrt{\frac{1}{82}+\frac{1}{87}}} = \frac{.69}{.273} = 2.52\)

13. Male Canada + Male US

\(t = \frac{6.39 - 6.43}{1.774\sqrt{\frac{1}{125}+\frac{1}{101}}} = \frac{-0.04}{.2373} = -0.1685\)

14. Male Canada + Male France

\(t = \frac{6.39 - 5.69}{1.774\sqrt{\frac{1}{125}+\frac{1}{87}}} = \frac{.7}{.237} = 2.949\)

Barely crosses threshold of significance

15. Male US + Male France

\(t = \frac{6.43 - 5.69}{1.774\sqrt{\frac{1}{101}+\frac{1}{87}}} = \frac{.74}{.259} = 2.851\)

In summary, 8/15 pairwise comparisons are very significant:

The comparison of Candaian and French Males came close to significance, but for a conservative analysis, one may not interpret it as such.

4) 20 Points

a) significance test

\(z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}}= \frac{500-485}{100/\sqrt{400}} = 15/5 = 3\)

p = 0.0013 from Table A, which is significanct at the 10% level.

b) type II error

The probability of cmmitting a Type II error is 1-Power, which is 1-0.9987 = 0.0013. A type II error is when you fail to reject the \(H_0\) and \(H_0\) is false.

c) Possible hypothesis test

x = rnorm(400)
y = rnorm(400, mean = 500, sd = 100) ## generate randome samples with specifed mean and sd
t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -101.23, df = 399.08, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -509.7801 -490.3566
## sample estimates:
##     mean of x     mean of y 
##  -0.005589055 500.062767008

d) Power calcualtion

\(P(z \leq 500) = p (\frac{500-485}{100/\sqrt{400}} = P(\frac{15}{5}) = 3)\)

z @ 3.0 = 0.9987 \(\therefore\) power = \(\textbf{99.8%}\)