Directions

The objective of this assignment is to introduce you to R and R markdown and to complete some basic data simulation exercises.

Please include all code needed to perform the tasks. This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Moodle. Please make sure that this link is hyperlinked and that I can see the visualization and the code required to create it.

Questions

  1. Simulate data for 30 draws from a normal distribution where the means and standard deviations vary among three distributions.
# setting seed
set.seed(1)
data = rnorm(30, mean = c(0,10,100), sd = c(.1,1,10))
data
##  [1]  -0.062645381  10.183643324  91.643713876   0.159528080  10.329507772
##  [6]  91.795316159   0.048742905  10.738324705 105.757813517  -0.030538839
## [11]  11.511781168 103.898432364  -0.062124058   7.785300113 111.249309181
## [16]  -0.004493361   9.983809737 109.438362107   0.082122120  10.593901321
## [21] 109.189773716   0.078213630  10.074564983  80.106483041   0.061982575
## [26]   9.943871260  98.442044933  -0.147075238   9.521849945 104.179415602
  1. Simulate 2 continuous variables (normal distribution) (n=20) and plot the relationship between them
set.seed(2)
var1 = rnorm(20, mean = 0, sd = 1)
var1
##  [1] -0.89691455  0.18484918  1.58784533 -1.13037567 -0.08025176
##  [6]  0.13242028  0.70795473 -0.23969802  1.98447394 -0.13878701
## [11]  0.41765075  0.98175278 -0.39269536 -1.03966898  1.78222896
## [16] -2.31106908  0.87860458  0.03580672  1.01282869  0.43226515
set.seed(3)
var2 = rnorm(20, mean = 0, sd = 1)
var2
##  [1] -0.96193342 -0.29252572  0.25878822 -1.15213189  0.19578283
##  [6]  0.03012394  0.08541773  1.11661021 -1.21885742  1.26736872
## [11] -0.74478160 -1.13121857 -0.71635849  0.25265237  0.15204571
## [16] -0.30765643 -0.95301733 -0.64824281  1.22431362  0.19981161
# plot variables
plot(var1, var2)

# As expected, there is no relationship between the two variables; each value from variable 1 has no effect on the values for variable 2
  1. Simulate 3 variables (x1, x2 and y). x1 and x2 should be drawn from a uniform distribution and y should be drawn from a normal distribution. Fit a multiple linear regression.
# place the code to simulate the data here
set.seed(4)
x1 = runif(50, min = 0, max = 100)  # uniform dist
set.seed(5)
x2 = runif(50, min = 1, max = 10)  # uniform dist
set.seed(6)
y = rnorm(50, mean = 10, sd = 2)  # normal dist
# Fit linear model to data
model = lm(y ~ x1 + x2)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3390 -1.3065 -0.2151  1.2315  4.4841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.83193    0.83321  11.800 1.18e-15 ***
## x1           0.01138    0.01081   1.053    0.298    
## x2          -0.05831    0.12012  -0.485    0.630    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.118 on 47 degrees of freedom
## Multiple R-squared:  0.02399,    Adjusted R-squared:  -0.01755 
## F-statistic: 0.5776 on 2 and 47 DF,  p-value: 0.5652
# Neither x1 nor x2 explain the variance in y; this makes sense as they are all pulled from separate distributions
  1. Simulate 3 letters repeating each letter twice, 2 times.
rep(letters[c(19,23,9)], each = 2, times = 2)
##  [1] "s" "s" "w" "w" "i" "i" "s" "s" "w" "w" "i" "i"
# using my initials s,w,i
  1. Create a dataframe with 3 groups, 2 factors and two quantitative response variables. Use the replicate function (n = 25).
set.seed(7)
data.frame(Group = rep(LETTERS[1:2], length.out = 25), Response1 = rnorm(25, mean = 0, sd = 1), Response2 = rnorm(25, mean = 50, 
                                                                                                                  sd= 5))
##    Group    Response1 Response2
## 1      A  2.287247161  50.92096
## 2      B -1.196771682  53.76140
## 3      A -0.694292510  52.95873
## 4      B -0.412292951  45.08474
## 5      A -0.970673341  48.61968
## 6      B -0.947279945  45.64574
## 7      A  0.748139340  53.59355
## 8      B -0.116955226  50.55326
## 9      A  0.152657626  49.60767
## 10     B  2.189978107  47.89755
## 11     A  0.356986230  47.18937
## 12     B  2.716751783  54.98757
## 13     A  2.281451926  44.47435
## 14     B  0.324020540  49.28856
## 15     A  1.896067067  51.57497
## 16     B  0.467680511  56.09275
## 17     A -0.893800723  46.50341
## 18     B -0.307328300  48.57284
## 19     A -0.004822422  43.44224
## 20     B  0.988164149  48.04494
## 21     A  0.839750360  47.99237
## 22     B  0.705341831  56.75259
## 23     A  1.305964721  52.95595
## 24     B -1.387996217  50.50263
## 25     A  1.272916864  54.65536
# Since n=25, we will end up with a data frame that has 2 uneven groups, 13 A's and 12 B's with 2 Responses each