Note: This is an RMarkdown document. Did you know you can open this document in RStudio, edit it by adding your answers and code, and then knit it to a pdf? Then you can submit both the .rmd file (the edited file) and the pdf file as a zip file on Forum. This method is actually preferred. To learn more about RMarkdown, watch the videos from session 1 and session 2 of the CS112B optional class. This is also a cheat sheet for using Rmarkdown. If you have questions about RMarkdown, please post them on Perusall.

Note: If you are not comfortable with RMarkdown, you can use any text editor (google doc or word) and type your answers and then save everything as a pdf and submit both the pdf file AND the link to your code (on github or jupyter) on Forum.

Note: Try knitting this document in your RStudio. You should be able to get a pdf file. At any step, you can try knitting the document and recreate a pdf. If you get error, you might have incomplete code.

Note: If you are submitting your assignment as an RMarkdown file, make sure you add your name to the top of this document.

QUESTION 1

STEP 1

Create a set of 1000 outcome observations using a data-generating process (DGP) that incorporates a systematic component and a stochastic component (of your choice)

bicycles <- sample.int(12000, 1000, replace = TRUE)
noise <- sample(-1500:1500, 1000, replace = TRUE)
accidents <- 0.4*bicycles + noise + 2000
df <- data.frame(accidents,bicycles)

STEP 2

Tell a 2-3 sentence story about the data generating process you coded up above. What is it about and what each component mean?

The data represents a relationship between the number of bicycles in a city and the number of car accidents in that city. The predictor is the number of bicycles, which was developed based on random integers between on a scale of 12000 as a maximum possible number of bicycles, which was chosen arbitrarily. The use of random integers instead of floats was chosen because there can not be half a bicycle. This variable constitutes the systematic component.

The outcome variable is the number of car accidents (accidents).

The noise was added as the stochastic component, using the same method as bicycles because it should follow the same analogy.

STEP 3

Using an incorrect model of the systematic component (of your choice), and using the simulation-based approach covered in class (the arm library, etc.), generate 1 prediction for every observation and calculate and report RMSE. Make sure you write out your model and report your RMSE.

Each prediction should be conditional on the predictor values of the corresponding observation. E.g., when predicting for observation #1, use the predictor values of observation #1.

#splitting the data
test.wrong <- sample(1:nrow(df),500)
test.step3 <- df[test.wrong,]
train.step3 <- df[-test.wrong,]

#creating the wrong regression model
lm.wrong <- lm(accidents ~ sqrt(bicycles) , data = train.step3)
lm.wrong.sim <- sim(lm.wrong,1)


#predicted.step3 <- predict(lm.wrong, test.step3)

#getting the error using the simulation based approach on the test data.
storage.vector3 <- rep(NA, 500)

get_accidents_wrong <- function(bicycle, coefs) {
  accidents_wrong <- coefs[1] + 
    bicycle*coefs[2] 
    return(accidents_wrong)
}

for ( i in 1:500){
  predicted_acc <- get_accidents_wrong(test.step3$bicycles[i], lm.wrong.sim@coef)
  storage.vector3[i] <-  predicted_acc
}

Error3 <- rmse(test.step3$accidents, storage.vector3)
Error3
## [1] 364615.9

STEP 4

Using the correct model (correct systematic and stochastic components), and using the simulation-based approach covered in class (the arm library, etc.), generate 1 prediction for every observation and calculate & report your RMSE. Once again, write out your model and report your RMSE.

Each prediction should be conditional on the predictor values of the corresponding observation. E.g., when predicting for observation #1, use the predictor values of observation #1.

#splitting the data
test.correct <- sample(1:nrow(df),500)
test.step4 <- df[test.correct,]
train.step4 <- df[-test.correct,]

#defining the correct model, using the training data
lm.correct <- lm(accidents ~ bicycles, data = train.step4)


#using the same approach to find the error
lm.correct.sim <- sim(lm.correct,n.sims=1)

storage.vector4 <- rep(NA, 500)

get_accidents_correct <- function(bicycle, coefs) {
  accidents_correct <- coefs[1] + 
    bicycle*coefs[2] 
    return(accidents_correct)
}

for ( i in 1:500){
  predicted_acc_correct <- get_accidents_correct(test.step4$bicycles[i], lm.correct.sim@coef)
  storage.vector4[i] <-  predicted_acc_correct
}

Error4 <- rmse(test.step4$accidents, storage.vector4)
Error4
## [1] 850.3478

STEP 5

Which RMSE is larger: The one from the correct model or the one from the incorrect model? Why?

The RMSE of the incorrect model is larger than the correct model (349590 vs 866). The correct model shows the true linear relationship between the variables (plus some noise), and will therefore predict more accurately (although it still depends on the amount of noise). As the incorrect model defines the relationship differently, the regression will predict different outcomes that will increase the RMSE.

QUESTION 2

Imagine that you want to create a data viz that illustrates the sensitivity of regression to outlier data points. So, you want to create two figures:

One figure that shows a regression line fit to a 2-dimensional (x and y) scatterplot, such that the regression line clearly has a positive slope.

plot(df$bicycles, df$accidents,  
     xlab="Number of Bicycles", ylab="Number of Accidents",
     main="Regression of the data WITHOUT an Outlier") + abline(lm.correct, col="red", lwd=4)

## integer(0)

And, another figure that shows a regression line with a negative slope fit to a scatter plot of the same data plus one additional outlier data point. This one data point is what changes the sign of the regression line’s slope from positive to negative.

# YOUR CODE HERE

#created new data to add outliers
x1 <- c(bicycles, 10000000)
y1 <- c(accidents,7)
lm_outlier <- lm(y1~x1)

plot(x1, y1, 
     xlab="Number of Bicycles", ylab="Number of Accidents",
     main="Regression of the data WITH an Outlier")+ abline(lm_outlier, col="red", lwd=2)

## integer(0)

Be sure to label the axes and the title the figures appropriately. Include a brief paragraph that explains the number of observations and how you created the data set and the outlier.

The number of observations is increased by one for the outlier visualization. It was added by creating vectors with the original data set and adding an extreme observation with 10000000 bicycles and 7 accidents as one additional observation. Extrapolation is dangerous if outliers heavily skew the model; a single outlier can change the regression line’s slope from positive to negative. This is a misleading representation of the data.

QUESTION 3

STEP 1

Using the laLonde data set, run a linear regression that models re78 as a function of age, education, re74, re75, hisp, and black. Note that the lalonde data set comes with the package Matching.

data(lalonde)
la_lm <- lm(re78 ~ age + educ + re74 + re75 + hisp + black, data = lalonde)
predicted.q3 <- predict(la_lm, lalonde)

STEP 2

Report coefficients and R-squared.

la_lm_summary <-  summary(la_lm)
la_lm_summary
## 
## Call:
## lm(formula = re78 ~ age + educ + re74 + re75 + hisp + black, 
##     data = lalonde)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9065  -4577  -1775   3186  55037 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.126e+03  2.434e+03   0.463   0.6437  
## age          5.928e+01  4.409e+01   1.345   0.1794  
## educ         4.293e+02  1.758e+02   2.442   0.0150 *
## re74         7.462e-02  7.699e-02   0.969   0.3330  
## re75         6.676e-02  1.314e-01   0.508   0.6116  
## hisp        -2.125e+02  1.546e+03  -0.137   0.8907  
## black       -2.323e+03  1.162e+03  -1.999   0.0462 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6544 on 438 degrees of freedom
## Multiple R-squared:  0.03942,    Adjusted R-squared:  0.02627 
## F-statistic: 2.996 on 6 and 438 DF,  p-value: 0.007035
la_lm_summary$coefficients
##                  Estimate   Std. Error    t value   Pr(>|t|)
## (Intercept)  1.126492e+03 2.433509e+03  0.4629082 0.64366006
## age          5.928155e+01 4.408699e+01  1.3446497 0.17943430
## educ         4.293074e+02 1.757953e+02  2.4420874 0.01499714
## re74         7.461872e-02 7.698791e-02  0.9692264 0.33296712
## re75         6.676314e-02 1.313676e-01  0.5082161 0.61155768
## hisp        -2.125053e+02 1.545746e+03 -0.1374775 0.89071657
## black       -2.323282e+03 1.162252e+03 -1.9989492 0.04623130
la_lm_summary$r.squared
## [1] 0.03942359

Then calculate R-squared by hand and confirm / report that you get the same or nearly the same answer as the summary (lm) command.

Write out hand calculations here.

#a function to calculate r squared  using RSS and TSS
R_squared <- function(real, predict) {
  RSS = sum((real - predict)^2)
  TSS = sum((real - mean(real))^2)
  RSQ = 1 - (RSS/TSS)
  return(RSQ)
}
R_squared(lalonde$re78, predicted.q3)
## [1] 0.03942359

STEP 3

Then, setting all the predictors at their means EXCEPT education, create a data visualization that shows the 95% confidence interval of the expected values of re78 as education varies from 3 to 16. Be sure to include axes labels and figure titles.

# YOUR CODE HERE
#do a for loop to change education value
data("lalonde")

#checking the range of education is 3-16
min(lalonde$educ)
## [1] 3
max(lalonde$educ)
## [1] 16
max(lalonde$educ)-min(lalonde$educ)
## [1] 13
storage.matrix <- matrix(NA, nrow = 1000, ncol = 14) #for predicted values
storage.matrix2 <- matrix(NA, nrow = 1000, ncol = 14) # for expected values


#the function find the coefficients using an input of defined person for the outcome variable of the defined regression
get_RE_78 <- function(coefs, person) {
  RE_78 <- coefs[1] + 
    person[1]*coefs[2] +
    person[2]*coefs[3] +
    person[3]*coefs[4] + 
    person[4]*coefs[5] +
    person[5]*coefs[6] +
    person[6]*coefs[7]
  
  return(RE_78)
}

#the nested loops simulate 1000 times using the input everytime from scratch and saving the results to the matrix.
#every loop of 1000 simulations the loop averages the values and starts again to find the expected values.
for (j in 1:1000){
  sim_lalonde_ex <- sim(la_lm, n.sims = 1000)
  for (educ in 3:16){
  Xs <- c(mean(lalonde$age) , educ , mean(lalonde$re74) , mean(lalonde$re75) , mean(lalonde$hisp) , mean(lalonde$black))
  for (i in 1:1000) {
    storage.matrix[i, educ - 2 ] <- get_RE_78(sim_lalonde_ex@coef[i, ], Xs)
    }
  }
  storage.matrix2[j,] <- colMeans(storage.matrix)
}

# obtain confidence intervals for each age by applying quantile() to each column in a data frame of simulated probabilities.
conf.intervals <- apply(storage.matrix2, 2, quantile, probs = c(0.025, 0.975))

#plotting the data:

{plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(min(storage.matrix2),max(storage.matrix2)), 
     main = "Probability of earnings in 1978 (based on expected values)", xlab = "years of education", 
     ylab = "Probability of earnings in USD")

for (educ in 3:16) {
   segments(
   x0 = educ,
   y0 = conf.intervals[1, educ - 2],
   x1 = educ,
   y1 = conf.intervals[2, educ - 2],
   lwd = 10)
}
}

}

#### STEP 4

#Then, do the same thing, but this time for the predicted values of `re78`. Be sure to include axes labels and figure titles.


```r
# YOUR CODE HERE

#this code works the same as in step 3, but does not average the values every 1000 simulation. thus, provides the predicted values
storage.matrix <- matrix(NA, nrow = 1000, ncol = 14) #predicted values

get_RE_78 <- function(coefs, person) {
  RE_78 <- coefs[1] + 
    person[1]*coefs[2] +
    person[2]*coefs[3] +
    person[3]*coefs[4] + 
    person[4]*coefs[5] +
    person[5]*coefs[6] +
    person[6]*coefs[7]
  
  return(RE_78)
}
sim_lalonde <- sim(la_lm, n.sims = 1000)

  

for (educ in 3:16){
  Xs <- c(mean(lalonde$age) , educ , mean(lalonde$re74) , mean(lalonde$re75) , mean(lalonde$hisp) , mean(lalonde$black))
  for (i in 1:1000) {
    storage.matrix[i, educ - 2 ] <- get_RE_78(sim_lalonde@coef[i, ], Xs)
  }
}



# obtain confidence intervals for each age by applying quantile() to each column in a data frame of simulated probabilities.
conf.intervals <- apply(storage.matrix, 2, quantile, probs = c(0.025, 0.975))
#plotting the data

{plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(min(storage.matrix),max(storage.matrix)), 
     main = "Probability of earnings in 1978 (based on predicted values)", xlab = "years of education", 
     ylab = "Probability of earnings")

for (educ in 3:16) {
   segments(
   x0 = educ,
   y0 = conf.intervals[1, educ - 2],
   x1 = educ,
   y1 = conf.intervals[2, educ - 2],
   lwd = 10)
}
}

STEP 5

Lastly, write a short paragraph with your reflections on this exercise (specifically, the length of intervals for given expected vs. predicted values) and the results you obtained.

Predicted values have a larger variance than expected values. This is because that estimation uncertainty and the fundamental uncertainty are included. The expected values visualization takes the mean of every 1000 values in a column (the number of years of education) and finds the confidence interval of 1000 means. This method cancels out the estimated uncertainty. This is shown by the length of each line per years of education. Although in both cases, the uncertainty is smaller, closer to 10 years of education, and raises as years of education increase or decrease, it is easy to notice the length of the bars (which represent the uncertainty) are much longer for the predicted values. The average should be nearly the same in both cases, as shown in the code below:

mean(storage.matrix[,4])
## [1] 3524.553
mean(storage.matrix2[,4])
## [1] 3500.262

QUESTION 4

STEP 1

Using the lalonde data set, run a logistic regression, modeling treatment status as a function of age, education, hisp, re74 and re75. Report and interpret the regression coefficient and 95% confidence intervals for age and education.

# YOUR CODE HERE

#defineing the model
names(lalonde)
##  [1] "age"     "educ"    "black"   "hisp"    "married" "nodegr"  "re74"   
##  [8] "re75"    "re78"    "u74"     "u75"     "treat"
glm4 <- glm(treat ~ age + educ + hisp + re74 + re75 , data = lalonde , family = binomial )

#glm4 <- glm(treat ~ age + educ + hisp + re74 + re75 + I(age*educ) + I(age*hisp) + I(age*re74) + I(age*re75)+ I(educ*hisp) + I(educ*re74) + I(educ*re75) + I(hisp*re74) + I(hisp*re75) + I(re74*re75), data = lalonde , family = binomial )


summary(glm4)
## 
## Call:
## glm(formula = treat ~ age + educ + hisp + re74 + re75, family = binomial, 
##     data = lalonde)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2970  -1.0519  -0.9521   1.2830   1.6924  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.320e+00  6.750e-01  -1.956   0.0505 .
## age          1.165e-02  1.370e-02   0.850   0.3953  
## educ         6.922e-02  5.545e-02   1.248   0.2119  
## hisp        -6.024e-01  3.777e-01  -1.595   0.1108  
## re74        -2.278e-05  2.511e-05  -0.907   0.3643  
## re75         5.221e-05  4.180e-05   1.249   0.2116  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 604.20  on 444  degrees of freedom
## Residual deviance: 596.82  on 439  degrees of freedom
## AIC: 608.82
## 
## Number of Fisher Scoring iterations: 4
glm4$coefficients
##   (Intercept)           age          educ          hisp          re74 
## -1.320183e+00  1.164838e-02  6.922206e-02 -6.024002e-01 -2.277944e-05 
##          re75 
##  5.221126e-05
#quantile(glm4$coefficients[2], probs = c(0.025, 0.975))
#quantile(glm4$coefficients[3], probs = c(0.025, 0.975))
#glm4$coefficients[3]

CI_4 <- confint(glm4, level=0.95)

CI_4 [2,]
##       2.5 %      97.5 % 
## -0.01538200  0.03852192
CI_4 [3,]
##       2.5 %      97.5 % 
## -0.03853866  0.17958880

Report and interpret regression coefficient and 95% confidence intervals for age and education here.

Based on the coefficients, “Hispanic” with an estimated coefficient of -0.6024 is the most significant compared to the other variables and has a negative effect on the treatment. Age estimated coefficient is 0.01165, and education’s is 0.06922. The confidence intervals of age and education are 2.5% = -0.0153 97.5% = 0.0385, and 2.5% = -0.0385 97.5% = 0.1795 respectively. the true mean of the population is expected to be in that range by 95% confidence.

STEP 2

Use a simple bootstrap to estimate (and report) bootstrapped confidence intervals for age and education given the logistic regression above. Code the bootstrap algorithm yourself.

# YOUR CODE HERE
# Bootstrap (1000 replications)

storage.matrix4 <- matrix(NA, nrow = 1000, ncol = 2)


# the for loop is used for bootstraping to replace boot(). 
# it creates a bootstraped data, extracts the coeffients of using the bootstraped data, and saves it in a the strage matrix.
#the loops run for a 1000 times, leaving 1000 coeefients based on runing the model 1000 times, every time on a new bootstrapped data
for (i in 1:1000) {
    
    local_df <- lalonde[sample(nrow(lalonde), 1000 ,  replace = TRUE), ]
    
    glm5 <- glm(treat ~ age + educ + hisp + re74 + re75 , data = local_df , family = binomial )
    
    age_coeff <- glm5$coefficients[2]
    educ_coeff <- glm5$coefficients[3]
    
    storage.matrix4[i,] = c(age_coeff,educ_coeff)
}

# calculating the confidance intervals of age and education
conf_intervals4 <- apply(storage.matrix4, 2, quantile, probs = c(0.025, 0.975))


colnames(conf_intervals4) <- c("Age", "Education")
conf_intervals4
##                Age    Education
## 2.5%  -0.007352155 -0.007566336
## 97.5%  0.029190185  0.151382897

Report bootstrapped confidence intervals for age and education here.

           Age    Education

2.5% -0.007055774 -0.008190116 97.5% 0.029461352 0.149596011

STEP 3

Then, using the simulation-based approach and the arm library, set all the predictors at their means EXCEPT education, create a data visualization that shows the 95% confidence interval of the expected values of the probability of receiving treatment as education varies from 3 to 16. Be sure to include axes labels and figure titles.

# YOUR CODE HERE

#the code works the same as in previous steps. howver, because the outcome is the probability of treatment using a logistic regration, the function, using the exp() to convert the logit into a probability.

storage.matrix_step3 <- matrix(NA, nrow = 1000, ncol = 14) #predicted values
storage.matrix_step3_2 <- matrix(NA, nrow = 1000, ncol = 14) #expected values

get_treat_ex <- function(coefs, person) {
  Treat_ex <- coefs[1] + 
    person[1]*coefs[2] +
    person[2]*coefs[3] +
    person[3]*coefs[4] + 
    person[4]*coefs[5] +
    person[5]*coefs[6]
  
  return(exp(Treat_ex) / (1 + exp(Treat_ex)))
}

for (j in 1:1000){
  sim_lalonde_Q4 <- sim(glm4, n.sims = 1000)
  for (educ in 3:16){
  Xs_q4 <- c(mean(lalonde$age) , educ , mean(lalonde$hisp), mean(lalonde$re74) , mean(lalonde$re75))
  for (i in 1:1000) {
    storage.matrix_step3[i, educ - 2 ] <- get_treat_ex(sim_lalonde_Q4@coef[i, ], Xs_q4)
    }
  }
  storage.matrix_step3_2[j,] <- colMeans(storage.matrix_step3)
}

# obtain confidence intervals for each age by applying quantile() to each column in a data frame of simulated probabilities.
conf.intervals_Q4_ex <- apply(storage.matrix_step3_2, 2, quantile, probs = c(0.025, 0.975))
# plotting the data

{plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(min(storage.matrix_step3_2),max(storage.matrix_step3_2)), 
     main = "95% confidence Interval of the expected values of the probability of receiving treatment as education varies", xlab = "years of education", 
     ylab = "probability of receiving treatment")

for (educ in 3:16) {
   segments(
   x0 = educ,
   y0 = conf.intervals_Q4_ex[1, educ - 2],
   x1 = educ,
   y1 = conf.intervals_Q4_ex[2, educ - 2],
   lwd = 10)
}
}

STEP 4

Then, do the same thing, but this time for the predicted values of the probability of receiving treatment as education varies from 3 to 16. Be sure to include axes labels and figure titles.

# YOUR CODE HERE
storage.matrix_Q4_pr <- matrix(NA, nrow = 1000, ncol = 14) #predicted values

get_treat_pr <- function(coefs, person) {
  Treat_pr <- coefs[1] + 
    person[1]*coefs[2] +
    person[2]*coefs[3] +
    person[3]*coefs[4] + 
    person[4]*coefs[5] +
    person[5]*coefs[6]
  
  return(exp(Treat_pr) / (1 + exp(Treat_pr)))
}
sim_lalonde_Q4_pr <- sim(glm4, n.sims = 1000)


for (educ in 3:16){
  Xs_q4 <- c(mean(lalonde$age) , educ , mean(lalonde$hisp), mean(lalonde$re74) , mean(lalonde$re75))
  for (i in 1:1000) {
    storage.matrix_Q4_pr[i, educ - 2 ] <- get_treat_pr(sim_lalonde_Q4_pr@coef[i, ], Xs_q4)
  }
}

# obtain confidence intervals for each age by applying quantile() to each column in a data frame of simulated probabilities.
conf.intervals_Q4_pr <- apply(storage.matrix_Q4_pr, 2, quantile, probs = c(0.025, 0.975))
# You need to begin with a plot that has the correct axes and scaling, but no data visualized (i.e., type = "n")... Try this:

{plot(x = c(1:100), y = c(1:100), type = "n", xlim = c(3,16), ylim = c(min(storage.matrix_Q4_pr),max(storage.matrix_Q4_pr)), 
     main = "The predicted values of receiving treatment as education varies", xlab = "years of education", 
     ylab = "Probability of receiving treatment")

for (educ in 3:16) {
   segments(
   x0 = educ,
   y0 = conf.intervals_Q4_pr[1, educ - 2],
   x1 = educ,
   y1 = conf.intervals_Q4_pr[2, educ - 2],
   lwd = 10)
}
}


#### STEP 5

Lastly, write a short paragraph with your reflections on this exercise and the results you obtained.

The confidence intervals of the first steps have the most variance. Using the bootstrapped model, I could decrease the uncertainty. This can also be viewed again by the difference with predicted and expected values in the confidence intervals visualizations, where the estimated uncertainty is canceled out. As we validate our data with a larger sample, we can improve our results to be more precise even though we use bootstrap data. The coefficients can tell us the relationship between the variables on the outcome.

## QUESTION 5
Write the executive summary for a decision brief about the impact of a stress therapy program, targeted at individuals age 18-42, intended to reduce average monthly stress. The program was tested via RCT, and the results are summarized by the figure that you get if you run this code chunk:


```r
# Note that if you knit this document, this part of the code won't 
# show up in the final pdf which is OK. We don't need to see the code
# we wrote.

# How effective is a therapy method against stress

# Participants in the study record their stress level for a month.
# Every day, participants assign a value from 1 to 10 for their stress level. 
# At the end of the month, we average the results for each participant.

#adds the confidence interval (first row of the matrix is lower 
# bound, second row is the upper bound)
trt1 = matrix(NA,nrow=2,ncol=7)
ctrl = matrix(NA,nrow=2,ncol=7) 

trt1[,1]=c(3.7, 6.5) #18  
ctrl[,1]=c(5, 8)

trt1[,2]=c(5, 8.5) #22
ctrl[,2]=c(7.5, 9)

trt1[,3]=c(6, 9) #26
ctrl[,3]=c(8.5, 10)

trt1[,4]=c(5, 7) #30
ctrl[,4]=c(6, 8)

trt1[,5]=c(3.5, 5) #34
ctrl[,5]=c(4.5, 7)

trt1[,6]=c(2, 3.5) #38
ctrl[,6]=c(3.5, 6)

trt1[,7]=c(0.5, 2) #42
ctrl[,7]=c(2.5, 5)
trt1
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  3.7  5.0    6    5  3.5  2.0  0.5
## [2,]  6.5  8.5    9    7  5.0  3.5  2.0
ctrl
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    5  7.5  8.5    6  4.5  3.5  2.5
## [2,]    8  9.0 10.0    8  7.0  6.0  5.0
# colors to each group
c1 = rgb(red = 0.3, green = 0, blue = 1, alpha = 0.7) #trt1
c2 = rgb(red = 1, green = 0.6, blue = 0, alpha = 1) #trt2
c3 = rgb(red = 0, green = 0.5, blue = 0, alpha = 0.7) #ctrl

# creates the background of the graph
plot(x = c(1:100), y = c(1:100), 
     type = "n", 
     xlim = c(17,43), 
     ylim = c(0,11), 
     cex.lab=1,
     main = "Stress Level - 95% Prediction Intervals", 
     xlab = "Age", 
     ylab = "Average Stress Level per Month", 
     xaxt = "n")

axis(1, at=seq(18,42,by=4), seq(18, 42, by=4))

grid(nx = NA, ny = NULL, col = "lightgray", lty = "dotted",
     lwd=par("lwd"), equilogs = TRUE)

# adds the legend
legend('topright',legend=c('Treatment','Control'),fill=c(c1,c2))

# iterates to add stuff to plot
for (age in seq(from=18,to=42,by=4)) { 
  #treatment
  segments(x0=age-0.2, y0=trt1[1, (age-18)/4+1],
           x1=age-0.2, y1=trt1[2, (age-18)/4+1], lwd=4, col=c1)
  
  #control
  segments(x0=age+0.2, y0=ctrl[1, (age-18)/4+1],
           x1=age+0.2, y1=ctrl[2, (age-18)/4+1], lwd=4, col=c2)
}

(Not that it matters, really, but you can imagine that these results were obtained via simulation, just like the results you have hopefully obtained for question 2 above).

Your executive summary should be between about 4 and 10 sentences long, it should briefly describe the the purpose of the study, the methodology, and the policy implications/prescription. (Feel free to imaginatively but realistically embellish/fill-in-the-blanks with respect to any of the above, since I am not giving you backstory here).

Write your executive summary here.

The RCT study tested the effectiveness of the “stress therapy program” (the treatment). Participants reported their stress level daily; their answers were segmented by age and were averaged per month. Results demonstrate that the therapy was more effective as age increases; both the stress levels and the uncertainty decreased with age increase. All results indicate overall lower stress levels both for control and treatment. However, the treatment effect’s uncertainty is very high between ages 18-26 and thus can not provide reliable conclusions. Nevertheless, at ages 38 and 42, we can be confident that the therapy program reduced stress.

- what is the sample size? Maybe with a larger sample, the uncertainty will decrease. - based on what characteristics the participants were selected? their background can affect confounds of the study as effectiveness on stress

QUESTION 6

Can we predict what projects end up being successful on Kickstarter?

We have data from the Kickstarter company.

From Wikipedia: Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity and merchandising. The company’s stated mission is to “help bring creative projects to life”. As of May 2019, Kickstarter has received more than $4 billion in pledges from 16.3 million backers to fund 445,000 projects, such as films, music, stage shows, comics, journalism, video games, technology, publishing, and food-related projects.

The data is collected by Mickaël Mouillé and is last uodated in 2018. Columns are self explanatory. Note that usd_pledged is the column pledged in US dollars (conversion done by kickstarter) and usd_pledge_real is the pledged column in real US dollars of the pledged column. Finally, usd_goal_real is the column goal in real US dollars. You should use the real columns.

So what makes a project successful? Undoubtedly, there are many factors, but perhaps we could set up a prediction problem here, similar to the one from the bonus part of the last assignment where we used GDP to predict personnel contributions.

We have columns representing the the number of backers, project length, the main category, and the real project goal in USD for each project.

Let’s explore the relationship between those predictors and the dependent variable of interest — the success of a project.

Instead of running a simple linear regression and calling it a day, let’s use cross-validation to make our prediction a little more sophisticated.

Our general plan is the following:

  1. Build the model on a training data set
  2. Apply the model on a new test data set to make predictions based on the inferred model parameters.
  3. Compute and track the prediction errors to check performance using the mean squared difference between the observed and the predicted outcome values in the test set.

Let’s get to it, step, by step. Make sure you have loaded the necessary packages for this project.

STEP 1: Import & Clean the Data

Import the dataset from this link: https://tinyurl.com/KaggleDataCS112

Remove any rows that include missing values.

# YOUR CODE HERE
df6 <- read.csv("https://tinyurl.com/KaggleDataCS112")
#cleaning the data
df6 <- na.omit(df6)

STEP 2: Codify outcome variable

Create a new variable that is either successful or NOT successful and call it success and save it in your dataframe. It should take values of 1 (successful) or 0 (unsuccessful).

# YOUR CODE HERE
#adding another column the define 1 as a project that suceeded and 0 for projects which got canceled or failed.
df6 <- cbind(success = ifelse( df6$state == "successful", 1, 0), df6 )
names(df6)
##  [1] "success"          "ID"               "name"             "category"        
##  [5] "main_category"    "currency"         "deadline"         "goal"            
##  [9] "launched"         "pledged"          "state"            "backers"         
## [13] "country"          "usd.pledged"      "usd_pledged_real" "usd_goal_real"

STEP 3: Getting the project length variable

Projects on Kickstarter can last anywhere from 1 - 60 days. Kickstarter claims that projects lasting any longer are rarely successful and campaigns with shorter durations have higher success rates, and create a helpful sense of urgency around your project. Using the package lubridate or any other package in R you come across by Googling, create a new column that shows the length of the project by taking the difference between the variable deadline and the variable launched. Call the new column length and save it in your dataframe. Remove any project length that is higher than 60.

# YOUR CODE HERE

#using lubridate I first create a new column just with dates, eliminating the time frame that is saved to the same column. 
#then I create a new cloumn that calculates the number of dates since the project strated until the end date.
df6$new_launched<- as.Date(df6$launched) 

df6$length <- interval( start = ymd(df6$new_launched ), end = ymd(df6$deadline))
df6$length <- time_length(df6$length, "day")
#I remove any row of data the lasted for more than 60 days 
df6<-df6[!(df6$ length >60),]

STEP 4: Splitting the data into a training and a testing set

While there are several variations of the k-fold cross-validation method, let’s stick with the simplest one where we just split randomly the dataset into two (k = 2) and split our available data from the link above into a training and a testing (aka validation) set.

Randomly select 80% of the data to be put in a training set and leave the rest for a test set.

# YOUR CODE HERE

#I define the main category as factor to use this data in the regression analysis
#then, I split the data leaving 80% of it to the training set.
df6$factor_main_category <- as.factor(df6$main_category)
df6_split <- sample(1:nrow(df6),0.8*nrow(df6))
df6_train <- df6[df6_split,]
df6_test <- df6[-df6_split,]

STEP 5: Fitting a model

Use a logistic regression to find what factors determine the chances a project is successful. Use the variable indicating whether a project is successful or not as the dependent variables (Y) and number of backers, project length, main category of the project, and the real project goal as independent variables. Make sure to use the main category as factor.

# YOUR CODE HERE

#setting the logistic regression model using the training data
glm.df6 <- glm(success ~ backers + length + factor_main_category + usd_goal_real, data=df6_train, family = "binomial")
glm.df6$coefficients
##                      (Intercept)                          backers 
##                    -0.1084459345                     0.0376614709 
##                           length       factor_main_categoryComics 
##                    -0.0166065842                    -0.3441265031 
##       factor_main_categoryCrafts        factor_main_categoryDance 
##                    -0.6909958567                     0.8083406935 
##       factor_main_categoryDesign      factor_main_categoryFashion 
##                    -0.6902328891                    -0.6229048646 
## factor_main_categoryFilm & Video         factor_main_categoryFood 
##                     0.2228310300                    -0.4265719103 
##        factor_main_categoryGames   factor_main_categoryJournalism 
##                    -1.5225347818                    -0.6963049798 
##        factor_main_categoryMusic  factor_main_categoryPhotography 
##                     0.2143968021                    -0.3905479027 
##   factor_main_categoryPublishing   factor_main_categoryTechnology 
##                    -0.4792630760                    -0.7480790636 
##      factor_main_categoryTheater                    usd_goal_real 
##                     0.7702796735                    -0.0001731652
# glm.df6_step8 <- glm(success ~ backers + length + factor_main_category + usd_goal_real + 
#                      I(backers*length) +
#                       I(backers*factor_main_category) +
#                       I(backers*usd_goal_real)+
#                       I(length*factor_main_category) + 
#                       I(length*usd_goal_real) +
#                       I(factor_main_category*usd_goal_real) , data= df6_step8_train, family = "binomial")

STEP 6: Predictions

Use the model you’ve inferred from the previous step to predict the success outcomes in the test set.

# YOUR CODE HERE
predicted_probs.glm6 <- predict(glm.df6, newdata= df6_test, type = "response")

STEP 7: How well did it do?

Report the Root Mean Squared Error (RMSE) of the predictions for the training and the test sets. For Steps 7 and onwards, instead of calculating the RMSE(s), calculate the misclassification rates instead. Given that we are dealing with a discrete outcome variable, it is inappropriate to use RMSE to evaluate model performance. Thanks to Mykyta for pointing this out!

# YOUR CODE HERE

#using the probabilities of the outcome variable I am using a confusion matrix to calculate the misclassification rates
predicted_ys <- ifelse( predicted_probs.glm6 > 0.5, 1, 0)

glm6_ctable <-table(predicted_ys, df6_test$success)

glm6_ctable
##             
## predicted_ys     0     1
##            0 45869  5877
##            1  1760 20372
#the misclassification rate is 1 minus the true positives.

misclassification_rate <- 1 - (glm6_ctable [2,2] / length(predicted_ys))
misclassification_rate
## [1] 0.7242481

Step 8: LOOCV method

Apply the leave-one-out cross validation (LOOCV) method to the training set. What is the RMSE of the training and test sets. How similar are the RMSEs? #For question 6, we understand that the number of observations are too large for the LOOCV to perform. You can sample 5% of the data and then split it into test/training for faster performance. #For Step 8 where we ask you to apply the LOOCV method, you can sample 5-10% of the original dataset and then split that into training and test sets. Make sure to explain what the implications are for only using such a sample to do the LOOCV on in the next steps.

# YOUR CODE HERE

#sampling randomly 5% of the original data frame
df6_step8 <- df6[sample(nrow(df6), 0.05*nrow(df6)),] 

#splitting the data
df6_step8_split <- sample(1:nrow(df6_step8),0.8*nrow(df6_step8))
df6_step8_train <- df6_step8[df6_step8_split,]
df6_step8_test <- df6_step8[-df6_step8_split,]


glm.df6_step8 <- glm(success ~ backers + length + factor_main_category + usd_goal_real , data= df6_step8_train, family = "binomial")

nrow(df6_step8_train )
## [1] 14775
#a function for finding the error based on LOOCV (k folds as the number of observations in the data)
cv.err <- cv.glm(df6_step8_train ,glm.df6_step8)

cv.err$delta
## [1] 0.08538586 0.08538586

Step 9: Explanations

Compare the RMSE from the simple method to the LOOCV method?

How do data scientists really use cross-validation? How is the approach in this project differ from real-world cases? Give an example to make your point!

The results for the LOOCV show a significant reduction in error from 0.7 to 0.07. however, this could be because of two main reasons. The LOOCV is based on the training data, and it is after splitting the data to 5% of the original data set. In a real-world scenario, LOOCV would be less appropriate because there are so many observations. In this case, it would be more appropriate to use either K folds so we can split the data into 5 or even 10 K groups because there are so many observations or to wait and observe for new data because it will not be difficult to generate more data through Kickstarter.

Extra Credit: Least Absolute Deviation Estimator

STEP 1

Figure out how to use rgenoud to run a regression that maximizes the least absolute deviation instead of the traditional sum of the squared residuals. Show that this works by running that regression on the lalonde data set with outcome being re78 and independent variables being age, education, hisp, re74, re75, and treat.

# YOUR CODE HERE
#library(rgenoud)
#glm.10 <- lm(re78 ~ age + educ + hisp + re74 + re75 + treat , data = lalonde)

#genoud(glm.10,nvars=1, max=TRUE)

STEP 2

How different is this coef on treat from the coef on treat that you get from the corresponding traditional least squares regression?

STEP 3

Now figure out how to do the same by using rgenoud to run the logistic regression (modeling treatment status as a function of age, education, hisp, re74 and re75).

# YOUR CODE HERE

END OF Assignment!!!

Final Steps

Add Markdown Text to .Rmd

Before finalizing your project you’ll want be sure there are comments in your code chunks and text outside of your code chunks to explain what you’re doing in each code chunk. These explanations are incredibly helpful for someone who doesn’t code or someone unfamiliar to your project. You have two options for submission:

  1. You can complete this .rmd file, knit it to pdf and submit both the .rmd file and the .pdf file on Forum as one .zip file.
  2. You can submit your assignment as a separate pdf using your favorite text editor and submit the pdf file along with a lint to your github code. Note that links to Google Docs are not accepted.

Knitting your R Markdown Document

Last but not least, you’ll want to Knit your .Rmd document into an HTML document. If you get an error, take a look at what the error says and edit your .Rmd document. Then, try to Knit again! Troubleshooting these error messages will teach you a lot about coding in R. If you get any error that doesn’t make sense to you, post it on Perusall.

A Few Final Checks

If you are submitting an .rmd file, a complete project should have:

  • Completed code chunks throughout the .Rmd document (your RMarkdown document should Knit without any error)
  • Comments in your code chunks
  • Answered all questions throughout this exercise.

If you are NOT submitting an .rmd file, a complete project should have:

  • A pdf that includes all the answers and their questions.
  • A link to Github (gist or repository) that contais all the code used to answer the questions. Each part of you code should say which question it’s referring to.