A problem that could be addressed via predictive modeling, inference, and clustering could be UCSD waitlist prediction. Predicting whether one will get off of the waitlist is a constant challenge for UCSD and other university students. It would be possible to infer from previous data what the relationship between waitlist location, class waitlisted on, current day into the possible enrollment period, speed with which the class filled up, etc. and chance of getting in. Some of these predictors, such as which class or which department the class is in, would likely be benefited by clustering. This model could then be used to predict future chance of getting in, and take some weight off of students’ minds!
Flexible would be good, as it is unlikely to underfit or have too much bias.
Inflexible would be good, as it is better at making extrapolatory guesses with low information most of the time and is less likely to overfit.
Flexible would be good, as an inflexible model is unlikely to grasp the nuance of the relationship between the predictors and response.
Inflexible would be good, as flexible models are highly impacted by variance.
obs.1 = c(0,3,0)
obs.2 = c(2,0,0)
obs.3 = c(0,1,3)
obs.4 = c(0,1,2)
obs.5 = c(-1,0,1)
obs.6 = c(1,1,1)
obs.colors = c("Red", "Red", "Red", "Green", "Green", "Red")
distances = c(0,0,0,0,0,0)
names(distances) = c("Obs 1", "Obs 2", "Obs 3", "Obs 4", "Obs 5", "Obs 6")
distances[1] = sqrt(obs.1[1]^2 + obs.1[2]^2 + obs.1[3]^2)
distances[2] = sqrt(obs.2[1]^2 + obs.2[2]^2 + obs.2[3]^2)
distances[3] = sqrt(obs.3[1]^2 + obs.3[2]^2 + obs.3[3]^2)
distances[4] = sqrt(obs.4[1]^2 + obs.4[2]^2 + obs.4[3]^2)
distances[5] = sqrt(obs.5[1]^2 + obs.5[2]^2 + obs.5[3]^2)
distances[6] = sqrt(obs.6[1]^2 + obs.6[2]^2 + obs.6[3]^2)
print(distances)
## Obs 1 Obs 2 Obs 3 Obs 4 Obs 5 Obs 6
## 3.000000 2.000000 3.162278 2.236068 1.414214 1.732051
The nearest observation is 5, which is green. Thus, we expect the test point to be green.
The nearest 3 observations are 2, 5, and 6, which are red, green, and red respectively. Thus we expect the test point to be red since most of the nearest points are red.
We would expect the best value for K to be small, as highly non-linear relationships need high-flexibility interpretation.
Income2.csv from the textbook’s websitereadtable or csvread. NOTE: Please include your code.xlabel and ylabel).HigherEd. This variable is defined to be 1 if the subject has ≥16 years of education, and 0 otherwise. Make a box plot comparing the income level of subjects with HigherEd=0 vs. HigherEd=1.require(matrixStats)
## Loading required package: matrixStats
income.data <- read.csv("/Users/rdoctor/Desktop/Cogs\ 109\ hw1/Income2.csv") #or whatever the Income2 file's path is
plot(income.data[["Education"]], income.data[["Income"]], xlab = "Education", ylab = "Income")
income.mean <- colMeans(income.data)[["Income"]]
print(income.mean)
## [1] 62.74473
income.sd <- colSds(data.matrix(income.data))[4]
print(income.sd)
## [1] 27.01328
income.sem <- income.sd / sqrt(length(income.data[["X"]]))
print(income.sem)
## [1] 4.931929
HigherEd = income.data[["Education"]] >= 16
print(HigherEd)
## [1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [12] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## [23] TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE