Our goal in this assignment is to predict the interest rate of government debt as accurately as possible.

Outcomes

Instructions

Data

The California DebtWatch contains the following information:

The principal amounts, sale dates, interest rates, terms, purposes, ratings, costs of issuance, financing team participants, issuance documents, and annual reporting (if applicable), among 67 other data points required under California Government Code section 8855, of the various types of debt issued by all state and local government agencies in California.

Download the data in CSV format and load it into R. Randomly select 20% of the rows and save them into a test set to use later to evaluate the performance of your model. Which column(s) represent interest rate? > TIC.Interest.Rate and NIC.Interest.Rate

set.seed(280)
library(rpart)
d = read.csv("~/Desktop/Stat 128/CDA_ALL_Raw.csv")
s20 = sample(nrow(d), size = .20*nrow(d))
test = d[s20, ]
train = d[-s20, ]
grep("Interest", colnames(d), ignore.case = TRUE, value = TRUE)
## [1] "TIC.Interest.Rate"   "NIC.Interest.Rate"   "Interest.Type"      
## [4] "Other.Interest.Type"

Calculated Columns

Define one or more new columns from existing text columns in the data set. For example, you could add a logical column indicating whether the term “lease” appears in some column. Why do you think this new column will help you improve the accuracy of your model?

l = grepl("lease", d$Debt.Type, ignore.case = TRUE)
d2 = cbind(d, l)

l_test = grepl("lease", test$Debt.Type, ignore.case = TRUE)
test2 = cbind(test, l_test)

Models

Use the remaining 80% of the data (the training set) to come up with two different models to predict interest rate. You’re welcome to use any external machine learning libraries you like, or you can stick with the lm and rpart from class. Note that you can come up with different models by using different subsets of columns. For example, a model with 3 input columns differs from a model with 60 input columns. Briefly describe the two models you ended up with.

train$interest = train[ , "TIC.Interest.Rate"]
train$year = substring(train$CDIAC.Number, 1, 4)
train_na = is.na(train$interest)
train$lease = grepl("lease", train$Debt.Type, ignore.case = TRUE)
train2 = train[!(train_na), c("interest", "lease", "year")]
fit_lm = lm(interest ~ year, train2)
plot(interest ~ year, train2)

Evaluating Performance

Evaluate both of your models on the 20% of the data you reserved for the test set by looking at the average absolute difference between the interest rate predicted by the model and the actual interest rate. Do the models do a reasonable job of predicting interest rate? Find the rows where the predicted interest rate is farthest from the true interest rate. Why might the model have done a poor job on these rows?

A smaller data set may result a poor job due to the larger variance and outliers it may have.

test$interest = test[ , "TIC.Interest.Rate"]
test$year = substring(test$CDIAC.Number, 1, 4)
test_na = is.na(test$interest)
test$lease = grepl("lease", test$Debt.Type, ignore.case = TRUE)
test2 = test[!(test_na), c("interest", "lease", "year")]

fit_lm = lm(interest ~ year, test2)
plot(interest ~ year, test2)