gwang323

Data

We will use the same dataset as Project 1: movies_merged.

Objective

Your goal in this project is to build a linear regression model that can predict the Gross revenue earned by a movie based on other variables. You may use R packages to fit and evaluate a regression model (no need to implement regression yourself). Please stick to linear regression, however.

Instructions

You should be familiar with using an RMarkdown Notebook by now. Remember that you have to open it in RStudio, and you can run code chunks by pressing Cmd+Shift+Enter.

Please complete the tasks below and submit this R Markdown file (as pr2.Rmd) containing all completed code chunks and written responses, and a PDF export of it (as pr2.pdf) which should include the outputs and plots as well.

Note that Setup and Data Preprocessing steps do not carry any points, however, they need to be completed as instructed in order to get meaningful results.

Setup

Same as Project 1, load the dataset into memory:

load('movies_merged')

This creates an object of the same name (movies_merged). For convenience, you can copy it to df and start using it:

df = movies_merged
cat("Dataset has", dim(df)[1], "rows and", dim(df)[2], "columns", end="\n", file="")

## Dataset has 40789 rows and 39 columns

colnames(df)

##  [1] "Title"             "Year"              "Rated"            
##  [4] "Released"          "Runtime"           "Genre"            
##  [7] "Director"          "Writer"            "Actors"           
## [10] "Plot"              "Language"          "Country"          
## [13] "Awards"            "Poster"            "Metascore"        
## [16] "imdbRating"        "imdbVotes"         "imdbID"           
## [19] "Type"              "tomatoMeter"       "tomatoImage"      
## [22] "tomatoRating"      "tomatoReviews"     "tomatoFresh"      
## [25] "tomatoRotten"      "tomatoConsensus"   "tomatoUserMeter"  
## [28] "tomatoUserRating"  "tomatoUserReviews" "tomatoURL"        
## [31] "DVD"               "BoxOffice"         "Production"       
## [34] "Website"           "Response"          "Budget"           
## [37] "Domestic_Gross"    "Gross"             "Date"

Load R packages

Load any R packages that you will need to use. You can come back to this chunk, edit it and re-run to load any additional packages later.

library(ggplot2)

If you are using any non-standard packages (ones that have not been discussed in class or explicitly allowed for this project), please mention them below. Include any special instructions if they cannot be installed using the regular install.packages('<pkg name>') command.

Non-standard packages used: None

Data Preprocessing

Before we start building models, we should clean up the dataset and perform any preprocessing steps that may be necessary. Some of these steps can be copied in from your Project 1 solution. It may be helpful to print the dimensions of the resulting dataframe at each step.

1. Remove non-movie rows

# TODO: Remove all rows from df that do not correspond to movies
df_movie <- df[df$Type == "movie",]
cat("df_movie has", dim(df_movie)[1], "rows and", dim(df_movie)[2], "columns", end="\n", file="")

## df_movie has 40000 rows and 39 columns

2. Drop rows with missing `Gross` value

Since our goal is to model Gross revenue against other variables, rows that have missing Gross values are not useful to us.

# TODO: Remove rows with missing Gross value
df_movie_gross = df_movie[!is.na(df_movie$Gross), ]
cat("df_movie_gross has", dim(df_movie_gross)[1], "rows and", dim(df_movie_gross)[2], "columns", end="\n", file="")

## df_movie_gross has 4558 rows and 39 columns

3. Exclude movies released prior to 2000

Inflation and other global financial factors may affect the revenue earned by movies during certain periods of time. Taking that into account is out of scope for this project, so let’s exclude all movies that were released prior to the year 2000 (you may use Released, Date or Year for this purpose).

# TODO: Exclude movies released prior to 2000
df_movie_gross_2000 = df_movie_gross[df_movie_gross$Year >=2000 ,]
cat("df_movie_gross_2000 has", dim(df_movie_gross_2000)[1], "rows and", dim(df_movie_gross_2000)[2], "columns", end="\n", file="")

## df_movie_gross_2000 has 3332 rows and 39 columns

4. Eliminate mismatched rows

Note: You may compare the Released column (string representation of release date) with either Year or Date (numeric representation of the year) to find mismatches. The goal is to avoid removing more than 10% of the rows.

# TODO: Remove mismatched rows
# Parse "Released" col to get the ReleasedYyear
X= data.frame(do.call('rbind', strsplit(as.character(df_movie_gross_2000$Released),'-',fixed=TRUE)))
df_movie_gross_2000$releasedYear = X$X1
cat("df_movie_gross_2000", dim(df_movie_gross_2000)[1], "rows and", dim(df_movie_gross_2000)[2], "columns", end="\n", file="")

## df_movie_gross_2000 3332 rows and 40 columns

# Keep the ones that are correct
df_movie_gross_2000_match2 = df_movie_gross_2000[abs(df_movie_gross_2000$Year - df_movie_gross_2000$Date) <= 1,]
df_movie_gross_2000_match2$releasedYear = as.numeric(as.character(df_movie_gross_2000_match2$releasedYear))
df_movie_gross_2000_match3 = df_movie_gross_2000_match2[abs(df_movie_gross_2000_match2$Year - df_movie_gross_2000_match2$releasedYear) <= 1,]
cat("df_movie_gross_2000_match3 has", dim(df_movie_gross_2000_match3)[1], "rows and", dim(df_movie_gross_2000_match3)[2], "columns", end="\n", file="")

## df_movie_gross_2000_match3 has 3179 rows and 40 columns

5. Drop `Domestic_Gross` column

Domestic_Gross is basically the amount of revenue a movie earned within the US. Understandably, it is very highly correlated with Gross and is in fact equal to it for movies that were not released globally. Hence, it should be removed for modeling purposes.

# TODO: Exclude the `Domestic_Gross` column
df_movie_gross_2000_match3$Domestic_Gross <- NULL
cat("df_movie_gross_2000_match3 has", dim(df_movie_gross_2000_match3)[1], "rows and", dim(df_movie_gross_2000_match3)[2], "columns", end="\n", file="")

## df_movie_gross_2000_match3 has 3179 rows and 39 columns

6. Process `Runtime` column

# TODO: Replace df$Runtime with a numeric column containing the runtime in minutes
rows_num = dim(df_movie_gross_2000_match3)[1]
for (i in seq(1, rows_num)){
  content = unlist(strsplit(df[i, "Runtime"], split=" "))
  if (content == "N/A"){
    df_movie_gross_2000_match3[i,"Runtime"] = 0 
    next()
  }
    
  cur = 0
  for (j in seq(1, length(content) - 1)){
    if (substr(content[j + 1],1, 1) == "h"){
      cur = cur + as.numeric(content[j]) * 60
    }
    
    if (substr(content[j + 1],1, 3) == "min"){
      cur = cur + as.numeric(content[j]) 
    }
  }
  df_movie_gross_2000_match3[i,"Runtime"]= cur
}
df_movie_gross_2000_match3$Runtime =as.numeric((df_movie_gross_2000_match3$Runtime))
cat("Convert Runtime to minutes, df_movie_gross_2000_match3 has", dim(df_movie_gross_2000_match3)[1], "rows and", dim(df_movie_gross_2000_match3)[2], "columns", end="\n", file="")

## Convert Runtime to minutes, df_movie_gross_2000_match3 has 3179 rows and 39 columns

df = df_movie_gross_2000_match3

Perform any additional preprocessing steps that you find necessary, such as dealing with missing values or highly correlated columns (feel free to add more code chunks, markdown blocks and plots here as necessary).

# TODO(optional): Additional preprocessing

Note: Do NOT convert categorical variables (like Genre) into binary columns yet. You will do that later as part of a model improvement task.

Final preprocessed dataset

Report the dimensions of the preprocessed dataset you will be using for modeling and evaluation, and print all the final column names. (Again, Domestic_Gross should not be in this list!)

# TODO: Print the dimensions of the final preprocessed dataset and column names
cat("After data preprocessing, the dataset has", dim(df)[1], "rows and", dim(df)[2], "columns", end="\n", file="")

## After data preprocessing, the dataset has 3179 rows and 39 columns

colnames(df)

##  [1] "Title"             "Year"              "Rated"            
##  [4] "Released"          "Runtime"           "Genre"            
##  [7] "Director"          "Writer"            "Actors"           
## [10] "Plot"              "Language"          "Country"          
## [13] "Awards"            "Poster"            "Metascore"        
## [16] "imdbRating"        "imdbVotes"         "imdbID"           
## [19] "Type"              "tomatoMeter"       "tomatoImage"      
## [22] "tomatoRating"      "tomatoReviews"     "tomatoFresh"      
## [25] "tomatoRotten"      "tomatoConsensus"   "tomatoUserMeter"  
## [28] "tomatoUserRating"  "tomatoUserReviews" "tomatoURL"        
## [31] "DVD"               "BoxOffice"         "Production"       
## [34] "Website"           "Response"          "Budget"           
## [37] "Gross"             "Date"              "releasedYear"

Evaluation Strategy

In each of the tasks described in the next section, you will build a regression model. In order to compare their performance, you will compute the training and test Root Mean Squared Error (RMSE) at different training set sizes.

First, randomly sample 10-20% of the preprocessed dataset and keep that aside as the test set. Do not use these rows for training! The remainder of the preprocessed dataset is your training data.

Now use the following evaluation procedure for each model:

Choose a suitable sequence of training set sizes, e.g. 10%, 20%, 30%, …, 100% (10-20 different sizes should suffice). For each size, sample that many inputs from the training data, train your model, and compute the resulting training and test RMSE.
Repeat your training and evaluation at least 10 times at each training set size, and average the RMSE results for stability.
Generate a graph of the averaged train and test RMSE values as a function of the train set size (%), with optional error bars.

You can define a helper function that applies this procedure to a given set of features and reuse it.

Tasks

Each of the following tasks is worth 20 points, for a total of 100 points for this project. Remember to build each model as specified, evaluate it using the strategy outlined above, and plot the training and test errors by training set size (%).

1. Numeric variables

Use Linear Regression to predict Gross based on available numeric variables. You can choose to include all or a subset of them.

# Shuffle data
df2 <- df[sample(nrow(df)),]
# Subsetting 20% for test data
test = df2[1: (dim(df2)[1] * 0.2),]
# Subsetting 80% for training data
train = df2[((dim(df2)[1] * 0.2)) : dim(df2)[1],]
cat("After shffling, the dataset has", dim(df2)[1], "rows and", dim(df2)[2], "columns", end="\n", file="")

## After shffling, the dataset has 3179 rows and 39 columns

cat("After splitting, the train dataset has", dim(train)[1], "rows and", dim(train)[2], "columns", end="\n", file="")

## After splitting, the train dataset has 2544 rows and 39 columns

cat("After splitting, the test has", dim(test)[1], "rows and", dim(test)[2], "columns", end="\n", file="")

## After splitting, the test has 635 rows and 39 columns

# TODO: Build & evaluate model 1 (numeric variables only)
selected_features = c("Gross", "Runtime", "imdbVotes", "tomatoRating", "imdbRating","tomatoFresh","Budget" )
train1 = subset(train, select = selected_features)
train1 = na.omit(train1)

test1 = subset(test, select = selected_features)
test1 = na.omit(test1)
print (dim(test1))

## [1] 576   7

size = as.integer(c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) * dim(train1)[1])
train1_RMSEs = matrix(c(0), nrow = 10, ncol = 10)
test1_RMSEs = matrix(c(0), nrow = 10, ncol = 10)

for (j in 1:10){
  train1 = train1[sample(nrow(train1)),]
  for (i in 1:10){
    cur = train1[1:size[i],]
    # fit model
    fit = lm(Gross~., data = cur, na.action = na.omit)
    # calculate train RMSE
    cur$pred_train1_gross = predict(fit, cur)
    cur$resid = (cur$pred_train1_gross - cur$Gross) * (cur$pred_train1_gross - cur$Gross)
    cur_RMSE = sqrt(sum(cur$resid) / dim(cur)[1])
    train1_RMSEs[i,j] = (cur_RMSE)
    # calculate test RMSE
    test1$pred_test1_gross = predict(fit, test1)
    test1$resid = (test1$pred_test1_gross - test1$Gross) * (test1$pred_test1_gross - test1$Gross)
    test1_RMSE = sqrt(sum(test1$resid) / dim(test1)[1])
    test1_RMSEs[i, j] = test1_RMSE
  }
}

# learning curve data build
trainingset_percentage = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
learning_curve_1 = matrix(c(0), nrow = 10, ncol = 3)
colnames = c("sample_percentage", "train", "")
for (row in 1: 10){
learning_curve_1[row,1] = row * 0.1
learning_curve_1[row, 2] = mean(train1_RMSEs[row,])
learning_curve_1[row, 3] = mean(test1_RMSEs[row,])
}
learning_curve_1 = data.frame(learning_curve_1)
#View(learning_curve_1)
print (min(learning_curve_1[,3]))

## [1] 100055610

print (which.min(learning_curve_1[,3]))

## [1] 10

# plot
g <- ggplot(learning_curve_1, aes(trainingset_percentage))
g <- g + geom_line(aes(y=learning_curve_1[2]), colour="red") 
g <- g + geom_line(aes(y=learning_curve_1[3]), colour="green")
g = g + labs(title = "Task 1: Train RMSE vs Test RMSE")
g

Q: List the numeric variables you used.

A: “Gross”, “Runtime”, “imdbVotes”, “tomatoRating”, “imdbRating”,“tomatoFresh”,“Budget”

Q: What is the best mean test RMSE value you observed, and at what training set size?

A: The best mean test RMSE is 112274777, at 100% training size

2. Feature transformations

Try to improve the prediction quality from Task 1 as much as possible by adding feature transformations of the numeric variables. Explore both numeric transformations such as power transforms and non-numeric transformations of the numeric variables like binning (e.g. is_budget_greater_than_3M).

# TODO: Build & evaluate model 2 (transformed numeric variables only)
selected = c("Gross", "Runtime", "imdbVotes", "tomatoRating", "imdbRating","tomatoFresh","Budget" )
train2 = subset(train, select = selected)
train2 = na.omit(train2)
train2$logBudget = log(train2$Budget)
train2$bintomatoRating = cut(train2$tomatoRating, 5, include.lowest=TRUE, labels=c(1, 2,3,4,5))

test2 = subset(test, select = selected)
test2 = na.omit(test2)
test2$logBudget = log(test2$Budget)
test2$bintomatoRating = cut(test2$tomatoRating, 5, include.lowest=TRUE, labels=c(1, 2,3,4,5))


size = as.integer(c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) * dim(train2)[1])
train2_RMSEs = matrix(c(0), nrow = 10, ncol = 10)
test2_RMSEs = matrix(c(0), nrow = 10, ncol = 10)

for (j in 1:10){
  train2 = train2[sample(nrow(train2)),]
  for (i in 1:10){
    cur = train2[1:size[i],]
    # fit model
    fit = lm(Gross~., data = cur, na.action = na.omit)
    # calculate train RMSE
    cur$pred_train2_gross = predict(fit, cur)
    cur$resid = (cur$pred_train2_gross - cur$Gross) * (cur$pred_train2_gross - cur$Gross)
    cur_RMSE = sqrt(sum(cur$resid) / dim(cur)[1])
    train2_RMSEs[i,j] = (cur_RMSE)
    # calculate test RMSE
    test2$pred_test2_gross = predict(fit, test2)
    test2$resid = (test2$pred_test2_gross - test2$Gross) * (test2$pred_test2_gross - test2$Gross)
    test2_RMSE = sqrt(sum(test2$resid) / dim(test2)[1])
    test2_RMSEs[i, j] = test2_RMSE
  }
}

# learning curve data build
trainingset_percentage = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
learning_curve_2 = matrix(c(0), nrow = 10, ncol = 3)
colnames = c("sample_percentage", "train", "")
for (row in 1: 10){
learning_curve_2[row,1] = row * 0.1
learning_curve_2[row, 2] = mean(train2_RMSEs[row,])
learning_curve_2[row, 3] = mean(test2_RMSEs[row,])
}
learning_curve_2 = data.frame(learning_curve_2)
print (min(learning_curve_2[,3]))

## [1] 98407356

print (which.min(learning_curve_2[,3]))

## [1] 10

# plot
g <- ggplot(learning_curve_2, aes(trainingset_percentage))
g <- g + geom_line(aes(y=learning_curve_2[2]), colour="red") 
g <- g + geom_line(aes(y=learning_curve_2[3]), colour="green")
g = g + labs(title = "Task 2: Train RMSE vs Test RMSE")
g

Q: Explain which transformations you used and why you chose them.

A: I used log transformed budget and binned tomatoRatings. From project 1, the log tranformed budget has a close relation with Gross. So I choose this parameter. I binned the tomato ratings to make the different more obvious, hoping to improve the model.

Q: How did the RMSE change compared to Task 1?

A: The test RMSE dropped from 112274777(task 1) to 110512183.

3. Non-numeric variables

Write code that converts genre, actors, directors, and other categorical variables to columns that can be used for regression (e.g. binary columns as you did in Project 1). Also process variables such as awards into more useful columns (again, like you did in Project 1). Now use these converted columns only to build your next model.

# TODO: Build & evaluate model 3 (converted non-numeric variables only)
selected = c("Gross", "Runtime", "imdbVotes", "tomatoRating","imdbRating","tomatoFresh","Budget", "Genre")
train3 = subset(train, select = selected)
train3 = na.omit(train3)
test3 = subset(test, select = selected)
test3 = na.omit(test3)

# convert genre to numeric values
total = rbind(train3, test3)
library(tm)
myCorpus <- Corpus(VectorSource(total$Genre))
myTDM <- DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))
x = as.matrix(myTDM)
total = cbind(total, x)
total$Genre = NULL

# split train and test
train3 = total[1: (dim(train3)[1]),]
test3 = total[(dim(train3)[1]): (dim(total)[1]),]

size = as.integer(c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) * dim(train3)[1])
train3_RMSEs = matrix(c(0), nrow = 10, ncol = 10)
test3_RMSEs = matrix(c(0), nrow = 10, ncol = 10)

for (j in 1:10){
  train3 = train3[sample(nrow(train3)),]
  for (i in 1:10){
    cur = train3[1:size[i],]
    # fit model
    fit = lm(Gross~., data = cur, na.action = na.omit)
    # calculate train RMSE
    cur$pred_train3_gross = predict(fit, cur)
    cur$resid = (cur$pred_train3_gross - cur$Gross) * (cur$pred_train3_gross - cur$Gross)
    cur_RMSE = sqrt(sum(cur$resid) / dim(cur)[1])
    train3_RMSEs[i,j] = (cur_RMSE)
    # calculate test RMSE
    test3$pred_test3_gross = predict(fit, test3)
    test3$resid = (test3$pred_test3_gross - test3$Gross) * (test3$pred_test3_gross - test3$Gross)
    test3_RMSE = sqrt(sum(test3$resid) / dim(test3)[1])
    test3_RMSEs[i, j] = test3_RMSE
  }
}

# learning curve data build
trainingset_percentage = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
learning_curve_3 = matrix(c(0), nrow = 10, ncol = 3)
colnames = c("sample_percentage", "train", "")
for (row in 1: 10){
learning_curve_3[row,1] = row * 0.1
learning_curve_3[row, 2] = mean(train3_RMSEs[row,])
learning_curve_3[row, 3] = mean(test3_RMSEs[row,])
}
learning_curve_3 = data.frame(learning_curve_3)
print (min(learning_curve_3[,3]))

## [1] 98242300

print (which.min(learning_curve_3[,3]))

## [1] 10

# plot
g <- ggplot(learning_curve_3, aes(trainingset_percentage))
g <- g + geom_line(aes(y=learning_curve_3[2]), colour="red") 
g <- g + geom_line(aes(y=learning_curve_3[3]), colour="green")
g = g + labs(title = "Task 3: Train RMSE vs Test RMSE")
g

Q: Explain which categorical variables you used, and how you encoded them into features.

A: I used Genre and converted it to binary numbers as described in Project 1. For each type, if the movie is belong to that feature, give it value 1, other wise give 0

Q: What is the best mean test RMSE value you observed, and at what training set size? How does this compare with Task 2?

A: Improved from 110512183(Task 2) to 108736427.

4. Numeric and categorical variables

Try to improve the prediction quality as much as possible by using both numeric and non-numeric variables from Tasks 2 & 3.

# TODO: Build & evaluate model 4 (numeric & converted non-numeric variables)
selected = c("Gross", "Runtime", "imdbVotes", "tomatoRating","imdbRating","tomatoFresh","Budget", "Genre")
train3 = subset(train, select = selected)
train3 = na.omit(train3)
test3 = subset(test, select = selected)
test3 = na.omit(test3)

# convert genre to numeric values
total = rbind(train3, test3)
library(tm)
myCorpus <- Corpus(VectorSource(total$Genre))
myTDM <- DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))
x = as.matrix(myTDM)
total = cbind(total, x)
total$Genre = NULL

# split train and test
train3 = total[1: (dim(train3)[1]),]
test3 = total[(dim(train3)[1]): (dim(total)[1]),]

# add logBudget and bintomato features to train and tes data
train3$logBudget = log(train3$Budget)
train3$bintomatoRating = cut(train3$tomatoRating, 5, include.lowest=TRUE, labels=c(1, 2,3,4,5))
test3$logBudget = log(test3$Budget)
test3$bintomatoRating = cut(test3$tomatoRating, 5, include.lowest=TRUE, labels=c(1, 2,3,4,5))

size = as.integer(c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) * dim(train3)[1])
train3_RMSEs = matrix(c(0), nrow = 10, ncol = 10)
test3_RMSEs = matrix(c(0), nrow = 10, ncol = 10)

for (j in 1:10){
  train3 = train3[sample(nrow(train3)),]
  for (i in 1:10){
    cur = train3[1:size[i],]
    # fit model
    fit = lm(Gross~., data = cur, na.action = na.omit)
    # calculate train RMSE
    cur$pred_train3_gross = predict(fit, cur)
    cur$resid = (cur$pred_train3_gross - cur$Gross) * (cur$pred_train3_gross - cur$Gross)
    cur_RMSE = sqrt(sum(cur$resid) / dim(cur)[1])
    train3_RMSEs[i,j] = (cur_RMSE)
    # calculate test RMSE
    test3$pred_test3_gross = predict(fit, test3)
    test3$resid = (test3$pred_test3_gross - test3$Gross) * (test3$pred_test3_gross - test3$Gross)
    test3_RMSE = sqrt(sum(test3$resid) / dim(test3)[1])
    test3_RMSEs[i, j] = test3_RMSE
  }
}

# learning curve data build
trainingset_percentage = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
learning_curve_3 = matrix(c(0), nrow = 10, ncol = 3)
colnames = c("sample_percentage", "train", "")
for (row in 1: 10){
learning_curve_3[row,1] = row * 0.1
learning_curve_3[row, 2] = mean(train3_RMSEs[row,])
learning_curve_3[row, 3] = mean(test3_RMSEs[row,])
}
learning_curve_3 = data.frame(learning_curve_3)
print (min(learning_curve_3[,3]))

## [1] 96640133

print (which.min(learning_curve_3[,3]))

## [1] 10

# plot
g <- ggplot(learning_curve_3, aes(trainingset_percentage))
g <- g + geom_line(aes(y=learning_curve_3[2]), colour="red") 
g <- g + geom_line(aes(y=learning_curve_3[3]), colour="green")
g = g + labs(title = "Task 4: Train RMSE vs Test RMSE")
g

Q: Compare the observed RMSE with Tasks 2 & 3.

A: Improved from 110512183 (task 2) and 108736427 (task 3) to 106736135.

5. Additional features

Now try creating additional features such as interactions (e.g. is_genre_comedy x is_budget_greater_than_3M) or deeper analysis of complex variables (e.g. text analysis of full-text columns like Plot).

# TODO: Build & evaluate model 5 (numeric, non-numeric and additional features)
# TODO: Build & evaluate model 4 (numeric & converted non-numeric variables)
selected = c("Gross", "Runtime", "imdbVotes", "tomatoRating","imdbRating","tomatoFresh","Budget", "Genre")
train3 = subset(train, select = selected)
train3 = na.omit(train3)
test3 = subset(test, select = selected)
test3 = na.omit(test3)

# convert genre to numeric values
total = rbind(train3, test3)
library(tm)
myCorpus <- Corpus(VectorSource(total$Genre))
myTDM <- DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))
x = as.matrix(myTDM)
total = cbind(total, x)
total$Genre = NULL

# split train and test
train3 = total[1: (dim(train3)[1]),]
test3 = total[(dim(train3)[1]): (dim(total)[1]),]

# add logBudget and bintomato features to train and tes data
train3$logBudget = log(train3$Budget)
train3$bintomatoRating = cut(train3$tomatoRating, 5, include.lowest=TRUE, labels=c(1, 2,3,4,5))
test3$logBudget = log(test3$Budget)
test3$bintomatoRating = cut(test3$tomatoRating, 5, include.lowest=TRUE, labels=c(1, 2,3,4,5))

# add additional features
train3$test <- ifelse(train3$imdbRating > 6 & train3$Budget > 4000000, 1, 0)
test3$test <- ifelse(test3$imdbRating > 6 & test3$Budget > 4000000, 1, 0)

size = as.integer(c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) * dim(train3)[1])
train3_RMSEs = matrix(c(0), nrow = 10, ncol = 10)
test3_RMSEs = matrix(c(0), nrow = 10, ncol = 10)

for (j in 1:10){
  train3 = train3[sample(nrow(train3)),]
  for (i in 1:10){
    cur = train3[1:size[i],]
    # fit model
    fit = lm(Gross~., data = cur, na.action = na.omit)
    # calculate train RMSE
    cur$pred_train3_gross = predict(fit, cur)
    cur$resid = (cur$pred_train3_gross - cur$Gross) * (cur$pred_train3_gross - cur$Gross)
    cur_RMSE = sqrt(sum(cur$resid) / dim(cur)[1])
    train3_RMSEs[i,j] = (cur_RMSE)
    # calculate test RMSE
    test3$pred_test3_gross = predict(fit, test3)
    test3$resid = (test3$pred_test3_gross - test3$Gross) * (test3$pred_test3_gross - test3$Gross)
    test3_RMSE = sqrt(sum(test3$resid) / dim(test3)[1])
    test3_RMSEs[i, j] = test3_RMSE
  }
}

# learning curve data build
trainingset_percentage = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
learning_curve_3 = matrix(c(0), nrow = 10, ncol = 3)
colnames = c("sample_percentage", "train", "")
for (row in 1: 10){
learning_curve_3[row,1] = row * 0.1
learning_curve_3[row, 2] = mean(train3_RMSEs[row,])
learning_curve_3[row, 3] = mean(test3_RMSEs[row,])
}
learning_curve_3 = data.frame(learning_curve_3)
print (min(learning_curve_3[,3]))

## [1] 96725342

print (which.min(learning_curve_3[,3]))

## [1] 10

# plot
g <- ggplot(learning_curve_3, aes(trainingset_percentage))
g <- g + geom_line(aes(y=learning_curve_3[2]), colour="red") 
g <- g + geom_line(aes(y=learning_curve_3[3]), colour="green")
g = g + labs(title = "Task 5: Train RMSE vs Test RMSE")
g

Q: Explain what new features you designed and why you chose them.

A: the new features is a boolen value. If a movie has imbdRating higher than 6 and the budget > 4M, the new feature is 1, otherwise 0. This give a cut off between good movie and bad movies, should help the modeling.

Q: Comment on the final RMSE values you obtained, and what you learned through the course of this project.

A: The final RMSE obtained is 106648277. I learned from the course that feature engineering is very important, choosing optimal features improve the prediction. Also sample size is important, in general larger sample size lead to better predictions.