Objective
Your goal in this project is to build a linear regression model that can predict the Gross revenue earned by a movie based on other variables. You may use R packages to fit and evaluate a regression model (no need to implement regression yourself). Please stick to linear regression, however.
Instructions
You should be familiar with using an RMarkdown Notebook by now. Remember that you have to open it in RStudio, and you can run code chunks by pressing Cmd+Shift+Enter.
Please complete the tasks below and submit this R Markdown file (as pr2.Rmd) containing all completed code chunks and written responses, and a PDF export of it (as pr2.pdf) which should include the outputs and plots as well.
Note that Setup and Data Preprocessing steps do not carry any points, however, they need to be completed as instructed in order to get meaningful results.
Setup
Same as Project 1, load the dataset into memory:
load('movies_merged')
This creates an object of the same name (movies_merged). For convenience, you can copy it to df and start using it:
df = movies_merged
cat("Dataset has", dim(df)[1], "rows and", dim(df)[2], "columns", end="\n", file="")
colnames(df)
Load R packages
Load any R packages that you will need to use. You can come back to this chunk, edit it and re-run to load any additional packages later.
library(ggplot2)
If you are using any non-standard packages (ones that have not been discussed in class or explicitly allowed for this project), please mention them below. Include any special instructions if they cannot be installed using the regular install.packages('<pkg name>') command.
Non-standard packages used: None
Data Preprocessing
Before we start building models, we should clean up the dataset and perform any preprocessing steps that may be necessary. Some of these steps can be copied in from your Project 1 solution. It may be helpful to print the dimensions of the resulting dataframe at each step.
1. Remove non-movie rows
# TODO: Remove all rows from df that do not correspond to movies
2. Drop rows with missing Gross value
Since our goal is to model Gross revenue against other variables, rows that have missing Gross values are not useful to us.
# TODO: Remove rows with missing Gross value
3. Exclude movies released prior to 2000
Inflation and other global financial factors may affect the revenue earned by movies during certain periods of time. Taking that into account is out of scope for this project, so let’s exclude all movies that were released prior to the year 2000 (you may use Released, Date or Year for this purpose).
# TODO: Exclude movies released prior to 2000
4. Eliminate mismatched rows
Note: You may compare the Released column (string representation of release date) with either Year or Date (numeric representation of the year) to find mismatches. The goal is to avoid removing more than 10% of the rows.
# TODO: Remove mismatched rows
5. Drop Domestic_Gross column
Domestic_Gross is basically the amount of revenue a movie earned within the US. Understandably, it is very highly correlated with Gross and is in fact equal to it for movies that were not released globally. Hence, it should be removed for modeling purposes.
# TODO: Exclude the `Domestic_Gross` column
6. Process Runtime column
# TODO: Replace df$Runtime with a numeric column containing the runtime in minutes
Perform any additional preprocessing steps that you find necessary, such as dealing with missing values or highly correlated columns (feel free to add more code chunks, markdown blocks and plots here as necessary).
# TODO(optional): Additional preprocessing
Note: Do NOT convert categorical variables (like Genre) into binary columns yet. You will do that later as part of a model improvement task.
Final preprocessed dataset
Report the dimensions of the preprocessed dataset you will be using for modeling and evaluation, and print all the final column names. (Again, Domestic_Gross should not be in this list!)
# TODO: Print the dimensions of the final preprocessed dataset and column names
Evaluation Strategy
In each of the tasks described in the next section, you will build a regression model. In order to compare their performance, you will compute the training and test Root Mean Squared Error (RMSE) at different training set sizes.
First, randomly sample 10-20% of the preprocessed dataset and keep that aside as the test set. Do not use these rows for training! The remainder of the preprocessed dataset is your training data.
Now use the following evaluation procedure for each model:
- Choose a suitable sequence of training set sizes, e.g. 10%, 20%, 30%, …, 100% (10-20 different sizes should suffice). For each size, sample that many inputs from the training data, train your model, and compute the resulting training and test RMSE.
- Repeat your training and evaluation at least 10 times at each training set size, and average the RMSE results for stability.
- Generate a graph of the averaged train and test RMSE values as a function of the train set size (%), with optional error bars.
You can define a helper function that applies this procedure to a given set of features and reuse it.
Tasks
Each of the following tasks is worth 20 points, for a total of 100 points for this project. Remember to build each model as specified, evaluate it using the strategy outlined above, and plot the training and test errors by training set size (%).
1. Numeric variables
Use Linear Regression to predict Gross based on available numeric variables. You can choose to include all or a subset of them.
# TODO: Build & evaluate model 1 (numeric variables only)
Q: List the numeric variables you used.
A:
Q: What is the best mean test RMSE value you observed, and at what training set size?
A:
3. Non-numeric variables
Write code that converts genre, actors, directors, and other categorical variables to columns that can be used for regression (e.g. binary columns as you did in Project 1). Also process variables such as awards into more useful columns (again, like you did in Project 1). Now use these converted columns only to build your next model.
# TODO: Build & evaluate model 3 (converted non-numeric variables only)
Q: Explain which categorical variables you used, and how you encoded them into features.
A:
Q: What is the best mean test RMSE value you observed, and at what training set size? How does this compare with Task 2?
A:
4. Numeric and categorical variables
Try to improve the prediction quality as much as possible by using both numeric and non-numeric variables from Tasks 2 & 3.
# TODO: Build & evaluate model 4 (numeric & converted non-numeric variables)
Q: Compare the observed RMSE with Tasks 2 & 3.
A:
5. Additional features
Now try creating additional features such as interactions (e.g. is_genre_comedy x is_budget_greater_than_3M) or deeper analysis of complex variables (e.g. text analysis of full-text columns like Plot).
# TODO: Build & evaluate model 5 (numeric, non-numeric and additional features)
Q: Explain what new features you designed and why you chose them.
A:
Q: Comment on the final RMSE values you obtained, and what you learned through the course of this project.
A:
---
title: 'Project 2: Modeling and Evaluation'
subtitle: "CSE6242 - Data and Visual Analytics - Fall 2017\n\nDue: Sunday, November 26, 2017 at 11:59 PM UTC-12:00 on T-Square"
output:
  html_notebook:
    code_folding: none
    theme: default
  html_document:
    code_folding: none
    theme: default
  pdf_document: default
---

# Data

We will use the same dataset as Project 1: [`movies_merged`](https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cs6242/project/movies_merged).

# Objective

Your goal in this project is to build a linear regression model that can predict the `Gross` revenue earned by a movie based on other variables. You may use R packages to fit and evaluate a regression model (no need to implement regression yourself). Please stick to linear regression, however.

# Instructions

You should be familiar with using an [RMarkdown](http://rmarkdown.rstudio.com) Notebook by now. Remember that you have to open it in RStudio, and you can run code chunks by pressing *Cmd+Shift+Enter*.

Please complete the tasks below and submit this R Markdown file (as **pr2.Rmd**) containing all completed code chunks and written responses, and a PDF export of it (as **pr2.pdf**) which should include the outputs and plots as well.

_Note that **Setup** and **Data Preprocessing** steps do not carry any points, however, they need to be completed as instructed in order to get meaningful results._

# Setup

Same as Project 1, load the dataset into memory:

```{r}
load('movies_merged')
```

This creates an object of the same name (`movies_merged`). For convenience, you can copy it to `df` and start using it:

```{r}
df = movies_merged
cat("Dataset has", dim(df)[1], "rows and", dim(df)[2], "columns", end="\n", file="")
colnames(df)
```

## Load R packages

Load any R packages that you will need to use. You can come back to this chunk, edit it and re-run to load any additional packages later.

```{r}
library(ggplot2)
```

If you are using any non-standard packages (ones that have not been discussed in class or explicitly allowed for this project), please mention them below. Include any special instructions if they cannot be installed using the regular `install.packages('<pkg name>')` command.

**Non-standard packages used**: None

# Data Preprocessing

Before we start building models, we should clean up the dataset and perform any preprocessing steps that may be necessary. Some of these steps can be copied in from your Project 1 solution. It may be helpful to print the dimensions of the resulting dataframe at each step.

## 1. Remove non-movie rows

```{r}
# TODO: Remove all rows from df that do not correspond to movies
```

## 2. Drop rows with missing `Gross` value

Since our goal is to model `Gross` revenue against other variables, rows that have missing `Gross` values are not useful to us.

```{r}
# TODO: Remove rows with missing Gross value
```

## 3. Exclude movies released prior to 2000

Inflation and other global financial factors may affect the revenue earned by movies during certain periods of time. Taking that into account is out of scope for this project, so let's exclude all movies that were released prior to the year 2000 (you may use `Released`, `Date` or `Year` for this purpose).

```{r}
# TODO: Exclude movies released prior to 2000
```

## 4. Eliminate mismatched rows

_Note: You may compare the `Released` column (string representation of release date) with either `Year` or `Date` (numeric representation of the year) to find mismatches. The goal is to avoid removing more than 10% of the rows._

```{r}
# TODO: Remove mismatched rows
```

## 5. Drop `Domestic_Gross` column

`Domestic_Gross` is basically the amount of revenue a movie earned within the US. Understandably, it is very highly correlated with `Gross` and is in fact equal to it for movies that were not released globally. Hence, it should be removed for modeling purposes.

```{r}
# TODO: Exclude the `Domestic_Gross` column
```

## 6. Process `Runtime` column

```{r}
# TODO: Replace df$Runtime with a numeric column containing the runtime in minutes
```

Perform any additional preprocessing steps that you find necessary, such as dealing with missing values or highly correlated columns (feel free to add more code chunks, markdown blocks and plots here as necessary).

```{r}
# TODO(optional): Additional preprocessing
```

_**Note**: Do NOT convert categorical variables (like `Genre`) into binary columns yet. You will do that later as part of a model improvement task._

## Final preprocessed dataset

Report the dimensions of the preprocessed dataset you will be using for modeling and evaluation, and print all the final column names. (Again, `Domestic_Gross` should not be in this list!)

```{r}
# TODO: Print the dimensions of the final preprocessed dataset and column names
```

# Evaluation Strategy

In each of the tasks described in the next section, you will build a regression model. In order to compare their performance, you will compute the training and test Root Mean Squared Error (RMSE) at different training set sizes.

First, randomly sample 10-20% of the preprocessed dataset and keep that aside as the **test set**. Do not use these rows for training! The remainder of the preprocessed dataset is your **training data**.

Now use the following evaluation procedure for each model:

- Choose a suitable sequence of training set sizes, e.g. 10%, 20%, 30%, ..., 100% (10-20 different sizes should suffice). For each size, sample that many inputs from the training data, train your model, and compute the resulting training and test RMSE.
- Repeat your training and evaluation at least 10 times at each training set size, and average the RMSE results for stability.
- Generate a graph of the averaged train and test RMSE values as a function of the train set size (%), with optional error bars.

You can define a helper function that applies this procedure to a given set of features and reuse it.

# Tasks

Each of the following tasks is worth 20 points, for a total of 100 points for this project. Remember to build each model as specified, evaluate it using the strategy outlined above, and plot the training and test errors by training set size (%).

## 1. Numeric variables

Use Linear Regression to predict `Gross` based on available _numeric_ variables. You can choose to include all or a subset of them.

```{r}
# TODO: Build & evaluate model 1 (numeric variables only)
```

**Q**: List the numeric variables you used.

**A**: 


**Q**: What is the best mean test RMSE value you observed, and at what training set size?

**A**: 


## 2. Feature transformations

Try to improve the prediction quality from **Task 1** as much as possible by adding feature transformations of the numeric variables. Explore both numeric transformations such as power transforms and non-numeric transformations of the numeric variables like binning (e.g. `is_budget_greater_than_3M`).

```{r}
# TODO: Build & evaluate model 2 (transformed numeric variables only)
```

**Q**: Explain which transformations you used and why you chose them.

**A**: 


**Q**: How did the RMSE change compared to Task 1?

**A**: 


## 3. Non-numeric variables

Write code that converts genre, actors, directors, and other categorical variables to columns that can be used for regression (e.g. binary columns as you did in Project 1). Also process variables such as awards into more useful columns (again, like you did in Project 1). Now use these converted columns only to build your next model.

```{r}
# TODO: Build & evaluate model 3 (converted non-numeric variables only)
```

**Q**: Explain which categorical variables you used, and how you encoded them into features.

**A**: 


**Q**: What is the best mean test RMSE value you observed, and at what training set size? How does this compare with Task 2?

**A**: 


## 4. Numeric and categorical variables

Try to improve the prediction quality as much as possible by using both numeric and non-numeric variables from **Tasks 2 & 3**.

```{r}
# TODO: Build & evaluate model 4 (numeric & converted non-numeric variables)
```

**Q**: Compare the observed RMSE with Tasks 2 & 3.

**A**: 


## 5. Additional features

Now try creating additional features such as interactions (e.g. `is_genre_comedy` x `is_budget_greater_than_3M`) or deeper analysis of complex variables (e.g. text analysis of full-text columns like `Plot`).

```{r}
# TODO: Build & evaluate model 5 (numeric, non-numeric and additional features)
```

**Q**: Explain what new features you designed and why you chose them.

**A**: 


**Q**: Comment on the final RMSE values you obtained, and what you learned through the course of this project.

**A**:

