Note: This project involves getting data ready for analysis and doing some preliminary investigations. Project 2 will involve modeling and predictions on the same dataset, and will be released at a later date. Both projects will have equal weightage towards your grade. You may reuse some of the preprocessing/analysis steps from Project 1 in Project 2.

Data

In this project, you will explore a dataset that contains information about movies, including ratings, budget, gross revenue and other attributes. It was prepared by Dr. Guy Lebanon, and here is his description of the dataset:

The file movies_merged contains a dataframe with the same name that has 40K rows and 39 columns. Each row represents a movie title and each column represents a descriptor such as Title, Actors, and Budget. I collected the data by querying IMDb’s API (see www.omdbapi.com) and joining it with a separate dataset of movie budgets and gross earnings (unknown to you). The join key was the movie title. This data is available for personal use, but IMDb’s terms of service do not allow it to be used for commercial purposes or for creating a competing repository.

Objective

Your goal is to investigate the relationship between the movie descriptors and the box office success of movies, as represented by the variable Gross. This task is extremely important as it can help a studio decide which titles to fund for production, how much to bid on produced movies, when to release a title, how much to invest in marketing and PR, etc. This information is most useful before a title is released, but it is still very valuable after the movie is already released to the public (for example it can affect additional marketing spend or how much a studio should negotiate with on-demand streaming companies for “second window” streaming rights).

Instructions

This is an R Markdown Notebook. Open this file in RStudio to get started.

When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

x = 1:10
print(x^2)
 [1]   1   4   9  16  25  36  49  64  81 100

Plots appear inline too:

plot(x, x^2, 'o')

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I. Enter some R code and run it.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

Please complete all the tasks below by implementing code chunks that have a TODO comment in them, running all code chunks so that output and plots are displayed, and typing in answers to each question (Q: …) next to/below the corresponding answer prompt (A:). Feel free to add code chunks/show additional output to support any of the answers.

When you are done, you will need to submit the final R markdown file (as pr1.Rmd) with all code chunks implemented and executed, and all text responses written in. You also need to submit a PDF export of the markdown file (as pr1.pdf), which should show your code, output, plots and written responses–this will be your project report. Compress these two files into a single .zip archive and upload it on T-Square.

Setup

Load data

Make sure you’ve downloaded the movies_merged file and it is in the current working directory. Now load it into memory:

load('movies_merged')
cat("Dataset has", dim(movies_merged)[1], "rows and", dim(movies_merged)[2], "columns", end="\n", file="")
Dataset has 40789 rows and 39 columns 

This creates an object of the same name (movies_merged). For convenience, you can copy it to df and start using it:

df = movies_merged
cat("Column names:", end="\n", file="")
Column names: 

Load R packages

Load any R packages that you will need to use. You can come back to this chunk, edit it and re-run to load any additional packages later.

library(ggplot2)
library(GGally)

If you are loading any non-standard packages (ones that have not been discussed in class or explicitly allowed for this project), please mention them below. Include any special instructions if they cannot be installed using the regular install.packages('<pkg name>') command.

Non-standard packages used: None

Tasks

Each task below is worth 10 points, and is meant to be performed sequentially, i.e. do step 2 after you have processed the data as described in step 1. Total points: 100

Complete each task by implementing code chunks as described by TODO comments, and by responding to questions (“Q:”) with written answers (“A:”). If you are unable to find a meaningful or strong relationship in any of the cases when requested, explain why not by referring to appropriate plots/statistics.

It is okay to handle missing values below by omission, but please omit as little as possible. It is worthwhile to invest in reusable and clear code as you may need to use it or modify it in project 2.

1. Remove non-movie rows

The variable Type captures whether the row is a movie, a TV series, or a game. Remove all rows from df that do not correspond to movies.

# TODO: Remove all rows from df that do not correspond to movies
df2 <- df[df$Type == "movie",]
df = df2

Q: How many rows are left after removal? Enter your response below.

A: 40000 rows are left

2. Process Runtime column

The variable Runtime represents the length of the title as a string. Write R code to convert it to a numeric value (in minutes) and replace df$Runtime with the new numeric column.

# TODO: Replace df$Runtime with a numeric column containing the runtime in minutes
rows_num = dim(df)[1]
for (i in seq(1, rows_num)){
  content = unlist(strsplit(df[i, "Runtime"], split=" "))
  if (content == "N/A"){
    df[i,"Runtime"] = 0 
    next()
  }
    
  cur = 0
  for (j in seq(1, length(content) - 1)){
    if (substr(content[j + 1],1, 1) == "h"){
      cur = cur + as.numeric(content[j]) * 60
    }
    
    if (substr(content[j + 1],1, 3) == "min"){
      cur = cur + as.numeric(content[j]) 
    }
  }
  df[i,"Runtime"]= cur
}
df$Runtime =as.numeric((df$Runtime))

Now investigate the distribution of Runtime values and how it changes over years (variable Year, which you can bucket into decades) and in relation to the budget (variable Budget). Include any plots that illustrate.

# TODO: Investigate the distribution of Runtime values and how it varies by Year and Budget
# The distribution of Runtime values
qplot(x = Runtime, data = df, binwidth = 10) 

# Runtime vs. Budget 
qplot(df$Runtime, log(df$Budget))

# Runtime vs. Year
df$Decades = (df$Year %/% 10) * 10
ggplot(df, aes(reorder(Decades, Decades, median), Runtime) ) + geom_boxplot() + scale_x_discrete("") 

Feel free to insert additional code chunks as necessary.

Q: Comment on the distribution as well as relationships. Are there any patterns or trends that you can observe?

A: 1. The Runtime distribution is “Bi model normal distribution”, there are outliers which is much longer than average. 2. Median of Runtime increaded in later years. 3. As Runtime increases, Budget maybe increases, but very scattered. The positive relationship is not very obvious.

3. Encode Genre column

The column Genre represents a list of genres associated with the movie in a string format. Write code to parse each text string into a binary vector with 1s representing the presence of a genre and 0s the absence, and add it to the dataframe as additional columns. Then remove the original Genre column.

For example, if there are a total of 3 genres: Drama, Comedy, and Action, a movie that is both Action and Comedy should be represented by a binary vector <0, 1, 1>. Note that you need to first compile a dictionary of all possible genres and then figure out which movie has which genres (you can use the R tm package to create the dictionary).

# TODO: Replace Genre with a collection of binary columns
library(tm)
myCorpus <- Corpus(VectorSource(df$Genre))
myTDM <- DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))
x = as.matrix(myTDM)
x
# Add processed Genre data to origional df
df = cbind(df, x)
# drop Genre from df
df$Genre <- NULL

Plot the relative proportions of movies having the top 10 most common genres.

# TODO: Select movies from top 10 most common genres and plot their relative proportions
# create a df to record frequency
df.count <- data.frame(matrix(ncol = 2, nrow = 30))
colnames(df.count) = c("genre", "count")
# fill in frequency
totalgenre = colnames(x)
i = 1
for (g in totalgenre){
  cur = sum(df[g])
  df.count[i,] = c(g, cur) 
  i = i + 1
}
df.count$frequency <- as.numeric(df.count$count) / dim(df)[1]
df.count[rev(order(df.count$frequency)),]
# plot relative proportions
df.count = df.count[rev(order(df.count$frequency)),]
qplot(x =df.count$genre[1:10], y =df.count$frequency[1:10])

Examine how the distribution of Runtime changes across genres for the top 10 most common genres.

# TODO: Plot Runtime distribution for top 10 most common genres
totalgenre = df.count$genre[1:10]
print(totalgenre)
 [1] "drama"       "comedy"      "short"       "romance"     "action"      "crime"      
 [7] "thriller"    "documentary" "adventure"   "animation"  
for (i in seq(1, 10)){
  print (ggplot(df[which(df[totalgenre[i]] == 1),], aes("", Runtime)) + geom_boxplot() + coord_flip() + scale_x_discrete("") + ggtitle(totalgenre[i]))
  z = df[which(df[totalgenre[i]] == 1),]
  print (totalgenre[i])
  print (median(z$Runtime))
  print(qplot(x = Runtime, data = z, binwidth = 4, main = totalgenre[i]))
}
[1] "drama"
[1] 96
[1] "comedy"
[1] 90
[1] "short"
[1] 10
[1] "romance"
[1] 96
[1] "action"
[1] 95
[1] "crime"
[1] 95
[1] "thriller"
[1] 95
[1] "documentary"
[1] 72
[1] "adventure"
[1] 95
[1] "animation"
[1] 7

dim(df)
[1] 40000    69
for (i in seq(1, 10)){
  print (ggplot(df[which(df[totalgenre[i]] == 1),], aes("", Runtime)) + geom_boxplot() + coord_flip() + scale_x_discrete("") + ggtitle(totalgenre[i]))
  z = df[which(df[totalgenre[i]] == 1),]
  print (totalgenre[i])
  print (median(z$Runtime))
  print(qplot(x = Runtime, data = z, binwidth = 4, main = totalgenre[i]))
}
[1] "drama"
[1] 96
[1] "comedy"
[1] 90
[1] "short"
[1] 10
[1] "romance"
[1] 96
[1] "action"
[1] 95
[1] "crime"
[1] 95
[1] "thriller"
[1] 95
[1] "documentary"
[1] 72
[1] "adventure"
[1] 95
[1] "animation"
[1] 7

dim(df)
[1] 40000    69

Q: Describe the interesting relationship(s) you observe. Are there any expected or unexpected trends that are evident?

A: As expected, for each genre, the movies run time are generally normally distributed, even though some are skewed, some are biomodeal with smaller bell shape in shorter runtime. The “short” has the lowest runtime, the drama has the longest, which is expected.

Some outliers (600min) is not exprected. They are pretty long movies. Biomodel normal distribution are not expected. I have not think out why there are two peaks.

4. Eliminate mismatched rows

The dataframe was put together by merging two different sources of data and it is possible that the merging process was inaccurate in some cases (the merge was done based on movie title, but there are cases of different movies with the same title). There are 3 columns that contain date information: Year (numeric year), Date (numeric year), and Released (string representation of the release date).

Find and remove all rows where you suspect a merge error occurred based on a mismatch between these variables. To make sure subsequent analysis and modeling work well, avoid removing more than 10% of the rows that have a Gross value present.

Note: Do not remove the rows with Gross == NA at this point, just use this a guideline.

# TODO: Remove rows with Year/Date/Released mismatch
dim(df)
[1] 40000    69
X= data.frame(do.call('rbind', strsplit(as.character(df$Released),'-',fixed=TRUE)))
df$releasedYear = X$X1
# Remove all whose Gross is NA
df_validGross = df[!is.na(df$Gross), ]
dim(df_validGross)
[1] 4558   70
# Keep the ones that are correct
df_validYearDate = df_validGross[abs(df_validGross$Year - df_validGross$Date) <= 1,]
#View(df_validYearDate[, c("Gross", "Year", "Date", "releasedYear")])
df_validYearDate$releasedYear = as.numeric(as.character(df_validYearDate$releasedYear))
df_validYearReleasedYear = df_validYearDate[abs(df_validYearDate$Year - df_validYearDate$releasedYear) <= 1,]
dim(df_validYearDate)
[1] 4431   70
dim(df_validYearReleasedYear)
[1] 4385   70

Q: What is your precise removal logic, and how many rows remain in the resulting dataset?

A: 1.For the same movies, the “Released”, “Date” and “Year” should show less than one year difference. If in any row that their year information in the above three columns are not the same, that indicates a mismatch, and should be removed. 2. There are 3518 rows left after removal, which has gross value.

5. Explore Gross revenue

For the commercial success of a movie, production houses want to maximize Gross revenue. Investigate if Gross revenue is related to Budget, Runtime or Genre in any way.

Note: To get a meaningful relationship, you may have to partition the movies into subsets such as short vs. long duration, or by genre, etc.

# TODO: Investigate if Gross Revenue is related to Budget, Runtime or Genre
# Gross Revenue vs Budget
qplot(log(df_validYearReleasedYear$Budget), log(df_validYearReleasedYear$Gross))

# Gross Revenue vs Runtime
qplot(df_validYearReleasedYear$Runtime, df_validYearReleasedYear$Gross)

# Gross Revenue vs Genre
for (i in seq(1, 10)){
  print (ggplot(df[which(df[totalgenre[i]] == 1),], aes("", Gross)) + geom_boxplot() + coord_flip() + scale_x_discrete("") + ggtitle(totalgenre[i]))
  z = df[which(df[totalgenre[i]] == 1),]
  print (totalgenre[i])
  print (median(z$Runtime))
  #print(qplot(x = Runtime, data = z, binwidth = 4, main = totalgenre[i]))
}
[1] "drama"
[1] 96
[1] "comedy"
[1] 90
[1] "short"
[1] 10
[1] "romance"
[1] 96
[1] "action"
[1] 95
[1] "crime"
[1] 95
[1] "thriller"
[1] 95
[1] "documentary"
[1] 72
[1] "adventure"
[1] 95
[1] "animation"
[1] 7

Q: Did you find any observable relationships or combinations of Budget/Runtime/Genre that result in high Gross revenue? If you divided the movies into different subsets, you may get different answers for them - point out interesting ones.

A: Budget increase as runtime increase The runtime for each Genere are different.Such as short is much shorter than others.

# TODO: Investigate if Gross Revenue is related to Release Month
X= data.frame(do.call('rbind', strsplit(as.character(df_validYearReleasedYear$Released),'-',fixed=TRUE)))
month = X$X2
transformedGross = log(df_validYearReleasedYear$Gross)
df_validYearReleasedYear = cbind(df_validYearReleasedYear, month, transformedGross)
names(df_validYearReleasedYear)
 [1] "Title"             "Year"              "Rated"             "Released"         
 [5] "Runtime"           "Director"          "Writer"            "Actors"           
 [9] "Plot"              "Language"          "Country"           "Awards"           
[13] "Poster"            "Metascore"         "imdbRating"        "imdbVotes"        
[17] "imdbID"            "Type"              "tomatoMeter"       "tomatoImage"      
[21] "tomatoRating"      "tomatoReviews"     "tomatoFresh"       "tomatoRotten"     
[25] "tomatoConsensus"   "tomatoUserMeter"   "tomatoUserRating"  "tomatoUserReviews"
[29] "tomatoURL"         "DVD"               "BoxOffice"         "Production"       
[33] "Website"           "Response"          "Budget"            "Domestic_Gross"   
[37] "Gross"             "Date"              "Decades"           "biography"        
[41] "documentary"       "romance"           "short"             "thriller"         
[45] "drama"             "war"               "comedy"            "horror"           
[49] "sci"               "adventure"         "family"            "history"          
[53] "crime"             "action"            "music"             "mystery"          
[57] "fantasy"           "sport"             "animation"         "musical"          
[61] "show"              "talk"              "adult"             "western"          
[65] "film"              "noir"              "reality"           "news"             
[69] "game"              "releasedYear"      "month"             "transformedGross" 
ggplot(df_validYearReleasedYear, aes(reorder(month, -transformedGross, median), transformedGross) ) + geom_boxplot() + coord_flip() + scale_x_discrete("transformedGross") 

6. Process Awards column

The variable Awards describes nominations and awards in text format. Convert it to 2 numeric columns, the first capturing the number of wins, and the second capturing nominations. Replace the Awards column with these new columns, and then study the relationship of Gross revenue with respect to them.

Note: The format of the Awards column is not standard; you may have to use regular expressions to find the relevant values. Try your best to process them, and you may leave the ones that don’t have enough information as NAs or set them to 0s.

# TODO: Convert Awards to 2 numeric columns: wins and nominations
rows_num = dim(df_validYearReleasedYear)[1]
df_validYearReleasedYear.awards = data.frame(matrix(ncol = 2, nrow = rows_num ))
colnames(df_validYearReleasedYear.awards) = c("wins", "nominations")
for (i in seq(1, rows_num)){
  content = unlist(strsplit(df_validYearReleasedYear[i, "Awards"], split=" "))
  if (length(content) < 2) next
  for (j in seq(1, length(content) - 1)){
    if (substr(content[j + 1], 1, 3) == "win") {
      win_num = as.numeric(content[j])
      df_validYearReleasedYear.awards[i, "wins"] = win_num
    }
    if (substr(content[j + 1], 1, 10) == "nomination") {
      nomination_num = as.numeric(content[j])
      df_validYearReleasedYear.awards[i, "nominations"] = nomination_num
    }
  }
}
df_validYearReleasedYear = cbind(df_validYearReleasedYear, df_validYearReleasedYear.awards)
# number of non N/A in nomination
colSums(!is.na(df_validYearReleasedYear.awards))[2]
nominations 
       3497 

Q: How did you construct your conversion mechanism? How many rows had valid/non-zero wins or nominations?

A: I wrote loops to access each cell in the Award column. Extract the number before " win“, and similarly extract the number before” nomination“. There are 11547 non NA rows in nonmination

# TODO: Plot Gross revenue against wins and nominations
plot(log(df_validYearReleasedYear$nominations),log(df_validYearReleasedYear$Gross))

plot(log(df_validYearReleasedYear$wins),log(df_validYearReleasedYear$Gross))

Q: How does the gross revenue vary by number of awards won and nominations received?

A: As the number of awards won and nominations increases, the gross revenue increase.

7. Movie ratings from IMDb and Rotten Tomatoes

There are several variables that describe ratings, including IMDb ratings (imdbRating represents average user ratings and imdbVotes represents the number of user ratings), and multiple Rotten Tomatoes ratings (represented by several variables pre-fixed by tomato). Read up on such ratings on the web (for example rottentomatoes.com/about and www.imdb.com/help/show_leaf?votestopfaq).

Investigate the pairwise relationships between these different descriptors using graphs.

# TODO: Illustrate how ratings from IMDb and Rotten Tomatoes are related
install.packages('GGally')
Error in install.packages : Updating loaded packages
library(GGally)
names(df)
 [1] "Title"             "Year"              "Rated"             "Released"         
 [5] "Runtime"           "Director"          "Writer"            "Actors"           
 [9] "Plot"              "Language"          "Country"           "Awards"           
[13] "Poster"            "Metascore"         "imdbRating"        "imdbVotes"        
[17] "imdbID"            "Type"              "tomatoMeter"       "tomatoImage"      
[21] "tomatoRating"      "tomatoReviews"     "tomatoFresh"       "tomatoRotten"     
[25] "tomatoConsensus"   "tomatoUserMeter"   "tomatoUserRating"  "tomatoUserReviews"
[29] "tomatoURL"         "DVD"               "BoxOffice"         "Production"       
[33] "Website"           "Response"          "Budget"            "Domestic_Gross"   
[37] "Gross"             "Date"              "Decades"           "biography"        
[41] "documentary"       "romance"           "short"             "thriller"         
[45] "drama"             "war"               "comedy"            "horror"           
[49] "sci"               "adventure"         "family"            "history"          
[53] "crime"             "action"            "music"             "mystery"          
[57] "fantasy"           "sport"             "animation"         "musical"          
[61] "show"              "talk"              "adult"             "western"          
[65] "film"              "noir"              "reality"           "news"             
[69] "game"              "releasedYear"     
rating_parameters = c("tomatoMeter", "tomatoRating", "tomatoReviews", "tomatoFresh", "tomatoRotten", "tomatoUserMeter", "tomatoUserRating", "imdbRating" ,"imdbVotes"  )
ggpairs(df, columns = rating_parameters )

 plot: [1,1] [=------------------------------------------------------------------]  1% est: 0s 
 plot: [1,2] [==-----------------------------------------------------------------]  2% est: 8s 
 plot: [1,3] [==-----------------------------------------------------------------]  4% est: 9s 
 plot: [1,4] [===----------------------------------------------------------------]  5% est:11s 
 plot: [1,5] [====---------------------------------------------------------------]  6% est:10s 
 plot: [1,6] [=====--------------------------------------------------------------]  7% est: 9s 
 plot: [1,7] [======-------------------------------------------------------------]  9% est: 9s 
 plot: [1,8] [=======------------------------------------------------------------] 10% est: 9s 
 plot: [1,9] [=======------------------------------------------------------------] 11% est: 8s 
 plot: [2,1] [========-----------------------------------------------------------] 12% est: 8s 
 plot: [2,2] [=========----------------------------------------------------------] 14% est: 8s 
 plot: [2,3] [==========---------------------------------------------------------] 15% est: 9s 
 plot: [2,4] [===========--------------------------------------------------------] 16% est: 8s 
 plot: [2,5] [============-------------------------------------------------------] 17% est: 8s 
 plot: [2,6] [============-------------------------------------------------------] 19% est: 8s 
 plot: [2,7] [=============------------------------------------------------------] 20% est: 8s 
 plot: [2,8] [==============-----------------------------------------------------] 21% est: 7s 
 plot: [2,9] [===============----------------------------------------------------] 22% est: 7s 
 plot: [3,1] [================---------------------------------------------------] 23% est: 7s 
 plot: [3,2] [=================--------------------------------------------------] 25% est: 7s 
 plot: [3,3] [=================--------------------------------------------------] 26% est: 7s 
 plot: [3,4] [==================-------------------------------------------------] 27% est: 7s 
 plot: [3,5] [===================------------------------------------------------] 28% est: 7s 
 plot: [3,6] [====================-----------------------------------------------] 30% est: 7s 
 plot: [3,7] [=====================----------------------------------------------] 31% est: 6s 
 plot: [3,8] [======================---------------------------------------------] 32% est: 6s 
 plot: [3,9] [======================---------------------------------------------] 33% est: 6s 
 plot: [4,1] [=======================--------------------------------------------] 35% est: 6s 
 plot: [4,2] [========================-------------------------------------------] 36% est: 6s 
 plot: [4,3] [=========================------------------------------------------] 37% est: 6s 
 plot: [4,4] [==========================-----------------------------------------] 38% est: 6s 
 plot: [4,5] [==========================-----------------------------------------] 40% est: 6s 
 plot: [4,6] [===========================----------------------------------------] 41% est: 5s 
 plot: [4,7] [============================---------------------------------------] 42% est: 5s 
 plot: [4,8] [=============================--------------------------------------] 43% est: 5s 
 plot: [4,9] [==============================-------------------------------------] 44% est: 5s 
 plot: [5,1] [===============================------------------------------------] 46% est: 5s 
 plot: [5,2] [===============================------------------------------------] 47% est: 5s 
 plot: [5,3] [================================-----------------------------------] 48% est: 5s 
 plot: [5,4] [=================================----------------------------------] 49% est: 5s 
 plot: [5,5] [==================================---------------------------------] 51% est: 5s 
 plot: [5,6] [===================================--------------------------------] 52% est: 5s 
 plot: [5,7] [====================================-------------------------------] 53% est: 4s 
 plot: [5,8] [====================================-------------------------------] 54% est: 4s 
 plot: [5,9] [=====================================------------------------------] 56% est: 4s 
 plot: [6,1] [======================================-----------------------------] 57% est: 4s 
 plot: [6,2] [=======================================----------------------------] 58% est: 4s 
 plot: [6,3] [========================================---------------------------] 59% est: 4s 
 plot: [6,4] [=========================================--------------------------] 60% est: 4s 
 plot: [6,5] [=========================================--------------------------] 62% est: 4s 
 plot: [6,6] [==========================================-------------------------] 63% est: 4s 
 plot: [6,7] [===========================================------------------------] 64% est: 3s 
 plot: [6,8] [============================================-----------------------] 65% est: 3s 
 plot: [6,9] [=============================================----------------------] 67% est: 3s 
 plot: [7,1] [=============================================----------------------] 68% est: 3s 
 plot: [7,2] [==============================================---------------------] 69% est: 3s 
 plot: [7,3] [===============================================--------------------] 70% est: 3s 
 plot: [7,4] [================================================-------------------] 72% est: 3s 
 plot: [7,5] [=================================================------------------] 73% est: 3s 
 plot: [7,6] [==================================================-----------------] 74% est: 3s 
 plot: [7,7] [==================================================-----------------] 75% est: 2s 
 plot: [7,8] [===================================================----------------] 77% est: 2s 
 plot: [7,9] [====================================================---------------] 78% est: 2s 
 plot: [8,1] [=====================================================--------------] 79% est: 2s 
 plot: [8,2] [======================================================-------------] 80% est: 2s 
 plot: [8,3] [=======================================================------------] 81% est: 2s 
 plot: [8,4] [=======================================================------------] 83% est: 2s 
 plot: [8,5] [========================================================-----------] 84% est: 2s 
 plot: [8,6] [=========================================================----------] 85% est: 2s 
 plot: [8,7] [==========================================================---------] 86% est: 1s 
 plot: [8,8] [===========================================================--------] 88% est: 1s 
 plot: [8,9] [============================================================-------] 89% est: 1s 
 plot: [9,1] [============================================================-------] 90% est: 1s 
 plot: [9,2] [=============================================================------] 91% est: 1s 
 plot: [9,3] [==============================================================-----] 93% est: 1s 
 plot: [9,4] [===============================================================----] 94% est: 1s 
 plot: [9,5] [================================================================---] 95% est: 1s 
 plot: [9,6] [=================================================================--] 96% est: 0s 
 plot: [9,7] [=================================================================--] 98% est: 0s 
 plot: [9,8] [==================================================================-] 99% est: 0s 
 plot: [9,9] [===================================================================]100% est: 0s 
                                                                                               

Q: Comment on the similarities and differences between the user ratings of IMDb and the critics ratings of Rotten Tomatoes.

A: (1). ImbdRatings is high similarities with “tomatoMeter”, “tomatoRating”, “tomatoUserMeter”, “tomatoUserRating”; (2). ImbdRatings has slight similarities with “tomatoReviews”, “tomatoFresh”; (3).ImbdRatings is different with “tomatoRotten”;

8. Ratings and awards

These ratings typically reflect the general appeal of the movie to the public or gather opinions from a larger body of critics. Whereas awards are given by professional societies that may evaluate a movie on specific attributes, such as artistic performance, screenplay, sound design, etc.

Study the relationship between ratings and awards using graphs (awards here refers to wins and/or nominations).

# TODO: Show how ratings and awards are related
df = df2
names(df)
 [1] "Title"             "Year"              "Rated"             "Released"         
 [5] "Runtime"           "Genre"             "Director"          "Writer"           
 [9] "Actors"            "Plot"              "Language"          "Country"          
[13] "Awards"            "Poster"            "Metascore"         "imdbRating"       
[17] "imdbVotes"         "imdbID"            "Type"              "tomatoMeter"      
[21] "tomatoImage"       "tomatoRating"      "tomatoReviews"     "tomatoFresh"      
[25] "tomatoRotten"      "tomatoConsensus"   "tomatoUserMeter"   "tomatoUserRating" 
[29] "tomatoUserReviews" "tomatoURL"         "DVD"               "BoxOffice"        
[33] "Production"        "Website"           "Response"          "Budget"           
[37] "Domestic_Gross"    "Gross"             "Date"             
# add wins and nominations to orgional df
rows_num = dim(df)[1]
df.awards = data.frame(matrix(ncol = 2, nrow = rows_num ))
colnames(df.awards) = c("wins", "nominations")
for (i in seq(1, rows_num)){
  content = unlist(strsplit(df[i, "Awards"], split=" "))
  if (length(content) < 2) next
  for (j in seq(1, length(content) - 1)){
    if (substr(content[j + 1], 1, 3) == "win") {
      win_num = as.numeric(content[j])
      df.awards[i, "wins"] = win_num
    }
    if (substr(content[j + 1], 1, 10) == "nomination") {
      nomination_num = as.numeric(content[j])
      df.awards[i, "nominations"] = nomination_num
    }
  }
}
df = cbind(df, df.awards)
names(df)
 [1] "Title"             "Year"              "Rated"             "Released"         
 [5] "Runtime"           "Genre"             "Director"          "Writer"           
 [9] "Actors"            "Plot"              "Language"          "Country"          
[13] "Awards"            "Poster"            "Metascore"         "imdbRating"       
[17] "imdbVotes"         "imdbID"            "Type"              "tomatoMeter"      
[21] "tomatoImage"       "tomatoRating"      "tomatoReviews"     "tomatoFresh"      
[25] "tomatoRotten"      "tomatoConsensus"   "tomatoUserMeter"   "tomatoUserRating" 
[29] "tomatoUserReviews" "tomatoURL"         "DVD"               "BoxOffice"        
[33] "Production"        "Website"           "Response"          "Budget"           
[37] "Domestic_Gross"    "Gross"             "Date"              "wins"             
[41] "nominations"      
# find relationship
parameters = c("tomatoMeter", "tomatoRating", "tomatoReviews", "tomatoFresh", "tomatoRotten", "tomatoUserMeter", "tomatoUserRating", "imdbRating" ,"imdbVotes", "wins", "nominations" )
ggpairs(df, columns = parameters, title = "fig1" )

 plot: [1,1] [=------------------------------------------------------------------]  1% est: 0s 
 plot: [1,2] [=------------------------------------------------------------------]  2% est: 6s 
 plot: [1,3] [==-----------------------------------------------------------------]  2% est: 8s 
 plot: [1,4] [==-----------------------------------------------------------------]  3% est: 8s 
 plot: [1,5] [===----------------------------------------------------------------]  4% est: 9s 
 plot: [1,6] [===----------------------------------------------------------------]  5% est: 9s 
 plot: [1,7] [====---------------------------------------------------------------]  6% est: 9s 
 plot: [1,8] [====---------------------------------------------------------------]  7% est: 9s 
 plot: [1,9] [=====--------------------------------------------------------------]  7% est: 9s 
 plot: [1,10] [=====-------------------------------------------------------------]  8% est: 9s 
 plot: [1,11] [======------------------------------------------------------------]  9% est: 9s 
 plot: [2,1] [=======------------------------------------------------------------] 10% est: 9s 
 plot: [2,2] [=======------------------------------------------------------------] 11% est: 9s 
 plot: [2,3] [========-----------------------------------------------------------] 12% est: 9s 
 plot: [2,4] [========-----------------------------------------------------------] 12% est: 9s 
 plot: [2,5] [=========----------------------------------------------------------] 13% est: 9s 
 plot: [2,6] [=========----------------------------------------------------------] 14% est: 9s 
 plot: [2,7] [==========---------------------------------------------------------] 15% est: 9s 
 plot: [2,8] [===========--------------------------------------------------------] 16% est: 9s 
 plot: [2,9] [===========--------------------------------------------------------] 17% est: 9s 
 plot: [2,10] [===========-------------------------------------------------------] 17% est: 9s 
 plot: [2,11] [============------------------------------------------------------] 18% est: 9s 
 plot: [3,1] [=============------------------------------------------------------] 19% est: 8s 
 plot: [3,2] [=============------------------------------------------------------] 20% est: 9s 
 plot: [3,3] [==============-----------------------------------------------------] 21% est: 9s 
 plot: [3,4] [==============-----------------------------------------------------] 21% est: 9s 
 plot: [3,5] [===============----------------------------------------------------] 22% est: 9s 
 plot: [3,6] [================---------------------------------------------------] 23% est: 9s 
 plot: [3,7] [================---------------------------------------------------] 24% est: 8s 
 plot: [3,8] [=================--------------------------------------------------] 25% est: 8s 
 plot: [3,9] [=================--------------------------------------------------] 26% est: 8s 
 plot: [3,10] [=================-------------------------------------------------] 26% est: 8s 
 plot: [3,11] [==================------------------------------------------------] 27% est: 8s 
 plot: [4,1] [===================------------------------------------------------] 28% est: 8s 
 plot: [4,2] [===================------------------------------------------------] 29% est: 8s 
 plot: [4,3] [====================-----------------------------------------------] 30% est: 8s 
 plot: [4,4] [====================-----------------------------------------------] 31% est: 8s 
 plot: [4,5] [=====================----------------------------------------------] 31% est: 8s 
 plot: [4,6] [======================---------------------------------------------] 32% est: 8s 
 plot: [4,7] [======================---------------------------------------------] 33% est: 8s 
 plot: [4,8] [=======================--------------------------------------------] 34% est: 8s 
 plot: [4,9] [=======================--------------------------------------------] 35% est: 8s 
 plot: [4,10] [=======================-------------------------------------------] 36% est: 7s 
 plot: [4,11] [========================------------------------------------------] 36% est: 7s 
 plot: [5,1] [=========================------------------------------------------] 37% est: 7s 
 plot: [5,2] [=========================------------------------------------------] 38% est: 7s 
 plot: [5,3] [==========================-----------------------------------------] 39% est: 7s 
 plot: [5,4] [===========================----------------------------------------] 40% est: 7s 
 plot: [5,5] [===========================----------------------------------------] 40% est: 7s 
 plot: [5,6] [============================---------------------------------------] 41% est: 7s 
 plot: [5,7] [============================---------------------------------------] 42% est: 7s 
 plot: [5,8] [=============================--------------------------------------] 43% est: 7s 
 plot: [5,9] [=============================--------------------------------------] 44% est: 7s 
 plot: [5,10] [=============================-------------------------------------] 45% est: 7s 
 plot: [5,11] [==============================------------------------------------] 45% est: 6s 
 plot: [6,1] [===============================------------------------------------] 46% est: 6s 
 plot: [6,2] [================================-----------------------------------] 47% est: 6s 
 plot: [6,3] [================================-----------------------------------] 48% est: 6s 
 plot: [6,4] [=================================----------------------------------] 49% est: 6s 
 plot: [6,5] [=================================----------------------------------] 50% est: 6s 
 plot: [6,6] [==================================---------------------------------] 50% est: 6s 
 plot: [6,7] [==================================---------------------------------] 51% est: 6s 
 plot: [6,8] [===================================--------------------------------] 52% est: 6s 
 plot: [6,9] [===================================--------------------------------] 53% est: 6s 
 plot: [6,10] [===================================-------------------------------] 54% est: 6s 
 plot: [6,11] [====================================------------------------------] 55% est: 6s 
 plot: [7,1] [=====================================------------------------------] 55% est: 6s 
 plot: [7,2] [======================================-----------------------------] 56% est: 6s 
 plot: [7,3] [======================================-----------------------------] 57% est: 5s 
 plot: [7,4] [=======================================----------------------------] 58% est: 5s 
 plot: [7,5] [=======================================----------------------------] 59% est: 5s 
 plot: [7,6] [========================================---------------------------] 60% est: 5s 
 plot: [7,7] [========================================---------------------------] 60% est: 5s 
 plot: [7,8] [=========================================--------------------------] 61% est: 5s 
 plot: [7,9] [==========================================-------------------------] 62% est: 5s 
 plot: [7,10] [=========================================-------------------------] 63% est: 5s 
 plot: [7,11] [==========================================------------------------] 64% est: 5s 
 plot: [8,1] [===========================================------------------------] 64% est: 5s 
 plot: [8,2] [============================================-----------------------] 65% est: 5s 
 plot: [8,3] [============================================-----------------------] 66% est: 4s 
 plot: [8,4] [=============================================----------------------] 67% est: 4s 
 plot: [8,5] [=============================================----------------------] 68% est: 4s 
 plot: [8,6] [==============================================---------------------] 69% est: 4s 
 plot: [8,7] [===============================================--------------------] 69% est: 4s 
 plot: [8,8] [===============================================--------------------] 70% est: 4s 
 plot: [8,9] [================================================-------------------] 71% est: 4s 
 plot: [8,10] [===============================================-------------------] 72% est: 4s 
 plot: [8,11] [================================================------------------] 73% est: 4s 
 plot: [9,1] [=================================================------------------] 74% est: 4s 
 plot: [9,2] [==================================================-----------------] 74% est: 3s 
 plot: [9,3] [==================================================-----------------] 75% est: 3s 
 plot: [9,4] [===================================================----------------] 76% est: 3s 
 plot: [9,5] [===================================================----------------] 77% est: 3s 
 plot: [9,6] [====================================================---------------] 78% est: 3s 
 plot: [9,7] [=====================================================--------------] 79% est: 3s 
 plot: [9,8] [=====================================================--------------] 79% est: 3s 
 plot: [9,9] [======================================================-------------] 80% est: 3s 
 plot: [9,10] [=====================================================-------------] 81% est: 3s 
 plot: [9,11] [======================================================------------] 82% est: 2s 
 plot: [10,1] [=======================================================-----------] 83% est: 2s 
 plot: [10,2] [=======================================================-----------] 83% est: 2s 
 plot: [10,3] [========================================================----------] 84% est: 2s 
 plot: [10,4] [========================================================----------] 85% est: 2s 
 plot: [10,5] [=========================================================---------] 86% est: 2s 
 plot: [10,6] [=========================================================---------] 87% est: 2s 
 plot: [10,7] [==========================================================--------] 88% est: 2s 
 plot: [10,8] [==========================================================--------] 88% est: 2s 
 plot: [10,9] [===========================================================-------] 89% est: 1s 
 plot: [10,10] [===========================================================------] 90% est: 1s 
 plot: [10,11] [===========================================================------] 91% est: 1s 
 plot: [11,1] [=============================================================-----] 92% est: 1s 
 plot: [11,2] [=============================================================-----] 93% est: 1s 
 plot: [11,3] [==============================================================----] 93% est: 1s 
 plot: [11,4] [==============================================================----] 94% est: 1s 
 plot: [11,5] [===============================================================---] 95% est: 1s 
 plot: [11,6] [===============================================================---] 96% est: 1s 
 plot: [11,7] [================================================================--] 97% est: 0s 
 plot: [11,8] [================================================================--] 98% est: 0s 
 plot: [11,9] [=================================================================-] 98% est: 0s 
 plot: [11,10] [================================================================-] 99% est: 0s 
 plot: [11,11] [=================================================================]100% est: 0s 
                                                                                               

Q: How good are these ratings in terms of predicting the success of a movie in winning awards or nominations? Is there a high correlation between two variables?

A: 1. Some of those ratings are highly correlated with the wins and nominations, such as tomatoReviews, tomatoFresh, imbdVotes and imbdVoting. tomatoRotten are not positively related to the awards which is expected. Some of those ratings are not very good predicators, such as tomatoMeters, tomatoRatins, since they have low correlations to awards. 2. Norminations and wins are highly related. Correlation is 0.819 ## 9. Expected insights

Come up with two new insights (backed up by data and graphs) that is expected. Here “new” means insights that are not an immediate consequence of one of the above tasks. You may use any of the columns already explored above or a different one in the dataset, such as Title, Actors, etc.

# TODO: Find and illustrate two expected insights
# Expeact Ingisht # 1
# deliminite
library(tm)
myCorpus <- Corpus(VectorSource(df$Country))
myTDM <- DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))
countries = as.matrix(myTDM)
df = cbind(df, countries)
df.countries <- data.frame(matrix(ncol = 2, nrow = 173))
colnames(df.countries) = c("countryname", "count")
# fill in frequency
totalcountries = colnames(countries)
i = 1
for (c in totalcountries){
  cur = sum(df[c])
  df.countries [i,] = c(c, cur) 
  i = i + 1
}
df.countries $freq <- as.numeric(df.countries $count) / dim(df)[1]
df.countries = df.countries[rev(order(df.countries$freq)),]
df.topcountries = df.countries[0:10 ,]
View(df.topcountries)
# Expected insight # 2
View(df_validYearReleasedYear[, c("Gross", "Decades")])
qplot(df_validYearReleasedYear$Decades, df_validYearReleasedYear$Gross)

ggplot(df_validYearReleasedYear, aes(reorder(Decades, -transformedGross, median), transformedGross) ) + geom_boxplot() + coord_flip() + scale_x_discrete("transformedGross") 

Q: Expected insight #1.

A: USA produced much larger amount of movies than other countries. The following countries are: France, Germany, Canada …

Q: Expected insight #2.

A: Recent years has larger Gross median than those old years.

10. Unexpected insight

Come up with one new insight (backed up by data and graphs) that is unexpected at first glance and do your best to motivate it. Same instructions apply as the previous task.

# TODO: Find and illustrate one unexpected insight
qplot( df_validYearReleasedYear$imdbRating, df_validYearReleasedYear$Gross)

Q: Unexpected insight.

A: The the first thought, I was not expect that the ones with high imbd ratings could have very low gross. But after thinking again, I found it is reasonable to some movies are good, but are not well commercialzied and on the business side, they were not successful.

---
title: 'Project 1: Explore and Prepare Data'
subtitle: |-
  CSE6242 - Data and Visual Analytics - Fall 2017
  Due: Sunday, October 15, 2017 at 11:59 PM UTC-12:00 on T-Square
output:
  html_notebook: default
  pdf_document: default
---

_Note: This project involves getting data ready for analysis and doing some preliminary investigations. Project 2 will involve modeling and predictions on the same dataset, and will be released at a later date. Both projects will have equal weightage towards your grade. You may reuse some of the preprocessing/analysis steps from Project 1 in Project 2._

# Data

In this project, you will explore a dataset that contains information about movies, including ratings, budget, gross revenue and other attributes. It was prepared by Dr. Guy Lebanon, and here is his description of the dataset:

> The file [`movies_merged`](https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cs6242/project/movies_merged) contains a dataframe with the same name that has 40K rows and 39 columns. Each row represents a movie title and each column represents a descriptor such as `Title`, `Actors`, and `Budget`. I collected the data by querying IMDb’s API (see [www.omdbapi.com](http://www.omdbapi.com/)) and joining it with a separate dataset of movie budgets and gross earnings (unknown to you). The join key was the movie title. This data is available for personal use, but IMDb’s terms of service do not allow it to be used for commercial purposes or for creating a competing repository.

# Objective

Your goal is to investigate the relationship between the movie descriptors and the box office success of movies, as represented by the variable `Gross`. This task is extremely important as it can help a studio decide which titles to fund for production, how much to bid on produced movies, when to release a title, how much to invest in marketing and PR, etc. This information is most useful before a title is released, but it is still very valuable after the movie is already released to the public (for example it can affect additional marketing spend or how much a studio should negotiate with on-demand streaming companies for “second window” streaming rights).

# Instructions

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. Open this file in RStudio to get started.

When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

```{r}
x = 1:10
print(x^2)
```

Plots appear inline too:
```{r}
plot(x, x^2, 'o')
```

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*. Enter some R code and run it.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Cmd+Shift+K* to preview the HTML file).

Please complete all the tasks below by implementing code chunks that have a `TODO` comment in them, running all code chunks so that output and plots are displayed, and typing in answers to each question (**Q:** ...) next to/below the corresponding answer prompt (**A:**). Feel free to add code chunks/show additional output to support any of the answers.

When you are done, you will need to submit the final R markdown file (as **pr1.Rmd**) with all code chunks implemented and executed, and all text responses written in. You also need to submit a PDF export of the markdown file (as **pr1.pdf**), which should show your code, output, plots and written responses--this will be your project report. Compress these two files into a single .zip archive and upload it on T-Square.

# Setup

## Load data

Make sure you've downloaded the [`movies_merged`](https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cs6242/project/movies_merged) file and it is in the current working directory. Now load it into memory:

```{r}
load('movies_merged')
cat("Dataset has", dim(movies_merged)[1], "rows and", dim(movies_merged)[2], "columns", end="\n", file="")
```

This creates an object of the same name (`movies_merged`). For convenience, you can copy it to `df` and start using it:

```{r}
df = movies_merged
cat("Column names:", end="\n", file="")
```

## Load R packages

Load any R packages that you will need to use. You can come back to this chunk, edit it and re-run to load any additional packages later.

```{r}
library(ggplot2)
library(GGally)
```

If you are loading any non-standard packages (ones that have not been discussed in class or explicitly allowed for this project), please mention them below. Include any special instructions if they cannot be installed using the regular `install.packages('<pkg name>')` command.

**Non-standard packages used**: None

# Tasks

Each task below is worth **10** points, and is meant to be performed sequentially, i.e. do step 2 after you have processed the data as described in step 1. Total points: **100**

Complete each task by implementing code chunks as described by `TODO` comments, and by responding to questions ("**Q**:") with written answers ("**A**:"). If you are unable to find a meaningful or strong relationship in any of the cases when requested, explain why not by referring to appropriate plots/statistics.

It is okay to handle missing values below by omission, but please omit as little as possible. It is worthwhile to invest in reusable and clear code as you may need to use it or modify it in project 2.

## 1. Remove non-movie rows

The variable `Type` captures whether the row is a movie, a TV series, or a game. Remove all rows from `df` that do not correspond to movies.

```{r}
# TODO: Remove all rows from df that do not correspond to movies
df2 <- df[df$Type == "movie",]
df = df2
```

**Q**: How many rows are left after removal? _Enter your response below._

**A**: 40000 rows are left

## 2. Process `Runtime` column

The variable `Runtime` represents the length of the title as a string. Write R code to convert it to a numeric value (in minutes) and replace `df$Runtime` with the new numeric column.

```{r}
# TODO: Replace df$Runtime with a numeric column containing the runtime in minutes
rows_num = dim(df)[1]
for (i in seq(1, rows_num)){
  content = unlist(strsplit(df[i, "Runtime"], split=" "))
  if (content == "N/A"){
    df[i,"Runtime"] = 0 
    next()
  }
    
  cur = 0
  for (j in seq(1, length(content) - 1)){
    if (substr(content[j + 1],1, 1) == "h"){
      cur = cur + as.numeric(content[j]) * 60
    }
    
    if (substr(content[j + 1],1, 3) == "min"){
      cur = cur + as.numeric(content[j]) 
    }
  }
  df[i,"Runtime"]= cur
}
df$Runtime =as.numeric((df$Runtime))
```

Now investigate the distribution of `Runtime` values and how it changes over years (variable `Year`, which you can bucket into decades) and in relation to the budget (variable `Budget`). Include any plots that illustrate.

```{r}
# TODO: Investigate the distribution of Runtime values and how it varies by Year and Budget

# The distribution of Runtime values
qplot(x = Runtime, data = df, binwidth = 10) 

# Runtime vs. Budget 
qplot(df$Runtime, log(df$Budget))

# Runtime vs. Year
df$Decades = (df$Year %/% 10) * 10
ggplot(df, aes(reorder(Decades, Decades, median), Runtime) ) + geom_boxplot() + scale_x_discrete("") 
```

_Feel free to insert additional code chunks as necessary._

**Q**: Comment on the distribution as well as relationships. Are there any patterns or trends that you can observe?

**A**: 
1. The Runtime distribution is "Bi model normal distribution", there are outliers which is much longer than average. 
2. Median of Runtime increaded in later years.
3. As Runtime increases, Budget maybe increases, but very scattered. The positive relationship is not very obvious.

## 3. Encode `Genre` column

The column `Genre` represents a list of genres associated with the movie in a string format. Write code to parse each text string into a binary vector with 1s representing the presence of a genre and 0s the absence, and add it to the dataframe as additional columns. Then remove the original `Genre` column.

For example, if there are a total of 3 genres: Drama, Comedy, and Action, a movie that is both Action and Comedy should be represented by a binary vector <0, 1, 1>. Note that you need to first compile a dictionary of all possible genres and then figure out which movie has which genres (you can use the R `tm` package to create the dictionary).

```{r}
# TODO: Replace Genre with a collection of binary columns
library(tm)
myCorpus <- Corpus(VectorSource(df$Genre))
myTDM <- DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))
x = as.matrix(myTDM)
x
# Add processed Genre data to origional df
df = cbind(df, x)
# drop Genre from df
df$Genre <- NULL
```

Plot the relative proportions of movies having the top 10 most common genres.

```{r}
# TODO: Select movies from top 10 most common genres and plot their relative proportions
# create a df to record frequency
df.count <- data.frame(matrix(ncol = 2, nrow = 30))
colnames(df.count) = c("genre", "count")
# fill in frequency
totalgenre = colnames(x)
i = 1
for (g in totalgenre){
  cur = sum(df[g])
  df.count[i,] = c(g, cur) 
  i = i + 1
}
df.count$frequency <- as.numeric(df.count$count) / dim(df)[1]
df.count[rev(order(df.count$frequency)),]
# plot relative proportions
df.count = df.count[rev(order(df.count$frequency)),]
qplot(x =df.count$genre[1:10], y =df.count$frequency[1:10])

```

Examine how the distribution of `Runtime` changes across genres for the top 10 most common genres.

```{r}
# TODO: Plot Runtime distribution for top 10 most common genres
totalgenre = df.count$genre[1:10]
print(totalgenre)
for (i in seq(1, 10)){
  print (ggplot(df[which(df[totalgenre[i]] == 1),], aes("", Runtime)) + geom_boxplot() + coord_flip() + scale_x_discrete("") + ggtitle(totalgenre[i]))
  z = df[which(df[totalgenre[i]] == 1),]
  print (totalgenre[i])
  print (median(z$Runtime))
  print(qplot(x = Runtime, data = z, binwidth = 4, main = totalgenre[i]))
}
dim(df)


for (i in seq(1, 10)){
  print (ggplot(df[which(df[totalgenre[i]] == 1),], aes("", Runtime)) + geom_boxplot() + coord_flip() + scale_x_discrete("") + ggtitle(totalgenre[i]))
  z = df[which(df[totalgenre[i]] == 1),]
  print (totalgenre[i])
  print (median(z$Runtime))
  print(qplot(x = Runtime, data = z, binwidth = 4, main = totalgenre[i]))
}
dim(df)

```

**Q**: Describe the interesting relationship(s) you observe. Are there any expected or unexpected trends that are evident?

**A**:
As expected, for each genre, the movies run time are generally normally distributed, even though some are skewed, some are biomodeal with smaller bell shape in shorter runtime. The "short" has the lowest runtime, the drama has the longest, which is expected.

Some outliers (600min) is not exprected. They are pretty long movies. Biomodel normal distribution are not expected. I have not think out why there are two peaks.

## 4. Eliminate mismatched rows

The dataframe was put together by merging two different sources of data and it is possible that the merging process was inaccurate in some cases (the merge was done based on movie title, but there are cases of different movies with the same title). There are 3 columns that contain date information: `Year` (numeric year), `Date` (numeric year), and `Released` (string representation of the release date).

Find and remove all rows where you suspect a merge error occurred based on a mismatch between these variables. To make sure subsequent analysis and modeling work well, avoid removing more than 10% of the rows that have a `Gross` value present.

_Note: Do not remove the rows with `Gross == NA` at this point, just use this a guideline._

```{r}
# TODO: Remove rows with Year/Date/Released mismatch
dim(df)
X= data.frame(do.call('rbind', strsplit(as.character(df$Released),'-',fixed=TRUE)))
df$releasedYear = X$X1

# Remove all whose Gross is NA
df_validGross = df[!is.na(df$Gross), ]
dim(df_validGross)

# Keep the ones that are correct
df_validYearDate = df_validGross[abs(df_validGross$Year - df_validGross$Date) <= 1,]

#View(df_validYearDate[, c("Gross", "Year", "Date", "releasedYear")])
df_validYearDate$releasedYear = as.numeric(as.character(df_validYearDate$releasedYear))
df_validYearReleasedYear = df_validYearDate[abs(df_validYearDate$Year - df_validYearDate$releasedYear) <= 1,]
dim(df_validYearDate)
dim(df_validYearReleasedYear)
```

**Q**: What is your precise removal logic, and how many rows remain in the resulting dataset?

**A**: 
1.For the same movies, the "Released", "Date" and "Year" should show less than one year difference. If in any row that their year information in the above three columns are not the same, that indicates a mismatch, and should be removed.
2. There are 3518 rows left after removal, which has gross value.

## 5. Explore `Gross` revenue

For the commercial success of a movie, production houses want to maximize Gross revenue. Investigate if Gross revenue is related to Budget, Runtime or Genre in any way.

_Note: To get a meaningful relationship, you may have to partition the movies into subsets such as short vs. long duration, or by genre, etc._

```{r}
# TODO: Investigate if Gross Revenue is related to Budget, Runtime or Genre

# Gross Revenue vs Budget
qplot(log(df_validYearReleasedYear$Budget), log(df_validYearReleasedYear$Gross))

# Gross Revenue vs Runtime
qplot(df_validYearReleasedYear$Runtime, df_validYearReleasedYear$Gross)

# Gross Revenue vs Genre

for (i in seq(1, 10)){
  print (ggplot(df[which(df[totalgenre[i]] == 1),], aes("", Gross)) + geom_boxplot() + coord_flip() + scale_x_discrete("") + ggtitle(totalgenre[i]))
  z = df[which(df[totalgenre[i]] == 1),]
  print (totalgenre[i])
  print (median(z$Runtime))
  #print(qplot(x = Runtime, data = z, binwidth = 4, main = totalgenre[i]))
}

```

**Q**: Did you find any observable relationships or combinations of Budget/Runtime/Genre that result in high Gross revenue? If you divided the movies into different subsets, you may get different answers for them - point out interesting ones.

**A**: Budget increase as runtime increase
       The runtime for each Genere are different.Such as short is much shorter than others.


```{r}
# TODO: Investigate if Gross Revenue is related to Release Month
X= data.frame(do.call('rbind', strsplit(as.character(df_validYearReleasedYear$Released),'-',fixed=TRUE)))
month = X$X2
transformedGross = log(df_validYearReleasedYear$Gross)
df_validYearReleasedYear = cbind(df_validYearReleasedYear, month, transformedGross)
names(df_validYearReleasedYear)
ggplot(df_validYearReleasedYear, aes(reorder(month, -transformedGross, median), transformedGross) ) + geom_boxplot() + coord_flip() + scale_x_discrete("transformedGross") 

```

## 6. Process `Awards` column

The variable `Awards` describes nominations and awards in text format. Convert it to 2 numeric columns, the first capturing the number of wins, and the second capturing nominations. Replace the `Awards` column with these new columns, and then study the relationship of `Gross` revenue with respect to them.

_Note: The format of the `Awards` column is not standard; you may have to use regular expressions to find the relevant values. Try your best to process them, and you may leave the ones that don't have enough information as NAs or set them to 0s._

```{r}
# TODO: Convert Awards to 2 numeric columns: wins and nominations
rows_num = dim(df_validYearReleasedYear)[1]
df_validYearReleasedYear.awards = data.frame(matrix(ncol = 2, nrow = rows_num ))
colnames(df_validYearReleasedYear.awards) = c("wins", "nominations")
for (i in seq(1, rows_num)){
  content = unlist(strsplit(df_validYearReleasedYear[i, "Awards"], split=" "))
  if (length(content) < 2) next
  for (j in seq(1, length(content) - 1)){
    if (substr(content[j + 1], 1, 3) == "win") {
      win_num = as.numeric(content[j])
      df_validYearReleasedYear.awards[i, "wins"] = win_num
    }
    if (substr(content[j + 1], 1, 10) == "nomination") {
      nomination_num = as.numeric(content[j])
      df_validYearReleasedYear.awards[i, "nominations"] = nomination_num
    }
  }
}
df_validYearReleasedYear = cbind(df_validYearReleasedYear, df_validYearReleasedYear.awards)
# number of non N/A in nomination
colSums(!is.na(df_validYearReleasedYear.awards))[2]
```

**Q**: How did you construct your conversion mechanism? How many rows had valid/non-zero wins or nominations?

**A**: I wrote loops to access each cell in the Award column. Extract the number before " win", and similarly extract the number before " nomination". There are 11547 non NA rows in nonmination

```{r}
# TODO: Plot Gross revenue against wins and nominations
plot(log(df_validYearReleasedYear$nominations),log(df_validYearReleasedYear$Gross))
plot(log(df_validYearReleasedYear$wins),log(df_validYearReleasedYear$Gross))

```

**Q**: How does the gross revenue vary by number of awards won and nominations received?

**A**: As the number of awards won and nominations increases, the gross revenue increase.

## 7. Movie ratings from IMDb and Rotten Tomatoes

There are several variables that describe ratings, including IMDb ratings (`imdbRating` represents average user ratings and `imdbVotes` represents the number of user ratings), and multiple Rotten Tomatoes ratings (represented by several variables pre-fixed by `tomato`). Read up on such ratings on the web (for example [rottentomatoes.com/about](https://www.rottentomatoes.com/about) and [ www.imdb.com/help/show_leaf?votestopfaq](http:// www.imdb.com/help/show_leaf?votestopfaq)).

Investigate the pairwise relationships between these different descriptors using graphs.

```{r}
# TODO: Illustrate how ratings from IMDb and Rotten Tomatoes are related
install.packages('GGally')
library(GGally)
names(df)
rating_parameters = c("tomatoMeter", "tomatoRating", "tomatoReviews", "tomatoFresh", "tomatoRotten", "tomatoUserMeter", "tomatoUserRating", "imdbRating" ,"imdbVotes"  )
ggpairs(df, columns = rating_parameters )
```

**Q**: Comment on the similarities and differences between the user ratings of IMDb and the critics ratings of Rotten Tomatoes.

**A**: 
(1). ImbdRatings is high similarities with "tomatoMeter", "tomatoRating",  "tomatoUserMeter", "tomatoUserRating";
(2). ImbdRatings has slight similarities with "tomatoReviews", "tomatoFresh"; 
(3).ImbdRatings is different with "tomatoRotten";

## 8. Ratings and awards
These ratings typically reflect the general appeal of the movie to the public or gather opinions from a larger body of critics. Whereas awards are given by professional societies that may evaluate a movie on specific attributes, such as artistic performance, screenplay, sound design, etc.

Study the relationship between ratings and awards using graphs (awards here refers to wins and/or nominations). 

```{r}
# TODO: Show how ratings and awards are related
df = df2
names(df)
# add wins and nominations to orgional df
rows_num = dim(df)[1]
df.awards = data.frame(matrix(ncol = 2, nrow = rows_num ))
colnames(df.awards) = c("wins", "nominations")
for (i in seq(1, rows_num)){
  content = unlist(strsplit(df[i, "Awards"], split=" "))
  if (length(content) < 2) next
  for (j in seq(1, length(content) - 1)){
    if (substr(content[j + 1], 1, 3) == "win") {
      win_num = as.numeric(content[j])
      df.awards[i, "wins"] = win_num
    }
    if (substr(content[j + 1], 1, 10) == "nomination") {
      nomination_num = as.numeric(content[j])
      df.awards[i, "nominations"] = nomination_num
    }
  }
}
df = cbind(df, df.awards)
names(df)

# find relationship
parameters = c("tomatoMeter", "tomatoRating", "tomatoReviews", "tomatoFresh", "tomatoRotten", "tomatoUserMeter", "tomatoUserRating", "imdbRating" ,"imdbVotes", "wins", "nominations" )
ggpairs(df, columns = parameters, title = "fig1" )
```

**Q**: How good are these ratings in terms of predicting the success of a movie in winning awards or nominations? Is there a high correlation between two variables?

**A**: 
1. Some of those ratings are highly correlated with the wins and nominations, such as tomatoReviews, tomatoFresh, imbdVotes and imbdVoting.
    tomatoRotten are not positively related to the awards which is expected.
    Some of those ratings are not very good predicators, such as tomatoMeters, tomatoRatins, since they have low correlations to awards.
2. Norminations and wins are highly related. Correlation is 0.819
## 9. Expected insights

Come up with two new insights (backed up by data and graphs) that is expected. Here “new” means insights that are not an immediate consequence of one of the above tasks. You may use any of the columns already explored above or a different one in the dataset, such as `Title`, `Actors`, etc.

```{r}
# TODO: Find and illustrate two expected insights
# Expeact Ingisht # 1
# deliminite
library(tm)
myCorpus <- Corpus(VectorSource(df$Country))
myTDM <- DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))
countries = as.matrix(myTDM)


df = cbind(df, countries)

df.countries <- data.frame(matrix(ncol = 2, nrow = 173))
colnames(df.countries) = c("countryname", "count")
# fill in frequency
totalcountries = colnames(countries)
i = 1
for (c in totalcountries){
  cur = sum(df[c])
  df.countries [i,] = c(c, cur) 
  i = i + 1
}

df.countries $freq <- as.numeric(df.countries $count) / dim(df)[1]
df.countries = df.countries[rev(order(df.countries$freq)),]
df.topcountries = df.countries[0:10 ,]
View(df.topcountries)

# Expected insight # 2
View(df_validYearReleasedYear[, c("Gross", "Decades")])
qplot(df_validYearReleasedYear$Decades, df_validYearReleasedYear$Gross)

ggplot(df_validYearReleasedYear, aes(reorder(Decades, -transformedGross, median), transformedGross) ) + geom_boxplot() + coord_flip() + scale_x_discrete("transformedGross") 





```

**Q**: Expected insight #1.

**A**: USA produced much larger amount of movies than other countries. The following countries are: France, Germany, Canada ... 


**Q**: Expected insight #2.

**A**: Recent years has larger Gross median than those old years.



## 10. Unexpected insight

Come up with one new insight (backed up by data and graphs) that is unexpected at first glance and do your best to motivate it. Same instructions apply as the previous task.

```{r}
# TODO: Find and illustrate one unexpected insight
qplot( df_validYearReleasedYear$imdbRating, df_validYearReleasedYear$Gross)
```

**Q**: Unexpected insight.

**A**: The the first thought, I was not expect that the ones with high imbd ratings could have very low gross. But after thinking again, I found it is reasonable to some movies are good, but are not well commercialzied and on the business side, they were not successful.


