R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

plot(cars)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

Load the dataset in R and store the dataframe in a new variable called “netflix”.

data <- read.csv("~/Downloads/NetflixData.csv", header = TRUE)
netflix <- read.csv("~/Downloads/NetflixData.csv", header = TRUE)

head(netflix)
##     X    type                                        title       country
## 1 656 TV Show                 Russell Peters vs. the World United States
## 2 657 TV Show                 Nature's Great Events (2009) United States
## 3 658 TV Show                                   Sammy & Co        France
## 4 659 TV Show             Ninja Turtles: The Next Mutation        Canada
## 5 660 TV Show Strawberry Shortcake: Berry Bitty Adventures        France
## 6 661 TV Show                                V.R. Troopers United States
##   date_added release_year rating duration_min_season    sales
## 1  14-Oct-13         2013     NR                   1 11.91041
## 2  10-Oct-15         2009  TV-PG                   1 10.60035
## 3  25-Dec-15         2014   TV-Y                   1 14.43723
## 4   1-Jan-16         1997   TV-G                   1 14.26843
## 5   1-Jan-16         2015   TV-Y                   1 13.48277
## 6   1-Jan-16         1995   TV-G                   2 14.68985
dim(netflix)
## [1] 931   9
str(netflix)
## 'data.frame':    931 obs. of  9 variables:
##  $ X                  : int  656 657 658 659 660 661 662 663 664 665 ...
##  $ type               : chr  "TV Show" "TV Show" "TV Show" "TV Show" ...
##  $ title              : chr  "Russell Peters vs. the World" "Nature's Great Events (2009)" "Sammy & Co" "Ninja Turtles: The Next Mutation" ...
##  $ country            : chr  "United States" "United States" "France" "Canada" ...
##  $ date_added         : chr  "14-Oct-13" "10-Oct-15" "25-Dec-15" "1-Jan-16" ...
##  $ release_year       : int  2013 2009 2014 1997 2015 1995 2003 2007 2016 2013 ...
##  $ rating             : chr  "NR" "TV-PG" "TV-Y" "TV-G" ...
##  $ duration_min_season: int  1 1 1 1 1 2 1 1 2 1 ...
##  $ sales              : num  11.9 10.6 14.4 14.3 13.5 ...

Extract the value of year from Sys.time() and save it in a new variable named current_year. Inspect the class of current_year. Convert it to a numeric data class.

current_year <- as.numeric(format(Sys.time(), "%Y"))

class(current_year)
## [1] "numeric"

Create a new column in the netflix DataFrame you created in question (1). Name this column time_since_release and assign it the value of the current year minus the release_year. This variable will give you the number of years since the release of the tv show/movie.

netflix$time_since_release <- current_year - netflix$release_year

Create a new column named title_length. Assign it the value of number of characters in the title of each of the tv show/movie.

netflix$title_length <- nchar(as.character(netflix$title))

Inspect the class of each column in the dataframe. Instead of doing this one by one for each column, can you write a “for” loop over the columns of the dataframe that prints the class of each column?

for (col in names(netflix)) {
  cat(col, ": ", class(netflix[[col]]), "\n", sep = "")
}
## X: integer
## type: character
## title: character
## country: character
## date_added: character
## release_year: integer
## rating: character
## duration_min_season: integer
## sales: numeric
## time_since_release: numeric
## title_length: integer

Compute the descriptive statistics (mean, median, min, max) for the variable you created time_since_release.

mean(netflix$time_since_release)
## [1] 9.137487
median(netflix$time_since_release)
## [1] 6
min(netflix$time_since_release)
## [1] 2
max(netflix$time_since_release)
## [1] 81

Can you compute the descriptive statistics for all the numerical variables in the data without repeating your code for each variable?

lapply(netflix, function(x) {
  if (is.numeric(x) | is.integer(x)) {
    c(mean = mean(x), median = median(x), min = min(x), max = max(x))
  }
})
## $X
##   mean median    min    max 
##    466    466      1    931 
## 
## $type
## NULL
## 
## $title
## NULL
## 
## $country
## NULL
## 
## $date_added
## NULL
## 
## $release_year
##     mean   median      min      max 
## 2013.863 2017.000 1942.000 2021.000 
## 
## $rating
## NULL
## 
## $duration_min_season
##      mean    median       min       max 
##  71.06767  90.00000   1.00000 312.00000 
## 
## $sales
##      mean    median       min       max 
##  50.00629  60.72718   6.75764 173.76761 
## 
## $time_since_release
##      mean    median       min       max 
##  9.137487  6.000000  2.000000 81.000000 
## 
## $title_length
##    mean  median     min     max 
## 17.2449 15.0000  1.0000 73.0000

Optional Bonus Question (+2 points) Can you create a function that takes two inputs: input_data (a dataframe) and id (a column id).

desc_stats <- function(input_data, id) {
  if (class(input_data[,id]) == "numeric" | class(input_data[,id]) == "integer") {
    summary <- data.frame(variable = id,
                         mean = mean(input_data[,id]),
                         median = median(input_data[,id]),
                         min = min(input_data[,id]),
                         max = max(input_data[,id]))
  } else {
    summary <- data.frame(variable = id,
                         mean = NA,
                         median = NA,
                         min = NA,
                         max = NA)
  }
  return(summary)
}

output_data <- desc_stats(netflix, "time_since_release")

output_data <- lapply(names(netflix), function(x) desc_stats(netflix, x))

Split the data into two dataframes named tv_shows and movies.

tv_shows <- netflix[netflix$type == "TV Show",]
movies <- netflix[netflix$type == "Movie",]

Run two different regression models for tv_shows only, to examine the relationship between sales and other variables. The dependent variables is sales. Model 1: The independent variables are time_since_release, title_length, duration_min_season. Interpret the regression coefficients. Model 2: In addition to Model 1 variables, include country and rating in your regression.

model1 <- lm(sales ~ time_since_release + title_length + duration_min_season, data = tv_shows)
summary(model1)
## 
## Call:
## lm(formula = sales ~ time_since_release + title_length + duration_min_season, 
##     data = tv_shows)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1851 -0.8260  0.0220  0.8634  4.4232 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         10.582291   0.218014  48.540  < 2e-16 ***
## time_since_release   0.110952   0.013725   8.084 2.07e-14 ***
## title_length         0.003825   0.009590   0.399    0.690    
## duration_min_season -0.041260   0.049880  -0.827    0.409    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.327 on 272 degrees of freedom
## Multiple R-squared:  0.2023, Adjusted R-squared:  0.1935 
## F-statistic: 22.99 on 3 and 272 DF,  p-value: 2.705e-13
model2 <- lm(sales ~ time_since_release + title_length + duration_min_season + country + rating, data = tv_shows)
summary(model2)
## 
## Call:
## lm(formula = sales ~ time_since_release + title_length + duration_min_season + 
##     country + rating, data = tv_shows)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9568 -0.7011  0.0000  0.7414  4.6182 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           10.9972180  1.4997731   7.333 3.74e-12 ***
## time_since_release     0.1126505  0.0150598   7.480 1.52e-12 ***
## title_length           0.0006994  0.0105762   0.066   0.9473    
## duration_min_season   -0.0335500  0.0545726  -0.615   0.5393    
## countryArgentina      -0.5652755  1.1369133  -0.497   0.6195    
## countryAustralia      -1.0935898  0.9126535  -1.198   0.2320    
## countryBelgium        -0.6486392  1.4808798  -0.438   0.6618    
## countryBrazil         -0.6559955  0.8645632  -0.759   0.4488    
## countryCanada         -0.6366348  0.7412025  -0.859   0.3913    
## countryChina          -1.4990808  0.9012412  -1.663   0.0976 .  
## countryColombia       -2.1540295  1.4711609  -1.464   0.1445    
## countryDenmark         0.4839906  1.4804721   0.327   0.7440    
## countryFinland        -2.4408737  1.4818100  -1.647   0.1009    
## countryFrance          0.4741722  0.7611345   0.623   0.5339    
## countryGermany         0.5308010  0.9968998   0.532   0.5949    
## countryIndia          -0.1617089  0.7393902  -0.219   0.8271    
## countryIreland        -0.7294564  1.4835414  -0.492   0.6234    
## countryIsrael         -0.1477560  1.4797765  -0.100   0.9205    
## countryItaly          -0.0567180  1.1480545  -0.049   0.9606    
## countryJapan          -0.0142749  0.7160378  -0.020   0.9841    
## countryLebanon        -0.8891609  1.4716753  -0.604   0.5463    
## countryMalaysia       -0.2137256  0.9966154  -0.214   0.8304    
## countryMexico         -0.6511711  0.7643416  -0.852   0.3951    
## countryNetherlands    -1.7962104  1.4964340  -1.200   0.2312    
## countryNorway         -0.0868322  1.4910773  -0.058   0.9536    
## countryPoland          1.2450583  1.1376601   1.094   0.2749    
## countryRussia          1.1784119  1.5015316   0.785   0.4334    
## countrySingapore      -0.6119740  1.1258819  -0.544   0.5873    
## countrySouth Africa   -0.0741508  1.4795547  -0.050   0.9601    
## countrySouth Korea     0.1383327  0.6960442   0.199   0.8426    
## countrySpain          -1.4167888  0.8088916  -1.752   0.0812 .  
## countrySweden         -1.8650753  1.1390221  -1.637   0.1029    
## countryTaiwan         -0.2973155  0.7397119  -0.402   0.6881    
## countryThailand       -0.6986747  0.9952929  -0.702   0.4834    
## countryTurkey         -0.1748323  1.1262088  -0.155   0.8768    
## countryUnited Kingdom -0.2670385  0.6551171  -0.408   0.6839    
## countryUnited States  -0.1993425  0.6248096  -0.319   0.7500    
## ratingTV-14           -0.2484371  1.3597296  -0.183   0.8552    
## ratingTV-G            -0.2017103  1.4090380  -0.143   0.8863    
## ratingTV-MA            0.0514162  1.3580360   0.038   0.9698    
## ratingTV-PG           -0.0865257  1.3627093  -0.063   0.9494    
## ratingTV-Y            -0.2595788  1.3790049  -0.188   0.8509    
## ratingTV-Y7           -0.5532761  1.4036684  -0.394   0.6938    
## ratingTV-Y7-FV         1.0794285  1.9419844   0.556   0.5789    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.339 on 232 degrees of freedom
## Multiple R-squared:  0.3073, Adjusted R-squared:  0.1789 
## F-statistic: 2.393 on 43 and 232 DF,  p-value: 1.835e-05

Run two different regression models for movies only, to examine the relationship between sales and other variables. The dependent variables is sales. The independent variables are time_since_release, title_length, duration_min_season. Interpret the regression coefficients. Model 2: In addition to Model 1 variables, include country and rating in your regression.

model1_m <- lm(sales ~ time_since_release + title_length + duration_min_season, data = movies)
summary(model1_m)
## 
## Call:
## lm(formula = sales ~ time_since_release + title_length + duration_min_season, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1131 -1.0228  0.0509  0.9879  3.7867 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         17.545601   0.248095  70.721   <2e-16 ***
## time_since_release  -0.095526   0.005733 -16.664   <2e-16 ***
## title_length        -0.009509   0.005580  -1.704   0.0889 .  
## duration_min_season  0.497798   0.001964 253.509   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.417 on 651 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.991 
## F-statistic: 2.388e+04 on 3 and 651 DF,  p-value: < 2.2e-16
model2_m <- lm(sales ~ time_since_release + title_length + duration_min_season + country + rating, data = movies)
summary(model2_m)
## 
## Call:
## lm(formula = sales ~ time_since_release + title_length + duration_min_season + 
##     country + rating, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0528 -0.8745  0.0077  0.9113  3.9765 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           21.172842   2.081806  10.170   <2e-16 ***
## time_since_release    -0.096582   0.006152 -15.698   <2e-16 ***
## title_length          -0.013723   0.006014  -2.282   0.0229 *  
## duration_min_season    0.500260   0.002405 208.006   <2e-16 ***
## countryArgentina      -0.036944   1.722311  -0.021   0.9829    
## countryAustralia      -0.031592   1.584831  -0.020   0.9841    
## countryAustria        -0.480890   1.813013  -0.265   0.7909    
## countryBelgium        -3.921930   2.112446  -1.857   0.0639 .  
## countryBrazil         -0.379702   1.567392  -0.242   0.8087    
## countryBulgaria       -0.790722   2.067051  -0.383   0.7022    
## countryCambodia       -1.433127   2.070775  -0.692   0.4892    
## countryCanada         -0.349329   1.521657  -0.230   0.8185    
## countryChile           0.404646   2.073227   0.195   0.8453    
## countryChina          -1.529484   1.676535  -0.912   0.3620    
## countryColombia       -1.931321   2.073636  -0.931   0.3520    
## countryDenmark        -3.704739   2.496485  -1.484   0.1383    
## countryEgypt          -0.213866   1.570051  -0.136   0.8917    
## countryFinland        -3.179871   2.070654  -1.536   0.1252    
## countryFrance         -0.462372   1.540853  -0.300   0.7642    
## countryGermany        -0.794812   1.616394  -0.492   0.6231    
## countryGhana          -1.836005   1.815533  -1.011   0.3123    
## countryGreece          0.969841   2.064729   0.470   0.6387    
## countryHong Kong      -0.260197   1.576185  -0.165   0.8689    
## countryIndia          -0.761111   1.528607  -0.498   0.6187    
## countryIndonesia      -1.148243   1.576239  -0.728   0.4666    
## countryIreland        -0.916516   1.717097  -0.534   0.5937    
## countryIsrael         -1.837470   1.716832  -1.070   0.2849    
## countryItaly          -0.590844   1.601708  -0.369   0.7123    
## countryJapan          -0.677441   1.558546  -0.435   0.6640    
## countryKuwait         -2.054212   2.077266  -0.989   0.3231    
## countryLebanon        -0.149511   2.065620  -0.072   0.9423    
## countryMalaysia       -1.607421   1.818178  -0.884   0.3770    
## countryMexico         -0.759188   1.573322  -0.483   0.6296    
## countryNamibia         0.435290   2.073009   0.210   0.8338    
## countryNetherlands    -0.859335   1.678664  -0.512   0.6089    
## countryNew Zealand    -2.045396   1.809385  -1.130   0.2588    
## countryNigeria        -1.126068   1.572059  -0.716   0.4741    
## countryNorway         -0.507200   1.721121  -0.295   0.7683    
## countryPakistan        0.179190   1.625634   0.110   0.9123    
## countryPhilippines    -0.762897   1.589362  -0.480   0.6314    
## countrySaudi Arabia   -1.387413   1.619457  -0.857   0.3919    
## countrySerbia          0.555125   2.067971   0.268   0.7885    
## countrySingapore      -1.251333   2.070877  -0.604   0.5459    
## countrySlovenia       -0.119018   2.067313  -0.058   0.9541    
## countrySouth Africa   -0.745710   1.669704  -0.447   0.6553    
## countrySouth Korea    -0.648960   1.610003  -0.403   0.6870    
## countrySpain           0.245309   1.562820   0.157   0.8753    
## countrySweden         -2.162021   2.070486  -1.044   0.2968    
## countryTaiwan         -1.881120   1.816024  -1.036   0.3007    
## countryThailand       -2.017988   1.812355  -1.113   0.2660    
## countryTurkey         -1.453623   1.573467  -0.924   0.3559    
## countryUnited Kingdom -0.412159   1.523608  -0.271   0.7869    
## countryUnited States  -0.407710   1.505265  -0.271   0.7866    
## countryUruguay        -0.709834   2.073396  -0.342   0.7322    
## ratingNR              -2.860948   1.485287  -1.926   0.0546 .  
## ratingPG              -3.656433   1.434041  -2.550   0.0110 *  
## ratingPG-13           -3.310065   1.430281  -2.314   0.0210 *  
## ratingR               -3.548990   1.422994  -2.494   0.0129 *  
## ratingTV-14           -3.234192   1.422158  -2.274   0.0233 *  
## ratingTV-G            -3.041819   1.455434  -2.090   0.0370 *  
## ratingTV-MA           -3.068279   1.418863  -2.162   0.0310 *  
## ratingTV-PG           -3.084497   1.427958  -2.160   0.0312 *  
## ratingTV-Y            -3.127097   1.500857  -2.084   0.0376 *  
## ratingTV-Y7           -3.363143   1.473114  -2.283   0.0228 *  
## ratingTV-Y7-FV               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.41 on 591 degrees of freedom
## Multiple R-squared:  0.9919, Adjusted R-squared:  0.991 
## F-statistic:  1148 on 63 and 591 DF,  p-value: < 2.2e-16

What are the differences you observe in your results for the regression outputs for tv_shows and movies? Which variables are significant?

The variables that are significant are time_since_relaease, title_length, and duration_min_season since they all have a p-value <0.5. The country and rating variables are not significant, since those p-values are >0.5. Some takeaways are that sales decrease as time_since_release increases and also decrase as title_length increases.

rmarkdown::render(“yourfile.Rmd”, output_file = “path/to/yourfile.html”, output_format = “html_document”)