This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
plot(cars)
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
Load the dataset in R and store the dataframe in a new variable called “netflix”.
data <- read.csv("~/Downloads/NetflixData.csv", header = TRUE)
netflix <- read.csv("~/Downloads/NetflixData.csv", header = TRUE)
head(netflix)
## X type title country
## 1 656 TV Show Russell Peters vs. the World United States
## 2 657 TV Show Nature's Great Events (2009) United States
## 3 658 TV Show Sammy & Co France
## 4 659 TV Show Ninja Turtles: The Next Mutation Canada
## 5 660 TV Show Strawberry Shortcake: Berry Bitty Adventures France
## 6 661 TV Show V.R. Troopers United States
## date_added release_year rating duration_min_season sales
## 1 14-Oct-13 2013 NR 1 11.91041
## 2 10-Oct-15 2009 TV-PG 1 10.60035
## 3 25-Dec-15 2014 TV-Y 1 14.43723
## 4 1-Jan-16 1997 TV-G 1 14.26843
## 5 1-Jan-16 2015 TV-Y 1 13.48277
## 6 1-Jan-16 1995 TV-G 2 14.68985
dim(netflix)
## [1] 931 9
str(netflix)
## 'data.frame': 931 obs. of 9 variables:
## $ X : int 656 657 658 659 660 661 662 663 664 665 ...
## $ type : chr "TV Show" "TV Show" "TV Show" "TV Show" ...
## $ title : chr "Russell Peters vs. the World" "Nature's Great Events (2009)" "Sammy & Co" "Ninja Turtles: The Next Mutation" ...
## $ country : chr "United States" "United States" "France" "Canada" ...
## $ date_added : chr "14-Oct-13" "10-Oct-15" "25-Dec-15" "1-Jan-16" ...
## $ release_year : int 2013 2009 2014 1997 2015 1995 2003 2007 2016 2013 ...
## $ rating : chr "NR" "TV-PG" "TV-Y" "TV-G" ...
## $ duration_min_season: int 1 1 1 1 1 2 1 1 2 1 ...
## $ sales : num 11.9 10.6 14.4 14.3 13.5 ...
Extract the value of year from Sys.time() and save it in a new variable named current_year. Inspect the class of current_year. Convert it to a numeric data class.
current_year <- as.numeric(format(Sys.time(), "%Y"))
class(current_year)
## [1] "numeric"
Create a new column in the netflix DataFrame you created in question (1). Name this column time_since_release and assign it the value of the current year minus the release_year. This variable will give you the number of years since the release of the tv show/movie.
netflix$time_since_release <- current_year - netflix$release_year
Create a new column named title_length. Assign it the value of number of characters in the title of each of the tv show/movie.
netflix$title_length <- nchar(as.character(netflix$title))
Inspect the class of each column in the dataframe. Instead of doing this one by one for each column, can you write a “for” loop over the columns of the dataframe that prints the class of each column?
for (col in names(netflix)) {
cat(col, ": ", class(netflix[[col]]), "\n", sep = "")
}
## X: integer
## type: character
## title: character
## country: character
## date_added: character
## release_year: integer
## rating: character
## duration_min_season: integer
## sales: numeric
## time_since_release: numeric
## title_length: integer
Compute the descriptive statistics (mean, median, min, max) for the variable you created time_since_release.
mean(netflix$time_since_release)
## [1] 9.137487
median(netflix$time_since_release)
## [1] 6
min(netflix$time_since_release)
## [1] 2
max(netflix$time_since_release)
## [1] 81
Can you compute the descriptive statistics for all the numerical variables in the data without repeating your code for each variable?
lapply(netflix, function(x) {
if (is.numeric(x) | is.integer(x)) {
c(mean = mean(x), median = median(x), min = min(x), max = max(x))
}
})
## $X
## mean median min max
## 466 466 1 931
##
## $type
## NULL
##
## $title
## NULL
##
## $country
## NULL
##
## $date_added
## NULL
##
## $release_year
## mean median min max
## 2013.863 2017.000 1942.000 2021.000
##
## $rating
## NULL
##
## $duration_min_season
## mean median min max
## 71.06767 90.00000 1.00000 312.00000
##
## $sales
## mean median min max
## 50.00629 60.72718 6.75764 173.76761
##
## $time_since_release
## mean median min max
## 9.137487 6.000000 2.000000 81.000000
##
## $title_length
## mean median min max
## 17.2449 15.0000 1.0000 73.0000
Optional Bonus Question (+2 points) Can you create a function that takes two inputs: input_data (a dataframe) and id (a column id).
desc_stats <- function(input_data, id) {
if (class(input_data[,id]) == "numeric" | class(input_data[,id]) == "integer") {
summary <- data.frame(variable = id,
mean = mean(input_data[,id]),
median = median(input_data[,id]),
min = min(input_data[,id]),
max = max(input_data[,id]))
} else {
summary <- data.frame(variable = id,
mean = NA,
median = NA,
min = NA,
max = NA)
}
return(summary)
}
output_data <- desc_stats(netflix, "time_since_release")
output_data <- lapply(names(netflix), function(x) desc_stats(netflix, x))
Split the data into two dataframes named tv_shows and movies.
tv_shows <- netflix[netflix$type == "TV Show",]
movies <- netflix[netflix$type == "Movie",]
Run two different regression models for tv_shows only, to examine the relationship between sales and other variables. The dependent variables is sales. Model 1: The independent variables are time_since_release, title_length, duration_min_season. Interpret the regression coefficients. Model 2: In addition to Model 1 variables, include country and rating in your regression.
model1 <- lm(sales ~ time_since_release + title_length + duration_min_season, data = tv_shows)
summary(model1)
##
## Call:
## lm(formula = sales ~ time_since_release + title_length + duration_min_season,
## data = tv_shows)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1851 -0.8260 0.0220 0.8634 4.4232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.582291 0.218014 48.540 < 2e-16 ***
## time_since_release 0.110952 0.013725 8.084 2.07e-14 ***
## title_length 0.003825 0.009590 0.399 0.690
## duration_min_season -0.041260 0.049880 -0.827 0.409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.327 on 272 degrees of freedom
## Multiple R-squared: 0.2023, Adjusted R-squared: 0.1935
## F-statistic: 22.99 on 3 and 272 DF, p-value: 2.705e-13
model2 <- lm(sales ~ time_since_release + title_length + duration_min_season + country + rating, data = tv_shows)
summary(model2)
##
## Call:
## lm(formula = sales ~ time_since_release + title_length + duration_min_season +
## country + rating, data = tv_shows)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9568 -0.7011 0.0000 0.7414 4.6182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9972180 1.4997731 7.333 3.74e-12 ***
## time_since_release 0.1126505 0.0150598 7.480 1.52e-12 ***
## title_length 0.0006994 0.0105762 0.066 0.9473
## duration_min_season -0.0335500 0.0545726 -0.615 0.5393
## countryArgentina -0.5652755 1.1369133 -0.497 0.6195
## countryAustralia -1.0935898 0.9126535 -1.198 0.2320
## countryBelgium -0.6486392 1.4808798 -0.438 0.6618
## countryBrazil -0.6559955 0.8645632 -0.759 0.4488
## countryCanada -0.6366348 0.7412025 -0.859 0.3913
## countryChina -1.4990808 0.9012412 -1.663 0.0976 .
## countryColombia -2.1540295 1.4711609 -1.464 0.1445
## countryDenmark 0.4839906 1.4804721 0.327 0.7440
## countryFinland -2.4408737 1.4818100 -1.647 0.1009
## countryFrance 0.4741722 0.7611345 0.623 0.5339
## countryGermany 0.5308010 0.9968998 0.532 0.5949
## countryIndia -0.1617089 0.7393902 -0.219 0.8271
## countryIreland -0.7294564 1.4835414 -0.492 0.6234
## countryIsrael -0.1477560 1.4797765 -0.100 0.9205
## countryItaly -0.0567180 1.1480545 -0.049 0.9606
## countryJapan -0.0142749 0.7160378 -0.020 0.9841
## countryLebanon -0.8891609 1.4716753 -0.604 0.5463
## countryMalaysia -0.2137256 0.9966154 -0.214 0.8304
## countryMexico -0.6511711 0.7643416 -0.852 0.3951
## countryNetherlands -1.7962104 1.4964340 -1.200 0.2312
## countryNorway -0.0868322 1.4910773 -0.058 0.9536
## countryPoland 1.2450583 1.1376601 1.094 0.2749
## countryRussia 1.1784119 1.5015316 0.785 0.4334
## countrySingapore -0.6119740 1.1258819 -0.544 0.5873
## countrySouth Africa -0.0741508 1.4795547 -0.050 0.9601
## countrySouth Korea 0.1383327 0.6960442 0.199 0.8426
## countrySpain -1.4167888 0.8088916 -1.752 0.0812 .
## countrySweden -1.8650753 1.1390221 -1.637 0.1029
## countryTaiwan -0.2973155 0.7397119 -0.402 0.6881
## countryThailand -0.6986747 0.9952929 -0.702 0.4834
## countryTurkey -0.1748323 1.1262088 -0.155 0.8768
## countryUnited Kingdom -0.2670385 0.6551171 -0.408 0.6839
## countryUnited States -0.1993425 0.6248096 -0.319 0.7500
## ratingTV-14 -0.2484371 1.3597296 -0.183 0.8552
## ratingTV-G -0.2017103 1.4090380 -0.143 0.8863
## ratingTV-MA 0.0514162 1.3580360 0.038 0.9698
## ratingTV-PG -0.0865257 1.3627093 -0.063 0.9494
## ratingTV-Y -0.2595788 1.3790049 -0.188 0.8509
## ratingTV-Y7 -0.5532761 1.4036684 -0.394 0.6938
## ratingTV-Y7-FV 1.0794285 1.9419844 0.556 0.5789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.339 on 232 degrees of freedom
## Multiple R-squared: 0.3073, Adjusted R-squared: 0.1789
## F-statistic: 2.393 on 43 and 232 DF, p-value: 1.835e-05
Run two different regression models for movies only, to examine the relationship between sales and other variables. The dependent variables is sales. The independent variables are time_since_release, title_length, duration_min_season. Interpret the regression coefficients. Model 2: In addition to Model 1 variables, include country and rating in your regression.
model1_m <- lm(sales ~ time_since_release + title_length + duration_min_season, data = movies)
summary(model1_m)
##
## Call:
## lm(formula = sales ~ time_since_release + title_length + duration_min_season,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1131 -1.0228 0.0509 0.9879 3.7867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.545601 0.248095 70.721 <2e-16 ***
## time_since_release -0.095526 0.005733 -16.664 <2e-16 ***
## title_length -0.009509 0.005580 -1.704 0.0889 .
## duration_min_season 0.497798 0.001964 253.509 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.417 on 651 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.991
## F-statistic: 2.388e+04 on 3 and 651 DF, p-value: < 2.2e-16
model2_m <- lm(sales ~ time_since_release + title_length + duration_min_season + country + rating, data = movies)
summary(model2_m)
##
## Call:
## lm(formula = sales ~ time_since_release + title_length + duration_min_season +
## country + rating, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0528 -0.8745 0.0077 0.9113 3.9765
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.172842 2.081806 10.170 <2e-16 ***
## time_since_release -0.096582 0.006152 -15.698 <2e-16 ***
## title_length -0.013723 0.006014 -2.282 0.0229 *
## duration_min_season 0.500260 0.002405 208.006 <2e-16 ***
## countryArgentina -0.036944 1.722311 -0.021 0.9829
## countryAustralia -0.031592 1.584831 -0.020 0.9841
## countryAustria -0.480890 1.813013 -0.265 0.7909
## countryBelgium -3.921930 2.112446 -1.857 0.0639 .
## countryBrazil -0.379702 1.567392 -0.242 0.8087
## countryBulgaria -0.790722 2.067051 -0.383 0.7022
## countryCambodia -1.433127 2.070775 -0.692 0.4892
## countryCanada -0.349329 1.521657 -0.230 0.8185
## countryChile 0.404646 2.073227 0.195 0.8453
## countryChina -1.529484 1.676535 -0.912 0.3620
## countryColombia -1.931321 2.073636 -0.931 0.3520
## countryDenmark -3.704739 2.496485 -1.484 0.1383
## countryEgypt -0.213866 1.570051 -0.136 0.8917
## countryFinland -3.179871 2.070654 -1.536 0.1252
## countryFrance -0.462372 1.540853 -0.300 0.7642
## countryGermany -0.794812 1.616394 -0.492 0.6231
## countryGhana -1.836005 1.815533 -1.011 0.3123
## countryGreece 0.969841 2.064729 0.470 0.6387
## countryHong Kong -0.260197 1.576185 -0.165 0.8689
## countryIndia -0.761111 1.528607 -0.498 0.6187
## countryIndonesia -1.148243 1.576239 -0.728 0.4666
## countryIreland -0.916516 1.717097 -0.534 0.5937
## countryIsrael -1.837470 1.716832 -1.070 0.2849
## countryItaly -0.590844 1.601708 -0.369 0.7123
## countryJapan -0.677441 1.558546 -0.435 0.6640
## countryKuwait -2.054212 2.077266 -0.989 0.3231
## countryLebanon -0.149511 2.065620 -0.072 0.9423
## countryMalaysia -1.607421 1.818178 -0.884 0.3770
## countryMexico -0.759188 1.573322 -0.483 0.6296
## countryNamibia 0.435290 2.073009 0.210 0.8338
## countryNetherlands -0.859335 1.678664 -0.512 0.6089
## countryNew Zealand -2.045396 1.809385 -1.130 0.2588
## countryNigeria -1.126068 1.572059 -0.716 0.4741
## countryNorway -0.507200 1.721121 -0.295 0.7683
## countryPakistan 0.179190 1.625634 0.110 0.9123
## countryPhilippines -0.762897 1.589362 -0.480 0.6314
## countrySaudi Arabia -1.387413 1.619457 -0.857 0.3919
## countrySerbia 0.555125 2.067971 0.268 0.7885
## countrySingapore -1.251333 2.070877 -0.604 0.5459
## countrySlovenia -0.119018 2.067313 -0.058 0.9541
## countrySouth Africa -0.745710 1.669704 -0.447 0.6553
## countrySouth Korea -0.648960 1.610003 -0.403 0.6870
## countrySpain 0.245309 1.562820 0.157 0.8753
## countrySweden -2.162021 2.070486 -1.044 0.2968
## countryTaiwan -1.881120 1.816024 -1.036 0.3007
## countryThailand -2.017988 1.812355 -1.113 0.2660
## countryTurkey -1.453623 1.573467 -0.924 0.3559
## countryUnited Kingdom -0.412159 1.523608 -0.271 0.7869
## countryUnited States -0.407710 1.505265 -0.271 0.7866
## countryUruguay -0.709834 2.073396 -0.342 0.7322
## ratingNR -2.860948 1.485287 -1.926 0.0546 .
## ratingPG -3.656433 1.434041 -2.550 0.0110 *
## ratingPG-13 -3.310065 1.430281 -2.314 0.0210 *
## ratingR -3.548990 1.422994 -2.494 0.0129 *
## ratingTV-14 -3.234192 1.422158 -2.274 0.0233 *
## ratingTV-G -3.041819 1.455434 -2.090 0.0370 *
## ratingTV-MA -3.068279 1.418863 -2.162 0.0310 *
## ratingTV-PG -3.084497 1.427958 -2.160 0.0312 *
## ratingTV-Y -3.127097 1.500857 -2.084 0.0376 *
## ratingTV-Y7 -3.363143 1.473114 -2.283 0.0228 *
## ratingTV-Y7-FV NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.41 on 591 degrees of freedom
## Multiple R-squared: 0.9919, Adjusted R-squared: 0.991
## F-statistic: 1148 on 63 and 591 DF, p-value: < 2.2e-16
What are the differences you observe in your results for the regression outputs for tv_shows and movies? Which variables are significant?
The variables that are significant are time_since_relaease, title_length, and duration_min_season since they all have a p-value <0.5. The country and rating variables are not significant, since those p-values are >0.5. Some takeaways are that sales decrease as time_since_release increases and also decrase as title_length increases.
rmarkdown::render(“yourfile.Rmd”, output_file = “path/to/yourfile.html”, output_format = “html_document”)