BONUS – place the original .csv in a github file and have R read from the link.
urlfile <- 'https://raw.githubusercontent.com/D-hartog/csv_file/main/mcu_films.csv'
mcu <- read.csv(urlfile)
head(mcu)
mcu$X <- NULL
mcu$release_date <- as.Date(mcu$release_date, tryFormats = c("%m-%d-%Y", "%m/%d/%Y"))
head(mcu)
a. Use the summary function to gain an overview of the data set.
summary(mcu)
## movie length_hrs length_min release_date
## Length:23 Min. :1.000 Min. : 1.0 Min. :2008-05-02
## Class :character 1st Qu.:1.500 1st Qu.: 7.5 1st Qu.:2012-11-02
## Mode :character Median :2.000 Median :16.0 Median :2015-07-17
## Mean :1.783 Mean :23.3 Mean :2015-02-18
## 3rd Qu.:2.000 3rd Qu.:40.5 3rd Qu.:2017-12-25
## Max. :3.000 Max. :58.0 Max. :2019-07-02
## opening_weekend_us gross_us gross_world
## Min. : 55414050 Min. :134806913 Min. :2.648e+08
## 1st Qu.: 85398076 1st Qu.:224645330 1st Qu.:6.233e+08
## Median :117027503 Median :333718600 Median :8.540e+08
## Mean :135096585 Mean :371600489 Mean :9.821e+08
## 3rd Qu.:176641864 3rd Qu.:417921916 3rd Qu.:1.184e+09
## Max. :357115007 Max. :858373000 Max. :2.798e+09
b. Disply the mean and median of the last two attributes: us_gross and world_gross
revenue <- list("US Gross" = mcu$gross_us, "World Gross" = mcu$gross_world)
mean_revenue <- mapply(mean, revenue)
median_revenue <- mapply(median, revenue)
revenuedf <- data.frame(mean_revenue, median_revenue)
revenuedf
Create a new data frame with a subset of the COLUMNS AND ROWS. Make sure to rename it.
# Retrieve a subset of the first five mcu movies, the titles, release dates, and gross revenue information
firstfive <- mcu[1:5, c("movie", "release_date", "gross_us", "gross_world")]
firstfive
Create new column names for the new data frame.
colnames(firstfive) <- c("movie" = "Movie_Title", "release_date" = "Date",
"gross_us"="US_Gross", "gross_world"="World_Gross")
firstfive
a. Use the summary function to create an overview of your new data frame.
summary(firstfive)
## Movie_Title Date US_Gross
## Length:5 Min. :2008-05-02 Min. :134806913
## Class :character 1st Qu.:2008-06-12 1st Qu.:176654505
## Mode :character Median :2010-05-07 Median :181030624
## Mean :2010-01-02 Mean :224791900
## 3rd Qu.:2011-05-06 3rd Qu.:312433331
## Max. :2011-07-22 Max. :319034126
## World_Gross
## Min. :264770996
## 1st Qu.:370569774
## Median :449326618
## Mean :458879393
## 3rd Qu.:585796247
## Max. :623933331
b. Print the mean and median for the same two attributes. Please compare.
firstfive_rev <- list("US Gross" = firstfive$US_Gross,
"World Gross" = firstfive$World_Gross)
firstfive_mean <- mapply(mean, firstfive_rev)
firstfive_med <- mapply(median, firstfive_rev)
firstfive_revdf <- data.frame(firstfive_mean, firstfive_med)
head(firstfive_revdf)
The average gross revenue in the US for the first five movies was about 150 million dollars less than the average for all the films up until 2019.
When looking at the first five movies of the mcu films, the movie that had the highest US gross was still below the median when all movies were combined.
Please rename at least 3 values in a column.
Since the mcufilms data did not have many columns with factors consisting of 3 or more values, I decided to use the length of the movie to create a new column with a category label.
mcu$Total_min <- (mcu$length_hrs * 60) + mcu$length_min
mcu$Duration[mcu$Total_min < 120] <- "Short"
mcu$Duration[mcu$Total_min > 120 & mcu$Total_min < 150] <- "Moderate"
mcu$Duration[mcu$Total_min >= 150] <- "Very Long"
tail(mcu)