R-programming: Homework wk2

Read in a .csv file from github using the raw data link

BONUS – place the original .csv in a github file and have R read from the link.

urlfile <- 'https://raw.githubusercontent.com/D-hartog/csv_file/main/mcu_films.csv'
mcu <- read.csv(urlfile)
head(mcu)

Cleaned up data frame by

Dropping the column labeled “X” that carried in index labels
Converting the “release_date” column to a date data type in case I ran any functions on the date column

mcu$X <- NULL
mcu$release_date <- as.Date(mcu$release_date, tryFormats = c("%m-%d-%Y", "%m/%d/%Y"))
head(mcu)

Problem 1

a. Use the summary function to gain an overview of the data set.

summary(mcu)

##     movie             length_hrs      length_min    release_date       
##  Length:23          Min.   :1.000   Min.   : 1.0   Min.   :2008-05-02  
##  Class :character   1st Qu.:1.500   1st Qu.: 7.5   1st Qu.:2012-11-02  
##  Mode  :character   Median :2.000   Median :16.0   Median :2015-07-17  
##                     Mean   :1.783   Mean   :23.3   Mean   :2015-02-18  
##                     3rd Qu.:2.000   3rd Qu.:40.5   3rd Qu.:2017-12-25  
##                     Max.   :3.000   Max.   :58.0   Max.   :2019-07-02  
##  opening_weekend_us     gross_us          gross_world       
##  Min.   : 55414050   Min.   :134806913   Min.   :2.648e+08  
##  1st Qu.: 85398076   1st Qu.:224645330   1st Qu.:6.233e+08  
##  Median :117027503   Median :333718600   Median :8.540e+08  
##  Mean   :135096585   Mean   :371600489   Mean   :9.821e+08  
##  3rd Qu.:176641864   3rd Qu.:417921916   3rd Qu.:1.184e+09  
##  Max.   :357115007   Max.   :858373000   Max.   :2.798e+09

b. Disply the mean and median of the last two attributes: us_gross and world_gross

revenue <- list("US Gross" = mcu$gross_us, "World Gross" = mcu$gross_world)
mean_revenue <- mapply(mean, revenue)
median_revenue <- mapply(median, revenue)
revenuedf <- data.frame(mean_revenue, median_revenue)
revenuedf

Problem 2

Create a new data frame with a subset of the COLUMNS AND ROWS. Make sure to rename it.

# Retrieve a subset of the first five mcu movies, the titles, release dates,  and gross revenue information 

firstfive <- mcu[1:5, c("movie", "release_date", "gross_us", "gross_world")]
firstfive

Problem 3

Create new column names for the new data frame.

colnames(firstfive) <- c("movie" = "Movie_Title", "release_date" = "Date", 
                         "gross_us"="US_Gross", "gross_world"="World_Gross")
firstfive

Problem 4

a. Use the summary function to create an overview of your new data frame.

summary(firstfive)

##  Movie_Title             Date               US_Gross        
##  Length:5           Min.   :2008-05-02   Min.   :134806913  
##  Class :character   1st Qu.:2008-06-12   1st Qu.:176654505  
##  Mode  :character   Median :2010-05-07   Median :181030624  
##                     Mean   :2010-01-02   Mean   :224791900  
##                     3rd Qu.:2011-05-06   3rd Qu.:312433331  
##                     Max.   :2011-07-22   Max.   :319034126  
##   World_Gross       
##  Min.   :264770996  
##  1st Qu.:370569774  
##  Median :449326618  
##  Mean   :458879393  
##  3rd Qu.:585796247  
##  Max.   :623933331

b. Print the mean and median for the same two attributes. Please compare.

firstfive_rev <- list("US Gross" = firstfive$US_Gross, 
                          "World Gross" = firstfive$World_Gross)
firstfive_mean <- mapply(mean, firstfive_rev)
firstfive_med <- mapply(median, firstfive_rev)

firstfive_revdf <- data.frame(firstfive_mean, firstfive_med)

head(firstfive_revdf)

The average gross revenue in the US for the first five movies was about 150 million dollars less than the average for all the films up until 2019.

When looking at the first five movies of the mcu films, the movie that had the highest US gross was still below the median when all movies were combined.

Problem 5

Please rename at least 3 values in a column.

Since the mcufilms data did not have many columns with factors consisting of 3 or more values, I decided to use the length of the movie to create a new column with a category label.

Movies < 120 minutes were labeled as “Short”
Movies between 120 and 150 minutes were labeled as “Moderate”
Movies > 150 were labeled as “Very long”

mcu$Total_min <- (mcu$length_hrs * 60) + mcu$length_min

mcu$Duration[mcu$Total_min < 120] <- "Short"
mcu$Duration[mcu$Total_min > 120 & mcu$Total_min < 150] <- "Moderate"
mcu$Duration[mcu$Total_min >= 150] <- "Very Long"

tail(mcu)