When working with data, you should expect to spend a good amount of time in the clean-up process, but it is not always ‘messy’ or unreadable. For example, data can still be organized in a data-frame in a way that is readable, but at the same time it may not be useful in such presented formats. In these cases, we may have to transpose the data-frame and re-organize to fit our needs.
The data is loaded into a data-frame called ‘movies_df’.
rm(list = ls())
movies_df <- data.frame(
c('Young',NA,NA,'Old',NA),
c('Liked','Disliked',NA,'Liked','Disliked'),
c(23,77,NA,58,42),
c(91,9,NA,14,86),
c(76,24,NA,37,63)
)
colnames(movies_df) <- c('','','Iron Man','Titanic','Legally Blonde')
#view current state of data-frame
movies_df
## Iron Man Titanic Legally Blonde
## 1 Young Liked 23 91 76
## 2 <NA> Disliked 77 9 24
## 3 <NA> <NA> NA NA NA
## 4 Old Liked 58 14 37
## 5 <NA> Disliked 42 86 63
We can now write the ‘movies_df’ data-frame to a CSV file. Within the ‘write.csv()’ function, I specified the ‘row.names’ parameter to ‘FALSE’ to avoid creating non-blank row names. I also set the ‘na’ parameter to “” to replace all ‘NA’ values with blanks. The ‘stringsAsFactors’ parameter is set to ‘FALSE’ to avoid changing the character entries. At first glance it looks correct, but the data-frame still shows up with ‘NA’ values in some of the cells.
#write to csv file
write.csv(movies_df, "movies.csv", row.names = FALSE, na = "")
#read from csv file into R
movies_csv <- read.csv('movies.csv', stringsAsFactors = FALSE, check.names = FALSE)
movies_csv
## Iron Man Titanic Legally Blonde
## 1 Young Liked 23 91 76
## 2 Disliked 77 9 24
## 3 NA NA NA
## 4 Old Liked 58 14 37
## 5 Disliked 42 86 63
To remove the remaining ‘NA’ values I subset the data-frame and set any value that ‘is.na’ to ““. When calling the updated data, it seemed to have removed the blank row entirely. To confirm that this was not the case, I presented the data nicely using the ‘datatable()’ function.
#change NA values to blank
movies_csv[is.na(movies_csv)] <- ""
#check updated data-frame
movies_csv
## Iron Man Titanic Legally Blonde
## 1 Young Liked 23 91 76
## 2 Disliked 77 9 24
## 3
## 4 Old Liked 58 14 37
## 5 Disliked 42 86 63
# Display the data frame nicely
datatable(movies_csv)
The previous output looks almost perfect, but the first two columns were given default names, namely “Var.2” and “Var.3”. To avoid this, I set the ‘options’ parameter accordingly. We can now see that our data-frame is formatted exactly as we want it to be.
#specify additional options
datatable(movies_csv, options = list(columnDefs = list(list(targets = c(1, 2), title = ""))))
Before we can pivot, we need to change the column/variable names to a valid format. This means no white-spaces or special characters.
#check data frame again
movies_csv
## Iron Man Titanic Legally Blonde
## 1 Young Liked 23 91 76
## 2 Disliked 77 9 24
## 3
## 4 Old Liked 58 14 37
## 5 Disliked 42 86 63
Here, I replace white-spaces with a period and labeled the first two columns that originally had blank titles as ‘Age_Group’ and ‘Preference’. For the blank row entries in the ‘Age_Group’ column, I labeled them according to the relevant age group.
colnames(movies_csv) <- c('Age_Group','Preference','Iron.Man','Titanic','Legally.Blonde')
movies_csv$Age_Group[2] <- 'Young'
movies_csv$Age_Group[5] <- 'Old'
movies_csv
## Age_Group Preference Iron.Man Titanic Legally.Blonde
## 1 Young Liked 23 91 76
## 2 Young Disliked 77 9 24
## 3
## 4 Old Liked 58 14 37
## 5 Old Disliked 42 86 63
Now that all of the required data is loaded, we need to transform what we have so that it is readable to our liking. In this case, this means pivoting the matrix and aggregating values and columns. To avoid gaps in our pivoted data, I remove the blank row before applying the ‘pivot_longer()’ function.
movies_csv <- movies_csv[-3, ]
p_movies <- pivot_longer(movies_csv,
cols = c('Iron.Man','Titanic','Legally.Blonde'),
names_to = "Movie",
values_to = "Ratings.Count")
p_movies
## # A tibble: 12 × 4
## Age_Group Preference Movie Ratings.Count
## <chr> <chr> <chr> <chr>
## 1 Young Liked Iron.Man 23
## 2 Young Liked Titanic 91
## 3 Young Liked Legally.Blonde 76
## 4 Young Disliked Iron.Man 77
## 5 Young Disliked Titanic 9
## 6 Young Disliked Legally.Blonde 24
## 7 Old Liked Iron.Man 58
## 8 Old Liked Titanic 14
## 9 Old Liked Legally.Blonde 37
## 10 Old Disliked Iron.Man 42
## 11 Old Disliked Titanic 86
## 12 Old Disliked Legally.Blonde 63
We need to change the ‘Ratings.Count’ column to numeric values to perform mathematical operations. After doing so, we can group the data according to Age_Group, Movie, and Preference, in that order.
p_movies$Ratings.Count <- as.numeric(p_movies$Ratings.Count)
#Aggregate data by city and airline
grouped_movies <- p_movies %>%
group_by(Age_Group, Movie, Preference) %>%
summarize(Ratings.Count, .groups = 'drop')
grouped_movies
## # A tibble: 12 × 4
## Age_Group Movie Preference Ratings.Count
## <chr> <chr> <chr> <dbl>
## 1 Old Iron.Man Disliked 42
## 2 Old Iron.Man Liked 58
## 3 Old Legally.Blonde Disliked 63
## 4 Old Legally.Blonde Liked 37
## 5 Old Titanic Disliked 86
## 6 Old Titanic Liked 14
## 7 Young Iron.Man Disliked 77
## 8 Young Iron.Man Liked 23
## 9 Young Legally.Blonde Disliked 24
## 10 Young Legally.Blonde Liked 76
## 11 Young Titanic Disliked 9
## 12 Young Titanic Liked 91
As shown above, for each age group there are duplicate entries for each movie in order to match to either the ‘Liked’ or ‘Disliked’ preference. This gives us a data-frame with 12 rows. We can condense this down to just 6 rows by including the preference for each movie on the same row by creating a ‘Liked’ column and a ‘Disliked’ column.
#present corresponding values for (on time) and (delayed) side by side
wide_movies <- grouped_movies %>%
pivot_wider(names_from = Preference,
values_from = Ratings.Count)
wide_movies
## # A tibble: 6 × 4
## Age_Group Movie Disliked Liked
## <chr> <chr> <dbl> <dbl>
## 1 Old Iron.Man 42 58
## 2 Old Legally.Blonde 63 37
## 3 Old Titanic 86 14
## 4 Young Iron.Man 77 23
## 5 Young Legally.Blonde 24 76
## 6 Young Titanic 9 91
Now that we have data that is readable to our liking, we can create new columns to use for our analysis. Here, I created a column for the percentage of liked films relative to all ratings of each movie and corresponding to a specific age group.
#create new column to show percentage of delayed flights for each row
wide_movies$pct_liked <- wide_movies$Liked / (wide_movies$Disliked + wide_movies$Liked)
wide_movies
## # A tibble: 6 × 5
## Age_Group Movie Disliked Liked pct_liked
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Old Iron.Man 42 58 0.58
## 2 Old Legally.Blonde 63 37 0.37
## 3 Old Titanic 86 14 0.14
## 4 Young Iron.Man 77 23 0.23
## 5 Young Legally.Blonde 24 76 0.76
## 6 Young Titanic 9 91 0.91
I decided to create two more columns for the mean of liked movies as well as the total number of ratings for each age group, regardless of movie. This is to compute a percentage for the mean in the same way that I computed a percentage for the individual movies.
mAge_liked <- wide_movies %>%
group_by(Age_Group) %>%
summarize(mean_liked = mean(Liked))
#mAge_liked
mLiked_Old <- mAge_liked$mean_liked[1]
mLiked_Young <- mAge_liked$mean_liked[2]
wide_movies$mean_liked <- NA
wide_movies$mean_liked <- ifelse(wide_movies$Age_Group == "Old", mLiked_Old, wide_movies$mean_liked)
wide_movies$mean_liked <- ifelse(wide_movies$Age_Group == "Young", mLiked_Young, wide_movies$mean_liked)
wide_movies
## # A tibble: 6 × 6
## Age_Group Movie Disliked Liked pct_liked mean_liked
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Old Iron.Man 42 58 0.58 36.3
## 2 Old Legally.Blonde 63 37 0.37 36.3
## 3 Old Titanic 86 14 0.14 36.3
## 4 Young Iron.Man 77 23 0.23 63.3
## 5 Young Legally.Blonde 24 76 0.76 63.3
## 6 Young Titanic 9 91 0.91 63.3
Here I include columns for the total number of liked films, total number of ratings, and liked movie percentages for the overall age groups. Then added a final column that calculates the mean across all movies of the percentages of liked movies for each age group.
mAge_total1 <- wide_movies %>%
group_by(Age_Group) %>%
summarize(liked_total = sum(Liked))
mAge_total <- wide_movies %>%
group_by(Age_Group) %>%
summarize(total = sum(Liked, Disliked))
mAge_total1
## # A tibble: 2 × 2
## Age_Group liked_total
## <chr> <dbl>
## 1 Old 109
## 2 Young 190
mAge_total
## # A tibble: 2 × 2
## Age_Group total
## <chr> <dbl>
## 1 Old 300
## 2 Young 300
mlike_old <- mAge_total1$liked_total[1]
mlike_young <- mAge_total1$liked_total[2]
mTot_old <- mAge_total$total[1]
mTot_young <- mAge_total$total[2]
wide_movies$liked_total <- NA
wide_movies$total <- NA
wide_movies$liked_total <- ifelse(wide_movies$Age_Group == "Old", mlike_old, wide_movies$liked_total)
wide_movies$liked_total <- ifelse(wide_movies$Age_Group == "Young", mlike_young, wide_movies$liked_total)
wide_movies$total <- ifelse(wide_movies$Age_Group == "Old", mTot_old, wide_movies$total)
wide_movies$total <- ifelse(wide_movies$Age_Group == "Young", mTot_young, wide_movies$total)
wide_movies$age_liked_pct <- wide_movies$liked_total / wide_movies$total
wide_movies$pct_mean_liked <- NA
wide_movies$pct_mean_liked <- wide_movies$mean_liked / wide_movies$total
wide_movies
## # A tibble: 6 × 10
## Age_Group Movie Disliked Liked pct_liked mean_liked liked_total total
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Old Iron.Man 42 58 0.58 36.3 109 300
## 2 Old Legally.Blonde 63 37 0.37 36.3 109 300
## 3 Old Titanic 86 14 0.14 36.3 109 300
## 4 Young Iron.Man 77 23 0.23 63.3 190 300
## 5 Young Legally.Blonde 24 76 0.76 63.3 190 300
## 6 Young Titanic 9 91 0.91 63.3 190 300
## # ℹ 2 more variables: age_liked_pct <dbl>, pct_mean_liked <dbl>
In my analysis, I decided to find the sample standard deviation based on the percentage points that I computed for the individual movies and the percentages of the means. I did this to compare the liked ratings according to variation in the data set.
sample_pct_sd <- wide_movies %>%
group_by(Age_Group) %>%
reframe(pct_liked, pct_mean_liked)
sample_pct_sd$sq_diff <- (sample_pct_sd$pct_liked - sample_pct_sd$pct_mean_liked)^2
sample_pct_sd1 <- sample_pct_sd %>%
group_by(Age_Group) %>%
summarize(pct_sd = sqrt(sum(sq_diff)/(5 - 1)))
#sample_pct_sd1
#standard deviation of delay percentages for old people
old_pct_sd <- sample_pct_sd1$pct_sd[1]
#standard deviation of delay percentages for young people
young_pct_sd <- sample_pct_sd1$pct_sd[2]
liked_df <- data.frame(Age_Group = wide_movies$Age_Group,
Movie = wide_movies$Movie,
Liked = wide_movies$Liked,
Pct_Liked = wide_movies$pct_liked,
#Mean_Pct_Delay = wide_flights$pct_mean_delayed,
#Pct_SD = round(sqrt(((wide_flights$pct_delayed - wide_flights$pct_mean_delayed)^2) / (wide_flights$total - 1)), 5),
#Avg_Pct_SD = NA,
Pct_SD = NA,
Age_Liked_Pct = wide_movies$age_liked_pct
)
liked_df$Pct_SD <- ifelse(wide_movies$Age_Group == "Old", old_pct_sd, liked_df$Pct_SD)
liked_df$Pct_SD <- ifelse(wide_movies$Age_Group == "Young", young_pct_sd, liked_df$Pct_SD)
#write to csv
write.csv(liked_df, 'movie_analysis.csv')
movies_analysis <- read.csv('movie_analysis.csv')
movies_analysis
## X Age_Group Movie Liked Pct_Liked Pct_SD Age_Liked_Pct
## 1 1 Old Iron.Man 58 0.58 0.2611903 0.3633333
## 2 2 Old Legally.Blonde 37 0.37 0.2611903 0.3633333
## 3 3 Old Titanic 14 0.14 0.2611903 0.3633333
## 4 4 Young Iron.Man 23 0.23 0.4444326 0.6333333
## 5 5 Young Legally.Blonde 76 0.76 0.4444326 0.6333333
## 6 6 Young Titanic 91 0.91 0.4444326 0.6333333
The graphs below show the percentages of liked movies in two different ways. Both visuals include error bars that show the variability of the data.
Here we see the age groups bunched in two graphs, each with the liked percentages of the individual movies.
ggplot(movies_analysis, aes(x = Pct_Liked, y = Movie, fill = Age_Group)) +
facet_wrap(~ Age_Group) +
geom_bar(stat = "identity") +
geom_errorbar(aes(xmin = Pct_Liked - Pct_SD, xmax = Pct_Liked + Pct_SD), width = 0.2) +
labs(
title = "Liked Percentage by Movie",
x = "Percentage",
y = "Movie"
)
Here we see the movies grouped in five graphs. Each shows the liked percentages of the individual age groups.
ggplot(movies_analysis, aes(x = Pct_Liked, y = Age_Group, fill = Movie)) +
facet_wrap(~ Movie) +
geom_bar(stat = "identity") +
geom_errorbar(aes(xmin = Pct_Liked - Pct_SD, xmax = Pct_Liked + Pct_SD), width = 0.2) +
labs(
title = "Liked Percentage by Age Group",
x = "Percentage",
y = "Age Group"
)
Below I compared the overall liked ratings for each age group and used a bar graph to compare. It is clear that significantly less older people liked movies relative to the total number of ratings across all movies.
liked_group <- movies_analysis %>%
group_by(Age_Group) %>%
summarize(age_liked_p = mean(Age_Liked_Pct))
ggplot(liked_group, aes(x = Age_Group, y = age_liked_p, fill = Age_Group)) +
geom_bar(stat = "identity") +
labs(
title = "Overall Liked Percentage by Age Group",
x = "Age Group",
y = "Percentage"
)
When breaking down the percentage comparison by movie, I visualized it as (1) a single stacked graph with the information for all movie and (2) as separate graphs for each movie. In either case, we see a significant difference in preferences across age groups.
Here the graph is visually appealing, but may be harder to compare as the bars representing the preference for each movie are not properly aligned side by side.
ggplot(movies_analysis, aes(x = Age_Group, y = Pct_Liked, fill = Movie)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Individual Like Percentage vs Age Group (Stacked)",
x = "Age Group",
y = "Percentage"
)
Here the separate graphs represent each and within each graph is a comparison of their preferences by . The individual graphs allow easier comparison.
ggplot(movies_analysis, aes(x = Age_Group, y = Pct_Liked, fill = Movie)) +
facet_wrap(~ Movie) +
geom_bar(stat = "identity") +
labs(
title = "Individual Liked Percentage vs Age Group (Separated)",
x = "Age Group",
y = "Percentage"
)
I likewise visualized it by comparing the like percentages of each movie and filled by age group. I also swapped the axes for better readability.
ggplot(movies_analysis, aes(x = Pct_Liked, y = Movie, fill = Age_Group)) +
geom_bar(stat = 'identity') +
labs(
title = "Liked Percentage vs Movie (Stacked)",
x = "Percentage",
y = "Movie"
)
Here the separate graphs represent each and within each graph is a comparison of their preference by .
ggplot(movies_analysis, aes(x = Pct_Liked, y = Movie, fill = Age_Group)) +
facet_wrap(~ Age_Group) +
geom_bar(stat = 'identity') +
labs(
title = "Liked Percentage vs Movie (Separated)",
x = "Percentage",
y = "Movie"
)
My analysis focused on comparing the spread of the data according to age group as well as comparing percentages of preference. In comparing the standard deviations of overall preferred movie percentages for each age group, we observe that older raters had a standard deviation of 0.2611903, while younger raters had a standard deviation of 0.4444326. This indicates that older raters were more consistent in their movie preferences. When comparing the actual percentages of liked movies, whether by age group overall or individual movies, we see that older people liked movies less than younger people did. Based on this analysis, older people seemed to like movies less and were more assertive in their preference.