movie_data <- read.csv("C:\\Users\\varsh\\OneDrive\\Desktop\\Gitstuff\\age_gaps.CSV")

Creating three new variables:

  1. mean_ages: This column describes the mean value for actor_1_age and actor_2_age.

  2. std_dev_ages: This column calculates the standard deviation for actor_1_age and actor_2_age.

  3. diff_mean_age_difference: This column stores the difference of mean(age_difference) and age_difference

movie_data$mean_ages <- rowMeans(movie_data[, c('actor_1_age', 'actor_2_age')])
movie_data$std_dev_ages <- apply(movie_data[, c('actor_1_age', 'actor_2_age')], 1, sd)

mean_age_difference <- mean(movie_data$age_difference)
movie_data$diff_mean_age_difference <-  movie_data$age_difference - mean_age_difference

head(movie_data)
##           movie_name release_year      director age_difference couple_number
## 1   Harold and Maude         1971     Hal Ashby             52             1
## 2              Venus         2006 Roger Michell             50             1
## 3 The Quiet American         2002 Phillip Noyce             49             1
## 4   The Big Lebowski         1998     Joel Coen             45             1
## 5          Beginners         2010    Mike Mills             43             1
## 6         Poison Ivy         1992     Katt Shea             42             1
##          actor_1_name    actor_2_name character_1_gender character_2_gender
## 1         Ruth Gordon        Bud Cort              woman                man
## 2       Peter O'Toole Jodie Whittaker                man              woman
## 3       Michael Caine  Do Thi Hai Yen                man              woman
## 4    David Huddleston       Tara Reid                man              woman
## 5 Christopher Plummer   Goran Visnjic                man                man
## 6        Tom Skerritt  Drew Barrymore                man              woman
##   actor_1_birthdate actor_2_birthdate actor_1_age actor_2_age mean_ages
## 1        1896-10-30        29-03-1948          75          23      49.0
## 2        02-08-1932        03-06-1982          74          24      49.0
## 3        14-03-1933        01-10-1982          69          20      44.5
## 4        17-09-1930        08-11-1975          68          23      45.5
## 5        13-12-1929        09-09-1972          81          38      59.5
## 6        25-08-1933        22-02-1975          59          17      38.0
##   std_dev_ages diff_mean_age_difference
## 1     36.76955                 41.57576
## 2     35.35534                 39.57576
## 3     34.64823                 38.57576
## 4     31.81981                 34.57576
## 5     30.40559                 32.57576
## 6     29.69848                 31.57576

Visualizations:

library(ggplot2)

plot1 <- ggplot(movie_data, aes(x = age_difference, y = diff_mean_age_difference)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(title = "age_difference vs diff_mean_age_difference",
       x = "age_difference",
       y = "diff_mean_age_difference")

plot2 <- ggplot(movie_data, aes(x = actor_1_age, y = mean_ages)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(title = "actor_1_age vs mean_ages",
       x = "actor_1_age",
       y = "mean_ages")

plot3 <- ggplot(movie_data, aes(x = actor_2_age, y = std_dev_ages)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(title = "actor_2_age vs std_dev_ages",
       x = "actor_2_age",
       y = "std_dev_ages")


print(plot1)
## `geom_smooth()` using formula = 'y ~ x'

print(plot2)
## `geom_smooth()` using formula = 'y ~ x'

print(plot3)
## `geom_smooth()` using formula = 'y ~ x'

1. Plot for age_difference vs diff_mean_age_difference:

Insights-

This plot displays the relationship between age_difference and diff_mean_age_difference, along with a linear regression line fitted to the data points.

Significance-

The plot displays a positive linear relationship between age_difference and diff_mean_age_difference, which is further demonstrated by an increasing trend in the regression line. It suggests that when the absolute difference between age_difference and its mean value (diff_mean_age_difference) increases, so does the actual age_difference.

2. Plot for actor_1_age vs mean_ages:

Insight-

This plot visualizes the relationship between actor_1_age and mean_ages, along with a linear regression line.

Significance-

The plot shows a strong positive relationship between actor_1_age and mean_ages, as shown by the regression line’s significant upward trend. This shows that as the first actor’s age (actor_1_age) rises, so does the average age of all actors in the film (mean_ages).

3. Plot: actor_2_age vs std_dev_ages:

Insight-

This plot illustrates the relationship between actor_2_age and std_dev_ages, along a linear regression line.

Significance-

The plot demonstrates a weak negative linear relationship between actor_2_age and std_dev_ages, as shown by the regression line’s slightly downward slope. However, the association appears to be rather scattered, indicating that the data points may differ.

Further Questions-

1. Analysing outliers or clusters of data points in the age_difference vs diff_mean_age_difference plot may show specific patterns or aspects related to casting decisions in films.

  1. Investigating the significant relationship between actor_1_age and mean_ages could shed light on whether older actors are typically cast in lead roles, potentially influencing the cast’s overall mean age, and looking at age distributions across different genres or movie types may provide additional insights.

  2. Investigating possible factors contributing to the variability in std_dev_ages, such as movie qualities or casting decisions, as well as outliers or clusters of data points, may reveal underlying trends influencing the ages of actors in films.

Correlation Coefficient:

cor_1 <- cor(movie_data$age_difference, movie_data$diff_mean_age_difference)

cor_2 <- cor(movie_data$actor_1_age, movie_data$mean_ages)

cor_3 <- cor(movie_data$actor_2_age, movie_data$std_dev_ages)

print(paste("Correlation coefficient for age_difference and diff_mean_age_difference:", cor_1))
## [1] "Correlation coefficient for age_difference and diff_mean_age_difference: 1"
print(paste("Correlation coefficient for actor_1_age and mean_ages:", cor_2))
## [1] "Correlation coefficient for actor_1_age and mean_ages: 0.926264555032481"
print(paste("Correlation coefficient for actor_2_age and std_dev_ages:", cor_3))
## [1] "Correlation coefficient for actor_2_age and std_dev_ages: -0.156464786475823"

Insights Gathered-

  1. Cor_1 shows a perfect positive linear link between age_difference and diff_mean_age_difference. This means that as age_difference increases, so does diff_mean_age_difference.
  2. The Correlation coefficient 2 is approximately 0.93 which suggests a significant positive linear relationship between actor_1_age and mean_ages. This shows that as the first actor’s age increases, so does the average age of all actors in the film.
  3. The correlation coefficient of nearly -0.16 indicates a slight negative linear relationship between actor_2_age and std_dev_ages. This shows that the ages of the actors (std_dev_ages) vary, but they are not highly related to the age of the second actor (actor_2_age).

Significance-

  1. Cor_1: The perfect correlation indicates that there is a direct and expected relationship between the absolute age_difference from its mean and the actual age_difference. This could imply that deviations from the average age difference are related to the actual age difference in each film.

  2. Cor_2: The significant positive correlation implies a potential trend of casting older actors, which could alter the overall age demographics of movie casts. This could be an indicator of casting preferences or industry trends that favour older performers in lead roles.

  3. Cor_3: The small negative correlation indicates that, while actor ages vary, they are not highly associated with the age of the second actor in the film. This could imply that variables other than the second actor’s age contribute to the difference in actor ages in films.

Confidence interval for each of the response variable:

mean_age_difference <- mean(movie_data$age_difference)
sd_age_difference <- sd(movie_data$age_difference)

confidence_level <- 0.95

margin_of_error <- qt((1 - confidence_level) / 2, df = length(movie_data$age_difference) - 1) * (sd_age_difference / sqrt(length(movie_data$age_difference)))

lower_bound <- mean_age_difference - margin_of_error
upper_bound <- mean_age_difference + margin_of_error

print(paste("The", confidence_level * 100, "% confidence interval for age_difference is [", lower_bound, ",", upper_bound, "]"))
## [1] "The 95 % confidence interval for age_difference is [ 10.9156001627796 , 9.93288468570528 ]"
mean_age_difference <- mean(movie_data$mean_ages)
sd_age_difference <- sd(movie_data$mean_ages)

confidence_level <- 0.95

margin_of_error <- qt((1 - confidence_level) / 2, df = length(movie_data$mean_ages) - 1) * (sd_age_difference / sqrt(length(movie_data$mean_ages)))

lower_bound <- mean_age_difference - margin_of_error
upper_bound <- mean_age_difference + margin_of_error

print(paste("The", confidence_level * 100, "% confidence interval for mean_ages is [", lower_bound, ",", upper_bound, "]"))
## [1] "The 95 % confidence interval for mean_ages is [ 35.8863699212133 , 34.96038332554 ]"
mean_age_difference <- mean(movie_data$std_dev_ages)
sd_age_difference <- sd(movie_data$std_dev_ages)

confidence_level <- 0.95

margin_of_error <- qt((1 - confidence_level) / 2, df = length(movie_data$std_dev_ages) - 1) * (sd_age_difference / sqrt(length(movie_data$std_dev_ages)))

lower_bound <- mean_age_difference - margin_of_error
upper_bound <- mean_age_difference + margin_of_error

print(paste("The", confidence_level * 100, "% confidence interval for std_dev_ages is [", lower_bound, ",", upper_bound, "]"))
## [1] "The 95 % confidence interval for std_dev_ages is [ 7.71849489582241 , 7.02361011800621 ]"

Insights Gathered-

  1. Confidence interval for age_difference: The confidence interval [10.92, 9.93] indicates that we have 95% confidence that the actual population mean age difference is within this range. In other words, we assume that the average age gap between performers in films is between 10.92 and 9.93 years.
  2. Confidence interval for mean_ages: The confidence interval [35.89, 34.96] indicates that we have 95% confidence that the actual population mean age of all actors is within this range. In other words, we estimate that the average age of all movie actors ranges between 35.89 and 34.96 years.
  3. Confidence interval for std_dev_ages: The confidence interval [7.72, 7.02] indicates that we have 95% confidence that true population standard deviation of ages among actors is within this range. In other words, we estimate that the age range for actors in movies is between 7.72 and 7.02 years.

Significance-

  1. Confidence interval for age_difference: The confidence interval provides a range of possible values for the population parameter (mean age difference). It helps us measure the uncertainty in our estimate and evaluates the accuracy of our sample mean. In this situation, the relatively small confidence interval implies that we have a fairly accurate estimate of the average age difference.
  2. Confidence interval for mean_ages: The confidence interval offers a range of apparent values for the population factor (the average age of all actors). It helps us to quantify the uncertainty in our estimate and assesses the precision of our sample mean. In this instance, the relatively small confidence interval implies that we have a fairly precise estimate of the average age of all actors.
  3. Confidence interval for std_dev_ages:The confidence interval provides a range of plausible values for the population parameter (standard deviation of ages among actors). It helps us to quantify the uncertainty in our estimate while assessing the precision of our sample standard deviation. In this situation, the very small confidence range implies that we have an approximate figure of the age variability among actors.

Further Questions-

1. Analyzing the practical consequences of the displayed mean age difference may reveal genre-specific patterns in age differences among actors, as well as potential shifts in casting preferences over time.

  1. Exploring the factors that affect variances in the average age of all actors may show genre-specific age preferences as well as temporal changes in casting methods across different film genres and time periods.

  2. Evaluating factors related with variations in the standard deviation of ages among actors could reveal genre-specific diversity in casting processes as well as potential trends in age differences across groups and time periods in the film industry.