data <- read.csv ("C:\\Users\\91630\\OneDrive\\Desktop\\statistics\\age_gaps.CSV")
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.3
## 
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(xts)
## Warning: package 'xts' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
## 
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
## 
##     index
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last

Using “release_year” column

data$release_date <- as.Date(paste0(data$release_year, "-01-01"))
data <- subset(data, select = -release_year)
head(data$release_date, 20)
##  [1] "1971-01-01" "2006-01-01" "2002-01-01" "1998-01-01" "2010-01-01"
##  [6] "1992-01-01" "2009-01-01" "1999-01-01" "1992-01-01" "1999-01-01"
## [11] "1989-01-01" "1948-01-01" "1995-01-01" "2003-01-01" "2004-01-01"
## [16] "2003-01-01" "2005-01-01" "2010-01-01" "1981-01-01" "2002-01-01"

Sorting the release dates

sorted_df <- data %>% arrange(release_date)
head(sorted_df)
##         movie_name           director age_difference couple_number
## 1 Star of Midnight    Stephen Roberts             19             1
## 2    Captain Blood     Michael Curtiz              7             1
## 3     Modern Times    Charlie Chaplin             21             1
## 4    Stella Dallas         King Vidor             12             1
## 5   A Star Is Born William A. Wellman              9             3
## 6    Stella Dallas         King Vidor              1             2
##      actor_1_name        actor_2_name character_1_gender character_2_gender
## 1  William Powell       Ginger Rogers                man              woman
## 2     Errol Flynn Olivia de Havilland                man              woman
## 3 Charlie Chaplin    Paulette Goddard                man              woman
## 4      John Boles    Barbara Stanwyck                man              woman
## 5   Fredric March        Janet Gaynor                man              woman
## 6    Anne Shirley            Tim Holt              woman                man
##   actor_1_birthdate actor_2_birthdate actor_1_age actor_2_age release_date
## 1        1892-06-29        1911-07-16          43          24   1935-01-01
## 2        1909-06-20        1916-07-01          26          19   1935-01-01
## 3        1889-04-16        1910-06-03          47          26   1936-01-01
## 4        1895-10-28        1907-07-16          42          30   1937-01-01
## 5        1897-08-31        1906-10-06          40          31   1937-01-01
## 6        1918-04-17        1919-02-05          19          18   1937-01-01

Understanding mean of the age differences over release date.

mean_age_difference <- sorted_df %>%
  group_by(release_date) %>%
  summarize(mean_age_difference = mean(age_difference, na.rm = TRUE))
mean_age_difference
## # A tibble: 82 × 2
##    release_date mean_age_difference
##    <date>                     <dbl>
##  1 1935-01-01                 13   
##  2 1936-01-01                 21   
##  3 1937-01-01                  7.33
##  4 1939-01-01                 12   
##  5 1940-01-01                 11.3 
##  6 1942-01-01                 20.5 
##  7 1944-01-01                 25   
##  8 1946-01-01                 25   
##  9 1947-01-01                 25   
## 10 1948-01-01                 23.2 
## # ℹ 72 more rows
ggplot(mean_age_difference, aes(x = release_date, y = mean_age_difference)) +
  geom_line() +
  labs(x = "Release Date", y = "Mean Age Difference", title = "Mean Age Difference Over Release Year")

1. Over the course of the dataset, age differences among movie casts have exhibited rather stable patterns, as evidenced by the graph’s lack of sharp peaks. This stability implies that age differences were stable during the selected time period, allowing for a more targeted investigation. 2. By focusing on a period with consistent patterns and less complexity, the more constrained time frame of 1960 to 1980 enables a more concentrated examination.

filtered_mean_age_difference <- mean_age_difference %>%
  filter(release_date >= as.Date("1960-01-01") & release_date <= as.Date("1980-12-31"))

print(filtered_mean_age_difference)
## # A tibble: 20 × 2
##    release_date mean_age_difference
##    <date>                     <dbl>
##  1 1960-01-01                  5   
##  2 1961-01-01                  9   
##  3 1962-01-01                  3.67
##  4 1963-01-01                 15.2 
##  5 1964-01-01                 10.8 
##  6 1965-01-01                 11   
##  7 1966-01-01                 16   
##  8 1967-01-01                  9.43
##  9 1968-01-01                 11   
## 10 1969-01-01                  4   
## 11 1970-01-01                  9.5 
## 12 1971-01-01                 22.3 
## 13 1972-01-01                 13.3 
## 14 1973-01-01                 16.2 
## 15 1974-01-01                 12   
## 16 1975-01-01                  6   
## 17 1976-01-01                 18   
## 18 1977-01-01                 13.8 
## 19 1979-01-01                 20.2 
## 20 1980-01-01                 17
filtered_mean_age_difference <- mean_age_difference %>%
  filter(release_date >= as.Date("1960-01-01") & release_date <= as.Date("1980-12-31"))

print(filtered_mean_age_difference)
## # A tibble: 20 × 2
##    release_date mean_age_difference
##    <date>                     <dbl>
##  1 1960-01-01                  5   
##  2 1961-01-01                  9   
##  3 1962-01-01                  3.67
##  4 1963-01-01                 15.2 
##  5 1964-01-01                 10.8 
##  6 1965-01-01                 11   
##  7 1966-01-01                 16   
##  8 1967-01-01                  9.43
##  9 1968-01-01                 11   
## 10 1969-01-01                  4   
## 11 1970-01-01                  9.5 
## 12 1971-01-01                 22.3 
## 13 1972-01-01                 13.3 
## 14 1973-01-01                 16.2 
## 15 1974-01-01                 12   
## 16 1975-01-01                  6   
## 17 1976-01-01                 18   
## 18 1977-01-01                 13.8 
## 19 1979-01-01                 20.2 
## 20 1980-01-01                 17
ggplot(filtered_mean_age_difference, aes(x = release_date, y = mean_age_difference)) +
  geom_line() +
  labs(x = "Release Date", y = "Mean Age Difference", title = "Mean Age Difference Over Release Year")

1. Looking at the graph, it’s clear that the age gaps between the casts of movies from 1960 to 1980 fluctuated, with multiple ups and downs. Over the selected time period, a clear increasing tendency is visible despite these fluctuations. 2. The years 1960–1980 are a dynamic time in the cinema business, characterized by changes in storytelling techniques, cultural preferences, and technological breakthroughs. Notwithstanding the oscillations, the general pattern points to a progressive rise in the age gaps between the actors in films during this revolutionary era. 3. Although age gaps between performers in films released between 1960 and 1980 may fluctuate across particular years, the overall pattern suggests a general tendency towards greater age disparities. This remark emphasizes how crucial it is to place data in historical and industry-specific settings in order to extract insightful information.

Linear Regression Model

ggplot(filtered_mean_age_difference, aes(x = release_date, y = mean_age_difference)) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_line() +
  labs(x = "Release Date", y = "Mean Age Difference", title = "Mean Age Difference Over Release Year")
## `geom_smooth()` using formula = 'y ~ x'

1. An rising trend from 1960 to 1980 is confirmed by the linear regression model applied to the filtered mean age difference data. The model offers quantitative backing for the trend’s visual evaluation by fitting a linear trendline to the data.
2. The regression analysis backs up the graph’s interpretation by showing that, across the chosen time period, the mean age difference between movie cast members does, in fact, follow an upward trajectory. The first trend observation is made more robust by this statistical validation.

  1. A deeper knowledge of casting procedures and age representation in movies from the 1960s to 1980s can be gained from the regression model’s identification of a rising trend. It implies a consistent widening of the age gaps between actors, which may be a reflection of changes in the era’s rules for the entertainment industry, audience demographics, or storytelling preferences.
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
ts_release_years <- ts(data$release_date)
plot(ts_release_years, type = "l", xlab = "Year", ylab = "Release Date", main = "Time Series of Movie Release Years")

1. A chronological viewpoint on the distribution of film releases across time is offered by the time series plot that represents the years of movie releases. The representation of a single year for each dot in the dataset facilitates a visual comprehension of the temporal patterns contained in the data. 2. Looking at the graph as a timeline reveals how the years that the dataset covers have changed in terms of movie release dates. This graphic provides information about the ups and downs of the film industry, emphasizing times when business is more or less active. 3. Although the graph shows discrete points for each year, the general pattern shows trends in the years that films have been released. The distribution of dots can be used to pinpoint times when the film business was growing, stagnating, or declining. This information can be used to analyze cultural trends and the dynamics of the industry as a whole.

hist(ts_release_years, breaks = 20, col = "skyblue", xlab = "Release Year", ylab = "Frequency", main = "Histogram of Movie Release Years")

1. The frequency distribution of film releases throughout various years is clearly shown by the histogram of movie release years. A visual evaluation of the distribution and density of movie releases throughout time is made possible by the bars, each of which reflects the number of films released in a specific year. 2. Patterns in the density of movie releases within particular time periods can be identified by looking at the height and width of each bar. Peaks, or groups of bars, show years when there were more movie releases concentrated in one area, while valleys show times when the business was less active.

3. Examining the histogram offers information about past variations and patterns in the film industry. Distribution peaks could correspond with important industrial turning points, cultural occurrences, or changes in audience tastes, whereas distribution valleys might represent times of economic recession or changes in production priorities.

smoothed_data <- lowess(time(ts_release_years), ts_release_years)

plot(ts_release_years, type = "n", xlab = "Year", ylab = "Release Date", main = "Scatter Plot with Lowess Smoothing")
points(time(ts_release_years), ts_release_years, pch = 19, col = "blue")  # Scatter plot
lines(smoothed_data, col = "red")  # Lowess smoothed line

1. A red smoothed line and blue points show the years of movie releases in a scatter plot with Lowess smoothing. Finding patterns and trends in the data is made easier with the help of this visualization, which helps comprehend the general trend in the years that movies have been released. 2. By reducing any abnormalities or oscillations in the data, the Lowess smoothed line helps to show the underlying pattern in the years of movie release more clearly. The visualization improves interpretability and makes it easier to identify long-term patterns by reducing noise in the dataset. 3. Over the course of the dataset, changes in the direction and size of movie release years can be identified by looking at the trajectory of the smoothed line. The smoothed line’s peaks and troughs show times when movie output rose or fell, providing historical patterns and variations in the motion picture business.