data <- read.csv ("C:\\Users\\91630\\OneDrive\\Desktop\\statistics\\age_gaps.CSV")
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.3
##
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(xts)
## Warning: package 'xts' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
data$release_date <- as.Date(paste0(data$release_year, "-01-01"))
data <- subset(data, select = -release_year)
head(data$release_date, 20)
## [1] "1971-01-01" "2006-01-01" "2002-01-01" "1998-01-01" "2010-01-01"
## [6] "1992-01-01" "2009-01-01" "1999-01-01" "1992-01-01" "1999-01-01"
## [11] "1989-01-01" "1948-01-01" "1995-01-01" "2003-01-01" "2004-01-01"
## [16] "2003-01-01" "2005-01-01" "2010-01-01" "1981-01-01" "2002-01-01"
sorted_df <- data %>% arrange(release_date)
head(sorted_df)
## movie_name director age_difference couple_number
## 1 Star of Midnight Stephen Roberts 19 1
## 2 Captain Blood Michael Curtiz 7 1
## 3 Modern Times Charlie Chaplin 21 1
## 4 Stella Dallas King Vidor 12 1
## 5 A Star Is Born William A. Wellman 9 3
## 6 Stella Dallas King Vidor 1 2
## actor_1_name actor_2_name character_1_gender character_2_gender
## 1 William Powell Ginger Rogers man woman
## 2 Errol Flynn Olivia de Havilland man woman
## 3 Charlie Chaplin Paulette Goddard man woman
## 4 John Boles Barbara Stanwyck man woman
## 5 Fredric March Janet Gaynor man woman
## 6 Anne Shirley Tim Holt woman man
## actor_1_birthdate actor_2_birthdate actor_1_age actor_2_age release_date
## 1 1892-06-29 1911-07-16 43 24 1935-01-01
## 2 1909-06-20 1916-07-01 26 19 1935-01-01
## 3 1889-04-16 1910-06-03 47 26 1936-01-01
## 4 1895-10-28 1907-07-16 42 30 1937-01-01
## 5 1897-08-31 1906-10-06 40 31 1937-01-01
## 6 1918-04-17 1919-02-05 19 18 1937-01-01
mean_age_difference <- sorted_df %>%
group_by(release_date) %>%
summarize(mean_age_difference = mean(age_difference, na.rm = TRUE))
mean_age_difference
## # A tibble: 82 × 2
## release_date mean_age_difference
## <date> <dbl>
## 1 1935-01-01 13
## 2 1936-01-01 21
## 3 1937-01-01 7.33
## 4 1939-01-01 12
## 5 1940-01-01 11.3
## 6 1942-01-01 20.5
## 7 1944-01-01 25
## 8 1946-01-01 25
## 9 1947-01-01 25
## 10 1948-01-01 23.2
## # ℹ 72 more rows
ggplot(mean_age_difference, aes(x = release_date, y = mean_age_difference)) +
geom_line() +
labs(x = "Release Date", y = "Mean Age Difference", title = "Mean Age Difference Over Release Year")
1. Over the course of the dataset, age differences among movie casts have exhibited rather stable patterns, as evidenced by the graph’s lack of sharp peaks. This stability implies that age differences were stable during the selected time period, allowing for a more targeted investigation. 2. By focusing on a period with consistent patterns and less complexity, the more constrained time frame of 1960 to 1980 enables a more concentrated examination.
filtered_mean_age_difference <- mean_age_difference %>%
filter(release_date >= as.Date("1960-01-01") & release_date <= as.Date("1980-12-31"))
print(filtered_mean_age_difference)
## # A tibble: 20 × 2
## release_date mean_age_difference
## <date> <dbl>
## 1 1960-01-01 5
## 2 1961-01-01 9
## 3 1962-01-01 3.67
## 4 1963-01-01 15.2
## 5 1964-01-01 10.8
## 6 1965-01-01 11
## 7 1966-01-01 16
## 8 1967-01-01 9.43
## 9 1968-01-01 11
## 10 1969-01-01 4
## 11 1970-01-01 9.5
## 12 1971-01-01 22.3
## 13 1972-01-01 13.3
## 14 1973-01-01 16.2
## 15 1974-01-01 12
## 16 1975-01-01 6
## 17 1976-01-01 18
## 18 1977-01-01 13.8
## 19 1979-01-01 20.2
## 20 1980-01-01 17
filtered_mean_age_difference <- mean_age_difference %>%
filter(release_date >= as.Date("1960-01-01") & release_date <= as.Date("1980-12-31"))
print(filtered_mean_age_difference)
## # A tibble: 20 × 2
## release_date mean_age_difference
## <date> <dbl>
## 1 1960-01-01 5
## 2 1961-01-01 9
## 3 1962-01-01 3.67
## 4 1963-01-01 15.2
## 5 1964-01-01 10.8
## 6 1965-01-01 11
## 7 1966-01-01 16
## 8 1967-01-01 9.43
## 9 1968-01-01 11
## 10 1969-01-01 4
## 11 1970-01-01 9.5
## 12 1971-01-01 22.3
## 13 1972-01-01 13.3
## 14 1973-01-01 16.2
## 15 1974-01-01 12
## 16 1975-01-01 6
## 17 1976-01-01 18
## 18 1977-01-01 13.8
## 19 1979-01-01 20.2
## 20 1980-01-01 17
ggplot(filtered_mean_age_difference, aes(x = release_date, y = mean_age_difference)) +
geom_line() +
labs(x = "Release Date", y = "Mean Age Difference", title = "Mean Age Difference Over Release Year")
1. Looking at the graph, it’s clear that the age gaps between the casts of movies from 1960 to 1980 fluctuated, with multiple ups and downs. Over the selected time period, a clear increasing tendency is visible despite these fluctuations. 2. The years 1960–1980 are a dynamic time in the cinema business, characterized by changes in storytelling techniques, cultural preferences, and technological breakthroughs. Notwithstanding the oscillations, the general pattern points to a progressive rise in the age gaps between the actors in films during this revolutionary era. 3. Although age gaps between performers in films released between 1960 and 1980 may fluctuate across particular years, the overall pattern suggests a general tendency towards greater age disparities. This remark emphasizes how crucial it is to place data in historical and industry-specific settings in order to extract insightful information.
ggplot(filtered_mean_age_difference, aes(x = release_date, y = mean_age_difference)) +
geom_smooth(method = "lm", se = FALSE) +
geom_line() +
labs(x = "Release Date", y = "Mean Age Difference", title = "Mean Age Difference Over Release Year")
## `geom_smooth()` using formula = 'y ~ x'
1. An rising trend from 1960 to 1980 is confirmed by the linear
regression model applied to the filtered mean age difference data. The
model offers quantitative backing for the trend’s visual evaluation by
fitting a linear trendline to the data.
2. The regression analysis backs up the graph’s interpretation by
showing that, across the chosen time period, the mean age difference
between movie cast members does, in fact, follow an upward trajectory.
The first trend observation is made more robust by this statistical
validation.
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
ts_release_years <- ts(data$release_date)
plot(ts_release_years, type = "l", xlab = "Year", ylab = "Release Date", main = "Time Series of Movie Release Years")
1. A chronological viewpoint on the distribution of film releases across time is offered by the time series plot that represents the years of movie releases. The representation of a single year for each dot in the dataset facilitates a visual comprehension of the temporal patterns contained in the data. 2. Looking at the graph as a timeline reveals how the years that the dataset covers have changed in terms of movie release dates. This graphic provides information about the ups and downs of the film industry, emphasizing times when business is more or less active. 3. Although the graph shows discrete points for each year, the general pattern shows trends in the years that films have been released. The distribution of dots can be used to pinpoint times when the film business was growing, stagnating, or declining. This information can be used to analyze cultural trends and the dynamics of the industry as a whole.
hist(ts_release_years, breaks = 20, col = "skyblue", xlab = "Release Year", ylab = "Frequency", main = "Histogram of Movie Release Years")
1. The frequency distribution of film releases throughout various years is clearly shown by the histogram of movie release years. A visual evaluation of the distribution and density of movie releases throughout time is made possible by the bars, each of which reflects the number of films released in a specific year. 2. Patterns in the density of movie releases within particular time periods can be identified by looking at the height and width of each bar. Peaks, or groups of bars, show years when there were more movie releases concentrated in one area, while valleys show times when the business was less active.
3. Examining the histogram offers information about past variations
and patterns in the film industry. Distribution peaks could correspond
with important industrial turning points, cultural occurrences, or
changes in audience tastes, whereas distribution valleys might represent
times of economic recession or changes in production priorities.
smoothed_data <- lowess(time(ts_release_years), ts_release_years)
plot(ts_release_years, type = "n", xlab = "Year", ylab = "Release Date", main = "Scatter Plot with Lowess Smoothing")
points(time(ts_release_years), ts_release_years, pch = 19, col = "blue") # Scatter plot
lines(smoothed_data, col = "red") # Lowess smoothed line
1. A red smoothed line and blue points show the years of movie releases in a scatter plot with Lowess smoothing. Finding patterns and trends in the data is made easier with the help of this visualization, which helps comprehend the general trend in the years that movies have been released. 2. By reducing any abnormalities or oscillations in the data, the Lowess smoothed line helps to show the underlying pattern in the years of movie release more clearly. The visualization improves interpretability and makes it easier to identify long-term patterns by reducing noise in the dataset. 3. Over the course of the dataset, changes in the direction and size of movie release years can be identified by looking at the trajectory of the smoothed line. The smoothed line’s peaks and troughs show times when movie output rose or fell, providing historical patterns and variations in the motion picture business.