Datadive11.Rmd

data <- read.csv ("C:\\Users\\varsh\\OneDrive\\Desktop\\Gitstuff\\age_gaps.CSV")

library(ggrepel)

## Loading required package: ggplot2

library(boot)
library(broom)
library(ggthemes)
library(lindia)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:boot':
## 
##     logit

model <- lm(actor_1_age ~ age_difference + release_year + couple_number, data = data)
summary(model)

## 
## Call:
## lm(formula = actor_1_age ~ age_difference + release_year + couple_number, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.723  -4.943  -0.566   3.712  36.430 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -141.64135   26.70573  -5.304 1.36e-07 ***
## age_difference    0.92008    0.02640  34.857  < 2e-16 ***
## release_year      0.08553    0.01331   6.425 1.93e-10 ***
## couple_number     1.11478    0.29163   3.823 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.243 on 1151 degrees of freedom
## Multiple R-squared:  0.5185, Adjusted R-squared:  0.5172 
## F-statistic: 413.1 on 3 and 1151 DF,  p-value: < 2.2e-16

The intercept of -141.64135 represents the expected age of actor 1 when all other variables (age difference, release year, and couple number) are zero. However, this value may not have a practical application because all predicted values are unlikely to be zero.
Age_difference: The coefficient of 0.92008 indicates that for every unit increase in age difference, actor 1’s age increases by approximately 0.92 years, while other predictors remain constant.
release_year: The coefficient of 0.08553 indicates that for each extra year in the release year, actor 1’s age increases by approximately 0.0855 years, while other predictors remain constant.
couple_number: The coefficient of 1.11478 indicates that for each extra couple in the film, actor 1’s age increases by about 1.11478 years, while other variables remain constant.

Diagnosing the Model-

Residuals vs. Fitted Values
```
gg_resfitted(model) +
  geom_smooth(se=FALSE)
```
```
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
```
A distinct pattern or trend in the residuals may suggest randomness, which means that the residual variance does not remain constant across all levels of the predictor variable.
The plot should ideally display random scattering of residuals around the horizontal line at 0. This shows that the residuals have constant variance and are independent of the fitted values, which supports the assumption of homogeneity and independence.

Residuals vs. X Values
```
residual_plots <- gg_resX(model)
```
Extreme outliers or points with high leverage and big residuals in the plot may suggest influential observations that have a significant impact on the regression coefficients.
If the residuals are randomly distributed with regard to the horizontal line at 0, with no visible pattern or trend as the X values vary, it indicates that the connection between the predictor and response variables is linear, which supports the linearity assumption.

Residuals Histogram
```
gg_reshist(model)
```
```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```
If the histogram deviates significantly from a bell-shaped curve, it indicates that the residuals are not distributed normally. The irregularity or wide residue in the histogram can indicate deviations from normality.
If the histogram resembles a bell-shaped curve and is centered around zero, it supports the concept of residual normality. A symmetric distribution with no noticeable skewness indicates that the residuals are roughly normal distributed.

QQ-Plots
```
gg_qqplot(model)
```

If the Q-Q plot’s points vary significantly from the diagonal line, it indicates that the residuals are not normal.
Ideally, the points must stick to the diagonal line, suggesting that the residuals are regularly distributed. Deviations from this line may indicate breaks from normality.

Cook’s Distance Plot
```
gg_cooksd(model, threshold = 'matlab')
```

Points with Cook’s distance values much higher than the rest indicate significant observations that may have an unfavorable effect on the regression coefficients. High Cook’s distance values indicate that removing these observations may have a significant impact on the derived coefficients.
The Cook’s Distance Plot detects influential observations by focusing on those with Cook’s distance values that are higher than a particular threshold.

Let us now interpret one of the regression model’s coefficients. We will select the coefficient associated with the predictor variable “age_difference”.

summary(model)

## 
## Call:
## lm(formula = actor_1_age ~ age_difference + release_year + couple_number, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.723  -4.943  -0.566   3.712  36.430 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -141.64135   26.70573  -5.304 1.36e-07 ***
## age_difference    0.92008    0.02640  34.857  < 2e-16 ***
## release_year      0.08553    0.01331   6.425 1.93e-10 ***
## couple_number     1.11478    0.29163   3.823 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.243 on 1151 degrees of freedom
## Multiple R-squared:  0.5185, Adjusted R-squared:  0.5172 
## F-statistic: 413.1 on 3 and 1151 DF,  p-value: < 2.2e-16

Coefficient: age_difference_col = 0.92008

Interpretation:

The coefficient of 0.92008 means that, on average, for each unit increase in the “age_difference” variable, which represents the age difference between characters in the film,
This shows that as the age difference between characters in a film increases, so does the anticipated age of the first actor, indicating potential casting decisions or character factors in the dataset.

Datadive11.Rmd

2024-04-06

Diagnosing the Model-

Residuals vs. Fitted Values

Residuals vs. X Values

Residuals Histogram

QQ-Plots

Cook’s Distance Plot