Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable

I selected “score” as my response variable. In the context of my IMDb dataset, “score” represents the IMDb rating given to movies, which is highly valuable to both viewers and creators.

Select a categorical column of data (explanatory variable) that you expect might influence the response variable

I selected “Orig_lang” column as the categorical explanatory variable for Task 2. Null Hypothesis : There is no significant difference in IMDb scores among movies of different original languages

categorical_column <- "orig_lang"


# H0: There is no significant difference in IMDb scores among movies of different original languages.

anova_result_orig_lang <- aov(score ~ factor(data[[categorical_column]]), data = data)

summary(anova_result_orig_lang)
##                                       Df  Sum Sq Mean Sq F value Pr(>F)    
## factor(data[[categorical_column]])    53  134051    2529   14.79 <2e-16 ***
## Residuals                          10124 1730891     171                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA results:

The degrees of freedom (Df) for the factor (original language) is 53, and the degrees of freedom for residuals are 10124.

The sum of squares (Sum Sq) for the factor (original language) is 134051, while for residuals, it’s 1730891.

The mean square (Mean Sq) for the factor is 2529.

The F-value is 14.79, and the associated p-value is less than 2e-16 (essentially zero).

This result suggests that the original language of a movie has a significant effect on IMDb scores. The low p-value indicates that there is strong evidence to reject the null hypothesis, which implies that there are significant differences in IMDb scores among movies with different original languages.

In practical terms, if someone are interested in IMDb scores for movies, they should consider the original language as an influential factor. Different languages may lead to different IMDb scores, and this variation is statistically significant. This information can be valuable for filmmakers, distributors, and anyone involved in the movie industry who wants to understand the factors that impact IMDb score.

Find at least one other continuous (or ordered integer) column of data that might influence the response variable.

library(ggplot2)

ggplot(data, aes(x = budget_x, y = score)) +
  geom_point() +
  labs(title = "IMDb Scores vs. Budget", x = "Budget (in dollars)", y = "IMDb Score")

lm_result_budget <- lm(score ~ budget_x, data = data)


summary(lm_result_budget)
## 
## Call:
## lm(formula = score ~ budget_x, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.121  -5.115   1.537   7.717  44.105 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.712e+01  1.975e-01  339.93   <2e-16 ***
## budget_x    -5.585e-08  2.285e-09  -24.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.16 on 10176 degrees of freedom
## Multiple R-squared:  0.05545,    Adjusted R-squared:  0.05535 
## F-statistic: 597.3 on 1 and 10176 DF,  p-value: < 2.2e-16

The intercept, which represents the IMDb score when the budget is zero, is 6.712e+01. However, this is not practically meaningful since movies typically have non-zero budgets. The coefficient for budget (budget_x) is approximately -5.585e-08. This coefficient implies that a one-unit change in the budget is associated with an exceedingly small change in IMDb scores. The model has a low residual standard error of 13.16, indicating that the predicted IMDb scores are relatively close to the actual scores. The R-squared value is 0.05545, suggesting that budget accounts for only 5.55% of the variability in IMDb scores. This indicates that budget alone is a weak predictor of IMDb scores. The F-statistic is 597.3, and its extremely low p-value confirms the statistical significance of the model.

The negative coefficient for budget suggests a slight negative relationship with IMDb scores, but the impact is negligible. The low R-squared value underscores that budget is an insufficient predictor of IMDb scores. Other unaccounted factors significantly influence movie ratings.

While budget matters in a movie’s success, it is not the sole determinant of IMDb scores.

Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn’t)

# Building a linear regression model with 'budget_x' and 'country'
lm_result_country <- lm(score ~ budget_x + country, data = data)

# Evaluating the model fit
summary(lm_result_country)
## 
## Call:
## lm(formula = score ~ budget_x + country, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.078  -5.328   1.116   7.318  55.261 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.019e+01  1.969e+00  35.649  < 2e-16 ***
## budget_x    -4.455e-08  2.259e-09 -19.721  < 2e-16 ***
## countryAT   -7.169e-01  9.078e+00  -0.079 0.937063    
## countryAU   -1.111e+00  1.968e+00  -0.565 0.572272    
## countryBE   -1.205e+01  5.127e+00  -2.350 0.018802 *  
## countryBO   -1.320e+01  1.269e+01  -1.041 0.298121    
## countryBR   -9.828e+00  2.823e+00  -3.482 0.000501 ***
## countryBY   -6.273e+01  1.269e+01  -4.943 7.81e-07 ***
## countryCA   -8.641e+00  2.487e+00  -3.475 0.000513 ***
## countryCH   -2.915e+01  5.938e+00  -4.908 9.35e-07 ***
## countryCL   -6.515e+00  4.615e+00  -1.412 0.158058    
## countryCN   -6.993e+00  2.351e+00  -2.975 0.002938 ** 
## countryCO   -4.812e+00  3.881e+00  -1.240 0.215016    
## countryCZ    4.566e+00  9.078e+00   0.503 0.614978    
## countryDE   -6.830e+00  2.371e+00  -2.881 0.003973 ** 
## countryDK   -6.132e-01  3.223e+00  -0.190 0.849123    
## countryDO    5.482e+00  1.269e+01   0.432 0.665728    
## countryES   -5.205e+00  2.205e+00  -2.361 0.018247 *  
## countryFI   -6.463e+00  5.483e+00  -1.179 0.238560    
## countryFR   -3.318e+00  2.132e+00  -1.557 0.119593    
## countryGB   -3.693e+00  2.177e+00  -1.697 0.089781 .  
## countryGR   -6.417e+00  5.480e+00  -1.171 0.241629    
## countryGT    1.606e+00  9.078e+00   0.177 0.859549    
## countryHK   -1.043e+01  2.256e+00  -4.621 3.87e-06 ***
## countryHU   -2.982e+01  7.498e+00  -3.977 7.04e-05 ***
## countryID   -1.718e+01  4.115e+00  -4.174 3.01e-05 ***
## countryIE   -7.592e+00  4.257e+00  -1.784 0.074530 .  
## countryIL   -6.082e+00  1.269e+01  -0.479 0.631714    
## countryIN   -2.030e+01  2.737e+00  -7.418 1.29e-13 ***
## countryIR   -4.038e+01  9.079e+00  -4.448 8.76e-06 ***
## countryIS   -5.276e+00  9.078e+00  -0.581 0.561160    
## countryIT   -7.786e+00  2.261e+00  -3.444 0.000576 ***
## countryJP   -9.597e-01  2.031e+00  -0.473 0.636557    
## countryKH   -6.150e+01  1.269e+01  -4.846 1.28e-06 ***
## countryKR   -1.218e+01  2.066e+00  -5.895 3.87e-09 ***
## countryLV   -3.470e+00  1.269e+01  -0.273 0.784506    
## countryMU    4.309e+00  1.269e+01   0.340 0.734153    
## countryMX    3.358e+00  2.309e+00   1.455 0.145783    
## countryMY   -1.629e+01  9.080e+00  -1.794 0.072773 .  
## countryNL   -9.273e+00  3.696e+00  -2.509 0.012119 *  
## countryNO   -4.694e+00  3.696e+00  -1.270 0.204109    
## countryPE    1.561e-01  5.127e+00   0.030 0.975711    
## countryPH   -2.443e+01  2.737e+00  -8.925  < 2e-16 ***
## countryPL   -4.512e+00  3.313e+00  -1.362 0.173297    
## countryPR    9.303e+00  9.078e+00   1.025 0.305468    
## countryPT   -7.019e+01  1.269e+01  -5.531 3.26e-08 ***
## countryPY    7.627e+00  1.269e+01   0.601 0.547793    
## countryRU   -1.020e+01  2.619e+00  -3.896 9.85e-05 ***
## countrySE   -7.930e+00  4.615e+00  -1.718 0.085759 .  
## countrySG   -1.580e+01  6.567e+00  -2.406 0.016125 *  
## countrySK   -6.241e+01  9.080e+00  -6.874 6.63e-12 ***
## countrySU    1.070e+01  5.940e+00   1.801 0.071745 .  
## countryTH   -9.397e+00  3.012e+00  -3.120 0.001814 ** 
## countryTR   -1.653e+01  3.419e+00  -4.834 1.36e-06 ***
## countryTW   -6.147e+00  3.990e+00  -1.541 0.123469    
## countryUA   -4.319e+00  6.568e+00  -0.658 0.510869    
## countryUS   -6.192e+00  1.973e+00  -3.138 0.001706 ** 
## countryUY   -3.391e+01  9.079e+00  -3.735 0.000189 ***
## countryVN   -4.652e+01  7.498e+00  -6.204 5.71e-10 ***
## countryXC    5.836e+00  1.269e+01   0.460 0.645582    
## countryZA   -7.801e+00  7.498e+00  -1.040 0.298150    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.54 on 10117 degrees of freedom
## Multiple R-squared:  0.1475, Adjusted R-squared:  0.1424 
## F-statistic: 29.17 on 60 and 10117 DF,  p-value: < 2.2e-16

Incorporating the “country” variable into the linear regression model offers valuable insights into how a movie’s country of origin influences its IMDb score. The model revealed that certain countries have a statistically significant impact on movie ratings when compared to the reference category, “AA.” For example, movies from Belgium (“countryBE”) tend to have IMDb scores approximately 12.05 points lower on average than movies from the reference category. Other countries also displayed varying effects, either positive or negative, on IMDb scores. However, it’s important to note that the “budget_x” variable, representing the budget of the movie, continues to have minimal practical significance. Overall, the model provides a better understanding of how a movie’s origin contributes to its IMDb score, making it a useful tool for anyone interested in exploring the factors that affect movie ratings.