STAT2170/6180 Assignment, 2023 S1

The goal of this repo is to provide a template and folder structure for submitting your assignment via Github.

You have to create a Markdown file in the directory and push to your repo.

How well you utilise Github as part of your workflow will form part of the assessment.

Question 1 [45 Marks]

a. [7 marks] Produce a plot and a correlation matrix of the data. comment on possible relationships between the response and predictors and relationships between the predictors themselves.

pm25 <-read.table("data/pm25.csv",sep=',',header=TRUE) 
par(mfrow=c(1,2))
plot(pm25)

variables <- c("temperature", "humidity", "wind", "precipitation")
plotmatrix <- pairs(pm25[variables]) 

cormatrix <- cor(pm25) 
cormatrix
#>               temperature    humidity        wind precipitation        pm25
#> temperature    1.00000000 -0.07264891  0.02861166   -0.05050014  0.57191961
#> humidity      -0.07264891  1.00000000  0.12406351   -0.13550607 -0.71965591
#> wind           0.02861166  0.12406351  1.00000000   -0.01525977 -0.21866823
#> precipitation -0.05050014 -0.13550607 -0.01525977    1.00000000  0.03759033
#> pm25           0.57191961 -0.71965591 -0.21866823    0.03759033  1.00000000

The scatterplot matrix is not very useful for determining the relative correlation of the predictor variables on the response variable, the correlation matrix should be used to numerically determine the strength of the relationships betwee independent varisbles and the respose variable. Response(pm25) and Predictors according to correlation matrix:
the pm25 column:

The response variable, pm25:

Predictor Variables:

These observations provide initial insights into the relationships between the response variable pm25 and the predictor variables temperature, humidity, wind, precipitation, as well as the relationships between the predictor variables themselves. Humidity, temperature, wind and precipitation have a decreasing impact on the response variable, pm25 whilst very weak:

b. [6 marks]

model <- lm(pm25 ~ temperature + humidity + wind + precipitation, data = pm25)
summary(model)
#> 
#> Call:
#> lm(formula = pm25 ~ temperature + humidity + wind + precipitation, 
#>     data = pm25)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -23.759  -6.804  -1.649   6.857  20.975 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   102.72259   14.71953   6.979 5.88e-09 ***
#> temperature     1.62142    0.18762   8.642 1.46e-11 ***
#> humidity       -1.27742    0.11854 -10.776 9.49e-15 ***
#> wind           -0.58016    0.23405  -2.479   0.0165 *  
#> precipitation  -0.01091    0.02350  -0.464   0.6444    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 10.06 on 51 degrees of freedom
#> Multiple R-squared:  0.8127, Adjusted R-squared:  0.7981 
#> F-statistic: 55.34 on 4 and 51 DF,  p-value: < 2.2e-16

The model’s goodness of fit can be assessed using the R-squared value, which indicates the proportion of variance in the response variable explained by the predictors. In this case, the multiple R-squared is 0.8127, meaning that approximately 81.27% of the variance in pm25 can be explained by the predictors. The adjusted R-squared takes into account the number of predictors and sample size, providing a more conservative estimate. The adjusted R-squared for this model is 0.7981.

The F-statistic (55.34) with its associated p-value (< 2.2e-16) tests the overall significance of the model. In this case, the low p-value indicates that the model, as a whole, is statistically significant in explaining the variation in pm25.

The coefficient for humidity is -1.27742 with a standard error of 0.11854. The negative coefficient. This, coupled with the correlation matrix indicating a strong, negative correlation of -0.72 leads ot the conclusion that there is a negative relationship between humidity and PM25.

For each unit-increase (of 1) on humidity, we expect a decrease of ~1.22742 units in PM25, holding all other predictors constant.

Since the t-value for humidity is -10.776, and the corresponding p-value is 9.49e-15, we now know that the coefficient for humidity is statistically significant, leading to the conclusion that humidity has a negative impact on PM25 levels; as humidity increases, PM25 levels tend to decrease.

c. [14 marks] Conduct an F-test for the overall regression i.e. is there any relationship between the response and the predictors. In your answer:

pm25 = β0 + β*temperature + β2*humidity + β3*wind + β4*precipitation + \(\epsilon\)

According to the linear regression model,

pm25 = 102.72259 + 1.62142*temperature + -1.27742*humidity + -0.58016*wind + -0.01091*precipitation + \(\epsilon\)

pm25 is the response variable (the concentration of PM2.5). temperature, humidity, wind, and precipitation are the predictor variables (the factors that may affect the PM2.5 concentration). β0 is the intercept term (the expected PM2.5 concentration when all predictor variables are equal to zero). β1, β2, β3, and β4 are the regression coefficients (the expected change in PM2.5 concentration associated with a one-unit increase in each predictor variable, holding all other predictor variables constant). ε is he error term (the random variability in the PM2.5 concentration that is not explained by the predictor variables).

Null Hypothesis: H_0 : β0, β1, β2, β3, β4 = 0

The model has no significant explanatory power, and the coefficients of the independent variables are all 0.

Alt Hypothesis: H_1 : β0, β1, β2, β3, β4 ≠ != 0

The regression model has significant explanatory power, and at least one of the independent variables has a non-zero coefficient.

Overall, this ANOVA test evaluates whether the regression model provides a statistically significant improvement over the model with no independent variables in explaining the response/dependent variable.

anova_table <- anova(model)
anova_table
#> Analysis of Variance Table
#> 
#> Response: pm25
#>               Df  Sum Sq Mean Sq  F value    Pr(>F)    
#> temperature    1  9014.4  9014.4  89.0853 8.908e-13 ***
#> humidity       1 12739.7 12739.7 125.9013 2.200e-15 ***
#> wind           1   622.6   622.6   6.1533   0.01646 *  
#> precipitation  1    21.8    21.8   0.2156   0.64440    
#> Residuals     51  5160.6   101.2                       
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
MSR <- anova_table$Mean[1]
MSE <- anova_table$Mean[length(anova_table$Mean)]

F_stat <- MSR / MSE
F_stat
#> [1] 89.08532
df2 <- anova_table$Df[length(anova_table$Df)]
df1 <- anova_table$Df[1] + anova_table$Df[2] + anova_table$Df[3] + anova_table$Df[4]
df1
#> [1] 4
df2
#> [1] 51

So we have a F-distribution with 4 and 51 degrees of freedom \({df1}\) (DF Regression = 4, DF residuals = 51).
* Compute the P-Value

p_val <- pf(F_stat, df1, df2, lower.tail = FALSE)
p_val
#> [1] 2.272962e-22

The p value (2.72962e-22) is extremely small and close to zero. The null hypothesis should be rejected.

Statistical Conclusion: The overall multiple regression model is statistically significant, as evidenced by the F statistic of 89.0853 (with 4 and 51 degrees of freedom) and a very small p-value (approximately 2.27e-22). This indicates that at least one of the predictor variables (temperature, humidity, wind, and precipitation) has a significant effect on the response variable (pm25).

Contextual Conclusion: The regression model, which includes temperature, humidity, wind, and precipitation as predictors, significantly explains the variability in the pm25 measurements. The predictors collectively have a significant impact on pm25 levels, and they can be considered important factors in predicting and understanding pm25 pollution.

d. [10 marks] Validate the full model and comment on whether the full regression model is appropriate to explain the PM2.5 concentration at various test locations.

pm25 = 102.72259 + 1.62142*temperature + -1.27742*humidity + -0.58016*wind + -0.01091*precipitation + \(\epsilon\)

Apply the fitted model to the test set and assess its predictive performance. Calculate relevant metrics such as mean square error, mean squared error (RMSE), or R_squared to evaluate how well the model predicts the PM2.5 concentration at the test locations.

Consider alternative models or subsets of predictors to compare their performance against the full model. This could involve removing adding predictors based on domain knowledge or statistical significance to assess if a simpler model provides comparable predictive accuracy.

Based on the evaluation metrics and a comparison of alternative models, draw conclusions about the appropriateness of the full regression model for explaining the PM2.5 concentration at various test locations. Consider factors such as the model’s predictive accuracy, interpretability, and practical usefulness.Its important to note that without the specific data and the results of the validation process, itsno t possible to provide a definitive comment on the apporpriate ness of the fulle regression model.

e. [2 marks] Find the R2 error and comment on what it means in the context of this dataset.

The R2 value of 0.8127 indicates that 81.27% of variability in the dependent variable can be explained by the independent variables in the linear regression model.

In context, the R2 values suggests that temperature, humidity, wind and precipitation (relevant independent variables) collectively have a strong influence on the variation observed in the dependent variable indicating that these variables are capable of explaining a significant portion of the changes/fluctuations in the response variable (pm25).

Does not fully determine the validity or reliability of the model only measuring the proportion of total variation in the response variable explained by the independent variables.

f. [3 marks] Using model selection procedures discussed in the course, find the best multiple regression model that explains the data. State the final fitted regression model.

g. [3 marks] Comment on the R2 and adjusted R2 in the full and final model you chose in part f. In particular explain why those goodness of fitness measures change but not in the same way.

Question 2. [25 Marks]

A business wants to advertise their product in Film media by using product placement in a movie. To maximise the brand recognition from the placement, the business conducted a study recording the correct number of brands identified by individuals in an experiment that watched different types of movies. Each movie in this experiment featured six different brands.

library(knitr)


df <- data.frame(
  Variable = c("Gender", "Genre", "Score"),
  Description = c("Gender of the individual watching the movie",
                  "Genre of the movie being watched",
                  "The number of correct brands recalled by the individuals after the movie")
)

kable(df)
Variable Description
Gender Gender of the individual watching the movie
Genre Genre of the movie being watched
Score The number of correct brands recalled by the individuals after the movie

a. [2 marks] For this study, is the design balanced or unbalanced? Explain why.

movie <- read.table("data/movie.csv",header=TRUE,sep=',')
table(movie[,1:2])
#>       Genre
#> Gender Action Comedy Drama
#>      F     39     33    22
#>      M     14     10    19

Unbalanced, clearly, women have far better brand recall than men and the disparities between movie categories are also equally unbalanced; suffice it to say the dataset is unbalanced.
b. [8 marks] Construct two different preliminary graphs that investigate different features of the data and comment.

boxplot(Score ~ Gender + Genre, data = movie, main = "Box Plot of Score by Gender and Genre", xlab = "Gender and Genre", ylab = "Score")

par(mfrow=c(1,2))
boxplot(Score ~ Gender, data = movie)
boxplot(Score ~ Genre, data = movie)

The first, boxplot of Score by Gender and Genre, representing the number of advertised brands successfully recalled by each respective gender in either drama, comedy or action movies, shows that Men successfully recalled more brands than Females in every category. And then, considering the box-plot pair immediately above, we see that drama was the best genre for brand recall, followed by comedy and finally action.

The preliminary boxplot pair above represents the distribution of brand recall scores based on gender and genre separately. The boxplot on the left, “Score by Gender,” compares the brand recall scores between males and females. It shows that the median brand recall score for males is higher than that for females, as indicated by the position of the median line within the boxes. Additionally, the box for males appears to be slightly larger, indicating a larger interquartile range and potentially more variability in brand recall scores compared to females.

The boxplot on the right, “Score by Genre,” compares the brand recall scores across different genres. It demonstrates that dramas have the highest median brand recall score, followed by comedies, and then action movies. The boxes for dramas and comedies appear to be similar in size, while the box for action movies is comparatively smaller, suggesting a narrower range of brand recall scores.

Overall, these preliminary boxplots provide an initial visual understanding of the distribution and variation of brand recall scores within each gender and genre category. They suggest that gender and genre might have an influence on brand recall, with males generally having higher scores than females and dramas having the highest scores among the three genres. However, further statistical analysis is needed to confirm these observations and draw more precise conclusions.

c. [4 marks] Write down the full mathematical model for this situation, defining all appropriate parameters.

Score = β0 + β1 * Gender + β2 * Genre + ε

*’*d. [9 marks]** Analyse the data to study the effect of Gender and Genre on the brand recall Score. These conclusions are only required to be at the qualitative level and can be based off the outcomes of the hypothesis tests you conducted in this part and the preliminary plots in part b. You do not need to statistically examine the multiple comparisons between contrasts and interactions. Remember to
* state the null and alternative hypothesis for each test, and
* check assumptions.

Gender
Null Hypothesis: H0 : There is no significant difference in brand recall Score between genders.
Alternative Hypothesis: H1: There is a significant difference in brand recall Score between genders.
Assumptions: Assumptions may include the independence of observations; Individual’s brand recall Scores are independent of each other.
Sufficient sample size: The sample size should be adequate for reliable analysis whilst a balanced data-set with regards to gender in each respective genre provides the most accurate representation of the sample space for analysis.

Conclusion: The results for gender, referring to the boxplots, shows us that males have a far better brand recall score than females.

Genre
Null Hypothesis: H0: There is no significant difference in brand recall Score between genres.
Alternative Hypothesis: H1: There is a significant difference in brand recall Score between genres.
Assumptions: May include the independence of observations; Individual’s brand recall Scores are independent of each other.
Sufficient sample size: The data-set is not balanced across the genres, and, the sample size should be larger to give a more accurate representation of the data-set for analysis.

Conclusion: The results for genre, referring to the boxplots, shows us that brand recall was best in dramas, followed by comedies and, finally, action.

Analysis: To test the hypothesis, conduct a chi-square test of independence to determine if there is an association between the two categorical variables establishing whether changes in one variables relates to changes in the other variable. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent.

contingency_tab <- table(movie$Gender, movie$Gender)
chi_square_result <- chisq.test(contingency_tab)
print(chi_square_result)
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  contingency_tab
#> X-squared = 132.4, df = 1, p-value < 2.2e-16

The results of the chi-square test namely, X-squared: 132.4, and p-val<2.2e-16 can be said to be extremely small indicating strong evidence against the null hypothesis Since, the chi-squared test rightly assumes independence between the variables in this test; it is definitely valid to determine that there is little association between the two categorical variables, hence, the test-set is valid and we may further analyse the data to study the effect of gender and genre on the brand recall Score. 

In conclusion, based on the preliminary analysis, it can be inferred that gender and genre may have an effect on brand recall scores. However, further analysis with a larger and more balanced dataset would be necessary to validate these conclusions and provide more robust insights into the relationship between gender, genre, and brand recall.

e. [2 marks] Based on your results from part d), discuss the practical implications of your findings for the business that aims to maximise the brand recognition from the placement. What advice/interpretation would you provide on the effect drama genre on the brand recall Score.

Drama was the second most effective genre in improving brand recall. Therefore, we can conclude that incorporating product placements within dramas can be an effective strategy for enhancing brand recall. Businesses should consider investing more in dramas as a platform for showcasing their products or services. By strategically placing their brands in dramas, they can increase the likelihood of viewers remembering and recognising their brand.