Question 1

a. Let’s explore to see if any sticker bundles seem intuitively similar:

ii. Find a single sticker bundle that is both in our limited data set and also in the app’s Sticker Store (e.g., “sweetmothersday”). Then, use your intuition to recommend (guess!) five other bundles in our dataset that might have similar usage patterns as this bundle.

Ans: I choose “sweetmothersday”
Recommend: “Mother’s Day Flowers”, “Mother’s Day Message”, “Happy Mother’s Day”, “To All Mothers”, and “Mom’s Special Day”

b. Let’s find similar bundles using geometric models of similarity:

i. Let’s create cosine similarity based recommendations for all bundles:

1. Create a matrix or data.frame of the top 5 recommendations for all bundles

library(data.table)
## Warning: 套件 'data.table' 是用 R 版本 4.2.2 來建造的
ac_bundles_dt <- fread("piccollage_accounts_bundles.csv")
ac_bundles_matrix <- as.matrix(ac_bundles_dt[, -1, with=FALSE])

library(lsa)
## Warning: 套件 'lsa' 是用 R 版本 4.2.3 來建造的
## 載入需要的套件:SnowballC
ac_bundles_matrix_cosmatrix <- as.data.frame(cosine(ac_bundles_matrix))
cosmatrix_rownames <- row.names(ac_bundles_matrix_cosmatrix)
top5_ac_bundles_cosmatrix <- as.data.frame(
  sapply(ac_bundles_matrix_cosmatrix,
         FUN = function(x)
           cosmatrix_rownames[order(x, decreasing = TRUE)[2:6]]))

2. Create a new function that automates the above functionality: it should take an accounts-bundles matrix as a parameter, and return a data object with the top 5 recommendations for each bundle in our data set, using cosine similarity.

bundle_recommendation <- function(accounts_bundles_matrix) {
  ac_bundles_matrix <- accounts_bundles_matrix
  ac_bundles_matrix_cosmatrix <- as.data.frame(cosine(ac_bundles_matrix)) 
  
  cosmatrix_rownames <- row.names(ac_bundles_matrix_cosmatrix)  

  top5_bundle <- as.data.frame(
    sapply(ac_bundles_matrix_cosmatrix,
           FUN = function(x)
             cosmatrix_rownames[order(x, decreasing = TRUE)[2:6]])
  ) 
  return(top5_bundle)
}

3. What are the top 5 recommendations for the bundle you chose to explore earlier?

recommendation <- bundle_recommendation(ac_bundles_matrix)
recommendation$sweetmothersday
## [1] "mmlm"             "julyfourth"       "tropicalparadise" "bestdaddy"       
## [5] "justmytype"

Ans: mmlm,julyfourth,tropicalparadise,bestdaddy,justmytype

ii. Let’s create correlation based recommendations.

1. Reuse the function you created above (don’t change it; don’t use the cor() function)

bundle_recommendation <- function(accounts_bundles_matrix) {
  ac_bundles_matrix <- accounts_bundles_matrix
  ac_bundles_matrix_cosmatrix <- as.data.frame(cosine(ac_bundles_matrix)) 
  
  cosmatrix_rownames <- row.names(ac_bundles_matrix_cosmatrix)  

  top5_bundle <- as.data.frame(
    sapply(ac_bundles_matrix_cosmatrix,
           FUN = function(x)
             cosmatrix_rownames[order(x, decreasing = TRUE)[2:6]])
  ) 
  return(top5_bundle)
}

2. But this time give the function an accounts-bundles matrix where each bundle (column) has already been mean-centered in advance.

bundle_means <- apply(ac_bundles_matrix, 2, mean)
bundle_means_matrix <- t(replicate(nrow(ac_bundles_matrix), bundle_means))
ac_bundles_mc_b <- ac_bundles_matrix - bundle_means_matrix
recommendation_2 <- bundle_recommendation(ac_bundles_mc_b)

3. Now what are the top 5 recommendations for the bundle you chose to explore earlier?

recommendation_2$sweetmothersday
## [1] "mmlm"       "julyfourth" "bestdaddy"  "justmytype" "gudetama"

Ans: mmlm,julyfourth,bestdaddy,justmytype,gudetama

iii. Let’s create adjusted-cosine based recommendations

1. Reuse the function you created above (you should not have to change it)

bundle_recommendation <- function(accounts_bundles_matrix) {
  ac_bundles_matrix <- accounts_bundles_matrix
  ac_bundles_matrix_cosmatrix <- as.data.frame(cosine(ac_bundles_matrix)) 
  
  cosmatrix_rownames <- row.names(ac_bundles_matrix_cosmatrix)  

  top5_bundle <- as.data.frame(
    sapply(ac_bundles_matrix_cosmatrix,
           FUN = function(x)
             cosmatrix_rownames[order(x, decreasing = TRUE)[2:6]])
  ) 
  return(top5_bundle)
}

2. But this time give the function an accounts-bundles matrix where each account (row) has already been mean-centered in advance.

account_means <- apply(ac_bundles_matrix, 1, mean)
account_means_matrix <- replicate(ncol(ac_bundles_matrix), account_means)
ac_account_mc_b <- ac_bundles_matrix - account_means_matrix
recommendation_3 <- bundle_recommendation(ac_account_mc_b)

3. What are the top 5 recommendations for the bundle you chose to explore earlier?

recommendation_3$sweetmothersday
## [1] "justmytype" "julyfourth" "gudetama"   "mmlm"       "bestdaddy"

Ans: justmytype, julyfourth, gudetama, mmlm,bestdaddy

Question 2

a. Scenario A: Create a horizontal set of random points, with a relatively narrow but flat distribution.

i. What raw slope of x and y would you generally expect?

Ans: The slope would be 0, as the overall trend of x and y appears to be horizontal.

ii. What is the correlation of x and y that you would generally expect?

Ans: It is anticipated that the correlation coefficient would be close to 0, as the range of y values does not significantly change regardless of the variation in x values, remaining within a fixed range.

b. Scenario B: Create a random set of points to fill the entire plotting area, along both x-axis and y-axis

i. What raw slope of the x and y would you generally expect?

Ans: The evenly distributed points on the plot have a center point which represents the expected mean value. A horizontal line with a slope of 0 indicates that there is no significant trend or correlation between the data points.

ii. What is the correlation of x and y that you would generally expect?

Ans: The evenly distributed points on the plot have a center point which represents the expected mean value. A horizontal line with a slope of 0 indicates that there is no significant trend or correlation between the data points.

i. What raw slope of the x and y would you generally expect? (note that x, y have the same scale)

Ans: When x and y have the same scale and exhibit a positive 45-degree relationship, the raw slope of x and y is generally expected to be 1.

ii. What is the correlation of x and y that you would generally expect?

Ans: If x and y exhibit a strong positive correlation, where as x increases, y decreases, the expected correlation coefficient between them would fall between -0.8 and -1.

i. What raw slope of the x and y would you generally expect? (note that x, y have the same scale)

Ans: Ans : When x and y exhibit a negative 45-degree relationship, the expected raw slope of x and y would be -1.

ii. What is the correlation of x and y that you would generally expect?

Ans: If x and y exhibit a strong negative correlation, the expected correlation coefficient between them would fall between -0.8 and -1.

e. Apart from any of the above scenarios, find another pattern of data points with no correlation (r ≈ 0).(can create a pattern that visually suggests a strong relationship but produces r ≈ 0?)

knitr::include_graphics("e.png")

f. Apart from any of the above scenarios, find another pattern of data points with perfect correlation (r ≈ 1).(can you find a scenario where the pattern visually suggests a different relationship?)

knitr::include_graphics("f.png")

g.Let’s see how correlation relates to simple regression, by simulating any linear relationship you wish:

i. Run the simulation and record the points you create: pts <- interactive_regression() (simulate either a positive or negative relationship)

library(compstatslib)

ii. Use the lm() function to estimate the regression intercept and slope of pts to ensure they are the same as the values reported in the simulation plot: summary( lm( pts\(y ~ pts\)x ))

x <- c(2.1280524, -0.9675402, -4.0631328, 17.0163788, 8.4666468, 15.6896962, 25.7135200
       , 30.5780226,35.8847528, 39.4225730, 41.7811197, 28.3668851, 20.7016081
       , 10.6777844, 43.9922573,5.3710542,19.2275164)
y <- c(11.816031, 2.315186, -3.779696, 12.533076, 7.513761, 21.496137, 30.279937
       , 38.525954,45.875664, 46.771970, 47.489015, 33.148117, 25.081362, 14.146427, 
       48.923105, 7.155239,20.779092)
pts <- cbind(x, y)
pts <- as.data.frame(pts)
summary(lm( pts$y ~ pts$x ))
## 
## Call:
## lm(formula = pts$y ~ pts$x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2924 -1.3104 -0.1566  1.3636  7.4435 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.02083    1.43208   1.411    0.179    
## pts$x        1.10509    0.05764  19.173 5.81e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.504 on 15 degrees of freedom
## Multiple R-squared:  0.9608, Adjusted R-squared:  0.9582 
## F-statistic: 367.6 on 1 and 15 DF,  p-value: 5.815e-12

iii. Estimate the correlation of x and y to see it is the same as reported in the plot: cor(pts)

cor(pts)
##           x         y
## x 1.0000000 0.9802006
## y 0.9802006 1.0000000

Ans: Yes, the correlation value of x and y is the same as shown in the graph.

iv. Now, standardize the values of both x and y from pts and re-estimate the regression slope

std <- apply(pts, 2, function(a)(a - mean(a)) / sd(a))
std_dataf <- as.data.frame(std)
summary(lm( std_dataf$y ~ std_dataf$x))
## 
## Call:
## lm(formula = std_dataf$y ~ std_dataf$x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48402 -0.07649 -0.00914  0.07959  0.43447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.078e-17  4.960e-02    0.00        1    
## std_dataf$x  9.802e-01  5.113e-02   19.17 5.81e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2045 on 15 degrees of freedom
## Multiple R-squared:  0.9608, Adjusted R-squared:  0.9582 
## F-statistic: 367.6 on 1 and 15 DF,  p-value: 5.815e-12

v. What is the relationship between correlation and the standardized simple-regression estimates?

cor(std_dataf)
##           x         y
## x 1.0000000 0.9802006
## y 0.9802006 1.0000000

Ans: When running regression on standardized data, the slope of the regression line is equal to the correlation coefficient between the two variables.