Description

This recommender system collects a rating of 10 colors by a sample of 10 people and recommend the highest rated colors. The dataset was created as a matrix of 10 colors columns and 10 people in rows. The values were selected at random ranging from 1-5 , with 1 as the lowest rating and 5 as the highest ratings. The cells with 0 values where replaced with NA. A set.seed(123) was used to retain reproducible values.

set.seed(123)
data.matrix <- matrix(nrow=10, ncol=10)
colnames(data.matrix)<-c(paste("color", 1:10, sep=""))
rownames(data.matrix)<-c(paste("person", 1:10, sep=""))

for (i in 1:10)
    {
    
    color.values<-rpois(5, lambda=sample(x=2.5,size=1))
    data.matrix[i,] <- color.values
}

data.matrix[data.matrix == 0] <- NA
data.matrix

##          color1 color2 color3 color4 color5 color6 color7 color8 color9 color10
## person1       2      1      2      3     NA      2      1      2      3      NA
## person2       4      2      2      5      2      4      2      2      5       2
## person3       2     NA      4      1     NA      2     NA      4      1      NA
## person4       3      2      1      1      4      3      2      1      1       4
## person5       1      1      1     NA     NA      1      1      1     NA      NA
## person6       2      1      2     NA      1      2      1      2     NA       1
## person7      NA     NA     NA     NA      1     NA     NA     NA     NA       1
## person8       1      1      1      1      2      1      1      1      1       2
## person9       2     NA      1      2     NA      2     NA      1      2      NA
## person10     NA     NA      2      2      1     NA     NA      2      2       1

Partition Data

The 100 cell matrix was split into a trainset with 70% of the data and a testing dataset with 30% of the data.

#Partition Data
set.seed(123)
trainidx<-sample(nrow(data.matrix),round(0.7*nrow(data.matrix)),replace=F)
traindata<-data.matrix[trainidx,]
testdata<-data.matrix[-trainidx,]

Raw Average(Mean)

The raw average (mean) value for this dataset is: 1.9259259

RMSE for raw averages

We calculate the RMSE by taking the square root of the of mean of datasets minus the mean squared. Na.rm is used to remove na values.

RMSE for the training dataset is: 1.0156525

RMSE for the testing dataset is: 1.0855075

Calculate Bias

We calculate bias by subtracting dataset mean from the mean of each item. We use rowMeans and colMeans functions to derive the mean of each item.

rdata<-rowMeans(traindata, na.rm=TRUE)-mdata
cdata<-colMeans(testdata, na.rm=TRUE)-mdata

The bias of each person is:

##     person3    person10     person2     person8     person6     person9 
##  0.40740741 -0.25925926  1.07407407 -0.72592593 -0.42592593 -0.25925926 
##     person1 
##  0.07407407

The bias of each color is:

##      color1      color2      color3      color4      color5      color6 
##  0.07407407 -0.42592593 -0.92592593 -0.92592593  0.57407407  0.07407407 
##      color7      color8      color9     color10 
## -0.42592593 -0.92592593 -0.92592593  0.57407407

Baseline Predictor

For the baseline predictor, we add the bias of each person plus bias of each color plus the original mean. We rounded each baseline value to 2 decimal places. The results is a matrix of all predicted baseline values. We then caculate the mean of all values in the matrix.

rbdata = NULL

for (j in 1:length(cdata))
  {
    rbdata<-cbind(rbdata, rdata[1:length(rdata)]%+%cdata[j]+mdata)
  }
dfrbdata <- as.matrix(round(rbdata,2))

colnames(dfrbdata)<-colnames(traindata)
rownames(dfrbdata)<-rownames(traindata)


rbmean<-mean(rbdata)

The baseline predictions are:

##          color1 color2 color3 color4 color5 color6 color7 color8 color9 color10
## person3    2.41   1.91   1.41   1.41   2.91   2.41   1.91   1.41   1.41    2.91
## person10   1.74   1.24   0.74   0.74   2.24   1.74   1.24   0.74   0.74    2.24
## person2    3.07   2.57   2.07   2.07   3.57   3.07   2.57   2.07   2.07    3.57
## person8    1.27   0.77   0.27   0.27   1.77   1.27   0.77   0.27   0.27    1.77
## person6    1.57   1.07   0.57   0.57   2.07   1.57   1.07   0.57   0.57    2.07
## person9    1.74   1.24   0.74   0.74   2.24   1.74   1.24   0.74   0.74    2.24
## person1    2.07   1.57   1.07   1.07   2.57   2.07   1.57   1.07   1.07    2.57

The mean of all baseline predictions is: 1.5835979

RMSE baseline predictors

Once again, for RMSE we take the square root of the of mean of baseline predictors minus the mean squared. Na.rm is used to remove na values.

RMSE for the training baseline predictors is: 1.0717922

RMSE for the testing baseline predictors is: 1.0573606

Summarize your results

trresults<-round((1-rbtrmse/trrmse)*100,2)
teresults<-round((1-rbtermse/termse)*100,2)

We divide the RMSE of the orginal training and testing data set by the RMSE of the baseline predictor training and test dataset. The end result is a slight improvement and a slight degradation of predictions. The percent improvement on the training dataset is a negative value: -5.53%. However, there is a slight percent improvement for the testing dataset: 2.59%

In conclusion, this recommender system is not a useful system to predict or recommend values. The training and testing results are both low and only the test value has a postive percent value of improvement.

APPENDIX

Code used in analysis

knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(reshape)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)


set.seed(123)
data.matrix <- matrix(nrow=10, ncol=10)
colnames(data.matrix)<-c(paste("color", 1:10, sep=""))
rownames(data.matrix)<-c(paste("person", 1:10, sep=""))

for (i in 1:10)
    {
    
    color.values<-rpois(5, lambda=sample(x=2.5,size=1))
    data.matrix[i,] <- color.values
}

data.matrix[data.matrix == 0] <- NA
data.matrix
#Partition Data
set.seed(123)
trainidx<-sample(nrow(data.matrix),round(0.7*nrow(data.matrix)),replace=F)
traindata<-data.matrix[trainidx,]
testdata<-data.matrix[-trainidx,]
mdata<-mean(traindata, na.rm=TRUE)

trrmse<-sqrt(mean((traindata-mdata)^2, na.rm=TRUE))
termse<-sqrt(mean((testdata-mdata)^2, na.rm=TRUE))
     

rdata<-rowMeans(traindata, na.rm=TRUE)-mdata
cdata<-colMeans(testdata, na.rm=TRUE)-mdata

rdata
cdata
rbdata = NULL

for (j in 1:length(cdata))
  {
    rbdata<-cbind(rbdata, rdata[1:length(rdata)]%+%cdata[j]+mdata)
  }
dfrbdata <- as.matrix(round(rbdata,2))

colnames(dfrbdata)<-colnames(traindata)
rownames(dfrbdata)<-rownames(traindata)


rbmean<-mean(rbdata)

dfrbdata
rbtrmse<-sqrt(mean((traindata-rbmean)^2, na.rm=TRUE))
rbtermse<-sqrt(mean((testdata-rbmean)^2, na.rm=TRUE))

trresults<-round((1-rbtrmse/trrmse)*100,2)
teresults<-round((1-rbtermse/termse)*100,2)

Data 612 Project 1

Anthony Pagan

6/10/2020