library(broom)
library(ggplot2)
library(psych)
library(olsrr)
library(car)
library(corrplot)

cameras<-read.csv("camera_dataset.csv")

The goal of this project is to build a model to predict camera prices. Tasks include:

EDA
Identifying outlier/high-leverage /influential points
Verifying regression assumptions
Data transformation (log, standardization)
Creating test/train sets

EDA

str(cameras)

## 'data.frame':    1038 obs. of  13 variables:
##  $ Model                  : chr  "Agfa ePhoto 1280" "Agfa ePhoto 1680" "Agfa ePhoto CL18" "Agfa ePhoto CL30" ...
##  $ Release.date           : int  1997 1998 2000 1999 1999 2001 1999 1997 1996 2001 ...
##  $ Max.resolution         : num  1024 1280 640 1152 1152 ...
##  $ Low.resolution         : num  640 640 0 640 640 ...
##  $ Effective.pixels       : num  0 1 0 0 0 1 1 0 0 1 ...
##  $ Zoom.wide..W.          : num  38 38 45 35 43 51 34 42 50 35 ...
##  $ Zoom.tele..T.          : num  114 114 45 35 43 51 102 42 50 105 ...
##  $ Normal.focus.range     : num  70 50 0 0 50 50 0 70 40 76 ...
##  $ Macro.focus.range      : num  40 0 0 0 0 20 0 3 10 16 ...
##  $ Storage.included       : num  4 4 2 4 40 8 8 2 1 8 ...
##  $ Weight..inc..batteries.: num  420 420 0 0 300 270 0 320 460 375 ...
##  $ Dimensions             : num  95 158 0 0 128 119 0 93 160 110 ...
##  $ Price                  : num  179 179 179 269 1299 ...

This data set contains 1036 observations and 13 variables.

# Visualize the distribution of 'Release.date'
ggplot(cameras, aes(x = Release.date)) +
  geom_histogram(binwidth = 1, fill = "#66c2a5", color = "#7570b3") +
  labs(title = "Distribution of Camera Release Dates",
       x = "Release Date",
       y = "Frequency")

# Explore the distribution of 'Max.resolution' and 'Effective.pixels'
ggplot(cameras, aes(x = Max.resolution, y = Effective.pixels)) +
  geom_point(alpha = 0.6, color = "#fc8d62") +
  labs(title = "Relationship between Max Resolution and Effective Pixels",
       x = "Max Resolution",
       y = "Effective Pixels")

# Analyze the 'Price' variable
ggplot(cameras, aes(x = Price)) +
  geom_histogram(binwidth = 50, fill = "#8da0cb", color = "#e78ac3") +
  geom_vline(xintercept = mean(cameras$Price), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Camera Prices",
       x = "Price",
       y = "Frequency") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the histogram of price, we see there are potential outliers. The red dashed line represents the mean of price.

describe(cameras$Price)

##    vars    n   mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 1038 457.38 760.45    199  288.25 103.78  14 7999  7985 5.17    36.82
##      se
## X1 23.6

# Check for missing values
missing_values <- colSums(is.na(cameras))
print(missing_values)

##                   Model            Release.date          Max.resolution 
##                       0                       0                       0 
##          Low.resolution        Effective.pixels           Zoom.wide..W. 
##                       0                       0                       0 
##           Zoom.tele..T.      Normal.focus.range       Macro.focus.range 
##                       0                       0                       1 
##        Storage.included Weight..inc..batteries.              Dimensions 
##                       2                       2                       2 
##                   Price 
##                       0

# Identify rows with missing values
cameras[apply(cameras, 1, function(row) any(is.na(row))), ]

##                  Model Release.date Max.resolution Low.resolution
## 346 HP Photosmart R927         2006           3296           2592
## 347 HP Photosmart R937         2007           3298              0
##     Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 346                8            35           105                 50
## 347                8            39           118                 50
##     Macro.focus.range Storage.included Weight..inc..batteries. Dimensions Price
## 346                10               NA                      NA         NA   179
## 347                NA               NA                      NA         NA   179

# Calculate percentage of missing data for each column
colMeans(is.na(cameras)) * 100

##                   Model            Release.date          Max.resolution 
##              0.00000000              0.00000000              0.00000000 
##          Low.resolution        Effective.pixels           Zoom.wide..W. 
##              0.00000000              0.00000000              0.00000000 
##           Zoom.tele..T.      Normal.focus.range       Macro.focus.range 
##              0.00000000              0.00000000              0.09633911 
##        Storage.included Weight..inc..batteries.              Dimensions 
##              0.19267823              0.19267823              0.19267823 
##                   Price 
##              0.00000000

Data is missing from 4 columns in 2 rows. The missing data is less than 1% of the data set. Deleting these rows should not present any issues for the analysis.

# removing rows with missing values
cameras <- na.omit(cameras)

# correlations


cameras$Release.date<-as.numeric(cameras$Release.date)

corrplot(cor(cameras[, -1]))

Price is negatively correlated with Zoom.wide..T. and positively with Weight.inc.batteries.

Distribution of Weight.inc.batteries

# Visualize the distribution of 'Weight.inc.batteries'

ggplot(cameras, aes(x = Weight..inc..batteries.)) +
  geom_histogram(binwidth = 50, fill = "#8da0cb", color = "#e78ac3") +
  geom_vline(xintercept = mean(cameras$Weight..inc..batteries.), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Camera Weight..inc..batteries",
       x = "Price",
       y = "Frequency") +
  theme_minimal()

describe(cameras$Weight..inc..batteries.)

##    vars    n   mean     sd median trimmed   mad min  max range skew kurtosis
## X1    1 1036 319.27 260.41    226  266.73 92.66   0 1860  1860 2.82     9.75
##      se
## X1 8.09

Weight..inc..batteries is right tailed with a long tail. There are appear to be outliers in the 1000-200 range.

# Visualize the distribution of 'Zoom.wide..W.'

ggplot(cameras, aes(x = Zoom.wide..W.)) +
  geom_histogram(bins =50, fill = "#8da0cb", color = "#e78ac3") +
  geom_vline(xintercept = mean(cameras$Zoom.wide..W.), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Camera Zoom.wide..W.",
       x = "Price",
       y = "Frequency") +
  theme_minimal()

describe(cameras$Zoom.wide..W.)

##    vars    n  mean    sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 1036 32.96 10.34     36   35.55 2.97   0  52    52 -2.57     5.56 0.32

Zoom..wide..W is somewhat normally distributed, but appears to have outliers at the 0 mark.

# Fit regression model with Zoom.wide..W. and Weight..inc..batteries
reg_model <- lm(Price ~ Zoom.wide..W. + Weight..inc..batteries., data = cameras )

# Plot the data points with regression line for one predictor variable
plot(cameras$Zoom.wide..W., cameras$Price, main = "Price by Release Date",
     xlab = "Release Date",
     ylab = "Price")
abline(reg_model, col = "red")

## Warning in abline(reg_model, col = "red"): only using the first two of 3
## regression coefficients

Assessing influential points

# Assessing influential points

out_lev_inf<- augment(reg_model)
out_lev_inf

## # A tibble: 1,036 × 10
##    .rownames Price Zoom.wide..W. Weight..inc..batteries. .fitted  .resid    .hat
##    <chr>     <dbl>         <dbl>                   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1           179            38                     420   443.  -264.   0.00220
##  2 2           179            38                     420   443.  -264.   0.00220
##  3 3           179            45                       0   -37.3  216.   0.00260
##  4 4           269            35                       0   156.   113.   0.00324
##  5 5          1299            43                     300   248.  1051.   0.00257
##  6 6           179            51                     270    68.8  110.   0.00591
##  7 7           179            34                       0   175.     3.75 0.00350
##  8 8           149            42                     320   284.  -135.   0.00242
##  9 9           139            50                     460   244.  -105.   0.00900
## 10 10          139            35                     375   464.  -325.   0.00124
## # ℹ 1,026 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

# Identify Outliers 

# observations with std res absolute value > 2
which(abs(out_lev_inf$.std.resid) >2)

##  [1]  49  50  51  52  53  54  55 202 389 390 391 392 393 394 395 396 397 398 399
## [20] 400 401 402 456 457 458 459 460 461 462 552 555 556 557 558 566 567 568 569
## [39] 570 575 585 606 607 608 610 611 623 824 833 901 902

With a standard residual score greater than 2, 51 cameras are considered outliers.

# Identify High-leverage points

# determine threshold for high leverage points
# predictors
m <- 1
# sample size
n <- length(cameras$Price)

# create thresholds
upper_threshold <- (3 * (m+1)) / n
lower_threshold <- (2 * (m+1)) / n
print(upper_threshold)

## [1] 0.005791506

print(lower_threshold)

## [1] 0.003861004

# extract high leverage points
high_lev_lwr <- which(out_lev_inf$.hat>0.0057)
high_lev_lwr

##   [1]    6    9   48   49   50   51   52   53   54   55   56   57   58   59   60
##  [16]   61   62   63   64   65   75  186  202  205  206  287  298  299  300  301
##  [31]  389  390  391  392  393  394  395  396  397  398  399  400  401  402  469
##  [46]  474  475  549  550  551  552  553  554  555  556  557  558  559  560  561
##  [61]  562  563  564  565  615  626  627  629  630  631  632  633  634  635  636
##  [76]  711  714  717  718  746  747  748  749  750  751  752  753  754  852  853
##  [91]  854  874  900  901  902  903  904  905  906  911  912  917  918  919  957
## [106] 1003 1004 1013 1016 1017 1018 1024

high_lev_upr <- which(out_lev_inf$.hat>0.0038)
high_lev_upr

##   [1]    6    9   48   49   50   51   52   53   54   55   56   57   58   59   60
##  [16]   61   62   63   64   65   74   75  168  186  202  205  206  266  280  281
##  [31]  287  298  299  300  301  319  380  381  383  389  390  391  392  393  394
##  [46]  395  396  397  398  399  400  401  402  437  469  474  475  476  507  549
##  [61]  550  551  552  553  554  555  556  557  558  559  560  561  562  563  564
##  [76]  565  576  601  615  626  627  628  629  630  631  632  633  634  635  636
##  [91]  711  714  717  718  745  746  747  748  749  750  751  752  753  754  852
## [106]  853  854  874  900  901  902  903  904  905  906  907  908  909  911  912
## [121]  917  918  919  922  957 1003 1004 1007 1008 1009 1010 1011 1013 1016 1017
## [136] 1018 1024

# Filter rows within the upper threshold
high_lev_rows_upper <- out_lev_inf$.hat >= upper_threshold

# Display the rows
out_lev_inf[high_lev_rows_upper, ]

## # A tibble: 112 × 10
##    .rownames Price Zoom.wide..W. Weight..inc..batteries. .fitted .resid    .hat
##    <chr>     <dbl>         <dbl>                   <dbl>   <dbl>  <dbl>   <dbl>
##  1 6           179            51                     270    68.8   110. 0.00591
##  2 9           139            50                     460   244.   -105. 0.00900
##  3 48         1299             0                     875  1552.   -253. 0.0108 
##  4 49         4499             0                    1585  2135.   2364. 0.0239 
##  5 50         4499             0                    1565  2119.   2380. 0.0231 
##  6 51         4499             0                    1335  1930.   2569. 0.0161 
##  7 52         4499             0                    1565  2119.   2380. 0.0231 
##  8 53         7999             0                    1585  2135.   5864. 0.0239 
##  9 54         7999             0                    1565  2119.   5880. 0.0231 
## 10 55         7999             0                    1385  1971.   6028. 0.0173 
## # ℹ 102 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

# Filter rows within the lower threshold
high_lev_rows_lower <- out_lev_inf$.hat >= lower_threshold
# Display the rows
out_lev_inf[high_lev_rows_lower, ]

## # A tibble: 136 × 10
##    .rownames Price Zoom.wide..W. Weight..inc..batteries. .fitted .resid    .hat
##    <chr>     <dbl>         <dbl>                   <dbl>   <dbl>  <dbl>   <dbl>
##  1 6           179            51                     270    68.8   110. 0.00591
##  2 9           139            50                     460   244.   -105. 0.00900
##  3 48         1299             0                     875  1552.   -253. 0.0108 
##  4 49         4499             0                    1585  2135.   2364. 0.0239 
##  5 50         4499             0                    1565  2119.   2380. 0.0231 
##  6 51         4499             0                    1335  1930.   2569. 0.0161 
##  7 52         4499             0                    1565  2119.   2380. 0.0231 
##  8 53         7999             0                    1585  2135.   5864. 0.0239 
##  9 54         7999             0                    1565  2119.   5880. 0.0231 
## 10 55         7999             0                    1385  1971.   6028. 0.0173 
## # ℹ 126 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

Following the upper threshold guideline, 112 rows are considered high-leverage. Following the lower threshold guideline, 136 rows are considered high-leverage.

# Identify influential Observations using Cooks Distance
# df2 = n-m-1 = 1036-13-1 = 1022

# to find median of F distribution with 1 and 1022 DF
cooksd50<- qf(0.5,1,1022)
#cooksd50

# for 25th percentile, any points between 25th and 50th are tending
cooksd25<- qf(0.25,1,26)
#cooksd25


# Identify influential observations
inf_obs <- which(cooks.distance(reg_model) > cooksd50)
inf_obs

## 53 54 55 
## 53 54 55

# Identify tending influential observations
inf_obs_25 <- which(cooks.distance(reg_model) > cooksd25)
inf_obs_25

##  49  50  52  53  54  55 392 402 560 
##  49  50  52  53  54  55 390 400 558

# Filter rows within the upper threshold
inf_points <- which(cooks.distance(reg_model) > cooksd50)

# Display the rows
out_lev_inf[inf_points, ]

## # A tibble: 3 × 10
##   .rownames Price Zoom.wide..W. Weight..inc..batteries. .fitted .resid   .hat
##   <chr>     <dbl>         <dbl>                   <dbl>   <dbl>  <dbl>  <dbl>
## 1 53         7999             0                    1585   2135.  5864. 0.0239
## 2 54         7999             0                    1565   2119.  5880. 0.0231
## 3 55         7999             0                    1385   1971.  6028. 0.0173
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

Cooks distance was calculated to identify influential observations. The cutoff for Cook’s distance scores at the 25th percentile is 0.1037 and at the 50th percentile is 0.455. Any values calculated with Cooks Distance above 0.455 are considered influential.

Observations 53, 54 and 55 are statistically influential. These observations are high both in the x and y space with large values for Weight..inc..batteries (1585, 1565 and 1385) and Price (7999).

Observation 53 has a leverage value of 0.0239, a residual score of 5863.68, a standardized residual value of greater than 2 (-9.00) and exceeds the F-median value of 0.455 at 0.66.

Observation 54 has a leverage value of 0.231, a residual score of 5880.12, a standardized residual value of greater than 2 (-9.025) and exceeds the F-median value of 0.455 at 0.642.

Observation 55 has a leverage value of 0.017, a residual score of 6023.11, a standardized residual value of greater than 2 (-9.225) and exceeds the F-median value of 0.455 at 0.501.

Verifying Regression Assumptions

plot(reg_model,  col = '#7570b3')

The independence, constant variance, zero mean, and normality assumptions are verified through analysis of the Residuals vs Fitted plot and Normal Q-Q plot.

# Assessing normality 

# histogram
hist(reg_model$residuals, col = '#e78ac3')

# data shape is right-tailed/skews to the right, slight bell-shaped curve

# q-q plot
plot(reg_model, col = '#e78ac3',which=c(2))

# Q-Qplot for standardized residuals should follow the line if residuals are drawn from a normally distributed process or population. From this q-qplot we see the bulk of the points lie on the straight line, however, the points begin to diverge from the line as the theoretical quantiles increase. From this plot, one might conclude the data is drawn from a chi-square distribution. 


# Hypothesis test
ks.test(reg_model$residuals, pnorm, mean(reg_model$residuals), sd(reg_model$residuals))

## Warning in ks.test.default(reg_model$residuals, pnorm,
## mean(reg_model$residuals), : ties should not be present for the
## Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  reg_model$residuals
## D = 0.26335, p-value < 2.2e-16
## alternative hypothesis: two-sided

# Under the null hypothesis, we assume the probability that the data what we observed is drawn from a normally distributed population. The p-value is  < 2.2e-16, therefore we reject the null hypothesis. The residuals does not appear to be drawn from a normal process or distribution. 

# Normality assumption does not hold.

# Assessing zero-mean assumption 

# red line is flat but negative, and then peaks up at the end 
# Residuals are not evenly distributed about the 0 dotted line and data show a funnel shape

# residuals vs fitted
plot(reg_model, col = '#e78ac3', which=c(1))

# 5 number summary
summary(reg_model$residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2232.42  -209.01  -106.62     0.00    34.45  6028.11

# mean is 0

hist(reg_model$residuals)

# centered around 0 

# The zero-mean assumption holds

# Assessing constant variance assumption 

plot(reg_model, which=c(1,3))

# residuals vs fitted plot shows data has a slight funnel shape which suggests there is not constant variance. Ideally, we would like a rectangle shape to the data. But the funnel shape suggests as values increase, variance increases. 

# scale-location plot shows data has funnel shape and the red line is not very flat. The plot shows the red line has a slight bump and then peaks up at the end, with an overall upward trend. 

# Hypothesis Test: Breusch-Pagan Test

ols_test_breusch_pagan(reg_model)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    2511.4364 
##  Prob > Chi2   =    0.0000

#Breusch Pagan Test for Heteroskedasticity
# -----------------------------------------
# Ho: the variance is constant            
# Ha: the variance is not constant    


# p-value = 0.000
# The small p-value (less than 0.05 cutoff value) would suggest we reject Ho and Accept Ha and conclude the variance is not constant. 

# constant variance assumption violated

# assessing independence of residuals

plot(reg_model, col = '#e78ac3', which=c(1))

# With a visual inspection of fitted vs residual we see red trend line is not flat, but negative and then peaks up


# Hypothesis test: Durbin Watson

# Ho = independent residuals 
# Ha = dependent residuals (auto-correlated)
set.seed(1234)
durbinWatsonTest(reg_model, reps = 5000)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.6324554     0.7349031       0
##  Alternative hypothesis: rho != 0

# increase reps to 5000 for more stable p-value 
# p-value is 0 which is less than 0.05 cutoff value so we reject null hypothesis and do not have independent residuals

# independence assumption is violated

Because this data does not verify all of the regression assumptions. We could normalize/log-transform the data, but lets continue on with linear regression without influential points.

# remove influential rows from the data
cameras_sub <- cameras[-c(53:55),]

reg_model_sub <- lm(Price ~ Zoom.wide..W. + Weight..inc..batteries., data = cameras_sub )

summary(reg_model_sub)

## 
## Call:
## lm(formula = Price ~ Zoom.wide..W. + Weight..inc..batteries., 
##     data = cameras_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1717.2  -200.9  -126.0    12.9  4598.6 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             918.36894  104.40543   8.796  < 2e-16 ***
## Zoom.wide..W.           -19.36014    2.40191  -8.060 2.10e-15 ***
## Weight..inc..batteries.   0.49883    0.09698   5.143 3.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 571.5 on 1030 degrees of freedom
## Multiple R-squared:  0.2147, Adjusted R-squared:  0.2132 
## F-statistic: 140.8 on 2 and 1030 DF,  p-value: < 2.2e-16

The linear regression model is represented by the equation:

Price = 832.18 − 19.32 × Zoom.wide..W. + 0.82 × Weight..inc..batteries.

The residuals (differences between actual and predicted values) exhibit a distribution with a minimum value of -2232.4, a 1st quartile (25th percentile) of -209.0, a median of -106.6, a 3rd quartile (75th percentile) of 34.4, and a maximum value of 6028.1.

The coefficients of the model indicate holding other variables constant, for each one-unit increase in Zoom.wide..W., Price decreases by 19.32 units. Similarly, for each one-unit increase in Weight..inc..batteries., the Price is expected to increase by 0.82 units.

Create Test and Training sets

set.seed(12345)  # Set seed for reproducibility

n <- nrow(cameras_sub)

# Create an index for the training set (80% of the data)
train_index <- sample(1:n, 0.8 * n)

# Create the training set
train_set <- cameras_sub[train_index, ]

# Create the test set (remaining 20% of the data)
test_set <- cameras_sub[-train_index, ]

# Check the dimensions of the sets
dim(train_set)

## [1] 826  13

dim(test_set)

## [1] 207  13

Model 1

# create trained model
reg_model_trained <- lm(Price ~ Zoom.wide..W. + Weight..inc..batteries., data = train_set)
summary(reg_model_trained)

## 
## Call:
## lm(formula = Price ~ Zoom.wide..W. + Weight..inc..batteries., 
##     data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1847.3  -195.1  -116.1    17.9  4519.6 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              952.591    118.455   8.042 3.07e-15 ***
## Zoom.wide..W.            -20.949      2.718  -7.707 3.71e-14 ***
## Weight..inc..batteries.    0.567      0.108   5.250 1.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 573 on 823 degrees of freedom
## Multiple R-squared:  0.2535, Adjusted R-squared:  0.2517 
## F-statistic: 139.7 on 2 and 823 DF,  p-value: < 2.2e-16

# Make predictions on the test set
predictions <- predict(reg_model_trained, newdata = test_set)

# Evaluate the model
mse <- mean((test_set$Price - predictions)^2)
rmse <- sqrt(mse)
rsquared <- 1 - (sum((test_set$Price - predictions)^2) / sum((test_set$Price - mean(test_set$Price))^2))

# Print the evaluation metrics
cat("Mean Squared Error (MSE):", mse, "\n")

## Mean Squared Error (MSE): 324489.9

cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 569.6401

cat("R-squared (R²):", rsquared, "\n")

## R-squared (R²): -0.01273235

With an Multiple R-squared value of 0.2195, this model can only account for 21.95% of the variability in Price using the variables Zoom.wide..W and Weight..inc..batteries. The RMSE is high at 562.5. Perhaps incorporating other variables might improve performance. Lets assess variables “low resolution” and “Zoom.tele..T” to see if they would be appropriate to add to the model.

# Visualize the distribution of 'zoom.tele..T.'

ggplot(cameras, aes(x = Zoom.tele..T.)) +
  geom_histogram(binwidth = 50, fill = "#8da0cb", color = "#e78ac3") +
  geom_vline(xintercept = mean(cameras$Zoom.tele..T.), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Camera Zoom.tele..T",
       x = "Price",
       y = "Frequency") +
  theme_minimal()

Somewhat normal, but a bit uneven.

ggplot(cameras, aes(x = Low.resolution)) +
  geom_histogram(binwidth = 100, fill = "#8da0cb", color = "#e78ac3") +
  geom_vline(xintercept = mean(cameras$Low.resolution), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Camera Low.resolution",
       x = "Price",
       y = "Frequency") +
  theme_minimal()

Lots of variability in this but overall normal shape.

Lets add low.resolution and Zoom.wide..W. to our model and call it model 2.

Model 2

# Build Model 2
model_2 <- lm(Price ~  Low.resolution + Zoom.tele..T. + Zoom.wide..W. + Weight..inc..batteries., data = train_set)
summary(model_2)

## 
## Call:
## lm(formula = Price ~ Low.resolution + Zoom.tele..T. + Zoom.wide..W. + 
##     Weight..inc..batteries., data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1838.4  -206.8  -111.8    23.6  4418.2 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             423.57143  160.63936   2.637 0.008528 ** 
## Low.resolution            0.12434    0.02777   4.477 8.65e-06 ***
## Zoom.tele..T.            -0.89916    0.25423  -3.537 0.000428 ***
## Zoom.wide..W.           -11.10968    3.39382  -3.274 0.001107 ** 
## Weight..inc..batteries.   0.85603    0.12204   7.014 4.84e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 565.5 on 821 degrees of freedom
## Multiple R-squared:  0.2746, Adjusted R-squared:  0.2711 
## F-statistic: 77.71 on 4 and 821 DF,  p-value: < 2.2e-16

# Make predictions on the test set
predictions <- predict(model_2, newdata = test_set)

# Evaluate the model
mse <- mean((test_set$Price - predictions)^2)
rmse <- sqrt(mse)
rsquared <- 1 - (sum((test_set$Price - predictions)^2) / sum((test_set$Price - mean(test_set$Price))^2))

# Print the evaluation metrics
cat("Mean Squared Error (MSE):", mse, "\n")

## Mean Squared Error (MSE): 332374.7

cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 576.5194

cat("R-squared (R²):", rsquared, "\n")

## R-squared (R²): -0.03734071

# Improved the R-squared by 0.03

Model 2 still has a poor R-squared value and high RMSE. Lets try log transformation.

Log model

 # there was # error due to infinite values, so adding 1 to avoid log of zero or negative values
train_set$log_price <- log(train_set$Price + 1) 
train_set$log_low.res <- log(train_set$Low.resolution + 1)
train_set$log_zoom.w <- log(train_set$Zoom.wide..W. + 1)
train_set$log_zoom.t <- log(train_set$Zoom.tele..T. + 1)
train_set$log_weight.batt <- log(train_set$Weight..inc..batteries + 1)

log_model <- lm(log_price ~ log_low.res + log_zoom.w + log_zoom.t + log_weight.batt, data = train_set)
summary(log_model)

## 
## Call:
## lm(formula = log_price ~ log_low.res + log_zoom.w + log_zoom.t + 
##     log_weight.batt, data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8175 -0.5190 -0.1187  0.4116  2.8372 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.83707    0.26860  21.732  < 2e-16 ***
## log_low.res      0.09672    0.01620   5.970 3.53e-09 ***
## log_zoom.w      -0.29635    0.07217  -4.107 4.42e-05 ***
## log_zoom.t      -0.03612    0.04941  -0.731    0.465    
## log_weight.batt  0.03813    0.03232   1.180    0.238    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7749 on 821 degrees of freedom
## Multiple R-squared:  0.1908, Adjusted R-squared:  0.1868 
## F-statistic: 48.38 on 4 and 821 DF,  p-value: < 2.2e-16

A log transformation did not raise R-squared. Lets try standardizing the data.

Model with standardized data

std_camera <- cameras  # Create a copy to avoid modifying the original dataset

# Variables to standardize
std_vars <- c("Low.resolution", "Zoom.wide..W.", "Zoom.tele..T.", "Weight..inc..batteries.")

# Standardize selected variables
std_camera[, std_vars] <- scale(std_camera[, std_vars])

# Check the first few rows of the standardized dataset
head(std_camera)

##                    Model Release.date Max.resolution Low.resolution
## 1       Agfa ePhoto 1280         1997           1024      -1.368157
## 2       Agfa ePhoto 1680         1998           1280      -1.368157
## 3       Agfa ePhoto CL18         2000            640      -2.139724
## 4       Agfa ePhoto CL30         1999           1152      -1.368157
## 5 Agfa ePhoto CL30 Clik!         1999           1152      -1.368157
## 6       Agfa ePhoto CL45         2001           1600      -1.368157
##   Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1                0     0.4877952   -0.08065066                 70
## 2                1     0.4877952   -0.08065066                 50
## 3                0     1.1646973   -0.81826991                  0
## 4                0     0.1976942   -0.92517124                  0
## 5                0     0.9712967   -0.83965017                 50
## 6                1     1.7448991   -0.75412910                 50
##   Macro.focus.range Storage.included Weight..inc..batteries. Dimensions Price
## 1                40                4              0.38683039         95   179
## 2                 0                4              0.38683039        158   179
## 3                 0                2             -1.22601004          0   179
## 4                 0                4             -1.22601004          0   269
## 5                 0               40             -0.07398116        128  1299
## 6                20                8             -0.18918405        119   179

# build std model 

std_model <- lm(Price ~ Low.resolution + Zoom.wide..W. + Zoom.tele..T. + Weight..inc..batteries., data = std_camera)
summary(std_model)

## 
## Call:
## lm(formula = Price ~ Low.resolution + Zoom.wide..W. + Zoom.tele..T. + 
##     Weight..inc..batteries., data = std_camera)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2219.7  -222.0  -113.0    41.7  5639.6 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               457.92      20.05  22.842  < 2e-16 ***
## Low.resolution            138.47      22.79   6.076 1.73e-09 ***
## Zoom.wide..W.             -58.30      35.15  -1.659   0.0975 .  
## Zoom.tele..T.            -123.89      24.04  -5.153 3.07e-07 ***
## Weight..inc..batteries.   311.04      31.45   9.889  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 645.3 on 1031 degrees of freedom
## Multiple R-squared:  0.284,  Adjusted R-squared:  0.2812 
## F-statistic: 102.2 on 4 and 1031 DF,  p-value: < 2.2e-16

Standardizing the predictor variables improved R squared to 0.284, but it still not a strong model and results regarding the significance of the variables should be interpreted with caution.

Random Forest model

# Example Random Forest
set.seed(123)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.2

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:psych':
## 
##     outlier

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_model <- randomForest(Price ~ Low.resolution + Zoom.tele..T., data = train_set)
rf_model

## 
## Call:
##  randomForest(formula = Price ~ Low.resolution + Zoom.tele..T.,      data = train_set) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 268249.9
##                     % Var explained: 38.79

# find number of trees that produce lowest test MSE 
which.min(rf_model$mse)

## [1] 1

# find RMSE of best model - 524.4467
sqrt(rf_model$mse[which.min(rf_model$mse)])

## [1] 442.7539

The random forest can explain 29.9% of the variance in Price. This does not suffice in terms of model deployment, but it beats the standardize model by about 2% and is the best model yet. Other approaches should be explored to model camera price.

Camera Pricing

Acadia Berry

January 3rd, 2024

EDA