Disclaimer

This analysis of the cameras dataset was done in fulfillment of an introduction to a data mining class assignment, and should not be used to make any inferences for practical applications.

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
camera_data <- read.csv("./data/camera_dataset.csv")

1. What is the statistical summary of the cameras dataset after the cleansing of missing data (if any)?

glimpse(camera_data)
## Rows: 1,038
## Columns: 13
## $ Model              <fct> Agfa ePhoto 1280, Agfa ePhoto 1680, Agfa ePhoto CL…
## $ Release.date       <int> 1997, 1998, 2000, 1999, 1999, 2001, 1999, 1997, 19…
## $ Max.resolution     <dbl> 1024, 1280, 640, 1152, 1152, 1600, 1280, 640, 832,…
## $ Low.resolution     <dbl> 640, 640, 0, 640, 640, 640, 640, 0, 640, 1024, 102…
## $ Effective.pixels   <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 3, 3, 1,…
## $ Zoom.wide..W.      <dbl> 38, 38, 45, 35, 43, 51, 34, 42, 50, 35, 39, 35, 39…
## $ Zoom.tele..T.      <dbl> 114, 114, 45, 35, 43, 51, 102, 42, 50, 105, 39, 10…
## $ Normal.focus.range <dbl> 70, 50, 0, 0, 50, 50, 0, 70, 40, 76, 20, 76, 20, 7…
## $ Macro.focus.range  <dbl> 40, 0, 0, 0, 0, 20, 0, 3, 10, 16, 5, 16, 5, 16, 5,…
## $ Storage.included   <dbl> 4, 4, 2, 4, 40, 8, 8, 2, 1, 8, 8, 8, 8, 8, 32, 0, …
## $ Weight             <dbl> 420, 420, 0, 0, 300, 270, 0, 320, 460, 375, 225, 3…
## $ Dimensions         <dbl> 95, 158, 0, 0, 128, 119, 0, 93, 160, 110, 110, 110…
## $ Price              <dbl> 179, 179, 179, 269, 1299, 179, 179, 149, 139, 139,…
#looking for columns with missing data
for(i in 1:dim(camera_data)[2]){
    temp <- camera_data %>% summarise(count = sum(is.na(camera_data[i])))
    print(paste("Missing value count for", colnames(camera_data[i]), "is", temp))
}
## [1] "Missing value count for Model is 0"
## [1] "Missing value count for Release.date is 0"
## [1] "Missing value count for Max.resolution is 0"
## [1] "Missing value count for Low.resolution is 0"
## [1] "Missing value count for Effective.pixels is 0"
## [1] "Missing value count for Zoom.wide..W. is 0"
## [1] "Missing value count for Zoom.tele..T. is 0"
## [1] "Missing value count for Normal.focus.range is 0"
## [1] "Missing value count for Macro.focus.range is 1"
## [1] "Missing value count for Storage.included is 2"
## [1] "Missing value count for Weight is 2"
## [1] "Missing value count for Dimensions is 2"
## [1] "Missing value count for Price is 0"
#store rows with missing data
camera_missing <- camera_data[rowSums(is.na(camera_data)) > 0,]
#drop rows with missing data
camera_data <- camera_data %>% drop_na()
summary(camera_data)
##                     Model       Release.date  Max.resolution Low.resolution
##  Agfa ePhoto 1280      :   1   Min.   :1994   Min.   :   0   Min.   :   0  
##  Agfa ePhoto 1680      :   1   1st Qu.:2002   1st Qu.:2048   1st Qu.:1120  
##  Agfa ePhoto CL18      :   1   Median :2004   Median :2560   Median :2048  
##  Agfa ePhoto CL30      :   1   Mean   :2004   Mean   :2473   Mean   :1775  
##  Agfa ePhoto CL30 Clik!:   1   3rd Qu.:2006   3rd Qu.:3072   3rd Qu.:2560  
##  Agfa ePhoto CL45      :   1   Max.   :2007   Max.   :5616   Max.   :4992  
##  (Other)               :1030                                               
##  Effective.pixels Zoom.wide..W.   Zoom.tele..T.   Normal.focus.range
##  Min.   : 0.00    Min.   : 0.00   Min.   :  0.0   Min.   :  0.00    
##  1st Qu.: 3.00    1st Qu.:35.00   1st Qu.: 96.0   1st Qu.: 30.00    
##  Median : 4.00    Median :36.00   Median :108.0   Median : 50.00    
##  Mean   : 4.59    Mean   :32.96   Mean   :121.5   Mean   : 44.13    
##  3rd Qu.: 7.00    3rd Qu.:38.00   3rd Qu.:117.0   3rd Qu.: 60.00    
##  Max.   :21.00    Max.   :52.00   Max.   :518.0   Max.   :120.00    
##                                                                     
##  Macro.focus.range Storage.included     Weight         Dimensions   
##  Min.   : 0.000    Min.   :  0.00   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 3.000    1st Qu.:  8.00   1st Qu.: 180.0   1st Qu.: 92.0  
##  Median : 6.000    Median : 16.00   Median : 226.0   Median :101.0  
##  Mean   : 7.786    Mean   : 17.45   Mean   : 319.3   Mean   :105.4  
##  3rd Qu.:10.000    3rd Qu.: 20.00   3rd Qu.: 350.0   3rd Qu.:115.0  
##  Max.   :85.000    Max.   :450.00   Max.   :1860.0   Max.   :240.0  
##                                                                     
##      Price       
##  Min.   :  14.0  
##  1st Qu.: 149.0  
##  Median : 199.0  
##  Mean   : 457.9  
##  3rd Qu.: 399.0  
##  Max.   :7999.0  
## 

Results

2. Between the dates (1994 and 2007), was there an increase in the number of camera models produced? Also, what is the rate of growth in the number camera models (if any) in subsequent years beginning from 1994?

#store the dates
dates <- camera_data$Release.date
# Barplot
camera_data %>% ggplot(aes(x=as.factor(dates), fill=as.factor(dates))) + 
    geom_bar(width=0.7) +
    ggtitle("Number of Models Produced Thoughout the Years")

    #ggsave("question 2 bar plot.jpg")
#extract distinct dates from the dataset
distinct_dates <- camera_data %>% select(Release.date) %>% add_count(Release.date) %>% unique() %>% arrange(Release.date)
# calculate the growth
rates <- c(0.0000)

for(i in 1:(dim(distinct_dates)[1] - 1)){
    diff <- distinct_dates[i+1, ][2] - distinct_dates[i, ][2]
    percent <- ( diff / distinct_dates[i, ][2] ) * 100
    rates <- c(rates, as.numeric(percent))
    
    fmt <- sprintf("The difference and growth rate from year %d to %d is (%d, %.2f)", as.integer(distinct_dates[i, ][1]), as.integer(distinct_dates[i+1, ][1]), as.integer(diff), as.double(percent))
    print(fmt)
}
## [1] "The difference and growth rate from year 1994 to 1995 is (0, 0.00)"
## [1] "The difference and growth rate from year 1995 to 1996 is (3, 300.00)"
## [1] "The difference and growth rate from year 1996 to 1997 is (7, 175.00)"
## [1] "The difference and growth rate from year 1997 to 1998 is (21, 190.91)"
## [1] "The difference and growth rate from year 1998 to 1999 is (21, 65.62)"
## [1] "The difference and growth rate from year 1999 to 2000 is (8, 15.09)"
## [1] "The difference and growth rate from year 2000 to 2001 is (24, 39.34)"
## [1] "The difference and growth rate from year 2001 to 2002 is (4, 4.71)"
## [1] "The difference and growth rate from year 2002 to 2003 is (12, 13.48)"
## [1] "The difference and growth rate from year 2003 to 2004 is (40, 39.60)"
## [1] "The difference and growth rate from year 2004 to 2005 is (2, 1.42)"
## [1] "The difference and growth rate from year 2005 to 2006 is (9, 6.29)"
## [1] "The difference and growth rate from year 2006 to 2007 is (10, 6.58)"
final <- data.frame(year = distinct_dates[1], growth = rates)
final %>% ggplot(aes(x=as.factor(Release.date), y=growth, group=1) ) +
    geom_line(color="grey") +
    geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
    ggtitle("The Rate of Growth Between Every Year from 1994 to 2007")

    #ggsave("question 2 line plot.jpg")

Results

3. Are there any outliers from the prices variable over the year 2000 in the dataset? How many camera models are priced over $2000?

camera_data %>% filter(Release.date > 2000) %>% ggplot(aes(x=as.factor(Release.date), y=Price, group=Release.date, fill=Release.date)) +
    geom_boxplot() +
    ggtitle("Box Plot of Prices over the Year 2000")

    #ggsave("question 3 box plot.jpg")
camera_data %>% filter(Release.date > 2000 & Price > 2000)
##                     Model Release.date Max.resolution Low.resolution
## 1            Canon EOS-1D         2001           2464           1232
## 2    Canon EOS-1D Mark II         2004           3504           3104
## 3   Canon EOS-1D Mark III         2007           3888           3456
## 4  Canon EOS-1D Mark II N         2005           3504           3104
## 5           Canon EOS-1Ds         2002           4064           2032
## 6   Canon EOS-1Ds Mark II         2004           4992           3600
## 7  Canon EOS-1Ds Mark III         2007           5616           4992
## 8            Canon EOS 5D         2005           4368           3168
## 9              Epson R-D1         2004           3008           2240
## 10              Nikon D1X         2001           3008           2000
## 11             Nikon D2Hs         2005           2464           1840
## 12              Nikon D2X         2004           4288           3216
## 13             Nikon D2Xs         2006           4288           3216
## 14               Nikon D3         2007           4256           3184
## 15       Ricoh GR Digital         2005           3264           2592
## 16         Ricoh RDC-i500         2001           2048           1024
##    Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1                 4             0             0                  0
## 2                 8             0             0                  0
## 3                10             0             0                  0
## 4                 8             0             0                  0
## 5                11             0             0                  0
## 6                16             0             0                  0
## 7                21             0             0                  0
## 8                12             0             0                  0
## 9                 6             0             0                  0
## 10                5             0             0                  0
## 11                4             0             0                  0
## 12               12             0             0                  0
## 13               12             0             0                  0
## 14               12             0             0                  0
## 15                8            28            28                 30
## 16                3            35           105                 24
##    Macro.focus.range Storage.included Weight Dimensions Price
## 1                  0                0   1585        156  4499
## 2                  0                0   1565        156  4499
## 3                  0                0   1335        156  4499
## 4                  0                0   1565        156  4499
## 5                  0                0   1585        156  7999
## 6                  0                0   1565        156  7999
## 7                  0                0   1385        150  7999
## 8                  0                0    895        152  2499
## 9                  0                0    620        142  4499
## 10                 0                0   1200        157  4699
## 11                 0                0   1200        158  4699
## 12                 0                0   1200        158  4699
## 13                 0                0   1200        158  4699
## 14                 0                0   1300        160  4999
## 15                 2               26    200        107  4999
## 16                 1                8    320        142  4999

Results

4. Which camera model is more cost efficient, costing less than $200, has a Max resolution (greater than 2160) and a storage size greater than 24gb in the year 2007?

camera_data %>% filter(Price < 200 & Max.resolution > 2160 & Release.date == 2007 & Storage.included > 24) %>% apply(2,min)
##                    Model             Release.date           Max.resolution 
## "Canon PowerShot SD1000"                   "2007"                   "2816" 
##           Low.resolution         Effective.pixels            Zoom.wide..W. 
##                   "   0"                     " 6"                     "28" 
##            Zoom.tele..T.       Normal.focus.range        Macro.focus.range 
##                    "101"                     " 0"                     " 0" 
##         Storage.included                   Weight               Dimensions 
##                    " 25"                    "  0"                    " 86" 
##                    Price 
##                    " 99"

Results

5. Given the different manufacturers in the dataset, plot the total price of all camera models per manufacturer, in a year for every year until 2007? with this insight, deduce:

  1. Which manufacturer resulted in the lowest and highest after accumulating all prices per manufacturer throughout the years.
  2. How many manufacturers have stopped producing cameras by the year 2007.
  3. The year when most manufacturers produced.
manufacturers <- c("Agfa ePhoto", "Canon", "Casio", "Contax", "Epson", "Fujifilm", "HP","Kodak", "Kyocera", "Finecam", "Leica", "Nikon", "Olympus", "Panasonic", "Pentax", "Ricoh", "Samsung", "Sanyo", "Sigma", "Sony", "Toshiba")
accumulated_total <- c(rep(0, times = length(manufacturers)))
for(i in 1:dim(distinct_dates[1])[1]){

    price_per_manu <- data.frame(manufacturers = character(), price = numeric())
    
    for(j in 1:length(manufacturers)){
        
        result <- camera_data %>% filter(Release.date == as.numeric(distinct_dates[1][i,]) & str_detect(Model, manufacturers[j]))
        total <- sum(result$Price)
        
        price_per_manu <- rbind(price_per_manu, data.frame(manufacturers = manufacturers[j], price = total))
        
        accumulated_total[j] <- accumulated_total[j] + total
    }
    
    plot <- price_per_manu %>% ggplot(aes(x=reorder(manufacturers, price), y=price)) +
        geom_bar(stat="identity", fill="#69b3a2", width=.4) +
        coord_flip() + 
        xlab("") +
        theme_bw() +
        ggtitle("Total Price Per Manufacturers in year: ", distinct_dates[1][i,])
    
    print(plot)
    #ggsave(paste("Year", distinct_dates[1][i,],".png", sep = ""))
}

accumulated_total_manu <- data.frame(manufacturers = manufacturers, price = accumulated_total)
accumulated_total_manu %>%
    ggplot(aes(x = reorder(manufacturers, price), y=price)) +
    geom_bar(stat="identity", fill="#69b3a2", width=.4) +
    coord_flip() + 
    xlab("") +
    theme_bw() +
    ggtitle("Accumulated Price of all Manufacturers Throughout the Years")

#ggsave("total_summary.png")

Results

6. How many camera models, with the following variables: [zoom wide, zoom tele, normal focus range, macro focus range] have a value less than 1? Also is there a trend that camera models with less than 1 in the corresponding variables are manufacturer specific or not?

less_than_ones <- camera_data %>% filter(Zoom.wide..W. < 1 & Zoom.tele..T. < 1 & Normal.focus.range < 1 & Macro.focus.range < 1)
dim(less_than_ones)[1]
## [1] 85
head(less_than_ones)
##                    Model Release.date Max.resolution Low.resolution
## 1          Canon EOS 10D         2003           3072           2048
## 2           Canon EOS-1D         2001           2464           1232
## 3   Canon EOS-1D Mark II         2004           3504           3104
## 4  Canon EOS-1D Mark III         2007           3888           3456
## 5 Canon EOS-1D Mark II N         2005           3504           3104
## 6          Canon EOS-1Ds         2002           4064           2032
##   Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1                6             0             0                  0
## 2                4             0             0                  0
## 3                8             0             0                  0
## 4               10             0             0                  0
## 5                8             0             0                  0
## 6               11             0             0                  0
##   Macro.focus.range Storage.included Weight Dimensions Price
## 1                 0                0    875        150  1299
## 2                 0                0   1585        156  4499
## 3                 0                0   1565        156  4499
## 4                 0                0   1335        156  4499
## 5                 0                0   1565        156  4499
## 6                 0                0   1585        156  7999

Results

7. What kind of data distributions does the following variables have: [dimensions, weight, zoom tele and prices]?

hist1 <- camera_data %>% ggplot(aes(x=Dimensions)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
    ggtitle("Dimensions Distribution")

dense1 <- camera_data %>% ggplot(aes(x=Dimensions)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
    ggtitle("Dimensions Distribution")

grid.arrange(hist1, dense1, ncol=2)

#ggsave("question 7 hist_density plot1.jpg", grid.arrange(hist1, dense1, ncol=2))

hist2 <- camera_data %>% ggplot(aes(x=Weight)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
    ggtitle("Weight Distribution")

dense2 <- camera_data %>% ggplot(aes(x=Weight)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
    ggtitle("Weight Distribution")

grid.arrange(hist2, dense2, ncol=2)

#ggsave("question 7 hist_density plot2.jpg", grid.arrange(hist2, dense2, ncol=2))

hist3 <- camera_data %>% ggplot(aes(x=Zoom.tele..T.)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
    ggtitle("Zoom Tele Distribution")

dense3 <- camera_data %>% ggplot(aes(x=Zoom.tele..T.)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
    ggtitle("Zoom Tele Distribution")

grid.arrange(hist3, dense3, ncol=2)

#ggsave("question 7 hist_density plot3.jpg", grid.arrange(hist3, dense3, ncol=2))

hist4 <- camera_data %>% ggplot(aes(x=Price)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
    ggtitle("Price Distribution")

dense4 <- camera_data %>% ggplot(aes(x=Price)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
    ggtitle("Price Distribution")

grid.arrange(hist4, dense4, ncol=2)

#ggsave("question 7 hist_density plot4.jpg", grid.arrange(hist4, dense4, ncol=2))

Results

8. What kind of relationship exists between the 3 variable pairs: [storage and max resolution], [weight and dimensions], [price and zoom tele], under the condition that [zoom wide, zoom tele, normal focus range, macro focus range] is greater than 1?

filtered_set <- camera_data %>% filter(Zoom.wide..W. > 1 & Zoom.tele..T. > 1 & Normal.focus.range > 1 & Macro.focus.range > 1)
filtered_set %>% filter(Storage.included < 100) %>% ggplot(aes(x=Max.resolution, y=Storage.included)) +
  geom_point(
        color="blue",
        fill="#69b3a2",
        shape=18,
        alpha=0.5,
        size=3,
        stroke = 1) +
  geom_smooth(method=lm , color="red", se=FALSE, formula="y ~ x") +
    ggtitle("Storage vs Max Resolution")

#ggsave("question 8 scatter plot1.jpg")

filtered_set %>% ggplot(aes(x=Weight, y=Dimensions)) +
  geom_point(
        color="blue",
        fill="#69b3a2",
        shape=18,
        alpha=0.5,
        size=3,
        stroke = 1) +
  geom_smooth(method=lm , color="red", se=FALSE, formula="y ~ x") +
    ggtitle("Dimensions vs Weight")

#ggsave("question 8 scatter plot2.jpg")

filtered_set %>% filter(Price < 2000) %>%ggplot(aes(x=Zoom.tele..T., y=Price)) +
  geom_point(
        color="blue",
        fill="#69b3a2",
        shape=18,
        alpha=0.5,
        size=3,
        stroke = 1) +
  geom_smooth(method=lm , color="red", se=FALSE, formula="y ~ x") +
    ggtitle("Price vs Zoom tele")

#ggsave("question 8 scatter plot3.jpg")

facet_wrap( ~ align)
## <ggproto object: Class FacetWrap, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetWrap, Facet, gg>

Results

9. As a follow up question to question 8, what are the correlation coefficients of 3 relationships identified?

relation_1 <- data.frame(cbind(max_res = filtered_set$Max.resolution, storage = filtered_set$Storage.included))
relation_2 <- data.frame(cbind(weight = filtered_set$Weight, dims = filtered_set$Dimensions))
relation_3 <- data.frame(cbind(zoom_t = filtered_set$Zoom.tele..T., price = filtered_set$Price))
cor(relation_1)
##           max_res   storage
## max_res 1.0000000 0.2538378
## storage 0.2538378 1.0000000
cor(relation_2)
##          weight     dims
## weight 1.000000 0.524276
## dims   0.524276 1.000000
cor(relation_3)
##             zoom_t       price
## zoom_t  1.00000000 -0.04432165
## price  -0.04432165  1.00000000

Results

10. Find all camera models with a value greater than 1 in the variables [zoom wide and zoom tele]? What is the zoom ratio of those camera models using the variables [zoom wide and zoom tele]? Does the zoom ratio get larger from the years 1994 to 2007?

camera_ratio <- camera_data %>% filter(Zoom.tele..T. > 0 & Zoom.wide..W. > 0) %>% mutate(zoom_ratio = Zoom.tele..T. / Zoom.wide..W.)
head(camera_ratio)
##                    Model Release.date Max.resolution Low.resolution
## 1       Agfa ePhoto 1280         1997           1024            640
## 2       Agfa ePhoto 1680         1998           1280            640
## 3       Agfa ePhoto CL18         2000            640              0
## 4       Agfa ePhoto CL30         1999           1152            640
## 5 Agfa ePhoto CL30 Clik!         1999           1152            640
## 6       Agfa ePhoto CL45         2001           1600            640
##   Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1                0            38           114                 70
## 2                1            38           114                 50
## 3                0            45            45                  0
## 4                0            35            35                  0
## 5                0            43            43                 50
## 6                1            51            51                 50
##   Macro.focus.range Storage.included Weight Dimensions Price zoom_ratio
## 1                40                4    420         95   179          3
## 2                 0                4    420        158   179          3
## 3                 0                2      0          0   179          1
## 4                 0                4      0          0   269          1
## 5                 0               40    300        128  1299          1
## 6                20                8    270        119   179          1
camera_ratio %>% ggplot(aes(x=as.factor(Release.date), y=zoom_ratio)) +
    geom_bar(stat="identity", fill="#69b3a2", width=.5)

#ggsave("question 10 bar plot.jpg")

Results

b. Regression Modeling

#splitting and filtering dataset

camera_data_unfiltered <- camera_data
camera_data_filtered <- camera_data %>% filter(Price < 1000)
set.seed(123)

#unfiltered dataset [80, 20] split
split = sort(sample(nrow(camera_data_unfiltered), nrow(camera_data_unfiltered)*.8))
train_unfiltered <- camera_data_unfiltered[split,]
test_unfiltered <- camera_data_unfiltered[-split,]

#filtered dataset [80, 20] split
split = sort(sample(nrow(camera_data_filtered), nrow(camera_data_filtered)*.8))
train_filtered <- camera_data_filtered[split,]
test_filtered <- camera_data_filtered[-split,]

Unfiltered

Model 1

#Model 1, first trial
model1_unfiltered <- lm(Price ~ Max.resolution+Weight+Dimensions, data = train_unfiltered)

print(summary(model1_unfiltered))
## 
## Call:
## lm(formula = Price ~ Max.resolution + Weight + Dimensions, data = train_unfiltered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2306.6  -226.7  -112.1    26.4  5693.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.86387  139.32734  -0.071   0.9436    
## Max.resolution   0.13631    0.03024   4.507 7.51e-06 ***
## Weight           1.42113    0.11931  11.912  < 2e-16 ***
## Dimensions      -3.15011    1.28959  -2.443   0.0148 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 658.2 on 824 degrees of freedom
## Multiple R-squared:  0.2288, Adjusted R-squared:  0.226 
## F-statistic:  81.5 on 3 and 824 DF,  p-value: < 2.2e-16
actual_preds <- data.frame(cbind(actuals=test_unfiltered$Price, predicted=predict(model1_unfiltered, test_unfiltered)))  
head(actual_preds)
##    actuals predicted
## 26     139  309.4887
## 30     149  420.6426
## 37     139  539.3297
## 38     139  591.6709
## 45     139  472.7551
## 50    4499 2200.3976
print(paste("Akaike’s information criterion:", AIC(model1_unfiltered)))
## [1] "Akaike’s information criterion: 13102.461549226"
print(paste("Bayesian information criterion:", BIC(model1_unfiltered)))
## [1] "Bayesian information criterion: 13126.0566149979"
#get accuracy
print(cor(actual_preds))
##             actuals predicted
## actuals   1.0000000 0.5295035
## predicted 0.5295035 1.0000000

Model 2

#Model 2, second trial
model2_unfiltered <- lm(Price ~ Max.resolution+Storage.included+Weight+Dimensions, data = train_unfiltered)

print(summary(model2_unfiltered))
## 
## Call:
## lm(formula = Price ~ Max.resolution + Storage.included + Weight + 
##     Dimensions, data = train_unfiltered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2284.1  -226.3  -117.8    31.5  5686.4 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.37501  138.91292   0.010  0.99210    
## Max.resolution     0.15340    0.03085   4.973 8.03e-07 ***
## Storage.included  -2.42132    0.93330  -2.594  0.00964 ** 
## Weight             1.37082    0.12046  11.380  < 2e-16 ***
## Dimensions        -3.10866    1.28523  -2.419  0.01579 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 656 on 823 degrees of freedom
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.2314 
## F-statistic: 63.23 on 4 and 823 DF,  p-value: < 2.2e-16
actual_preds <- data.frame(cbind(actuals=test_unfiltered$Price, predicted=predict(model2_unfiltered, test_unfiltered)))  
head(actual_preds)
##    actuals predicted
## 26     139  309.1985
## 30     149  438.8639
## 37     139  556.2971
## 38     139  576.4616
## 45     139  432.0135
## 50    4499 2199.2727
print(paste("Akaike’s information criterion:", AIC(model2_unfiltered)))
## [1] "Akaike’s information criterion: 13097.7174056187"
print(paste("Bayesian information criterion:", BIC(model2_unfiltered)))
## [1] "Bayesian information criterion: 13126.031484545"
#get accuracy
print(cor(actual_preds) )
##             actuals predicted
## actuals   1.0000000 0.5138635
## predicted 0.5138635 1.0000000

Model 3

#Model 3, third trial
model3_unfiltered <- lm(Price ~ Max.resolution+Low.resolution+Effective.pixels+Zoom.tele..T.+Weight+Dimensions, data = train_unfiltered)

print(summary(model3_unfiltered))
## 
## Call:
## lm(formula = Price ~ Max.resolution + Low.resolution + Effective.pixels + 
##     Zoom.tele..T. + Weight + Dimensions, data = train_unfiltered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2327.6  -249.2  -106.5    60.0  5596.0 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      576.0420   181.2031   3.179  0.00153 ** 
## Max.resolution    -0.3127     0.1058  -2.954  0.00322 ** 
## Low.resolution     0.2706     0.0507   5.338 1.22e-07 ***
## Effective.pixels  61.1598    26.4088   2.316  0.02081 *  
## Zoom.tele..T.     -1.5871     0.2398  -6.618 6.56e-11 ***
## Weight             1.5807     0.1176  13.438  < 2e-16 ***
## Dimensions        -4.0400     1.2502  -3.231  0.00128 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 633.8 on 821 degrees of freedom
## Multiple R-squared:  0.2876, Adjusted R-squared:  0.2824 
## F-statistic: 55.24 on 6 and 821 DF,  p-value: < 2.2e-16
actual_preds <- data.frame(cbind(actuals=test_unfiltered$Price, predicted=predict(model3_unfiltered, test_unfiltered)))  
head(actual_preds)
##    actuals predicted
## 26     139  325.8619
## 30     149  487.1587
## 37     139  581.5300
## 38     139  644.3967
## 45     139  438.2805
## 50    4499 2653.1778
print(paste("Akaike’s information criterion:", AIC(model3_unfiltered)))
## [1] "Akaike’s information criterion: 13042.8381874199"
print(paste("Bayesian information criterion:", BIC(model3_unfiltered)))
## [1] "Bayesian information criterion: 13080.5902926549"
#get accuracy
print(cor(actual_preds))
##            actuals predicted
## actuals   1.000000  0.571435
## predicted 0.571435  1.000000

Filtered

Model 1

#Model 1, first trial
model1_filtered <- lm(Price ~ Max.resolution+Weight+Dimensions, data = train_filtered)

print(summary(model1_filtered))
## 
## Call:
## lm(formula = Price ~ Max.resolution + Weight + Dimensions, data = train_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -232.82  -81.19  -34.16   50.76  544.60 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.112e+02  3.082e+01   3.608  0.00033 ***
## Max.resolution 3.211e-02  6.202e-03   5.178 2.92e-07 ***
## Weight         2.178e-02  2.713e-02   0.803  0.42225    
## Dimensions     2.744e-01  2.837e-01   0.967  0.33368    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116.6 on 714 degrees of freedom
## Multiple R-squared:  0.04051,    Adjusted R-squared:  0.03648 
## F-statistic: 10.05 on 3 and 714 DF,  p-value: 1.716e-06
actual_preds <- data.frame(cbind(actuals=test_filtered$Price, predicted=predict(model1_filtered, test_filtered)))  
head(actual_preds)
##    actuals predicted
## 8      139  191.8423
## 11     139  200.9304
## 13     139  190.1105
## 16     139  200.3858
## 23     149  178.1218
## 24     149  186.8825
print(paste("Akaike’s information criterion:", AIC(model1_filtered)))
## [1] "Akaike’s information criterion: 8876.99678932854"
print(paste("Bayesian information criterion:", BIC(model1_filtered)))
## [1] "Bayesian information criterion: 8899.87913717378"
#get accuracy
print(cor(actual_preds))
##             actuals predicted
## actuals   1.0000000 0.1830946
## predicted 0.1830946 1.0000000

Model 2

#Model 2, second trial
model2_filtered <- lm(Price ~ Max.resolution+Storage.included+Weight+Dimensions, data = train_filtered)

print(summary(model2_filtered))
## 
## Call:
## lm(formula = Price ~ Max.resolution + Storage.included + Weight + 
##     Dimensions, data = train_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -236.28  -81.25  -34.25   50.25  539.70 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      111.063557  30.785041   3.608  0.00033 ***
## Max.resolution     0.034186   0.006326   5.404 8.89e-08 ***
## Storage.included  -0.263697   0.163013  -1.618  0.10618    
## Weight             0.016826   0.027272   0.617  0.53746    
## Dimensions         0.287151   0.283460   1.013  0.31139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116.5 on 713 degrees of freedom
## Multiple R-squared:  0.04402,    Adjusted R-squared:  0.03866 
## F-statistic: 8.208 on 4 and 713 DF,  p-value: 1.787e-06
actual_preds <- data.frame(cbind(actuals=test_filtered$Price, predicted=predict(model2_filtered, test_filtered)))  
head(actual_preds)
##    actuals predicted
## 8      139  192.9268
## 11     139  201.5482
## 13     139  190.1880
## 16     139  201.1275
## 23     149  178.1497
## 24     149  187.1683
print(paste("Akaike’s information criterion:", AIC(model2_filtered)))
## [1] "Akaike’s information criterion: 8876.36648723066"
print(paste("Bayesian information criterion:", BIC(model2_filtered)))
## [1] "Bayesian information criterion: 8903.82530464495"
#get accuracy
print(cor(actual_preds) )
##             actuals predicted
## actuals   1.0000000 0.1778359
## predicted 0.1778359 1.0000000

Model 3

#Model 3, third trial
model3_filtered <- lm(Price ~ Max.resolution+Low.resolution+Effective.pixels+Zoom.tele..T.+Weight+Dimensions, data = train_filtered)

print(summary(model3_filtered))
## 
## Call:
## lm(formula = Price ~ Max.resolution + Low.resolution + Effective.pixels + 
##     Zoom.tele..T. + Weight + Dimensions, data = train_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -239.66  -81.54  -24.38   48.70  527.62 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      139.732165  39.778649   3.513 0.000472 ***
## Max.resolution    -0.029171   0.022403  -1.302 0.193317    
## Low.resolution     0.058122   0.010280   5.654 2.27e-08 ***
## Effective.pixels   1.793944   5.443311   0.330 0.741823    
## Zoom.tele..T.     -0.005215   0.049059  -0.106 0.915374    
## Weight             0.059376   0.027515   2.158 0.031265 *  
## Dimensions         0.284733   0.278790   1.021 0.307451    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.2 on 711 degrees of freedom
## Multiple R-squared:  0.08308,    Adjusted R-squared:  0.07534 
## F-statistic: 10.74 on 6 and 711 DF,  p-value: 1.904e-11
actual_preds <- data.frame(cbind(actuals=test_filtered$Price, predicted=predict(model3_filtered, test_filtered)))  
head(actual_preds)
##    actuals predicted
## 8      139  225.2699
## 11     139  207.4091
## 13     139  215.2593
## 16     139  205.9247
## 23     149  183.5846
## 24     149  187.5673
print(paste("Akaike’s information criterion:", AIC(model3_filtered)))
## [1] "Akaike’s information criterion: 8850.41924475122"
print(paste("Bayesian information criterion:", BIC(model3_filtered)))
## [1] "Bayesian information criterion: 8887.03100130361"
#get accuracy
print(cor(actual_preds))
##             actuals predicted
## actuals   1.0000000 0.1632764
## predicted 0.1632764 1.0000000

Results

Using the Unfiltered Dataset

  1. First model uses 3 variables (Max resolution, Weight and Dimensions), Akaike’s and Bayesian information criterion were (13102.461549226 and 13126.0566149979) respectively, the correlation between predicted and actual values was 0.5295035 and the p values for all variables were less than 5% indicating they are statistically significant.

  2. Second Model uses 4 variables, the same from the first model in addition to (Storage), Akaike’s and Bayesian information criterion were (13097.7174056187 and 13126.031484545) respectively and the correlation between predicted and actual values was 0.5138635 and the p values for all variables were less than 5% indicating they are statistically significant.

  3. Third model uses 6 variables (Max resolution, Min resolution, effective pixels, Zoom tele, weight and Dimensions). Akaike’s and Bayesian information criterion were (13042.8381874199 and 13080.5902926549) respectively, the correlation between predicted and actual values was 0.571435 and the p values for all variables were less than 5% indicating they are statistically significant.

Using the Filtered Dataset

  1. The first model Akaike’s and Bayesian information criterion were (8876.99678932854 and 8899.87913717378) respectively and the correlation between predicted and actual values was 0.1830946.

  2. The second model Akaike’s and Bayesian information criterion were (8876.36648723066 and 8903.82530464495) respectively and the correlation between predicted and actual values was 0.1778359.

  3. The third model Akaike’s and Bayesian information criterion were (8850.41924475122 and 8887.03100130361) respectively and the correlation between predicted and actual values was 0.1632764.

References