CSE 4202 Project: Camera Dataset Analysis

Disclaimer

This analysis of the cameras dataset was done in fulfillment of an introduction to a data mining class assignment, and should not be used to make any inferences for practical applications.

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

camera_data <- read.csv("./data/camera_dataset.csv")

1. What is the statistical summary of the cameras dataset after the cleansing of missing data (if any)?

glimpse(camera_data)

## Rows: 1,038
## Columns: 13
## $ Model              <fct> Agfa ePhoto 1280, Agfa ePhoto 1680, Agfa ePhoto CL…
## $ Release.date       <int> 1997, 1998, 2000, 1999, 1999, 2001, 1999, 1997, 19…
## $ Max.resolution     <dbl> 1024, 1280, 640, 1152, 1152, 1600, 1280, 640, 832,…
## $ Low.resolution     <dbl> 640, 640, 0, 640, 640, 640, 640, 0, 640, 1024, 102…
## $ Effective.pixels   <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 3, 3, 1,…
## $ Zoom.wide..W.      <dbl> 38, 38, 45, 35, 43, 51, 34, 42, 50, 35, 39, 35, 39…
## $ Zoom.tele..T.      <dbl> 114, 114, 45, 35, 43, 51, 102, 42, 50, 105, 39, 10…
## $ Normal.focus.range <dbl> 70, 50, 0, 0, 50, 50, 0, 70, 40, 76, 20, 76, 20, 7…
## $ Macro.focus.range  <dbl> 40, 0, 0, 0, 0, 20, 0, 3, 10, 16, 5, 16, 5, 16, 5,…
## $ Storage.included   <dbl> 4, 4, 2, 4, 40, 8, 8, 2, 1, 8, 8, 8, 8, 8, 32, 0, …
## $ Weight             <dbl> 420, 420, 0, 0, 300, 270, 0, 320, 460, 375, 225, 3…
## $ Dimensions         <dbl> 95, 158, 0, 0, 128, 119, 0, 93, 160, 110, 110, 110…
## $ Price              <dbl> 179, 179, 179, 269, 1299, 179, 179, 149, 139, 139,…

#looking for columns with missing data
for(i in 1:dim(camera_data)[2]){
    temp <- camera_data %>% summarise(count = sum(is.na(camera_data[i])))
    print(paste("Missing value count for", colnames(camera_data[i]), "is", temp))
}

## [1] "Missing value count for Model is 0"
## [1] "Missing value count for Release.date is 0"
## [1] "Missing value count for Max.resolution is 0"
## [1] "Missing value count for Low.resolution is 0"
## [1] "Missing value count for Effective.pixels is 0"
## [1] "Missing value count for Zoom.wide..W. is 0"
## [1] "Missing value count for Zoom.tele..T. is 0"
## [1] "Missing value count for Normal.focus.range is 0"
## [1] "Missing value count for Macro.focus.range is 1"
## [1] "Missing value count for Storage.included is 2"
## [1] "Missing value count for Weight is 2"
## [1] "Missing value count for Dimensions is 2"
## [1] "Missing value count for Price is 0"

#store rows with missing data
camera_missing <- camera_data[rowSums(is.na(camera_data)) > 0,]
#drop rows with missing data
camera_data <- camera_data %>% drop_na()

summary(camera_data)

##                     Model       Release.date  Max.resolution Low.resolution
##  Agfa ePhoto 1280      :   1   Min.   :1994   Min.   :   0   Min.   :   0  
##  Agfa ePhoto 1680      :   1   1st Qu.:2002   1st Qu.:2048   1st Qu.:1120  
##  Agfa ePhoto CL18      :   1   Median :2004   Median :2560   Median :2048  
##  Agfa ePhoto CL30      :   1   Mean   :2004   Mean   :2473   Mean   :1775  
##  Agfa ePhoto CL30 Clik!:   1   3rd Qu.:2006   3rd Qu.:3072   3rd Qu.:2560  
##  Agfa ePhoto CL45      :   1   Max.   :2007   Max.   :5616   Max.   :4992  
##  (Other)               :1030                                               
##  Effective.pixels Zoom.wide..W.   Zoom.tele..T.   Normal.focus.range
##  Min.   : 0.00    Min.   : 0.00   Min.   :  0.0   Min.   :  0.00    
##  1st Qu.: 3.00    1st Qu.:35.00   1st Qu.: 96.0   1st Qu.: 30.00    
##  Median : 4.00    Median :36.00   Median :108.0   Median : 50.00    
##  Mean   : 4.59    Mean   :32.96   Mean   :121.5   Mean   : 44.13    
##  3rd Qu.: 7.00    3rd Qu.:38.00   3rd Qu.:117.0   3rd Qu.: 60.00    
##  Max.   :21.00    Max.   :52.00   Max.   :518.0   Max.   :120.00    
##                                                                     
##  Macro.focus.range Storage.included     Weight         Dimensions   
##  Min.   : 0.000    Min.   :  0.00   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 3.000    1st Qu.:  8.00   1st Qu.: 180.0   1st Qu.: 92.0  
##  Median : 6.000    Median : 16.00   Median : 226.0   Median :101.0  
##  Mean   : 7.786    Mean   : 17.45   Mean   : 319.3   Mean   :105.4  
##  3rd Qu.:10.000    3rd Qu.: 20.00   3rd Qu.: 350.0   3rd Qu.:115.0  
##  Max.   :85.000    Max.   :450.00   Max.   :1860.0   Max.   :240.0  
##                                                                     
##      Price       
##  Min.   :  14.0  
##  1st Qu.: 149.0  
##  Median : 199.0  
##  Mean   : 457.9  
##  3rd Qu.: 399.0  
##  Max.   :7999.0  
##

Results

Checking for missing data turned out to be true for 2 rows, and was distributed amongst the variables (Macro Focus Range, Storage, Weight and Dimensions).
These missing values were dropped from the dataset.
The dataset was then summarised which displayed a general overview of every variable, this was beneficial for the construction of other EDA questions.

2. Between the dates (1994 and 2007), was there an increase in the number of camera models produced? Also, what is the rate of growth in the number camera models (if any) in subsequent years beginning from 1994?

#store the dates
dates <- camera_data$Release.date
# Barplot
camera_data %>% ggplot(aes(x=as.factor(dates), fill=as.factor(dates))) + 
    geom_bar(width=0.7) +
    ggtitle("Number of Models Produced Thoughout the Years")

    #ggsave("question 2 bar plot.jpg")

#extract distinct dates from the dataset
distinct_dates <- camera_data %>% select(Release.date) %>% add_count(Release.date) %>% unique() %>% arrange(Release.date)

# calculate the growth
rates <- c(0.0000)

for(i in 1:(dim(distinct_dates)[1] - 1)){
    diff <- distinct_dates[i+1, ][2] - distinct_dates[i, ][2]
    percent <- ( diff / distinct_dates[i, ][2] ) * 100
    rates <- c(rates, as.numeric(percent))
    
    fmt <- sprintf("The difference and growth rate from year %d to %d is (%d, %.2f)", as.integer(distinct_dates[i, ][1]), as.integer(distinct_dates[i+1, ][1]), as.integer(diff), as.double(percent))
    print(fmt)
}

## [1] "The difference and growth rate from year 1994 to 1995 is (0, 0.00)"
## [1] "The difference and growth rate from year 1995 to 1996 is (3, 300.00)"
## [1] "The difference and growth rate from year 1996 to 1997 is (7, 175.00)"
## [1] "The difference and growth rate from year 1997 to 1998 is (21, 190.91)"
## [1] "The difference and growth rate from year 1998 to 1999 is (21, 65.62)"
## [1] "The difference and growth rate from year 1999 to 2000 is (8, 15.09)"
## [1] "The difference and growth rate from year 2000 to 2001 is (24, 39.34)"
## [1] "The difference and growth rate from year 2001 to 2002 is (4, 4.71)"
## [1] "The difference and growth rate from year 2002 to 2003 is (12, 13.48)"
## [1] "The difference and growth rate from year 2003 to 2004 is (40, 39.60)"
## [1] "The difference and growth rate from year 2004 to 2005 is (2, 1.42)"
## [1] "The difference and growth rate from year 2005 to 2006 is (9, 6.29)"
## [1] "The difference and growth rate from year 2006 to 2007 is (10, 6.58)"

final <- data.frame(year = distinct_dates[1], growth = rates)

final %>% ggplot(aes(x=as.factor(Release.date), y=growth, group=1) ) +
    geom_line(color="grey") +
    geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
    ggtitle("The Rate of Growth Between Every Year from 1994 to 2007")

    #ggsave("question 2 line plot.jpg")

Results

Plotting the total number of camera models from the year 1994 to 2007 does show an increase in production with every year having more than the last.
The rate of growth in the line graph shows from the year 1995 to 1996, for that time, there was a massive growth of 300%, which then diminished to a lower rate by the year 2000. From the years (2000 to 2001) and (2003 to 2004) the growth rate was the highest between the years (2000 to 2007), however, after 2004 the growth rate was very negligible.

3. Are there any outliers from the prices variable over the year 2000 in the dataset? How many camera models are priced over $2000?

camera_data %>% filter(Release.date > 2000) %>% ggplot(aes(x=as.factor(Release.date), y=Price, group=Release.date, fill=Release.date)) +
    geom_boxplot() +
    ggtitle("Box Plot of Prices over the Year 2000")

    #ggsave("question 3 box plot.jpg")

camera_data %>% filter(Release.date > 2000 & Price > 2000)

##                     Model Release.date Max.resolution Low.resolution
## 1            Canon EOS-1D         2001           2464           1232
## 2    Canon EOS-1D Mark II         2004           3504           3104
## 3   Canon EOS-1D Mark III         2007           3888           3456
## 4  Canon EOS-1D Mark II N         2005           3504           3104
## 5           Canon EOS-1Ds         2002           4064           2032
## 6   Canon EOS-1Ds Mark II         2004           4992           3600
## 7  Canon EOS-1Ds Mark III         2007           5616           4992
## 8            Canon EOS 5D         2005           4368           3168
## 9              Epson R-D1         2004           3008           2240
## 10              Nikon D1X         2001           3008           2000
## 11             Nikon D2Hs         2005           2464           1840
## 12              Nikon D2X         2004           4288           3216
## 13             Nikon D2Xs         2006           4288           3216
## 14               Nikon D3         2007           4256           3184
## 15       Ricoh GR Digital         2005           3264           2592
## 16         Ricoh RDC-i500         2001           2048           1024
##    Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1                 4             0             0                  0
## 2                 8             0             0                  0
## 3                10             0             0                  0
## 4                 8             0             0                  0
## 5                11             0             0                  0
## 6                16             0             0                  0
## 7                21             0             0                  0
## 8                12             0             0                  0
## 9                 6             0             0                  0
## 10                5             0             0                  0
## 11                4             0             0                  0
## 12               12             0             0                  0
## 13               12             0             0                  0
## 14               12             0             0                  0
## 15                8            28            28                 30
## 16                3            35           105                 24
##    Macro.focus.range Storage.included Weight Dimensions Price
## 1                  0                0   1585        156  4499
## 2                  0                0   1565        156  4499
## 3                  0                0   1335        156  4499
## 4                  0                0   1565        156  4499
## 5                  0                0   1585        156  7999
## 6                  0                0   1565        156  7999
## 7                  0                0   1385        150  7999
## 8                  0                0    895        152  2499
## 9                  0                0    620        142  4499
## 10                 0                0   1200        157  4699
## 11                 0                0   1200        158  4699
## 12                 0                0   1200        158  4699
## 13                 0                0   1200        158  4699
## 14                 0                0   1300        160  4999
## 15                 2               26    200        107  4999
## 16                 1                8    320        142  4999

Results

From the years 2001 to 2007 there are outliers. However between the price ranges of: ($1500 to $2000), ($4000 to $5000) and ($7500 to $8000) there is a price trend which indicates the higher end camera models. This further indicates the data distribution of the prices variable is skewed, due to majority of prices less than $1500.
There is a total of 16 camera models priced over $2000.

4. Which camera model is more cost efficient, costing less than $200, has a Max resolution (greater than 2160) and a storage size greater than 24gb in the year 2007?

camera_data %>% filter(Price < 200 & Max.resolution > 2160 & Release.date == 2007 & Storage.included > 24) %>% apply(2,min)

##                    Model             Release.date           Max.resolution 
## "Canon PowerShot SD1000"                   "2007"                   "2816" 
##           Low.resolution         Effective.pixels            Zoom.wide..W. 
##                   "   0"                     " 6"                     "28" 
##            Zoom.tele..T.       Normal.focus.range        Macro.focus.range 
##                    "101"                     " 0"                     " 0" 
##         Storage.included                   Weight               Dimensions 
##                    " 25"                    "  0"                    " 86" 
##                    Price 
##                    " 99"

Results

In the year 2007, the ‘Canon PowerShot SD1000’ was found to be the most cost efficient, costing only $99, having a storage size of 25gb and a maximum resolution of 2816p

5. Given the different manufacturers in the dataset, plot the total price of all camera models per manufacturer, in a year for every year until 2007? with this insight, deduce:

Which manufacturer resulted in the lowest and highest after accumulating all prices per manufacturer throughout the years.
How many manufacturers have stopped producing cameras by the year 2007.
The year when most manufacturers produced.

manufacturers <- c("Agfa ePhoto", "Canon", "Casio", "Contax", "Epson", "Fujifilm", "HP","Kodak", "Kyocera", "Finecam", "Leica", "Nikon", "Olympus", "Panasonic", "Pentax", "Ricoh", "Samsung", "Sanyo", "Sigma", "Sony", "Toshiba")
accumulated_total <- c(rep(0, times = length(manufacturers)))

for(i in 1:dim(distinct_dates[1])[1]){

    price_per_manu <- data.frame(manufacturers = character(), price = numeric())
    
    for(j in 1:length(manufacturers)){
        
        result <- camera_data %>% filter(Release.date == as.numeric(distinct_dates[1][i,]) & str_detect(Model, manufacturers[j]))
        total <- sum(result$Price)
        
        price_per_manu <- rbind(price_per_manu, data.frame(manufacturers = manufacturers[j], price = total))
        
        accumulated_total[j] <- accumulated_total[j] + total
    }
    
    plot <- price_per_manu %>% ggplot(aes(x=reorder(manufacturers, price), y=price)) +
        geom_bar(stat="identity", fill="#69b3a2", width=.4) +
        coord_flip() + 
        xlab("") +
        theme_bw() +
        ggtitle("Total Price Per Manufacturers in year: ", distinct_dates[1][i,])
    
    print(plot)
    #ggsave(paste("Year", distinct_dates[1][i,],".png", sep = ""))
}

accumulated_total_manu <- data.frame(manufacturers = manufacturers, price = accumulated_total)

accumulated_total_manu %>%
    ggplot(aes(x = reorder(manufacturers, price), y=price)) +
    geom_bar(stat="identity", fill="#69b3a2", width=.4) +
    coord_flip() + 
    xlab("") +
    theme_bw() +
    ggtitle("Accumulated Price of all Manufacturers Throughout the Years")

#ggsave("total_summary.png")

Results

15 different plots were generated which shows the total price for each manufacturer for that year and the accumulated price per manufacturer.
The manufacturer with the lowest accumulated price was Contax and the manufacturer with the highest accumulated price was Canon.
By the year 2007, 9 manufacturers stopped producing any cameras
In the year 2003, 19 of 21 manufacturers produced cameras while the 2 that did not produced cameras were Agfa ePhoto and Contax

6. How many camera models, with the following variables: [zoom wide, zoom tele, normal focus range, macro focus range] have a value less than 1? Also is there a trend that camera models with less than 1 in the corresponding variables are manufacturer specific or not?

less_than_ones <- camera_data %>% filter(Zoom.wide..W. < 1 & Zoom.tele..T. < 1 & Normal.focus.range < 1 & Macro.focus.range < 1)
dim(less_than_ones)[1]

## [1] 85

head(less_than_ones)

##                    Model Release.date Max.resolution Low.resolution
## 1          Canon EOS 10D         2003           3072           2048
## 2           Canon EOS-1D         2001           2464           1232
## 3   Canon EOS-1D Mark II         2004           3504           3104
## 4  Canon EOS-1D Mark III         2007           3888           3456
## 5 Canon EOS-1D Mark II N         2005           3504           3104
## 6          Canon EOS-1Ds         2002           4064           2032
##   Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1                6             0             0                  0
## 2                4             0             0                  0
## 3                8             0             0                  0
## 4               10             0             0                  0
## 5                8             0             0                  0
## 6               11             0             0                  0
##   Macro.focus.range Storage.included Weight Dimensions Price
## 1                 0                0    875        150  1299
## 2                 0                0   1585        156  4499
## 3                 0                0   1565        156  4499
## 4                 0                0   1335        156  4499
## 5                 0                0   1565        156  4499
## 6                 0                0   1585        156  7999

Results

There are 85 camera models with a value of 0 in [zoom wide, zoom tele, normal focus range, macro focus range]
A select few of the manufacturers: Canon, Fujifilm and Kodak have this value which means that its not manufacturer specific but the data for these specific camera models were unavailable.

7. What kind of data distributions does the following variables have: [dimensions, weight, zoom tele and prices]?

hist1 <- camera_data %>% ggplot(aes(x=Dimensions)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
    ggtitle("Dimensions Distribution")

dense1 <- camera_data %>% ggplot(aes(x=Dimensions)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
    ggtitle("Dimensions Distribution")

grid.arrange(hist1, dense1, ncol=2)

#ggsave("question 7 hist_density plot1.jpg", grid.arrange(hist1, dense1, ncol=2))

hist2 <- camera_data %>% ggplot(aes(x=Weight)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
    ggtitle("Weight Distribution")

dense2 <- camera_data %>% ggplot(aes(x=Weight)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
    ggtitle("Weight Distribution")

grid.arrange(hist2, dense2, ncol=2)

#ggsave("question 7 hist_density plot2.jpg", grid.arrange(hist2, dense2, ncol=2))

hist3 <- camera_data %>% ggplot(aes(x=Zoom.tele..T.)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
    ggtitle("Zoom Tele Distribution")

dense3 <- camera_data %>% ggplot(aes(x=Zoom.tele..T.)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
    ggtitle("Zoom Tele Distribution")

grid.arrange(hist3, dense3, ncol=2)

#ggsave("question 7 hist_density plot3.jpg", grid.arrange(hist3, dense3, ncol=2))

hist4 <- camera_data %>% ggplot(aes(x=Price)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
    ggtitle("Price Distribution")

dense4 <- camera_data %>% ggplot(aes(x=Price)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
    ggtitle("Price Distribution")

grid.arrange(hist4, dense4, ncol=2)

#ggsave("question 7 hist_density plot4.jpg", grid.arrange(hist4, dense4, ncol=2))

Results

The dimensions distribution is multimodal.
The weight distribution is right skewed and multimodal.
Zoom tele distribution is right skewed and multimodal.
Price distribution is right skewed and multimodal.

8. What kind of relationship exists between the 3 variable pairs: [storage and max resolution], [weight and dimensions], [price and zoom tele], under the condition that [zoom wide, zoom tele, normal focus range, macro focus range] is greater than 1?

filtered_set <- camera_data %>% filter(Zoom.wide..W. > 1 & Zoom.tele..T. > 1 & Normal.focus.range > 1 & Macro.focus.range > 1)

filtered_set %>% filter(Storage.included < 100) %>% ggplot(aes(x=Max.resolution, y=Storage.included)) +
  geom_point(
        color="blue",
        fill="#69b3a2",
        shape=18,
        alpha=0.5,
        size=3,
        stroke = 1) +
  geom_smooth(method=lm , color="red", se=FALSE, formula="y ~ x") +
    ggtitle("Storage vs Max Resolution")

#ggsave("question 8 scatter plot1.jpg")

filtered_set %>% ggplot(aes(x=Weight, y=Dimensions)) +
  geom_point(
        color="blue",
        fill="#69b3a2",
        shape=18,
        alpha=0.5,
        size=3,
        stroke = 1) +
  geom_smooth(method=lm , color="red", se=FALSE, formula="y ~ x") +
    ggtitle("Dimensions vs Weight")

#ggsave("question 8 scatter plot2.jpg")

filtered_set %>% filter(Price < 2000) %>%ggplot(aes(x=Zoom.tele..T., y=Price)) +
  geom_point(
        color="blue",
        fill="#69b3a2",
        shape=18,
        alpha=0.5,
        size=3,
        stroke = 1) +
  geom_smooth(method=lm , color="red", se=FALSE, formula="y ~ x") +
    ggtitle("Price vs Zoom tele")

#ggsave("question 8 scatter plot3.jpg")

facet_wrap( ~ align)

## <ggproto object: Class FacetWrap, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetWrap, Facet, gg>

Results

The relationship between storage and max resolution is positive.
The relationship between dimensions and weight is also positive.
The relationship between price and zoom tele seems to be a little negative to no relationship.

9. As a follow up question to question 8, what are the correlation coefficients of 3 relationships identified?

relation_1 <- data.frame(cbind(max_res = filtered_set$Max.resolution, storage = filtered_set$Storage.included))
relation_2 <- data.frame(cbind(weight = filtered_set$Weight, dims = filtered_set$Dimensions))
relation_3 <- data.frame(cbind(zoom_t = filtered_set$Zoom.tele..T., price = filtered_set$Price))

cor(relation_1)

##           max_res   storage
## max_res 1.0000000 0.2538378
## storage 0.2538378 1.0000000

cor(relation_2)

##          weight     dims
## weight 1.000000 0.524276
## dims   0.524276 1.000000

cor(relation_3)

##             zoom_t       price
## zoom_t  1.00000000 -0.04432165
## price  -0.04432165  1.00000000

Results

The correlation coefficient for the relationship between storage and max resolution is 0.2538378.
The correlation coefficient for the relationship between dimensions and weight is 0.524276.
The correlation coefficient for the relationship between price and zoom tele is -0.04432165.

10. Find all camera models with a value greater than 1 in the variables [zoom wide and zoom tele]? What is the zoom ratio of those camera models using the variables [zoom wide and zoom tele]? Does the zoom ratio get larger from the years 1994 to 2007?

camera_ratio <- camera_data %>% filter(Zoom.tele..T. > 0 & Zoom.wide..W. > 0) %>% mutate(zoom_ratio = Zoom.tele..T. / Zoom.wide..W.)
head(camera_ratio)

##                    Model Release.date Max.resolution Low.resolution
## 1       Agfa ePhoto 1280         1997           1024            640
## 2       Agfa ePhoto 1680         1998           1280            640
## 3       Agfa ePhoto CL18         2000            640              0
## 4       Agfa ePhoto CL30         1999           1152            640
## 5 Agfa ePhoto CL30 Clik!         1999           1152            640
## 6       Agfa ePhoto CL45         2001           1600            640
##   Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1                0            38           114                 70
## 2                1            38           114                 50
## 3                0            45            45                  0
## 4                0            35            35                  0
## 5                0            43            43                 50
## 6                1            51            51                 50
##   Macro.focus.range Storage.included Weight Dimensions Price zoom_ratio
## 1                40                4    420         95   179          3
## 2                 0                4    420        158   179          3
## 3                 0                2      0          0   179          1
## 4                 0                4      0          0   269          1
## 5                 0               40    300        128  1299          1
## 6                20                8    270        119   179          1

camera_ratio %>% ggplot(aes(x=as.factor(Release.date), y=zoom_ratio)) +
    geom_bar(stat="identity", fill="#69b3a2", width=.5)

#ggsave("question 10 bar plot.jpg")

Results

Using the variables zoom wide and zoom tele, a mutate function was used to calculate the zoom ratio for every camera models when zoom tele and zoom ratio is greater than 0.
Throughout the years the zoom ratio gets larger with every subsequent year.

b. Regression Modeling

#splitting and filtering dataset

camera_data_unfiltered <- camera_data
camera_data_filtered <- camera_data %>% filter(Price < 1000)
set.seed(123)

#unfiltered dataset [80, 20] split
split = sort(sample(nrow(camera_data_unfiltered), nrow(camera_data_unfiltered)*.8))
train_unfiltered <- camera_data_unfiltered[split,]
test_unfiltered <- camera_data_unfiltered[-split,]

#filtered dataset [80, 20] split
split = sort(sample(nrow(camera_data_filtered), nrow(camera_data_filtered)*.8))
train_filtered <- camera_data_filtered[split,]
test_filtered <- camera_data_filtered[-split,]

Unfiltered

Model 1

#Model 1, first trial
model1_unfiltered <- lm(Price ~ Max.resolution+Weight+Dimensions, data = train_unfiltered)

print(summary(model1_unfiltered))

## 
## Call:
## lm(formula = Price ~ Max.resolution + Weight + Dimensions, data = train_unfiltered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2306.6  -226.7  -112.1    26.4  5693.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.86387  139.32734  -0.071   0.9436    
## Max.resolution   0.13631    0.03024   4.507 7.51e-06 ***
## Weight           1.42113    0.11931  11.912  < 2e-16 ***
## Dimensions      -3.15011    1.28959  -2.443   0.0148 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 658.2 on 824 degrees of freedom
## Multiple R-squared:  0.2288, Adjusted R-squared:  0.226 
## F-statistic:  81.5 on 3 and 824 DF,  p-value: < 2.2e-16

actual_preds <- data.frame(cbind(actuals=test_unfiltered$Price, predicted=predict(model1_unfiltered, test_unfiltered)))  
head(actual_preds)

##    actuals predicted
## 26     139  309.4887
## 30     149  420.6426
## 37     139  539.3297
## 38     139  591.6709
## 45     139  472.7551
## 50    4499 2200.3976

print(paste("Akaike’s information criterion:", AIC(model1_unfiltered)))

## [1] "Akaike’s information criterion: 13102.461549226"

print(paste("Bayesian information criterion:", BIC(model1_unfiltered)))

## [1] "Bayesian information criterion: 13126.0566149979"

#get accuracy
print(cor(actual_preds))

##             actuals predicted
## actuals   1.0000000 0.5295035
## predicted 0.5295035 1.0000000

Model 2

#Model 2, second trial
model2_unfiltered <- lm(Price ~ Max.resolution+Storage.included+Weight+Dimensions, data = train_unfiltered)

print(summary(model2_unfiltered))

## 
## Call:
## lm(formula = Price ~ Max.resolution + Storage.included + Weight + 
##     Dimensions, data = train_unfiltered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2284.1  -226.3  -117.8    31.5  5686.4 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.37501  138.91292   0.010  0.99210    
## Max.resolution     0.15340    0.03085   4.973 8.03e-07 ***
## Storage.included  -2.42132    0.93330  -2.594  0.00964 ** 
## Weight             1.37082    0.12046  11.380  < 2e-16 ***
## Dimensions        -3.10866    1.28523  -2.419  0.01579 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 656 on 823 degrees of freedom
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.2314 
## F-statistic: 63.23 on 4 and 823 DF,  p-value: < 2.2e-16

actual_preds <- data.frame(cbind(actuals=test_unfiltered$Price, predicted=predict(model2_unfiltered, test_unfiltered)))  
head(actual_preds)

##    actuals predicted
## 26     139  309.1985
## 30     149  438.8639
## 37     139  556.2971
## 38     139  576.4616
## 45     139  432.0135
## 50    4499 2199.2727

print(paste("Akaike’s information criterion:", AIC(model2_unfiltered)))

## [1] "Akaike’s information criterion: 13097.7174056187"

print(paste("Bayesian information criterion:", BIC(model2_unfiltered)))

## [1] "Bayesian information criterion: 13126.031484545"

#get accuracy
print(cor(actual_preds) )

##             actuals predicted
## actuals   1.0000000 0.5138635
## predicted 0.5138635 1.0000000

Model 3

#Model 3, third trial
model3_unfiltered <- lm(Price ~ Max.resolution+Low.resolution+Effective.pixels+Zoom.tele..T.+Weight+Dimensions, data = train_unfiltered)

print(summary(model3_unfiltered))

## 
## Call:
## lm(formula = Price ~ Max.resolution + Low.resolution + Effective.pixels + 
##     Zoom.tele..T. + Weight + Dimensions, data = train_unfiltered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2327.6  -249.2  -106.5    60.0  5596.0 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      576.0420   181.2031   3.179  0.00153 ** 
## Max.resolution    -0.3127     0.1058  -2.954  0.00322 ** 
## Low.resolution     0.2706     0.0507   5.338 1.22e-07 ***
## Effective.pixels  61.1598    26.4088   2.316  0.02081 *  
## Zoom.tele..T.     -1.5871     0.2398  -6.618 6.56e-11 ***
## Weight             1.5807     0.1176  13.438  < 2e-16 ***
## Dimensions        -4.0400     1.2502  -3.231  0.00128 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 633.8 on 821 degrees of freedom
## Multiple R-squared:  0.2876, Adjusted R-squared:  0.2824 
## F-statistic: 55.24 on 6 and 821 DF,  p-value: < 2.2e-16

actual_preds <- data.frame(cbind(actuals=test_unfiltered$Price, predicted=predict(model3_unfiltered, test_unfiltered)))  
head(actual_preds)

##    actuals predicted
## 26     139  325.8619
## 30     149  487.1587
## 37     139  581.5300
## 38     139  644.3967
## 45     139  438.2805
## 50    4499 2653.1778

print(paste("Akaike’s information criterion:", AIC(model3_unfiltered)))

## [1] "Akaike’s information criterion: 13042.8381874199"

print(paste("Bayesian information criterion:", BIC(model3_unfiltered)))

## [1] "Bayesian information criterion: 13080.5902926549"

#get accuracy
print(cor(actual_preds))

##            actuals predicted
## actuals   1.000000  0.571435
## predicted 0.571435  1.000000

Filtered

Model 1

#Model 1, first trial
model1_filtered <- lm(Price ~ Max.resolution+Weight+Dimensions, data = train_filtered)

print(summary(model1_filtered))

## 
## Call:
## lm(formula = Price ~ Max.resolution + Weight + Dimensions, data = train_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -232.82  -81.19  -34.16   50.76  544.60 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.112e+02  3.082e+01   3.608  0.00033 ***
## Max.resolution 3.211e-02  6.202e-03   5.178 2.92e-07 ***
## Weight         2.178e-02  2.713e-02   0.803  0.42225    
## Dimensions     2.744e-01  2.837e-01   0.967  0.33368    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116.6 on 714 degrees of freedom
## Multiple R-squared:  0.04051,    Adjusted R-squared:  0.03648 
## F-statistic: 10.05 on 3 and 714 DF,  p-value: 1.716e-06

actual_preds <- data.frame(cbind(actuals=test_filtered$Price, predicted=predict(model1_filtered, test_filtered)))  
head(actual_preds)

##    actuals predicted
## 8      139  191.8423
## 11     139  200.9304
## 13     139  190.1105
## 16     139  200.3858
## 23     149  178.1218
## 24     149  186.8825

print(paste("Akaike’s information criterion:", AIC(model1_filtered)))

## [1] "Akaike’s information criterion: 8876.99678932854"

print(paste("Bayesian information criterion:", BIC(model1_filtered)))

## [1] "Bayesian information criterion: 8899.87913717378"

#get accuracy
print(cor(actual_preds))

##             actuals predicted
## actuals   1.0000000 0.1830946
## predicted 0.1830946 1.0000000

Model 2

#Model 2, second trial
model2_filtered <- lm(Price ~ Max.resolution+Storage.included+Weight+Dimensions, data = train_filtered)

print(summary(model2_filtered))

## 
## Call:
## lm(formula = Price ~ Max.resolution + Storage.included + Weight + 
##     Dimensions, data = train_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -236.28  -81.25  -34.25   50.25  539.70 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      111.063557  30.785041   3.608  0.00033 ***
## Max.resolution     0.034186   0.006326   5.404 8.89e-08 ***
## Storage.included  -0.263697   0.163013  -1.618  0.10618    
## Weight             0.016826   0.027272   0.617  0.53746    
## Dimensions         0.287151   0.283460   1.013  0.31139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116.5 on 713 degrees of freedom
## Multiple R-squared:  0.04402,    Adjusted R-squared:  0.03866 
## F-statistic: 8.208 on 4 and 713 DF,  p-value: 1.787e-06

actual_preds <- data.frame(cbind(actuals=test_filtered$Price, predicted=predict(model2_filtered, test_filtered)))  
head(actual_preds)

##    actuals predicted
## 8      139  192.9268
## 11     139  201.5482
## 13     139  190.1880
## 16     139  201.1275
## 23     149  178.1497
## 24     149  187.1683

print(paste("Akaike’s information criterion:", AIC(model2_filtered)))

## [1] "Akaike’s information criterion: 8876.36648723066"

print(paste("Bayesian information criterion:", BIC(model2_filtered)))

## [1] "Bayesian information criterion: 8903.82530464495"

#get accuracy
print(cor(actual_preds) )

##             actuals predicted
## actuals   1.0000000 0.1778359
## predicted 0.1778359 1.0000000

Model 3

#Model 3, third trial
model3_filtered <- lm(Price ~ Max.resolution+Low.resolution+Effective.pixels+Zoom.tele..T.+Weight+Dimensions, data = train_filtered)

print(summary(model3_filtered))

## 
## Call:
## lm(formula = Price ~ Max.resolution + Low.resolution + Effective.pixels + 
##     Zoom.tele..T. + Weight + Dimensions, data = train_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -239.66  -81.54  -24.38   48.70  527.62 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      139.732165  39.778649   3.513 0.000472 ***
## Max.resolution    -0.029171   0.022403  -1.302 0.193317    
## Low.resolution     0.058122   0.010280   5.654 2.27e-08 ***
## Effective.pixels   1.793944   5.443311   0.330 0.741823    
## Zoom.tele..T.     -0.005215   0.049059  -0.106 0.915374    
## Weight             0.059376   0.027515   2.158 0.031265 *  
## Dimensions         0.284733   0.278790   1.021 0.307451    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.2 on 711 degrees of freedom
## Multiple R-squared:  0.08308,    Adjusted R-squared:  0.07534 
## F-statistic: 10.74 on 6 and 711 DF,  p-value: 1.904e-11

actual_preds <- data.frame(cbind(actuals=test_filtered$Price, predicted=predict(model3_filtered, test_filtered)))  
head(actual_preds)

##    actuals predicted
## 8      139  225.2699
## 11     139  207.4091
## 13     139  215.2593
## 16     139  205.9247
## 23     149  183.5846
## 24     149  187.5673

print(paste("Akaike’s information criterion:", AIC(model3_filtered)))

## [1] "Akaike’s information criterion: 8850.41924475122"

print(paste("Bayesian information criterion:", BIC(model3_filtered)))

## [1] "Bayesian information criterion: 8887.03100130361"

#get accuracy
print(cor(actual_preds))

##             actuals predicted
## actuals   1.0000000 0.1632764
## predicted 0.1632764 1.0000000

Results

Two datasets were created to compare 2 sets of 3 models each, the unfiltered dataset is the regular camera dataset while the unfiltered dataset removes the skew in the prices distribution where the prices are less than $1000.
All 6 models utilize multiple linear regression instead of simple linear regression to predict the price of a camera model.

Using the Unfiltered Dataset

First model uses 3 variables (Max resolution, Weight and Dimensions), Akaike’s and Bayesian information criterion were (13102.461549226 and 13126.0566149979) respectively, the correlation between predicted and actual values was 0.5295035 and the p values for all variables were less than 5% indicating they are statistically significant.
Second Model uses 4 variables, the same from the first model in addition to (Storage), Akaike’s and Bayesian information criterion were (13097.7174056187 and 13126.031484545) respectively and the correlation between predicted and actual values was 0.5138635 and the p values for all variables were less than 5% indicating they are statistically significant.
Third model uses 6 variables (Max resolution, Min resolution, effective pixels, Zoom tele, weight and Dimensions). Akaike’s and Bayesian information criterion were (13042.8381874199 and 13080.5902926549) respectively, the correlation between predicted and actual values was 0.571435 and the p values for all variables were less than 5% indicating they are statistically significant.

Finilizing the results for the models above, the third model performed the best with a lower AIC and BIC score, a higher correlation, as well as the smallest residual standard error. The first and second model were tied closely with a very small difference in correlation and AIC however model 2 is slightly better with a smaller residual standard error. For the unfiltered dataset, the third model was chosen to predict prices.

Using the Filtered Dataset

The first 2 following models are the same as the first 2 unfiltered models but were applied to a dataset without a price skew. The p values for the variables of all 3 models are different indicating not all variables are significant. Even though not significant variables are that effective they were kept to keep the models consistent so only the data changes.

The first model Akaike’s and Bayesian information criterion were (8876.99678932854 and 8899.87913717378) respectively and the correlation between predicted and actual values was 0.1830946.
The second model Akaike’s and Bayesian information criterion were (8876.36648723066 and 8903.82530464495) respectively and the correlation between predicted and actual values was 0.1778359.
The third model Akaike’s and Bayesian information criterion were (8850.41924475122 and 8887.03100130361) respectively and the correlation between predicted and actual values was 0.1632764.

Finilizing these 3 models, model one performed the best while model 3 performed the worst. However, comparing unfiltered vs filtered models, filtered models definately performed better than the unfiltered. Model 1 filtered is the best model out of all 6 however it does not take into consideration the skewed data for prices; model 3 unfiltered does but the predicted prices are larger than usual. This leads to the conclusion that the same model configuration can not work given the same variables and a better configuration would inevitably lead to more accurate predictions with a filtered dataset.

CSE 4202 Project: Camera Dataset Analysis

Abdur-Rahman Baksh

25/05/2020

Disclaimer

1. What is the statistical summary of the cameras dataset after the cleansing of missing data (if any)?

Results

2. Between the dates (1994 and 2007), was there an increase in the number of camera models produced? Also, what is the rate of growth in the number camera models (if any) in subsequent years beginning from 1994?

Results

3. Are there any outliers from the prices variable over the year 2000 in the dataset? How many camera models are priced over $2000?

Results

4. Which camera model is more cost efficient, costing less than $200, has a Max resolution (greater than 2160) and a storage size greater than 24gb in the year 2007?

Results

5. Given the different manufacturers in the dataset, plot the total price of all camera models per manufacturer, in a year for every year until 2007? with this insight, deduce:

Results

6. How many camera models, with the following variables: [zoom wide, zoom tele, normal focus range, macro focus range] have a value less than 1? Also is there a trend that camera models with less than 1 in the corresponding variables are manufacturer specific or not?

Results

7. What kind of data distributions does the following variables have: [dimensions, weight, zoom tele and prices]?

Results

8. What kind of relationship exists between the 3 variable pairs: [storage and max resolution], [weight and dimensions], [price and zoom tele], under the condition that [zoom wide, zoom tele, normal focus range, macro focus range] is greater than 1?

Results

9. As a follow up question to question 8, what are the correlation coefficients of 3 relationships identified?

Results

10. Find all camera models with a value greater than 1 in the variables [zoom wide and zoom tele]? What is the zoom ratio of those camera models using the variables [zoom wide and zoom tele]? Does the zoom ratio get larger from the years 1994 to 2007?

Results

b. Regression Modeling

Unfiltered

Model 1

Model 2

Model 3

Filtered

Model 1

Model 2

Model 3

Results

Using the Unfiltered Dataset

Using the Filtered Dataset

References