This document provides a brief overiew of data analysis and visualization techniques using R. The dataset used in this overview was taken from: https://www.kaggle.com/crawford/1000-cameras-dataset/#
Disclaimer: This report was done in fullfillment of an Introduction to Data Mining class assignment ~ Database System Design and Information Management Systems II (CSE4202).

 

Dimension of the Dataset

## The dimension of the dataset is: 258 14

 

Summary of the Dataset | General Statistical Operation of the Dataset

##      Brand                        Model      Release.date  Max.resolution
##  Min.   :1.000   Canon PowerShot 350 :  1   Min.   :1996   Min.   : 512  
##  1st Qu.:1.000   Canon PowerShot 600 :  1   1st Qu.:2002   1st Qu.:2048  
##  Median :2.000   Canon PowerShot A10 :  1   Median :2004   Median :2592  
##  Mean   :1.907   Canon PowerShot A100:  1   Mean   :2004   Mean   :2547  
##  3rd Qu.:3.000   Canon PowerShot A20 :  1   3rd Qu.:2006   3rd Qu.:3072  
##  Max.   :3.000   Canon PowerShot A200:  1   Max.   :2007   Max.   :4256  
##                  (Other)             :252                                
##  Low.resolution Effective.pixels Zoom.wide..W.  Zoom.tele..T.  
##  Min.   :   0   Min.   : 0.000   Min.   : 0.0   Min.   :  0.0  
##  1st Qu.:1600   1st Qu.: 3.000   1st Qu.:35.0   1st Qu.:105.0  
##  Median :2048   Median : 5.000   Median :36.0   Median :111.5  
##  Mean   :1894   Mean   : 4.597   Mean   :35.1   Mean   :131.9  
##  3rd Qu.:2304   3rd Qu.: 6.750   3rd Qu.:38.0   3rd Qu.:138.2  
##  Max.   :3264   Max.   :12.000   Max.   :52.0   Max.   :486.0  
##                                                                
##  Normal.focus.range Macro.focus.range Storage.included Weight..inc..batteries.
##  Min.   : 0.00      Min.   : 0.000    Min.   : 0.00    Min.   :120.0          
##  1st Qu.:30.00      1st Qu.: 4.000    1st Qu.:12.00    1st Qu.:185.0          
##  Median :50.00      Median : 5.000    Median :16.00    Median :225.0          
##  Mean   :42.84      Mean   : 6.671    Mean   :17.65    Mean   :284.2          
##  3rd Qu.:60.00      3rd Qu.:10.000    3rd Qu.:19.00    3rd Qu.:320.0          
##  Max.   :90.00      Max.   :20.000    Max.   :64.00    Max.   :930.0          
##                                                                               
##    Dimensions        Price       
##  Min.   : 60.0   Min.   :  99.0  
##  1st Qu.: 90.0   1st Qu.: 159.0  
##  Median : 98.0   Median : 199.0  
##  Mean   :100.6   Mean   : 251.8  
##  3rd Qu.:110.0   3rd Qu.: 229.0  
##  Max.   :160.0   Max.   :1699.0  
## 

 

Question 1: Which year had the most camera releases?

Discussion for Question 1:

Based on our dataset, we can see a steady increase in the amount of camera production over the years. Most of the cameras went on sale in the years; 2004, 2006 and 2007. From the year 2000, it can be seen that camera production has doubled onwards. This is a strong indication of rapid technological growth.

 

Question 2: Has there been an increase in Max Resolution over the years?

Discussion for Question 2:

The visualization plot shows a steady increase in the camera’s max resolution over the period of 1996-2007. From the year 2000, the resolution has entered in the 3000 dimensions in pixels. Then from 2002, the dimensions increased by another 1000 pixels.

 

Question 3: Has there been a decrease in Low Resolution over the years?

Discussion for Question 3:

In a contradictory conclusion, there has been an increase rather than a decrease in low resolution over the years. This means that prior to 2001, the lowest resolution was approximately 500 pixels and then in 2007, the lowest was around 1000 resolution.

 

Question 4: Is there any relationship between Dimension and Year?

Discussion for Question 4:

It is evident from our scatter plot that most cameras has a dimension between 80 to 120 from 2000 onwards. Based on our dataset, we can conclude that camera models of Canon Powershot, Fujifilm FinePix and Nikon Coolpix falls most popularly in these dimensions.

 

Question 6: What is the average price for a Canon Powershot camera?

Discussion for Question 6:

Based on the box plot diagram, the average price for a Canon PowerShot camera falls below $500. The FALSE value shows the prices for the other brands (i.e. Fujifilm FinePix and Nikon Coolpix) which shows some outliers.

 

Question 7: What is the weight distribution of the cameras over the period of time?

## `geom_smooth()` using formula 'y ~ x'

Discussion for Question 7:

The diagram shows that there has been a decline in camera weight from the years 1996 to 2007. Our chosen camera brands have been producing cameras that are easily portable as technology advances and become more mainstream. However, there seems to be a limit to the weight amount centred around 250. The slope depicts a fall in the data visualization diagram.

 

Question 8: Which brand has the widest variety of affordable camera models? [Based on our Dataset]

Discussion for Question 8:

Based on the visualization, it can be seen that most Canon PowerShot models fall in the price range of 100-150. Whereas for the Nikon Coolpix models, the prices fall between 200-250. However for the Fujifilm FinePix, their seems to be many outliers, thus it is hard to determine a narrow range for the prices. It can be shown that most camera prices for Fujifilm FinePix fall below 250. In conclusion, the widest variety of the cheapest camera cost is the Canon PowerShot.  

 

Question 9: What are the prices of Cameras with Max Resolution more than 4000 pixels?

##   Brand                      Model Max.resolution Price
## 1     2 Fujifilm FinePix E550 Zoom           4048   169
## 2     2      Fujifilm FinePix F610           4048   169
## 3     2 Fujifilm FinePix F810 Zoom           4048   169
## 4     2   Fujifilm FinePix S7000 Z           4048   229
## 5     2    Fujifilm FinePix IS Pro           4256  1699
## 6     2    Fujifilm FinePix S2 Pro           4256  1699
## 7     2    Fujifilm FinePix S3 Pro           4256  1699
## 8     2    Fujifilm FinePix S5 Pro           4256  1699

Discussion for Question 9:

The barplot shows that one could obtain four cameras in the price range of 1500-1800. Four other cameras are also available in the price range of 150-250. This is ideal for persons who want large file formats so that they can manipulate their images in post edits. It was also found out of our 3 models, Fujifilm Finepix has the largest array of Max Resolution cameras.

 

Question 10: Is there Correlation between Normal and Macro Focus Range?

Discussion for Question 10:

Most cameras in our dataset have a normal focus distance that ranges between 25-75 and a Macro Focus range that is centred around 1-15. This range signifies how far away or how close someone can be with the camera to take clear and in-focus shots. Knowing the focus range on a camera is vital for photographers when picking a camera model to purchase.

 

Summary of the Sub-Dataset of Variables (Properties) we used in this Analysis

## The dimension of the dataset is: 258 11 
## 
##      Brand                        Model      Release.date  Max.resolution
##  Min.   :1.000   Canon PowerShot 350 :  1   Min.   :1996   Min.   : 512  
##  1st Qu.:1.000   Canon PowerShot 600 :  1   1st Qu.:2002   1st Qu.:2048  
##  Median :2.000   Canon PowerShot A10 :  1   Median :2004   Median :2592  
##  Mean   :1.907   Canon PowerShot A100:  1   Mean   :2004   Mean   :2547  
##  3rd Qu.:3.000   Canon PowerShot A20 :  1   3rd Qu.:2006   3rd Qu.:3072  
##  Max.   :3.000   Canon PowerShot A200:  1   Max.   :2007   Max.   :4256  
##                  (Other)             :252                                
##  Low.resolution Normal.focus.range Macro.focus.range Storage.included
##  Min.   :   0   Min.   : 0.00      Min.   : 0.000    Min.   : 0.00   
##  1st Qu.:1600   1st Qu.:30.00      1st Qu.: 4.000    1st Qu.:12.00   
##  Median :2048   Median :50.00      Median : 5.000    Median :16.00   
##  Mean   :1894   Mean   :42.84      Mean   : 6.671    Mean   :17.65   
##  3rd Qu.:2304   3rd Qu.:60.00      3rd Qu.:10.000    3rd Qu.:19.00   
##  Max.   :3264   Max.   :90.00      Max.   :20.000    Max.   :64.00   
##                                                                      
##  Weight..inc..batteries.   Dimensions        Price       
##  Min.   :120.0           Min.   : 60.0   Min.   :  99.0  
##  1st Qu.:185.0           1st Qu.: 90.0   1st Qu.: 159.0  
##  Median :225.0           Median : 98.0   Median : 199.0  
##  Mean   :284.2           Mean   :100.6   Mean   : 251.8  
##  3rd Qu.:320.0           3rd Qu.:110.0   3rd Qu.: 229.0  
##  Max.   :930.0           Max.   :160.0   Max.   :1699.0  
## 

Prep for modeling

Spliting the Dataset into Train & Test

The dataset was divided into a ratio of 75:25 [train:test]

Dataset used to Train

head(train)
##     Brand                          Model Release.date Max.resolution
## 170     2     Fujifilm FinePix S602Z Pro         2002           2832
## 83      1 Canon PowerShot SD430 Wireless         2005           2592
## 38      1            Canon PowerShot A80         2003           2272
## 116     2          Fujifilm FinePix A203         2002           1600
## 234     3               Nikon Coolpix L6         2006           2816
## 11      1           Canon PowerShot A400         2004           2048
##     Low.resolution Normal.focus.range Macro.focus.range Storage.included
## 170           2048                 50                 1                0
## 83            2048                  0                 3               16
## 38            1600                 46                 5               32
## 116           1280                 60                10               16
## 234           2048                 30                10               23
## 11            1600                  0                 5               16
##     Weight..inc..batteries. Dimensions Price
## 170                     590        121  1699
## 83                      185         99   199
## 38                      350        103   139
## 116                     180         97   169
## 234                     175         91    99
## 11                      255        107   139

Dataset used for Test

head(test)
##    Brand                Model Release.date Max.resolution Low.resolution
## 1      1  Canon PowerShot 350         1997            640              0
## 2      1  Canon PowerShot 600         1996            832            640
## 7      1  Canon PowerShot A30         2002           1280           1024
## 12     1 Canon PowerShot A410         2005           2048           1600
## 13     1 Canon PowerShot A420         2006           2272           1600
## 17     1   Canon PowerShot A5         1998           1024            512
##    Normal.focus.range Macro.focus.range Storage.included
## 1                  70                 3                2
## 2                  40                10                1
## 7                  76                16                8
## 12                  0                 1               16
## 13                 47                 1               16
## 17                 50                 9                8
##    Weight..inc..batteries. Dimensions Price
## 1                      320         93   149
## 2                      460        160   139
## 7                      350        110   139
## 12                     195        103   139
## 13                     200        103   139
## 17                     240        105   149

 

Simple Linear Regression

A simple linear regression model is desired to study the relationship between the variables Weight & Price. However, due to outliers (as seen in the graph), it is believe that the data may not be as expected.
Note: This Weight variable includes the weight of the batteries, not the camera body alone.

## `geom_smooth()` using formula 'y ~ x'

Modeling a Simple Linear Regression

Model 1

X_Predicator: Weight [independent variable]
Y_Response: Price [dependent variable]

## 
## Call:
## lm(formula = Price ~ Weight..inc..batteries., data = sub_data)
## 
## Coefficients:
##             (Intercept)  Weight..inc..batteries.  
##                 -38.360                    1.021
## 
## Call:
## lm(formula = Price ~ Weight..inc..batteries., data = sub_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -579.35  -78.49  -10.40   60.13 1135.03 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -38.36049   26.48033  -1.449    0.149    
## Weight..inc..batteries.   1.02089    0.08178  12.483   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.8 on 256 degrees of freedom
## Multiple R-squared:  0.3784, Adjusted R-squared:  0.376 
## F-statistic: 155.8 on 1 and 256 DF,  p-value: < 2.2e-16

 

Model 2

Training X_Predicator: Weight
Y_Response: Price

## 
## Call:
## lm(formula = Price ~ Weight..inc..batteries., data = train)
## 
## Coefficients:
##             (Intercept)  Weight..inc..batteries.  
##                -13.0012                   0.9162
## 
## Call:
## lm(formula = Price ~ Weight..inc..batteries., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -520.94  -80.40  -17.04   51.67 1171.46 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -13.00125   31.37592  -0.414    0.679    
## Weight..inc..batteries.   0.91617    0.09639   9.505   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.3 on 191 degrees of freedom
## Multiple R-squared:  0.3211, Adjusted R-squared:  0.3176 
## F-statistic: 90.34 on 1 and 191 DF,  p-value: < 2.2e-16

 

Model 3

Training X_Predicator: Dimensions
Y_Response: Price

## 
## Call:
## lm(formula = Price ~ Dimensions, data = train)
## 
## Coefficients:
## (Intercept)   Dimensions  
##    -463.222        7.036
## 
## Call:
## lm(formula = Price ~ Dimensions, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -412.36 -108.37  -34.30   57.16 1310.92 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -463.222    110.463  -4.193  4.2e-05 ***
## Dimensions     7.036      1.079   6.518  6.2e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 229.7 on 191 degrees of freedom
## Multiple R-squared:  0.1819, Adjusted R-squared:  0.1777 
## F-statistic: 42.48 on 1 and 191 DF,  p-value: 6.202e-10

 

Model Selection

Comparing the models to each other to help in selection of the best fitted model among the set. This method is done based on AIC & BIC.

AIC(linear_model2)
## [1] 2614.316
AIC(linear_model3)
## [1] 2650.308
BIC(linear_model2)
## [1] 2624.104
BIC(linear_model3)
## [1] 2660.096

AIC is lower than the BIC by a margin of 10.

 

Prediction

pred_Weight..inc..batteries.<-predict(linear_model3, test)
actuals_preds <-data.frame(cbind(actuals=test$Weight..inc..batteries.,
                                 predicteds=pred_Weight..inc..batteries.))
head(actuals_preds)
##    actuals predicteds
## 1      320   191.0873
## 2      460   662.4712
## 7      350   310.6922
## 12     195   261.4431
## 13     200   261.4431
## 17     240   275.5143

 

Correlation Accuracy of the Prediction

correlation_accuracy<-cor(actuals_preds)
correlation_accuracy
##              actuals predicteds
## actuals    1.0000000  0.7641795
## predicteds 0.7641795  1.0000000

Seems like the correlation between the actual values and the predicted value is postive.

 

References