This document provides a brief overiew of data analysis and visualization techniques using R. The dataset used in this overview was taken from: https://www.kaggle.com/crawford/1000-cameras-dataset/#
Disclaimer: This report was done in fullfillment of an Introduction to Data Mining class assignment ~ Database System Design and Information Management Systems II (CSE4202).
## The dimension of the dataset is: 258 14
## Brand Model Release.date Max.resolution
## Min. :1.000 Canon PowerShot 350 : 1 Min. :1996 Min. : 512
## 1st Qu.:1.000 Canon PowerShot 600 : 1 1st Qu.:2002 1st Qu.:2048
## Median :2.000 Canon PowerShot A10 : 1 Median :2004 Median :2592
## Mean :1.907 Canon PowerShot A100: 1 Mean :2004 Mean :2547
## 3rd Qu.:3.000 Canon PowerShot A20 : 1 3rd Qu.:2006 3rd Qu.:3072
## Max. :3.000 Canon PowerShot A200: 1 Max. :2007 Max. :4256
## (Other) :252
## Low.resolution Effective.pixels Zoom.wide..W. Zoom.tele..T.
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:1600 1st Qu.: 3.000 1st Qu.:35.0 1st Qu.:105.0
## Median :2048 Median : 5.000 Median :36.0 Median :111.5
## Mean :1894 Mean : 4.597 Mean :35.1 Mean :131.9
## 3rd Qu.:2304 3rd Qu.: 6.750 3rd Qu.:38.0 3rd Qu.:138.2
## Max. :3264 Max. :12.000 Max. :52.0 Max. :486.0
##
## Normal.focus.range Macro.focus.range Storage.included Weight..inc..batteries.
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. :120.0
## 1st Qu.:30.00 1st Qu.: 4.000 1st Qu.:12.00 1st Qu.:185.0
## Median :50.00 Median : 5.000 Median :16.00 Median :225.0
## Mean :42.84 Mean : 6.671 Mean :17.65 Mean :284.2
## 3rd Qu.:60.00 3rd Qu.:10.000 3rd Qu.:19.00 3rd Qu.:320.0
## Max. :90.00 Max. :20.000 Max. :64.00 Max. :930.0
##
## Dimensions Price
## Min. : 60.0 Min. : 99.0
## 1st Qu.: 90.0 1st Qu.: 159.0
## Median : 98.0 Median : 199.0
## Mean :100.6 Mean : 251.8
## 3rd Qu.:110.0 3rd Qu.: 229.0
## Max. :160.0 Max. :1699.0
##
Based on our dataset, we can see a steady increase in the amount of camera production over the years. Most of the cameras went on sale in the years; 2004, 2006 and 2007. From the year 2000, it can be seen that camera production has doubled onwards. This is a strong indication of rapid technological growth.
The visualization plot shows a steady increase in the camera’s max resolution over the period of 1996-2007. From the year 2000, the resolution has entered in the 3000 dimensions in pixels. Then from 2002, the dimensions increased by another 1000 pixels.
In a contradictory conclusion, there has been an increase rather than a decrease in low resolution over the years. This means that prior to 2001, the lowest resolution was approximately 500 pixels and then in 2007, the lowest was around 1000 resolution.
It is evident from our scatter plot that most cameras has a dimension between 80 to 120 from 2000 onwards. Based on our dataset, we can conclude that camera models of Canon Powershot, Fujifilm FinePix and Nikon Coolpix falls most popularly in these dimensions.
From the year 1996 to 2007, the most popular storage size from our cameras dataset is 16GB. The runner up storage included for our cameras are 32GB and 8GB.
Based on the box plot diagram, the average price for a Canon PowerShot camera falls below $500. The FALSE value shows the prices for the other brands (i.e. Fujifilm FinePix and Nikon Coolpix) which shows some outliers.
## `geom_smooth()` using formula 'y ~ x'
The diagram shows that there has been a decline in camera weight from the years 1996 to 2007. Our chosen camera brands have been producing cameras that are easily portable as technology advances and become more mainstream. However, there seems to be a limit to the weight amount centred around 250. The slope depicts a fall in the data visualization diagram.
Based on the visualization, it can be seen that most Canon PowerShot models fall in the price range of 100-150. Whereas for the Nikon Coolpix models, the prices fall between 200-250. However for the Fujifilm FinePix, their seems to be many outliers, thus it is hard to determine a narrow range for the prices. It can be shown that most camera prices for Fujifilm FinePix fall below 250. In conclusion, the widest variety of the cheapest camera cost is the Canon PowerShot.
## Brand Model Max.resolution Price
## 1 2 Fujifilm FinePix E550 Zoom 4048 169
## 2 2 Fujifilm FinePix F610 4048 169
## 3 2 Fujifilm FinePix F810 Zoom 4048 169
## 4 2 Fujifilm FinePix S7000 Z 4048 229
## 5 2 Fujifilm FinePix IS Pro 4256 1699
## 6 2 Fujifilm FinePix S2 Pro 4256 1699
## 7 2 Fujifilm FinePix S3 Pro 4256 1699
## 8 2 Fujifilm FinePix S5 Pro 4256 1699
The barplot shows that one could obtain four cameras in the price range of 1500-1800. Four other cameras are also available in the price range of 150-250. This is ideal for persons who want large file formats so that they can manipulate their images in post edits. It was also found out of our 3 models, Fujifilm Finepix has the largest array of Max Resolution cameras.
Most cameras in our dataset have a normal focus distance that ranges between 25-75 and a Macro Focus range that is centred around 1-15. This range signifies how far away or how close someone can be with the camera to take clear and in-focus shots. Knowing the focus range on a camera is vital for photographers when picking a camera model to purchase.
## The dimension of the dataset is: 258 11
##
## Brand Model Release.date Max.resolution
## Min. :1.000 Canon PowerShot 350 : 1 Min. :1996 Min. : 512
## 1st Qu.:1.000 Canon PowerShot 600 : 1 1st Qu.:2002 1st Qu.:2048
## Median :2.000 Canon PowerShot A10 : 1 Median :2004 Median :2592
## Mean :1.907 Canon PowerShot A100: 1 Mean :2004 Mean :2547
## 3rd Qu.:3.000 Canon PowerShot A20 : 1 3rd Qu.:2006 3rd Qu.:3072
## Max. :3.000 Canon PowerShot A200: 1 Max. :2007 Max. :4256
## (Other) :252
## Low.resolution Normal.focus.range Macro.focus.range Storage.included
## Min. : 0 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.:1600 1st Qu.:30.00 1st Qu.: 4.000 1st Qu.:12.00
## Median :2048 Median :50.00 Median : 5.000 Median :16.00
## Mean :1894 Mean :42.84 Mean : 6.671 Mean :17.65
## 3rd Qu.:2304 3rd Qu.:60.00 3rd Qu.:10.000 3rd Qu.:19.00
## Max. :3264 Max. :90.00 Max. :20.000 Max. :64.00
##
## Weight..inc..batteries. Dimensions Price
## Min. :120.0 Min. : 60.0 Min. : 99.0
## 1st Qu.:185.0 1st Qu.: 90.0 1st Qu.: 159.0
## Median :225.0 Median : 98.0 Median : 199.0
## Mean :284.2 Mean :100.6 Mean : 251.8
## 3rd Qu.:320.0 3rd Qu.:110.0 3rd Qu.: 229.0
## Max. :930.0 Max. :160.0 Max. :1699.0
##
The dataset was divided into a ratio of 75:25 [train:test]
Dataset used to Train
head(train)
## Brand Model Release.date Max.resolution
## 170 2 Fujifilm FinePix S602Z Pro 2002 2832
## 83 1 Canon PowerShot SD430 Wireless 2005 2592
## 38 1 Canon PowerShot A80 2003 2272
## 116 2 Fujifilm FinePix A203 2002 1600
## 234 3 Nikon Coolpix L6 2006 2816
## 11 1 Canon PowerShot A400 2004 2048
## Low.resolution Normal.focus.range Macro.focus.range Storage.included
## 170 2048 50 1 0
## 83 2048 0 3 16
## 38 1600 46 5 32
## 116 1280 60 10 16
## 234 2048 30 10 23
## 11 1600 0 5 16
## Weight..inc..batteries. Dimensions Price
## 170 590 121 1699
## 83 185 99 199
## 38 350 103 139
## 116 180 97 169
## 234 175 91 99
## 11 255 107 139
Dataset used for Test
head(test)
## Brand Model Release.date Max.resolution Low.resolution
## 1 1 Canon PowerShot 350 1997 640 0
## 2 1 Canon PowerShot 600 1996 832 640
## 7 1 Canon PowerShot A30 2002 1280 1024
## 12 1 Canon PowerShot A410 2005 2048 1600
## 13 1 Canon PowerShot A420 2006 2272 1600
## 17 1 Canon PowerShot A5 1998 1024 512
## Normal.focus.range Macro.focus.range Storage.included
## 1 70 3 2
## 2 40 10 1
## 7 76 16 8
## 12 0 1 16
## 13 47 1 16
## 17 50 9 8
## Weight..inc..batteries. Dimensions Price
## 1 320 93 149
## 2 460 160 139
## 7 350 110 139
## 12 195 103 139
## 13 200 103 139
## 17 240 105 149
A simple linear regression model is desired to study the relationship between the variables Weight & Price. However, due to outliers (as seen in the graph), it is believe that the data may not be as expected.
Note: This Weight variable includes the weight of the batteries, not the camera body alone.
## `geom_smooth()` using formula 'y ~ x'
X_Predicator: Weight [independent variable]
Y_Response: Price [dependent variable]
##
## Call:
## lm(formula = Price ~ Weight..inc..batteries., data = sub_data)
##
## Coefficients:
## (Intercept) Weight..inc..batteries.
## -38.360 1.021
##
## Call:
## lm(formula = Price ~ Weight..inc..batteries., data = sub_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -579.35 -78.49 -10.40 60.13 1135.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -38.36049 26.48033 -1.449 0.149
## Weight..inc..batteries. 1.02089 0.08178 12.483 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 203.8 on 256 degrees of freedom
## Multiple R-squared: 0.3784, Adjusted R-squared: 0.376
## F-statistic: 155.8 on 1 and 256 DF, p-value: < 2.2e-16
Training X_Predicator: Weight
Y_Response: Price
##
## Call:
## lm(formula = Price ~ Weight..inc..batteries., data = train)
##
## Coefficients:
## (Intercept) Weight..inc..batteries.
## -13.0012 0.9162
##
## Call:
## lm(formula = Price ~ Weight..inc..batteries., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -520.94 -80.40 -17.04 51.67 1171.46
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.00125 31.37592 -0.414 0.679
## Weight..inc..batteries. 0.91617 0.09639 9.505 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.3 on 191 degrees of freedom
## Multiple R-squared: 0.3211, Adjusted R-squared: 0.3176
## F-statistic: 90.34 on 1 and 191 DF, p-value: < 2.2e-16
Training X_Predicator: Dimensions
Y_Response: Price
##
## Call:
## lm(formula = Price ~ Dimensions, data = train)
##
## Coefficients:
## (Intercept) Dimensions
## -463.222 7.036
##
## Call:
## lm(formula = Price ~ Dimensions, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -412.36 -108.37 -34.30 57.16 1310.92
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -463.222 110.463 -4.193 4.2e-05 ***
## Dimensions 7.036 1.079 6.518 6.2e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 229.7 on 191 degrees of freedom
## Multiple R-squared: 0.1819, Adjusted R-squared: 0.1777
## F-statistic: 42.48 on 1 and 191 DF, p-value: 6.202e-10
Comparing the models to each other to help in selection of the best fitted model among the set. This method is done based on AIC & BIC.
AIC(linear_model2)
## [1] 2614.316
AIC(linear_model3)
## [1] 2650.308
BIC(linear_model2)
## [1] 2624.104
BIC(linear_model3)
## [1] 2660.096
AIC is lower than the BIC by a margin of 10.
pred_Weight..inc..batteries.<-predict(linear_model3, test)
actuals_preds <-data.frame(cbind(actuals=test$Weight..inc..batteries.,
predicteds=pred_Weight..inc..batteries.))
head(actuals_preds)
## actuals predicteds
## 1 320 191.0873
## 2 460 662.4712
## 7 350 310.6922
## 12 195 261.4431
## 13 200 261.4431
## 17 240 275.5143
correlation_accuracy<-cor(actuals_preds)
correlation_accuracy
## actuals predicteds
## actuals 1.0000000 0.7641795
## predicteds 0.7641795 1.0000000
Seems like the correlation between the actual values and the predicted value is postive.