This analysis of the cameras dataset was done in fulfillment of an introduction to a data mining class assignment, and should not be used to make any inferences for practical applications.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
camera_data <- read.csv("./data/camera_dataset.csv")
glimpse(camera_data)
## Rows: 1,038
## Columns: 13
## $ Model <fct> Agfa ePhoto 1280, Agfa ePhoto 1680, Agfa ePhoto CL…
## $ Release.date <int> 1997, 1998, 2000, 1999, 1999, 2001, 1999, 1997, 19…
## $ Max.resolution <dbl> 1024, 1280, 640, 1152, 1152, 1600, 1280, 640, 832,…
## $ Low.resolution <dbl> 640, 640, 0, 640, 640, 640, 640, 0, 640, 1024, 102…
## $ Effective.pixels <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 3, 3, 1,…
## $ Zoom.wide..W. <dbl> 38, 38, 45, 35, 43, 51, 34, 42, 50, 35, 39, 35, 39…
## $ Zoom.tele..T. <dbl> 114, 114, 45, 35, 43, 51, 102, 42, 50, 105, 39, 10…
## $ Normal.focus.range <dbl> 70, 50, 0, 0, 50, 50, 0, 70, 40, 76, 20, 76, 20, 7…
## $ Macro.focus.range <dbl> 40, 0, 0, 0, 0, 20, 0, 3, 10, 16, 5, 16, 5, 16, 5,…
## $ Storage.included <dbl> 4, 4, 2, 4, 40, 8, 8, 2, 1, 8, 8, 8, 8, 8, 32, 0, …
## $ Weight <dbl> 420, 420, 0, 0, 300, 270, 0, 320, 460, 375, 225, 3…
## $ Dimensions <dbl> 95, 158, 0, 0, 128, 119, 0, 93, 160, 110, 110, 110…
## $ Price <dbl> 179, 179, 179, 269, 1299, 179, 179, 149, 139, 139,…
#looking for columns with missing data
for(i in 1:dim(camera_data)[2]){
temp <- camera_data %>% summarise(count = sum(is.na(camera_data[i])))
print(paste("Missing value count for", colnames(camera_data[i]), "is", temp))
}
## [1] "Missing value count for Model is 0"
## [1] "Missing value count for Release.date is 0"
## [1] "Missing value count for Max.resolution is 0"
## [1] "Missing value count for Low.resolution is 0"
## [1] "Missing value count for Effective.pixels is 0"
## [1] "Missing value count for Zoom.wide..W. is 0"
## [1] "Missing value count for Zoom.tele..T. is 0"
## [1] "Missing value count for Normal.focus.range is 0"
## [1] "Missing value count for Macro.focus.range is 1"
## [1] "Missing value count for Storage.included is 2"
## [1] "Missing value count for Weight is 2"
## [1] "Missing value count for Dimensions is 2"
## [1] "Missing value count for Price is 0"
#store rows with missing data
camera_missing <- camera_data[rowSums(is.na(camera_data)) > 0,]
#drop rows with missing data
camera_data <- camera_data %>% drop_na()
summary(camera_data)
## Model Release.date Max.resolution Low.resolution
## Agfa ePhoto 1280 : 1 Min. :1994 Min. : 0 Min. : 0
## Agfa ePhoto 1680 : 1 1st Qu.:2002 1st Qu.:2048 1st Qu.:1120
## Agfa ePhoto CL18 : 1 Median :2004 Median :2560 Median :2048
## Agfa ePhoto CL30 : 1 Mean :2004 Mean :2473 Mean :1775
## Agfa ePhoto CL30 Clik!: 1 3rd Qu.:2006 3rd Qu.:3072 3rd Qu.:2560
## Agfa ePhoto CL45 : 1 Max. :2007 Max. :5616 Max. :4992
## (Other) :1030
## Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 3.00 1st Qu.:35.00 1st Qu.: 96.0 1st Qu.: 30.00
## Median : 4.00 Median :36.00 Median :108.0 Median : 50.00
## Mean : 4.59 Mean :32.96 Mean :121.5 Mean : 44.13
## 3rd Qu.: 7.00 3rd Qu.:38.00 3rd Qu.:117.0 3rd Qu.: 60.00
## Max. :21.00 Max. :52.00 Max. :518.0 Max. :120.00
##
## Macro.focus.range Storage.included Weight Dimensions
## Min. : 0.000 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.000 1st Qu.: 8.00 1st Qu.: 180.0 1st Qu.: 92.0
## Median : 6.000 Median : 16.00 Median : 226.0 Median :101.0
## Mean : 7.786 Mean : 17.45 Mean : 319.3 Mean :105.4
## 3rd Qu.:10.000 3rd Qu.: 20.00 3rd Qu.: 350.0 3rd Qu.:115.0
## Max. :85.000 Max. :450.00 Max. :1860.0 Max. :240.0
##
## Price
## Min. : 14.0
## 1st Qu.: 149.0
## Median : 199.0
## Mean : 457.9
## 3rd Qu.: 399.0
## Max. :7999.0
##
Checking for missing data turned out to be true for 2 rows, and was distributed amongst the variables (Macro Focus Range, Storage, Weight and Dimensions).
These missing values were dropped from the dataset.
The dataset was then summarised which displayed a general overview of every variable, this was beneficial for the construction of other EDA questions.
#store the dates
dates <- camera_data$Release.date
# Barplot
camera_data %>% ggplot(aes(x=as.factor(dates), fill=as.factor(dates))) +
geom_bar(width=0.7) +
ggtitle("Number of Models Produced Thoughout the Years")
#ggsave("question 2 bar plot.jpg")
#extract distinct dates from the dataset
distinct_dates <- camera_data %>% select(Release.date) %>% add_count(Release.date) %>% unique() %>% arrange(Release.date)
# calculate the growth
rates <- c(0.0000)
for(i in 1:(dim(distinct_dates)[1] - 1)){
diff <- distinct_dates[i+1, ][2] - distinct_dates[i, ][2]
percent <- ( diff / distinct_dates[i, ][2] ) * 100
rates <- c(rates, as.numeric(percent))
fmt <- sprintf("The difference and growth rate from year %d to %d is (%d, %.2f)", as.integer(distinct_dates[i, ][1]), as.integer(distinct_dates[i+1, ][1]), as.integer(diff), as.double(percent))
print(fmt)
}
## [1] "The difference and growth rate from year 1994 to 1995 is (0, 0.00)"
## [1] "The difference and growth rate from year 1995 to 1996 is (3, 300.00)"
## [1] "The difference and growth rate from year 1996 to 1997 is (7, 175.00)"
## [1] "The difference and growth rate from year 1997 to 1998 is (21, 190.91)"
## [1] "The difference and growth rate from year 1998 to 1999 is (21, 65.62)"
## [1] "The difference and growth rate from year 1999 to 2000 is (8, 15.09)"
## [1] "The difference and growth rate from year 2000 to 2001 is (24, 39.34)"
## [1] "The difference and growth rate from year 2001 to 2002 is (4, 4.71)"
## [1] "The difference and growth rate from year 2002 to 2003 is (12, 13.48)"
## [1] "The difference and growth rate from year 2003 to 2004 is (40, 39.60)"
## [1] "The difference and growth rate from year 2004 to 2005 is (2, 1.42)"
## [1] "The difference and growth rate from year 2005 to 2006 is (9, 6.29)"
## [1] "The difference and growth rate from year 2006 to 2007 is (10, 6.58)"
final <- data.frame(year = distinct_dates[1], growth = rates)
final %>% ggplot(aes(x=as.factor(Release.date), y=growth, group=1) ) +
geom_line(color="grey") +
geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
ggtitle("The Rate of Growth Between Every Year from 1994 to 2007")
#ggsave("question 2 line plot.jpg")
Plotting the total number of camera models from the year 1994 to 2007 does show an increase in production with every year having more than the last.
The rate of growth in the line graph shows from the year 1995 to 1996, for that time, there was a massive growth of 300%, which then diminished to a lower rate by the year 2000. From the years (2000 to 2001) and (2003 to 2004) the growth rate was the highest between the years (2000 to 2007), however, after 2004 the growth rate was very negligible.
camera_data %>% filter(Release.date > 2000) %>% ggplot(aes(x=as.factor(Release.date), y=Price, group=Release.date, fill=Release.date)) +
geom_boxplot() +
ggtitle("Box Plot of Prices over the Year 2000")
#ggsave("question 3 box plot.jpg")
camera_data %>% filter(Release.date > 2000 & Price > 2000)
## Model Release.date Max.resolution Low.resolution
## 1 Canon EOS-1D 2001 2464 1232
## 2 Canon EOS-1D Mark II 2004 3504 3104
## 3 Canon EOS-1D Mark III 2007 3888 3456
## 4 Canon EOS-1D Mark II N 2005 3504 3104
## 5 Canon EOS-1Ds 2002 4064 2032
## 6 Canon EOS-1Ds Mark II 2004 4992 3600
## 7 Canon EOS-1Ds Mark III 2007 5616 4992
## 8 Canon EOS 5D 2005 4368 3168
## 9 Epson R-D1 2004 3008 2240
## 10 Nikon D1X 2001 3008 2000
## 11 Nikon D2Hs 2005 2464 1840
## 12 Nikon D2X 2004 4288 3216
## 13 Nikon D2Xs 2006 4288 3216
## 14 Nikon D3 2007 4256 3184
## 15 Ricoh GR Digital 2005 3264 2592
## 16 Ricoh RDC-i500 2001 2048 1024
## Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1 4 0 0 0
## 2 8 0 0 0
## 3 10 0 0 0
## 4 8 0 0 0
## 5 11 0 0 0
## 6 16 0 0 0
## 7 21 0 0 0
## 8 12 0 0 0
## 9 6 0 0 0
## 10 5 0 0 0
## 11 4 0 0 0
## 12 12 0 0 0
## 13 12 0 0 0
## 14 12 0 0 0
## 15 8 28 28 30
## 16 3 35 105 24
## Macro.focus.range Storage.included Weight Dimensions Price
## 1 0 0 1585 156 4499
## 2 0 0 1565 156 4499
## 3 0 0 1335 156 4499
## 4 0 0 1565 156 4499
## 5 0 0 1585 156 7999
## 6 0 0 1565 156 7999
## 7 0 0 1385 150 7999
## 8 0 0 895 152 2499
## 9 0 0 620 142 4499
## 10 0 0 1200 157 4699
## 11 0 0 1200 158 4699
## 12 0 0 1200 158 4699
## 13 0 0 1200 158 4699
## 14 0 0 1300 160 4999
## 15 2 26 200 107 4999
## 16 1 8 320 142 4999
From the years 2001 to 2007 there are outliers. However between the price ranges of: ($1500 to $2000), ($4000 to $5000) and ($7500 to $8000) there is a price trend which indicates the higher end camera models. This further indicates the data distribution of the prices variable is skewed, due to majority of prices less than $1500.
There is a total of 16 camera models priced over $2000.
camera_data %>% filter(Price < 200 & Max.resolution > 2160 & Release.date == 2007 & Storage.included > 24) %>% apply(2,min)
## Model Release.date Max.resolution
## "Canon PowerShot SD1000" "2007" "2816"
## Low.resolution Effective.pixels Zoom.wide..W.
## " 0" " 6" "28"
## Zoom.tele..T. Normal.focus.range Macro.focus.range
## "101" " 0" " 0"
## Storage.included Weight Dimensions
## " 25" " 0" " 86"
## Price
## " 99"
manufacturers <- c("Agfa ePhoto", "Canon", "Casio", "Contax", "Epson", "Fujifilm", "HP","Kodak", "Kyocera", "Finecam", "Leica", "Nikon", "Olympus", "Panasonic", "Pentax", "Ricoh", "Samsung", "Sanyo", "Sigma", "Sony", "Toshiba")
accumulated_total <- c(rep(0, times = length(manufacturers)))
for(i in 1:dim(distinct_dates[1])[1]){
price_per_manu <- data.frame(manufacturers = character(), price = numeric())
for(j in 1:length(manufacturers)){
result <- camera_data %>% filter(Release.date == as.numeric(distinct_dates[1][i,]) & str_detect(Model, manufacturers[j]))
total <- sum(result$Price)
price_per_manu <- rbind(price_per_manu, data.frame(manufacturers = manufacturers[j], price = total))
accumulated_total[j] <- accumulated_total[j] + total
}
plot <- price_per_manu %>% ggplot(aes(x=reorder(manufacturers, price), y=price)) +
geom_bar(stat="identity", fill="#69b3a2", width=.4) +
coord_flip() +
xlab("") +
theme_bw() +
ggtitle("Total Price Per Manufacturers in year: ", distinct_dates[1][i,])
print(plot)
#ggsave(paste("Year", distinct_dates[1][i,],".png", sep = ""))
}
accumulated_total_manu <- data.frame(manufacturers = manufacturers, price = accumulated_total)
accumulated_total_manu %>%
ggplot(aes(x = reorder(manufacturers, price), y=price)) +
geom_bar(stat="identity", fill="#69b3a2", width=.4) +
coord_flip() +
xlab("") +
theme_bw() +
ggtitle("Accumulated Price of all Manufacturers Throughout the Years")
#ggsave("total_summary.png")
15 different plots were generated which shows the total price for each manufacturer for that year and the accumulated price per manufacturer.
The manufacturer with the lowest accumulated price was Contax and the manufacturer with the highest accumulated price was Canon.
By the year 2007, 9 manufacturers stopped producing any cameras
In the year 2003, 19 of 21 manufacturers produced cameras while the 2 that did not produced cameras were Agfa ePhoto and Contax
less_than_ones <- camera_data %>% filter(Zoom.wide..W. < 1 & Zoom.tele..T. < 1 & Normal.focus.range < 1 & Macro.focus.range < 1)
dim(less_than_ones)[1]
## [1] 85
head(less_than_ones)
## Model Release.date Max.resolution Low.resolution
## 1 Canon EOS 10D 2003 3072 2048
## 2 Canon EOS-1D 2001 2464 1232
## 3 Canon EOS-1D Mark II 2004 3504 3104
## 4 Canon EOS-1D Mark III 2007 3888 3456
## 5 Canon EOS-1D Mark II N 2005 3504 3104
## 6 Canon EOS-1Ds 2002 4064 2032
## Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1 6 0 0 0
## 2 4 0 0 0
## 3 8 0 0 0
## 4 10 0 0 0
## 5 8 0 0 0
## 6 11 0 0 0
## Macro.focus.range Storage.included Weight Dimensions Price
## 1 0 0 875 150 1299
## 2 0 0 1585 156 4499
## 3 0 0 1565 156 4499
## 4 0 0 1335 156 4499
## 5 0 0 1565 156 4499
## 6 0 0 1585 156 7999
There are 85 camera models with a value of 0 in [zoom wide, zoom tele, normal focus range, macro focus range]
A select few of the manufacturers: Canon, Fujifilm and Kodak have this value which means that its not manufacturer specific but the data for these specific camera models were unavailable.
hist1 <- camera_data %>% ggplot(aes(x=Dimensions)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
ggtitle("Dimensions Distribution")
dense1 <- camera_data %>% ggplot(aes(x=Dimensions)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Dimensions Distribution")
grid.arrange(hist1, dense1, ncol=2)
#ggsave("question 7 hist_density plot1.jpg", grid.arrange(hist1, dense1, ncol=2))
hist2 <- camera_data %>% ggplot(aes(x=Weight)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
ggtitle("Weight Distribution")
dense2 <- camera_data %>% ggplot(aes(x=Weight)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Weight Distribution")
grid.arrange(hist2, dense2, ncol=2)
#ggsave("question 7 hist_density plot2.jpg", grid.arrange(hist2, dense2, ncol=2))
hist3 <- camera_data %>% ggplot(aes(x=Zoom.tele..T.)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
ggtitle("Zoom Tele Distribution")
dense3 <- camera_data %>% ggplot(aes(x=Zoom.tele..T.)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Zoom Tele Distribution")
grid.arrange(hist3, dense3, ncol=2)
#ggsave("question 7 hist_density plot3.jpg", grid.arrange(hist3, dense3, ncol=2))
hist4 <- camera_data %>% ggplot(aes(x=Price)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, bins=30) +
ggtitle("Price Distribution")
dense4 <- camera_data %>% ggplot(aes(x=Price)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Price Distribution")
grid.arrange(hist4, dense4, ncol=2)
#ggsave("question 7 hist_density plot4.jpg", grid.arrange(hist4, dense4, ncol=2))
The dimensions distribution is multimodal.
The weight distribution is right skewed and multimodal.
Zoom tele distribution is right skewed and multimodal.
Price distribution is right skewed and multimodal.
filtered_set <- camera_data %>% filter(Zoom.wide..W. > 1 & Zoom.tele..T. > 1 & Normal.focus.range > 1 & Macro.focus.range > 1)
filtered_set %>% filter(Storage.included < 100) %>% ggplot(aes(x=Max.resolution, y=Storage.included)) +
geom_point(
color="blue",
fill="#69b3a2",
shape=18,
alpha=0.5,
size=3,
stroke = 1) +
geom_smooth(method=lm , color="red", se=FALSE, formula="y ~ x") +
ggtitle("Storage vs Max Resolution")
#ggsave("question 8 scatter plot1.jpg")
filtered_set %>% ggplot(aes(x=Weight, y=Dimensions)) +
geom_point(
color="blue",
fill="#69b3a2",
shape=18,
alpha=0.5,
size=3,
stroke = 1) +
geom_smooth(method=lm , color="red", se=FALSE, formula="y ~ x") +
ggtitle("Dimensions vs Weight")
#ggsave("question 8 scatter plot2.jpg")
filtered_set %>% filter(Price < 2000) %>%ggplot(aes(x=Zoom.tele..T., y=Price)) +
geom_point(
color="blue",
fill="#69b3a2",
shape=18,
alpha=0.5,
size=3,
stroke = 1) +
geom_smooth(method=lm , color="red", se=FALSE, formula="y ~ x") +
ggtitle("Price vs Zoom tele")
#ggsave("question 8 scatter plot3.jpg")
facet_wrap( ~ align)
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
The relationship between storage and max resolution is positive.
The relationship between dimensions and weight is also positive.
The relationship between price and zoom tele seems to be a little negative to no relationship.
relation_1 <- data.frame(cbind(max_res = filtered_set$Max.resolution, storage = filtered_set$Storage.included))
relation_2 <- data.frame(cbind(weight = filtered_set$Weight, dims = filtered_set$Dimensions))
relation_3 <- data.frame(cbind(zoom_t = filtered_set$Zoom.tele..T., price = filtered_set$Price))
cor(relation_1)
## max_res storage
## max_res 1.0000000 0.2538378
## storage 0.2538378 1.0000000
cor(relation_2)
## weight dims
## weight 1.000000 0.524276
## dims 0.524276 1.000000
cor(relation_3)
## zoom_t price
## zoom_t 1.00000000 -0.04432165
## price -0.04432165 1.00000000
The correlation coefficient for the relationship between storage and max resolution is 0.2538378.
The correlation coefficient for the relationship between dimensions and weight is 0.524276.
The correlation coefficient for the relationship between price and zoom tele is -0.04432165.
camera_ratio <- camera_data %>% filter(Zoom.tele..T. > 0 & Zoom.wide..W. > 0) %>% mutate(zoom_ratio = Zoom.tele..T. / Zoom.wide..W.)
head(camera_ratio)
## Model Release.date Max.resolution Low.resolution
## 1 Agfa ePhoto 1280 1997 1024 640
## 2 Agfa ePhoto 1680 1998 1280 640
## 3 Agfa ePhoto CL18 2000 640 0
## 4 Agfa ePhoto CL30 1999 1152 640
## 5 Agfa ePhoto CL30 Clik! 1999 1152 640
## 6 Agfa ePhoto CL45 2001 1600 640
## Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1 0 38 114 70
## 2 1 38 114 50
## 3 0 45 45 0
## 4 0 35 35 0
## 5 0 43 43 50
## 6 1 51 51 50
## Macro.focus.range Storage.included Weight Dimensions Price zoom_ratio
## 1 40 4 420 95 179 3
## 2 0 4 420 158 179 3
## 3 0 2 0 0 179 1
## 4 0 4 0 0 269 1
## 5 0 40 300 128 1299 1
## 6 20 8 270 119 179 1
camera_ratio %>% ggplot(aes(x=as.factor(Release.date), y=zoom_ratio)) +
geom_bar(stat="identity", fill="#69b3a2", width=.5)
#ggsave("question 10 bar plot.jpg")
Using the variables zoom wide and zoom tele, a mutate function was used to calculate the zoom ratio for every camera models when zoom tele and zoom ratio is greater than 0.
Throughout the years the zoom ratio gets larger with every subsequent year.
#splitting and filtering dataset
camera_data_unfiltered <- camera_data
camera_data_filtered <- camera_data %>% filter(Price < 1000)
set.seed(123)
#unfiltered dataset [80, 20] split
split = sort(sample(nrow(camera_data_unfiltered), nrow(camera_data_unfiltered)*.8))
train_unfiltered <- camera_data_unfiltered[split,]
test_unfiltered <- camera_data_unfiltered[-split,]
#filtered dataset [80, 20] split
split = sort(sample(nrow(camera_data_filtered), nrow(camera_data_filtered)*.8))
train_filtered <- camera_data_filtered[split,]
test_filtered <- camera_data_filtered[-split,]
#Model 1, first trial
model1_unfiltered <- lm(Price ~ Max.resolution+Weight+Dimensions, data = train_unfiltered)
print(summary(model1_unfiltered))
##
## Call:
## lm(formula = Price ~ Max.resolution + Weight + Dimensions, data = train_unfiltered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2306.6 -226.7 -112.1 26.4 5693.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.86387 139.32734 -0.071 0.9436
## Max.resolution 0.13631 0.03024 4.507 7.51e-06 ***
## Weight 1.42113 0.11931 11.912 < 2e-16 ***
## Dimensions -3.15011 1.28959 -2.443 0.0148 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 658.2 on 824 degrees of freedom
## Multiple R-squared: 0.2288, Adjusted R-squared: 0.226
## F-statistic: 81.5 on 3 and 824 DF, p-value: < 2.2e-16
actual_preds <- data.frame(cbind(actuals=test_unfiltered$Price, predicted=predict(model1_unfiltered, test_unfiltered)))
head(actual_preds)
## actuals predicted
## 26 139 309.4887
## 30 149 420.6426
## 37 139 539.3297
## 38 139 591.6709
## 45 139 472.7551
## 50 4499 2200.3976
print(paste("Akaike’s information criterion:", AIC(model1_unfiltered)))
## [1] "Akaike’s information criterion: 13102.461549226"
print(paste("Bayesian information criterion:", BIC(model1_unfiltered)))
## [1] "Bayesian information criterion: 13126.0566149979"
#get accuracy
print(cor(actual_preds))
## actuals predicted
## actuals 1.0000000 0.5295035
## predicted 0.5295035 1.0000000
#Model 2, second trial
model2_unfiltered <- lm(Price ~ Max.resolution+Storage.included+Weight+Dimensions, data = train_unfiltered)
print(summary(model2_unfiltered))
##
## Call:
## lm(formula = Price ~ Max.resolution + Storage.included + Weight +
## Dimensions, data = train_unfiltered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2284.1 -226.3 -117.8 31.5 5686.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.37501 138.91292 0.010 0.99210
## Max.resolution 0.15340 0.03085 4.973 8.03e-07 ***
## Storage.included -2.42132 0.93330 -2.594 0.00964 **
## Weight 1.37082 0.12046 11.380 < 2e-16 ***
## Dimensions -3.10866 1.28523 -2.419 0.01579 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 656 on 823 degrees of freedom
## Multiple R-squared: 0.2351, Adjusted R-squared: 0.2314
## F-statistic: 63.23 on 4 and 823 DF, p-value: < 2.2e-16
actual_preds <- data.frame(cbind(actuals=test_unfiltered$Price, predicted=predict(model2_unfiltered, test_unfiltered)))
head(actual_preds)
## actuals predicted
## 26 139 309.1985
## 30 149 438.8639
## 37 139 556.2971
## 38 139 576.4616
## 45 139 432.0135
## 50 4499 2199.2727
print(paste("Akaike’s information criterion:", AIC(model2_unfiltered)))
## [1] "Akaike’s information criterion: 13097.7174056187"
print(paste("Bayesian information criterion:", BIC(model2_unfiltered)))
## [1] "Bayesian information criterion: 13126.031484545"
#get accuracy
print(cor(actual_preds) )
## actuals predicted
## actuals 1.0000000 0.5138635
## predicted 0.5138635 1.0000000
#Model 3, third trial
model3_unfiltered <- lm(Price ~ Max.resolution+Low.resolution+Effective.pixels+Zoom.tele..T.+Weight+Dimensions, data = train_unfiltered)
print(summary(model3_unfiltered))
##
## Call:
## lm(formula = Price ~ Max.resolution + Low.resolution + Effective.pixels +
## Zoom.tele..T. + Weight + Dimensions, data = train_unfiltered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2327.6 -249.2 -106.5 60.0 5596.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 576.0420 181.2031 3.179 0.00153 **
## Max.resolution -0.3127 0.1058 -2.954 0.00322 **
## Low.resolution 0.2706 0.0507 5.338 1.22e-07 ***
## Effective.pixels 61.1598 26.4088 2.316 0.02081 *
## Zoom.tele..T. -1.5871 0.2398 -6.618 6.56e-11 ***
## Weight 1.5807 0.1176 13.438 < 2e-16 ***
## Dimensions -4.0400 1.2502 -3.231 0.00128 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 633.8 on 821 degrees of freedom
## Multiple R-squared: 0.2876, Adjusted R-squared: 0.2824
## F-statistic: 55.24 on 6 and 821 DF, p-value: < 2.2e-16
actual_preds <- data.frame(cbind(actuals=test_unfiltered$Price, predicted=predict(model3_unfiltered, test_unfiltered)))
head(actual_preds)
## actuals predicted
## 26 139 325.8619
## 30 149 487.1587
## 37 139 581.5300
## 38 139 644.3967
## 45 139 438.2805
## 50 4499 2653.1778
print(paste("Akaike’s information criterion:", AIC(model3_unfiltered)))
## [1] "Akaike’s information criterion: 13042.8381874199"
print(paste("Bayesian information criterion:", BIC(model3_unfiltered)))
## [1] "Bayesian information criterion: 13080.5902926549"
#get accuracy
print(cor(actual_preds))
## actuals predicted
## actuals 1.000000 0.571435
## predicted 0.571435 1.000000
#Model 1, first trial
model1_filtered <- lm(Price ~ Max.resolution+Weight+Dimensions, data = train_filtered)
print(summary(model1_filtered))
##
## Call:
## lm(formula = Price ~ Max.resolution + Weight + Dimensions, data = train_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -232.82 -81.19 -34.16 50.76 544.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.112e+02 3.082e+01 3.608 0.00033 ***
## Max.resolution 3.211e-02 6.202e-03 5.178 2.92e-07 ***
## Weight 2.178e-02 2.713e-02 0.803 0.42225
## Dimensions 2.744e-01 2.837e-01 0.967 0.33368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 116.6 on 714 degrees of freedom
## Multiple R-squared: 0.04051, Adjusted R-squared: 0.03648
## F-statistic: 10.05 on 3 and 714 DF, p-value: 1.716e-06
actual_preds <- data.frame(cbind(actuals=test_filtered$Price, predicted=predict(model1_filtered, test_filtered)))
head(actual_preds)
## actuals predicted
## 8 139 191.8423
## 11 139 200.9304
## 13 139 190.1105
## 16 139 200.3858
## 23 149 178.1218
## 24 149 186.8825
print(paste("Akaike’s information criterion:", AIC(model1_filtered)))
## [1] "Akaike’s information criterion: 8876.99678932854"
print(paste("Bayesian information criterion:", BIC(model1_filtered)))
## [1] "Bayesian information criterion: 8899.87913717378"
#get accuracy
print(cor(actual_preds))
## actuals predicted
## actuals 1.0000000 0.1830946
## predicted 0.1830946 1.0000000
#Model 2, second trial
model2_filtered <- lm(Price ~ Max.resolution+Storage.included+Weight+Dimensions, data = train_filtered)
print(summary(model2_filtered))
##
## Call:
## lm(formula = Price ~ Max.resolution + Storage.included + Weight +
## Dimensions, data = train_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -236.28 -81.25 -34.25 50.25 539.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.063557 30.785041 3.608 0.00033 ***
## Max.resolution 0.034186 0.006326 5.404 8.89e-08 ***
## Storage.included -0.263697 0.163013 -1.618 0.10618
## Weight 0.016826 0.027272 0.617 0.53746
## Dimensions 0.287151 0.283460 1.013 0.31139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 116.5 on 713 degrees of freedom
## Multiple R-squared: 0.04402, Adjusted R-squared: 0.03866
## F-statistic: 8.208 on 4 and 713 DF, p-value: 1.787e-06
actual_preds <- data.frame(cbind(actuals=test_filtered$Price, predicted=predict(model2_filtered, test_filtered)))
head(actual_preds)
## actuals predicted
## 8 139 192.9268
## 11 139 201.5482
## 13 139 190.1880
## 16 139 201.1275
## 23 149 178.1497
## 24 149 187.1683
print(paste("Akaike’s information criterion:", AIC(model2_filtered)))
## [1] "Akaike’s information criterion: 8876.36648723066"
print(paste("Bayesian information criterion:", BIC(model2_filtered)))
## [1] "Bayesian information criterion: 8903.82530464495"
#get accuracy
print(cor(actual_preds) )
## actuals predicted
## actuals 1.0000000 0.1778359
## predicted 0.1778359 1.0000000
#Model 3, third trial
model3_filtered <- lm(Price ~ Max.resolution+Low.resolution+Effective.pixels+Zoom.tele..T.+Weight+Dimensions, data = train_filtered)
print(summary(model3_filtered))
##
## Call:
## lm(formula = Price ~ Max.resolution + Low.resolution + Effective.pixels +
## Zoom.tele..T. + Weight + Dimensions, data = train_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -239.66 -81.54 -24.38 48.70 527.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 139.732165 39.778649 3.513 0.000472 ***
## Max.resolution -0.029171 0.022403 -1.302 0.193317
## Low.resolution 0.058122 0.010280 5.654 2.27e-08 ***
## Effective.pixels 1.793944 5.443311 0.330 0.741823
## Zoom.tele..T. -0.005215 0.049059 -0.106 0.915374
## Weight 0.059376 0.027515 2.158 0.031265 *
## Dimensions 0.284733 0.278790 1.021 0.307451
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114.2 on 711 degrees of freedom
## Multiple R-squared: 0.08308, Adjusted R-squared: 0.07534
## F-statistic: 10.74 on 6 and 711 DF, p-value: 1.904e-11
actual_preds <- data.frame(cbind(actuals=test_filtered$Price, predicted=predict(model3_filtered, test_filtered)))
head(actual_preds)
## actuals predicted
## 8 139 225.2699
## 11 139 207.4091
## 13 139 215.2593
## 16 139 205.9247
## 23 149 183.5846
## 24 149 187.5673
print(paste("Akaike’s information criterion:", AIC(model3_filtered)))
## [1] "Akaike’s information criterion: 8850.41924475122"
print(paste("Bayesian information criterion:", BIC(model3_filtered)))
## [1] "Bayesian information criterion: 8887.03100130361"
#get accuracy
print(cor(actual_preds))
## actuals predicted
## actuals 1.0000000 0.1632764
## predicted 0.1632764 1.0000000
Two datasets were created to compare 2 sets of 3 models each, the unfiltered dataset is the regular camera dataset while the unfiltered dataset removes the skew in the prices distribution where the prices are less than $1000.
All 6 models utilize multiple linear regression instead of simple linear regression to predict the price of a camera model.
First model uses 3 variables (Max resolution, Weight and Dimensions), Akaike’s and Bayesian information criterion were (13102.461549226 and 13126.0566149979) respectively, the correlation between predicted and actual values was 0.5295035 and the p values for all variables were less than 5% indicating they are statistically significant.
Second Model uses 4 variables, the same from the first model in addition to (Storage), Akaike’s and Bayesian information criterion were (13097.7174056187 and 13126.031484545) respectively and the correlation between predicted and actual values was 0.5138635 and the p values for all variables were less than 5% indicating they are statistically significant.
Third model uses 6 variables (Max resolution, Min resolution, effective pixels, Zoom tele, weight and Dimensions). Akaike’s and Bayesian information criterion were (13042.8381874199 and 13080.5902926549) respectively, the correlation between predicted and actual values was 0.571435 and the p values for all variables were less than 5% indicating they are statistically significant.
The first model Akaike’s and Bayesian information criterion were (8876.99678932854 and 8899.87913717378) respectively and the correlation between predicted and actual values was 0.1830946.
The second model Akaike’s and Bayesian information criterion were (8876.36648723066 and 8903.82530464495) respectively and the correlation between predicted and actual values was 0.1778359.
The third model Akaike’s and Bayesian information criterion were (8850.41924475122 and 8887.03100130361) respectively and the correlation between predicted and actual values was 0.1632764.