(Part b) Load the data file in R

data <- read.csv("C:/Users/dell/Desktop/Jawwad/dataSet_19122098.csv")

The given line of code reads a CSV file named ‘dataSet_19122098.csv’ from the specified path (‘C:Users/dell/Downloads/data files/csv file/’) into an R data frame named “data” .

(Part c): Describe the data, type and measurement scale of all variables

summary(data)

##   Observation       Brand              Price_.        Megapixels   
##  Min.   : 1.00   Length:28          Min.   : 64.0   Min.   :10.00  
##  1st Qu.: 7.75   Class :character   1st Qu.: 88.0   1st Qu.:12.00  
##  Median :14.50   Mode  :character   Median :128.0   Median :12.00  
##  Mean   :14.50                      Mean   :140.3   Mean   :12.86  
##  3rd Qu.:21.25                      3rd Qu.:160.0   3rd Qu.:14.00  
##  Max.   :28.00                      Max.   :320.0   Max.   :16.00  
##    Weight_oz         Score          Brand.1      
##  Min.   :4.000   Min.   :50.00   Min.   :0.0000  
##  1st Qu.:5.000   1st Qu.:60.00   1st Qu.:0.0000  
##  Median :6.000   Median :64.50   Median :0.0000  
##  Mean   :5.821   Mean   :64.36   Mean   :0.4643  
##  3rd Qu.:7.000   3rd Qu.:69.25   3rd Qu.:1.0000  
##  Max.   :7.000   Max.   :74.00   Max.   :1.0000

This command provides a summary of the statistical properties of the variables in the “data” dataset, including measures such as mean, median, quartiles, and other descriptive statistics.

str(data)

## 'data.frame':    28 obs. of  7 variables:
##  $ Observation: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Brand      : chr  "Canon" "Canon" "Canon" "Canon" ...
##  $ Price_.    : int  264 160 240 160 144 160 160 104 104 88 ...
##  $ Megapixels : int  10 12 12 10 12 12 14 10 12 16 ...
##  $ Weight_oz  : int  7 5 7 6 5 7 5 7 5 5 ...
##  $ Score      : int  74 74 73 70 70 69 68 68 67 63 ...
##  $ Brand.1    : int  1 1 1 1 1 1 1 1 1 1 ...

This command displays the structure of the “data” dataset, showing the data types and the first few observations of each variable. It provides a concise overview of the dataset’s composition.

(Part D1) Make the histogram of all scaled variables and check the normality

hist(data$Price, col = rainbow(length(data$Price)))

This histogram appears to be negatively skewed, as evidenced by the longer left tail. The majority of the data points are concentrated on the higher values, with a few lower values extending the left tail.

hist(data$Megapixels, col = heat.colors(length(data$Megapixels)))

The histogram of Megapixels appears to be right-skewed, with a higher frequency of lower Megapixel values. The color variation provides a clear distinction between different levels of Megapixels. There are no apparent outliers, but the distribution is not perfectly symmetric, suggesting some deviation from a normal distribution.

colors <- rainbow(length(data$Weight))
hist(data$Weight, col = colors, main = "Histogram of Weight", xlab = "Weight")

Suggests a positively skewed distribution, indicating that lighter weights are more common

hist(data$Score, col = rainbow(10))  # Adjust the number in rainbow() as needed

Suggests a positively skewed distribution, indicating that lighter lighter are more common

(Part D2) Make the matrix scatter plot using Price_$, Megapixels, Weight_oz and Score

pairs(~Price_. + Megapixels + Weight_oz + Score, data = data, 
      col = c("red", "green", "blue", "purple"))

(Part D3) Computer the correlation matrix using Price_$, Megapixels, Weight_oz and Score

cor_matrix <- cor(data[, c("Price_.", "Megapixels", "Weight_oz", "Score")])
print(cor_matrix)

##              Price_.   Megapixels  Weight_oz        Score
## Price_.    1.0000000  0.138906307  0.3488151  0.683211844
## Megapixels 0.1389063  1.000000000 -0.1988338 -0.007729723
## Weight_oz  0.3488151 -0.198833809  1.0000000  0.285688204
## Score      0.6832118 -0.007729723  0.2856882  1.000000000

Price_. and Megapixels (0.1389063):There is a weak positive correlation between Price_. and Megapixels (correlation coefficient = 0.1389063). This suggests that as Megapixels increase, there is a slight tendency for Price_. to increase, but the correlation is not very strong.

Price_. and Weight_oz (0.3488151):There is a moderate positive correlation between Price_. and Weight_oz (correlation coefficient = 0.3488151). This indicates that there is a moderate tendency for Price_. to increase as Weight_oz increases.

Price_. and Score (0.6832118):There is a strong positive correlation between Price_. and Score (correlation coefficient = 0.6832118). This suggests a strong tendency for Price_. to increase as Score increases. These two variables are positively correlated.

Megapixels and Weight_oz (-0.1988338):There is a weak negative correlation between Megapixels and Weight_oz (correlation coefficient = -0.1988338). This indicates a slight tendency for Megapixels to decrease as Weight_oz increases, but the correlation is not very strong.

Megapixels and Score (-0.007729723):There is a very weak negative correlation between Megapixels and Score (correlation coefficient = -0.007729723). The correlation is close to zero, suggesting little to no linear relationship between these two variables.

Weight_oz and Score (0.2856882):There is a moderate positive correlation between Weight_oz and Score (correlation coefficient = 0.2856882). This indicates a moderate tendency for Weight_oz to increase as Score increases.

(PART D4)m1 -> Fit a regression model to predict the using Price_$ using Megapixels

m1 <- lm(Price_.~Megapixels, data=data)
summary(m1)

## 
## Call:
## lm(formula = Price_. ~ Megapixels, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -82.0  -48.5  -18.0   26.5  174.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   76.000     90.766   0.837    0.410
## Megapixels     5.000      6.991   0.715    0.481
## 
## Residual standard error: 66.85 on 26 degrees of freedom
## Multiple R-squared:  0.01929,    Adjusted R-squared:  -0.01842 
## F-statistic: 0.5115 on 1 and 26 DF,  p-value: 0.4808

The model equation is: = 76.000 + 5.000

The residuals represent the differences between the observed and predicted values.

The minimum residual is -82.0, and the maximum residual is 174.0.

The intercept is 76.000, indicating the estimated mean Price_. when Megapixels is zero.

The coefficient for Megapixels is 5.000, suggesting that, on average, Price_. increases by 5.000 units for each one-unit increase in Megapixels.

The t-test for the coefficient of Megapixels checks whether it is significantly different from zero.

The p-value associated with Megapixels is 0.481, which is greater than the conventional significance level of 0.05. This suggests that we fail to reject the null hypothesis, indicating that the coefficient for Megapixels is not statistically different from zero.

The F-statistic tests the overall significance of the model. The p-value associated with the F-statistic is 0.4808, indicating that the model as a whole is not statistically significant.

(PART D5)m2 -> Fit a regression model to predict the using Price_$ using Megapixels, and Weight_oz

m2 <- lm(Price_. ~ Megapixels + Weight_oz, data=data)
summary(m2)

## 
## Call:
## lm(formula = Price_. ~ Megapixels + Weight_oz, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -116.321  -36.408   -0.915   33.685  139.679 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -113.756    124.148  -0.916   0.3683  
## Megapixels     7.805      6.705   1.164   0.2554  
## Weight_oz     26.401     12.548   2.104   0.0456 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.83 on 25 degrees of freedom
## Multiple R-squared:  0.1668, Adjusted R-squared:  0.1002 
## F-statistic: 2.503 on 2 and 25 DF,  p-value: 0.1021

The coefficient for Intercept is -113.756, and its p-value is 0.3683. This suggests that the intercept is not significantly different from zero.

The coefficient for Megapixels is 7.805 with a p-value of 0.2554. This coefficient is not statistically significant at conventional significance levels (e.g., 0.05).

The coefficient for Weight_oz is 26.401 with a p-value of 0.0456, indicating that Weight_oz is statistically significant at a significance level of 0.05.

The F-statistic is 2.503 with a p-value of 0.1021. This tests the overall significance of the model. The p-value suggests that the model may not be statistically significant.

(PART D6) m3 -> Fit a regression model to predict the using Price_$ using Megapixels, Weight_oz and Score

m3 <- lm(Price_. ~ Megapixels + Weight_oz + Score, data=data)
summary(m3)

## 
## Call:
## lm(formula = Price_. ~ Megapixels + Weight_oz + Score, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60.97 -27.98 -11.45  29.50 139.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -424.658    120.352  -3.528 0.001717 ** 
## Megapixels     6.655      5.173   1.286 0.210573    
## Weight_oz     13.935     10.101   1.379 0.180467    
## Score          6.188      1.454   4.256 0.000275 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.41 on 24 degrees of freedom
## Multiple R-squared:  0.5252, Adjusted R-squared:  0.4659 
## F-statistic:  8.85 on 3 and 24 DF,  p-value: 0.0003961

he overall model is statistically significant (F-statistic: 8.85, p-value: 0.0003961), indicating that at least one of the predictors has a significant effect on the dependent variable (Price_.).
Intercept: The intercept is -424.658, representing the estimated Price_. when all predictor variables are zero.

Megapixels: The coefficient is 6.655, but it is not statistically significant (p-value: 0.210573). There is weak evidence to suggest a relationship between Megapixels and Price_..

Weight_oz: The coefficient is 13.935, but it is not statistically significant (p-value: 0.180467). There is weak evidence to suggest a relationship between Weight_oz and Price_..

Score: The coefficient is 6.188, and it is statistically significant (p-value: 0.000275). There is strong evidence to suggest a positive relationship between Score and Price_..

The residual standard error is 48.41, providing an estimate of the variability of the unexplained variance

The distribution of residuals shows that they range from -60.97 to 139.33.

Adjusted R-squared is 0.4659, considering the number of predictors.

(PART D7) Compare m1, m3 using anova, Adj R^2

anova(m1, m3)

## Analysis of Variance Table
## 
## Model 1: Price_. ~ Megapixels
## Model 2: Price_. ~ Megapixels + Weight_oz + Score
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     26 116176                                  
## 2     24  56244  2     59932 12.787 0.0001658 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F (F-statistic):The F-statistic tests the null hypothesis that all coefficients of the additional predictors are equal to zero (i.e., these predictors do not contribute significantly to explaining the variation in the response variable).In this case, the F-statistic is 12.787.

Pr(>F) (p-value for F-statistic):The p-value associated with the F-statistic is 0.0001658, which is less than the typical significance levels (0.05, 0.01, etc.) This indicates that at least one of the predictors (Weight_oz or Score) in Model 2 is significant.

Signif. codes:The stars indicate the level of significance. In this case, (’*’) means highly significant (p-value < 0.001), suggesting that the additional predictors in Model 2 significantly improve the model fit compared to Model 1.**

The ANOVA table suggests that Model 2, which includes Megapixels, Weight_oz, and Score, is a significantly better fit than Model 1, which only includes Megapixels.

The predictors Weight_oz and Score together contribute significantly to explaining the variation in the response variable Price_.

(PART D8)Compare m2, m3 using anova, Adj R^2

anova(m2, m3)

## Analysis of Variance Table
## 
## Model 1: Price_. ~ Megapixels + Weight_oz
## Model 2: Price_. ~ Megapixels + Weight_oz + Score
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     25 98699                                  
## 2     24 56244  1     42455 18.116 0.0002752 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic tests the hypothesis that adding the variable Score to the model does not significantly reduce the amount of variability left unexplained.

p−value=0.0002752 The p-value associated with the F-statistic. It indicates the probability of observing such an extreme F-statistic under the null hypothesis that the added variable does not significantly contribute to explaining the variability in the response variable.

(**): Highly significant (p < 0.001)*
*(**):Significant (0.001 < p < 0.01) (): Marginally significant (0.01 < p < 0.05) (.): Not significant (p > 0.05)

The low p-value (0.0002752) associated with the F-statistic in the ANOVA table suggests that including the variable Score in the model significantly improves the model fit.

You can reject the null hypothesis that the coefficient of Score is zero (i.e., Score does not significantly contribute to explaining the variability in Price_).

Consider keeping the variable Score in your model, as it appears to be a significant predictor.
The output suggests that Model 2 is better than Model 1 in explaining the variability in Price_.
# (PART D9)Add an indicator variable named Nikon which is 1 if the Brand = 0 otherwise 0

data$Nikon <- ifelse(data$Brand.1 == 0, 1, 0)
data

##    Observation Brand Price_. Megapixels Weight_oz Score Brand.1 Nikon
## 1            1 Canon     264         10         7    74       1     0
## 2            2 Canon     160         12         5    74       1     0
## 3            3 Canon     240         12         7    73       1     0
## 4            4 Canon     160         10         6    70       1     0
## 5            5 Canon     144         12         5    70       1     0
## 6            6 Canon     160         12         7    69       1     0
## 7            7 Canon     160         14         5    68       1     0
## 8            8 Canon     104         10         7    68       1     0
## 9            9 Canon     104         12         5    67       1     0
## 10          10 Canon      88         16         5    63       1     0
## 11          11 Canon      72         14         5    60       1     0
## 12          12 Canon      80         10         6    59       1     0
## 13          13 Canon      72         12         7    54       1     0
## 14          14 Nikon     216         16         5    73       0     1
## 15          15 Nikon     240         16         7    71       0     1
## 16          16 Nikon     160         14         6    69       0     1
## 17          17 Nikon     320         14         7    67       0     1
## 18          18 Nikon      96         14         5    65       0     1
## 19          19 Nikon     136         16         6    64       0     1
## 20          20 Nikon     120         12         5    64       0     1
## 21          21 Nikon     184         14         6    63       0     1
## 22          22 Nikon     144         12         6    61       0     1
## 23          23 Nikon     104         12         6    61       0     1
## 24          24 Nikon      64         12         7    60       0     1
## 25          25 Nikon      64         14         7    58       0     1
## 26          26 Nikon      80         12         4    54       0     1
## 27          27 Nikon      88         12         5    53       0     1
## 28          28 Nikon     104         14         4    50       0     1

(PART D10) m4 -> Fit a regression model using best of the m1, m2, and m3 and include the indicator Nikon

m4 <- lm(Price_. ~ Megapixels + Weight_oz + Score + Nikon, data=data)
summary(m4)

## 
## Call:
## lm(formula = Price_. ~ Megapixels + Weight_oz + Score + Nikon, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.226 -25.490  -2.939  19.960 127.943 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -439.938    117.215  -3.753 0.001036 ** 
## Megapixels     2.334      5.723   0.408 0.687152    
## Weight_oz     12.152      9.870   1.231 0.230666    
## Score          7.166      1.542   4.648 0.000112 ***
## Nikon         34.123     21.684   1.574 0.129228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.99 on 23 degrees of freedom
## Multiple R-squared:  0.5714, Adjusted R-squared:  0.4968 
## F-statistic: 7.665 on 4 and 23 DF,  p-value: 0.0004448

The linear regression model (M4) predicts “Price_” based on “Megapixels,” “Weight_oz,” “Score,” and “Nikon.” The significant predictors are “Score” (p < 0.001), indicating a positive relationship, while others are not significant. The model explains 57.14% of the variance, suggesting moderate predictive power. Residuals have a standard deviation of 46.99. The F-statistic (p = 0.0004448) implies the overall significance of the model. Consider further examining the residuals and predictor significance for a comprehensive assessment.

(PART D11) write down the model m4 for Nikon and Canon separately and plot each model on the data graph

# Fit linear models
m4_nikon <- lm(Price_. ~ Megapixels + Weight_oz + Score, data = subset(data, Nikon == 1))
m4_canon <- lm(Price_. ~ Megapixels + Weight_oz + Score, data = subset(data, Nikon == 0))

# Function to create scatter plots with colors
create_scatter_plot <- function(x, y, color_column, title) {
  plot(x, y, col = ifelse(data$Nikon == 1, "red", "blue"), main = title, xlab = names(data)[x], ylab = names(data)[y])
  legend("topright", legend = levels(factor(data[[color_column]])), fill = c("red", "blue"))
}

# Create scatter plots
create_scatter_plot(data$Megapixels, data$Price_, "Nikon", "Scatter Plot: Price_ vs. Megapixels")

create_scatter_plot(data$Weight_oz, data$Price_, "Nikon", "Scatter Plot: Price_ vs. Weight_oz")

create_scatter_plot(data$Score, data$Price_, "Nikon", "Scatter Plot: Price_ vs. Score")

m4_nikon <- lm(Price_. ~ Megapixels + Weight_oz + Score, data = subset(data, Nikon == 1))
m4_canon <- lm(Price_. ~ Megapixels + Weight_oz + Score, data = subset(data, Nikon == 0))

# Plotting
plot(data$Price_. ~ data$Megapixels, col = ifelse(data$Nikon == 1, "red", "blue"))

plot(data$Price_. ~ data$Weight_oz, col = ifelse(data$Nikon == 1, "red", "blue"))

plot(data$Price_. ~ data$Score, col = ifelse(data$Nikon == 1, "red", "blue"))

Jawwad_AhmedD09

Jawwad Ahmed

2023-11-22