HW1:Data Analysis

install.packages(‘R Markdown’) library(R Markdown)

Part A. Variables In the field of psychology, much research is done using self-report surveys using Likert scales (look it up!).

A1 What type of variable is a Likert response?__ (1 pt)

The variable type is considered ordinal variable, which means that the variables have a natural order but the gaps betweeen the answer choices are not necessarily equal.

A2 What are some (at least 2) benefits of using Likert scales?__ (2 pts)

A benefit woulc be it is easy for the people taking the survey to understand thus allows for quick gathering of data. Another benefit would be that it allows us to quanitfy subjective opinions so they can then be analyzed by technology.

A3 What are some drawbacks — and dangers — of using them? Make sure you mention at least one ‘drawback’ and one ‘danger’ (a ‘drawback’ is a shortcoming, while a ‘danger’ implies potential harm).__ (2 pts)

A significant problem associated with the likert scales would be that there is a central bias tendency because people are less inclined to click an extreme response. Also even though it allows us to quanitify these opinions it still lacks complexity in the results.

Part B. Simple Linear Regression

Perform linear regressions on a dataset from a European Toyota car dealer on the sales records of used cars (Toyota Corolla). We would like to construct a reasonable linear regression model for the relationship between the sales prices of used cars and various explanatory variables (such as age, mileage, horsepower). We are interested to see what factors affect the sales price of a used car and by how much.

Data Description

Id - ID number of each used car Model - Model name of each used car Price - The price (in Euros) at which each used car was sold Age - Age (in months) of each used car as of August 2004 KM - Accumulated kilometers on odometer HP - Horsepower Metallic - Metallic color? (Yes = 1, No = 0) Automatic - Automatic transmission? ( Yes = 1, No = 0) CC - Cylinder volume (in cubic centimeters) Doors - Number of doors Gears - Number of gears Weight - Weight (in kilograms)

The data is in the file “UsedCars.csv”. To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv().

Read data and show few rows of data

# Read in the data
data = read.csv("UsedCars.csv",sep = ",",header = TRUE)
# Show the first few rows of data
head(data, 3)

##   Id                                                  Model Price Age    KM  HP
## 1  1            TOYOTA Corolla 1800 T SPORT VVT I 2/3-Doors 21500  27 19700 192
## 2  2      TOYOTA Corolla 1.8 VVTL-i T-Sport 3-Drs 2/3-Doors 20950  25 31461 192
## 3  3 TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT BNS 2/3-Doors 19950  22 43610 192
##   Metallic Automatic   CC Doors Gears Weight
## 1        0         0 1800     3     5   1185
## 2        0         0 1800     3     6   1185
## 3        0         0 1800     3     6   1185

Question B1: Exploratory Data Analysis

3 pts Use a scatter plot to describe the relationship between Price and the Accumulated kilometers on odometer. Describe the general trend (direction and form). Include plots and R-code used.

setwd("~/Regression Analysis/Homework")
getwd

## function () 
## .Internal(getwd())
## <bytecode: 0x0000025122f883c0>
## <environment: namespace:base>

data = read.csv("UsedCars.csv", head = TRUE, sep = ",")
head(data)

##   Id                                                  Model Price Age    KM  HP
## 1  1            TOYOTA Corolla 1800 T SPORT VVT I 2/3-Doors 21500  27 19700 192
## 2  2      TOYOTA Corolla 1.8 VVTL-i T-Sport 3-Drs 2/3-Doors 20950  25 31461 192
## 3  3 TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT BNS 2/3-Doors 19950  22 43610 192
## 4  4     TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT 2/3-Doors 19600  25 32189 192
## 5  5     TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT 2/3-Doors 21500  31 23000 192
## 6  6     TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT 2/3-Doors 22500  32 34131 192
##   Metallic Automatic   CC Doors Gears Weight
## 1        0         0 1800     3     5   1185
## 2        0         0 1800     3     6   1185
## 3        0         0 1800     3     6   1185
## 4        0         0 1800     3     6   1185
## 5        1         0 1800     3     6   1185
## 6        1         0 1800     3     6   1185

str(data)

## 'data.frame':    1264 obs. of  12 variables:
##  $ Id       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Model    : chr  "TOYOTA Corolla 1800 T SPORT VVT I 2/3-Doors" "TOYOTA Corolla 1.8 VVTL-i T-Sport 3-Drs 2/3-Doors" "TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT BNS 2/3-Doors" "TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT 2/3-Doors" ...
##  $ Price    : int  21500 20950 19950 19600 21500 22500 22000 22750 17950 16750 ...
##  $ Age      : int  27 25 22 25 31 32 28 30 24 24 ...
##  $ KM       : int  19700 31461 43610 32189 23000 34131 18739 34000 21716 25563 ...
##  $ HP       : int  192 192 192 192 192 192 192 192 110 110 ...
##  $ Metallic : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ Automatic: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CC       : int  1800 1800 1800 1800 1800 1800 1800 1800 1600 1600 ...
##  $ Doors    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Gears    : int  5 6 6 6 6 6 6 5 5 5 ...
##  $ Weight   : int  1185 1185 1185 1185 1185 1185 1185 1185 1105 1065 ...

plot(data$KM, data$Price, main ="Scatterplot of Price vs Accumulated KM",
     ylab="Price", xlab =" Accuumulated KM")

There appears to be an inverse relationship between the Price and the Accumulated KM. On average as the amount of accumulated KM increases the price decreases.

3 pts What is the value of the correlation coefficient between Price and KM? Please interpret the strength of the correlation based on the correlation coefficient.

model_lr = lm(KM~ Price, data = data)
summary(model_lr)

## 
## Call:
## lm(formula = KM ~ Price, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83459 -16217  -2149  13685 105476 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.224e+05  2.244e+03   54.56   <2e-16 ***
## Price       -5.609e+00  2.007e-01  -27.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23720 on 1262 degrees of freedom
## Multiple R-squared:  0.3824, Adjusted R-squared:  0.3819 
## F-statistic: 781.4 on 1 and 1262 DF,  p-value: < 2.2e-16

The R-squared is 0.3824 which means there is a weak positive correlation between the KM and the response variable Price.

2 pts Based on this exploratory analysis, would you recommend a simple linear regression model for the relationship?

Yes, because even though the correlation coefficient is low, the regression analysis is significant according to F statistic being high and the p-values being very low.

1 pts Based on the analysis above, would you pursue a transformation of the data? Do not transform the data.

Yes, because of the lower R-squared value and potential of outliers in the residuals, pursuing a transformation of the data could possibly improve the fit of the model.

Question B2: Fitting the Simple Linear Regression Model

Fit a linear regression model, named model_1, to evaluate the relationship between UsedCars Price and the accumulated KM. Do not transform the data. The function you should use in R is:

# Create the model
model_1 = lm(Price ~ KM, data)
summary(model_1)

## 
## Call:
## lm(formula = Price ~ KM, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7993  -1774   -457   1394  11437 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.494e+04  1.693e+02   88.24   <2e-16 ***
## KM          -6.817e-02  2.439e-03  -27.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2615 on 1262 degrees of freedom
## Multiple R-squared:  0.3824, Adjusted R-squared:  0.3819 
## F-statistic: 781.4 on 1 and 1262 DF,  p-value: < 2.2e-16

3 pts What are the model parameters and what are their estimates?

Model parameters:

βo = 1.494e4
β1 = -6.817e-2
ϵ = 2.439e-3

2 pts Write down the estimated simple linear regression equation.

Price = 1.494e4 -6.817e-2(KM)

2 pts Interpret the estimated value of the $\beta_1$ parameter in the context of the problem.

With an increase of one km in the accumulated total km on the odometer the price change associated is a decrease by $0.068 on average.

2 pts Find a 95% confidence interval for the $\beta_1$ parameter. Is $\beta_1$ statistically significant at this level?

#confidence interrval
confint(model_1, level= 0.95)

##                     2.5 %        97.5 %
## (Intercept)  1.461094e+04  1.527539e+04
## KM          -7.296019e-02 -6.339078e-02

The confidence interval that 𝛽1 statistically significant is ( -7.296, -6.339).

2 pts Is $\beta_1$ statistically significantly negative at an $\alpha$-level of 0.01? What is the approximate p-value of this test?

#stat neg?
tvalue= -27.95
pt(tvalue,1262)

## [1] 1.661374e-134

Yes, 𝛽1 is statistically negative at an alpha level of 0.01 with the approximate p-value being 1.66e-134.

Question B3: Checking the Assumptions of the Model

Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.

3 pts Scatterplot of the data with KM on the x-axis and Price on the y-axis. Make sure you include a line showing the overall trend of the scatterplot

#scatter plot with km - x and price -y
plot(data$KM, data$Price, main ="Scatterplot of Price vs Accumulated KM",
     ylab="Price", xlab =" Accuumulated KM")
abline(model_1, col = "red", lwd = 2)

Model Assumption(s) it checks: Linearity

Interpretation: The data exhibits linear behavior in an inverse relationship, and there are no obvious outliers or leverage points visible. Therefore, the assumption is satisfied.

4 pts Residual plot - a plot of the residuals, $\hat\epsilon_i$, versus the fitted values, $\hat{y}_i$. Make sure you include a line showing the ideal baseline (hint: residual = 0) that serves as the comparison

#Residuals
plot(model_1$fitted, model_1$residuals, main
    = "Residuals Plot", xlab = "Price", ylab ="Residuals", pty =2, lwd = 2)
abline(h = 0, col = "red", lwd = 2)

Model Assumption(s) it checks: Constant Variance

Interpretation: The residual plot centers around y=0 which can indicate constant variance, but they are grouped together that could indicate a pattern within the residuals, outliers, or a skew to the right. Since this pattern is minimal it means that the model has room for improvement. But overall, the assumption is satisfied.

4 pts Histogram and q-q plot of the residuals. Make sure you include a line in the q-q showing the ideal baseline that serves as the comparison in a q-q plot

#QQplot and histogram
library(car)

## Loading required package: carData

qqPlot(model_1$residuals, ylab = "Residuals")

## [1]  36 115

hist(model_1$residuals, xlab="Residuals")

Model Assumption(s) it checks: Normality

Interpretation: For the most part the data looks normally distributed. According to the Q-Q plot there is a very light lift on the upper tail indicating there could be a right skew to the data. Then looking at the histogram this aligns with the Q-Q plot with a skew to the right. Thus the assumption is satisfied.

Question B4: Prediction Use the results from both model_1 to discuss the effects of KM on the dependent variable: Holding everything else equal, how much the sales price would decrease if a car accumulated 10,000 more kilometers? What observations can you make about the result in the context of the problem? (3 pts)

#predictions
model_1pred = predict(model_1)
pred_10k = data.frame(KM = c(10000))
predictions = predict(model_1, pred_10k)
print(predictions-1.494e+04)

##         1 
## -678.5901

The price would drop $678.59 if the car were to accumulate another 10,000 km, holding all other things constant.

Part C. Experiment!

You work for the National Park Service (NPS), and you absolutely love bears. Describe an imaginary (it can be realistic) scenario in which you get to run a one-way ANOVA on a few (3+) species of bears.

Part C1 What are you comparing (name the variable!)? What do you hope to learn from ANOVA?__ (2 pts)

The variable of interest is the average claw length of different species of bears. We hope to see if there is a significant difference in the average claw length for the different species of bears.

Part C2 Imagine that the results are “mixed”, meaning you can draw some conclusions and not others. Describe your conclusions and make sure you detail, with reference to your ANOVA, why the results were “mixed.”__ (3 pts)

• The conclusions from the test were that the ANOVA test reported a p-value lower than our alpha at 0.01 indicating a significant difference in the species claw lengths.

• From there the species were grouped together in pairs and their means compared using the turkey method. This revealed that some pairs of bear species are significantly different whereas others had no difference.

• The test indicates that there are mixed results meaning that there were some pairs of species that had a significant difference in claw length and others that did not.

Part C3 Now imagine that you have just been granted 3 months and $50,000 to continue this study (you’re a great grant writer and a very likable member of the NPS!). Describe some next steps you would take to clarify, reinforce and/or further explore your nascent investigation. You MUST reference using a ‘controlling’ variable somehow in your response.__ (5 pts)

After achieving mixed results in the study, the next steps would be to consider other factors (diet, environment, other animals in the area, etc.) that might influence the length of the bear’s claws. In addition, there would be some controlling variables that we would use for reference for claw length, such as gender and age. A 2-way ANOVA test would be used to then incorporate the control variables which would then give us a narrower view on which means of bears’ claw length are being compared. There could also be an issue with a small sample size and there needs to be more points in each species sample to accurately investigate the means of the claw length.

Part D. Explain the meaning of a p-value! Explain in detail what it means specifically — in a statistical sense — for any result to be “statistically significant” at a particular -level. In other words, explain the meaning and use of p-values. You should research this question, and you should expect your answer to be at least a paragraph long.__ (6 pts)

Statistical significance indicates that the result of a test is a true finding not likely to have occurred by random chance. When establishing statistical significance, a null and alternative hypothesis is written. The null hypothesis describes that the effect of interest has happened due to random chance and there is no true result whereas the alternative hypothesis states that there is a true result. Associated with the null hypothesis is the alpha value which represents the probability threshold that the null hypothesis is true. From there we run the test to find the probability of getting the result of interest given the current sample of observations. This probability is called the p-value, and if the p-value is lower than the alpha value it means that the null hypothesis is rejected, and it is unlikely that the result is due to random chance.