Part A. Variables

In the field of psychology, much research is done using self-report surveys using Likert scales (look it up!).

A1

What type of variable is a Likert response? (1 pt) A Likert response variable is an ordinal varibale.

A2

What are some (at least 2) benefits of using Likert scales? (2 pts) Some benefits of using Likert scales are: 1. They are easy to understand, since responders just rank their preferences/choices based on the selected scale 2. Likert scales are flexible and can be customized to fit the needs to the researcher 3. Likert scales provide numerical data for quantitative analysis

A3

What are some drawbacks — and dangers — of using them? Make sure you mention at least one ‘drawback’ and one ‘danger’ (a ‘drawback’ is a shortcoming, while a ‘danger’ implies potential harm). (2 pts)

A drawback of a Likert scale is: Limited responses can make participants feel like they are forced to commit to an answer that they don’t fully agree with. For example, if a Likert scale only provided the following reponses, then the participant may not be able to answer in the manner they truly feel (ie. Strongly Disagree): Strongly Agree, Agree, Neutral, Disagree
A danger of using a Likert scale is possible misinterpretation on both the participant and researcher, leading to incorrect conclusions.
Another danger is participants may agree with statements, dispite their true opinion, providing inaccurate data and thus, conclusions.

Part B. Simple Linear Regression

Perform linear regressions on a dataset from a European Toyota car dealer on the sales records of used cars (Toyota Corolla). We would like to construct a reasonable linear regression model for the relationship between the sales prices of used cars and various explanatory variables (such as age, mileage, horsepower). We are interested to see what factors affect the sales price of a used car and by how much.

Data Description

Id - ID number of each used car Model - Model name of each used car Price - The price (in Euros) at which each used car was sold Age - Age (in months) of each used car as of August 2004 KM - Accumulated kilometers on odometer HP - Horsepower Metallic - Metallic color? (Yes = 1, No = 0) Automatic - Automatic transmission? ( Yes = 1, No = 0) CC - Cylinder volume (in cubic centimeters) Doors - Number of doors Gears - Number of gears Weight - Weight (in kilograms)

The data is in the file “UsedCars.csv”. To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv().

Read data and show few rows of data.

# Read in the data
data = read.csv("UsedCars.csv",sep = ",",header = TRUE)
# Show the first few rows of data
head(data, 3)

##   Id                                                  Model Price Age    KM  HP
## 1  1            TOYOTA Corolla 1800 T SPORT VVT I 2/3-Doors 21500  27 19700 192
## 2  2      TOYOTA Corolla 1.8 VVTL-i T-Sport 3-Drs 2/3-Doors 20950  25 31461 192
## 3  3 TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT BNS 2/3-Doors 19950  22 43610 192
##   Metallic Automatic   CC Doors Gears Weight
## 1        0         0 1800     3     5   1185
## 2        0         0 1800     3     6   1185
## 3        0         0 1800     3     6   1185

Question B1: Exploratory Data Analysis

3 pts Use a scatter plot to describe the relationship between Price and the Accumulated kilometers on odometer. Describe the general trend (direction and form). Include plots and R-code used.

# Your code here...
price = data[,3]
km = data[,5]
plot(km, price, main="Price and KM", xlab="Accumulated KMs ", ylab="Price ", pch=19)

The general trend is as Accumulated KMs increase, the price of the used vehicle decreases. There seems to be a negative, and slightly nonlinear association between price and accumulated KMs.

3 pts What is the value of the correlation coefficient between Price and KM? Please interpret the strength of the correlation based on the correlation coefficient.

# Your code here...
cor(price, km)

## [1] -0.6183873

The value of the correlation coefficient between Price and KM is -0.6183873. This suggests there is a strong correlation between the two variables (values between +/- 0.5 t0 +/-1 is considered stong).

2 pts Based on this exploratory analysis, would you recommend a simple linear regression model for the relationship?

Dispite having a strong correlation coefficient, the chart suggests that a nonlinear model would best describe this relationship.

1 pts Based on the analysis above, would you pursue a transformation of the data? Do not transform the data.

Based on the analysis above, I would pursue a transformation of the data.

Question B2: Fitting the Simple Linear Regression Model

Fit a linear regression model, named model_1, to evaluate the relationship between UsedCars Price and the accumulated KM. Do not transform the data. The function you should use in R is:

# Your code here...
model_1 = lm(price~km)
summary(model_1)

## 
## Call:
## lm(formula = price ~ km)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7993  -1774   -457   1394  11437 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.494e+04  1.693e+02   88.24   <2e-16 ***
## km          -6.817e-02  2.439e-03  -27.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2615 on 1262 degrees of freedom
## Multiple R-squared:  0.3824, Adjusted R-squared:  0.3819 
## F-statistic: 781.4 on 1 and 1262 DF,  p-value: < 2.2e-16

3 pts What are the model parameters and what are their estimates?

intercept ($\beta_0$): 1.494e+04 (14,940) km ($\beta_1$): -6.817e-02 (-0.06817)

2 pts Write down the estimated simple linear regression equation.

price = 14,940 - 0.06817*km

2 pts Interpret the estimated value of the $\beta_1$ parameter in the context of the problem.

For every KM accumulated, the price falls by 0.068 Euros.

2 pts Find a 95% confidence interval for the $\beta_1$ parameter. Is $\beta_1$ statistically significant at this level?

# Your code here...
confint(model_1, level=0.95)

##                     2.5 %        97.5 %
## (Intercept)  1.461094e+04  1.527539e+04
## km          -7.296019e-02 -6.339078e-02

The 95% confidence interval for the $\beta_1$ parameter is (-0.07296, -0.06339078). Since the p-value is low (2e-16) and less than 5%, the estimate for $\beta_1$ is statistically significant.

2 pts Is $\beta_1$ statistically significantly negative at an $\alpha$-level of 0.01? What is the approximate p-value of this test?

# Your code here...
tvalue = summary(model_1)$coefficients['km','t value']
df = model_1$df.residual
pvalue = pt(tvalue, df, lower=TRUE)
pvalue

## [1] 1.561401e-134

Yes,$\beta_1$ statistically significantly negative at an $\alpha$-level of 0.01 since the p-value (1.561401e-134) is less than 0.01.

Question B3: Checking the Assumptions of the Model

Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.

3 pts Scatterplot of the data with KM on the x-axis and Price on the y-axis. Make sure you include a line showing the overall trend of the scatterplot

# Your code here...
library(ggplot2)
ggplot(data, aes(x=km, y=price)) + geom_point() +  geom_smooth(method=lm)

## `geom_smooth()` using formula = 'y ~ x'

4 pts Residual plot - a plot of the residuals, $\hat\epsilon_i$, versus the fitted values, $\hat{y}_i$. Make sure you include a line showing the ideal baseline (hint: residual = 0) that serves as the comparison

# Your code here...
plot(model_1$fitted, model_1$residuals, main="Fitted vs Residual", xlab="Fitted values", ylab="Residual")
abline(0,0)

Model assumption: Independence and Constant Variance Assumption
The residuals are more concentrated around the 9000-12,000 Fitted Values range and show larger variance as the fitted values increases. Thus, the variance is not constant.

4 pts Histogram and q-q plot of the residuals. Make sure you include a line in the q-q showing the ideal baseline that serves as the comparison in a q-q plot

# Your code here...
hist(residuals(model_1),main="Histogram of residuals",xlab="Residuals")

qqnorm(residuals(model_1))
qqline(residuals(model_1))

Model assumption: Normal distribution of residuals The histogram is slightly skewed to the right. Based on the QQ plot, we can see the tail ends do not align with the expected normal line. Overall, the residuals are not normally distributed.

Question B4: Prediction

Use the results from both model_1 to discuss the effects of KM on the dependent variable: Holding everything else equal, how much the sales price would decrease if a car accumulated 10,000 more kilometers? What observations can you make about the result in the context of the problem? (3 pts)

# Your code here...
original = predict(model_1, data.frame(km=0), interval="predict")
new = predict(model_1, data.frame(km=10000), interval="predict")
new - original

##         fit       lwr       upr
## 1 -681.7549 -679.1921 -684.3176

Holding everything else equal, the sales price would decrease by 681.75 EUROs if a car accumulated 10,000 more kilometers. This value makes sense as the beta_1 is -0.06817, which corresponds to the proportional KM accumulation.

Part C. Experiment!

You work for the National Park Service (NPS), and you absolutely love bears. Describe an imaginary (it can be realistic) scenario in which you get to run a one-way ANOVA on a few (3+) species of bears.

Part C1

What are you comparing (name the variable!)? What do you hope to learn from ANOVA? (2 pts)

We can compare the Average Weight of each species of bears in the parks. We hope to learn how the bear species influences their weight.

Part C2

Imagine that the results are “mixed”, meaning you can draw some conclusions and not others. Describe your conclusions and make sure you detail, with reference to your ANOVA, why the results were “mixed.” (3 pts)

If the results were “mixed”, it may imply that the overall model confirms that not all the means are equal, however, we want to comfirm that each pairwise means are different. As a result, we need to perform the TukeyHSD model to determine which set(s) of pairwise means are significant. The result might show that some of the p-values of the multiple pairwise comparisons were low and some were high. For example, the study may have had a low p-value for brown bears compared to pandas, but may have a large p-value for brown bears compared to cave bears and brown bears compared to black bears. This means that the average weight of a brown bear is significantly different from pandas but not from cave bears or black bears.

Part C3

Now imagine that you have just been granted 3 months and $50,000 to continue this study (you’re a great grant writer and a very likable member of the NPS!). Describe some next steps you would take to clarify, reinforce and/or further explore your nascent investigation. You MUST reference using a ‘controlling’ variable somehow in your response. (5 pts)

We can control for gender within our study, as gender may have skewed results earlier. By controlling gender (say, only determining the average weight of the male bears from the parks), we may get better results from ANCOVA. Note, since we are using control variables, we should use ANCOVA models instead of ANOVA. The next steps would be to only sample male bear weights and disregard female weights. Then we can perform the ANCOVA model to determine if the average weights of each bear species is significantly different.

Part D. Explain the meaning of a p-value!

Explain in detail what it means specifically — in a statistical sense — for any result to be “statistically significant” at a particular -level. In other words, explain the meaning and use of p-values. You should research this question, and you should expect your answer to be at least a paragraph long. (6 pts)

Statistically significant at a particular level provides you with confidence that a result was either due to chance or the independent variable in question. If a independent variable is statistically significant, it means that this variable significantly influences the result of the dependent variable and it is not due to chance. This is important because as you run an experiment, you take a sample of the population (ie. it is impossible/very difficult to survey the entire population of interest, so a sample study is taken instead). With a sample population, you need to be sure that you did not just sample a section that agrees with your study (ie. the sample you selected just happen to be statistically significant) or vice versa.

The significance level, or p-value, demonstrates how likely your results are due to chance, assuming the null hypothesis is true. Thus, a lower p-value, the better. The alpha value provides a sort of benchmark to either accept or refect the null hypothesis. If the p-value is lower than the alpha, the null hypothesis can be rejected at the alpha significance level.

HW1 Peer Assessment