In the field of psychology, much research is done using self-report surveys using Likert scales (look it up!).
What type of variable is a Likert response? (1 pt) A Likert response variable is an ordinal varibale.
What are some (at least 2) benefits of using Likert scales? (2 pts) Some benefits of using Likert scales are: 1. They are easy to understand, since responders just rank their preferences/choices based on the selected scale 2. Likert scales are flexible and can be customized to fit the needs to the researcher 3. Likert scales provide numerical data for quantitative analysis
What are some drawbacks — and dangers — of using them? Make sure you mention at least one ‘drawback’ and one ‘danger’ (a ‘drawback’ is a shortcoming, while a ‘danger’ implies potential harm). (2 pts)
Perform linear regressions on a dataset from a European Toyota car dealer on the sales records of used cars (Toyota Corolla). We would like to construct a reasonable linear regression model for the relationship between the sales prices of used cars and various explanatory variables (such as age, mileage, horsepower). We are interested to see what factors affect the sales price of a used car and by how much.
Data Description
Id - ID number of each used car Model - Model name of each used car Price - The price (in Euros) at which each used car was sold Age - Age (in months) of each used car as of August 2004 KM - Accumulated kilometers on odometer HP - Horsepower Metallic - Metallic color? (Yes = 1, No = 0) Automatic - Automatic transmission? ( Yes = 1, No = 0) CC - Cylinder volume (in cubic centimeters) Doors - Number of doors Gears - Number of gears Weight - Weight (in kilograms)
The data is in the file “UsedCars.csv”. To read the data in
R, save the file in your working directory (make sure you
have changed the directory if different from the R working directory)
and read the data using the R function
read.csv().
Read data and show few rows of data.
# Read in the data
data = read.csv("UsedCars.csv",sep = ",",header = TRUE)
# Show the first few rows of data
head(data, 3)
## Id Model Price Age KM HP
## 1 1 TOYOTA Corolla 1800 T SPORT VVT I 2/3-Doors 21500 27 19700 192
## 2 2 TOYOTA Corolla 1.8 VVTL-i T-Sport 3-Drs 2/3-Doors 20950 25 31461 192
## 3 3 TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT BNS 2/3-Doors 19950 22 43610 192
## Metallic Automatic CC Doors Gears Weight
## 1 0 0 1800 3 5 1185
## 2 0 0 1800 3 6 1185
## 3 0 0 1800 3 6 1185
# Your code here...
price = data[,3]
km = data[,5]
plot(km, price, main="Price and KM", xlab="Accumulated KMs ", ylab="Price ", pch=19)
The general trend is as Accumulated KMs increase, the price of the used vehicle decreases. There seems to be a negative, and slightly nonlinear association between price and accumulated KMs.
# Your code here...
cor(price, km)
## [1] -0.6183873
The value of the correlation coefficient between Price and KM is -0.6183873. This suggests there is a strong correlation between the two variables (values between +/- 0.5 t0 +/-1 is considered stong).
Dispite having a strong correlation coefficient, the chart suggests that a nonlinear model would best describe this relationship.
Based on the analysis above, I would pursue a transformation of the data.
Fit a linear regression model, named model_1, to evaluate the relationship between UsedCars Price and the accumulated KM. Do not transform the data. The function you should use in R is:
# Your code here...
model_1 = lm(price~km)
summary(model_1)
##
## Call:
## lm(formula = price ~ km)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7993 -1774 -457 1394 11437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.494e+04 1.693e+02 88.24 <2e-16 ***
## km -6.817e-02 2.439e-03 -27.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2615 on 1262 degrees of freedom
## Multiple R-squared: 0.3824, Adjusted R-squared: 0.3819
## F-statistic: 781.4 on 1 and 1262 DF, p-value: < 2.2e-16
intercept (\(\beta_0\)): 1.494e+04 (14,940) km (\(\beta_1\)): -6.817e-02 (-0.06817)
price = 14,940 - 0.06817*km
For every KM accumulated, the price falls by 0.068 Euros.
# Your code here...
confint(model_1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 1.461094e+04 1.527539e+04
## km -7.296019e-02 -6.339078e-02
The 95% confidence interval for the \(\beta_1\) parameter is (-0.07296, -0.06339078). Since the p-value is low (2e-16) and less than 5%, the estimate for \(\beta_1\) is statistically significant.
# Your code here...
tvalue = summary(model_1)$coefficients['km','t value']
df = model_1$df.residual
pvalue = pt(tvalue, df, lower=TRUE)
pvalue
## [1] 1.561401e-134
Yes,\(\beta_1\) statistically significantly negative at an \(\alpha\)-level of 0.01 since the p-value (1.561401e-134) is less than 0.01.
Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.
# Your code here...
library(ggplot2)
ggplot(data, aes(x=km, y=price)) + geom_point() + geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'
# Your code here...
plot(model_1$fitted, model_1$residuals, main="Fitted vs Residual", xlab="Fitted values", ylab="Residual")
abline(0,0)
Model assumption: Independence and Constant Variance Assumption
The residuals are more concentrated around the 9000-12,000 Fitted Values
range and show larger variance as the fitted values increases. Thus, the
variance is not constant.
# Your code here...
hist(residuals(model_1),main="Histogram of residuals",xlab="Residuals")
qqnorm(residuals(model_1))
qqline(residuals(model_1))
Model assumption: Normal distribution of residuals The histogram is
slightly skewed to the right. Based on the QQ plot, we can see the tail
ends do not align with the expected normal line. Overall, the residuals
are not normally distributed.
Use the results from both model_1 to discuss the effects of KM on the dependent variable: Holding everything else equal, how much the sales price would decrease if a car accumulated 10,000 more kilometers? What observations can you make about the result in the context of the problem? (3 pts)
# Your code here...
original = predict(model_1, data.frame(km=0), interval="predict")
new = predict(model_1, data.frame(km=10000), interval="predict")
new - original
## fit lwr upr
## 1 -681.7549 -679.1921 -684.3176
Holding everything else equal, the sales price would decrease by 681.75 EUROs if a car accumulated 10,000 more kilometers. This value makes sense as the beta_1 is -0.06817, which corresponds to the proportional KM accumulation.
You work for the National Park Service (NPS), and you absolutely love bears. Describe an imaginary (it can be realistic) scenario in which you get to run a one-way ANOVA on a few (3+) species of bears.
What are you comparing (name the variable!)? What do you hope to learn from ANOVA? (2 pts)
We can compare the Average Weight of each species of bears in the parks. We hope to learn how the bear species influences their weight.
Imagine that the results are “mixed”, meaning you can draw some conclusions and not others. Describe your conclusions and make sure you detail, with reference to your ANOVA, why the results were “mixed.” (3 pts)
If the results were “mixed”, it may imply that the overall model confirms that not all the means are equal, however, we want to comfirm that each pairwise means are different. As a result, we need to perform the TukeyHSD model to determine which set(s) of pairwise means are significant. The result might show that some of the p-values of the multiple pairwise comparisons were low and some were high. For example, the study may have had a low p-value for brown bears compared to pandas, but may have a large p-value for brown bears compared to cave bears and brown bears compared to black bears. This means that the average weight of a brown bear is significantly different from pandas but not from cave bears or black bears.
Now imagine that you have just been granted 3 months and $50,000 to continue this study (you’re a great grant writer and a very likable member of the NPS!). Describe some next steps you would take to clarify, reinforce and/or further explore your nascent investigation. You MUST reference using a ‘controlling’ variable somehow in your response. (5 pts)
We can control for gender within our study, as gender may have skewed results earlier. By controlling gender (say, only determining the average weight of the male bears from the parks), we may get better results from ANCOVA. Note, since we are using control variables, we should use ANCOVA models instead of ANOVA. The next steps would be to only sample male bear weights and disregard female weights. Then we can perform the ANCOVA model to determine if the average weights of each bear species is significantly different.
Explain in detail what it means specifically — in a statistical sense — for any result to be “statistically significant” at a particular -level. In other words, explain the meaning and use of p-values. You should research this question, and you should expect your answer to be at least a paragraph long. (6 pts)
Statistically significant at a particular level provides you with confidence that a result was either due to chance or the independent variable in question. If a independent variable is statistically significant, it means that this variable significantly influences the result of the dependent variable and it is not due to chance. This is important because as you run an experiment, you take a sample of the population (ie. it is impossible/very difficult to survey the entire population of interest, so a sample study is taken instead). With a sample population, you need to be sure that you did not just sample a section that agrees with your study (ie. the sample you selected just happen to be statistically significant) or vice versa.
The significance level, or p-value, demonstrates how likely your results are due to chance, assuming the null hypothesis is true. Thus, a lower p-value, the better. The alpha value provides a sort of benchmark to either accept or refect the null hypothesis. If the p-value is lower than the alpha, the null hypothesis can be rejected at the alpha significance level.