Instructions to students

You should only use the file Exam_template.Rmd provided on blackboard and you should load this file from your scripts folder / directory.

Save this template as your studentID.Rmd; you will upload this file as your submission. Change the information on line 3 of this file – changing the author information to your student ID. Do not change the authorship to your name.

Ensure that you save your data into your data folder (as discussed in class). You may use the files mypackages.R and helperFunctions.R from blackboard. If you use these files, do not alter them. If you wish to create additional files for custom functions that you have prepared in advance, make sure that you upload these in addition to your .Rmd file and your compiled output file.

Your should knit this file to a document Word format.

Any changes that you make to the data (e.g. variable name changes) should be made entirely within R.

The subsubsections labelled Answer: indicate where you should put in your written Answers. The template also provides blank code chunks for you to complete your Answers; you may choose to add additional chunks if required.

load required libraries / additional files

#install.packages(“tidyverse”) #install.packages(“dplyr”) #install.packages(“boot”) #install.packages(“ggplot2”) #install.packages(“psych”) #install.packages(“performance”) #install.packages(“corrplot”) #install.packages(“summarytools”) Having installed all these packages, we shall call the following libraries:

# Load the dataset as follows:
library(readxl)
#data_1=read_excel(file.choose())
# Attach the dataset as follows:
attach(data_1)

Error in attach(data_1): object 'data_1' not found

# Name the dataset to know how many variables are therein:
names(data_1)

Error in eval(expr, envir, enclos): object 'data_1' not found

#View(data_1)
dim(data_1)

Error in eval(expr, envir, enclos): object 'data_1' not found

Data description

This dataset is part of a larger dataset that has been collected to help to estimate the price of used cars.

It contains the following variables:

brand (manufacturer)
model (of car)
year (of registration of the car)
price (in GB pounds)
transmission (type of gearbox)
mileage (total distance covered by the car)
fuelType (type of fuel used by the car)
tax (annual cost of vehicle tax)
mpg (miles per gallon - a measure of fuel efficiency)
engineSize (size of the engine in litres)

Question 1: Data Preparation (11 marks)

You are interested in modelling the price of vehicles that have all of the following properties:

mileage less than 90000
Manual transmission
Diesel engine (fuelType)
Costing less than £300 in annual Vehicle Tax.

Once you have selected the rows of data with these properties, then you must use the last 4 digits of your studentID to select a random sample of 2000 rows of the data to perform the rest of your analysis with.

You should remove any redundant variables (where only one value remains in that variable).

This subset of the data is what you should use for the rest of this assessment.

Explain what data preparation is required in order for the data in Jan_2022_Exam_Data.csv to be suitable for this analysis.

(4 marks)

Answer:

ReCall the libraries
Reading the csv file to the R Studio environmwent
Viewing it to see the dataset
Filtering it accordingly
Saving the filtered dataset to another name of your choice
Selecting a random sample of 2000 rows from the filtered dataset for set.seed = 7889.
Removing the variables with redundant values by running appropriate codes
Converting the variables to factors as appropriate

Implement the required data preparation in the code chunk below:

(7 marks)

Answer:

data_2 = subset(data_1, transmission=="Manual" & mileage < 90000 & fuelType=="Diesel" & tax < 300, select = -c(fuelType, transmission))

Error in subset(data_1, transmission == "Manual" & mileage < 90000 & fuelType == : object 'data_1' not found

#View(data_2)
dim(data_2)

Error in eval(expr, envir, enclos): object 'data_2' not found

#Seeting our seed to be:
set.seed(7889)

data_3 = data_2 %>% sample_n(2000, replace = FALSE)

Error in sample_n(., 2000, replace = FALSE): object 'data_2' not found

#View(data_3)
dim(data_3)

Error in eval(expr, envir, enclos): object 'data_3' not found

Question 2: Exploratory Data Analysis (22 marks)

Descriptive Statistics

What descriptive statistics would be appropriate for this dataset? Explain why these are useful in this context.

(2 marks)

Answer:

Measure of Location: mean, mode, and median. They are useful because they depict where the entire datasets are concentrated.
Measure of Spread: range, interquartile range, variance and standard deviation. They are useful becuase they supplement the information provided by the measure of location.

Produce those descriptive statistics in the code chunk below:

(4 marks)

Answer:

summary(data_3)

Error in summary(data_3): object 'data_3' not found

descr(data_3)

Error in descr(data_3): object 'data_3' not found

What have those descriptive statistics told you – and how does this inform the analysis that you would undertake on this data or any additional data cleaning requirements?

(4 marks)

Answer:

Additional cleaning is required because we do not need the descriptive statistics for the year as shown in our results.
The minimum and maximum values for price, mileage, tax, mpg, and engineSize are (2295 & 37995), (5 & 89462), (0 & 260), (30.10 & 88.30) and (1.00 & 3.00) respectively.
Out of all, mileage has the highest degree of variability as reported in our analysis.

In the analysis using measure of central tendency, we observe that the average price (in GB pounds), mileage (total distance covered by the car), annual cost of vehicle tax, miles per gallon, and size of the engine (in litres) are 13933.86, 32538.84,84.97,64.16 and 1.79 respectively, while their median values are 13486, 29666.5, 125, 64.2, and 2. Not only these but also, the modes are 16000 and 11000 for only price (this means that there are bimodal values for the price of the car), 10, 145, 74.3, and 2 are the mode for mileage, tax, mpg, and engineSize. All these talk about measure of central tendency.

Come to think of the measure of variability as used here, the ranges for price (in GB pounds), mileage (total distance covered by the car), annual cost of vehicle tax, miles per gallon, and size of the engine (in litres) are (2395 - 32995), (5 - 89936), (0 - 260), (30.1 - 88.3), and (0.0 - 2.2) respectively. We got 6423.75, 29308, 125, 13.275, and 0.5 as interquartile range values for price (in GB pounds), mileage (total distance covered by the car), annual cost of vehicle tax, miles per gallon, and size of the engine (in litres). Also, the values for variance and standard deviation for each of these variables are (22931232, 4788.657), (445497989, 21106.82), (4263.894, 65.2985), (101.5388, 10.07665), and (0.06297, 0.25094).

Exploratory Graphs

What exploratory graphs would be appropriate for this dataset? Explain why these are useful in this context.

(2 marks)

Answer:

Box plot: The plot is purposeful designed to check two things in datasets: Normality and Outliers.
Histogram: Basically, data visualization, such as histogram, is to present information in visualized form. Therefore, histogram will show the information in graphical format at a glance.
Scatter plot: This is to show the kind of linear relatuiosnhip between two variables at a time

(4 marks)

Answer:

# To obtain the box plot, run the command below:
# Form the dataset one by one:
price=data_3[,4]; mileage=data_3[,5]; tax=data_3[,6];mpg=data_3[,7];engineSize=data_3[,8]

Error in eval(expr, envir, enclos): object 'data_3' not found

Error in eval(expr, envir, enclos): object 'data_3' not found

Error in eval(expr, envir, enclos): object 'data_3' not found

Error in eval(expr, envir, enclos): object 'data_3' not found

Error in eval(expr, envir, enclos): object 'data_3' not found

#Produce the data frame:
data_4 = data.frame(price, mileage, tax, mpg, engineSize)

Error in data.frame(price, mileage, tax, mpg, engineSize): object 'price' not found

#Draw the plot:
boxplot(data_4, main="Box-Whisker Plot", col=c(1:8), col.main=2, pch=22, las=1, sub="Fig. I", col.sub="chocolate")

Error in boxplot(data_4, main = "Box-Whisker Plot", col = c(1:8), col.main = 2, : object 'data_4' not found

# To obtain the histogram, run the commands below:
h=hist(price, main="Histigram", sub="Figire 2", col.sub="blue2",col=c(1:8), xlab="Price", col.main="yellow4")

Error in hist(price, main = "Histigram", sub = "Figire 2", col.sub = "blue2", : object 'price' not found

h=hist(mileage, main="Histigram", sub="Figire 3", col.sub="blue2",col=c(1:8), xlab="Mileage", col.main="yellow4")

Error in hist(mileage, main = "Histigram", sub = "Figire 3", col.sub = "blue2", : object 'mileage' not found

h=hist(tax, main="Histigram", sub="Figire 4", col.sub="blue2",col=c(1:8), xlab="Tax", col.main="yellow4")

Error in hist(tax, main = "Histigram", sub = "Figire 4", col.sub = "blue2", : object 'tax' not found

h=hist(mpg, main="Histigram", sub="Figire 5", col.sub="blue2",col=c(1:8), xlab="mpg", col.main="yellow4")

Error in hist.default(mpg, main = "Histigram", sub = "Figire 5", col.sub = "blue2", : 'x' must be numeric

h=hist(engineSize, main="Histigram", sub="Figire 6", col.sub="blue2",col=c(1:8), xlab="engineSize", col.main="yellow4")

Error in hist(engineSize, main = "Histigram", sub = "Figire 6", col.sub = "blue2", : object 'engineSize' not found

# To obtain the SCATTER PLOT, run the commands below:

plot(mileage,price, main="Scatter Plot", col.main=2, sub="Figure 7", col.sub="green", xlab="Mileage", ylab="Price", col.lab="blue", pch=19, col=c(1:22), las=1)

Error in plot(mileage, price, main = "Scatter Plot", col.main = 2, sub = "Figure 7", : object 'mileage' not found

plot(tax,price, main="Scatter Plot", col.main=2, sub="Figure 8", col.sub="green", xlab="tax", ylab="Price", col.lab="blue", pch=19, col=c(1:22), las=1)

Error in plot(tax, price, main = "Scatter Plot", col.main = 2, sub = "Figure 8", : object 'tax' not found

plot(mpg,price, main="Scatter Plot", col.main=2, sub="Figure 9", col.sub="green", xlab="mpg", ylab="Price", col.lab="blue", pch=19, col=c(1:22), las=1)

Error in pairs.default(data.matrix(x), ...): object 'price' not found

plot(engineSize,price, main="Scatter Plot", col.main=2, sub="Figure 10", col.sub="green", xlab="EngineSize", ylab="Price", col.lab="blue", pch=19, col=c(1:22), las=1)

Error in plot(engineSize, price, main = "Scatter Plot", col.main = 2, : object 'engineSize' not found

Interpret these exploratory graphs. How do these graphs inform your subsequent analysis? (4 marks)

Answer:

In Box Plot, it appears that mileage appears to be a bit normal while others are away from normality.
The Histogram and scatter plots also show similar illustrations.

Correlations

What linear correlations are present within this data? (2 marks)

Answer:

cor(data_4)

Error in is.data.frame(x): object 'data_4' not found

The results for the command “cor(data_4)” show the matrix of linear correlation.

Question 3: Bivariate relationship (14 marks)

Which of the potential explanatory variables has the strongest linear relationship with the dependent variable? (1 mark)

Answer:

It is mileage, which has -0.64, by approximation.

Create a linear model to model this relationship. (2 marks)

Answer:

model = lm(price ~ mileage, data = data_4)

Error in is.data.frame(data): object 'data_4' not found

summary(model)

Error in summary(model): object 'model' not found

plot(mileage,price, main="Scatter Plot", col.main=2, sub="", col.sub="green", xlab="Mileage", ylab="Price", col.lab="blue", pch=19, col="chocolate", las=1)

Error in plot(mileage, price, main = "Scatter Plot", col.main = 2, sub = "", : object 'mileage' not found

abline(coef(model), lwd=3, col=1)

Error in coef(model): object 'model' not found

Explain and interpret the model: (3 marks)

Answer:

The estimated coefficient of mileage, which is -0.1445, indicates that there is a negative correlation between price and mileage, which equally confirms the result obtained previously under correlation matrix. The p-values for mileage (0.0000000) shows that the null hypothesis is rejected indicating that the model is statistically significant, when we set our level of significance to be 0.05 respectively. The value of Adjusted R-squared (0.4474) shows that only 44% variations in price can be explained by mileage, while the rest cannot be explained.
The p-values for the regression coefficient (p-value = 0.0000000000) shows that the model mis significant.
The value of the overall p-value also confirms that the model is statistically signioficant.
The value of Adjusted R-squared (0.4117) shaows that about 41% variations in price can be accounted for by mileage.

Comment on the performance of this model, including comments on overall model fit and the validity of model assumptions. Include any additional code required for you to make these comments in the code chunk below. (4 marks)

Answer:

library(performance)
model_performance(model)

Error in model_performance(model): object 'model' not found

anova(model)

Error in anova(model): object 'model' not found

The results show that R2 (adjusted), which is 0.412, and AIC & BIC (38463.206 & 38480.009) mean that only 41% variation can be accounted for in price, while the Aikaike Information criterion also tell us that the model is fit and adequate.

Bootstrap

Use bootstrapping on this model to obtain a 95% confidence interval of the estimate of the slope parameter. (4 marks)

Answer:

model2 = step(model)

Error in terms(object): object 'model' not found

#Set up the bootstrap as follows:
# function to obtain R-Squared from the data:
set.seed(7889)
library(boot)
#Define function to calculate R-squared value
rsq_function = function(formula, data, indices) {
  w = data[indices,] 
  fit = lm(formula, data=w)
  return(summary(fit)$r.square)
}
#perform bootstrapping with 2000 replications
rw = boot(data=data_4, statistic=rsq_function, R=2000, formula=price~mileage)

Error in NROW(data): object 'data_4' not found

#view results of boostrapping
rw

Error in eval(expr, envir, enclos): object 'rw' not found

#plot(rw)

#To calculate the 95% confidence interval for the estimated R-squared of the model:
#To calculate adjusted bootstrap percentile (bca) interval:
boot.ci(rw, type="bca")

Error in boot.ci(rw, type = "bca"): object 'rw' not found

Question 4: Multivariable relationship (10 marks)

The remaining explanatory variables are tax, mpg, and engineSize:

model3 = lm(price ~ tax + + mpg + engineSize - 1, data = data_4)

Error in is.data.frame(data): object 'data_4' not found

summary(model3)

Error in summary(model3): object 'model3' not found

anova(model3)

Error in anova(model3): object 'model3' not found

Explain and interpret the model: (4 marks)

Answer:

It is obvious, from the results, that all the three remaining explanatory variables are statistically significant at 0.05 level of significance. Also, the results for the multiple R-squared and Adjusted multiple R-squared (0.9317 and 0.9316) show that there are 93% variations in price which are explained by all the three erxplanatory variables.

Comment on the performance of this model, including comments on overall model fit and the validity of model assumptions. Include any additional code required for you to make these comments in the code chunk below.

(4 marks)

Answer:

library(performance)
model_performance(model3)

Error in model_performance(model3): object 'model3' not found

# For normality assumption:
residual=resid(model3)

Error in resid(model3): object 'model3' not found

## Breusch-Pagan Test for Heteroscedasticity Assumption
library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

bptest(model3, studentize=FALSE)

Error in bptest(model3, studentize = FALSE): object 'model3' not found

The result shows that multiple R-squared is approximately 93% which tells us that 0nly 93% variations are explained by the explanatory variables.

What general concerns do you have regarding this model? (2 marks)

Answer:

The model is likely to develop a problem becuase it is not stable.

Question 5: Model simplification (8 marks)

What approaches for model simplification would you consider implementing and why? (4 marks)

Answer:

The approaches are: maximal model minimum adequate model null mode

The best for this scenario is maximal model beacuse it is more accurate and adequate.

What are the potential advantages of simplifying a model? (2 marks)

Answer:

Efficient in modelling
Reliable for prediction and forecasting
It is dependable

What are the potential disadvantages of simplifying a model? (2 marks)

Answer:

Very high cost of bulding a model, it is time consuming and take a lot of energy

Question 6: Reporting (35 marks)

A client is looking to purchase a used Skoda Superb (registration year either 2018 or 2019, manual transmission, diesel engine) and wants to understand what factors influence the expected price of a used car, (and how they influence the price).

Write a short report of 300-500 words for the client.

Furthermore, include an explanation as to which statistical model you would recommend, and why you have selected that statistical model.

Comment on any suggestions for alterations to the statistical model that would be appropriate to consider.

Highlight what may or may not be directly transferable from the scenario analysed in Questions 1 to 5.

Answer:

A client who wants to buy to a used Skoda Superb (registration year either 2018 or 2019, manual transmission, diesel engine) car would need to understand that some basic factors such as brand, model, fuel type, and tax would influence the expected price of a used car. In the first instance, someone who needs Audi will know that its price will surely be different (either less or more than) from the price of another brand, say BMW. Also, the model of the car will determine its price. It is expected that Audi with model A4 will be different in price while comparing it with BWM of model X3. The type of the fuel a car uses determines the price of such a car. In this scenario, you cannot expect a car using diesel to be at the same price with that of a car using petrol. Also, annual cost of vehicle tax influences its price.

I would like to recommend that a multiple linear regression model of price connecting with brand, model, fuel type, and tax be used because of the roles played by all these explanatory variables in explaining the price of this used car.

However, it is a good idea to suggest that some modifications should be made to our previous models such that the new model can be captured or incorporated into our analysis.

From the scenario analyzed in Questions 1 - 5, mileage less tha 90000 may not be transferable while manual transmission, diesel engine, and cost less than 300 Euro are transferable.

Session Information

Do not edit this part. Make sure that you compile your document so that the information about your session (including software / package versions) is included in your submission.

sessionInfo()

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lmtest_0.9-40      zoo_1.8-10         readxl_1.4.0       summarytools_1.0.1
 [5] corrplot_0.92      performance_0.9.1  psych_2.2.5        boot_1.3-28       
 [9] forcats_0.5.1      stringr_1.4.0      dplyr_1.0.9        purrr_0.3.4       
[13] readr_2.1.2        tidyr_1.2.0        tibble_3.1.7       ggplot2_3.3.6     
[17] tidyverse_1.3.1   

loaded via a namespace (and not attached):
 [1] httr_1.4.3         sass_0.4.1         jsonlite_1.8.0     modelr_0.1.8      
 [5] bslib_0.3.1        assertthat_0.2.1   pander_0.6.5       cellranger_1.1.0  
 [9] yaml_2.3.5         pillar_1.7.0       backports_1.4.1    lattice_0.20-45   
[13] glue_1.6.2         digest_0.6.29      pryr_0.1.5         checkmate_2.1.0   
[17] rvest_1.0.2        colorspace_2.0-3   plyr_1.8.7         htmltools_0.5.2   
[21] pkgconfig_2.0.3    broom_1.0.0        haven_2.5.0        magick_2.7.3      
[25] scales_1.2.0       tzdb_0.3.0         generics_0.1.3     ellipsis_0.3.2    
[29] withr_2.5.0        cli_3.3.0          mnormt_2.1.0       magrittr_2.0.3    
[33] crayon_1.5.1       evaluate_0.15      fs_1.5.2           fansi_1.0.3       
[37] nlme_3.1-158       MASS_7.3-57        xml2_1.3.3         rapportools_1.1   
[41] tools_4.2.0        hms_1.1.1          lifecycle_1.0.1    matrixStats_0.62.0
[45] munsell_0.5.0      reprex_2.0.1       compiler_4.2.0     jquerylib_0.1.4   
[49] rlang_1.0.4        grid_4.2.0         rstudioapi_0.13    tcltk_4.2.0       
[53] base64enc_0.1-3    rmarkdown_2.14     gtable_0.3.0       codetools_0.2-18  
[57] DBI_1.1.3          reshape2_1.4.4     R6_2.5.1           lubridate_1.8.0   
[61] knitr_1.39         fastmap_1.1.0      utf8_1.2.2         insight_0.18.0    
[65] stringi_1.7.8      parallel_4.2.0     Rcpp_1.0.9         vctrs_0.4.1       
[69] dbplyr_2.2.1       tidyselect_1.1.2   xfun_0.31

STATISTICAL INFERENCE

21061026

2022-07-22

Instructions to students

load required libraries / additional files

Data description

Question 1: Data Preparation (11 marks)

Answer:

Answer:

Question 2: Exploratory Data Analysis (22 marks)

Descriptive Statistics

Answer:

Answer:

Answer:

Exploratory Graphs

Answer:

Answer:

Answer:

Correlations

Answer:

Question 3: Bivariate relationship (14 marks)

Answer:

Answer:

Answer:

Answer:

Bootstrap

Answer:

Question 4: Multivariable relationship (10 marks)

Answer:

Answer:

Answer:

Question 5: Model simplification (8 marks)

Answer:

Answer:

Answer:

Question 6: Reporting (35 marks)

Answer:

Session Information