Introduction

In the following report, we’ve prepared the data contained in the Corolla data set which englobes different variables including prices, manufacturing year, mileage, and others. In this report, we follow the steps of cleaning the data to remove unnecessary outliers and execute an EDA analysis process to understand the given data and relate the variables, and in future steps of our challenge project, have a prepared dataset with the correct information to carry on further analyses. Overall, this report aims to provide a detailed panorama on the procedures followed to clean the data, as well as explain the reasons as to why these steps were taken. We aim to demonstrate concise and clear explanations, with the purpose of achieving a proper understanding of the procedure.

STEPS 1 AND 2: Load dataset and create subset

As part of our two first steps, we initiated by loading our dataset into the program, and storing it into a variable that we decided to call “data”. As well, we applied a “gsub” function to our dataset that proceeds to replace any leading question mark at the beginning of stings in the “Model” column with an empty string (““), which basically means that the question marks found are being removed throughout this function. Once this step is made, a summarized demonstration of the data is displayed by applying the skim function. Lastly but not least, by using the”“colnames” function to “data”, this line returns the names of all columns in the data frame as a character vector, listing each variable name foun in our dataset after any data manipulation.

In shorter words, these steps are taken not only to initiate with our data preparation process, but most importantly, it helps to clean, summarize, and verify the structure of “data”.

data = read.csv("corolla2.csv")
data$Model <- gsub("^\\?", "", data$Model)
skim(data)

Data summary
Name	data
Number of rows	1436
Number of columns	37
_______________________
Column type frequency:
character	2
numeric	35
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Model	0	1	15	75	0	319	0
Fuel_Type	0	1	3	6	0	3	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Id	1	721.56	416.48	1	361.75	721.5	1081.25	1442	▇▇▇▇▇
Price	1	10730.82	3626.96	4350	8450.00	9900.0	11950.00	32500	▇▅▁▁▁
Age_08_04	1	55.95	18.60	1	44.00	61.0	70.00	80	▁▂▃▆▇
Mfg_Month	1	5.55	3.35	1	3.00	5.0	8.00	12	▇▅▅▃▅
Mfg_Year	1	1999.63	1.54	1998	1998.00	1999.0	2001.00	2004	▇▂▂▁▁
KM	1	68533.26	37506.45	1	43000.00	63389.5	87020.75	243000	▅▇▂▁▁
HP	1	101.50	14.98	69	90.00	110.0	110.00	192	▃▇▁▁▁
Met_Color	1	0.67	0.47	0	0.00	1.0	1.00	1	▃▁▁▁▇
Automatic	1	0.06	0.23	0	0.00	0.0	0.00	1	▇▁▁▁▁
cc	1	1576.86	424.39	1300	1400.00	1600.0	1600.00	16000	▇▁▁▁▁
Doors	1	4.03	0.95	2	3.00	4.0	5.00	5	▁▇▁▂▇
Cylinders	1	4.00	0.00	4	4.00	4.0	4.00	4	▁▁▇▁▁
Gears	1	5.03	0.19	3	5.00	5.0	5.00	6	▁▁▁▇▁
Quarterly_Tax	1	87.12	41.13	19	69.00	85.0	85.00	283	▇▇▁▂▁
Weight	1	1072.46	52.64	1000	1040.00	1070.0	1085.00	1615	▇▁▁▁▁
Mfr_Guarantee	1	0.41	0.49	0	0.00	0.0	1.00	1	▇▁▁▁▆
BOVAG_Guarantee	1	0.90	0.31	0	1.00	1.0	1.00	1	▁▁▁▁▇
Guarantee_Period	1	3.82	3.01	3	3.00	3.0	3.00	36	▇▁▁▁▁
ABS	1	0.81	0.39	0	1.00	1.0	1.00	1	▂▁▁▁▇
Airbag_1	1	0.97	0.17	0	1.00	1.0	1.00	1	▁▁▁▁▇
Airbag_2	1	0.72	0.45	0	0.00	1.0	1.00	1	▃▁▁▁▇
Airco	1	0.51	0.50	0	0.00	1.0	1.00	1	▇▁▁▁▇
Automatic_airco	1	0.06	0.23	0	0.00	0.0	0.00	1	▇▁▁▁▁
Boardcomputer	1	0.29	0.46	0	0.00	0.0	1.00	1	▇▁▁▁▃
CD_Player	1	0.22	0.41	0	0.00	0.0	0.00	1	▇▁▁▁▂
Central_Lock	1	0.58	0.49	0	0.00	1.0	1.00	1	▆▁▁▁▇
Powered_Windows	1	0.56	0.50	0	0.00	1.0	1.00	1	▆▁▁▁▇
Power_Steering	1	0.98	0.15	0	1.00	1.0	1.00	1	▁▁▁▁▇
Radio	1	0.15	0.35	0	0.00	0.0	0.00	1	▇▁▁▁▂
Mistlamps	1	0.26	0.44	0	0.00	0.0	1.00	1	▇▁▁▁▃
Sport_Model	1	0.30	0.46	0	0.00	0.0	1.00	1	▇▁▁▁▃
Backseat_Divider	1	0.77	0.42	0	1.00	1.0	1.00	1	▂▁▁▁▇
Metallic_Rim	1	0.20	0.40	0	0.00	0.0	0.00	1	▇▁▁▁▂
Radio_cassette	1	0.15	0.35	0	0.00	0.0	0.00	1	▇▁▁▁▂
Tow_Bar	1	0.28	0.45	0	0.00	0.0	1.00	1	▇▁▁▁▃

colnames(data)

##  [1] "Id"               "Model"            "Price"            "Age_08_04"       
##  [5] "Mfg_Month"        "Mfg_Year"         "KM"               "Fuel_Type"       
##  [9] "HP"               "Met_Color"        "Automatic"        "cc"              
## [13] "Doors"            "Cylinders"        "Gears"            "Quarterly_Tax"   
## [17] "Weight"           "Mfr_Guarantee"    "BOVAG_Guarantee"  "Guarantee_Period"
## [21] "ABS"              "Airbag_1"         "Airbag_2"         "Airco"           
## [25] "Automatic_airco"  "Boardcomputer"    "CD_Player"        "Central_Lock"    
## [29] "Powered_Windows"  "Power_Steering"   "Radio"            "Mistlamps"       
## [33] "Sport_Model"      "Backseat_Divider" "Metallic_Rim"     "Radio_cassette"  
## [37] "Tow_Bar"

STEP 3: Data Exploration

Afterwards, we apply the command of “str” to data in order to provide a structured summary of the data frame, and give out a quick overview of its internal structure. This line of code is helpful to quickly check data types, detect (if any) unexpected types, and overall ensure the data looks as expected before proceeding to further steps and analysis.

str(data)

## 'data.frame':    1436 obs. of  37 variables:
##  $ Id              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Model           : chr  "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" ...
##  $ Price           : int  13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...
##  $ Age_08_04       : int  23 23 24 26 30 32 27 30 27 23 ...
##  $ Mfg_Month       : int  10 10 9 7 3 1 6 3 6 10 ...
##  $ Mfg_Year        : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
##  $ KM              : int  46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
##  $ Fuel_Type       : chr  "Diesel" "Diesel" "Diesel" "Diesel" ...
##  $ HP              : int  90 90 90 90 90 90 90 90 192 69 ...
##  $ Met_Color       : int  1 1 1 0 0 0 1 1 0 0 ...
##  $ Automatic       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cc              : int  2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
##  $ Doors           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Cylinders       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Gears           : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Quarterly_Tax   : int  210 210 210 210 210 210 210 210 100 185 ...
##  $ Weight          : int  1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 ...
##  $ Mfr_Guarantee   : int  0 0 1 1 1 0 0 1 0 0 ...
##  $ BOVAG_Guarantee : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Guarantee_Period: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ ABS             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airbag_1        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airbag_2        : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Airco           : int  0 1 0 0 1 1 1 1 1 1 ...
##  $ Automatic_airco : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Boardcomputer   : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ CD_Player       : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ Central_Lock    : int  1 1 0 0 1 1 1 1 1 0 ...
##  $ Powered_Windows : int  1 0 0 0 1 1 1 1 1 0 ...
##  $ Power_Steering  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Radio           : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Mistlamps       : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ Sport_Model     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Backseat_Divider: int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Metallic_Rim    : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Radio_cassette  : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Tow_Bar         : int  0 0 0 0 0 0 0 0 0 0 ...

STEP 4: Visualization (Histograms)

As part of our fourth step, we know that histograms are essential for understanding the distribution of continuous variables in a dataset. Therefore, we proceeded to start visualizing our data through means of histograms. We decided to use histograms within our cleaning process, since it can allow us to identify factors such as skewness, multimodalities, or irregularities in the data distribution. In return, this results helpful for our data cleaning process. From the results obtained, we are able to state the following:

Interpretation of the Histograms

Price Distribution Histogram

This histogram depicts a right-skewed distribution of car prices, showing that most cars are priced towards the lower end, but there are a few very expensive cars. This is important to take in consideration as it can suggest outliers that may need attention during the data cleaning

Age_08_04 Distribution Histogram

The histogram indicates the age of the cars in months corresponding to August of 2004. The distribution depicts a relatively uniform spread, with no significant concetration in any specific age range. There is a slight peak around 50.70 month range, which might suggest that most cars are in this middle age category. This is helpful for us to understand the age distribution and can help to identify whether certain age groups need special attention for data cleaning.

Mfg_Year (Manufacturing Year) Distribution Histogram

This histogramshows a left skewed distribution, with most vehicles being relatively recent, but a few older vehicles that might need to be checked for anomalies

KM (Mileage) Distribution Histogram

The distribution of mileage also shows to be right-skewed, indicating that there are many carshaving lower mileage, but few with exceptionally high mileage. High mileage cars could be candidates for further scrunity or removal, as these also represent outliers.

HP (Horsepower) Distribution Histogram

The horsepower distribution appears to be more uniform but also slightly right skewed. A uniform distribution in this context might indicate no strong concentration og vehicles around specific horsepower values.

CC (Cubic Capacity) Distribution Histogram

We can obsserve that the majority of the vehicles have engine sizes around 1400 to 1600 cc. There are a few cars with significantly larger engine sizes, showing thhe presence of outliers.

Quarterly tax Distribution Histogram

In this histograms most cars have lower quarterly taxes and a few have much higher taxes. This indicates that the histogram is right-skewed. This skewness can suggest potential outliers at the high end, which may need attention during data cleaning.

Weight Distribution Histogram

The distribution of weight is likely centered around a specific range, with most cars having moderate weights and fewer vehicles at the extremes. As seen in other variables, cars with significantly higher weights can be outliers that can be and should be taken in considerarion for further data cleaning. Understanding the distribution of car weights can help ensure consistency within the dataset and can also identify any data points that don’t fit the overall pattern.

ggplot(data = data, aes(x = Price)) +
  geom_histogram(bins = 50, fill = "blue", color = "black") +
  labs(title = "Price Distribution of Toyota Corolla")

ggplot(data = data, aes(x = Age_08_04)) +
  geom_histogram(bins = 50, fill = "green", color = "black") +
  labs(title = "Age of Seller Distribution of Toyota Corolla")

ggplot(data = data, aes(x = Mfg_Year)) +
  geom_histogram(bins = 50, fill = "red", color = "black") +
  labs(title = "Year of Model Distribution of Toyota Corolla")

ggplot(data = data, aes(x = KM)) +
  geom_histogram(bins = 50, fill = "purple", color = "black") +
  labs(title = "KM Distribution of Toyota Corolla")

ggplot(data = data, aes(x = HP)) +
  geom_histogram(bins = 50, fill = "pink", color = "black") +
  labs(title = "Horsepower Distribution of Toyota Corolla")

ggplot(data = data, aes(x = cc)) +
  geom_histogram(bins = 50, fill = "grey", color = "black") +
  labs(title = "Cubic Capacity Distribution of Toyota Corolla")

ggplot(data = data, aes(x = Quarterly_Tax)) +
  geom_histogram(bins = 50, fill = "magenta", color = "black") +
  labs(title = "Quaterly Tax Distribution of Toyota Corolla")

ggplot(data = data, aes(x = Weight)) +
  geom_histogram(bins = 50, fill = "orange", color = "black") +
  labs(title = "Weight Distribution of Toyota Corolla")

STEP 5: Correlation Matrix

Next, we proceeded to create a correaltion matrix showing the relaionships between numerical variables in the dataset. This way, we can more easily identify which variables are strongly or weakly related. This step is useful during data cleaning because highly correlated variables may contain redundant information that can guide future selection or transformation.

From the correlation matrix, the strongest relationships observed are:

Price and HP: Higher horsepower vehicles tend to incline towards more expensive prices

Price and KM: There is likely a negative correlation here, since higher mileage cars tend to have lower prices

Price and Mfg_Year: Newer cars correlate to having a higher price

num_vars = data %>% select("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight")
cor_matrix <- cor(num_vars, use = "complete.obs")
corrplot(cor_matrix, type = "full")

STEP 6: Boxplots

Afterwards we created to identify outliers, and therefore remove these from “data” to clean our datatset and prevent them from further affect any future analyses and achieve appropriate prediction models, and overall improve data quality.

ggplot(data = data, mapping = aes(x = "", y = data[, 3])) +
  geom_boxplot() +
  xlab(names(data)[3]) + ylab(names(data)[3])

ggplot(data = data, mapping = aes(x = "", y = data[, 7])) +
  geom_boxplot() +
  xlab(names(data)[7]) + ylab(names(data)[7])

ggplot(data = data, mapping = aes(x = "", y = data[, 16])) +
  geom_boxplot() +
  xlab(names(data)[16]) + ylab(names(data)[16])

STEP 7: Outliers

When analyzing our dataset, we must remove outliers for several reasons. First, outliers can negatively affect the accuracy of models like linear regression by distorting parameters, leading to poor model fit and reduced predictive capabilities. They can also distort the data structure, resulting in misleading coefficients and inaccurate insights. Additionally, linear regression relies on key assumptions like normality and linearity, and outliers often violate these assumptions, producing unreliable estimates. During exploratory data analysis (EDA), we identified several variables that required revision: KM, Price, Weight, and Gears. To remove outliers, we used the boxplot.stats function with the $out parameter, which helped us identify outliers for each variable. We then created a vector with the row numbers of these outliers and deleted those rows for Price, KM, and Weight. Since Gears had a limited set of possible values (as identified by the unique function), we used a different approach for this variable. We manually identified and deleted the rows containing outlier values. Finally, we checked for missing data to ensure the model could be executed correctly.

Delete outliers for price

out_Price = boxplot.stats(data$Price)$out
outlier_Price = which(data$Price %in% out_Price)
data <- data[-outlier_Price, ]

Delete outliers for KM

out_KM = boxplot.stats(data$KM)$out
outlier_KM = which(data$KM %in% out_KM)
data <- data[-outlier_KM, ]

Delete outliers for Weight

out_Weight = boxplot.stats(data$Weight)$out
outlier_Weight = which(data$Weight %in% out_Weight)
data <- data[-outlier_Weight, ]

Delete outliers for Gears

row_index_Gears1 <- which(data$Gears == 3)
row_index_Gears1

## [1] 789 823

row_index_Gears2 <- which(data$Gears == 4)
row_index_Gears2

## [1] 1229

data = data[-c(957, 992, 1427), ]

Check for missing data

rows_with_na <- which(!complete.cases(data))
data[rows_with_na, ]

##  [1] Id               Model            Price            Age_08_04       
##  [5] Mfg_Month        Mfg_Year         KM               Fuel_Type       
##  [9] HP               Met_Color        Automatic        cc              
## [13] Doors            Cylinders        Gears            Quarterly_Tax   
## [17] Weight           Mfr_Guarantee    BOVAG_Guarantee  Guarantee_Period
## [21] ABS              Airbag_1         Airbag_2         Airco           
## [25] Automatic_airco  Boardcomputer    CD_Player        Central_Lock    
## [29] Powered_Windows  Power_Steering   Radio            Mistlamps       
## [33] Sport_Model      Backseat_Divider Metallic_Rim     Radio_cassette  
## [37] Tow_Bar         
## <0 rows> (or 0-length row.names)

# No missing data in the df :)

STEP 8: Missing data pattern (NON NUMERICAL CAN BE ANALIZED)

The md.pattern() function is used to analyze the pattern of missing data in both, numerical and non-numerical variables (binary or categorical). The code generates a visual pattern for missing data for the selected numerical varibales found within the dataset. The output shows which variables have missing values and the combinations of missing data across records. This is an important step, as it helps to understand if any variables have significant missing values. As well, a new subset is created by excluding the numerical values and keeping only the categorical or binary variables. In this case, we found that no data is missing, which was a positive finding to continue the preparation of our dataset.

md.pattern(num_vars)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##      Price Mfg_Year KM HP cc Quarterly_Tax Weight  
## 1436     1        1  1  1  1             1      1 0
##          0        0  0  0  0             0      0 0

#Binary and categorical data
other_vars <- data %>%
  select(-c("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight"))

md.pattern(other_vars)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##      Id Model Age_08_04 Mfg_Month Fuel_Type Met_Color Automatic Doors Cylinders
## 1236  1     1         1         1         1         1         1     1         1
##       0     0         0         0         0         0         0     0         0
##      Gears Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS Airbag_1 Airbag_2
## 1236     1             1               1                1   1        1        1
##          0             0               0                0   0        0        0
##      Airco Automatic_airco Boardcomputer CD_Player Central_Lock Powered_Windows
## 1236     1               1             1         1            1               1
##          0               0             0         0            0               0
##      Power_Steering Radio Mistlamps Sport_Model Backseat_Divider Metallic_Rim
## 1236              1     1         1           1                1            1
##                   0     0         0           0                0            0
##      Radio_cassette Tow_Bar  
## 1236              1       1 0
##                   0       0 0

STEP 9: Scatter plot matrix

Step 9 consisted on making scatter plots for 4 variables at a time, we used the ggpairs() function to plot them and compare them, however as we’re comparing price, it was the only variable that repeated in both occasions.

For the first scatter plot, we used the variables price, manufacturing year, kilometers driven (or mileages) and horsepower. The main results obtained in the first scatter plot show the strong positive correlation between price and manufacturing year of about 0.885, meaning newer cars are more expensive. In contrast, price and km have a strong negative correlation -0.570, meaning higher mileage leads to lower prices

On the other hand, the second scatter plot included variables such as price, cc, quarterly tax and weight. The main results obtained in this second case showed that the strongest positive correlation is between weight and quarterly tax, its coefficient is of 0.626 and it indicates that as the weight increases, the quarterly tax rises. For price and weight there was also a strong positive relationship of 0.581, meaning that higher weights usually come to higher prices, nonetheless the correlation for price and cc is on the weaker side with a correlation of 0.126

ggpairs(num_vars, columns = c("Price", "Mfg_Year", "KM", "HP"),
        title = "Scatter Plot Matrix of Price vs Mfg_Year, KM, HP")

# Scatter plot matrix for the next group of 4 variables (replace with appropriate names)
ggpairs(num_vars, columns = c("Price", "cc", "Quarterly_Tax", "Weight"),
        title = "Scatter Plot Matrix of Price vs cc, Quarterly_Tax, Weight")

STEP 10: QQ-Plots

In step 10, we focused on creating QQ-plots using the function qqnorm() that would help us visualize whether our data follows a normal distribution and we also added a reference line with the function qqline(). In this step we used 5 different variables, price, manufacturing year, kilometers driven, horsepower and weight. Each result will now be explained:

For the QQ-Plot for Price, the graph shows that majority of the data fall along the reference line in the middle part, indicating a good fit in this area, but there are slight deviations at both tails, the lower left side deviate below the line, suggesting that actual prices are lower than expected. On the other hand, the upper right side has points deviating above the line, which indicates that the actual prices are higher than expected.

For the QQ-Plot for Mfg_Year, the result indicates that there are some sort of steps throughout the reference line, typically this means that the data is limited and/or discrete. In this case, each “step” corresponds to a specific year where a significant number of cars were manufactured. This also indicates the the “steeper” portions represent the transitions between the years. In this plot its observable the non-continuity nature of the manufacturing years.

For the QQ-Plot for KM, mileage follows a normal distributions with some deviations. The central part of the plot seems to be a good fit to the reference line, however there’s a clear deviation at the upper right side of the graph, indicating that the actual mileage for some cars is higher than expected.

For the QQ-Plot for HP, similarly to the manufacturing years, horsepower doesn’t follow a normal distribution, but instead a kind of steps, indicating the concentrated values at certain levels. The flat sections in each step represent the high number of cars with the same HP and the steeper sections show the jumps between these values. The plot also has minor deviations specifically on the higher end, which could mean there are few cars with higher HP.

For the QQ-Plot for Weight, this plot has minor deviations, a step-like pattern at some points and normal distribution all at once, first from the lower left side end, the point deviate below the line, meaning some car’s weight is lighter than expected, then there’s a subtle stepped pattern, this suggests that car weights tend to cluster at certain values, mainly because of common classes. Lastly, on the upper right end, the points deviate above the line, maybe suggesting there are a few heavier cars than expected

# QQ-Plot for Price
qqnorm(data$Price, main = "QQ Plot of Price")
qqline(data$Price, col = "blue", lwd = 2)

#QQ-Plot for Mfg_Year
qqnorm(data$Mfg_Year, main = "QQ Plot of Mfg_Year")
qqline(data$Mfg_Year, col = "red", lwd = 2)

#QQ-Plot for KM
qqnorm(data$KM, main = "QQ Plot of KM")
qqline(data$KM, col = "purple", lwd = 2)

#QQ-Plot for HP
qqnorm(data$HP, main = "QQ Plot of HP")
qqline(data$HP, col = "pink", lwd = 2)

#QQ-Plot for Weight
qqnorm(data$Weight, main = "QQ Plot of Weight")
qqline(data$Weight, col = "orange", lwd = 2)

STEP 11: Identify Classes in Categorical Attributes

In this step, the goal is to identify unique categories within some variables. For practicity purposes, the function “head” is applied to the functions of “unique” to identify the number of distinct categories or classes found in categorical attributes, which results useful when analyzing categorical data for models.

head(unique(data$Model))

## [1] "TOYOTA Corolla 1.9 D HATCHB TERRA 2/3-Doors"             
## [2] "TOYOTA Corolla 1.6 16v L.SOL 2/3-Doors"                  
## [3] "TOYOTA Corolla 1.6 16V VVT I 3DR TERRA 2/3-Doors"        
## [4] "TOYOTA Corolla 1.6 16V VVT I 3DR SOL 2/3-Doors"          
## [5] "TOYOTA Corolla 1.4 VVTI Linea Terra 2/3-Doors"           
## [6] "TOYOTA Corolla 1.4 16V VVT I 3DR TERRA COMFORT 2/3-Doors"

head(unique(data$Model))

## [1] "TOYOTA Corolla 1.9 D HATCHB TERRA 2/3-Doors"             
## [2] "TOYOTA Corolla 1.6 16v L.SOL 2/3-Doors"                  
## [3] "TOYOTA Corolla 1.6 16V VVT I 3DR TERRA 2/3-Doors"        
## [4] "TOYOTA Corolla 1.6 16V VVT I 3DR SOL 2/3-Doors"          
## [5] "TOYOTA Corolla 1.4 VVTI Linea Terra 2/3-Doors"           
## [6] "TOYOTA Corolla 1.4 16V VVT I 3DR TERRA COMFORT 2/3-Doors"

head(unique(data$Price))

## [1] 12950 16750 16950 15950 16250 15750

head(unique(data$Mfg_Year))

## [1] 2002 2003 2001 2000 1999 1998

head(unique(data$HP))

## [1]  69 110  97 116  86  72

head(unique(data$Weight))

## [1] 1105 1065 1120 1100 1110 1075

STEP 12: Summary Statistics

This step produces a summary of the dataset, demonstrating a comprehensive overview of the dataset to help spot any potential issues in both categorical and numerical values. It helps to overall understand the data distribution and spot, if any, anomalies.

summary(data)

##        Id            Model               Price         Age_08_04    
##  Min.   :  10.0   Length:1236        Min.   : 4450   Min.   :13.00  
##  1st Qu.: 459.8   Class :character   1st Qu.: 8450   1st Qu.:50.00  
##  Median : 789.5   Mode  :character   Median : 9750   Median :62.00  
##  Mean   : 784.3                      Mean   : 9988   Mean   :59.34  
##  3rd Qu.:1130.2                      3rd Qu.:10995   3rd Qu.:71.00  
##  Max.   :1442.0                      Max.   :17200   Max.   :80.00  
##    Mfg_Month         Mfg_Year          KM          Fuel_Type        
##  Min.   : 1.000   Min.   :1998   Min.   :     1   Length:1236       
##  1st Qu.: 3.000   1st Qu.:1998   1st Qu.: 45975   Class :character  
##  Median : 5.000   Median :1999   Median : 64371   Mode  :character  
##  Mean   : 5.584   Mean   :1999   Mean   : 67554                     
##  3rd Qu.: 8.000   3rd Qu.:2000   3rd Qu.: 85000                     
##  Max.   :12.000   Max.   :2003   Max.   :151000                     
##        HP          Met_Color        Automatic             cc      
##  Min.   : 69.0   Min.   :0.0000   Min.   :0.00000   Min.   :1300  
##  1st Qu.: 97.0   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1400  
##  Median :110.0   Median :1.0000   Median :0.00000   Median :1600  
##  Mean   :101.8   Mean   :0.6715   Mean   :0.05583   Mean   :1536  
##  3rd Qu.:110.0   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1600  
##  Max.   :116.0   Max.   :1.0000   Max.   :1.00000   Max.   :2000  
##      Doors         Cylinders     Gears       Quarterly_Tax        Weight    
##  Min.   :2.000   Min.   :4   Min.   :3.000   Min.   : 19.00   Min.   :1000  
##  1st Qu.:3.000   1st Qu.:4   1st Qu.:5.000   1st Qu.: 69.00   1st Qu.:1035  
##  Median :4.000   Median :4   Median :5.000   Median : 69.00   Median :1060  
##  Mean   :3.999   Mean   :4   Mean   :5.025   Mean   : 79.47   Mean   :1060  
##  3rd Qu.:5.000   3rd Qu.:4   3rd Qu.:5.000   3rd Qu.: 85.00   3rd Qu.:1075  
##  Max.   :5.000   Max.   :4   Max.   :6.000   Max.   :197.00   Max.   :1135  
##  Mfr_Guarantee    BOVAG_Guarantee  Guarantee_Period      ABS        
##  Min.   :0.0000   Min.   :0.0000   Min.   : 3.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.: 3.000   1st Qu.:1.0000  
##  Median :0.0000   Median :1.0000   Median : 3.000   Median :1.0000  
##  Mean   :0.4215   Mean   :0.9013   Mean   : 3.699   Mean   :0.7994  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 3.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :24.000   Max.   :1.0000  
##     Airbag_1         Airbag_2          Airco        Automatic_airco  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.9693   Mean   :0.7087   Mean   :0.4644   Mean   :0.01375  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  Boardcomputer      CD_Player       Central_Lock    Powered_Windows 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.2451   Mean   :0.1845   Mean   :0.5469   Mean   :0.5332  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  Power_Steering       Radio          Mistlamps       Sport_Model    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.9757   Mean   :0.1481   Mean   :0.2387   Mean   :0.2727  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  Backseat_Divider  Metallic_Rim    Radio_cassette      Tow_Bar      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.7702   Mean   :0.1934   Mean   :0.1472   Mean   :0.2969  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Separate DF in top car models

After analyzing the situation as a team, we determined that a good approach to identify the most relevant variables affecting vehicle prices is to analyze different car models individually. Therefore, we decided to create separate dataframes for the most popular types of cars to identify the factors influencing the price ranges of each vehicle. We used the table function to determine how many times each model appeared in the Model column of the dataset. Then, we identified the top three most repeated models and created dataframes that included only these models. This allowed us to analyze each car model individually, providing better insight into the factors that affect vehicle prices.

model_counts <- table(data$Model)
top_model <- names(which.max(model_counts))
top_model_data <- data[data$Model == top_model, ]
head(top_model_data)

##      Id                                               Model Price Age_08_04
## 247 248 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 11950        42
## 293 294 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 10500        35
## 479 481 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 10500        54
## 484 486 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors  9900        56
## 526 528 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 10250        52
## 579 582 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 11500        55
##     Mfg_Month Mfg_Year    KM Fuel_Type  HP Met_Color Automatic   cc Doors
## 247         3     2001 55968    Petrol 110         0         0 1600     3
## 293        10     2001 43000    Petrol 110         0         0 1600     3
## 479         3     2000 63135    Petrol 110         1         0 1600     3
## 484         1     2000 62751    Petrol 110         0         0 1600     3
## 526         5     2000 49432    Petrol 110         1         0 1600     3
## 579         2     2000 33230    Petrol 110         0         0 1600     3
##     Cylinders Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee
## 247         4     5            69   1050             1               1
## 293         4     5            69   1050             0               1
## 479         4     5            69   1050             1               1
## 484         4     5            69   1050             0               1
## 526         4     5            69   1050             1               1
## 579         4     5            69   1050             0               1
##     Guarantee_Period ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer
## 247                3   1        1        1     1               0             0
## 293                3   1        1        1     0               0             0
## 479                3   1        1        1     1               0             0
## 484                3   1        1        1     1               0             0
## 526                3   1        1        1     1               0             0
## 579                3   1        1        1     1               0             0
##     CD_Player Central_Lock Powered_Windows Power_Steering Radio Mistlamps
## 247         0            0               0              1     0         0
## 293         1            1               0              1     0         0
## 479         0            1               1              1     0         0
## 484         0            0               0              1     0         0
## 526         0            1               1              1     1         0
## 579         0            1               1              1     0         0
##     Sport_Model Backseat_Divider Metallic_Rim Radio_cassette Tow_Bar
## 247           1                1            0              0       0
## 293           1                1            1              0       0
## 479           0                1            0              0       1
## 484           1                1            0              0       1
## 526           0                1            0              1       1
## 579           1                1            0              0       0

second_model <- names(sort(model_counts, decreasing = TRUE))[2]
second_model_data <- data[data$Model == second_model, ]
head(second_model_data)

##      Id                                               Model Price Age_08_04
## 420 422 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors  8950        55
## 647 650 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors  7250        65
## 694 697 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors  9900        65
## 705 708 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors  8450        65
## 713 716 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors  8750        65
## 724 727 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors  7950        61
##     Mfg_Month Mfg_Year     KM Fuel_Type HP Met_Color Automatic   cc Doors
## 420         2     2000  94122    Petrol 86         1         0 1300     3
## 647         4     1999 117609    Petrol 86         0         0 1300     3
## 694         4     1999  97173    Petrol 86         0         0 1300     3
## 705         4     1999  93408    Petrol 86         0         0 1300     3
## 713         4     1999  91246    Petrol 86         1         0 1300     3
## 724         8     1999  88127    Petrol 86         0         0 1300     3
##     Cylinders Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee
## 420         4     5            69   1015             0               1
## 647         4     5            69   1015             0               1
## 694         4     5            69   1015             1               1
## 705         4     5            69   1015             0               1
## 713         4     5            69   1015             0               1
## 724         4     5            69   1015             1               1
##     Guarantee_Period ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer
## 420                3   1        1        1     0               0             0
## 647                3   1        1        1     0               0             0
## 694                3   1        1        1     0               0             0
## 705                3   1        1        1     0               0             0
## 713                3   1        1        1     0               0             0
## 724                3   1        1        1     0               0             0
##     CD_Player Central_Lock Powered_Windows Power_Steering Radio Mistlamps
## 420         0            1               1              1     0         0
## 647         0            0               0              1     0         0
## 694         0            0               0              1     0         0
## 705         0            0               0              1     0         0
## 713         0            0               0              1     0         0
## 724         0            0               0              1     1         0
##     Sport_Model Backseat_Divider Metallic_Rim Radio_cassette Tow_Bar
## 420           1                1            0              0       1
## 647           1                1            0              0       0
## 694           1                1            0              0       0
## 705           1                1            0              0       0
## 713           1                1            0              0       0
## 724           1                1            0              1       0

third_model <- names(sort(model_counts, decreasing = TRUE))[3]
third_model_data <- data[data$Model == third_model, ]
head(third_model_data)

##      Id                                             Model Price Age_08_04
## 406 408 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors  9950        54
## 497 499 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors 11750        54
## 625 628 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors  7950        67
## 654 657 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors  9950        64
## 665 668 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors  9950        62
## 667 670 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors  9500        68
##     Mfg_Month Mfg_Year     KM Fuel_Type  HP Met_Color Automatic   cc Doors
## 406         3     2000 103454    Petrol 110         1         0 1600     5
## 497         3     2000  58530    Petrol 110         0         0 1600     5
## 625         2     1999 135337       CNG 110         1         0 1600     5
## 654         5     1999 114846    Petrol 110         1         0 1600     5
## 665         7     1999 109547    Petrol 110         0         0 1600     5
## 667         1     1999 109150    Petrol 110         0         0 1600     5
##     Cylinders Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee
## 406         4     5            85   1075             1               1
## 497         4     5            85   1075             0               1
## 625         4     5           197   1120             0               1
## 654         4     5            85   1075             1               1
## 665         4     5            85   1075             1               1
## 667         4     5            85   1075             0               1
##     Guarantee_Period ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer
## 406                3   1        1        1     1               0             0
## 497                3   1        1        1     1               0             0
## 625                3   1        1        1     1               0             0
## 654                3   1        1        1     1               0             0
## 665                3   1        1        1     1               0             0
## 667                3   1        1        1     1               0             0
##     CD_Player Central_Lock Powered_Windows Power_Steering Radio Mistlamps
## 406         0            1               1              1     0         1
## 497         0            1               1              1     0         1
## 625         1            1               1              1     0         1
## 654         0            1               1              1     0         1
## 665         1            1               1              1     0         1
## 667         0            1               1              1     0         1
##     Sport_Model Backseat_Divider Metallic_Rim Radio_cassette Tow_Bar
## 406           0                1            1              0       0
## 497           0                1            1              0       0
## 625           0                1            1              0       0
## 654           0                1            1              0       0
## 665           0                1            1              0       0
## 667           0                1            0              0       0

Exploration of top model

Onto the exploration of top model, first we used the function ggpairs() as a visualization tool, creating a scatter plot matrix, that relates certain car characteristics such as price, manufacturing year, mileage, horsepower and weight. The result of this matrix showed the distribution of each variable along the diagonal, with the variables relationship on the lower triangle and their corresponding correlation coefficients in the upper triangle. Based on this we found out that newer cars tend to be pricier and cars with very high mileage have lower prices. It’s also observable that there’s a weak almost negligible positive relationship between price and weight, and not a real correlation between horsepower and mileage.

The next thing we did was to use a correlation matrix plot, first we selected the chosen columns, in this case “Price”, “Mfg_Year”, “KM”,“HP”, “cc”,“Quarterly_Tax” and “Weight”. By using the function cor() to calculate the correlation matrix and then plotted the results. The resulting plot showed different-colored circles to represent the correlation coefficients between each variable pair. This tool helped us identify patters and relationship between variables, so we quickly found a strong and positive relationship between the price and the manufacturing year which are represented in the dark blue circles, as well as a strong negative correlation between the quaterly tax and weight.

Finally for this step, we created a QQ-plot using qqnorm() to compare the quantiles of the price variable to the quantiles of a standard distribution, we also added the function qqline() to be able to add a reference line and analyze the closeness of the data points with the expected pattern. Based on our results, the QQ-plot of price deviates a bit when the prices get higher, meaning there are less cars with very high prices in contrast with the rest of the line indicating there are more cars with lower prices. All of this is supported in our summary statistics that shows there’s a skewed distribution.

ggpairs(top_model_data, columns = c("Price", "Mfg_Year", "KM", "HP","Weight"),
        title = "Scatter Plot Matrix of Price vs Mfg_Year, KM, HP, Weight")

## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero

top_model_data_num = top_model_data %>% select("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight")
cor_matrix <- cor(top_model_data_num, use = "complete.obs")

## Warning in cor(top_model_data_num, use = "complete.obs"): the standard
## deviation is zero

corrplot(cor_matrix, type = "full")

qqnorm(top_model_data$Price, main = "QQ Plot of Price of Top Model")
qqline(top_model_data$Price, col = "blue", lwd = 2)

head(summary(top_model_data))

##        Id          Model               Price         Age_08_04    
##  Min.   : 248   Length:107         Min.   : 5950   Min.   :35.00  
##  1st Qu.: 852   Class :character   1st Qu.: 7750   1st Qu.:62.50  
##  Median :1017   Mode  :character   Median : 8750   Median :68.00  
##  Mean   :1027                      Mean   : 8624   Mean   :67.62  
##  3rd Qu.:1264                      3rd Qu.: 9250   3rd Qu.:74.00  
##  Max.   :1429                      Max.   :11950   Max.   :80.00  
##    Mfg_Month         Mfg_Year          KM          Fuel_Type        
##  Min.   : 1.000   Min.   :1998   Min.   : 22835   Length:107        
##  1st Qu.: 4.000   1st Qu.:1998   1st Qu.: 54328   Class :character  
##  Median : 6.000   Median :1999   Median : 65000   Mode  :character  
##  Mean   : 5.981   Mean   :1999   Mean   : 70753                     
##  3rd Qu.: 8.000   3rd Qu.:1999   3rd Qu.: 87919                     
##  Max.   :12.000   Max.   :2001   Max.   :137050                     
##        HP        Met_Color        Automatic       cc           Doors  
##  Min.   :110   Min.   :0.0000   Min.   :0   Min.   :1600   Min.   :3  
##  1st Qu.:110   1st Qu.:0.0000   1st Qu.:0   1st Qu.:1600   1st Qu.:3  
##  Median :110   Median :1.0000   Median :0   Median :1600   Median :3  
##  Mean   :110   Mean   :0.6449   Mean   :0   Mean   :1600   Mean   :3  
##  3rd Qu.:110   3rd Qu.:1.0000   3rd Qu.:0   3rd Qu.:1600   3rd Qu.:3  
##  Max.   :110   Max.   :1.0000   Max.   :0   Max.   :1600   Max.   :3  
##    Cylinders     Gears   Quarterly_Tax       Weight     Mfr_Guarantee   
##  Min.   :4   Min.   :5   Min.   : 69.0   Min.   :1050   Min.   :0.0000  
##  1st Qu.:4   1st Qu.:5   1st Qu.: 69.0   1st Qu.:1050   1st Qu.:0.0000  
##  Median :4   Median :5   Median : 69.0   Median :1050   Median :0.0000  
##  Mean   :4   Mean   :5   Mean   : 70.2   Mean   :1050   Mean   :0.4486  
##  3rd Qu.:4   3rd Qu.:5   3rd Qu.: 69.0   3rd Qu.:1050   3rd Qu.:1.0000  
##  Max.   :4   Max.   :5   Max.   :197.0   Max.   :1078   Max.   :1.0000  
##  BOVAG_Guarantee  Guarantee_Period      ABS            Airbag_1
##  Min.   :0.0000   Min.   :3        Min.   :0.0000   Min.   :1  
##  1st Qu.:1.0000   1st Qu.:3        1st Qu.:1.0000   1st Qu.:1  
##  Median :1.0000   Median :3        Median :1.0000   Median :1  
##  Mean   :0.9907   Mean   :3        Mean   :0.8598   Mean   :1  
##  3rd Qu.:1.0000   3rd Qu.:3        3rd Qu.:1.0000   3rd Qu.:1  
##  Max.   :1.0000   Max.   :3        Max.   :1.0000   Max.   :1  
##     Airbag_2          Airco        Automatic_airco Boardcomputer
##  Min.   :0.0000   Min.   :0.0000   Min.   :0       Min.   :0    
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0       1st Qu.:0    
##  Median :1.0000   Median :0.0000   Median :0       Median :0    
##  Mean   :0.8037   Mean   :0.3364   Mean   :0       Mean   :0    
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0       3rd Qu.:0    
##  Max.   :1.0000   Max.   :1.0000   Max.   :0       Max.   :0    
##    CD_Player        Central_Lock    Powered_Windows  Power_Steering
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :1     
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1     
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :1     
##  Mean   :0.07477   Mean   :0.3925   Mean   :0.3458   Mean   :1     
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1     
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1     
##      Radio          Mistlamps        Sport_Model     Backseat_Divider
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :1       
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:1       
##  Median :0.0000   Median :0.00000   Median :1.0000   Median :1       
##  Mean   :0.1402   Mean   :0.02804   Mean   :0.6729   Mean   :1       
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1       
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1       
##   Metallic_Rim    Radio_cassette      Tow_Bar      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1589   Mean   :0.1402   Mean   :0.3178  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Exploration of second car model

Then, we did an exploration of a second car model using the same conditions as for the top model, including the same variables and functions such as ggpairs() for a scatter plot matrix, cor() for the correlation matrix and for plotting the result, and qqnorm() to compare the QQ-plots and the qqline() to add a reference line to the plot.

For the scatter plot matrix, it was shown that price is heavily influenced by manufacturing year and mileage in case of used cars. For example, newer cars with lower mileage tend to have higher prices, this is noticeable in the strong positive correlation, having a coefficient of about 0.529, on the other hand for used cars, there’s a moderate negative correlation of around (-0.501). There was also a weak correlation between manufacturing year and mileage and a negligible relationship between horsepower and weight. This result is very similar to the top model.

Next, the correlation matrix showed that there’s a pattern and a high positive correlation between weight and engine displacement indicated by the dark blue circles (HP has also a strong positive correlation), however there was also a negative correlation between price and the kilometers driven, indicating that as the kilometers driven (mileage) increases, price usually decreases.

Lastly, this second model also displays a QQ-plot to compare the distribution of the observed price data to a theoretical normal distribution (obtained by the qqline() function). So in the results we can see that most of the data are along the blue line, specially in the middle of the point, nonetheless this suggests that in both ends the points fall either above or below the line, indicating that the tails of the data are heavier or data is skewed. In other words, there are more extreme prices (high or low) in the results.

ggpairs(second_model_data, columns = c("Price", "Mfg_Year", "KM", "HP","Weight"),
        title = "Scatter Plot Matrix of Price vs Mfg_Year, KM, HP, Weight of the 2nd car model")

## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero

second_model_data_num = second_model_data %>% select("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight")
cor_matrix <- cor(second_model_data_num, use = "complete.obs")

## Warning in cor(second_model_data_num, use = "complete.obs"): the standard
## deviation is zero

corrplot(cor_matrix, type = "full")

qqnorm(second_model_data$Price, main = "QQ Plot of Price of Second Model")
qqline(second_model_data$Price, col = "blue", lwd = 2)

head(summary(second_model_data))

##        Id            Model               Price         Age_08_04    
##  Min.   : 422.0   Length:80          Min.   : 5600   Min.   :55.00  
##  1st Qu.: 812.5   Class :character   1st Qu.: 7500   1st Qu.:65.00  
##  Median :1015.5   Mode  :character   Median : 7972   Median :68.00  
##  Mean   :1041.0                      Mean   : 8158   Mean   :69.14  
##  3rd Qu.:1248.0                      3rd Qu.: 8950   3rd Qu.:75.00  
##  Max.   :1441.0                      Max.   :10845   Max.   :80.00  
##    Mfg_Month         Mfg_Year          KM          Fuel_Type        
##  Min.   : 1.000   Min.   :1998   Min.   : 16916   Length:80         
##  1st Qu.: 3.000   1st Qu.:1998   1st Qu.: 55804   Class :character  
##  Median : 4.000   Median :1999   Median : 73440   Mode  :character  
##  Mean   : 5.112   Mean   :1999   Mean   : 69592                     
##  3rd Qu.: 7.000   3rd Qu.:1999   3rd Qu.: 84420                     
##  Max.   :12.000   Max.   :2000   Max.   :146197                     
##        HP       Met_Color        Automatic       cc           Doors  
##  Min.   :86   Min.   :0.0000   Min.   :0   Min.   :1300   Min.   :3  
##  1st Qu.:86   1st Qu.:0.0000   1st Qu.:0   1st Qu.:1300   1st Qu.:3  
##  Median :86   Median :1.0000   Median :0   Median :1300   Median :3  
##  Mean   :86   Mean   :0.5625   Mean   :0   Mean   :1300   Mean   :3  
##  3rd Qu.:86   3rd Qu.:1.0000   3rd Qu.:0   3rd Qu.:1300   3rd Qu.:3  
##  Max.   :86   Max.   :1.0000   Max.   :0   Max.   :1332   Max.   :3  
##    Cylinders     Gears   Quarterly_Tax     Weight     Mfr_Guarantee   
##  Min.   :4   Min.   :5   Min.   :69    Min.   :1015   Min.   :0.0000  
##  1st Qu.:4   1st Qu.:5   1st Qu.:69    1st Qu.:1015   1st Qu.:0.0000  
##  Median :4   Median :5   Median :69    Median :1015   Median :0.0000  
##  Mean   :4   Mean   :5   Mean   :69    Mean   :1015   Mean   :0.3875  
##  3rd Qu.:4   3rd Qu.:5   3rd Qu.:69    3rd Qu.:1015   3rd Qu.:1.0000  
##  Max.   :4   Max.   :5   Max.   :69    Max.   :1015   Max.   :1.0000  
##  BOVAG_Guarantee Guarantee_Period      ABS           Airbag_1   
##  Min.   :0.00    Min.   :3        Min.   :0.000   Min.   :0.00  
##  1st Qu.:1.00    1st Qu.:3        1st Qu.:0.000   1st Qu.:1.00  
##  Median :1.00    Median :3        Median :1.000   Median :1.00  
##  Mean   :0.95    Mean   :3        Mean   :0.725   Mean   :0.95  
##  3rd Qu.:1.00    3rd Qu.:3        3rd Qu.:1.000   3rd Qu.:1.00  
##  Max.   :1.00    Max.   :3        Max.   :1.000   Max.   :1.00  
##     Airbag_2         Airco   Automatic_airco Boardcomputer   CD_Player     
##  Min.   :0.000   Min.   :0   Min.   :0       Min.   :0     Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0   1st Qu.:0       1st Qu.:0     1st Qu.:0.0000  
##  Median :1.000   Median :0   Median :0       Median :0     Median :0.0000  
##  Mean   :0.775   Mean   :0   Mean   :0       Mean   :0     Mean   :0.0125  
##  3rd Qu.:1.000   3rd Qu.:0   3rd Qu.:0       3rd Qu.:0     3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :0   Max.   :0       Max.   :0     Max.   :1.0000  
##   Central_Lock Powered_Windows Power_Steering     Radio         Mistlamps     
##  Min.   :0.0   Min.   :0.00    Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0   1st Qu.:0.00    1st Qu.:1.00   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.0   Median :0.00    Median :1.00   Median :0.000   Median :0.0000  
##  Mean   :0.1   Mean   :0.15    Mean   :0.95   Mean   :0.125   Mean   :0.0125  
##  3rd Qu.:0.0   3rd Qu.:0.00    3rd Qu.:1.00   3rd Qu.:0.000   3rd Qu.:0.0000  
##  Max.   :1.0   Max.   :1.00    Max.   :1.00   Max.   :1.000   Max.   :1.0000  
##   Sport_Model     Backseat_Divider  Metallic_Rim    Radio_cassette 
##  Min.   :0.0000   Min.   :0.00     Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1.00     1st Qu.:0.0000   1st Qu.:0.000  
##  Median :1.0000   Median :1.00     Median :0.0000   Median :0.000  
##  Mean   :0.5375   Mean   :0.95     Mean   :0.1125   Mean   :0.125  
##  3rd Qu.:1.0000   3rd Qu.:1.00     3rd Qu.:0.0000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1.00     Max.   :1.0000   Max.   :1.000  
##     Tow_Bar     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.175  
##  3rd Qu.:0.000  
##  Max.   :1.000

Exploration of third car model

For a third car model, we used the same conditions as the top and second model, including the same variables and functions such as ggpairs() for a scatter plot matrix, cor() for the correlation matrix and for plotting the result, and qqnorm() to compare the QQ-plots and the qqline() to add a reference line to the plot.

First for the scatter plot, the results showed that the correlation between price and manufacturing year is strong and positive, with around (0.662), also suggesting that newer cars tend to higher prices. Additionally, price and kilometers driven showed a negative correlation of -0.403, so higher mileage lower prices. Overall, the general overview is very similar to the top and second models

Next, the correlation matrix for the third model show a strong positive correlation between price and manufacturing year, represented in dark blue circles, in contrast with the strong negative correlation between price and kilometers driven, there are also some correlations that are either weak or negligible, but can be observed in the graph.

Finally, for this third model we obtained a QQ-plot representing the comparison of the actual price of top model to the theoretical normal distribution. The results showed that the prices in the middle range are almost normally distributed, however at the left lower end, the point deviate below the line, meaning that the prices are lower than expected, contrary to the upper right side, where point deviate above the line, indicating prices are higher than expected

ggpairs(third_model_data, columns = c("Price", "Mfg_Year", "KM", "HP","Weight"),
        title = "Scatter Plot Matrix of Price vs Mfg_Year, KM, HP, Weight of the 3rd car model")

## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero

third_model_data_num = third_model_data %>% select("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight")
cor_matrix <- cor(third_model_data_num, use = "complete.obs")

## Warning in cor(third_model_data_num, use = "complete.obs"): the standard
## deviation is zero

corrplot(cor_matrix, type = "full")

qqnorm(third_model_data$Price, main = "QQ Plot of Price of Top Model")
qqline(third_model_data$Price, col = "blue", lwd = 2)

head(summary(third_model_data))

##        Id            Model               Price         Age_08_04    
##  Min.   : 408.0   Length:79          Min.   : 6950   Min.   :54.00  
##  1st Qu.: 781.5   Class :character   1st Qu.: 8875   1st Qu.:62.00  
##  Median : 922.0   Mode  :character   Median : 9500   Median :65.00  
##  Mean   : 967.1                      Mean   : 9499   Mean   :67.03  
##  3rd Qu.:1213.5                      3rd Qu.: 9950   3rd Qu.:72.00  
##  Max.   :1405.0                      Max.   :11950   Max.   :80.00  
##    Mfg_Month         Mfg_Year          KM          Fuel_Type        
##  Min.   : 1.000   Min.   :1998   Min.   : 22705   Length:79         
##  1st Qu.: 2.000   1st Qu.:1998   1st Qu.: 59506   Class :character  
##  Median : 5.000   Median :1999   Median : 71825   Mode  :character  
##  Mean   : 5.468   Mean   :1999   Mean   : 72995                     
##  3rd Qu.: 8.000   3rd Qu.:1999   3rd Qu.: 85704                     
##  Max.   :12.000   Max.   :2000   Max.   :135337                     
##        HP        Met_Color        Automatic       cc           Doors  
##  Min.   :110   Min.   :0.0000   Min.   :0   Min.   :1600   Min.   :5  
##  1st Qu.:110   1st Qu.:0.0000   1st Qu.:0   1st Qu.:1600   1st Qu.:5  
##  Median :110   Median :1.0000   Median :0   Median :1600   Median :5  
##  Mean   :110   Mean   :0.7089   Mean   :0   Mean   :1600   Mean   :5  
##  3rd Qu.:110   3rd Qu.:1.0000   3rd Qu.:0   3rd Qu.:1600   3rd Qu.:5  
##  Max.   :110   Max.   :1.0000   Max.   :0   Max.   :1600   Max.   :5  
##    Cylinders     Gears   Quarterly_Tax        Weight     Mfr_Guarantee   
##  Min.   :4   Min.   :5   Min.   : 85.00   Min.   :1075   Min.   :0.0000  
##  1st Qu.:4   1st Qu.:5   1st Qu.: 85.00   1st Qu.:1075   1st Qu.:0.0000  
##  Median :4   Median :5   Median : 85.00   Median :1075   Median :0.0000  
##  Mean   :4   Mean   :5   Mean   : 86.42   Mean   :1076   Mean   :0.3291  
##  3rd Qu.:4   3rd Qu.:5   3rd Qu.: 85.00   3rd Qu.:1075   3rd Qu.:1.0000  
##  Max.   :4   Max.   :5   Max.   :197.00   Max.   :1120   Max.   :1.0000  
##  BOVAG_Guarantee  Guarantee_Period      ABS            Airbag_1
##  Min.   :0.0000   Min.   :3        Min.   :0.0000   Min.   :1  
##  1st Qu.:1.0000   1st Qu.:3        1st Qu.:1.0000   1st Qu.:1  
##  Median :1.0000   Median :3        Median :1.0000   Median :1  
##  Mean   :0.9494   Mean   :3        Mean   :0.8608   Mean   :1  
##  3rd Qu.:1.0000   3rd Qu.:3        3rd Qu.:1.0000   3rd Qu.:1  
##  Max.   :1.0000   Max.   :3        Max.   :1.0000   Max.   :1  
##     Airbag_2          Airco        Automatic_airco Boardcomputer
##  Min.   :0.0000   Min.   :0.0000   Min.   :0       Min.   :0    
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0       1st Qu.:0    
##  Median :1.0000   Median :1.0000   Median :0       Median :0    
##  Mean   :0.8481   Mean   :0.8734   Mean   :0       Mean   :0    
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0       3rd Qu.:0    
##  Max.   :1.0000   Max.   :1.0000   Max.   :0       Max.   :0    
##    CD_Player        Central_Lock Powered_Windows Power_Steering
##  Min.   :0.00000   Min.   :1     Min.   :1       Min.   :1     
##  1st Qu.:0.00000   1st Qu.:1     1st Qu.:1       1st Qu.:1     
##  Median :0.00000   Median :1     Median :1       Median :1     
##  Mean   :0.07595   Mean   :1     Mean   :1       Mean   :1     
##  3rd Qu.:0.00000   3rd Qu.:1     3rd Qu.:1       3rd Qu.:1     
##  Max.   :1.00000   Max.   :1     Max.   :1       Max.   :1     
##      Radio           Mistlamps       Sport_Model Backseat_Divider
##  Min.   :0.00000   Min.   :0.0000   Min.   :0    Min.   :1       
##  1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0    1st Qu.:1       
##  Median :0.00000   Median :1.0000   Median :0    Median :1       
##  Mean   :0.06329   Mean   :0.8987   Mean   :0    Mean   :1       
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0    3rd Qu.:1       
##  Max.   :1.00000   Max.   :1.0000   Max.   :0    Max.   :1       
##   Metallic_Rim    Radio_cassette       Tow_Bar      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.7215   Mean   :0.06329   Mean   :0.3165  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000

Deleting unecessary data

Since all the values found in the “Cylinders” variable are the same in the dataset, these are removed to prevent them from affecting any further analyses.

data = data %>% select(-Cylinders)
# All values in Cylinders are the same, not relevant.

Conclusions

In conclusion, this report meticulously details the processes used to clean and analyze the Corolla dataset, providing an overview of each step taken to enhance the quality of the data. From the initial loading and cleaning of the dataset, including the removal of leading question marks and outliers, to the application of exploratory data analysis techniques such as histograms, correlation matrices, scatter plots, and QQ-plots, we have approached the challenge of understanding the relationships between various variables. The insights obtained from these analyses have shown us key patterns, particularly the strong positive correlation between vehicle price and manufacturing year, as well as the negative correlation with mileage.

Furthermore, we addressed the implications of outliers and missing data, ensuring a good dataset that can be used for future predictive modeling. Our exploration of multiple car models highlights consistent trends across the dataset, reinforcing our findings and identifying the importance of data cleaning and analysis in deriving conclusions. Finally, this report serves not only to show our methodical approach to data preparation but also as a resource for subsequent analyses, allowing us to build predictive models with great accuracy and reliability. The rigorous examination of the data ensures that we are well-equipped to address the complexities in automotive pricing, ultimately making it easier to have more informed decision-making in future reports.

Delivery 4.2-Preparation of the dataset

Team 4

2024-10-09