Load Data:

Cars.df = read.csv("/Users/aoibhinlaverty/DataMining/Lecture3/Assignment1/Cars.csv")

Reviewing the Data:

dim(Cars.df) ## find the dimension of data frame number of Rows & Columns
## [1] 1000    7
head(Cars.df) ## View first 6 rows of the dataset
##       Model    Price EngineSize Horsepower  Mileage Kilometers Year
## 1 Hatchback 26990.54   2.089507   152.8477 32.39167   52.12921 2017
## 2 Hatchback 25031.51   2.346371   202.9725 17.96607   28.91351 2012
## 3 Hatchback 35133.93   2.048951   151.8194 15.58014   25.07374 2019
## 4       SUV 33755.31   2.813534   158.6406 23.61317   38.00162 2021
## 5 Hatchback 22454.17   3.060178   210.9754 27.15214   43.69702 2014
## 6       SUV 29524.26   3.563607   160.9751 24.35607   39.19719 2013
tail(Cars.df) ## View last 6 rows of the dataset
##            Model    Price EngineSize Horsepower  Mileage Kilometers Year
## 995    Hatchback 24533.50   1.950207   135.4070 23.86184   38.40182 2018
## 996    Hatchback 31072.40   2.863282   170.8552 24.27257   39.06282 2018
## 997          SUV 34462.86   3.220435   176.7364 24.64290   39.65880 2014
## 998          SUV 35093.79   2.394915   150.6973 28.09765   45.21867 2020
## 999  Convertible 35445.56   3.225640   124.6947 25.63828   41.26071 2017
## 1000 Convertible 29184.36   2.820776   253.5519 21.86312   35.18519 2017
str(Cars.df) ## find summary statistics for each column (number of observations, number or variables and the variable types)
## 'data.frame':    1000 obs. of  7 variables:
##  $ Model     : chr  "Hatchback" "Hatchback" "Hatchback" "SUV" ...
##  $ Price     : num  26991 25032 35134 33755 22454 ...
##  $ EngineSize: num  2.09 2.35 2.05 2.81 3.06 ...
##  $ Horsepower: num  153 203 152 159 211 ...
##  $ Mileage   : num  32.4 18 15.6 23.6 27.2 ...
##  $ Kilometers: num  52.1 28.9 25.1 38 43.7 ...
##  $ Year      : int  2017 2012 2019 2021 2014 2013 2012 2011 2011 2011 ...

Assessing the Character Column “Model”:

library(dbplyr) #call the dbplyr library to use the unique function
unique(Cars.df$Model) #Show the unique values within the Model column
## [1] "Hatchback"   "SUV"         "Sedan"       "Convertible" "Joogooloo"  
## [6] ""

Assessing the Remaining Columns:

summary(Cars.df[,-1]) #Use Summary funtion on the numerical variables to show their values for minimum, maximum, Mean, Median, and 1st and 3rd Quartiles.
##      Price           EngineSize       Horsepower          Mileage     
##  Min.   :    2.7   Min.   :0.9761   Min.   :   54.84   Min.   :10.56  
##  1st Qu.:26605.2   1st Qu.:2.1744   1st Qu.:  154.26   1st Qu.:21.61  
##  Median :30010.9   Median :2.4837   Median :  178.33   Median :25.16  
##  Mean   :30023.2   Mean   :2.4978   Mean   :  274.71   Mean   :24.87  
##  3rd Qu.:33252.7   3rd Qu.:2.8326   3rd Qu.:  207.23   3rd Qu.:28.19  
##  Max.   :46951.8   Max.   :4.1453   Max.   :90000.00   Max.   :42.23  
##  NA's   :50        NA's   :51       NA's   :51         NA's   :51     
##    Kilometers         Year      
##  Min.   : 0.00   Min.   : 2010  
##  1st Qu.:34.77   1st Qu.: 2013  
##  Median :40.47   Median : 2016  
##  Mean   :39.98   Mean   : 2035  
##  3rd Qu.:45.36   3rd Qu.: 2019  
##  Max.   :67.96   Max.   :20190  
##  NA's   :50      NA's   :50

Observations from Data:

1. The Model Column has empty values & a redundant value “Joogooloo”.
2. The Price Column has 50 NA’s & a minimum value of 2.7, which I am assuming is not an accurate value.
3. The EngineSize Column has 51 NA’s & a minimum value of 0.9, which on researching is realistic.
4. The Horsepower Column has 51 NA’s & a maximum value of 90000.00, which I am assuming is not an accurate value.
5. The Mileage Column has 51 NA’s.
6. The Kilometers Column has 50 NA’s.
7. The Year Column has 50 NA’s and a maximum value of 20190, which I assume is a typo for 2019.
8. Mileage and Kilometers are essential the same value, making one of these columns redundant.

Identified Cleanup Actions:

Step 1. Remove rows with NA and “Joogooloo” values in the Model column.

Step 2. Remove redundant column “Kilometers”.

Step 3. Once step 1 & 2 are complete, review the data and determine how many NA’s are left for the remaining columns.
Step 4. Review year column to check if the incorrect Year value still exists, if so replace with correct value.
Step 5-A. Replace remaining NA’s with the Median value of that column
Step 5-B. Replace remaining inaccurate values with the Median value of that column.
Step 6. Round columns with decimal values to 2 decimal places for cleaner data.

Inital Data Clean-Up:

Step 1.Remove Rows with NA and Incorrect Values in the Model Column:

Cars.df = Cars.df[!(Cars.df$Model == "" | Cars.df$Model == "Joogooloo" ), ]

unique(Cars.df$Model) #Check of the Unique Values in Model
## [1] "Hatchback"   "SUV"         "Sedan"       "Convertible"

Step 2. Remove Redundant Column “Kilometers”:

Cars.df = Cars.df[,-6] 
#View(Cars.df) - View the updates to the data in a separate tab

Step 3. Review Data & Identify the Remaining NA Values:

summary(Cars.df)
##     Model               Price           EngineSize       Horsepower      
##  Length:949         Min.   :    2.7   Min.   :0.9761   Min.   :   54.84  
##  Class :character   1st Qu.:26604.4   1st Qu.:2.1744   1st Qu.:  154.24  
##  Mode  :character   Median :30008.4   Median :2.4835   Median :  178.11  
##                     Mean   :30010.7   Mean   :2.4971   Mean   :  274.74  
##                     3rd Qu.:33249.0   3rd Qu.:2.8324   3rd Qu.:  207.12  
##                     Max.   :46951.8   Max.   :4.1453   Max.   :90000.00  
##                                       NA's   :1        NA's   :1         
##     Mileage           Year      
##  Min.   :10.56   Min.   : 2010  
##  1st Qu.:21.61   1st Qu.: 2013  
##  Median :25.15   Median : 2016  
##  Mean   :24.86   Mean   : 2035  
##  3rd Qu.:28.18   3rd Qu.: 2019  
##  Max.   :42.23   Max.   :20190  
##  NA's   :1

Outcome of Initial Data Clean Up:

1. The Price Column now has 0 NA’s & a minimum value of 2.7, which I am assuming is not an accurate value.
2. The EngineSize Column has 1 NA.
3. The Horsepower Column has 1 NA & a maximum value of 90000.00, which I am assuming is not an accurate value.
4. The Mileage Column has 1 NA.
5. The Year Column has 0 NA’s and a maximum value of 20190, which I assume is a typo for 2019.

Data Clean Up Cont..

Step 4. Replace Incorrect Year Value with 2019:

unique(Cars.df$Year) # Checking unique values in Year to ensure 20190 is the only outlier
##  [1]  2017  2012  2019  2021  2014  2013  2011  2020  2022  2016  2015  2018
## [13]  2010 20190
Cars.df$Year[Cars.df$Year== 20190] = 2019 #Replace the value 20190 with 2019
unique(Cars.df$Year) #Check Unique Values for Year to ensure incorrect value was replaced
##  [1] 2017 2012 2019 2021 2014 2013 2011 2020 2022 2016 2015 2018 2010

Step 5-A. Replace Remaining NA Values with the Median Value:

The columns that are left with NA Values are EngineSize, Horsepower, and Mileage.

Engine Size:
Cars.df[is.na(Cars.df$EngineSize), ] # View the rows in the data where the Engine Size value is NA 
##        Model   Price EngineSize Horsepower  Mileage Year
## 99 Hatchback 35905.9         NA   122.3728 26.26043 2010
EngineS_Median = median(Cars.df$EngineSize, na.rm = T) #Get the Median of the column without the NA value
Cars.df$EngineSize[is.na(Cars.df$EngineSize)] = EngineS_Median #Assign the Median to the NA Value
summary(Cars.df$EngineSize) #Use summary function to verify Engine Size no longer has any NA values
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9761  2.1744  2.4835  2.4971  2.8323  4.1453
Horsepower:
Cars.df[is.na(Cars.df$Horsepower), ]  # View the rows in the data where the Horsepower value is NA 
##           Model    Price EngineSize Horsepower  Mileage Year
## 149 Convertible 33704.07   2.976278         NA 25.55111 2015
HP_Median = median(Cars.df$Horsepower, na.rm = T) #Get the Median of the column without the NA value
Cars.df$Horsepower[is.na(Cars.df$Horsepower)] = HP_Median #Assign the Median to the NA Value
summary(Cars.df$Horsepower) #Use summary function to verify Horsepower no longer has any NA values
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    54.84   154.26   178.11   274.64   207.09 90000.00
Mileage:
Cars.df[is.na(Cars.df$Mileage), ]  # View the rows in the data where the Mileage value is NA
##     Model    Price EngineSize Horsepower Mileage Year
## 199 Sedan 32102.87   2.543622   163.7784      NA 2019
Mileage_Median = median(Cars.df$Mileage, na.rm = T) #Get the Median of the column without the NA value
Cars.df$Mileage[is.na(Cars.df$Mileage)] = Mileage_Median #Assign the Median to the NA Value
summary(Cars.df$Mileage) #Use summary function to verify Mileage no longer has any NA values
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.56   21.61   25.15   24.86   28.18   42.23

Step 5- B. Replace Remaining Inaccurate Values with the Median Value:

The remaining Inaccurate Values are:

  • The Maximum Value of Horsepower
  • The Minimum Value of Price.

I used View(Cars.df) to view the data in a separate tab and sorted the Horsepower and Price columns. I did this to determine the last correct maximum value for Horsepower and the last correct minimum value for Price. I used these values to help identify the outliers and replace the incorrect values with the Median of the column.

Horsepower (Max Value):
Cars.df[Cars.df$Horsepower > 500, ]  # View the rows in the data where the Horsepower value greater than 500
##     Model    Price EngineSize Horsepower  Mileage Year
## 749 Sedan 37023.42    2.77458      90000 25.56134 2022
Cars.df$Horsepower[Cars.df$Horsepower > 500] = HP_Median #Assign the Median to the incorrect value
summary(Cars.df$Horsepower) #Use summary function to verify Horsepower no longer has an inaccurate Max Value
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   54.84  154.26  178.11  179.99  207.01  316.84
Price (Min Value):
Cars.df[Cars.df$Price < 15000, ]  # View rows in the data where the Price value is less than 15,000
##         Model Price EngineSize Horsepower  Mileage Year
## 499 Hatchback   2.7   2.768428   227.5605 21.62572 2011
Price_Median = median(Cars.df$Price) #Get the Median of the column
Cars.df$Price[Cars.df$Price < 1500] = Price_Median #Assign the Median to the incorrect Value
summary(Cars.df$Price)  #Use summary function to verify Price no longer has an inaccurate Min Value
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15951   26607   30008   30042   33249   46952

Step 6. Round the Decimal Values to 2 Decimal Places For Cleaner Data:

Cars.df$Price = round(Cars.df$Price, 2)
Cars.df$EngineSize = round(Cars.df$EngineSize, 2)
Cars.df$Horsepower = round(Cars.df$Horsepower, 2)
Cars.df$Mileage = round(Cars.df$Mileage, 2)

# View(Cars.df) - View Data after all updates
summary(Cars.df) #Use summary function to verify updates 
##     Model               Price         EngineSize      Horsepower    
##  Length:949         Min.   :15951   Min.   :0.980   Min.   : 54.84  
##  Class :character   1st Qu.:26607   1st Qu.:2.170   1st Qu.:154.26  
##  Mode  :character   Median :30008   Median :2.480   Median :178.11  
##                     Mean   :30042   Mean   :2.497   Mean   :179.99  
##                     3rd Qu.:33249   3rd Qu.:2.830   3rd Qu.:207.01  
##                     Max.   :46952   Max.   :4.150   Max.   :316.84  
##     Mileage           Year     
##  Min.   :10.56   Min.   :2010  
##  1st Qu.:21.61   1st Qu.:2013  
##  Median :25.15   Median :2016  
##  Mean   :24.86   Mean   :2016  
##  3rd Qu.:28.18   3rd Qu.:2019  
##  Max.   :42.23   Max.   :2022

Correlation of Numeric Variables

Cars_Correlation = round(cor(Cars.df[,-1]),2) #show the correlation value of each numeric variable rounded to two decimal places
Cars_Correlation #Print the correlation table
##            Price EngineSize Horsepower Mileage  Year
## Price       1.00       0.06      -0.02    0.03  0.00
## EngineSize  0.06       1.00       0.03    0.00  0.01
## Horsepower -0.02       0.03       1.00    0.06 -0.02
## Mileage     0.03       0.00       0.06    1.00  0.02
## Year        0.00       0.01      -0.02    0.02  1.00

Scatter

plot(Cars.df[,-1]) #Generate a scatter plot showing correlation with all numeric variables

I initially used the scatter plot function to show the correlation between the numeric variables in the dataset, however this is difficult to decipher and as a result I created a Heatmap to better display the correlations.

Heatmap

library(plotly)
Cars_Correlation[upper.tri(Cars_Correlation, diag = TRUE)] = NA #Remove variables direct correlation  (the diagonal 1) and duplicate correlations.

ColorScale = list(
  c(0, "blue"),   # Color for High Negative Correlation
  c(0.5, "white"), # Color for little to no Correlation
  c(1, "yellow")     # Color for High Positive Correlation
)

Cars_Cor_Graph = plot_ly(
  x=colnames(Cars_Correlation), 
  y=rownames(Cars_Correlation),
  z = Cars_Correlation,
  type = "heatmap",
  zmin = -1,
  zmax = 1,
  colorscale = ColorScale,
  colorbar = list(title = "Correlation")
)

Cars_Cor_Graph = Cars_Cor_Graph %>% layout(title = "Correlation of Numeric Variables in the Cars Dataset")

Cars_Cor_Graph

Observations from Heatmap:

In general the color correlation for each variable is very light in the Heatmap, this shows that each variable has little to no correlation with each other. This verifies what we saw above when the correlation table was initially called Cars_Correlation. The biggest takeaway from this Heatmap is that Horsepower has a negative correlation with both Year and Price, although this is a very low negative correlation (-0.2) it still gives us insight that as Horsepower increase the Price and Year of the cars are likely to decrease slightly and vice versa.

Identification of Outliers in the Dataset using Boxplot:

par(mfcol = c(2, 3))
boxplot(Cars.df$Price, main= "Price")
boxplot(Cars.df$EngineSize, main= "Engine Size")
boxplot(Cars.df$Horsepower, main= "Horsepower")
boxplot(Cars.df$Mileage, main= "Mileage")
boxplot(Cars.df$Year, main= "Year")

Observations from the Boxplots:

Price and Horsepower show the most outliers in this data set.

1.Price:

  • Median of around $30,000, interquartile range of around $26,000 (1st Quartile) to $34,000 (3rd Quartile).
  • There are a few outliers in the dataset,below $15,000 and above $45,000.
    • This indicates there are some unique cases of very low and very high priced cars. Most of the outliers are above $45,000.

2. Engine Size:

  • Median of around 2.5 liters, interquartile range of around 2.3 (1st Quartile) to 2.8 (3rd Quartile).
  • There are a few outliers in the dataset, below 1 liter and above 3.7 liters.
    • This indicates there are some unique cases of cars with Engine Sizes that are very small and very large. Most of the outliers are above 3.7 liters.

3. Horsepower:

  • Median of around 175hp,interquartile range of around 150hp (1st Quartile) to 200hp (3rd Quartile).
  • There are a few outliers in the dataset, below 60hp and above 260hp.
    • This indicates there are some unique cases of cars where the Horsepower is very low and very high.Most of the outliers are below 60hp.

4. Mileage:

  • Median of around 25mpg, an interquartile range of around 22mpg (1st Quartile) to 29mpg (3rd Quartile).
  • There are a few outliers in the dataset,below 11mpg and above 36mpg.
    • This indicates there are some unique cases of cars with very low and very high Mileage.There is an even spread above and below the min and max values.

5. Year:

  • The Boxplot shows a Median of around 2016 , an interquartile range of around 2013 (1st Quartile) to 2019 (3rd Quartile).
  • There are no outliers in the data set.

Descriptive Statistics of Preprocessed Data:

Calculating Central Tendency Measures (Mean & Median):

summary(Cars.df) # Mean & Median can be derived from the output of the summary function
##     Model               Price         EngineSize      Horsepower    
##  Length:949         Min.   :15951   Min.   :0.980   Min.   : 54.84  
##  Class :character   1st Qu.:26607   1st Qu.:2.170   1st Qu.:154.26  
##  Mode  :character   Median :30008   Median :2.480   Median :178.11  
##                     Mean   :30042   Mean   :2.497   Mean   :179.99  
##                     3rd Qu.:33249   3rd Qu.:2.830   3rd Qu.:207.01  
##                     Max.   :46952   Max.   :4.150   Max.   :316.84  
##     Mileage           Year     
##  Min.   :10.56   Min.   :2010  
##  1st Qu.:21.61   1st Qu.:2013  
##  Median :25.15   Median :2016  
##  Mean   :24.86   Mean   :2016  
##  3rd Qu.:28.18   3rd Qu.:2019  
##  Max.   :42.23   Max.   :2022
Interpretation:

1. Price:

  • Mean is the average price of a car in this dataset which is $30,042.
  • Median is the middle value of the dataset which is $30,008.
  • As the Mean and Median are very close in range, this is an most cars in this dataset are priced around $30,000.

2. Engine Size:

  • Mean is the average Engine Size in a car in this dataset which is 2.50 liters.
  • Median is the middle value of the dataset which is 2.48 liters.
  • As the Mean and Median are very close in range, this is an indication that most cars have engine sizes close to 2.5 liters.

3. Horsepower:

  • Mean is the average Horsepower in a car in this dataset which is 180hp.
  • Median is the middle value of the dataset which is 178 hp.
  • As the Mean and Median are very close in range, this is an indication that most cars horsepower is around 179 hp.

4. Mileage:

  • Mean is the average Mileage in a car in this dataset which is 24.9 mpg.
  • Median is the middle value of the dataset which is 25.1 mpg.
  • As the Mean and Median are very close in range, this is an indication that most cars have a mileage of around 25mpg.

5. Year:

  • Mean is the average Year the cars in this dataset were built which is 2016.
  • Median is the middle value of the dataset which is 2016.
  • As the Mean and Median are the same this indicates that the majority of the cars in this dataset are from 2016.


Overall, for each variable the Mean and Median were extremely similar to each other. This gives the impression that the distribution of each variable is normally distributed with few extreme outliers. This is validated by the Boxplots of the variables above.

Calculating Central Tendency Measures (Mode)

The Mode of a specific variable can be determined using the table() function, which represents the most frequently occurring value. However, determining the Mode is not ideal for this dataset as each variable contains numerous unique values.
Note: I have commented out the table() function for Price, EngineSize, Horsepower & Mileage as a result of this.

table(Cars.df$Year)
## 
## 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 
##   75   68   81   81   69   68   72   79   77   84   60   66   69
table(Cars.df$Model)
## 
## Convertible   Hatchback       Sedan         SUV 
##         233         242         234         240
# table(Cars.df$Price)
# table(Cars.df$EngineSize)
# table(Cars.df$Horsepower)
# table(Cars.df$Mileage)

Compute Measures of Dispersion (Standard Deviation, Variance, Range,& Interquartile Range):

#I created data frames for each dispersion measure
Price_Dispersion = data.frame(
  Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
  Value = c(round(sd(Cars.df$Price), 2),
            round(var(Cars.df$Price), 2),
            min(Cars.df$Price),
            max(Cars.df$Price),
            IQR(Cars.df$Price))
)

Mileage_Dispersion = data.frame(
  Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
  Value = c(round(sd(Cars.df$Mileage), 2),
            round(var(Cars.df$Mileage), 2),
            min(Cars.df$Mileage),
            max(Cars.df$Mileage),
            IQR(Cars.df$Mileage))
)

EngineSize_Dispersion = data.frame(
  Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
  Value = c(round(sd(Cars.df$EngineSize), 2),
            round(var(Cars.df$EngineSize), 2),
            min(Cars.df$EngineSize),
            max(Cars.df$EngineSize),
            IQR(Cars.df$EngineSize))
)

Horsepower_Dispersion = data.frame(
  Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
  Value = c(round(sd(Cars.df$Horsepower), 2),
            round(var(Cars.df$Horsepower), 2),
            min(Cars.df$Horsepower),
            max(Cars.df$Horsepower),
            IQR(Cars.df$Horsepower))
)

Year_Dispersion = data.frame(
  Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
  Value = c(round(sd(Cars.df$Year), 2),
            round(var(Cars.df$Year), 2),
            min(Cars.df$Year),
            max(Cars.df$Year),
            IQR(Cars.df$Year))
)

# I then combined them all into one data frame
Dispersion_table = data.frame(
  Metric = Price_Dispersion$Metric,
  Price = Price_Dispersion$Value,
  Mileage = Mileage_Dispersion$Value,
  EngineSize = EngineSize_Dispersion$Value,
  Horsepower = Horsepower_Dispersion$Value,
  Year = Year_Dispersion$Value
)

# Print the combined table
print(Dispersion_table)
##               Metric       Price Mileage EngineSize Horsepower    Year
## 1 Standard Deviation     5001.70    4.95       0.49      40.36    3.70
## 2           Variance 25017001.44   24.54       0.24    1628.64   13.71
## 3                Min    15951.13   10.56       0.98      54.84 2010.00
## 4                Max    46951.85   42.23       4.15     316.84 2022.00
## 5                IQR     6641.48    6.57       0.66      52.75    6.00

Alteratively you can derive the Range and Interquartile Range using the function summary() :

  • Range: The Minimum and Maximum Values.
  • Interquartile Range: The 3rd Quartile Value minus the 1st Quartile Value.
Interpretation:

1. Price:

  • Standard Deviation (SD): Indicates that car prices deviate from the Mean (average) by about $5,002.
  • Variance: 25,017,001, indicates high variance in the price of cars.
  • Range: The difference between the most expensive (Max Value) and the least expensive (Min Value) car in the dataset.
  • IQR: Indicates the middle 50% of car prices are within the range of $6,641 (3rd Quartile – 1st Quartile).

2. Mileage:

  • Standard Deviation (SD): Indicates that Mileage deviates from the Mean (average) by about 4.95 mpg.
  • Variance: 24.54, indicates moderate variance in the car mileage.
  • Range: The difference between car with the most mileage (Max Value) and the least mileage (Min Value) in the dataset.
  • IQR: Indicates the middle 50% of car mileage is within a 6.57 mpg range (3rd Quartile – 1st Quartile).

3. Engine Size:

  • Standard Deviation (SD): Indicates that Engine Size deviates from the Mean (average) by about 0.49 liters.
  • Variance: 0.24, indicates very low variance in the Engine Size of the cars.
  • Range: The difference between car with the largest Engine Size (Max Value) and the smallest (Min Value) in the dataset.
  • IQR: Indicates the middle 50% of Engine Sizes are within a 0.66-liter range (3rd Quartile – 1st Quartile).

4. Horsepower:

  • Standard Deviation (SD): Indicates that Horsepower deviates from the Mean (average) by about 40.4 hp.
  • Variance: 1,628.6, indicates high variance in horsepower.
  • Range: The difference between the car with the highest horsepower (Max Value) and lowest (Min Value) in the dataset.
  • IQR: Indicates the middle 50% of the horsepower values are within a 52.8 hp range (3rd Quartile – 1st Quartile).

5. Year:

  • Standard Deviation (SD): Indicates that year a car was built deviate from the Mean (average) by about 3.7 years.
  • Variance: 13.7, indicates moderate variance in the year of the car.
  • Range: The difference between the earliest year a car was built (Max Value) and latest (Min Value) in the dataset.
  • IQR: Indicates the middle 50% of cars were built within a 6-year range.

Summary:

  • Price and Horsepower show the highest variation.
  • Mileage and Engine Size have moderate levels of variation.
  • Again, this is also validated by the Boxplots of the variables above.

Visualization of Preprocessed Data:

Histogram of the Variable Price:

library(plotly) #call the plotly library to use the plotly function

plot_ly(
  data = Cars.df,
  x = ~ Price,
  type = 'histogram',
  nbinsx = 10, #Set bin size
  marker = list(color = '#73c1ac', # Set color of the bars
                line = list(
                  color = 'black',      # Set color of the outline
                  width = 1.5           # Set width of the outline
                )
  ) 
)%>%
  layout(
    title = "Car Prices 2010 - 2022 <br> SUVs, Hatchbacks, Sedans & Convertibles",
    xaxis = list(title = 'Price (USD)'),
    yaxis = list(title = 'Count of Cars')
  ) 

Observations:

  • The distribution of the Price Variable is bell-shaped, indicating it is normally distributed.
  • Car prices range from $15,000 to $50,000, with the majority falling between $20,000 and $40,000.
  • Overall th data is mostly concentrated between $25,000 and $35,000, suggesting that this is the most common price range for a car in this datset.

Advanced Data Visualization

#Create a Bin for Horsepower:

# I reviewed the Min and Max Values for Horsepower to help determine my bin sizes
min(Cars.df$Horsepower) # 54.8
## [1] 54.84
max(Cars.df$Horsepower) # 317
## [1] 316.84
Cars.df = Cars.df %>%
  mutate(Horsepower_Bin = cut (Cars.df$Horsepower,
                               breaks = c(0,120,240,360),
                               labels = c("0-120", "121-240", "241-360")
                               ))

# Created a Scatter plot with Plotly to show the Car Price based on Year, Model and Horsepower
plot_ly(
  data = Cars.df,
  x = ~Year,
  y = ~ Price,
  symbol = ~Model,
  color = ~Horsepower_Bin,
  type = 'scatter',
  nbinsx = 10,
  marker = list(line = list(
    color = ~ Model, 
    width = 1.5           
  )
  ) 
)%>%
  layout(
    title = "Car Price based on Year, Model & Horsepower",
    xaxis = list(title = 'Year'),
    yaxis = list(title = 'Price(USD)')
  )

Scatter Plot Format:

  • The vertical/y axis represents the Price in USD, which ranges from around $15k to $45k.
  • The horizontal/x axis represents the Year the car was made and ranges from 2010-2022.
  • The shapes represent the car Model.
  • The colors represent Horsepower range of the car (0-120, 121-240, 241-360).

Observations:

1.Price: Variation is high across all years (2010-2022). Overall, the most common price range seems to be between $20,000 and $35,000.

2.Horsepower: On average cars where the horsepower is between 121-240 (orange) seem to be higher priced, however there is high price variation based on the horsepower groups and is difficult to say that there is a strong correlation between horsepower and price.

3.Models:

  • Convertibles (Circles) on average have a higher price, especially when the horsepower is within 121-240 (orange).
  • Hatchbacks (Triangles) on average seem to be cheaper, cars with higher horsepower do not seem to beas expensive in comparison to other models.
  • SUVs (Cross) have significant price variation based on horsepower, with cars that have a horsepower between 121-240 (orange) showing on both the high and low ranges of price.
  • Sedans (Square) in general no matter the horsepower range are mostly in the price range of $25,000 and $40,000.

Conclusion: Although there is some price variance across the years is is difficult to say that cars from 2010 are cheaper than newer models (2022).

Summary and Conclusions

This report covers Reviewing & Assessing Data, Data Cleaning, Identification of Outliers, Descriptive Statistics, and Data Visualization Methods. These topics were covered as part of BANL_6625_03 - Data Mining for Business Intelligence (Week 2 & 3).

In this report I looked into the Cars.csv dataset which had 1,000 observations and 7 variables. The dataset contained several data quality issues, such as missing values, incorrect values, and redundancy variables (kilometers), which were addressed through a series of cleaning steps. I removed rows with empty or inaccurate values, removed the redundant kilometers column, and replaced the remaining NA and inaccurate values with the Median of the associated column to maintain the integrity of the data.

After I cleaned the data I used Boxplots to visualize the data and identify any outliers within the numeric variables. During this analysis I was able to identify that Price and Horsepower seemed to have the most outliers in the data set. The Price range for majority of the cars was between $15,000 to $45,000, however there are some cars in the data set that are below or above this range, with majority of the outliers above $45,000. Horsepower had majority of its outliers below the minimum value of 60hp. From the boxplots I was also able to derive the Median and Interquartile Ranges for each of the numeric variables.

In order to get a better indication of the distribution of the data I calculated the central tendency measures. I knew the Median from the Boxplot but the summary function provided me with the Mean. On reviewing the range between the Median and the Mean I was able to get a better indication on the distribution of each of the variables. As each variable had little to no difference between the Median and the Mean values, this indicated the data was normally distributed across each variable.

In addition to central tendency measures I also calculated the dispersion of the data using Standard Deviation, Variance, Range and the Interquartile Range. This information allowed me to better understand how the data was dispersed. Price and Horsepower had high variance which indicates the data is widely scattered making it less consistent. Whereas EngineSize and Mileage had mush less variation Meaning the data is more clustered together, making it more consistent.

Finally I used data visualization techniques to illustrate the relationship among certain variables. I created my plots using plotly as I am learning it as part of my Data Visulaization class and find it to be a more interactive graph for the end user. I used a Histogram to visually determine the distribution of the Price variable, this validated the results from the central tendency measures by illustrating a normally distributed histogram (bell-shape). I also used a Scatter Plot to show the relationship between the Price, Model, Year, and Horsepower with the dataset. This demonstrated the wide variance in Price and Horsepower identified when calculating the dispersion of the variables. The scatter plot also interesting showed that there was little difference in Price based on the Year the car was built, my initial assumption would have been that an older car would be cheaper than a newer model.

In conclusion this report highlights the steps taken to clean and visualize the Cars.csv dataset in order to identify and handle incorrect values, detect outliers and clearly understand the relationship between the variables.