Cars.df = read.csv("/Users/aoibhinlaverty/DataMining/Lecture3/Assignment1/Cars.csv")
dim(Cars.df) ## find the dimension of data frame number of Rows & Columns
## [1] 1000 7
head(Cars.df) ## View first 6 rows of the dataset
## Model Price EngineSize Horsepower Mileage Kilometers Year
## 1 Hatchback 26990.54 2.089507 152.8477 32.39167 52.12921 2017
## 2 Hatchback 25031.51 2.346371 202.9725 17.96607 28.91351 2012
## 3 Hatchback 35133.93 2.048951 151.8194 15.58014 25.07374 2019
## 4 SUV 33755.31 2.813534 158.6406 23.61317 38.00162 2021
## 5 Hatchback 22454.17 3.060178 210.9754 27.15214 43.69702 2014
## 6 SUV 29524.26 3.563607 160.9751 24.35607 39.19719 2013
tail(Cars.df) ## View last 6 rows of the dataset
## Model Price EngineSize Horsepower Mileage Kilometers Year
## 995 Hatchback 24533.50 1.950207 135.4070 23.86184 38.40182 2018
## 996 Hatchback 31072.40 2.863282 170.8552 24.27257 39.06282 2018
## 997 SUV 34462.86 3.220435 176.7364 24.64290 39.65880 2014
## 998 SUV 35093.79 2.394915 150.6973 28.09765 45.21867 2020
## 999 Convertible 35445.56 3.225640 124.6947 25.63828 41.26071 2017
## 1000 Convertible 29184.36 2.820776 253.5519 21.86312 35.18519 2017
str(Cars.df) ## find summary statistics for each column (number of observations, number or variables and the variable types)
## 'data.frame': 1000 obs. of 7 variables:
## $ Model : chr "Hatchback" "Hatchback" "Hatchback" "SUV" ...
## $ Price : num 26991 25032 35134 33755 22454 ...
## $ EngineSize: num 2.09 2.35 2.05 2.81 3.06 ...
## $ Horsepower: num 153 203 152 159 211 ...
## $ Mileage : num 32.4 18 15.6 23.6 27.2 ...
## $ Kilometers: num 52.1 28.9 25.1 38 43.7 ...
## $ Year : int 2017 2012 2019 2021 2014 2013 2012 2011 2011 2011 ...
library(dbplyr) #call the dbplyr library to use the unique function
unique(Cars.df$Model) #Show the unique values within the Model column
## [1] "Hatchback" "SUV" "Sedan" "Convertible" "Joogooloo"
## [6] ""
summary(Cars.df[,-1]) #Use Summary funtion on the numerical variables to show their values for minimum, maximum, Mean, Median, and 1st and 3rd Quartiles.
## Price EngineSize Horsepower Mileage
## Min. : 2.7 Min. :0.9761 Min. : 54.84 Min. :10.56
## 1st Qu.:26605.2 1st Qu.:2.1744 1st Qu.: 154.26 1st Qu.:21.61
## Median :30010.9 Median :2.4837 Median : 178.33 Median :25.16
## Mean :30023.2 Mean :2.4978 Mean : 274.71 Mean :24.87
## 3rd Qu.:33252.7 3rd Qu.:2.8326 3rd Qu.: 207.23 3rd Qu.:28.19
## Max. :46951.8 Max. :4.1453 Max. :90000.00 Max. :42.23
## NA's :50 NA's :51 NA's :51 NA's :51
## Kilometers Year
## Min. : 0.00 Min. : 2010
## 1st Qu.:34.77 1st Qu.: 2013
## Median :40.47 Median : 2016
## Mean :39.98 Mean : 2035
## 3rd Qu.:45.36 3rd Qu.: 2019
## Max. :67.96 Max. :20190
## NA's :50 NA's :50
1. The Model Column has empty values & a
redundant value “Joogooloo”.
2. The Price Column
has 50 NA’s & a minimum value of 2.7, which I am assuming is not an
accurate value.
3. The EngineSize Column has 51
NA’s & a minimum value of 0.9, which on researching is
realistic.
4. The Horsepower Column has 51 NA’s
& a maximum value of 90000.00, which I am assuming is not an
accurate value.
5. The Mileage Column has 51
NA’s.
6. The Kilometers Column has 50 NA’s.
7. The Year Column has 50 NA’s and a maximum value of
20190, which I assume is a typo for 2019.
8.
Mileage and Kilometers are essential the same value, making one of these
columns redundant.
Step 1. Remove rows with NA and “Joogooloo” values in the Model column.
Step 2. Remove redundant column “Kilometers”.
Step 3. Once step 1 & 2 are complete, review the
data and determine how many NA’s are left for the remaining columns.
Step 4. Review year column to check if the
incorrect Year value still exists, if so replace with correct value.
Step 5-A. Replace remaining NA’s with the Median
value of that column
Step 5-B. Replace remaining
inaccurate values with the Median value of that column.
Step
6. Round columns with decimal values to 2 decimal places for
cleaner data.
Cars.df = Cars.df[!(Cars.df$Model == "" | Cars.df$Model == "Joogooloo" ), ]
unique(Cars.df$Model) #Check of the Unique Values in Model
## [1] "Hatchback" "SUV" "Sedan" "Convertible"
Cars.df = Cars.df[,-6]
#View(Cars.df) - View the updates to the data in a separate tab
summary(Cars.df)
## Model Price EngineSize Horsepower
## Length:949 Min. : 2.7 Min. :0.9761 Min. : 54.84
## Class :character 1st Qu.:26604.4 1st Qu.:2.1744 1st Qu.: 154.24
## Mode :character Median :30008.4 Median :2.4835 Median : 178.11
## Mean :30010.7 Mean :2.4971 Mean : 274.74
## 3rd Qu.:33249.0 3rd Qu.:2.8324 3rd Qu.: 207.12
## Max. :46951.8 Max. :4.1453 Max. :90000.00
## NA's :1 NA's :1
## Mileage Year
## Min. :10.56 Min. : 2010
## 1st Qu.:21.61 1st Qu.: 2013
## Median :25.15 Median : 2016
## Mean :24.86 Mean : 2035
## 3rd Qu.:28.18 3rd Qu.: 2019
## Max. :42.23 Max. :20190
## NA's :1
1. The Price Column now has 0 NA’s & a minimum
value of 2.7, which I am assuming is not an accurate value.
2. The EngineSize Column has 1 NA.
3. The Horsepower Column has 1 NA & a maximum value
of 90000.00, which I am assuming is not an accurate value.
4. The Mileage Column has 1 NA.
5. The Year Column has 0 NA’s and a maximum value of
20190, which I assume is a typo for 2019.
unique(Cars.df$Year) # Checking unique values in Year to ensure 20190 is the only outlier
## [1] 2017 2012 2019 2021 2014 2013 2011 2020 2022 2016 2015 2018
## [13] 2010 20190
Cars.df$Year[Cars.df$Year== 20190] = 2019 #Replace the value 20190 with 2019
unique(Cars.df$Year) #Check Unique Values for Year to ensure incorrect value was replaced
## [1] 2017 2012 2019 2021 2014 2013 2011 2020 2022 2016 2015 2018 2010
The columns that are left with NA Values are EngineSize, Horsepower, and Mileage.
Cars.df[is.na(Cars.df$EngineSize), ] # View the rows in the data where the Engine Size value is NA
## Model Price EngineSize Horsepower Mileage Year
## 99 Hatchback 35905.9 NA 122.3728 26.26043 2010
EngineS_Median = median(Cars.df$EngineSize, na.rm = T) #Get the Median of the column without the NA value
Cars.df$EngineSize[is.na(Cars.df$EngineSize)] = EngineS_Median #Assign the Median to the NA Value
summary(Cars.df$EngineSize) #Use summary function to verify Engine Size no longer has any NA values
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9761 2.1744 2.4835 2.4971 2.8323 4.1453
Cars.df[is.na(Cars.df$Horsepower), ] # View the rows in the data where the Horsepower value is NA
## Model Price EngineSize Horsepower Mileage Year
## 149 Convertible 33704.07 2.976278 NA 25.55111 2015
HP_Median = median(Cars.df$Horsepower, na.rm = T) #Get the Median of the column without the NA value
Cars.df$Horsepower[is.na(Cars.df$Horsepower)] = HP_Median #Assign the Median to the NA Value
summary(Cars.df$Horsepower) #Use summary function to verify Horsepower no longer has any NA values
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.84 154.26 178.11 274.64 207.09 90000.00
Cars.df[is.na(Cars.df$Mileage), ] # View the rows in the data where the Mileage value is NA
## Model Price EngineSize Horsepower Mileage Year
## 199 Sedan 32102.87 2.543622 163.7784 NA 2019
Mileage_Median = median(Cars.df$Mileage, na.rm = T) #Get the Median of the column without the NA value
Cars.df$Mileage[is.na(Cars.df$Mileage)] = Mileage_Median #Assign the Median to the NA Value
summary(Cars.df$Mileage) #Use summary function to verify Mileage no longer has any NA values
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.56 21.61 25.15 24.86 28.18 42.23
The remaining Inaccurate Values are:
I used View(Cars.df) to view the data in a separate tab and sorted the Horsepower and Price columns. I did this to determine the last correct maximum value for Horsepower and the last correct minimum value for Price. I used these values to help identify the outliers and replace the incorrect values with the Median of the column.
Cars.df[Cars.df$Horsepower > 500, ] # View the rows in the data where the Horsepower value greater than 500
## Model Price EngineSize Horsepower Mileage Year
## 749 Sedan 37023.42 2.77458 90000 25.56134 2022
Cars.df$Horsepower[Cars.df$Horsepower > 500] = HP_Median #Assign the Median to the incorrect value
summary(Cars.df$Horsepower) #Use summary function to verify Horsepower no longer has an inaccurate Max Value
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.84 154.26 178.11 179.99 207.01 316.84
Cars.df[Cars.df$Price < 15000, ] # View rows in the data where the Price value is less than 15,000
## Model Price EngineSize Horsepower Mileage Year
## 499 Hatchback 2.7 2.768428 227.5605 21.62572 2011
Price_Median = median(Cars.df$Price) #Get the Median of the column
Cars.df$Price[Cars.df$Price < 1500] = Price_Median #Assign the Median to the incorrect Value
summary(Cars.df$Price) #Use summary function to verify Price no longer has an inaccurate Min Value
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15951 26607 30008 30042 33249 46952
Cars.df$Price = round(Cars.df$Price, 2)
Cars.df$EngineSize = round(Cars.df$EngineSize, 2)
Cars.df$Horsepower = round(Cars.df$Horsepower, 2)
Cars.df$Mileage = round(Cars.df$Mileage, 2)
# View(Cars.df) - View Data after all updates
summary(Cars.df) #Use summary function to verify updates
## Model Price EngineSize Horsepower
## Length:949 Min. :15951 Min. :0.980 Min. : 54.84
## Class :character 1st Qu.:26607 1st Qu.:2.170 1st Qu.:154.26
## Mode :character Median :30008 Median :2.480 Median :178.11
## Mean :30042 Mean :2.497 Mean :179.99
## 3rd Qu.:33249 3rd Qu.:2.830 3rd Qu.:207.01
## Max. :46952 Max. :4.150 Max. :316.84
## Mileage Year
## Min. :10.56 Min. :2010
## 1st Qu.:21.61 1st Qu.:2013
## Median :25.15 Median :2016
## Mean :24.86 Mean :2016
## 3rd Qu.:28.18 3rd Qu.:2019
## Max. :42.23 Max. :2022
Cars_Correlation = round(cor(Cars.df[,-1]),2) #show the correlation value of each numeric variable rounded to two decimal places
Cars_Correlation #Print the correlation table
## Price EngineSize Horsepower Mileage Year
## Price 1.00 0.06 -0.02 0.03 0.00
## EngineSize 0.06 1.00 0.03 0.00 0.01
## Horsepower -0.02 0.03 1.00 0.06 -0.02
## Mileage 0.03 0.00 0.06 1.00 0.02
## Year 0.00 0.01 -0.02 0.02 1.00
plot(Cars.df[,-1]) #Generate a scatter plot showing correlation with all numeric variables
I initially used the scatter plot function to show the correlation
between the numeric variables in the dataset, however this is difficult
to decipher and as a result I created a Heatmap to better display the
correlations.
library(plotly)
Cars_Correlation[upper.tri(Cars_Correlation, diag = TRUE)] = NA #Remove variables direct correlation (the diagonal 1) and duplicate correlations.
ColorScale = list(
c(0, "blue"), # Color for High Negative Correlation
c(0.5, "white"), # Color for little to no Correlation
c(1, "yellow") # Color for High Positive Correlation
)
Cars_Cor_Graph = plot_ly(
x=colnames(Cars_Correlation),
y=rownames(Cars_Correlation),
z = Cars_Correlation,
type = "heatmap",
zmin = -1,
zmax = 1,
colorscale = ColorScale,
colorbar = list(title = "Correlation")
)
Cars_Cor_Graph = Cars_Cor_Graph %>% layout(title = "Correlation of Numeric Variables in the Cars Dataset")
Cars_Cor_Graph
In general the color correlation for each variable is very light in the Heatmap, this shows that each variable has little to no correlation with each other. This verifies what we saw above when the correlation table was initially called Cars_Correlation. The biggest takeaway from this Heatmap is that Horsepower has a negative correlation with both Year and Price, although this is a very low negative correlation (-0.2) it still gives us insight that as Horsepower increase the Price and Year of the cars are likely to decrease slightly and vice versa.
par(mfcol = c(2, 3))
boxplot(Cars.df$Price, main= "Price")
boxplot(Cars.df$EngineSize, main= "Engine Size")
boxplot(Cars.df$Horsepower, main= "Horsepower")
boxplot(Cars.df$Mileage, main= "Mileage")
boxplot(Cars.df$Year, main= "Year")
Price and Horsepower show the most outliers in this data set.
1.Price:
2. Engine Size:
3. Horsepower:
4. Mileage:
5. Year:
summary(Cars.df) # Mean & Median can be derived from the output of the summary function
## Model Price EngineSize Horsepower
## Length:949 Min. :15951 Min. :0.980 Min. : 54.84
## Class :character 1st Qu.:26607 1st Qu.:2.170 1st Qu.:154.26
## Mode :character Median :30008 Median :2.480 Median :178.11
## Mean :30042 Mean :2.497 Mean :179.99
## 3rd Qu.:33249 3rd Qu.:2.830 3rd Qu.:207.01
## Max. :46952 Max. :4.150 Max. :316.84
## Mileage Year
## Min. :10.56 Min. :2010
## 1st Qu.:21.61 1st Qu.:2013
## Median :25.15 Median :2016
## Mean :24.86 Mean :2016
## 3rd Qu.:28.18 3rd Qu.:2019
## Max. :42.23 Max. :2022
1. Price:
2. Engine Size:
3. Horsepower:
4. Mileage:
5. Year:
Overall, for each variable the Mean and Median were extremely
similar to each other. This gives the impression that the distribution
of each variable is normally distributed with few extreme outliers. This
is validated by the Boxplots of the variables above.
The Mode of a specific variable can be determined using the table()
function, which represents the most frequently occurring value. However,
determining the Mode is not ideal for this dataset as each variable
contains numerous unique values.
Note: I
have commented out the table() function for Price, EngineSize,
Horsepower & Mileage as a result of this.
table(Cars.df$Year)
##
## 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
## 75 68 81 81 69 68 72 79 77 84 60 66 69
table(Cars.df$Model)
##
## Convertible Hatchback Sedan SUV
## 233 242 234 240
# table(Cars.df$Price)
# table(Cars.df$EngineSize)
# table(Cars.df$Horsepower)
# table(Cars.df$Mileage)
#I created data frames for each dispersion measure
Price_Dispersion = data.frame(
Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
Value = c(round(sd(Cars.df$Price), 2),
round(var(Cars.df$Price), 2),
min(Cars.df$Price),
max(Cars.df$Price),
IQR(Cars.df$Price))
)
Mileage_Dispersion = data.frame(
Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
Value = c(round(sd(Cars.df$Mileage), 2),
round(var(Cars.df$Mileage), 2),
min(Cars.df$Mileage),
max(Cars.df$Mileage),
IQR(Cars.df$Mileage))
)
EngineSize_Dispersion = data.frame(
Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
Value = c(round(sd(Cars.df$EngineSize), 2),
round(var(Cars.df$EngineSize), 2),
min(Cars.df$EngineSize),
max(Cars.df$EngineSize),
IQR(Cars.df$EngineSize))
)
Horsepower_Dispersion = data.frame(
Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
Value = c(round(sd(Cars.df$Horsepower), 2),
round(var(Cars.df$Horsepower), 2),
min(Cars.df$Horsepower),
max(Cars.df$Horsepower),
IQR(Cars.df$Horsepower))
)
Year_Dispersion = data.frame(
Metric = c("Standard Deviation", "Variance", "Min", "Max", "IQR"),
Value = c(round(sd(Cars.df$Year), 2),
round(var(Cars.df$Year), 2),
min(Cars.df$Year),
max(Cars.df$Year),
IQR(Cars.df$Year))
)
# I then combined them all into one data frame
Dispersion_table = data.frame(
Metric = Price_Dispersion$Metric,
Price = Price_Dispersion$Value,
Mileage = Mileage_Dispersion$Value,
EngineSize = EngineSize_Dispersion$Value,
Horsepower = Horsepower_Dispersion$Value,
Year = Year_Dispersion$Value
)
# Print the combined table
print(Dispersion_table)
## Metric Price Mileage EngineSize Horsepower Year
## 1 Standard Deviation 5001.70 4.95 0.49 40.36 3.70
## 2 Variance 25017001.44 24.54 0.24 1628.64 13.71
## 3 Min 15951.13 10.56 0.98 54.84 2010.00
## 4 Max 46951.85 42.23 4.15 316.84 2022.00
## 5 IQR 6641.48 6.57 0.66 52.75 6.00
Alteratively you can derive the Range and Interquartile Range using the function summary() :
1. Price:
2. Mileage:
3. Engine Size:
4. Horsepower:
5. Year:
Summary:
library(plotly) #call the plotly library to use the plotly function
plot_ly(
data = Cars.df,
x = ~ Price,
type = 'histogram',
nbinsx = 10, #Set bin size
marker = list(color = '#73c1ac', # Set color of the bars
line = list(
color = 'black', # Set color of the outline
width = 1.5 # Set width of the outline
)
)
)%>%
layout(
title = "Car Prices 2010 - 2022 <br> SUVs, Hatchbacks, Sedans & Convertibles",
xaxis = list(title = 'Price (USD)'),
yaxis = list(title = 'Count of Cars')
)
#Create a Bin for Horsepower:
# I reviewed the Min and Max Values for Horsepower to help determine my bin sizes
min(Cars.df$Horsepower) # 54.8
## [1] 54.84
max(Cars.df$Horsepower) # 317
## [1] 316.84
Cars.df = Cars.df %>%
mutate(Horsepower_Bin = cut (Cars.df$Horsepower,
breaks = c(0,120,240,360),
labels = c("0-120", "121-240", "241-360")
))
# Created a Scatter plot with Plotly to show the Car Price based on Year, Model and Horsepower
plot_ly(
data = Cars.df,
x = ~Year,
y = ~ Price,
symbol = ~Model,
color = ~Horsepower_Bin,
type = 'scatter',
nbinsx = 10,
marker = list(line = list(
color = ~ Model,
width = 1.5
)
)
)%>%
layout(
title = "Car Price based on Year, Model & Horsepower",
xaxis = list(title = 'Year'),
yaxis = list(title = 'Price(USD)')
)
1.Price: Variation is high across all years (2010-2022). Overall, the most common price range seems to be between $20,000 and $35,000.
2.Horsepower: On average cars where the horsepower is between 121-240 (orange) seem to be higher priced, however there is high price variation based on the horsepower groups and is difficult to say that there is a strong correlation between horsepower and price.
3.Models:
Conclusion: Although there is some price variance across the years is is difficult to say that cars from 2010 are cheaper than newer models (2022).
This report covers Reviewing & Assessing Data, Data Cleaning, Identification of Outliers, Descriptive Statistics, and Data Visualization Methods. These topics were covered as part of BANL_6625_03 - Data Mining for Business Intelligence (Week 2 & 3).
In this report I looked into the Cars.csv dataset which had 1,000 observations and 7 variables. The dataset contained several data quality issues, such as missing values, incorrect values, and redundancy variables (kilometers), which were addressed through a series of cleaning steps. I removed rows with empty or inaccurate values, removed the redundant kilometers column, and replaced the remaining NA and inaccurate values with the Median of the associated column to maintain the integrity of the data.
After I cleaned the data I used Boxplots to visualize the data and identify any outliers within the numeric variables. During this analysis I was able to identify that Price and Horsepower seemed to have the most outliers in the data set. The Price range for majority of the cars was between $15,000 to $45,000, however there are some cars in the data set that are below or above this range, with majority of the outliers above $45,000. Horsepower had majority of its outliers below the minimum value of 60hp. From the boxplots I was also able to derive the Median and Interquartile Ranges for each of the numeric variables.
In order to get a better indication of the distribution of the data I calculated the central tendency measures. I knew the Median from the Boxplot but the summary function provided me with the Mean. On reviewing the range between the Median and the Mean I was able to get a better indication on the distribution of each of the variables. As each variable had little to no difference between the Median and the Mean values, this indicated the data was normally distributed across each variable.
In addition to central tendency measures I also calculated the dispersion of the data using Standard Deviation, Variance, Range and the Interquartile Range. This information allowed me to better understand how the data was dispersed. Price and Horsepower had high variance which indicates the data is widely scattered making it less consistent. Whereas EngineSize and Mileage had mush less variation Meaning the data is more clustered together, making it more consistent.
Finally I used data visualization techniques to illustrate the relationship among certain variables. I created my plots using plotly as I am learning it as part of my Data Visulaization class and find it to be a more interactive graph for the end user. I used a Histogram to visually determine the distribution of the Price variable, this validated the results from the central tendency measures by illustrating a normally distributed histogram (bell-shape). I also used a Scatter Plot to show the relationship between the Price, Model, Year, and Horsepower with the dataset. This demonstrated the wide variance in Price and Horsepower identified when calculating the dispersion of the variables. The scatter plot also interesting showed that there was little difference in Price based on the Year the car was built, my initial assumption would have been that an older car would be cheaper than a newer model.
In conclusion this report highlights the steps taken to clean and visualize the Cars.csv dataset in order to identify and handle incorrect values, detect outliers and clearly understand the relationship between the variables.