INTRO
Analysis
# LIBRARIES
library(readxl)
library(tidyverse)
library(dplyr)
library(DT)
library(RColorBrewer)
library(rio)
library(dbplyr)
library(psych)
library(FSA)
library(magrittr)
#PATH TO DATA SOURCE (DATASET)
CarSet <- read_excel("C://Users//User//OneDrive//Documents//ALY_6010//Datasets//M1data_carsales.xlsx")
FIRST TASK
# 1 .In this first task I will use psych function which was developed by northwestern university to include some of the functions which can render some Psychological research work.
CarSet%>%
dplyr::select("Efficiency", "Power_bhp", "Seats", "Km","Price")%>%
psych::describe()%>%
t()%>%
round(2)%>%
knitr::kable()%>%
kableExtra::kable_classic_2()
| Efficiency | Power_bhp | Seats | Km | Price | |
|---|---|---|---|---|---|
| vars | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 |
| n | 4949.00 | 4949.00 | 4949.00 | 4949.00 | 4949.00 |
| mean | 18.47 | 123.73 | 5.16 | 55809.14 | 8383.22 |
| sd | 4.17 | 41.46 | 0.56 | 28764.20 | 5158.19 |
| median | 18.00 | 100.00 | 5.00 | 54000.00 | 7146.00 |
| trimmed | 18.32 | 117.48 | 5.03 | 54060.90 | 7668.28 |
| mad | 2.97 | 0.00 | 0.00 | 28169.40 | 4121.63 |
| min | 8.00 | 50.00 | 4.00 | 171.00 | 618.00 |
| max | 30.00 | 600.00 | 10.00 | 149000.00 | 25270.00 |
| range | 22.00 | 550.00 | 6.00 | 148829.00 | 24652.00 |
| skew | 0.41 | 1.96 | 3.85 | 0.58 | 1.23 |
| kurtosis | -0.17 | 8.61 | 18.71 | 0.23 | 1.16 |
| se | 0.06 | 0.59 | 0.01 | 408.88 | 73.32 |
Using psych function alone did
not make data readable and presentable but after using the following
commands t and round made table more readable and expressible but there
is significant loss of data in price since decimals in finance means
alot of money. Using kable extra as an alternative made my table more
classy and appealing.
SECOND TASK
# In this second task I will use different variable to generate a bar plot and a pie chart.
#Creating 2 rows using MFCOL function
par(mfcol = c(1,2))
par(mar = c(7,4,2,2))
#Creating a graph using Location as Variable
CarGraph = barplot(table(CarSet$Location),
main = " Categories-Location-Frequency",
ylab = "",
xlab = "",
ylim = c(0, 700),
las = "2",
col = brewer.pal(5,"Dark2"),
)
#Using title function to improve visualization of table
title(ylab = "Frequencies", line = 2.5, cex.lab = 1.0, col.lab = "#AE7804")
title(xlab = "Categories", line = 5.8, cex.lab = 1.0, col.lab = "#AE1604")
#Using Box function to set a boundary around and inside the figure
box(which = "figure", lty = "dashed", col = "darkred")
box(which = "plot", lty = "dotdash", col = "brown2")
#Pie Chart
pie (table (CarSet$FuelType),
labels = table(CarSet$FuelType),
radius = 0.8,
main = "Pie Chart of Variable Fuel Type",
col = terrain.colors(4),
cex = 0.7
)
legend ("bottomright",
legend=paste(unique(sort(CarSet$FuelType)), "Fuel Type"),
fill = terrain.colors(4),
cex = 0.6)
box(which = "figure", lty = "solid", col = "darkred")
Bar plot shows regions of India
on x - axis and the frequency of cars sold in the y - axis which are
pre-owned cars, as we can see in the graph most of the cars are sold in
Mumbai followed by Delhi. According to bar plot the profit of this
company can be increased if they focus on providing more cars in Mumbai
and Delhi and Pune region.
On the other hand pie chart demonstrates the different fuel type of
cars, as shown in the legend. Most fuel type used in India is Gasoline
fuel also called as petrol followed by diesel type. More petrol cars are
sold since they are more cheaper than a diesel engine.
THIRD TASK
# Creating another variable along with some calculations
CarTable = table(CarSet$Location)
CarTable2 = as.data.frame(CarTable)
#Renaming variables
names(CarTable2)[names(CarTable2) == "Var1"] = "Location"
names(CarTable2)[names(CarTable2) == "Freq"] = "Frequency"
#Creating 3 new columns using mutate function
CarTable2 = mutate(CarTable2,
CumFreq = cumsum (CarTable2$Frequency),
Percentage = round ((Frequency/sum(Frequency)) * 100, 2),
CumPercentage = cumsum (Percentage))
#Using kable function to generate a table and using kableExtra to improve visualization.
knitr::kable(CarTable2)%>%
kableExtra::kable_styling(bootstrap_options = "bordered", "hover")
| Location | Frequency | CumFreq | Percentage | CumPercentage |
|---|---|---|---|---|
| Ahmedabad | 196 | 196 | 3.96 | 3.96 |
| Bangalore | 248 | 444 | 5.01 | 8.97 |
| Chennai | 396 | 840 | 8.00 | 16.97 |
| Coimbatore | 468 | 1308 | 9.46 | 26.43 |
| Delhi | 464 | 1772 | 9.38 | 35.81 |
| Hyderabad | 580 | 2352 | 11.72 | 47.53 |
| Jaipur | 366 | 2718 | 7.40 | 54.93 |
| Kochi | 531 | 3249 | 10.73 | 65.66 |
| Kolkata | 494 | 3743 | 9.98 | 75.64 |
| Mumbai | 677 | 4420 | 13.68 | 89.32 |
| Pune | 529 | 4949 | 10.69 | 100.01 |
Highest sales frequency of cars
can be observed in Mumbai since maximum sales are being observed in
Mumbai followed by Delhi. This table can be used to identify whether the
data is positively or negatively skewed , we can see that the values of
cumulative freq and cumulative percentage are increasing therefore in
this case the data is skewed left (plot on a distribution graph it will
have a long tail) which is also called as negatively skewed data.
FOURTH TASK
# Using Kilometers variables to create a density plot
#Making a object to store kilometers
CarKm1 = c(CarSet$Km/1000
)
#Calculating Mean and Standard Score
CarMean = mean(CarKm1)
CarSD = sd(CarKm1)
#Calculating Z score
ZScore24 = (2.4 * CarSD)+CarMean
ZScore31 = (-3.1 * CarSD)+CarMean
#Developing a density plot using density function
DensePlot = density(CarKm1, adjust = 1)%>%
plot()
#Plotting lines on graph using abline function
abline(v= CarMean, col = "green" , lwd = 3)
abline(v= ZScore24, col = "Yellow" , lwd = 3)
abline(v= ZScore31, col = "Red", lwd = 3)
#Adding tex to the density plot to make visualization of values more clear
text(x = ZScore24,
y = 0.008,
paste(round(ZScore24), 2)
)
text(x = ZScore31,
y = 0.010,
paste(round(ZScore31), 2)
)
text(x = CarMean,
y = 0.008,
paste(round(CarMean), 2)
)
This graph shows a density plot
where Mean value which is 55.8091445 is shown and 2.4
standard deviation which is 124.8432127 above the mean and
-3.1 standard deviation which is -33.3598603 below the mean
is shown.
FIFTH TASK
# In this task I will create a horizontal box plot using kilometers as my variable.
#Using par function to add two graphs in one row
par(mfrow=c(2,1))
par(mai=c(0.6,1,0.2,0.4))
#Creating Box plot
CarMedian = median(CarKm1)
boxplot(CarKm1,
horizontal = T,
main = " " ,
col = brewer.pal(3, "Set1")
)
#Creating a box around figure to differentiate between two
box(which = "figure", col = "red")
#Creating Points to locate mean value
points(
x = CarMean,
y = c(1),
col = "green",
pch = 17
)
#creating histogram
hist(CarKm1,
main = "",
xlab = "",
col = brewer.pal(8, "Dark2"),
breaks = 25)
box(which = "figure", col = "green")
title(xlab = "Kilometers", line = 2, cex.lab = 0.8, col.lab = "#751F14")
In the box plot we can observe
that mean value lies just above the median line whose value is
54 in the box plot and mean which is
55.8091445. Most of the data in the box plot lies in the
3rd quartile and we can also observe outliers above the 75th percentile
and no outliers below the median line.
Histogram shows no spaces in data therefore there are no missing values
obtained. Maximum cars this company sells have an average of 50000 kms
running. Cars in range of 100000 - 150000 are very less and have less
tendency to be sold due to increased km running.
SIXTH TASK
# In this task I will create a horizontal box plot using Price as my variable.
CarPrice = c(CarSet$Price)
PriceMean = mean(CarPrice)
CarMedian = median(CarPrice)
#Using par function to add two graphs in one row
par(mfrow=c(2,1))
par(mai=c(0.6,1,0.2,0.4))
#Creating Box plot
boxplot(CarPrice,
horizontal = T,
main = "Box Plot of Cars Price " ,
col = brewer.pal(3, "Set1")
)
#Creating a box around figure to differentiate between two
box(which = "figure", col = "darkred")
#Creating Points to locate mean value
points(
x = PriceMean,
y = c(1),
col = "green",
pch = 17
)
#creating histogram
hist(CarPrice,
main = "Histogram of Cars Price",
xlab = "",
col = brewer.pal(8, "Set3"),
breaks = 25)
box(which = "figure", col = "purple")
title(xlab = "Price", line = 2, cex.lab = 0.8, col.lab = "#751F14")
The box plot shows us the
quartiles, median andany outliers present, in this case we can see the
median value which is 7146 as well as the Mean value of the
price which is 8383.2224692 ,dooes not lie close to each
other, on a measure of standard deviation the data is spread away from
mean therefore data is likely to be reliable.
SEVENTH TASK
#Creating an object to relate two variables and to attach mean along with it using tapply
CarDist = tapply(CarSet$Price, CarSet$Owner, mean)
#creating boxplot
boxplot(
CarSet$Price~CarSet$Owner,
xlab = "",
ylab = "",
col = brewer.pal(4, "Dark2"),
main = "Distribution of Owner according to Price",
)
#Creating points
points(
y = CarDist,
x = c(1, 2, 3, 4),
col = "#CED70F",
pch = 17
)
title(xlab = "Owner Level", line = 2.5, cex.lab = 1, col.lab = "#751F14")
title(ylab = "Price", line = 2.5, cex.lab = 1, col.lab = "#751F14")
As we see in this observation
we have total of 4 box plots which shows number of owner of a vehicle on
x axis and price according to number of owner. This box plot shows that
as the number of owner increases the price value decreases. observe the
first owner the average price is very close to the median line whereas
in the Fourth and Third owner the average price is away from the median
line which shows that price drops eventually with increase in number of
owners on one vehicle.
EIGHTH TASK
#Using par function to create margins
par(mai=c(1.4,1,0.4,1))
CarDist2 = tapply(CarSet$Km, CarSet$Location, mean)
#Creating boxplot of Location
boxplot(
CarSet$Km~CarSet$Location,
xlab = "",
ylab = "",
col = brewer.pal(4, "Dark2"),
las = 2,
main = "Distribution of Locations according to kilometer",
)
#Creating points
points(
y = CarDist2,
x = c(1,2,3,4,5,6,7,8,9,10,11),
col = "#CED70F",
pch = 17
)
title(xlab = "Locations", line = 5.3, cex.lab = 1, col.lab = "#751F14")
title(ylab = "Kilometers", line = 3.8, cex.lab = 1, col.lab = "#751F14")
Above observation contains two
variables, one is locations on x axis and other is KMs of car on Y axis,
this box plot shows that all the locations have average kms on most of
their cars which lies very close and some on the median values, Most of
the cars having higher level of km achieved are mostly found in Chennai
and Pune, this means people mostly do not sell their cars until they
have a long run on their KM levels. Most of the states except Pune have
outliers in their 75th and 100th percentile and no outliers present
below the median and mean value. This is a good example of Normal
distribution since all the mean values are very close to median values
which depicts that standard deviation will be very close to mean and
therefore the data is reliable and accurate.
NINTH TASK
#Using New function to get values by which a boxplot is created
boxplot.stats(CarSet$Km)
## $stats
## [1] 171 34994 54000 72618 129000
##
## $n
## [1] 4949
##
## $conf
## [1] 53154.99 54845.01
##
## $out
## [1] 137000 140000 137008 130000 145000 147898 135670 130000 145277 143143
## [11] 140000 129986 135000 138000 140000 138000 135000 130000 143000 148000
## [21] 140000 136642 143275 137148 135000 146000 131000 144400 143354 130790
## [31] 140000 132000 129750 130000 146000 130000 146824 144000 130002 148000
## [41] 130000 130000 130000 131000 130000 140000 140000 130000 135000 145000
## [51] 141537 134000 138000 130923 131000 132000 142000 149000 135000 130000
## [61] 130000 143017 132000 136000 136997 137000 133000 133000 133944 137800
## [71] 138000 148000 135000 132000 140000 146300 148009 144113 136000 135000
## [81] 138205 131765 131000 145000 147350 142000 147848 147000 140000 136490
## [91] 144471 135000 144000 130000
Box plot stats is a function
which is used to gather necessary statistics which are basically
required and needed to develop and generate a box plot.
n shows the number of observations in our data, basically it shows your
sample size.
Stats shows the lowest percentile of the box plot which is 0th
percentile which is 171, the 25th percentile which is 34994, the 50th
percentile which is 54000, the 75th percentile which is 72618 and the
100th percentile which is 129000. This values are important since to
develop a box plot this values needs to be plotted. Out basically means
outliers, it shows the data which lies beyond the limit of whiskers
therefore cannot be plotted in the box plot and are present after the
100th percentile and beyond 0th percentile of the box plot.
TENTH TASK
#Using same variable as above and creating stats report for variable Km
CarStats = c(boxplot.stats(CarSet$Km)$stats)
#Using dotchart to visualize the data
dotchart(CarStats,
main = "Dot Chart of Kilometers",
col = brewer.pal(5,"Dark2"))
title(xlab = "Kilometers", line = 2.5, cex.lab = 1, col.lab = "#751F14")
Above figure is a dot chart of
variable Km, We have used stat fucntion here to get the statistical
information of the following above variable KM.In this figure most of
the cars in this datsets have already crossed 60000 kms of runnning, and
most of the sales department consist of cars which have above 80000
running kms.
CONCLUSION
BIBLOGRAPHY
APPENDIX
An R Markdown file has been attached to this report. The name of the
file is Project1_Tiwari.rmd