INTRODUCTION
A. Overview of Global Car Sales
Market:
The global automotive industry is undergoing and
foreseeing a revolutionary transformation ever before. This is due to
some driving trends, such as technological advancement, sustainability,
electrification, autonomous driving, connectivity, and diverse mobility
(Paul Gao et al., 2016). These driving forces can leverage one another
with their influence and eventually significantly impact the automation
industry’s growth. According to the McKinsey & Company report by
Paul Gao et al., 2016, overall global car sales continue to grow;
however, the global annual growth rate drop from 3.6 percent to approx.
2 percent by 2030.
B. Importance of discrete and
continuous probability distributions
A probability
distribution is to evaluate the probability of a random variable. It can
be discrete or continuous. The discrete probability distribution
computes countable distinct possible values, whereas the continuous
probability distribution computes infinite possible values in a
specified range. In discrete, probabilities can be designated to
specific values; however, in continuous, probability designated to a
particular value is null since its values are infinite (Bluman,
2018).
With respect to applications, the discrete probability
distribution is straightforward, easy to compute, and best suitable for
countable data. A continuous probability distribution can analyze and
predict complex data such as measurements, mathematic and scientific
research, and performance evaluations.
C. About dataset used in this report:
Dataset used to build this report is “M1data_carsales”.
The dataset contains the history of carsales in India between 1998 and
2019. It contains 12 variables and details of 4,949 cars from various
automotive companies.
TASK 1
Descriptive statistics summary of dataset “M1data_carsales” for
variables Efficiency, Power_bhp, Seats, Km, Price)
#using dplyr to select specific variables from dataset, using psych with describe() to create summary statistics and descriptions of dataframes with both numerical and categorical variables
M1data_carsales %>%
dplyr::select(Efficiency, Power_bhp, Seats, Km, Price) %>%
psych::describe()%>%
t()%>%
knitr::kable(align = "c", digits = 2, caption = "Descriptive Statistics Summary", format = "html")%>%
kable_styling(bootstrap_options = "basic", stripe_color = "black", table.envir = "table", full_width = TRUE, protect_latex = FALSE, font_size = 10)
| Efficiency | Power_bhp | Seats | Km | Price | |
|---|---|---|---|---|---|
| vars | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 |
| n | 4949.00 | 4949.00 | 4949.00 | 4949.00 | 4949.00 |
| mean | 18.47 | 123.73 | 5.16 | 55809.14 | 8383.22 |
| sd | 4.17 | 41.46 | 0.56 | 28764.20 | 5158.19 |
| median | 18.00 | 100.00 | 5.00 | 54000.00 | 7146.00 |
| trimmed | 18.32 | 117.48 | 5.03 | 54060.90 | 7668.28 |
| mad | 2.97 | 0.00 | 0.00 | 28169.40 | 4121.63 |
| min | 8.00 | 50.00 | 4.00 | 171.00 | 618.00 |
| max | 30.00 | 600.00 | 10.00 | 149000.00 | 25270.00 |
| range | 22.00 | 550.00 | 6.00 | 148829.00 | 24652.00 |
| skew | 0.41 | 1.96 | 3.85 | 0.58 | 1.23 |
| kurtosis | -0.17 | 8.61 | 18.71 | 0.23 | 1.16 |
| se | 0.06 | 0.59 | 0.01 | 408.88 | 73.32 |
Observation:
The above summary table outlines the descriptive statistical
measures of the variables efficiency, seats, Km, Power_bhp, and price.
It describes the measure of central tendency and dispersion, skewness,
kurtosis, etc. The output is so comprehensive interms of understanding
the density distribution shape, central tendency, spread, and
distribution of each variable.
TASK 2
Presenting barplot and piechart of each category of variables location
and fueltyope, respectively.
#creating an object to pick variable from dataset using table()
Location_freq <- table(M1data_carsales$Location)
#creating a dataframe w.r.t variable location and its frequency
Location_freq_dataframe <- data.frame(Location = names(Location_freq), Frequency = as.numeric(Location_freq))
#using par() to organize chart presentation with 1x2 matrix
par(mfcol=c(1,2))
#creating barplot w.r.t location and frequency
barplot(Location_freq_dataframe$Frequency, names.arg = Location_freq_dataframe$Location, col=brewer.pal(n=3,name = "Set2"),las = 1, horiz = TRUE, xlab = "Frequency", xlim = c(0,max(Location_freq_dataframe$Frequency)*1.2), cex.main = 0.6, cex = 0.7, cex.lab = 0.8, cex.axis = 0.65)
#assigning an object by picking a variable from dataset using table()
fueltype_count <- table(M1data_carsales$FuelType)
fueltype <- M1data_carsales$FuelType
#determining percentage of car sales on each category of variable
Fueltype_percentage <- round(((fueltype_count / sum(fueltype_count))*100), 1)
#creating dataframe of variable fuel type and its percentage
Fueltype_percentage_dataframe <- data.frame(Fueltype = names(Fueltype_percentage), Percentage = as.numeric(Fueltype_percentage))
#using par() to organize graph presentation
par(mai=c(0.4,0.1,0.4,0.1))
#creating pie chart
pie(Fueltype_percentage, labels = paste0(Fueltype_percentage, "%"), edges = 200, angle = 180, las =1, radius = 0.8, cex.axis = 1,cex = 0.5, col = brewer.pal(8,"Set3"), cex.main = 0.6, cex.axis = 0.65)
legend("bottomright", c("CNG","Diesel","Gasoline","LPG"), cex= 0.5, fill=brewer.pal(9,"Set3"))
Observation:
Barplot: The barplot presents the frequencies of each category of
variable “Location” in the dataset carsales in India. As listed in the
barplot, 11 Indian cities carsales summary captured in the dataset.
Based on the illustration, it is understandable that the city ‘Mumbai’
top the list with highest frequency of sales compared to other cities.
Consecutively, the city ‘Hyderabad’ stands at the second spot, and
followed by Pune and Kolkata as both sharing the third highest carsales
market in the country. Interestingly, the city ‘Ahmedabad’ is at the
lowest spot with very minimal sales compared to other cities in India.
Piechart: The piechart presented in the above chart displays
the percentages of each category of variable ‘fuel type’. The fuel types
captured in the dataset are, CNG, Diesel, Gasoline, and LPG. As
illustrated in the piechart, the total CNG cars contributed to the
indian market is 1.1%, LPG at 0.2%, Dieset at 47.2%, and Diesel at
51.5%. The highest moving category is diesel cars and followed by
Gasoline. The CNG and LPG fuel type cars are not having major market
size in India, based on the given dataset.
TASK 3
For
variable ‘Owner’, presenting frequencies, cumulative frequencies,
percentages, and cumulative percentages.
#creating table for variable 'Owner' and its freq
Owner_freq <- table(M1data_carsales$Owner)
#tansposing table using dataframe
Owner_freq_dataframe <- data.frame(Owner = names(Owner_freq), Frequency = as.numeric(Owner_freq))
#using mutate() to add columns CumFreq, Percentage, CumPercentage
Owner_freq_dataframe <- Owner_freq_dataframe %>%
mutate(
CumFrequency = cumsum(Frequency),
Percentage = round((Owner_freq / sum(Owner_freq)*100), digits = 2),
CumPercentage = cumsum(Percentage)
)
#using knitr::kable() to present the table
kable(Owner_freq_dataframe, align = "c", caption = "Carsales w.r.t Owner and their frequencies") %>%
kable_styling(full_width = NULL, bootstrap_options = "basic", table.envir = "table", protect_latex = TRUE, position = "center")
| Owner | Frequency | CumFrequency | Percentage | CumPercentage |
|---|---|---|---|---|
| First | 2921 | 2921 | 59.02 | 59.02 |
| Fourth | 782 | 3703 | 15.80 | 74.82 |
| Second | 916 | 4619 | 18.51 | 93.33 |
| Third | 330 | 4949 | 6.67 | 100.00 |
Observation:
Based on my observation from the above summary, I can understand
that the majority of owners in the Indian car sales market is ‘First’
hand owners contributing to 59.02%. Consecutively, second and fourth
hand owners with the contribution of 18.51% and 15.80%, respectively.
Interestingly, ‘Third’ hand owners contributes to 6.67% at the lowest.
By and large, the above summary indicates the lifecycle of carsales
market and ownership dynamics.
TASK 4
Density distribution of variable ‘Kilometers’ interms of mean
and standard deviation.
#creating objects to store mean and standard deviation
Km <- M1data_carsales$Km
km_mean <- mean(M1data_carsales$Km)
km_sd <- sd(M1data_carsales$Km)
#creating object to store the value 2.4 SD above the mean
sd2.4A = km_mean + (2.4*km_sd)
#creating object to store the value 3.1 SD below the mean
sd3.1B = km_mean - (3.1*km_sd)
#using par() to organize the chart
par(mar=c(5,5,5,5)+0.1)
#using density() and plot() to present a density curve for Km
Km %>%
density(adjust = 2) %>%
plot(main = "", xlab = "Kilometers", xlim = c(-50000,max(Km)*1.1), ylab = "", las =1, ylim = c(0,0.0000195), cex.axis = 0.6)
#using abline() to add vertical for the mean
abline(v=c(sd2.4A, km_mean, sd3.1B), col=c("red", "blue", "red"))
text(x=km_mean-2, y=0.0000155, round(km_mean,2), cex=0.8, srt=90, adj = c(1,0))
text(x=sd2.4A-2, y=0.0000155, round(sd2.4A,2), cex=0.8, srt=90, adj = c(1,0))
text(x=sd3.1B-2, y=0.0000155, round(sd3.1B,2), cex=0.8, srt=90, adj = c(1,0))
Observation:
The density distribution of variable ‘Kilometers’ is a perfect
resemblance of the descriptive summary we have observed in the task 1.
As mean stands at 55809.14, 2.4 SD above the mean stands at 124843.21
and 3.1 SD below the mean stands at -33359.86. The skewness is almost at
the center and distribution is wide above the mean.
TASK 5
Horizontal boxplot and Histogram to display the data distribution of
continuous variable ‘Kilometers’
#using par() to organize graph presentation
par(mfrow=c(2,1), mar=c(2,3,2,3))
#presenting histogram to display the data distribution of variable 'Km'
hist(Km, main = "Histogram of variable 'Km' in carsales", col = brewer.pal(8,"Set1"), las = 1, xlim = c(0,max(Km)*1.2), ylim = c(0,900), cex = 0.6, cex.main = 0.6, cex.lab = 0.6, cex.axis = 0.6)
#presenting box plot to display the data distribution of variable 'Km'
boxplot(Km, main = "Boxplot of variable 'Km' in carsales", col = brewer.pal(7,"Set2"),horizontal = T, cex = 0.6, cex.main = 0.6, cex.lab = 0.6, cex.axis = 0.6, ylim = c(0,max(Km)*1.05),las = 1)
Observation:
The histogram illustrates the distribution of continuous variable ‘Km’
by clearing highlighting the major contributing range. The boxplot
represents the quartiles of variable ‘Km’ distribution. The lowest and
highest distrubution range is 0 and 125k; 1st and 3rd quartile is much
closer to the median. There are outliers above 125k and last towards
150k.
TASK 6
Presenting histogram and boxplot of variable ‘Price’ carsales
#using par() to organize graph presentation
par(mfrow=c(2,1), mar=c(2,3,2,3))
#presenting histogram to display the data distribution of variable 'Price'
hist(M1data_carsales$Price, main = "Histogram of variable 'Price' in carsales", col = brewer.pal(8,"Set2"), las = 1, xlim = c(0,max(M1data_carsales$Price)*1.14), ylim = c(0,1200), cex = 0.6, cex.main = 0.6, cex.lab = 0.6, cex.axis = 0.6)
#presenting boxplot to display the data distribution of variable 'Price'
boxplot(M1data_carsales$Price, main = "Boxplot of variable 'Price' in carsales", col = brewer.pal(7,"Set1"),horizontal = T, cex = 0.6, cex.main = 0.6, cex.lab = 0.6, cex.axis = 0.6, ylim = c(0,max(M1data_carsales$Price)*1.05),las = 1)
Observation:
The histogram of variable ‘Price’ from the dataset illustrates
that highest price value ranges merely more or less 5000, whereas lowest
price value ranges merely more or less 25,000. Most importantly,
majority of car price is between the range of 2000 and 12000. On the
other hand, while observing boxplot, we can notice that the 0th quartile
or lowest range is merely 1000 and highest range or 100th quartile is
merely lesser than 20k. There are heavy density of population between
4000 and 12000. And the outliers stands above highest range, between
19-20k and 25-26k.
TASK 7
Presenting boxplot to display price distribution per owner.
#using par() to organize graph
par(mai=c(0.8,0.8,0.8,0.8))
#presenting boxplot to illustrate price distribution per owner
boxplot(M1data_carsales$Price ~ M1data_carsales$Owner, main = "Boxplot - Price distribution based on Owner", col = brewer.pal(7,"Set1"), horizontal = T, cex = 0.6, cex.main = 0.6, cex.lab = 0.6, cex.axis = 0.6, ylim = c(0,max(M1data_carsales$Price)*1.05),las = 1, ylab = "Owner", xlab = "Price")
Observation:
The presented boxplot illustrates the distribution of variable
‘Price’ as per each category of variable ‘Owner’. Based on the
observation, ‘Third’ owner holds maximum range between 1st and 3rd
quartile, compared to any other categories. The ‘Fourth’ Owner has the
maximum outliers, whereas no other categories have any outliers at all.
It describes that ‘Fourth’ owner category is so uneven and it
drastically varies and hard to predict.
TASK 8
Presenting boxplot for the distribution of variable ‘Km’ with
respect to each location
#using par() to adjust chart dimensions
par(mai=c(0.8,0.8,0.8,0.8))
#presenting boxplot to illustrate 'Km' distribution in each location
boxplot(M1data_carsales$Km ~ M1data_carsales$Location, main = "Boxplot - 'Km' distribution in each location in India", col = brewer.pal(7,"Set2"),horizontal = T, cex = 0.6, cex.main = 0.6, cex.lab = 0.6, cex.axis = 0.6, ylim = c(0,max(M1data_carsales$Km)*1.05),las = 1, ylab = "", xlab = "Km")
Observation:
Based on the observation, we can understand that almost all the cities,
except Pune and Chennai, have outliers beyond the maximum range. Density
distribution of variable ‘Km’ is more or less similar almost all the
cities, except Pune and Chennai since these two cities have covered
maximum range. Most of the cities’ range of 1st and 3rd quartile, stands
at the range between 25k and 80k.
TASK 9
Presenting outcomes of code boxplot.stats() for variable
‘Km’
#presenting boxplot statistics for variable Kilometers
boxplot.stats(Km)
## $stats
## [1] 171 34994 54000 72618 129000
##
## $n
## [1] 4949
##
## $conf
## [1] 53154.99 54845.01
##
## $out
## [1] 137000 140000 137008 130000 145000 147898 135670 130000 145277 143143
## [11] 140000 129986 135000 138000 140000 138000 135000 130000 143000 148000
## [21] 140000 136642 143275 137148 135000 146000 131000 144400 143354 130790
## [31] 140000 132000 129750 130000 146000 130000 146824 144000 130002 148000
## [41] 130000 130000 130000 131000 130000 140000 140000 130000 135000 145000
## [51] 141537 134000 138000 130923 131000 132000 142000 149000 135000 130000
## [61] 130000 143017 132000 136000 136997 137000 133000 133000 133944 137800
## [71] 138000 148000 135000 132000 140000 146300 148009 144113 136000 135000
## [81] 138205 131765 131000 145000 147350 142000 147848 147000 140000 136490
## [91] 144471 135000 144000 130000
Observation:
The presented statistics displays statistical summary of
variable ‘Km’. The Oth quartile is 171, 1st quartile is 34994, Median is
54000, 3rd quartile is 72618, Maximum quartile value is 129000. The
total population or number of observations is 4949. The confidence
intervels is 53154.99 and 54845.01. The outliers of variable ‘Km’
presented in a series.
TASK 10
Presenting dotchart to display the quartile values for variable
‘Km’
#applying par() to adjust dimensions of the chart
par(mai=c(0.9,0.9,0.9,0.9))
#creating object to assign boxplot.stats for variable 'Km'
Km_stats <- boxplot.stats(Km)$stats
#presenting dot chart to display quartiles of variable 'Km' using code strategy boxplot.stats()$stats
dotchart(Km_stats, main = "Dot chart - quartiles of variable 'Km'", xlim = c(0,max(Km)), xlab = "Km", pch = 18, cex = 0.8)
Observation:
Here is the illustration of quartile values of variable ‘Km’ in the
dotchart. The dots in the chart is the indication of the Oth quartile
171, 1st quartile 34994, Median 54000, 3rd quartile 72618, Maximum
quartile value 129000.
CONCLUSION
The comprehensive statistical analysis of discrete and continuous
variables from the dataset ‘Carsales’ of one of the largest global
automotive markets in the world, India. Based on the illustrations and
its observations presented in the report, it is understandable the
carsales market between 1998 and 2019 have seen considerable growth and
majority of the car ownership is with ‘First’ and ‘Second’ hand. Also,
the price distribution is mostly in the range of 4k and 12k (1st and 3rd
quartile); on the other hand, distribution of variable ‘km’ between the
range of 30 and 60k (1st and 3rd quartile). Based on the fueltype
analysis, the major market share is with diesel and gasoline fuel type
cars. While observing the top cities with huge market size, Mumbai and
Hyderabad stands at 1st and 2nd postiion, respectively.
BIBLIOGRAPHY
1. Paul Gao. V, Kaas. H, Mohr. D, Wee. D, 2016, Automotive
revolution – perspective towards 2030, McKinsey & Company, URL: https://www.mckinsey.com/industries/automotive-and-assembly/our-insights/disruptive-trends-that-will-transform-the-auto-industry/de-DE
2. Bluman, A (2018), Elementary Statistics: a step by step approach. In
Bluman, A., Descriptive and Inferential Statistics, (pp. 3-4)
APPENDIX
An
R Markdown file has been attached to this report. The name of the file
is Project1_Jayakumar.rmd