Indian Car Sales Report
ALY6010: Probability Theory and Introductory Statistics
John Wood-Downey
Dr. Dee Chiluiza, PhD
03 March, 2023


Introduction


The global automotive industry is a substantial market. According to Carlier (2023), the auto manufacturing industry generates 2.86 trillion USD annually and produces nearly 57 million cars yearly. In 2022, global car sales were 66.1 million; in 2021, China and the United States were two of the biggest markets for car sales (Carlier, 2023). In 2021, Toyota sold the most cars worldwide, but the Volkswagen group had the highest revenue, with $295.73 billion (Carlier, 2023). Some of the most significant issues in the automotive industry are supply-chain issues due to continued chip shortages from the pandemic, increased demand for electric vehicles, and companies transitioning manufacturing to EVs (Carlier, 2023). In addition, in 2020, passenger vehicles accounted for 41% of global carbon dioxide emissions, and many countries want to stop their global footprint with new regulations (Carlier, 2023). Growing electric vehicle companies are disrupting the automotive industry, and sales are shifting toward the EV market. In a recent article, Elon Musk, the CEO of Tesla, said that Tesla plans to cut production costs by 50% and offer more affordable cars to increase the output of electric vehicles (Reed, 2023).


More specifically, the Indian car sales market is substantial globally. In 2021, the Indian automotive market value was $32.7 billion and should grow at a compounded annual growth rate of 9% from 2022 to 2027 to reach $54.84 billion in 2027 (IBEF, 2022). In addition, the EV market should reach $7.09 by 2025 (IBEF, 2022). Some of the automobile clusters in India are Mumbai, Chennai, Kolkata, and Delhi (IBEF, 2022). The data set evaluates these locations. Some key trends in the Indian automobile industry are the rising demand due to a growing young and wealthy middle-class population, increased incorporation of electric standards, India receiving much foreign investment in the automotive sector, and strong policies supporting the automotive industry (IBEF, 2022). In 2022, Suzuki was the most popular car brand in India, with a market share of 46%, and the other big brands in the Indian car market are KIA motors, Toyota, Honda, and Volkswagen (Statista Research Department, 2023). In another article, Reuters (2023) claims India is the fourth largest car market, and vehicle sales should grow nearly 10% in 2024, more than 20% more than pre-pandemic car sales. According to Reuters (2023), the SUV category is in the highest demand in India, and it will help the total number of vehicles sold reach nearly 5 million units in 2024. The health of the Indian car market is in solid standing.


Discrete and continuous probability distributions are important in statistics and data visualization. According to Bluman (2018), discrete variables are anything countable. For example, there are nine cars in the parking lot. Discrete probability is usually presented in graphs like bar plots because the data is discrete or contains absolute values (Young, 2023). The main types of discrete probability distributions are binomial, a probability of two outcomes; Bernoulli, a probability of two outcomes where one is success and zero failure; Multinomial, a probability of more than two outcomes; and Poisson, a probability of events will take place over a fixed period (Young, 2023). A continuous probability distribution evaluates numbers with decimals and fractions (Bluman, 2018)—for example, measuring the temperature in classrooms at Northeastern. A temperature unit contains fractions and is not a whole number. According to Albright (2023), an X value can assume infinite values and be displayed under a standard curve in a continuous probability distribution. There are numerous types of continuous distributions like uniform, flipping a die; normal distribution, evaluating data with a mean of zero and a standard deviation of one; log-normal distribution, using the logarithm of x; T distribution, which has thicker tails than a normal distribution; Chi-square, evaluating degrees of freedom in data; and exponential distribution (Mandiraju, 2021).


The data used in this report is about car sales in India. The data evaluates car models, location, year, fuel type, transmission, owner, efficiency, engine, power, seats, kilometers, and price. The data is presented in tables, bar charts, pie charts, density graphs, box plots, histograms, and scatterplots to give an overview of car sales in India and how variables affect each other. The goal of the report is to find the price of cars in India, the kilometers, and the location of sales to give a presentation of the market and explain how the Indian market is one of the biggest car sales markets in the world.


Analysis


Task 1


This task presents a table with descriptive statistics of the car sales data set

Descriptive Stats for Car Sales Data Set
Efficiency Power_bhp Seats Km Price
vars 1.00 2.00 3.00 4.00 5.00
n 4949.00 4949.00 4949.00 4949.00 4949.00
mean 18.47 123.73 5.16 55809.14 8383.22
sd 4.17 41.46 0.56 28764.20 5158.19
median 18.00 100.00 5.00 54000.00 7146.00
trimmed 18.32 117.48 5.03 54060.90 7668.28
mad 2.97 0.00 0.00 28169.40 4121.63
min 8.00 50.00 4.00 171.00 618.00
max 30.00 600.00 10.00 149000.00 25270.00
range 22.00 550.00 6.00 148829.00 24652.00
skew 0.41 1.96 3.85 0.58 1.23
kurtosis -0.17 8.61 18.71 0.23 1.16
se 0.06 0.59 0.01 408.88 73.32


The table above describes the car sales data set with descriptive statistics like the mean, median, range, skew, number of variables for each category, and other information. According to Bluman (2018), descriptive statistics help describe a situation. From the table, we understand that some of the main variables in the car sales data set are car efficiency, power/horsepower output, seats per vehicle, kilometers, and the price of the car. These variables are important to highlight because they influence car sales significantly. Two of the most impactful factors when buying used cars are mileage and condition because the higher the mileage, usually the less good the shape of the vehicle and the less a customer wants to pay for it because there is a big difference between 10,000km and 200,000km (D’Allegro, 2021). From the table, we learn there are 4949 data points in the car sales data set. The mean for efficiency is 18.47, the mean for horsepower is 123.73, the mean for seats is 5.16, the mean for kilometers is 55,809.14 km, and the average price per car in the data set is $8383.22. The vehicle with the lowest price is $618, and the highest is $25,270. The code used to create the table was the dplyr::select() code which allowed only select the variables seen in the table. The code is good for data analysis to only present the variables of interest. Secondly, the code psych::describe() gave the descriptive stats in the table. The code t() transposed the table to have the variables in columns and the information in rows. Finally, the code round() eliminated some decimals to clean the data. The table presentation used the code knitr::kable to create a better HTML table (Zhu, 2021).


Task 2


This task presents a bar chart for each location of car sales and a pie chart for each type of fuel in the car sales data set


Here we have two graphs using the code par(mfcol) to display two graphs side by side. The first graph is a bar chart showing the frequency of car sales in different locations in India, and the second is a pie chart showcasing the types of fuel for the cars in the data set “car sales.” From the bar chart, the area with the most car sales was Mumbai, with 677 cars, and the lowest was Ahmedabad, with 196 cars sold. The data is essential because we can see what area sells the most cars and which sells less. Based on historical sales data, consumers can use this information when looking for a reasonable sales price or where sellers can sell their cars the fastest. Bluman (2018) says bar graphs accurately show the frequencies of categories, so a bar graph is appropriate in this example to showcase the sales location. The second graph is a pie chart for the type of fuel Indian cars use. From the pie chart, we see that most vehicles are diesel or gasoline vehicles. Very few cars are CNG or LPG-powered vehicles. The chart is exciting because India has many diesel vehicles available compared to North America. According to IEA, in 2017, diesel engines accounted for 47% of vehicle sales, and in 2019 40% of vehicle sales in India (IEA, 2021). The pie chart confirms the data and explains that many diesel-powered vehicles are available in these locations in India.


Task 3


This task shows a table for the frequency of owner type (first, second, third, or fourth) with the percentage of each

Owner Type Frequency & Percentage
. Frequency Cum Frequency Percentage Cum Percentage
First 2921 2921 0.59 0.59
Fourth 782 3703 0.16 0.75
Second 916 4619 0.19 0.93
Third 330 4949 0.07 1.00


The following table shows the data set’s owner type frequency and percentage. The table divides vehicle ownership among first-time owners, second, third, or fourth. Most of the data is for first-time owners with 2921 cars, followed by a second at 916, then fourth-time ownership at 782, and finally third at 330. First-time owners make up 59% of the data for car sales in India, while third-time owners are only 7% of Indian car sales. The code knitr::kable from the kableExtra library allowed for a custom table creation (Zhu, 2021). The code mutate() added new columns to the table to display important values describing the data set. According to Bluman (2018), the information in this table, frequencies, and percentages should be displayed in a pie chart because pie charts divide sections according to the percentage for each categorical value. Therefore, car sales agents can use the information in this table to target audiences and sell to the appropriate owner type to increase sales.


Task 4


This task presents a density plot for the kilometers of cars in the car sales data set


Here we created a density plot to show the distribution of the variable kilometers in the data set car sales. The density plot graphs all the data for all kilometers, so we can interpret what values are most common in car mileage for car sales in India. From the plot, the mean for kilometers is 55,809km, and the data goes from zero to nearly 200,000km for cars in India. In the plot, there are also two lines to show the z-score. According to Bluman (2018), the standard score is a value that tells us how many standard deviations a data value falls above or below the mean. We can find the standard score by using a value and subtracting the mean, and then dividing that by the standard deviation of the data set (Bluman, 2018). In this example, we used z-scores of 2.4 and -3.1 to show where the values fall on the standard curve. A z-score of 2.4 equals 124,843km in the car sales data set, while a z-score of -3.1 equals -33,360km on the standard curve. The negative z-score is inaccurate for customers because car kilometers cannot be below zero. Using the graph, we can use any kilometer value to find how many standard deviations that value exceeds the mean. The density plot is essential in car sales because customers want to know where the vehicle stands compared to another vehicle in terms of kilometers before purchasing. The code text() added numerical values besides the lines on the graph (Chiluiza, 2021).


Task 5


This task presents a box plot and a histogram for the kilometers of the cars from the car sales data set


Here we have two graphs showing kilometers in cars for the Indian car sales data set. These two graphs accompany the density plot in the last task to show more visuals about the data distribution. From the box plot, the blue data tells us that the mean is greater than the median. According to Bluman (2018), when the mean is to the right of the median, most of the data is to the left of the mean resulting in a positive skew. The dots on the right side of the box plot show potential outliers in the kilometers data. The second graph is a histogram and shows the same data from the box plot and the density curve to show the kilometers data in another visual format. As kilometers are continuous data, the histogram divides the values into bins and paints a picture of the distribution. Continuous data is values that can hold infinite values between two specific points and include fractions or decimals (Bluman, 2021). The red line shows the median, and the pink line shows the mean. The data is right-skewed, as explained in the box plot. The code par(mfrow) allows a presentation of two graphs in a row format to optimize data visualization. In these tables, the code main=NA hides the titles because the graphs continue to explain the density curve.


Task 6


This task presents a box plot and a histogram for the price of cars in the car sales data set


Here we have two graphs showing the price in car sales from the car sales data set. The boxplot and the histogram show the data distribution. The box plot shows that the mean is larger than the median, making the data positively skewed and confirming that most sales prices are less than $8,000. The histogram shows another view of the sales price distribution because it groups the data in bins and helps us see the positive skew. The tallest bin in the histogram is at $5000, showing us that most data points in the car sales data set group in this price range. Finally, the box plot shows some outliers in the data because some cars have higher prices than others based on conditions like mileage and condition (D’Allegro, 2021). The red line on the histogram is the median, and the orange line is the mean.


Task 7


This task presents box plots for the price of cars per different owner


Here is a boxplot for the distribution of price and the type of owner for the car sales data set. As previously explained, the ownership divides among first, second, third, and fourth owners. Each owner pays a varying price for a vehicle based on their desired specifications. From the box plots, the owner with the lowest median price is the third owner, followed by the fourth, first, and second. Second, owners are more willing to pay higher prices for their vehicles based on the data. The fourth owners have more separation in the data because there are numerous outliers, meaning that while some owners pay low prices, others are willing to pay increased prices for their fourth vehicle. The third owners have the most extensive range because the box plot shows an elongated interquartile range in the data. The box plots show that first owners pay between a few hundred dollars and $16,000 for their first car. According to Bluman (2021), using box plots is an excellent tool to evaluate data distribution, especially between categories, to compare the data distribution.


Task 8


This task presents box plots of kilometers of cars in different locations


Task 8 presents a box plot showcasing the kilometers of cars in different locations in India in the Indian car sales data set. For example, the box plot shows that Hyderabad has the highest median car sales in India, and Kolkata has the lowest median car sales in India. According to Yi (2021), box plots are good visualization tools because they emphasize 50% of the data with the box or interquartile range. Having a clear vision of the data allows for a better understanding of how the distribution compares to other categories. In this chart, for example, we can easily see what location in India has the most outliers in car prices. In Kolkata, there are many outliers in car prices. We can also see that the prices have a more extended interquartile range in Chennai. Box plots offer information about the skewness of the data, variance, symmetry, and possible outliers (Yi, 2021). This box plot gives a global overview of data we previously explained, like kilometers and location, and combines them to give the audience a better understanding of the overall Indian car sales data set.


Task 9


This task shows the box plot statistics of the kilometers of cars in the car sales data set

Box Plot Stats Kilometers
Descriptive_Statistics Values
Min 171
Q1 34994
Median 54000
Q3 72618
Max 129000


The following table shows descriptive statistics for the variable kilometers in the Indian Car sales data set. The code boxplot.stats() generates the minimum, Q1, Median, Q3, and maximum variable kilometers for the global data. The variable kilometer is important for car sales because, as D’Allegro (2021) stated, kilometers affects the car’s condition and varies the price customer are willing to pay. The information from the code was inserted in a vector to create a table using knitr::kable (Zhu, 2021). From the information, the car with the lowest kilometers was 171km, which is a new car. The Q1 was 34,994km, the median kilometers for the data set was 54,000km, and the Q3 was 72,618km. Finally, the car with the highest kilometers in the Indian car sales data set was 129,000km, resulting in a range of 128,829km. According to Bluman (2018), box plot stats are a five-number summary that helps understand data, and these points create visualizations, as seen in the next chart below.


Task 10


This task presents a dot chart with the quartiles of kilometers in cars from the previous table


The final analysis of the Indian car sales data set is a dot chart to present the box plot statistics from task 9. The dot chart shows five red dots to explain the variable kilometers in car sales. From the chart we see the number of kilometers in cars increasing. Each dot shows a box plot stat like the min, max, Q1, Q3, and median. The visual helps gain a deeper understanding of what the data points mean for the overall data set.


Conclusion


From the data, we found that factors like location, kilometers, ownership, and condition affect the price of car sales. The analysis section of the report gave basic descriptive statistics of efficiency, power, seats, kilometers, and price to paint a picture of the data set. Moreover, we found the location frequency and fuel types in cars in India. Some key findings are that Mumbai has the highest car sales, gasoline cars are the most popular, but surprisingly, diesel is a popular fuel source in Indian cars. Furthermore, we found that most car owners in India are first-car owners 59% and very few are third-car owners, with only 7% in this category. Finally, through a density plot, we found that the mean kilometers for the entire data set were 55,809km per car.


Additionally, the price of cars is around $8383.22, and most sales are near the median of $7146. Finally, we found that the median kilometers for cars in India were 54,000km, and the mean was 55,809.14km. The car with the highest kilometers in the sales data set was 149,000km, and the lowest was 171km.


From the data, we can recommend that the cars that sell the most are either gasoline or diesel. Although there is a push for EVs in India, the EV market still needs to produce more sales, according to the data set. Some locations that are more prone to sell cars are Mumbai, Hyderabad, Pune, Kochi, Coimbatore, and Delhi. Car sales companies can use the data to evaluate where they should sell cars and how much consumers will pay. The data can also be used by car manufacturers when establishing what product line of cars to sell in India. The descriptive statistics give information about sales in the country and what factors to consider. The report can also help new car owners compare how many kilometers are regular in car sales and what price first, second, third, or fourth-time owners are paying for cars in India.


Through the report, new skills were acquired, like adding columns to data tables, various graphs like density plots and scatterplots, and using new libraries. From the overall project, the visualizations influence the need for foreign direct investment and how India is still an emerging market. The manufacturing sector is increasing, and the opportunity to leverage care sales to make profitable investments are apparent through the data.


Bibliography


Reference


Albright, E. (2023). ENV710 Statistics Review Website. Duke Nicholas School of the Environment. Retrieved March 2, 2023, from https://sites.nicholas.duke.edu/statsreview/continuous-probability-distributions/#:~:text=Continuous%20probability%20distribution%3A%20A%20probability,50).


Bluman A. G. (2018). Elementary statistics: a step by step approach (Tenth). McGraw-Hill Education.


Carlier, M. (2023). Automotive Industry Worldwide- statistics & facts. Statista. Retrieved March 2, 2023, from https://www.statista.com/topics/1487/automotive-industry/#topicOverview


Chiluiza, D. (2021). Introduction to data anlysis using R, R Studio and R Markdown- Short Manual Series: Histograms. RPubs. Retrieved March 1, 2023, from https://rpubs.com/Dee_Chiluiza/816756


D’Allegro, J. (2021). Just what factors into the value of your used car. Investopedia. Retrieved February 28, 2023, from https://www.investopedia.com/articles/investing/090314/just-what-factors-value-your-used-car.asp


IBEF. (2022). Automobile Industry in India. India Brand Equity Foundation. Retrieved March 2, 2023, from https://www.ibef.org/industry/india-automobiles


IEA. (2021). Fuel economy in India- part of the global fuel economy initiative 2021. International Energy Agency. Retrieved March 1, 2023, from https://www.iea.org/articles/fuel-economy-in-india


Madiraju, P. (2021) Statistics 101: Beginners Guide to Continuous Probability Distribution. Analytics Vidhya. Retrieved Mar 2, 2023, from https://www.analyticsvidhya.com/blog/2021/02/statistics-101-beginners-guide-to-continuous-probability-distributions/


Reed, B. (2023). Musk unveils plans for low production cost while skirting affordable car option. Guardian. Retrieved March 2, 2023, from https://www.theguardian.com/technology/2023/mar/01/elon-musk-tesla-fossil-fuels-economy


Reuters. (2023). India’s passenger vehicle sales to grow 9%-10% in 2024- Crisil Ratings. Reuters. Retrieved March 2, 2023, from https://www.reuters.com/world/india/indias-passenger-vehicle-sales-grow-9-10-2024-crisil-ratings-2023-02-28/


Statista Research Department. (2023). Passenger car market share across India in 2022, by vendor. Statista. Retrieved March 2, 2023, from https://www.statista.com/statistics/316850/indian-passenger-car-market-share/#statisticContainer


Yi, M. (2021). A complete guide to box plots. Chartio. Retrieved March 2, 2023, from https://chartio.com/learn/charts/box-plot-complete-guide/#:~:text=Box%20plots%20are%20used%20to,skew%2C%20variance%2C%20and%20outliers.


Young, J. (2023). Discrete Probability Distribution: Overview and Examples. Investopedia. Retrieved March 2, 2023, from https://www.investopedia.com/terms/d/discrete-distribution.asp#:~:text=A%20discrete%20probability%20distribution%20counts,%2C%20Poisson%2C%20and%20Bernoulli%20distributions.


Zhu, H. (2021). Create awesome HTML table with knit::Kable and kableExtra. Can.R-Project. Retrieved March 1, 2023, from https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html


Appendix


Due to the submission of this report in an HTML file, an R Markdown file has been attached to the report. The file name is Project1_JohnWoodDowney.Rmd