1. Introduction to Rmarkdown

Rmarkdown file (.Rmd) allows to:

Save and execute code
Generate high quality reports that can be shared with an audience, by knitting together plots, tables, and results with narrative text, and rendering to a variety of formats like HTML, PDF, MS Word, or MS PowerPoint.

2. How to make a report using Rmarkdown?

1- Create new file: File > New File > R Markdown

2- Create code chunks: Run code by line, by chunk, or all at once.

3- Write text and add tables, figures, images

4- Customize header: set output format

5- Save and knit your document

6- Publish (rpubs.com, Rstudio connect etc)

3. Data analysis

3.1 Dataset description

We want to understand the factors on which the pricing of cars depends. Specifically, we want to know how well those variables describe the price of a car.

To do so, we got a dataset from Kaggle.

The dataset has 205 observations and 26 columns.

# libraries
library(tidyverse)
library(knitr)

# import data from csv file
cars.price <- read.csv("archive/CarPrice.csv")

# examine data structure
str(cars.price)

## 'data.frame':    205 obs. of  26 variables:
##  $ car_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ CarName         : chr  "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
##  $ fueltype        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration      : chr  "std" "std" "std" "std" ...
##  $ doornumber      : chr  "two" "two" "two" "four" ...
##  $ carbody         : chr  "convertible" "convertible" "hatchback" "sedan" ...
##  $ drivewheel      : chr  "rwd" "rwd" "rwd" "fwd" ...
##  $ enginelocation  : chr  "front" "front" "front" "front" ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginetype      : chr  "dohc" "dohc" "ohcv" "ohc" ...
##  $ cylindernumber  : chr  "four" "four" "six" "four" ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuelsystem      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...

# head of the data
head(cars.price, 5)

tail(cars.price, 5)

We select some variables to use in our analysis.

# selecting columns
# method 1
cars.price_2 <- cars.price[,3:26]

# method 2: eliminate columns 1 and 2
cars.price_3 <- cars.price[,-c(1,2)]

# method 3: select variables by name
cars.price_4 <- cars.price[,c("stroke", "price","carbody")]

# check fuel type values
table(cars.price$fueltype) %>% 
  kable(col.names = c("Value","Frequency"), caption="Fuel type values")

Fuel type values
Value	Frequency
diesel	20
gas	185

# method 4: select variables & filter rows (dplyr)
cars.price_5 <- cars.price %>%
  select(stroke, price, carbody, peakrpm, CarName, fueltype, drivewheel, curbweight, enginetype, horsepower) %>%
  filter(fueltype == "gas")

# summary statistics of numeric variables
summary(cars.price_5[,c(1,2,4,8,10)]) %>% kable(caption = "Summary statistics ofnumeric variables")

Summary statistics ofnumeric variables
stroke	price	peakrpm	curbweight	horsepower
Min. :2.070	Min. : 5118	Min. :4200	Min. :1488	Min. : 48.0
1st Qu.:3.100	1st Qu.: 7689	1st Qu.:4800	1st Qu.:2128	1st Qu.: 70.0
Median :3.255	Median : 9989	Median :5200	Median :2405	Median : 97.0
Mean :3.231	Mean :13000	Mean :5200	Mean :2518	Mean :106.2
3rd Qu.:3.400	3rd Qu.:15998	3rd Qu.:5500	3rd Qu.:2847	3rd Qu.:116.0
Max. :4.170	Max. :45400	Max. :6600	Max. :4066	Max. :288.0

# convert character variables to factors for one variable:
cars.price_5$carbody <- as.factor(cars.price_5$carbody)

# convert 5 variables from character to factor
cars.price_5[,-c(1,2,4,8,10)] <- lapply(cars.price_5[,-c(1,2,4,8,10)] , as.factor)

# summary of factor variables
options(knitr.kable.NA = '')
summary(cars.price_5[,-c(1,2,4,8,10)]) %>% kable(caption="Summary of factor variables")

Summary of factor variables
carbody	CarName	fueltype	drivewheel	enginetype
convertible: 6	toyota corolla: 5	gas:185	4wd: 9	dohc : 12
hardtop : 7	toyota corona : 5		fwd:111	dohcv: 1
hatchback :69	peugeot 504 : 4		rwd: 65	l : 7
sedan :81	subaru dl : 4			ohc :133
wagon :22	honda civic : 3			ohcf : 15
	mazda 626 : 3			ohcv : 13
	(Other) :161			rotor: 4

3.2. Data visualization

3.2.1. Univariate plots

# use ggplot to make a histogram
ggplot(data = cars.price_5, aes(x=price)) + # select data and x variable
  geom_histogram(binwidth = 2000, color="white", fill="orange") +
  # change binwidth, bars border and fill color
  xlab("Price ($)") + # add label to the x axis
  ylab("Count") + # add label to the y axis
  ggtitle("Histogram of car prices") + # add title to the plot
  theme_minimal() # add a theme

The histogram of car prices shows that the data isn’t symmetrical. It has extreme values towards the maximum (outliers) which makes the right tail of the histogram longer. Most of the values range between 5000$ and 15000$.

# use ggplot to make a histogram
ggplot(data = cars.price_5, aes(y=stroke)) + # select data and x variable
  geom_boxplot(fill="orange") + # fill color
  ylab("Stroke") + # add label to the x axis
  #ylab("Count") + # add label to the y axis
  ggtitle("Boxplot of stroke") + # add title to the plot
  theme_minimal() # add a theme

The boxplot of the variable stroke illustrates the distribution of the volume inside the engine. 75% of the data ranges between 3.2 and 3.4. the variable has outliers towards the minimum and maximum.

3.2.2. Bivariate plots

## scatterplot

ggplot(cars.price_5, aes(y=price, x= horsepower))+
  geom_point(color="orange")+
  geom_smooth(method="lm")+ # add linear regression line
  ylab("Price ($)") + # add label to the y axis
  xlab("Horsepower") + # add label to the x axis
  ggtitle("Scatterplot of car prices by horsepower") + # add title to the plot
  theme_minimal() # add a theme

The scatterplot of car prices by horsepower shows that there is a positive relationship between the variables, since when the values of horsepower increase, the car prices also increase.

# barplot
ggplot(cars.price_5, aes(x=drivewheel, fill=enginetype))+
  geom_bar(position = "dodge", color="white")+ # stacked
  xlab("Drive wheel") + # add label to the x axis
  ggtitle("Barplot of drive wheel and engine type") + # add title to the plot
  theme_minimal()+ # add a theme
  theme(legend.position = "bottom") # change legend position

The barplot shows that most fwd drive cars have engine type ohc, which is also the most common type engine in the data.

Car Prices Analysis

Project report

Abdullah Alshalaan

2022-08-30