1. Introduction to Rmarkdown

Rmarkdown file (.Rmd) allows to:

  • Save and execute code

  • Generate high quality reports that can be shared with an audience, by knitting together plots, tables, and results with narrative text, and rendering to a variety of formats like HTML, PDF, MS Word, or MS PowerPoint.

2. How to make a report using Rmarkdown?

1- Create new file: File > New File > R Markdown

2- Create code chunks: Run code by line, by chunk, or all at once.

3- Write text and add tables, figures, images

4- Customize header: set output format

5- Save and knit your document

6- Publish (rpubs.com, Rstudio connect etc)

3. Data analysis

3.1 Dataset description

We want to understand the factors on which the pricing of cars depends. Specifically, we want to know how well those variables describe the price of a car.

To do so, we got a dataset from Kaggle.

The dataset has 205 observations and 26 columns.

# libraries
library(tidyverse)
library(knitr)
# import data from csv file
cars.price <- read.csv("archive/CarPrice.csv")

# examine data structure
str(cars.price)
## 'data.frame':    205 obs. of  26 variables:
##  $ car_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ CarName         : chr  "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
##  $ fueltype        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration      : chr  "std" "std" "std" "std" ...
##  $ doornumber      : chr  "two" "two" "two" "four" ...
##  $ carbody         : chr  "convertible" "convertible" "hatchback" "sedan" ...
##  $ drivewheel      : chr  "rwd" "rwd" "rwd" "fwd" ...
##  $ enginelocation  : chr  "front" "front" "front" "front" ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginetype      : chr  "dohc" "dohc" "ohcv" "ohc" ...
##  $ cylindernumber  : chr  "four" "four" "six" "four" ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuelsystem      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...
# head of the data
head(cars.price, 5)
tail(cars.price, 5)

We select some variables to use in our analysis.

# selecting columns
# method 1
cars.price_2 <- cars.price[,3:26]

# method 2: eliminate columns 1 and 2
cars.price_3 <- cars.price[,-c(1,2)]

# method 3: select variables by name
cars.price_4 <- cars.price[,c("stroke", "price","carbody")]
# check fuel type values
table(cars.price$fueltype) %>% 
  kable(col.names = c("Value","Frequency"), caption="Fuel type values")
Fuel type values
Value Frequency
diesel 20
gas 185
# method 4: select variables & filter rows (dplyr)
cars.price_5 <- cars.price %>%
  select(stroke, price, carbody, peakrpm, CarName, fueltype, drivewheel, curbweight, enginetype, horsepower) %>%
  filter(fueltype == "gas")
# summary statistics of numeric variables
summary(cars.price_5[,c(1,2,4,8,10)]) %>% kable(caption = "Summary statistics ofnumeric variables")
Summary statistics ofnumeric variables
stroke price peakrpm curbweight horsepower
Min. :2.070 Min. : 5118 Min. :4200 Min. :1488 Min. : 48.0
1st Qu.:3.100 1st Qu.: 7689 1st Qu.:4800 1st Qu.:2128 1st Qu.: 70.0
Median :3.255 Median : 9989 Median :5200 Median :2405 Median : 97.0
Mean :3.231 Mean :13000 Mean :5200 Mean :2518 Mean :106.2
3rd Qu.:3.400 3rd Qu.:15998 3rd Qu.:5500 3rd Qu.:2847 3rd Qu.:116.0
Max. :4.170 Max. :45400 Max. :6600 Max. :4066 Max. :288.0
# convert character variables to factors for one variable:
cars.price_5$carbody <- as.factor(cars.price_5$carbody)

# convert 5 variables from character to factor
cars.price_5[,-c(1,2,4,8,10)] <- lapply(cars.price_5[,-c(1,2,4,8,10)] , as.factor)

# summary of factor variables
options(knitr.kable.NA = '')
summary(cars.price_5[,-c(1,2,4,8,10)]) %>% kable(caption="Summary of factor variables")
Summary of factor variables
carbody CarName fueltype drivewheel enginetype
convertible: 6 toyota corolla: 5 gas:185 4wd: 9 dohc : 12
hardtop : 7 toyota corona : 5 fwd:111 dohcv: 1
hatchback :69 peugeot 504 : 4 rwd: 65 l : 7
sedan :81 subaru dl : 4 ohc :133
wagon :22 honda civic : 3 ohcf : 15
mazda 626 : 3 ohcv : 13
(Other) :161 rotor: 4

3.2. Data visualization

3.2.1. Univariate plots

# use ggplot to make a histogram
ggplot(data = cars.price_5, aes(x=price)) + # select data and x variable
  geom_histogram(binwidth = 2000, color="white", fill="orange") +
  # change binwidth, bars border and fill color
  xlab("Price ($)") + # add label to the x axis
  ylab("Count") + # add label to the y axis
  ggtitle("Histogram of car prices") + # add title to the plot
  theme_minimal() # add a theme

The histogram of car prices shows that the data isn’t symmetrical. It has extreme values towards the maximum (outliers) which makes the right tail of the histogram longer. Most of the values range between 5000$ and 15000$.

# use ggplot to make a histogram
ggplot(data = cars.price_5, aes(y=stroke)) + # select data and x variable
  geom_boxplot(fill="orange") + # fill color
  ylab("Stroke") + # add label to the x axis
  #ylab("Count") + # add label to the y axis
  ggtitle("Boxplot of stroke") + # add title to the plot
  theme_minimal() # add a theme

The boxplot of the variable stroke illustrates the distribution of the volume inside the engine. 75% of the data ranges between 3.2 and 3.4. the variable has outliers towards the minimum and maximum.

3.2.2. Bivariate plots

## scatterplot

ggplot(cars.price_5, aes(y=price, x= horsepower))+
  geom_point(color="orange")+
  geom_smooth(method="lm")+ # add linear regression line
  ylab("Price ($)") + # add label to the y axis
  xlab("Horsepower") + # add label to the x axis
  ggtitle("Scatterplot of car prices by horsepower") + # add title to the plot
  theme_minimal() # add a theme

The scatterplot of car prices by horsepower shows that there is a positive relationship between the variables, since when the values of horsepower increase, the car prices also increase.

# barplot
ggplot(cars.price_5, aes(x=drivewheel, fill=enginetype))+
  geom_bar(position = "dodge", color="white")+ # stacked
  xlab("Drive wheel") + # add label to the x axis
  ggtitle("Barplot of drive wheel and engine type") + # add title to the plot
  theme_minimal()+ # add a theme
  theme(legend.position = "bottom") # change legend position

The barplot shows that most fwd drive cars have engine type ohc, which is also the most common type engine in the data.