Rmarkdown file (.Rmd) allows to:
Save and execute code
Generate high quality reports that can be shared with an audience, by knitting together plots, tables, and results with narrative text, and rendering to a variety of formats like HTML, PDF, MS Word, or MS PowerPoint.
1- Create new file: File > New File > R Markdown
2- Create code chunks: Run code by line, by chunk, or all at once.
3- Write text and add tables, figures, images
4- Customize header: set output format
5- Save and knit your document
6- Publish (rpubs.com, Rstudio connect etc)
We want to understand the factors on which the pricing of cars depends. Specifically, we want to know how well those variables describe the price of a car.
To do so, we got a dataset from Kaggle.
The dataset has 205 observations and 26 columns.
# import data from csv file
cars.price <- read.csv("archive/CarPrice.csv")
# examine data structure
str(cars.price)## 'data.frame': 205 obs. of 26 variables:
## $ car_ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ CarName : chr "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
## $ fueltype : chr "gas" "gas" "gas" "gas" ...
## $ aspiration : chr "std" "std" "std" "std" ...
## $ doornumber : chr "two" "two" "two" "four" ...
## $ carbody : chr "convertible" "convertible" "hatchback" "sedan" ...
## $ drivewheel : chr "rwd" "rwd" "rwd" "fwd" ...
## $ enginelocation : chr "front" "front" "front" "front" ...
## $ wheelbase : num 88.6 88.6 94.5 99.8 99.4 ...
## $ carlength : num 169 169 171 177 177 ...
## $ carwidth : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ carheight : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curbweight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ enginetype : chr "dohc" "dohc" "ohcv" "ohc" ...
## $ cylindernumber : chr "four" "four" "six" "four" ...
## $ enginesize : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuelsystem : chr "mpfi" "mpfi" "mpfi" "mpfi" ...
## $ boreratio : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compressionratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peakrpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ citympg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highwaympg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : num 13495 16500 16500 13950 17450 ...
We select some variables to use in our analysis.
# selecting columns
# method 1
cars.price_2 <- cars.price[,3:26]
# method 2: eliminate columns 1 and 2
cars.price_3 <- cars.price[,-c(1,2)]
# method 3: select variables by name
cars.price_4 <- cars.price[,c("stroke", "price","carbody")]# check fuel type values
table(cars.price$fueltype) %>%
kable(col.names = c("Value","Frequency"), caption="Fuel type values")| Value | Frequency |
|---|---|
| diesel | 20 |
| gas | 185 |
# method 4: select variables & filter rows (dplyr)
cars.price_5 <- cars.price %>%
select(stroke, price, carbody, peakrpm, CarName, fueltype, drivewheel, curbweight, enginetype, horsepower) %>%
filter(fueltype == "gas")# summary statistics of numeric variables
summary(cars.price_5[,c(1,2,4,8,10)]) %>% kable(caption = "Summary statistics ofnumeric variables")| stroke | price | peakrpm | curbweight | horsepower | |
|---|---|---|---|---|---|
| Min. :2.070 | Min. : 5118 | Min. :4200 | Min. :1488 | Min. : 48.0 | |
| 1st Qu.:3.100 | 1st Qu.: 7689 | 1st Qu.:4800 | 1st Qu.:2128 | 1st Qu.: 70.0 | |
| Median :3.255 | Median : 9989 | Median :5200 | Median :2405 | Median : 97.0 | |
| Mean :3.231 | Mean :13000 | Mean :5200 | Mean :2518 | Mean :106.2 | |
| 3rd Qu.:3.400 | 3rd Qu.:15998 | 3rd Qu.:5500 | 3rd Qu.:2847 | 3rd Qu.:116.0 | |
| Max. :4.170 | Max. :45400 | Max. :6600 | Max. :4066 | Max. :288.0 |
# convert character variables to factors for one variable:
cars.price_5$carbody <- as.factor(cars.price_5$carbody)
# convert 5 variables from character to factor
cars.price_5[,-c(1,2,4,8,10)] <- lapply(cars.price_5[,-c(1,2,4,8,10)] , as.factor)
# summary of factor variables
options(knitr.kable.NA = '')
summary(cars.price_5[,-c(1,2,4,8,10)]) %>% kable(caption="Summary of factor variables")| carbody | CarName | fueltype | drivewheel | enginetype | |
|---|---|---|---|---|---|
| convertible: 6 | toyota corolla: 5 | gas:185 | 4wd: 9 | dohc : 12 | |
| hardtop : 7 | toyota corona : 5 | fwd:111 | dohcv: 1 | ||
| hatchback :69 | peugeot 504 : 4 | rwd: 65 | l : 7 | ||
| sedan :81 | subaru dl : 4 | ohc :133 | |||
| wagon :22 | honda civic : 3 | ohcf : 15 | |||
| mazda 626 : 3 | ohcv : 13 | ||||
| (Other) :161 | rotor: 4 |
# use ggplot to make a histogram
ggplot(data = cars.price_5, aes(x=price)) + # select data and x variable
geom_histogram(binwidth = 2000, color="white", fill="orange") +
# change binwidth, bars border and fill color
xlab("Price ($)") + # add label to the x axis
ylab("Count") + # add label to the y axis
ggtitle("Histogram of car prices") + # add title to the plot
theme_minimal() # add a themeThe histogram of car prices shows that the data isn’t symmetrical. It has extreme values towards the maximum (outliers) which makes the right tail of the histogram longer. Most of the values range between 5000$ and 15000$.
# use ggplot to make a histogram
ggplot(data = cars.price_5, aes(y=stroke)) + # select data and x variable
geom_boxplot(fill="orange") + # fill color
ylab("Stroke") + # add label to the x axis
#ylab("Count") + # add label to the y axis
ggtitle("Boxplot of stroke") + # add title to the plot
theme_minimal() # add a themeThe boxplot of the variable stroke illustrates the distribution of the volume inside the engine. 75% of the data ranges between 3.2 and 3.4. the variable has outliers towards the minimum and maximum.
## scatterplot
ggplot(cars.price_5, aes(y=price, x= horsepower))+
geom_point(color="orange")+
geom_smooth(method="lm")+ # add linear regression line
ylab("Price ($)") + # add label to the y axis
xlab("Horsepower") + # add label to the x axis
ggtitle("Scatterplot of car prices by horsepower") + # add title to the plot
theme_minimal() # add a themeThe scatterplot of car prices by horsepower shows that there is a positive relationship between the variables, since when the values of horsepower increase, the car prices also increase.
# barplot
ggplot(cars.price_5, aes(x=drivewheel, fill=enginetype))+
geom_bar(position = "dodge", color="white")+ # stacked
xlab("Drive wheel") + # add label to the x axis
ggtitle("Barplot of drive wheel and engine type") + # add title to the plot
theme_minimal()+ # add a theme
theme(legend.position = "bottom") # change legend positionThe barplot shows that most fwd drive cars have engine type ohc, which is also the most common type engine in the data.