Exploratory Data Analysis in R. Choose an interesting dataset and use R graphics to describe the data. You may use base R graphics, or a graphics package of your choice. You should include at least one example of each of the following:
. histogram . boxplot . scatterplot
Description:
Aircraft Data, deals with 23 single-engine aircraft built over the years 1947-1979, from Office of Naval Research. The dependent variable is cost (in units of $100,000) and the explanatory variables are aspect ratio, lift-to-drag ratio, weight of plane (in pounds) and maximal thrust.
Format A data frame with 23 observations on the following 5 variables. X1:Aspect Ratio X2:Lift-to-Drag Ratio X3:Weight X4:Thrust Y:Cost
Source:
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, page 154, table 22.
Links:
Aircraft from Aircraft Data, robustbase data base package, Rdatasets
The CSV Data file downloaded from the following link under the title Aircraft Data: http://vincentarelbundock.github.io/Rdatasets/datasets.html
Data description link: http://vincentarelbundock.github.io/Rdatasets/doc/robustbase/aircraft.html
The objective of the report is to analyze and observe the effect of the independent variables on the dependent variable which is the cost of the aircrafts
library(ggplot2)
library(grid)
library(gridExtra)Import the downloaded aircraft dataset from R directory to the R studio environment and store it into aircraft_data
aircraft_data <- read.csv("aircraft.csv")
head(aircraft_data)## X X1 X2 X3 X4 Y
## 1 1 6.3 1.7 8176 4500 2.76
## 2 2 6.0 1.9 6699 3120 4.76
## 3 3 5.9 1.5 9663 6300 8.75
## 4 4 3.0 1.2 12837 9800 7.78
## 5 5 5.0 1.8 10205 4900 6.18
## 6 6 6.3 2.0 14890 6500 9.50
Change the headers to the proper representation of the variables
colnames(aircraft_data)<-c("Id","Aspect_Ratio","Lift_to_Drag_Ratio","Weight","Thrust","Cost")
head(aircraft_data)## Id Aspect_Ratio Lift_to_Drag_Ratio Weight Thrust Cost
## 1 1 6.3 1.7 8176 4500 2.76
## 2 2 6.0 1.9 6699 3120 4.76
## 3 3 5.9 1.5 9663 6300 8.75
## 4 4 3.0 1.2 12837 9800 7.78
## 5 5 5.0 1.8 10205 4900 6.18
## 6 6 6.3 2.0 14890 6500 9.50
Let's take a look at the dataset summary
summary(aircraft_data)## Id Aspect_Ratio Lift_to_Drag_Ratio Weight
## Min. : 1.0 Min. :1.600 Min. :1.20 Min. : 6699
## 1st Qu.: 6.5 1st Qu.:2.850 1st Qu.:1.75 1st Qu.:12232
## Median :12.0 Median :3.900 Median :2.30 Median :14890
## Mean :12.0 Mean :3.909 Mean :2.80 Mean :18092
## 3rd Qu.:17.5 3rd Qu.:4.750 3rd Qu.:3.50 3rd Qu.:22645
## Max. :23.0 Max. :6.300 Max. :9.70 Max. :46172
## Thrust Cost
## Min. : 3120 Min. : 2.760
## 1st Qu.: 7980 1st Qu.: 8.265
## Median :14500 Median : 13.590
## Mean :15913 Mean : 20.327
## 3rd Qu.:22050 3rd Qu.: 27.160
## Max. :37000 Max. :107.100
From the summary, we can conclude that most of the aircraft were designed with fairly high aspect-ratio with the mean value of 3.91.
Let's plot the histogram to inspect the frequency distribution of the depended variable
hist(aircraft_data$Cost, main = "Cost Distribution (in units of $100,000)"
,xlab = "Cost",ylab = "Frequency"
,breaks = 20
,col = "Orange")The right skewed histogram indicates that most of the aircrafts cost were under 40 with two exception of a higher cost of 60 and 100
let observe the boxplot of the cost variable and check the outlier and the mean value
ggplot(aircraft_data, aes(1, Cost)) +
geom_boxplot() +
geom_boxplot(outlier.colour = "green", outlier.size = 3)+
stat_summary(fun.y=mean, geom="point", shape=23, size=4)To determine which of the of independent variables have an effect on the output cost we must use multiple linear regression where it is modeled as function of several explanatory variables
Before fitting our regression model, we want to investigate how the variables are related to one another
plot(aircraft_data)From the plot we can observe that Thrust increase when weight increase and the lift-to-drag ration decrease aspect ratio has an effect on the lift-to-drag ratio
Based on the observation let's take a close look at the relationship between these variables
p1<-ggplot(aircraft_data, aes(x =Aspect_Ratio, y =Lift_to_Drag_Ratio)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
p2<-ggplot(aircraft_data, aes(x =Weight, y =Thrust)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
p3<-ggplot(aircraft_data, aes(x =Weight, y =Cost)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
p4<-ggplot(aircraft_data, aes(x =Aspect_Ratio, y =Cost)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
grid.arrange(p1, p2, p3, p4)The plot shows avery weak relationship between aspect-ration and lift-drag-ratio while it illustrates a strong relationship between weight and trust. Looking at the relationship between cost and weight vs cost and aspect-ratio it indicates that high aspect-ratio decrease the cost while weight increase the cost. The big question is, which of the two variables has more influence on the output? Thus, we need to investigate using multiple regression and look at the significance level
multi_reg<-lm(Cost~Aspect_Ratio + Weight, data=aircraft_data)
All = lm(Cost ~ Aspect_Ratio + Weight + Thrust + Lift_to_Drag_Ratio, data=aircraft_data)
anova(multi_reg,All) ## Analysis of Variance Table
##
## Model 1: Cost ~ Aspect_Ratio + Weight
## Model 2: Cost ~ Aspect_Ratio + Weight + Thrust + Lift_to_Drag_Ratio
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 20 2365.5
## 2 18 1271.8 2 1093.6 7.7388 0.003755 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value of .00375 is less than 5% which indicates that all variables a have some type of influence on Cost
Now we can plot the normal distribution and observe the trend line of the regression
qqnorm(multi_reg$res)
qqline(multi_reg$res)