Assignment 4

Assignment Description

Exploratory Data Analysis in R. Choose an interesting dataset and use R graphics to describe the data. You may use base R graphics, or a graphics package of your choice. You should include at least one example of each of the following:

. histogram . boxplot . scatterplot

Aircraft Dataset

Description:

Aircraft Data, deals with 23 single-engine aircraft built over the years 1947-1979, from Office of Naval Research. The dependent variable is cost (in units of $100,000) and the explanatory variables are aspect ratio, lift-to-drag ratio, weight of plane (in pounds) and maximal thrust.

Format A data frame with 23 observations on the following 5 variables. X1:Aspect Ratio X2:Lift-to-Drag Ratio X3:Weight X4:Thrust Y:Cost

Source:

P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, page 154, table 22.

Links:

Aircraft from Aircraft Data, robustbase data base package, Rdatasets

The CSV Data file downloaded from the following link under the title Aircraft Data: http://vincentarelbundock.github.io/Rdatasets/datasets.html

Data description link: http://vincentarelbundock.github.io/Rdatasets/doc/robustbase/aircraft.html

Objective

The objective of the report is to analyze and observe the effect of the independent variables on the dependent variable which is the cost of the aircrafts

Load the required libraries

library(ggplot2)
library(grid)
library(gridExtra)

Import Dataset

Import the downloaded aircraft dataset from R directory to the R studio environment and store it into aircraft_data

aircraft_data <- read.csv("aircraft.csv")
head(aircraft_data)
##   X  X1  X2    X3   X4    Y
## 1 1 6.3 1.7  8176 4500 2.76
## 2 2 6.0 1.9  6699 3120 4.76
## 3 3 5.9 1.5  9663 6300 8.75
## 4 4 3.0 1.2 12837 9800 7.78
## 5 5 5.0 1.8 10205 4900 6.18
## 6 6 6.3 2.0 14890 6500 9.50

Change the headers to the proper representation of the variables

colnames(aircraft_data)<-c("Id","Aspect_Ratio","Lift_to_Drag_Ratio","Weight","Thrust","Cost")
head(aircraft_data)
##   Id Aspect_Ratio Lift_to_Drag_Ratio Weight Thrust Cost
## 1  1          6.3                1.7   8176   4500 2.76
## 2  2          6.0                1.9   6699   3120 4.76
## 3  3          5.9                1.5   9663   6300 8.75
## 4  4          3.0                1.2  12837   9800 7.78
## 5  5          5.0                1.8  10205   4900 6.18
## 6  6          6.3                2.0  14890   6500 9.50

Primary analisys

Let's take a look at the dataset summary

summary(aircraft_data)
##        Id        Aspect_Ratio   Lift_to_Drag_Ratio     Weight     
##  Min.   : 1.0   Min.   :1.600   Min.   :1.20       Min.   : 6699  
##  1st Qu.: 6.5   1st Qu.:2.850   1st Qu.:1.75       1st Qu.:12232  
##  Median :12.0   Median :3.900   Median :2.30       Median :14890  
##  Mean   :12.0   Mean   :3.909   Mean   :2.80       Mean   :18092  
##  3rd Qu.:17.5   3rd Qu.:4.750   3rd Qu.:3.50       3rd Qu.:22645  
##  Max.   :23.0   Max.   :6.300   Max.   :9.70       Max.   :46172  
##      Thrust           Cost        
##  Min.   : 3120   Min.   :  2.760  
##  1st Qu.: 7980   1st Qu.:  8.265  
##  Median :14500   Median : 13.590  
##  Mean   :15913   Mean   : 20.327  
##  3rd Qu.:22050   3rd Qu.: 27.160  
##  Max.   :37000   Max.   :107.100

From the summary, we can conclude that most of the aircraft were designed with fairly high aspect-ratio with the mean value of 3.91.

Histogram Analysis

Let's plot the histogram to inspect the frequency distribution of the depended variable

hist(aircraft_data$Cost, main = "Cost Distribution (in units of $100,000)"
     ,xlab = "Cost",ylab = "Frequency"
     ,breaks = 20
     ,col = "Orange")

The right skewed histogram indicates that most of the aircrafts cost were under 40 with two exception of a higher cost of 60 and 100

Boxplot

let observe the boxplot of the cost variable and check the outlier and the mean value

ggplot(aircraft_data, aes(1, Cost)) + 
  geom_boxplot() +
  geom_boxplot(outlier.colour = "green", outlier.size = 3)+
  stat_summary(fun.y=mean, geom="point", shape=23, size=4)

Multiple Linear Regression

To determine which of the of independent variables have an effect on the output cost we must use multiple linear regression where it is modeled as function of several explanatory variables

Before fitting our regression model, we want to investigate how the variables are related to one another

plot(aircraft_data)

From the plot we can observe that Thrust increase when weight increase and the lift-to-drag ration decrease aspect ratio has an effect on the lift-to-drag ratio

Based on the observation let's take a close look at the relationship between these variables

p1<-ggplot(aircraft_data, aes(x =Aspect_Ratio, y =Lift_to_Drag_Ratio)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")

p2<-ggplot(aircraft_data, aes(x =Weight, y =Thrust)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")

p3<-ggplot(aircraft_data, aes(x =Weight, y =Cost)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")

p4<-ggplot(aircraft_data, aes(x =Aspect_Ratio, y =Cost)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")

grid.arrange(p1, p2, p3, p4)

The plot shows avery weak relationship between aspect-ration and lift-drag-ratio while it illustrates a strong relationship between weight and trust. Looking at the relationship between cost and weight vs cost and aspect-ratio it indicates that high aspect-ratio decrease the cost while weight increase the cost. The big question is, which of the two variables has more influence on the output? Thus, we need to investigate using multiple regression and look at the significance level

multi_reg<-lm(Cost~Aspect_Ratio + Weight, data=aircraft_data)
All = lm(Cost ~ Aspect_Ratio + Weight + Thrust + Lift_to_Drag_Ratio, data=aircraft_data)
anova(multi_reg,All) 
## Analysis of Variance Table
## 
## Model 1: Cost ~ Aspect_Ratio + Weight
## Model 2: Cost ~ Aspect_Ratio + Weight + Thrust + Lift_to_Drag_Ratio
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     20 2365.5                                
## 2     18 1271.8  2    1093.6 7.7388 0.003755 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p value of .00375 is less than 5% which indicates that all variables a have some type of influence on Cost

Now we can plot the normal distribution and observe the trend line of the regression

qqnorm(multi_reg$res)
qqline(multi_reg$res)