R Final Project for the Bridge Course

# For the Bonus Question!
library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/jcp9010/R-Week-2-HW-Assignment/master/cmc.data.csv")
birthcontrol <- read.csv(text = x)

Credit for this dataset belongs to:

Lim, T.-S., Loh, W.-Y. & Shih, Y.-S. (1999). A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms. Machine Learning. Forthcoming. (ftp://ftp.stat.wisc.edu/pub/loh/treeprogs/quest1.7/mach1317.pdf or (http://www.stat.wisc.edu/~limt/mach1317.pdf)

This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or do not know if they were at the time of interview. The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.

  1. Wife’s age (numerical)
  2. Wife’s education (categorical) 1=low, 2, 3, 4=high
  3. Husband’s education (categorical) 1=low, 2, 3, 4=high
  4. Number of children ever born (numerical)
  5. Wife’s religion (binary) 0=Non-Islam, 1=Islam
  6. Wife’s now working? (binary) 0=Yes, 1=No
  7. Husband’s occupation (categorical) 1, 2, 3, 4
  8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high
  9. Media exposure (binary) 0=Good, 1=Not good
  10. Contraceptive method used (class attribute) 1=No-use 2=Long-term 3=Short-term

This database examines a population of married women in Indonesia and regarding their contraceptive use. With this data, we can take subsets of different populations, such as religion, socio-economics, media exposure, age, etc. and determine some correlation with whether or not this particular married woman is using contraceptives (either long-term, short-term, or none at all). Below is the data that has been loaded into “birthcontrol” and a summary of the results has been produced.

  1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
# birthcontrol <- read.csv("cmc.data.csv") - can ignore this line at the moment since csv file has been retrieved from github
# print(birthcontrol) - This will produce a very long data frame. Will # it out in the interest of space.
print(summary(birthcontrol))
##     wife_age        wife_edu        hus_edu       num_child     
##  Min.   :16.00   Min.   :1.000   Min.   :1.00   Min.   : 0.000  
##  1st Qu.:26.00   1st Qu.:2.000   1st Qu.:3.00   1st Qu.: 1.000  
##  Median :32.00   Median :3.000   Median :4.00   Median : 3.000  
##  Mean   :32.54   Mean   :2.959   Mean   :3.43   Mean   : 3.261  
##  3rd Qu.:39.00   3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.: 4.000  
##  Max.   :49.00   Max.   :4.000   Max.   :4.00   Max.   :16.000  
##    wife_relig       wife_work         hus_job       stand_living  
##  Min.   :0.0000   Min.   :0.0000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :1.0000   Median :1.0000   Median :2.000   Median :3.000  
##  Mean   :0.8506   Mean   :0.7495   Mean   :2.138   Mean   :3.134  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :4.000   Max.   :4.000  
##    media_exp      contra_used  
##  Min.   :0.000   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:1.00  
##  Median :0.000   Median :2.00  
##  Mean   :0.074   Mean   :1.92  
##  3rd Qu.:0.000   3rd Qu.:3.00  
##  Max.   :1.000   Max.   :3.00

The summary function is great since it provides a very quick and easy method of rapidly assessing some data points within the dataframe.

The average (or mean) age of this study for a married woman is 32.54 years old. The range of the wife’s age is from 16 to 49. The study probably did not include women older than the age of 49 as they are most likely in menopause, and women in the age group > 49 may potentially skew the results of the study.

The overall conclusion on the average women in this study is that she is quite educated (with education mean of 2.959) and is married to an educated man (mean 3.43). The size of the typical family is 3.261 children, and the women are mostly muslim (religion mean of .8506 with 1 being muslim). The standard of living is approximately 3.134. The majority of women do NOT work (mean of 0.7495). Interestingly, this study already makes assumption that the husband is working and does not include a survey question of whether or not the husband works. Again, I suspect that is a reflection of the Indonesian culture where the husband typically is the income generator of the home.

  1. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

Several subsets will be created to look at different factors that could be contributing to contraceptive use. Age, Education level of the wife, if wife is currently working, media exposure, and standard of living index.

Before we go onto creating the subsets, it was worth acknolwedging that the authors for this paper had made three categorical values for the contraceptive used. 1: No Use 2: Long-term Use 3. Short-Term use.

The issue here is that when we cannot perform straight aggregate functions when it is in this fashion. For example, taking a mean of the birthcontrol$contra_used would be meaningless. Therefore, I had wrangled the data so that it is binary; either yes (takes birth control) or no (does NOT take birth control).

i = 1
while (i <= length(birthcontrol$contra_used)){
  if (birthcontrol[i,'contra_used'] == 3){
    birthcontrol[[i, 'contra_used']] <- 10
  }
  i <- i + 1
}

Initially, the values 1 (for NO) and 2 (for YES) were used. However, when I ran the initial above code, the values I was receiving in the summary() function contained values that differed by several hundredths of the point, making it difficult to truly appreciate if there were any notable differences in contraceptive use. So instead, I had changed the values to 1 (for NO) and 10 (for YES) for a more dramatic effect.

Age.Comparison.Older <- birthcontrol[ birthcontrol$wife_age > 35, c('wife_age', 'contra_used')]
colnames(Age.Comparison.Older) <- c('Wife Age','Contraceptive Used')

Age.Comparison.Younger <- birthcontrol[ birthcontrol$wife_age <= 35, c('wife_age', 'contra_used')]
colnames(Age.Comparison.Younger) <- c('Wife Age','Contraceptive Used')

Education.Comparison.Schooled <- birthcontrol[ birthcontrol$wife_edu >=3, c('wife_edu', 'contra_used')]
colnames(Education.Comparison.Schooled) <- c('Wife Education Status','Contraceptive Used')

Education.Comparison.NotSchooled <- birthcontrol[birthcontrol$wife_edu < 3, c('wife_edu', 'contra_used')]
colnames(Education.Comparison.NotSchooled) <- c('Wife Education Status','Contraceptive Used')

Working.Comparison.Yes <- birthcontrol[ birthcontrol$wife_work == 0, c('wife_work', 'contra_used')]
colnames(Working.Comparison.Yes) <- c('Wife Working','Contraceptive Used')

Working.Comparison.No <- birthcontrol[ birthcontrol$wife_work == 1, c('wife_work', 'contra_used')]
colnames(Working.Comparison.No) <- c('Wife Working','Contraceptive Used')

Media.Yes <- birthcontrol[ birthcontrol$media_exp == 0, c('media_exp', 'contra_used')]
colnames(Media.Yes) <- c('Media Exposure','Contraceptive Used')

Media.No <- birthcontrol[ birthcontrol$media_exp == 1, c('media_exp', 'contra_used')]
colnames(Media.No) <- c('Media Exposure','Contraceptive Used')

HighStandardofLiving <- birthcontrol[ birthcontrol$stand_living >= 3, c('stand_living', 'contra_used')]
colnames(HighStandardofLiving) <- c('Standard of Living','Contraceptive Used')

LowStandardofLiving <- birthcontrol[ birthcontrol$stand_living < 3, c('stand_living', 'contra_used')]
colnames(LowStandardofLiving) <- c('Standard of Living','Contraceptive Used')

Comparing women who are older than 35 vs. women are younger than or equal to 35.

print(summary(Age.Comparison.Older))
##     Wife Age     Contraceptive Used
##  Min.   :36.00   Min.   : 1.000    
##  1st Qu.:38.00   1st Qu.: 1.000    
##  Median :42.00   Median : 2.000    
##  Mean   :41.81   Mean   : 3.268    
##  3rd Qu.:45.00   3rd Qu.: 2.000    
##  Max.   :49.00   Max.   :10.000
print(summary(Age.Comparison.Younger))
##     Wife Age     Contraceptive Used
##  Min.   :16.00   Min.   : 1.000    
##  1st Qu.:24.00   1st Qu.: 1.000    
##  Median :27.00   Median : 2.000    
##  Mean   :27.39   Mean   : 4.948    
##  3rd Qu.:31.00   3rd Qu.:10.000    
##  Max.   :35.00   Max.   :10.000

For the older subset of women, the mean age was 41.81 years old and the mean contraceptive use was 3.268. While for the younger subset of women, the mean age was 27.39 and the mean was 4.948. Interestingly, these results seem to imply that younger women are more likely to use contraceptives.

Comparing women who are educated vs. women who are not educated. (Higher the number, the more educated the woman is.)

print(summary(Education.Comparison.Schooled))
##  Wife Education Status Contraceptive Used
##  Min.   :3.000         Min.   : 1.000    
##  1st Qu.:3.000         1st Qu.: 1.000    
##  Median :4.000         Median : 2.000    
##  Mean   :3.585         Mean   : 4.482    
##  3rd Qu.:4.000         3rd Qu.:10.000    
##  Max.   :4.000         Max.   :10.000
print(summary(Education.Comparison.NotSchooled))
##  Wife Education Status Contraceptive Used
##  Min.   :1.000         Min.   : 1.000    
##  1st Qu.:1.000         1st Qu.: 1.000    
##  Median :2.000         Median : 1.000    
##  Mean   :1.687         Mean   : 4.076    
##  3rd Qu.:2.000         3rd Qu.:10.000    
##  Max.   :2.000         Max.   :10.000

In this category, it appears that educated women tend be slightly more likely to take contraceptives vs. their uneducated counterparts.

Comparing women who are working vs. women who are not working. (In this study, 0 means yes)

print(summary(Working.Comparison.Yes))
##   Wife Working Contraceptive Used
##  Min.   :0     Min.   : 1.000    
##  1st Qu.:0     1st Qu.: 1.000    
##  Median :0     Median : 2.000    
##  Mean   :0     Mean   : 3.924    
##  3rd Qu.:0     3rd Qu.:10.000    
##  Max.   :0     Max.   :10.000
print(summary(Working.Comparison.No))
##   Wife Working Contraceptive Used
##  Min.   :1     Min.   : 1.00     
##  1st Qu.:1     1st Qu.: 1.00     
##  Median :1     Median : 2.00     
##  Mean   :1     Mean   : 4.49     
##  3rd Qu.:1     3rd Qu.:10.00     
##  Max.   :1     Max.   :10.00

Interestingly, nonworking women appear to be more likely to use contraceptives vs. working women.

Does media influence women’s decision to take contraceptives? (In this study, 0 means yes)

print(summary(Media.Yes))
##  Media Exposure Contraceptive Used
##  Min.   :0      Min.   : 1.000    
##  1st Qu.:0      1st Qu.: 1.000    
##  Median :0      Median : 2.000    
##  Mean   :0      Mean   : 4.444    
##  3rd Qu.:0      3rd Qu.:10.000    
##  Max.   :0      Max.   :10.000
print(summary(Media.No))
##  Media Exposure Contraceptive Used
##  Min.   :1      Min.   : 1.000    
##  1st Qu.:1      1st Qu.: 1.000    
##  Median :1      Median : 1.000    
##  Mean   :1      Mean   : 3.156    
##  3rd Qu.:1      3rd Qu.: 2.000    
##  Max.   :1      Max.   :10.000

It appears that the media exposure likely impacts women’s decision to take contraceptives (more likely to take them).

And lastly, does a higher standard of living make a difference in a woman’s decision to take contraceptives?

print(summary(HighStandardofLiving))
##  Standard of Living Contraceptive Used
##  Min.   :3.000      Min.   : 1.000    
##  1st Qu.:3.000      1st Qu.: 1.000    
##  Median :4.000      Median : 2.000    
##  Mean   :3.613      Mean   : 4.404    
##  3rd Qu.:4.000      3rd Qu.:10.000    
##  Max.   :4.000      Max.   :10.000
print(summary(LowStandardofLiving))
##  Standard of Living Contraceptive Used
##  Min.   :1.00       Min.   : 1.000    
##  1st Qu.:1.00       1st Qu.: 1.000    
##  Median :2.00       Median : 1.000    
##  Mean   :1.64       Mean   : 4.176    
##  3rd Qu.:2.00       3rd Qu.:10.000    
##  Max.   :2.00       Max.   :10.000

Despite the fact that the mean of High Standard of Living is 3.613 and the mean of Low Standard of Living is 1.64, there does not appear to be a significant difference in contraceptive use.

  1. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

I will demonstrate statistical graphics via 2 methods. One method is via from the “Readings for Week 3 Assignemnts” and the second method is via ggplot2.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2

A histogram is a great way to show representation of the distribution of numerical data. In this case, we will use age to demonstrate how frequent each age group was present

Frequency Plot:

birthcontrol.age <- birthcontrol$wife_age
birthcontrol.contraceptiveuse <- birthcontrol$contra_used

# Given ages 16 to 49, will make 33 bins
histogram.birthcontrol.age <- hist(birthcontrol.age, breaks = 33, main = "Age of Women in Study")

Now to make a Density Plot:

hist(birthcontrol.age, freq = FALSE, main = "Density Plot of the Age Group of Women in this Study")

And to add the aesthetics: (Including what would would be a normal distribution curve)

hist(birthcontrol.age, freq = FALSE, xlab = "Years in Age", main = "Density Plot of the Age Group of Women in this Study", col="lightgreen")
curve(dnorm(x, mean=mean(birthcontrol.age), sd=sd(birthcontrol.age)), add=TRUE, col="darkblue", lwd=2)

Now, let’s make a historgram via ggplot2.

pl <- ggplot(birthcontrol, aes(x=wife_age))
pl2 <- pl + geom_histogram(binwidth = 1, color='red',fill='pink', alpha = 0.4)
pl3 <- pl2 + xlab('Age in Years') + ylab('Count')
print(pl3 + ggtitle("Age of Women in the Study"))

For the sake of my assignment, I will need to use a different dataset to adequately demonstrate the boxplot as well as the scatterplot. The Indonesian Birth Control Data set is all categorical and is more conducive to histograms than to the other two. As a result, I will use the nycflights13 dataset to complete the assignment.

library(nycflights13)
## Warning: package 'nycflights13' was built under R version 3.3.2

Below is the first six rows of dataframe nycflights13 to demonstrate what data is stored.

print(head(flights))
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## 4  2013     1     1      544            545        -1     1004
## 5  2013     1     1      554            600        -6      812
## 6  2013     1     1      554            558        -4      740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

And below is the summary of the flights database.

print(summary(flights))
##       year          month             day           dep_time   
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400  
##                                                  NA's   :8255  
##  sched_dep_time   dep_delay          arr_time    sched_arr_time
##  Min.   : 106   Min.   : -43.00   Min.   :   1   Min.   :   1  
##  1st Qu.: 906   1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124  
##  Median :1359   Median :  -2.00   Median :1535   Median :1556  
##  Mean   :1344   Mean   :  12.64   Mean   :1502   Mean   :1536  
##  3rd Qu.:1729   3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945  
##  Max.   :2359   Max.   :1301.00   Max.   :2400   Max.   :2359  
##                 NA's   :8255      NA's   :8713                 
##    arr_delay          carrier              flight       tailnum         
##  Min.   : -86.000   Length:336776      Min.   :   1   Length:336776     
##  1st Qu.: -17.000   Class :character   1st Qu.: 553   Class :character  
##  Median :  -5.000   Mode  :character   Median :1496   Mode  :character  
##  Mean   :   6.895                      Mean   :1972                     
##  3rd Qu.:  14.000                      3rd Qu.:3465                     
##  Max.   :1272.000                      Max.   :8500                     
##  NA's   :9430                                                           
##     origin              dest              air_time        distance   
##  Length:336776      Length:336776      Min.   : 20.0   Min.   :  17  
##  Class :character   Class :character   1st Qu.: 82.0   1st Qu.: 502  
##  Mode  :character   Mode  :character   Median :129.0   Median : 872  
##                                        Mean   :150.7   Mean   :1040  
##                                        3rd Qu.:192.0   3rd Qu.:1389  
##                                        Max.   :695.0   Max.   :4983  
##                                        NA's   :9430                  
##       hour           minute        time_hour                  
##  Min.   : 1.00   Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 9.00   1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :13.00   Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :13.18   Mean   :26.23   Mean   :2013-07-03 05:02:36  
##  3rd Qu.:17.00   3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :23.00   Max.   :59.00   Max.   :2013-12-31 23:00:00  
## 

In the nycflights13 dataset, I will use a scatterplot to plot distance vs. air time. The presumption is that the longer the distance, the longer the air time. By using the built in plot function, a simple scatterplot can be displayed. Below uses a random sample of 100 rows in the flights dataframe.

random.flights <- flights[sample(nrow(flights), 100), ]
attach(random.flights)
print(plot(distance,air_time, main="Scatterplot: Distance vs. Air Time", xlab = "Flight Distance", ylab = "Air Time"))

## NULL

Below are fit lines for this scatterplot.

attach(random.flights)
## The following objects are masked from random.flights (pos = 3):
## 
##     air_time, arr_delay, arr_time, carrier, day, dep_delay,
##     dep_time, dest, distance, flight, hour, minute, month, origin,
##     sched_arr_time, sched_dep_time, tailnum, time_hour, year
print(plot(distance,air_time, main="Scatterplot: Distance vs. Air Time", xlab = "Flight Distance", ylab = "Air Time"))
## NULL
abline(lm(air_time~distance), col="red") # regression line (y~x)
lines(lowess(distance,air_time), col="blue") # lowess line (x,y)

For more complex scatterplots, ggplot2 will be utilized.

pl.scatter <- ggplot(random.flights, aes(x=distance, y=air_time))
pl.scatter.2 <- pl.scatter + geom_point()
print(pl.scatter.2)
## Warning: Removed 2 rows containing missing values (geom_point).

The beauty of ggplot2 is the the additional information that can be made to the scatterplot. This addition to the geom_point will provide a third set of information; that is, different carriers by using different shapes.

pl.scatter <- ggplot(random.flights, aes(x=distance, y=air_time))
pl.scatter.2 <- pl.scatter + geom_point(aes(shape = carrier))
print(pl.scatter.2)
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 10. Consider specifying shapes manually if you must have them.
## Warning: Removed 35 rows containing missing values (geom_point).

Or we can even do them by colors.

pl.scatter.2 <- pl.scatter + geom_point(aes(color = carrier))
print(pl.scatter.2)
## Warning: Removed 2 rows containing missing values (geom_point).

A boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. What I will be doing with this portion is adding another column to the flights dataframe with average speed of each flight, by taking distance/air_time. I will create a box plot for the average speed (miles/hr) for each carrier.

speedDF <- data.frame(speed = flights$distance/(flights$air_time/60)) 
flights2 <- cbind(flights, speedDF)
print(head(flights2)) #Using head to demonstrate that speed was added on as a column
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5
##   minute           time_hour    speed
## 1     15 2013-01-01 05:00:00 370.0441
## 2     29 2013-01-01 05:00:00 374.2731
## 3     40 2013-01-01 05:00:00 408.3750
## 4     45 2013-01-01 05:00:00 516.7213
## 5      0 2013-01-01 06:00:00 394.1379
## 6     58 2013-01-01 05:00:00 287.6000
Carriers <- as.vector(flights2$carrier)
Carriers <- unique(Carriers)
print(factor(Carriers))
##  [1] UA AA B6 DL EV MQ US WN VX FL AS 9E F9 HA YV OO
## Levels: 9E AA AS B6 DL EV F9 FL HA MQ OO UA US VX WN YV
print(paste0("Number of Carriers: ", length(Carriers)))
## [1] "Number of Carriers: 16"
# There are 16 different unique carriers.
# Will create 16 different vectors holding each flight speed time for each specific carrier.

flights3 <- data.frame(flights2$carrier, flights2$speed)

NineE <- flights3[ (flights3$flights2.carrier == '9E'), ]
AA <- flights3[ (flights3$flights2.carrier == 'AA'), ]
AS <- flights3[ (flights3$flights2.carrier == 'AS'), ]
B6 <- flights3[ (flights3$flights2.carrier == 'B6'), ]
DL <- flights3[ (flights3$flights2.carrier == 'DL'), ]
EV <- flights3[ (flights3$flights2.carrier == 'EV'), ]
F9 <- flights3[ (flights3$flights2.carrier == 'F9'), ]
FL <- flights3[ (flights3$flights2.carrier == 'FL'), ]
HA <- flights3[ (flights3$flights2.carrier == 'HA'), ]
MQ <- flights3[ (flights3$flights2.carrier == 'MQ'), ]
OO <- flights3[ (flights3$flights2.carrier == 'OO'), ]
UA <- flights3[ (flights3$flights2.carrier == 'UA'), ]
US <- flights3[ (flights3$flights2.carrier == 'US'), ]
VX <- flights3[ (flights3$flights2.carrier == 'VX'), ]
WN <- flights3[ (flights3$flights2.carrier == 'WN'), ]
YV <- flights3[ (flights3$flights2.carrier == 'YV'), ]

NineE <- as.vector(NineE$flights2.speed)
AA <- as.vector(AA$flights2.speed)
AS <- as.vector(AS$flights2.speed)
B6 <- as.vector(B6$flights2.speed)
DL <- as.vector(DL$flights2.speed)
EV <- as.vector(EV$flights2.speed)
F9 <- as.vector(F9$flights2.speed)
FL <- as.vector(FL$flights2.speed)
HA <- as.vector(HA$flights2.speed)
MQ <- as.vector(MQ$flights2.speed)
OO <- as.vector(OO$flights2.speed)
UA <- as.vector(UA$flights2.speed)
US <- as.vector(US$flights2.speed)
VX <- as.vector(VX$flights2.speed)
WN <- as.vector(WN$flights2.speed)
YV <- as.vector(YV$flights2.speed)

boxplot(NineE, AA, AS, B6, DL, EV, F9, FL, HA, MQ, OO, UA, US, VX, WN, YV, ylab = "Speed MPH", xlab = "Airline Carriers", las = 2, names = c("9E","AA","AS","B6","DL","EV","F9","FL","HA","MQ","OO","UA","UA","VX","WN","YV"))

We can also express boxplots via ggplot2.

temp <- ggplot(flights3, aes(x=factor(flights2.carrier), y=flights2.speed))
temp2 <- temp + geom_boxplot(aes(fill=factor(flights2.carrier)))
print(temp2)
## Warning: Removed 9430 rows containing non-finite values (stat_boxplot).

  1. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

Throughout this final project, I had the fortune of going over two datasets.

Data set 1: What factors influence married women to use birth control in Indonesia? Some of the findings in the Indonesian Women study to be interesting. It appears that younger, non-working women, who were exposed to the media were more likely to use contraception. However, the standard of living, whether or not if the woman was working, or education did not seem to influence whether or not women would use contraception.

Data set 2: Which airline appeared to be the fastest on average? When examining the boxplot, it appears that carrier HA may be the quickest on average. However, interestingly, they also appear to have the last amount of outliers as well. But overall, the mean speed for all carriers fall between the ranges of 300 and 500, which demonstrate a significant amount of variability between carriers.