# For the Bonus Question!
library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/jcp9010/R-Week-2-HW-Assignment/master/cmc.data.csv")
birthcontrol <- read.csv(text = x)
Credit for this dataset belongs to:
Lim, T.-S., Loh, W.-Y. & Shih, Y.-S. (1999). A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms. Machine Learning. Forthcoming. (ftp://ftp.stat.wisc.edu/pub/loh/treeprogs/quest1.7/mach1317.pdf or (http://www.stat.wisc.edu/~limt/mach1317.pdf)
This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or do not know if they were at the time of interview. The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.
This database examines a population of married women in Indonesia and regarding their contraceptive use. With this data, we can take subsets of different populations, such as religion, socio-economics, media exposure, age, etc. and determine some correlation with whether or not this particular married woman is using contraceptives (either long-term, short-term, or none at all). Below is the data that has been loaded into “birthcontrol” and a summary of the results has been produced.
# birthcontrol <- read.csv("cmc.data.csv") - can ignore this line at the moment since csv file has been retrieved from github
# print(birthcontrol) - This will produce a very long data frame. Will # it out in the interest of space.
print(summary(birthcontrol))
## wife_age wife_edu hus_edu num_child
## Min. :16.00 Min. :1.000 Min. :1.00 Min. : 0.000
## 1st Qu.:26.00 1st Qu.:2.000 1st Qu.:3.00 1st Qu.: 1.000
## Median :32.00 Median :3.000 Median :4.00 Median : 3.000
## Mean :32.54 Mean :2.959 Mean :3.43 Mean : 3.261
## 3rd Qu.:39.00 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.: 4.000
## Max. :49.00 Max. :4.000 Max. :4.00 Max. :16.000
## wife_relig wife_work hus_job stand_living
## Min. :0.0000 Min. :0.0000 Min. :1.000 Min. :1.000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:3.000
## Median :1.0000 Median :1.0000 Median :2.000 Median :3.000
## Mean :0.8506 Mean :0.7495 Mean :2.138 Mean :3.134
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :4.000 Max. :4.000
## media_exp contra_used
## Min. :0.000 Min. :1.00
## 1st Qu.:0.000 1st Qu.:1.00
## Median :0.000 Median :2.00
## Mean :0.074 Mean :1.92
## 3rd Qu.:0.000 3rd Qu.:3.00
## Max. :1.000 Max. :3.00
The summary function is great since it provides a very quick and easy method of rapidly assessing some data points within the dataframe.
The average (or mean) age of this study for a married woman is 32.54 years old. The range of the wife’s age is from 16 to 49. The study probably did not include women older than the age of 49 as they are most likely in menopause, and women in the age group > 49 may potentially skew the results of the study.
The overall conclusion on the average women in this study is that she is quite educated (with education mean of 2.959) and is married to an educated man (mean 3.43). The size of the typical family is 3.261 children, and the women are mostly muslim (religion mean of .8506 with 1 being muslim). The standard of living is approximately 3.134. The majority of women do NOT work (mean of 0.7495). Interestingly, this study already makes assumption that the husband is working and does not include a survey question of whether or not the husband works. Again, I suspect that is a reflection of the Indonesian culture where the husband typically is the income generator of the home.
Several subsets will be created to look at different factors that could be contributing to contraceptive use. Age, Education level of the wife, if wife is currently working, media exposure, and standard of living index.
Before we go onto creating the subsets, it was worth acknolwedging that the authors for this paper had made three categorical values for the contraceptive used. 1: No Use 2: Long-term Use 3. Short-Term use.
The issue here is that when we cannot perform straight aggregate functions when it is in this fashion. For example, taking a mean of the birthcontrol$contra_used would be meaningless. Therefore, I had wrangled the data so that it is binary; either yes (takes birth control) or no (does NOT take birth control).
i = 1
while (i <= length(birthcontrol$contra_used)){
if (birthcontrol[i,'contra_used'] == 3){
birthcontrol[[i, 'contra_used']] <- 10
}
i <- i + 1
}
Initially, the values 1 (for NO) and 2 (for YES) were used. However, when I ran the initial above code, the values I was receiving in the summary() function contained values that differed by several hundredths of the point, making it difficult to truly appreciate if there were any notable differences in contraceptive use. So instead, I had changed the values to 1 (for NO) and 10 (for YES) for a more dramatic effect.
Age.Comparison.Older <- birthcontrol[ birthcontrol$wife_age > 35, c('wife_age', 'contra_used')]
colnames(Age.Comparison.Older) <- c('Wife Age','Contraceptive Used')
Age.Comparison.Younger <- birthcontrol[ birthcontrol$wife_age <= 35, c('wife_age', 'contra_used')]
colnames(Age.Comparison.Younger) <- c('Wife Age','Contraceptive Used')
Education.Comparison.Schooled <- birthcontrol[ birthcontrol$wife_edu >=3, c('wife_edu', 'contra_used')]
colnames(Education.Comparison.Schooled) <- c('Wife Education Status','Contraceptive Used')
Education.Comparison.NotSchooled <- birthcontrol[birthcontrol$wife_edu < 3, c('wife_edu', 'contra_used')]
colnames(Education.Comparison.NotSchooled) <- c('Wife Education Status','Contraceptive Used')
Working.Comparison.Yes <- birthcontrol[ birthcontrol$wife_work == 0, c('wife_work', 'contra_used')]
colnames(Working.Comparison.Yes) <- c('Wife Working','Contraceptive Used')
Working.Comparison.No <- birthcontrol[ birthcontrol$wife_work == 1, c('wife_work', 'contra_used')]
colnames(Working.Comparison.No) <- c('Wife Working','Contraceptive Used')
Media.Yes <- birthcontrol[ birthcontrol$media_exp == 0, c('media_exp', 'contra_used')]
colnames(Media.Yes) <- c('Media Exposure','Contraceptive Used')
Media.No <- birthcontrol[ birthcontrol$media_exp == 1, c('media_exp', 'contra_used')]
colnames(Media.No) <- c('Media Exposure','Contraceptive Used')
HighStandardofLiving <- birthcontrol[ birthcontrol$stand_living >= 3, c('stand_living', 'contra_used')]
colnames(HighStandardofLiving) <- c('Standard of Living','Contraceptive Used')
LowStandardofLiving <- birthcontrol[ birthcontrol$stand_living < 3, c('stand_living', 'contra_used')]
colnames(LowStandardofLiving) <- c('Standard of Living','Contraceptive Used')
Comparing women who are older than 35 vs. women are younger than or equal to 35.
print(summary(Age.Comparison.Older))
## Wife Age Contraceptive Used
## Min. :36.00 Min. : 1.000
## 1st Qu.:38.00 1st Qu.: 1.000
## Median :42.00 Median : 2.000
## Mean :41.81 Mean : 3.268
## 3rd Qu.:45.00 3rd Qu.: 2.000
## Max. :49.00 Max. :10.000
print(summary(Age.Comparison.Younger))
## Wife Age Contraceptive Used
## Min. :16.00 Min. : 1.000
## 1st Qu.:24.00 1st Qu.: 1.000
## Median :27.00 Median : 2.000
## Mean :27.39 Mean : 4.948
## 3rd Qu.:31.00 3rd Qu.:10.000
## Max. :35.00 Max. :10.000
For the older subset of women, the mean age was 41.81 years old and the mean contraceptive use was 3.268. While for the younger subset of women, the mean age was 27.39 and the mean was 4.948. Interestingly, these results seem to imply that younger women are more likely to use contraceptives.
Comparing women who are educated vs. women who are not educated. (Higher the number, the more educated the woman is.)
print(summary(Education.Comparison.Schooled))
## Wife Education Status Contraceptive Used
## Min. :3.000 Min. : 1.000
## 1st Qu.:3.000 1st Qu.: 1.000
## Median :4.000 Median : 2.000
## Mean :3.585 Mean : 4.482
## 3rd Qu.:4.000 3rd Qu.:10.000
## Max. :4.000 Max. :10.000
print(summary(Education.Comparison.NotSchooled))
## Wife Education Status Contraceptive Used
## Min. :1.000 Min. : 1.000
## 1st Qu.:1.000 1st Qu.: 1.000
## Median :2.000 Median : 1.000
## Mean :1.687 Mean : 4.076
## 3rd Qu.:2.000 3rd Qu.:10.000
## Max. :2.000 Max. :10.000
In this category, it appears that educated women tend be slightly more likely to take contraceptives vs. their uneducated counterparts.
Comparing women who are working vs. women who are not working. (In this study, 0 means yes)
print(summary(Working.Comparison.Yes))
## Wife Working Contraceptive Used
## Min. :0 Min. : 1.000
## 1st Qu.:0 1st Qu.: 1.000
## Median :0 Median : 2.000
## Mean :0 Mean : 3.924
## 3rd Qu.:0 3rd Qu.:10.000
## Max. :0 Max. :10.000
print(summary(Working.Comparison.No))
## Wife Working Contraceptive Used
## Min. :1 Min. : 1.00
## 1st Qu.:1 1st Qu.: 1.00
## Median :1 Median : 2.00
## Mean :1 Mean : 4.49
## 3rd Qu.:1 3rd Qu.:10.00
## Max. :1 Max. :10.00
Interestingly, nonworking women appear to be more likely to use contraceptives vs. working women.
Does media influence women’s decision to take contraceptives? (In this study, 0 means yes)
print(summary(Media.Yes))
## Media Exposure Contraceptive Used
## Min. :0 Min. : 1.000
## 1st Qu.:0 1st Qu.: 1.000
## Median :0 Median : 2.000
## Mean :0 Mean : 4.444
## 3rd Qu.:0 3rd Qu.:10.000
## Max. :0 Max. :10.000
print(summary(Media.No))
## Media Exposure Contraceptive Used
## Min. :1 Min. : 1.000
## 1st Qu.:1 1st Qu.: 1.000
## Median :1 Median : 1.000
## Mean :1 Mean : 3.156
## 3rd Qu.:1 3rd Qu.: 2.000
## Max. :1 Max. :10.000
It appears that the media exposure likely impacts women’s decision to take contraceptives (more likely to take them).
And lastly, does a higher standard of living make a difference in a woman’s decision to take contraceptives?
print(summary(HighStandardofLiving))
## Standard of Living Contraceptive Used
## Min. :3.000 Min. : 1.000
## 1st Qu.:3.000 1st Qu.: 1.000
## Median :4.000 Median : 2.000
## Mean :3.613 Mean : 4.404
## 3rd Qu.:4.000 3rd Qu.:10.000
## Max. :4.000 Max. :10.000
print(summary(LowStandardofLiving))
## Standard of Living Contraceptive Used
## Min. :1.00 Min. : 1.000
## 1st Qu.:1.00 1st Qu.: 1.000
## Median :2.00 Median : 1.000
## Mean :1.64 Mean : 4.176
## 3rd Qu.:2.00 3rd Qu.:10.000
## Max. :2.00 Max. :10.000
Despite the fact that the mean of High Standard of Living is 3.613 and the mean of Low Standard of Living is 1.64, there does not appear to be a significant difference in contraceptive use.
I will demonstrate statistical graphics via 2 methods. One method is via from the “Readings for Week 3 Assignemnts” and the second method is via ggplot2.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
A histogram is a great way to show representation of the distribution of numerical data. In this case, we will use age to demonstrate how frequent each age group was present
Frequency Plot:
birthcontrol.age <- birthcontrol$wife_age
birthcontrol.contraceptiveuse <- birthcontrol$contra_used
# Given ages 16 to 49, will make 33 bins
histogram.birthcontrol.age <- hist(birthcontrol.age, breaks = 33, main = "Age of Women in Study")
Now to make a Density Plot:
hist(birthcontrol.age, freq = FALSE, main = "Density Plot of the Age Group of Women in this Study")
And to add the aesthetics: (Including what would would be a normal distribution curve)
hist(birthcontrol.age, freq = FALSE, xlab = "Years in Age", main = "Density Plot of the Age Group of Women in this Study", col="lightgreen")
curve(dnorm(x, mean=mean(birthcontrol.age), sd=sd(birthcontrol.age)), add=TRUE, col="darkblue", lwd=2)
Now, let’s make a historgram via ggplot2.
pl <- ggplot(birthcontrol, aes(x=wife_age))
pl2 <- pl + geom_histogram(binwidth = 1, color='red',fill='pink', alpha = 0.4)
pl3 <- pl2 + xlab('Age in Years') + ylab('Count')
print(pl3 + ggtitle("Age of Women in the Study"))
For the sake of my assignment, I will need to use a different dataset to adequately demonstrate the boxplot as well as the scatterplot. The Indonesian Birth Control Data set is all categorical and is more conducive to histograms than to the other two. As a result, I will use the nycflights13 dataset to complete the assignment.
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 3.3.2
Below is the first six rows of dataframe nycflights13 to demonstrate what data is stored.
print(head(flights))
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
And below is the summary of the flights database.
print(summary(flights))
## year month day dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907
## Median :2013 Median : 7.000 Median :16.00 Median :1401
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400
## NA's :8255
## sched_dep_time dep_delay arr_time sched_arr_time
## Min. : 106 Min. : -43.00 Min. : 1 Min. : 1
## 1st Qu.: 906 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124
## Median :1359 Median : -2.00 Median :1535 Median :1556
## Mean :1344 Mean : 12.64 Mean :1502 Mean :1536
## 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945
## Max. :2359 Max. :1301.00 Max. :2400 Max. :2359
## NA's :8255 NA's :8713
## arr_delay carrier flight tailnum
## Min. : -86.000 Length:336776 Min. : 1 Length:336776
## 1st Qu.: -17.000 Class :character 1st Qu.: 553 Class :character
## Median : -5.000 Mode :character Median :1496 Mode :character
## Mean : 6.895 Mean :1972
## 3rd Qu.: 14.000 3rd Qu.:3465
## Max. :1272.000 Max. :8500
## NA's :9430
## origin dest air_time distance
## Length:336776 Length:336776 Min. : 20.0 Min. : 17
## Class :character Class :character 1st Qu.: 82.0 1st Qu.: 502
## Mode :character Mode :character Median :129.0 Median : 872
## Mean :150.7 Mean :1040
## 3rd Qu.:192.0 3rd Qu.:1389
## Max. :695.0 Max. :4983
## NA's :9430
## hour minute time_hour
## Min. : 1.00 Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 9.00 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :13.00 Median :29.00 Median :2013-07-03 10:00:00
## Mean :13.18 Mean :26.23 Mean :2013-07-03 05:02:36
## 3rd Qu.:17.00 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :23.00 Max. :59.00 Max. :2013-12-31 23:00:00
##
In the nycflights13 dataset, I will use a scatterplot to plot distance vs. air time. The presumption is that the longer the distance, the longer the air time. By using the built in plot function, a simple scatterplot can be displayed. Below uses a random sample of 100 rows in the flights dataframe.
random.flights <- flights[sample(nrow(flights), 100), ]
attach(random.flights)
print(plot(distance,air_time, main="Scatterplot: Distance vs. Air Time", xlab = "Flight Distance", ylab = "Air Time"))
## NULL
Below are fit lines for this scatterplot.
attach(random.flights)
## The following objects are masked from random.flights (pos = 3):
##
## air_time, arr_delay, arr_time, carrier, day, dep_delay,
## dep_time, dest, distance, flight, hour, minute, month, origin,
## sched_arr_time, sched_dep_time, tailnum, time_hour, year
print(plot(distance,air_time, main="Scatterplot: Distance vs. Air Time", xlab = "Flight Distance", ylab = "Air Time"))
## NULL
abline(lm(air_time~distance), col="red") # regression line (y~x)
lines(lowess(distance,air_time), col="blue") # lowess line (x,y)
For more complex scatterplots, ggplot2 will be utilized.
pl.scatter <- ggplot(random.flights, aes(x=distance, y=air_time))
pl.scatter.2 <- pl.scatter + geom_point()
print(pl.scatter.2)
## Warning: Removed 2 rows containing missing values (geom_point).
The beauty of ggplot2 is the the additional information that can be made to the scatterplot. This addition to the geom_point will provide a third set of information; that is, different carriers by using different shapes.
pl.scatter <- ggplot(random.flights, aes(x=distance, y=air_time))
pl.scatter.2 <- pl.scatter + geom_point(aes(shape = carrier))
print(pl.scatter.2)
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 10. Consider specifying shapes manually if you must have them.
## Warning: Removed 35 rows containing missing values (geom_point).
Or we can even do them by colors.
pl.scatter.2 <- pl.scatter + geom_point(aes(color = carrier))
print(pl.scatter.2)
## Warning: Removed 2 rows containing missing values (geom_point).
A boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. What I will be doing with this portion is adding another column to the flights dataframe with average speed of each flight, by taking distance/air_time. I will create a box plot for the average speed (miles/hr) for each carrier.
speedDF <- data.frame(speed = flights$distance/(flights$air_time/60))
flights2 <- cbind(flights, speedDF)
print(head(flights2)) #Using head to demonstrate that speed was added on as a column
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5
## 5 -25 DL 461 N668DN LGA ATL 116 762 6
## 6 12 UA 1696 N39463 EWR ORD 150 719 5
## minute time_hour speed
## 1 15 2013-01-01 05:00:00 370.0441
## 2 29 2013-01-01 05:00:00 374.2731
## 3 40 2013-01-01 05:00:00 408.3750
## 4 45 2013-01-01 05:00:00 516.7213
## 5 0 2013-01-01 06:00:00 394.1379
## 6 58 2013-01-01 05:00:00 287.6000
Carriers <- as.vector(flights2$carrier)
Carriers <- unique(Carriers)
print(factor(Carriers))
## [1] UA AA B6 DL EV MQ US WN VX FL AS 9E F9 HA YV OO
## Levels: 9E AA AS B6 DL EV F9 FL HA MQ OO UA US VX WN YV
print(paste0("Number of Carriers: ", length(Carriers)))
## [1] "Number of Carriers: 16"
# There are 16 different unique carriers.
# Will create 16 different vectors holding each flight speed time for each specific carrier.
flights3 <- data.frame(flights2$carrier, flights2$speed)
NineE <- flights3[ (flights3$flights2.carrier == '9E'), ]
AA <- flights3[ (flights3$flights2.carrier == 'AA'), ]
AS <- flights3[ (flights3$flights2.carrier == 'AS'), ]
B6 <- flights3[ (flights3$flights2.carrier == 'B6'), ]
DL <- flights3[ (flights3$flights2.carrier == 'DL'), ]
EV <- flights3[ (flights3$flights2.carrier == 'EV'), ]
F9 <- flights3[ (flights3$flights2.carrier == 'F9'), ]
FL <- flights3[ (flights3$flights2.carrier == 'FL'), ]
HA <- flights3[ (flights3$flights2.carrier == 'HA'), ]
MQ <- flights3[ (flights3$flights2.carrier == 'MQ'), ]
OO <- flights3[ (flights3$flights2.carrier == 'OO'), ]
UA <- flights3[ (flights3$flights2.carrier == 'UA'), ]
US <- flights3[ (flights3$flights2.carrier == 'US'), ]
VX <- flights3[ (flights3$flights2.carrier == 'VX'), ]
WN <- flights3[ (flights3$flights2.carrier == 'WN'), ]
YV <- flights3[ (flights3$flights2.carrier == 'YV'), ]
NineE <- as.vector(NineE$flights2.speed)
AA <- as.vector(AA$flights2.speed)
AS <- as.vector(AS$flights2.speed)
B6 <- as.vector(B6$flights2.speed)
DL <- as.vector(DL$flights2.speed)
EV <- as.vector(EV$flights2.speed)
F9 <- as.vector(F9$flights2.speed)
FL <- as.vector(FL$flights2.speed)
HA <- as.vector(HA$flights2.speed)
MQ <- as.vector(MQ$flights2.speed)
OO <- as.vector(OO$flights2.speed)
UA <- as.vector(UA$flights2.speed)
US <- as.vector(US$flights2.speed)
VX <- as.vector(VX$flights2.speed)
WN <- as.vector(WN$flights2.speed)
YV <- as.vector(YV$flights2.speed)
boxplot(NineE, AA, AS, B6, DL, EV, F9, FL, HA, MQ, OO, UA, US, VX, WN, YV, ylab = "Speed MPH", xlab = "Airline Carriers", las = 2, names = c("9E","AA","AS","B6","DL","EV","F9","FL","HA","MQ","OO","UA","UA","VX","WN","YV"))
We can also express boxplots via ggplot2.
temp <- ggplot(flights3, aes(x=factor(flights2.carrier), y=flights2.speed))
temp2 <- temp + geom_boxplot(aes(fill=factor(flights2.carrier)))
print(temp2)
## Warning: Removed 9430 rows containing non-finite values (stat_boxplot).
Throughout this final project, I had the fortune of going over two datasets.
Data set 1: What factors influence married women to use birth control in Indonesia? Some of the findings in the Indonesian Women study to be interesting. It appears that younger, non-working women, who were exposed to the media were more likely to use contraception. However, the standard of living, whether or not if the woman was working, or education did not seem to influence whether or not women would use contraception.
Data set 2: Which airline appeared to be the fastest on average? When examining the boxplot, it appears that carrier HA may be the quickest on average. However, interestingly, they also appear to have the last amount of outliers as well. But overall, the mean speed for all carriers fall between the ranges of 300 and 500, which demonstrate a significant amount of variability between carriers.