#source('create_datasets.R')
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
cars <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv", stringsAsFactors=T)
comics <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv", stringsAsFactors=T)
life <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv", stringsAsFactors=T)
Explaination : The codes above are shown for which library do we use and, what kind of files we’ll be doing EDA on. The files that are put there and will be analyzed are on URLs.
head(comics)
## name id align eye hair
## 1 Spider-Man (Peter Parker) Secret Good Hazel Eyes Brown Hair
## 2 Captain America (Steven Rogers) Public Good Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral Blue Eyes Black Hair
## 4 Iron Man (Anthony \\"Tony\\" Stark) Public Good Blue Eyes Black Hair
## 5 Thor (Thor Odinson) No Dual Good Blue Eyes Blond Hair
## 6 Benjamin Grimm (Earth-616) Public Good Blue Eyes No Hair
## gender gsm alive appearances first_appear publisher
## 1 Male <NA> Living Characters 4043 Aug-62 marvel
## 2 Male <NA> Living Characters 3360 Mar-41 marvel
## 3 Male <NA> Living Characters 3061 Oct-74 marvel
## 4 Male <NA> Living Characters 2961 Mar-63 marvel
## 5 Male <NA> Living Characters 2258 Nov-50 marvel
## 6 Male <NA> Living Characters 2255 Nov-61 marvel
Explaination : As the function “head” runs, the dataset “comics” shows the head of every variable it has. As I can see, there are 11 different variables, which are made of 9 char variables, and 1 int variable. Also, there is a missing variable entitled “g…” which should’ve been N/A for a reason.
levels(comics$align)
## [1] "Bad" "Good" "Neutral"
## [4] "Reformed Criminals"
Explaination : The level function will show us the levels of attribute in a variable, but somehow I cannot show it out the level in “align” because it’s a char variable. The level function will work on char or string if I do stringAsFactors. After putting the stringAsFactors i can see the level of align var which consists of Bad, Good, Neutral, and Reformed criminals
levels(comics$gender)
## [1] "Female" "Male" "Other"
Explaination : Same with the explaination above, the gender is a char variable or a string. So, it will not show the levels except if I do stringAsFactors. After putting stringAsFactors, i can see the level of gender var which consists of Female, male, and others
tab <- table(comics$align, comics$gender)
tab
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
Explaination : The “tab” function is used to make a summarized table of a specific chosen variable by the user. In the case above, the variable chosen is in the dataset comics, in align and gender. It summarizes the align and gender var, so we can see how many data are there inside the variable.
comics <- comics %>%
filter(align != 'Reformed Criminals') %>%
droplevels()
levels(comics$align)
## [1] "Bad" "Good" "Neutral"
Explaination : After the code runs, we can see that it has hidden the “Reformed Criminals” variable which will not be shown after we run another “levels” function. Also the function “droplevels” will remove the unused variable which “Reformed Criminal” is used.
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "dodge")
Explaination : The barchart which was generated shows us the chart on
align and the content of genders inside of each aligns. After analyzing
the chart, we can see that the Male is dominantly the Bad characte,
while the Female holds the highest on the Good align. The “Other” is the
least appeareance but it’s still shown a little bit on Bad. The “N/A”
genders or not identified holds the second lowest on Bad.
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(positio = "dodge") +
theme(axis.text.x = element_text(angle = 90))
Explaination : The chart shows that Females has the “Good” in dominance,
while on the Bad and neutral is lower in them. The Male however, it has
the “Bad” in dominance with less Good and Neutral in them. For the
“Other” gender, as the most least in the chart it still shows that “Bad”
is still dominating among that gender. And for the “N/A” or unidentified
gender, it has the “Bad” dominating in them while the lowest is
Neutral.
options(scipen = 999, digits = 3)
tbl_cnt <- table(comics$id, comics$align)
tbl_cnt
##
## Bad Good Neutral
## No Dual 474 647 390
## Public 2172 2930 965
## Secret 4493 2475 959
## Unknown 7 0 2
Explaination : This fucntion shows the content and makes a table that consist of the “id” and “align” variables. As the table generates we can see that most “id” are secret. and still 7 unknown regading the “Bad” guys. The No Dual shows the least and most of them are the “Good” guys.
prop.table(tbl_cnt)
##
## Bad Good Neutral
## No Dual 0.030553 0.041704 0.025139
## Public 0.140003 0.188862 0.062202
## Secret 0.289609 0.159533 0.061815
## Unknown 0.000451 0.000000 0.000129
Explaination : As we can see up above, we can see the properties of the “id” and “align” The tbl_cnt variable was made for this table.
sum(prop.table(tbl_cnt))
## [1] 1
Explaination : The function is used, so we can analyze that there are only 1 numerical factor inside the table.
prop.table(tbl_cnt, 1)
##
## Bad Good Neutral
## No Dual 0.314 0.428 0.258
## Public 0.358 0.483 0.159
## Secret 0.567 0.312 0.121
## Unknown 0.778 0.000 0.222
ggplot(comics, aes(x = id, fill = align)) +
geom_bar(position = "fill") +
ylab("proportion")
Explaination : The chart up above shows us the proprtion of the “id”
variable. as we can see, the Bad holds the dominant on the Unknown ID.
The chart also shows that the secret ID, and N/A has the same proportion
on the “Bad”. The least proportion among them all is neutral in every
“id” variable.
ggplot(comics, aes(x = align, fill = id)) +
geom_bar(position = "fill") +
ylab("proportion")
Explaination : The chart above shows the reverse of the chart before.
The chart itself shows the align’s proportions, which makes a total
difference. The Unknown “id” has no proportion, and it is not shown on
the chart. While, the secret and public “id” has roughly almost the same
proportion being held.
tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab) # Joint proportions
##
## Female Male Other
## Bad 0.082210 0.395160 0.001672
## Good 0.130135 0.251333 0.000888
## Neutral 0.043692 0.094021 0.000888
Explaination : The function and the codes that has been run shows us the table with varies of properties. The digits are printed fewer as the comment said and it is true.
prop.table(tab, 2)
##
## Female Male Other
## Bad 0.321 0.534 0.485
## Good 0.508 0.339 0.258
## Neutral 0.171 0.127 0.258
Explaination : Roughly, the table shown up above tells us that the female has roughly 50% good in them. While the male has roughly 55% bad in them. The others has mostly 50 of them dominated in bad.
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()
Explaination : The chart shows us the plot of genders by aligns. After
analyzing, i can confurm that all of the genders hold the “Bad” as the
dominant.
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "fill")
Explaination : The chart shows that Male gender has hold the highest
numbers of them all, but also holding the highest “Bad” than the others.
While female still shows that they hold “Good” for the most. The NA has
roughly same proportion on Bad, and Neutral.
table(comics$id)
##
## No Dual Public Secret Unknown
## 1511 6067 7927 9
Explaination : The function up above is to dostribute the numbers or factors that are inside “id” variable. As we can see, i’ve analyzed that most of them has a secret “id” hence the unknown are the least.
ggplot(comics, aes(x = id)) +
geom_bar()
Explaination : This is just a simple bar chart to visual the “id”
variable. By analyzing the chart, i can make a statenent that : - Most
superheroes has a secret “id”. - No dual has the second least count on
the chart. - The “id” which has unknown is the least count on the
barchart.
ggplot(comics, aes(x = id)) +
geom_bar() +
facet_wrap(~align)
Explaination : From the barcharts generated above, I can analyze and
make a statement that : - The Bad superheroes listed with a secret “id”
has the most counts on the chart - The Good superheroes has 0 unknown
“id”s. - The Good superheroes also has the most counts on Public “id”s.
- Roughly the Public and Secret “id”ed superheroes has the same Neutral
counts on it.
# Change the order of the levels in align
comics$align <- factor(comics$align,
levels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics, aes(x = align)) +
geom_bar()
Explaination : The function used above is to show the Align variable
that generates the 3 levels only; Bad, Good and Neutral. The statement
from the analyzing this chart that i can make isL - The Bad has the most
count which almost reached 10000 - The neutral is the lowest count.
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)
Explaination : From the chart above, after anaylizing i can make a
statemnt that : - Male has the most dominant on Bad superheroes. - The
female has a lower count on the aligns, which females still have more
percentage of Good superheroes then male. - The “Other” align, has the
most least count than the other variables.
pies <- data.frame(flavors = as.factor(rep(c("apple", "blueberry", "boston creme", "cherry", "key lime", "pumpkin", "strawberry"), times = c(17, 14, 15, 13, 16, 12, 11))))
# Put levels of flavor in descending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)
# Create barchart of flavor
ggplot(pies, aes(x = flavor)) +
geom_bar(fill = "chartreuse") +
theme(axis.text.x = element_text(angle = 90))
Explaination : From the barchat generated above, I can make a statement
that shows : - Apple flavored pies are the most on the counts. - The
least count on the flavour is strawberry.
# A dot plot shows all the datapoints
ggplot(cars, aes(x = weight)) +
geom_dotplot(dotsize = 0.4)
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bindot).
Explaination : From the dotchart generated above, i can analyze that the
average of the cars weighed around 2000 and 4000, and most of them are
at that number
# A histogram groups the points into bins so it does not get overwhelming
ggplot(cars, aes(x = weight)) +
geom_histogram(dotsize = 0.4, binwidth = 500)
## Warning: Ignoring unknown parameters: dotsize
## Warning: Removed 2 rows containing non-finite values (stat_bin).
ggplot(cars, aes(x = weight)) +
geom_density()
## Warning: Removed 2 rows containing non-finite values (stat_density).
Explaination : The chart above shows that the more weight it has, the
more density the vehicle has. Which the highest point of density is
roughly around 3500 on weight.
ggplot(cars, aes(x = 1, y = weight)) +
geom_boxplot() +
coord_flip()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
library(ggplot2)
This is the library used for the next chart generation.
str(cars)
## 'data.frame': 428 obs. of 19 variables:
## $ name : Factor w/ 425 levels "Acura 3.5 RL 4dr",..: 66 67 68 69 70 114 115 133 129 130 ...
## $ sports_car : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ suv : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wagon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ minivan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pickup : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ all_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rear_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncyl : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ...
## $ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ...
## $ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ...
## $ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
## $ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ...
## $ length : int 167 153 183 183 183 174 174 168 168 168 ...
## $ width : int 66 66 69 68 69 67 67 67 67 67 ...
Explaination : From my analysis, I can see that there are 19 different variables. But the point that i’ll make is : - There are 10 interger variables. - 1 num variable. - There are 7 fase variables.
ggplot(cars, aes(x = city_mpg)) +
geom_histogram() +
facet_wrap(~ suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).
unique(cars$ncyl)
## [1] 4 6 3 8 5 12 10 -1
table(cars$ncyl)
##
## -1 3 4 5 6 8 10 12
## 2 1 136 7 190 87 2 3
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))
# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values (stat_density).
Explaination : From the chart above, I can analyze that the more density
it has, the lower city_mpg numbers it has.
# Create hist of horsepwr
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram() +
ggtitle("Horsepower distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Explaination : From the chart above, my statement is that most of the
cars has between 200-300 horsepower given on it. THe least horse power
the car has is between 400-500 HP.
# Create hist of horsepwr with binwidth of 3
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 3) +
ggtitle("binwidth = 3")