#source('create_datasets.R')

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
cars <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv", stringsAsFactors=T)
comics <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv", stringsAsFactors=T)
life <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv", stringsAsFactors=T)

Explaination : The codes above are shown for which library do we use and, what kind of files we’ll be doing EDA on. The files that are put there and will be analyzed are on URLs.

head(comics)
##                                    name      id   align        eye       hair
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes Brown Hair
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes Black Hair
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes Black Hair
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes Blond Hair
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes    No Hair
##   gender  gsm             alive appearances first_appear publisher
## 1   Male <NA> Living Characters        4043       Aug-62    marvel
## 2   Male <NA> Living Characters        3360       Mar-41    marvel
## 3   Male <NA> Living Characters        3061       Oct-74    marvel
## 4   Male <NA> Living Characters        2961       Mar-63    marvel
## 5   Male <NA> Living Characters        2258       Nov-50    marvel
## 6   Male <NA> Living Characters        2255       Nov-61    marvel

Explaination : As the function “head” runs, the dataset “comics” shows the head of every variable it has. As I can see, there are 11 different variables, which are made of 9 char variables, and 1 int variable. Also, there is a missing variable entitled “g…” which should’ve been N/A for a reason.

levels(comics$align)
## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

Explaination : The level function will show us the levels of attribute in a variable, but somehow I cannot show it out the level in “align” because it’s a char variable. The level function will work on char or string if I do stringAsFactors. After putting the stringAsFactors i can see the level of align var which consists of Bad, Good, Neutral, and Reformed criminals

levels(comics$gender)
## [1] "Female" "Male"   "Other"

Explaination : Same with the explaination above, the gender is a char variable or a string. So, it will not show the levels except if I do stringAsFactors. After putting stringAsFactors, i can see the level of gender var which consists of Female, male, and others

tab <- table(comics$align, comics$gender)
tab
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

Explaination : The “tab” function is used to make a summarized table of a specific chosen variable by the user. In the case above, the variable chosen is in the dataset comics, in align and gender. It summarizes the align and gender var, so we can see how many data are there inside the variable.

comics <- comics %>%
  filter(align != 'Reformed Criminals') %>%
  droplevels()

levels(comics$align)
## [1] "Bad"     "Good"    "Neutral"

Explaination : After the code runs, we can see that it has hidden the “Reformed Criminals” variable which will not be shown after we run another “levels” function. Also the function “droplevels” will remove the unused variable which “Reformed Criminal” is used.

ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge")

Explaination : The barchart which was generated shows us the chart on align and the content of genders inside of each aligns. After analyzing the chart, we can see that the Male is dominantly the Bad characte, while the Female holds the highest on the Good align. The “Other” is the least appeareance but it’s still shown a little bit on Bad. The “N/A” genders or not identified holds the second lowest on Bad.

ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(positio = "dodge") +
  theme(axis.text.x = element_text(angle = 90))

Explaination : The chart shows that Females has the “Good” in dominance, while on the Bad and neutral is lower in them. The Male however, it has the “Bad” in dominance with less Good and Neutral in them. For the “Other” gender, as the most least in the chart it still shows that “Bad” is still dominating among that gender. And for the “N/A” or unidentified gender, it has the “Bad” dominating in them while the lowest is Neutral.

options(scipen = 999, digits = 3) 
tbl_cnt <- table(comics$id, comics$align)
tbl_cnt
##          
##            Bad Good Neutral
##   No Dual  474  647     390
##   Public  2172 2930     965
##   Secret  4493 2475     959
##   Unknown    7    0       2

Explaination : This fucntion shows the content and makes a table that consist of the “id” and “align” variables. As the table generates we can see that most “id” are secret. and still 7 unknown regading the “Bad” guys. The No Dual shows the least and most of them are the “Good” guys.

prop.table(tbl_cnt)
##          
##                Bad     Good  Neutral
##   No Dual 0.030553 0.041704 0.025139
##   Public  0.140003 0.188862 0.062202
##   Secret  0.289609 0.159533 0.061815
##   Unknown 0.000451 0.000000 0.000129

Explaination : As we can see up above, we can see the properties of the “id” and “align” The tbl_cnt variable was made for this table.

sum(prop.table(tbl_cnt))
## [1] 1

Explaination : The function is used, so we can analyze that there are only 1 numerical factor inside the table.

prop.table(tbl_cnt, 1)
##          
##             Bad  Good Neutral
##   No Dual 0.314 0.428   0.258
##   Public  0.358 0.483   0.159
##   Secret  0.567 0.312   0.121
##   Unknown 0.778 0.000   0.222
ggplot(comics, aes(x = id, fill = align)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

Explaination : The chart up above shows us the proprtion of the “id” variable. as we can see, the Bad holds the dominant on the Unknown ID. The chart also shows that the secret ID, and N/A has the same proportion on the “Bad”. The least proportion among them all is neutral in every “id” variable.

ggplot(comics, aes(x = align, fill = id)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

Explaination : The chart above shows the reverse of the chart before. The chart itself shows the align’s proportions, which makes a total difference. The Unknown “id” has no proportion, and it is not shown on the chart. While, the secret and public “id” has roughly almost the same proportion being held.

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions
##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888

Explaination : The function and the codes that has been run shows us the table with varies of properties. The digits are printed fewer as the comment said and it is true.

prop.table(tab, 2)
##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

Explaination : Roughly, the table shown up above tells us that the female has roughly 50% good in them. While the male has roughly 55% bad in them. The others has mostly 50 of them dominated in bad.

ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar()

Explaination : The chart shows us the plot of genders by aligns. After analyzing, i can confurm that all of the genders hold the “Bad” as the dominant.

ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "fill")

Explaination : The chart shows that Male gender has hold the highest numbers of them all, but also holding the highest “Bad” than the others. While female still shows that they hold “Good” for the most. The NA has roughly same proportion on Bad, and Neutral.

table(comics$id)
## 
## No Dual  Public  Secret Unknown 
##    1511    6067    7927       9

Explaination : The function up above is to dostribute the numbers or factors that are inside “id” variable. As we can see, i’ve analyzed that most of them has a secret “id” hence the unknown are the least.

ggplot(comics, aes(x = id)) + 
  geom_bar()

Explaination : This is just a simple bar chart to visual the “id” variable. By analyzing the chart, i can make a statenent that : - Most superheroes has a secret “id”. - No dual has the second least count on the chart. - The “id” which has unknown is the least count on the barchart.

ggplot(comics, aes(x = id)) + 
  geom_bar() + 
  facet_wrap(~align)

Explaination : From the barcharts generated above, I can analyze and make a statement that : - The Bad superheroes listed with a secret “id” has the most counts on the chart - The Good superheroes has 0 unknown “id”s. - The Good superheroes also has the most counts on Public “id”s. - Roughly the Public and Secret “id”ed superheroes has the same Neutral counts on it.

# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

Explaination : The function used above is to show the Align variable that generates the 3 levels only; Bad, Good and Neutral. The statement from the analyzing this chart that i can make isL - The Bad has the most count which almost reached 10000 - The neutral is the lowest count.

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + 
  geom_bar() +
  facet_wrap(~ gender)

Explaination : From the chart above, after anaylizing i can make a statemnt that : - Male has the most dominant on Bad superheroes. - The female has a lower count on the aligns, which females still have more percentage of Good superheroes then male. - The “Other” align, has the most least count than the other variables.

pies <- data.frame(flavors = as.factor(rep(c("apple", "blueberry", "boston creme", "cherry", "key lime", "pumpkin", "strawberry"), times = c(17, 14, 15, 13, 16, 12, 11))))

# Put levels of flavor in descending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

# Create barchart of flavor
ggplot(pies, aes(x = flavor)) + 
  geom_bar(fill = "chartreuse") + 
  theme(axis.text.x = element_text(angle = 90))

Explaination : From the barchat generated above, I can make a statement that shows : - Apple flavored pies are the most on the counts. - The least count on the flavour is strawberry.

# A dot plot shows all the datapoints
ggplot(cars, aes(x = weight)) + 
  geom_dotplot(dotsize = 0.4)
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bindot).

Explaination : From the dotchart generated above, i can analyze that the average of the cars weighed around 2000 and 4000, and most of them are at that number

# A histogram groups the points into bins so it does not get overwhelming
ggplot(cars, aes(x = weight)) + 
  geom_histogram(dotsize = 0.4, binwidth = 500)
## Warning: Ignoring unknown parameters: dotsize
## Warning: Removed 2 rows containing non-finite values (stat_bin).

ggplot(cars, aes(x = weight)) + 
  geom_density()
## Warning: Removed 2 rows containing non-finite values (stat_density).

Explaination : The chart above shows that the more weight it has, the more density the vehicle has. Which the highest point of density is roughly around 3500 on weight.

ggplot(cars, aes(x = 1, y = weight)) + 
  geom_boxplot() +
  coord_flip()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

library(ggplot2)

This is the library used for the next chart generation.

str(cars)
## 'data.frame':    428 obs. of  19 variables:
##  $ name       : Factor w/ 425 levels "Acura 3.5 RL 4dr",..: 66 67 68 69 70 114 115 133 129 130 ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

Explaination : From my analysis, I can see that there are 19 different variables. But the point that i’ll make is : - There are 10 interger variables. - 1 num variable. - There are 7 fase variables.

ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_wrap(~ suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).

unique(cars$ncyl)
## [1]  4  6  3  8  5 12 10 -1
table(cars$ncyl)
## 
##  -1   3   4   5   6   8  10  12 
##   2   1 136   7 190  87   2   3
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values (stat_density).

Explaination : From the chart above, I can analyze that the more density it has, the lower city_mpg numbers it has.

# Create hist of horsepwr
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  ggtitle("Horsepower distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Explaination : From the chart above, my statement is that most of the cars has between 200-300 horsepower given on it. THe least horse power the car has is between 400-500 HP.

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("binwidth = 3")