Introduction

The objective of this project is to practice data cleaning by tidying and transforming data using tidyr package. In this project three different data must be utilized and analyzed.

Load Required Libraries

To achieve the task, many libraries are needed to load data from html, manipulate strings and convert them into different data types, clean data, and load and convert data to csv files.

library(tidyr)
library(stats)
library(base)
library(dplyr)
library(stringr)
library(ggplot2)
library(gridExtra)
library(magrittr)
library(bitops)
library(htmltab)
library(xml2)
library(rvest)
library(date)
library(lubridate)

First Part

An Arduino Nano amazon consumer review will be chosen for the first part. The data will be imported from html page into r to perform cleaning and tidying then exported into an csv file. Data will be brought again into r for data analysis.

Link: Amazon Page

Get data:

nano.html <- read_html("https://www.amazon.com/OSOYOO-ATMEGA328P-Module-Micro-controller-Arduino/product-reviews/B00UACD13Q/ref=cm_cr_arp_d_viewopt_rvwer?ie=UTF8&showViewpoints=1&sortBy=helpful&reviewerType=all_reviews&pageNumber=1", encoding = "UTF-8")

Get titles:

titles <- html_nodes(nano.html, ".a-color-base.a-text-bold")
titles <- html_text(titles)
titles
## character(0)

Get authors:

authors <- html_nodes(nano.html, ".author")
authors <- html_text(authors)
authors
## character(0)

Get date:

date <- html_nodes(nano.html, "#cm_cr-review_list .review-date")
date <- html_text(date)
date <- as.Date(mdy(gsub("on"," ", date)))
## Warning: All formats failed to parse. No formats found.
date
## character(0)

Get rate:

rank <- html_nodes(nano.html, "#cm_cr-review_list .review-rating")
rank <- html_text(rank)
rank <- as.numeric(gsub(" out of 5 stars", "", rank))
rank
## numeric(0)

Get reviews messages:

review_scrript <- html_nodes(nano.html, ".review-data+ .review-data")
review_scrript <- html_text(review_scrript)
str_trunc(strsplit(review_scrript,"   "),50, "right", ellipsis = " ....")
## list()

Put all extracted data into fata frame

options(width=100)
reviews_df<-data.frame(date, authors, titles, rank, stringsAsFactors = FALSE)
reviews_df
## [1] date    authors titles  rank   
## <0 rows> (or 0-length row.names)

Export data to csv file format

options(width=100)
write.table(reviews_df, file = "arduino_nano.csv", sep = ",", col.names=T)
Read_csv <- read.csv( "arduino_nano.csv")
Read_csv
## [1] date    authors titles  rank   
## <0 rows> (or 0-length row.names)

Get data from getHub

nanocsv <- data.frame(read.csv("https://raw.githubusercontent.com/aliharb/R-Programming/1b26e012c98c801629d9944614feb842eb5cb216/arduino_nano.csv"))
str(nanocsv)
## 'data.frame':    10 obs. of  5 variables:
##  $ date   : int  1 2 3 4 5 6 7 8 9 10
##  $ authors: Factor w/ 10 levels "11/17/2015","12/17/2015",..: 10 2 3 6 4 8 5 1 7 9
##  $ titles : Factor w/ 10 levels "Anthony Coghlan",..: 2 6 8 1 5 4 3 7 10 9
##  $ rank   : Factor w/ 10 levels "A very cool, very tiny Arduino!",..: 7 10 1 6 5 4 8 9 3 2
##  $ X      : int  3 5 5 5 5 5 5 1 3 5

As shown, the header of the data is shifted and a column is being added. Thus they need to be corrected.

Subset the columns into new data frame:

nano_tb<-select(nanocsv,authors,titles,rank,X)

Rename the Variables:

nano_tb <- plyr::rename(nano_tb, c("authors" = "date", "titles" = "authors", "rank" = "titles", "X" = "rank"))

Convert data to table:

tbl_df(nano_tb)
## # A tibble: 10 × 4
##          date         authors                                                 titles  rank
##        <fctr>          <fctr>                                                 <fctr> <int>
## 1   6/22/2016       centenary                              Like some other reviewers     3
## 2  12/17/2015      Mike Moore                                        Would buy again     5
## 3  12/20/2015   Richard DeMar                        A very cool, very tiny Arduino!     5
## 4   2/22/2016 Anthony Coghlan    Great little cost-effective board for experimenting     5
## 5  12/29/2015              JA             Great little alternative to an Arduino Uno     5
## 6   4/24/2016      Hamsterman                              Complete with Bootloader!     5
## 7   12/3/2015    Edward Smith       This Nano is a nice version of the Arduino board     5
## 8  11/17/2015            Opaz very bad.. this board came without boot loader and ...     1
## 9   3/28/2016 Steve Reinhardt                      Caution; it's not exactly a Nano.     3
## 10   5/2/2016        ryanvito                      Awesome little piece of hardware.     5
glimpse(nano_tb)
## Observations: 10
## Variables: 4
## $ date    <fctr> 6/22/2016, 12/17/2015, 12/20/2015, 2/22/2016, 12/29/2015, 4/24/2016, 12/3/2015...
## $ authors <fctr> centenary, Mike Moore, Richard DeMar, Anthony Coghlan, JA, Hamsterman, Edward ...
## $ titles  <fctr> Like some other reviewers, Would buy again, A very cool, very tiny Arduino!, G...
## $ rank    <int> 3, 5, 5, 5, 5, 5, 5, 1, 3, 5

First Part Analysis

Plot the data using bar ggplot

ggplot(nano_tb, aes(rank)) + 
  geom_bar(fill="blue", colour="black")

nanomean <- mean(nano_tb$rank)
nanomean
## [1] 4.2
nanosd <- sd(nano_tb$rank)
nanosd
## [1] 1.398412
nanovar <- var(nano_tb$rank)
nanovar
## [1] 1.955556

The bar graph shows 7 of the reviews were 5 stars and 2 were, 3 stars, and 1 is one star.

Based on the mean of 4.2, the result reflects the chart distribution were most data fall into 5 stars. But the result of standard deviation and the variance indicates that the distribution varies greatly between reviews.

Second Part

For part two the babyname dataset will be used to perform tidying and transformation.

link:babynames data ; DOC

Get data:

babynames <- data.frame(read.csv("https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv"))

Look into the imported data

str(babynames)
## 'data.frame':    258000 obs. of  4 variables:
##  $ year   : int  1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
##  $ name   : Factor w/ 6782 levels "Aaden","Aaliyah",..: 3380 6632 3125 1174 2554 2449 3428 6232 2834 5517 ...
##  $ percent: num  0.0815 0.0805 0.0501 0.0452 0.0433 ...
##  $ sex    : Factor w/ 2 levels "boy","girl": 1 1 1 1 1 1 1 1 1 1 ...
head(babynames)
##   year    name  percent sex
## 1 1880    John 0.081541 boy
## 2 1880 William 0.080511 boy
## 3 1880   James 0.050057 boy
## 4 1880 Charles 0.045167 boy
## 5 1880  George 0.043292 boy
## 6 1880   Frank 0.027380 boy
tail(babynames)
##        year     name  percent  sex
## 257995 2008     Diya 0.000128 girl
## 257996 2008 Carleigh 0.000128 girl
## 257997 2008    Iyana 0.000128 girl
## 257998 2008   Kenley 0.000127 girl
## 257999 2008   Sloane 0.000127 girl
## 258000 2008  Elianna 0.000127 girl

Second Part Analysis

Find the most popular names

most_popular_names <- babynames %>%
  group_by(sex, name) %>%
  summarize(total = sum(percent)) %>%
  arrange(desc(total)) %$%
  split(., sex) 
  most_popular_names
## $boy
## Source: local data frame [3,437 x 3]
## Groups: sex [1]
## 
##       sex    name    total
##    <fctr>  <fctr>    <dbl>
## 1     boy    John 5.299585
## 2     boy   James 4.574991
## 3     boy William 4.409453
## 4     boy  Robert 3.821662
## 5     boy Charles 2.518147
## 6     boy Michael 2.366102
## 7     boy  Joseph 2.292487
## 8     boy   David 2.159018
## 9     boy  George 2.096747
## 10    boy  Thomas 1.901267
## # ... with 3,427 more rows
## 
## $girl
## Source: local data frame [4,018 x 3]
## Groups: sex [1]
## 
##       sex      name    total
##    <fctr>    <fctr>    <dbl>
## 1    girl      Mary 4.511860
## 2    girl Elizabeth 1.392100
## 3    girl  Margaret 1.360965
## 4    girl     Helen 1.234222
## 5    girl      Anna 1.195867
## 6    girl   Dorothy 1.065111
## 7    girl   Barbara 1.001579
## 8    girl  Patricia 0.999798
## 9    girl      Ruth 0.942272
## 10   girl     Linda 0.837364
## # ... with 4,008 more rows

Visualize the distribution of some popular names

g1 <- babynames %>%
  filter(name=="Barbara") %$% 
  ggplot(., aes(year,percent)) +
  geom_line(aes(color=sex), lwd=1) +
  scale_color_manual(values = c("firebrick1", "dodgerblue")) +
  theme_bw() +
  ggtitle("Barbara")

g2 <- babynames %>%
  filter(name=="John") %$% 
  ggplot(., aes(year,percent)) +
  geom_line(aes(color=sex), lwd=1) +
  scale_color_manual(values = c("firebrick1", "dodgerblue")) +
  theme_bw() +
  ggtitle("John")

g3 <- babynames %>%
  filter(name=="Charles") %$% 
  ggplot(., aes(year,percent)) +
  geom_line(aes(color=sex), lwd=1) +
  scale_color_manual(values = c("firebrick1", "dodgerblue")) +
  theme_bw() +
  ggtitle("Charles")

g4 <- babynames %>%
  filter(name=="Jennifer") %$% 
  ggplot(., aes(year,percent)) +
  geom_line(aes(color=sex), lwd=1) +
  scale_color_manual(values = c("firebrick1", "dodgerblue")) +
  theme_bw() +
  ggtitle("Jennifer")

grid.arrange(g1,g2,g3,g4,ncol=2)

Even though some of these names are the most popular, the graph shows that these names are declining as years get close to present time, such as John and Charles. It also shows that some of these name become popular at the forties, such as Barbara, and seventies, such as Jennifer then decline afterword.

Third part

The mtcar dataset will be utilized to perform data analysis and transformation

Links: mtcar data ; doc

Get Data:

carsSpecs <- data.frame(read.csv("https://raw.githubusercontent.com/aliharb/R-Programming/6a7424e677c38c58adc598e5b53812edb2a808a5/mtcars.csv"))

look into the data set

str(carsSpecs)
## 'data.frame':    32 obs. of  12 variables:
##  $ X   : Factor w/ 32 levels "AMC Javelin",..: 18 19 5 13 14 31 7 21 20 22 ...
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : int  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : int  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : int  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : int  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: int  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: int  4 4 1 1 2 1 4 2 2 4 ...
head(carsSpecs)
##                   X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(carsSpecs)
##                 X  mpg cyl  disp  hp drat    wt qsec vs am gear carb
## 27  Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## 28   Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## 29 Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## 30   Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## 31  Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## 32     Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Third Part Analysis

Analyze the effect of automatic and manual on gas mileage

subset data:

manual<- carsSpecs %>%
  filter(am==1)
head(manual)
##                X  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1      Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2  Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3     Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4       Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 5    Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 6 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
auto <- carsSpecs%>%
  filter(am==0)
head(auto)
##                   X  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1    Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 2 Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 3           Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 4        Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 5         Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 6          Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2

Find the means and Standard deviation of:

Manual:

manualmean<-summarise(manual,mean(manual$mpg))
manualmean
##   mean(manual$mpg)
## 1         24.39231
manualsd<-summarise(manual,sd(manual$mpg))
manualsd
##   sd(manual$mpg)
## 1       6.166504

Automatic:

automean<-summarise(auto,mean(auto$mpg))
automean
##   mean(auto$mpg)
## 1       17.14737
autosd<-summarise(auto, sd(auto$mpg))
autosd
##   sd(auto$mpg)
## 1     3.833966

Get the percentage increase of the mileage based on the means

(manualmean-automean)/automean
##   mean(manual$mpg)
## 1        0.4225103

plot the distribution

ggplot(manual, aes(manual$mpg)) + 
  geom_histogram(aes(y=..density..),binwidth=4,colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666")+
  geom_vline(xintercept = as.numeric(manualmean))+
  geom_vline(xintercept = as.numeric(manualmean), linetype="dotted", color = "blue", size=1.5)

ggplot(auto, aes(auto$mpg)) + 
  geom_histogram(aes(y=..density..),binwidth=3,colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666")+
  geom_vline(xintercept = as.numeric(automean))+
  geom_vline(xintercept = as.numeric(automean), linetype="dotted", color = "blue", size=1.5)

Based on the results, the mean of the manual transmission cars is greater than the automated. It also has the highest variance.