The objective of this project is to practice data cleaning by tidying and transforming data using tidyr package. In this project three different data must be utilized and analyzed.
To achieve the task, many libraries are needed to load data from html, manipulate strings and convert them into different data types, clean data, and load and convert data to csv files.
library(tidyr)
library(stats)
library(base)
library(dplyr)
library(stringr)
library(ggplot2)
library(gridExtra)
library(magrittr)
library(bitops)
library(htmltab)
library(xml2)
library(rvest)
library(date)
library(lubridate)
An Arduino Nano amazon consumer review will be chosen for the first part. The data will be imported from html page into r to perform cleaning and tidying then exported into an csv file. Data will be brought again into r for data analysis.
Link: Amazon Page
Get data:
nano.html <- read_html("https://www.amazon.com/OSOYOO-ATMEGA328P-Module-Micro-controller-Arduino/product-reviews/B00UACD13Q/ref=cm_cr_arp_d_viewopt_rvwer?ie=UTF8&showViewpoints=1&sortBy=helpful&reviewerType=all_reviews&pageNumber=1", encoding = "UTF-8")
Get titles:
titles <- html_nodes(nano.html, ".a-color-base.a-text-bold")
titles <- html_text(titles)
titles
## character(0)
Get authors:
authors <- html_nodes(nano.html, ".author")
authors <- html_text(authors)
authors
## character(0)
Get date:
date <- html_nodes(nano.html, "#cm_cr-review_list .review-date")
date <- html_text(date)
date <- as.Date(mdy(gsub("on"," ", date)))
## Warning: All formats failed to parse. No formats found.
date
## character(0)
Get rate:
rank <- html_nodes(nano.html, "#cm_cr-review_list .review-rating")
rank <- html_text(rank)
rank <- as.numeric(gsub(" out of 5 stars", "", rank))
rank
## numeric(0)
Get reviews messages:
review_scrript <- html_nodes(nano.html, ".review-data+ .review-data")
review_scrript <- html_text(review_scrript)
str_trunc(strsplit(review_scrript," "),50, "right", ellipsis = " ....")
## list()
Put all extracted data into fata frame
options(width=100)
reviews_df<-data.frame(date, authors, titles, rank, stringsAsFactors = FALSE)
reviews_df
## [1] date authors titles rank
## <0 rows> (or 0-length row.names)
Export data to csv file format
options(width=100)
write.table(reviews_df, file = "arduino_nano.csv", sep = ",", col.names=T)
Read_csv <- read.csv( "arduino_nano.csv")
Read_csv
## [1] date authors titles rank
## <0 rows> (or 0-length row.names)
Get data from getHub
nanocsv <- data.frame(read.csv("https://raw.githubusercontent.com/aliharb/R-Programming/1b26e012c98c801629d9944614feb842eb5cb216/arduino_nano.csv"))
str(nanocsv)
## 'data.frame': 10 obs. of 5 variables:
## $ date : int 1 2 3 4 5 6 7 8 9 10
## $ authors: Factor w/ 10 levels "11/17/2015","12/17/2015",..: 10 2 3 6 4 8 5 1 7 9
## $ titles : Factor w/ 10 levels "Anthony Coghlan",..: 2 6 8 1 5 4 3 7 10 9
## $ rank : Factor w/ 10 levels "A very cool, very tiny Arduino!",..: 7 10 1 6 5 4 8 9 3 2
## $ X : int 3 5 5 5 5 5 5 1 3 5
As shown, the header of the data is shifted and a column is being added. Thus they need to be corrected.
Subset the columns into new data frame:
nano_tb<-select(nanocsv,authors,titles,rank,X)
Rename the Variables:
nano_tb <- plyr::rename(nano_tb, c("authors" = "date", "titles" = "authors", "rank" = "titles", "X" = "rank"))
Convert data to table:
tbl_df(nano_tb)
## # A tibble: 10 × 4
## date authors titles rank
## <fctr> <fctr> <fctr> <int>
## 1 6/22/2016 centenary Like some other reviewers 3
## 2 12/17/2015 Mike Moore Would buy again 5
## 3 12/20/2015 Richard DeMar A very cool, very tiny Arduino! 5
## 4 2/22/2016 Anthony Coghlan Great little cost-effective board for experimenting 5
## 5 12/29/2015 JA Great little alternative to an Arduino Uno 5
## 6 4/24/2016 Hamsterman Complete with Bootloader! 5
## 7 12/3/2015 Edward Smith This Nano is a nice version of the Arduino board 5
## 8 11/17/2015 Opaz very bad.. this board came without boot loader and ... 1
## 9 3/28/2016 Steve Reinhardt Caution; it's not exactly a Nano. 3
## 10 5/2/2016 ryanvito Awesome little piece of hardware. 5
glimpse(nano_tb)
## Observations: 10
## Variables: 4
## $ date <fctr> 6/22/2016, 12/17/2015, 12/20/2015, 2/22/2016, 12/29/2015, 4/24/2016, 12/3/2015...
## $ authors <fctr> centenary, Mike Moore, Richard DeMar, Anthony Coghlan, JA, Hamsterman, Edward ...
## $ titles <fctr> Like some other reviewers, Would buy again, A very cool, very tiny Arduino!, G...
## $ rank <int> 3, 5, 5, 5, 5, 5, 5, 1, 3, 5
Plot the data using bar ggplot
ggplot(nano_tb, aes(rank)) +
geom_bar(fill="blue", colour="black")
nanomean <- mean(nano_tb$rank)
nanomean
## [1] 4.2
nanosd <- sd(nano_tb$rank)
nanosd
## [1] 1.398412
nanovar <- var(nano_tb$rank)
nanovar
## [1] 1.955556
The bar graph shows 7 of the reviews were 5 stars and 2 were, 3 stars, and 1 is one star.
Based on the mean of 4.2, the result reflects the chart distribution were most data fall into 5 stars. But the result of standard deviation and the variance indicates that the distribution varies greatly between reviews.
For part two the babyname dataset will be used to perform tidying and transformation.
link:babynames data ; DOC
Get data:
babynames <- data.frame(read.csv("https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv"))
Look into the imported data
str(babynames)
## 'data.frame': 258000 obs. of 4 variables:
## $ year : int 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
## $ name : Factor w/ 6782 levels "Aaden","Aaliyah",..: 3380 6632 3125 1174 2554 2449 3428 6232 2834 5517 ...
## $ percent: num 0.0815 0.0805 0.0501 0.0452 0.0433 ...
## $ sex : Factor w/ 2 levels "boy","girl": 1 1 1 1 1 1 1 1 1 1 ...
head(babynames)
## year name percent sex
## 1 1880 John 0.081541 boy
## 2 1880 William 0.080511 boy
## 3 1880 James 0.050057 boy
## 4 1880 Charles 0.045167 boy
## 5 1880 George 0.043292 boy
## 6 1880 Frank 0.027380 boy
tail(babynames)
## year name percent sex
## 257995 2008 Diya 0.000128 girl
## 257996 2008 Carleigh 0.000128 girl
## 257997 2008 Iyana 0.000128 girl
## 257998 2008 Kenley 0.000127 girl
## 257999 2008 Sloane 0.000127 girl
## 258000 2008 Elianna 0.000127 girl
Find the most popular names
most_popular_names <- babynames %>%
group_by(sex, name) %>%
summarize(total = sum(percent)) %>%
arrange(desc(total)) %$%
split(., sex)
most_popular_names
## $boy
## Source: local data frame [3,437 x 3]
## Groups: sex [1]
##
## sex name total
## <fctr> <fctr> <dbl>
## 1 boy John 5.299585
## 2 boy James 4.574991
## 3 boy William 4.409453
## 4 boy Robert 3.821662
## 5 boy Charles 2.518147
## 6 boy Michael 2.366102
## 7 boy Joseph 2.292487
## 8 boy David 2.159018
## 9 boy George 2.096747
## 10 boy Thomas 1.901267
## # ... with 3,427 more rows
##
## $girl
## Source: local data frame [4,018 x 3]
## Groups: sex [1]
##
## sex name total
## <fctr> <fctr> <dbl>
## 1 girl Mary 4.511860
## 2 girl Elizabeth 1.392100
## 3 girl Margaret 1.360965
## 4 girl Helen 1.234222
## 5 girl Anna 1.195867
## 6 girl Dorothy 1.065111
## 7 girl Barbara 1.001579
## 8 girl Patricia 0.999798
## 9 girl Ruth 0.942272
## 10 girl Linda 0.837364
## # ... with 4,008 more rows
Visualize the distribution of some popular names
g1 <- babynames %>%
filter(name=="Barbara") %$%
ggplot(., aes(year,percent)) +
geom_line(aes(color=sex), lwd=1) +
scale_color_manual(values = c("firebrick1", "dodgerblue")) +
theme_bw() +
ggtitle("Barbara")
g2 <- babynames %>%
filter(name=="John") %$%
ggplot(., aes(year,percent)) +
geom_line(aes(color=sex), lwd=1) +
scale_color_manual(values = c("firebrick1", "dodgerblue")) +
theme_bw() +
ggtitle("John")
g3 <- babynames %>%
filter(name=="Charles") %$%
ggplot(., aes(year,percent)) +
geom_line(aes(color=sex), lwd=1) +
scale_color_manual(values = c("firebrick1", "dodgerblue")) +
theme_bw() +
ggtitle("Charles")
g4 <- babynames %>%
filter(name=="Jennifer") %$%
ggplot(., aes(year,percent)) +
geom_line(aes(color=sex), lwd=1) +
scale_color_manual(values = c("firebrick1", "dodgerblue")) +
theme_bw() +
ggtitle("Jennifer")
grid.arrange(g1,g2,g3,g4,ncol=2)
Even though some of these names are the most popular, the graph shows that these names are declining as years get close to present time, such as John and Charles. It also shows that some of these name become popular at the forties, such as Barbara, and seventies, such as Jennifer then decline afterword.
The mtcar dataset will be utilized to perform data analysis and transformation
Links: mtcar data ; doc
Get Data:
carsSpecs <- data.frame(read.csv("https://raw.githubusercontent.com/aliharb/R-Programming/6a7424e677c38c58adc598e5b53812edb2a808a5/mtcars.csv"))
look into the data set
str(carsSpecs)
## 'data.frame': 32 obs. of 12 variables:
## $ X : Factor w/ 32 levels "AMC Javelin",..: 18 19 5 13 14 31 7 21 20 22 ...
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : int 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : int 0 0 1 1 0 1 0 1 1 1 ...
## $ am : int 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: int 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: int 4 4 1 1 2 1 4 2 2 4 ...
head(carsSpecs)
## X mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(carsSpecs)
## X mpg cyl disp hp drat wt qsec vs am gear carb
## 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
Analyze the effect of automatic and manual on gas mileage
subset data:
manual<- carsSpecs %>%
filter(am==1)
head(manual)
## X mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 5 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 6 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
auto <- carsSpecs%>%
filter(am==0)
head(auto)
## X mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 2 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 3 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 4 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 5 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 6 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Find the means and Standard deviation of:
Manual:
manualmean<-summarise(manual,mean(manual$mpg))
manualmean
## mean(manual$mpg)
## 1 24.39231
manualsd<-summarise(manual,sd(manual$mpg))
manualsd
## sd(manual$mpg)
## 1 6.166504
Automatic:
automean<-summarise(auto,mean(auto$mpg))
automean
## mean(auto$mpg)
## 1 17.14737
autosd<-summarise(auto, sd(auto$mpg))
autosd
## sd(auto$mpg)
## 1 3.833966
Get the percentage increase of the mileage based on the means
(manualmean-automean)/automean
## mean(manual$mpg)
## 1 0.4225103
plot the distribution
ggplot(manual, aes(manual$mpg)) +
geom_histogram(aes(y=..density..),binwidth=4,colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")+
geom_vline(xintercept = as.numeric(manualmean))+
geom_vline(xintercept = as.numeric(manualmean), linetype="dotted", color = "blue", size=1.5)
ggplot(auto, aes(auto$mpg)) +
geom_histogram(aes(y=..density..),binwidth=3,colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")+
geom_vline(xintercept = as.numeric(automean))+
geom_vline(xintercept = as.numeric(automean), linetype="dotted", color = "blue", size=1.5)
Based on the results, the mean of the manual transmission cars is greater than the automated. It also has the highest variance.