IS607 Project 2 (Data Set 2 & 3)

Project 2 Dataset 2::

Choose any three of the “wide” datasets identified in the Week 6/7 Discussion item. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.]
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission, for each of the three chosen datasets: The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.

===================================================================================================

I choose the noaa dataset for Climate at a Glance to analyze the temperature variation happened in US since 1895 and is it a valid statement that we are facing extreme temperature now compared to 50 or 70 years back. Loaded tabel directly from the website URL and extrated tabel using XPATH, then transformed the HTML tabel into data frame and renamed the column names.

library(xml2)

## Warning: package 'xml2' was built under R version 3.2.2

library(rvest)

## Warning: package 'rvest' was built under R version 3.2.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.2

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.2.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.2

library(stringr)

## Warning: package 'stringr' was built under R version 3.2.2

library(sqldf)

## Warning: package 'sqldf' was built under R version 3.2.2

## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI

library(RColorBrewer)

## Warning: package 'RColorBrewer' was built under R version 3.2.2

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.2.2

library('IS606')

## 
## Welcome to CUNY IS606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='IS606') will list the demos that are available.

## 
## Attaching package: 'IS606'
## 
## The following object is masked from 'package:utils':
## 
##     demo

theurl<-"http://www.ncdc.noaa.gov/cag/time-series/us"

tempTable<-theurl%>%
  read_html() %>%
  html_node(xpath='//*[@id="values"]')%>%
  html_table(fill=TRUE)

tempTab<-tbl_df(tempTable)
colnames(tempTab)<-c("Date","Value","Rank","Anomaly")

Now I need to clean the data to make it consumable and ready for visualization, used stringr package for performing the replacement of special characters and then converting dates from integer to date format. Once the dataset is clean this can be exported to .csv format

tempTab$Value<-as.numeric(str_replace_all(tempTab$Value,pattern = "Â°F",""))
tempTab$Anomaly<-as.numeric(str_replace_all(tempTab$Anomaly,pattern = "Â°F",""))
tempTab$Date<-as.Date(gsub('^([0-9]{4})([0-9]{2})$', '\\1-\\2-01', tempTab$Date))

write.csv(tempTab, file ="tempTab.csv")

Since the data is spread across 120 years it would be easy to analyze by dividing them into decades and taking mean for temperature variation.

year1<- 1895:2015
 indx<-findInterval(year1,seq(1895,2015,by=10))
group<-seq(1895,2015,by=10)
ind<-seq(1,length(group),by=1)
labl1<-paste(group[ind],group[ind+1],sep="-")[-42]
tempTab1<- data.frame(year=year1,decade=labl1[indx],Value=tempTab$Value, Anomaly=tempTab$Anomaly,stringsAsFactors=FALSE)
tempTab1$decade<-str_replace(tempTab1$decade,pattern="NA","2015")
tempTab1$decade <-as.character(tempTab1$decade)

Once I had all the years sorted out as decades frequency, I transformed it into a new Dataframe with Year, Decade, value and Anamoly to initiate further analysis. Used SQLDF package to group the data and calculate means for decade wise teamperature and anamoly variation.

newTab<- sqldf("select decade, AVG(Value) AveTemp, AVG(Anomaly) AveAno from tempTab1 GROUP BY decade")

## Loading required package: tcltk

Plotted histogram using ggplot to see the variation in temperature across century is not very conclusive as data is quiet consistent and normalized with no peaks seen in the grapgh below.

ggplot(data = newTab, aes(x = newTab$decade, y = newTab$AveTemp, fill = "Average Temperature Per Decade")) + geom_bar(stat ="identity", position="dodge") + scale_fill_brewer(palette = "Red")  + theme_hc(bgcolor = "darkunica") + scale_colour_hc("darkunica") + ggtitle("Temperature variation per Decade") + ylab("Average Temperature (Degree F)")

## Warning in pal_name(palette, type): Unknown palette Red

summary(newTab)

##     decade             AveTemp          AveAno       
##  Length:13          Min.   :63.99   Min.   :-0.8720  
##  Class :character   1st Qu.:64.66   1st Qu.:-0.1960  
##  Mode  :character   Median :64.94   Median : 0.0810  
##                     Mean   :65.27   Mean   : 0.4122  
##                     3rd Qu.:65.40   3rd Qu.: 0.5390  
##                     Max.   :68.54   Max.   : 3.6800

Conclusion: It’s hard to conclude this analysis as global warming or drastic increase in temperature in last few decades as there is not much variation seen in the temperature plotting it per decade. We need to look into other factors now to see the environmental impact in last few decades.

So plotted normal distribution chart to analyze the normality of the data and it comes out as near to normal distribution with 2 data points as outlier and most of the data fall near mean.

s=sd(newTab$AveTemp)
m=mean(newTab$AveTemp)

par(mfrow=c(1,2))
hist(newTab$AveTemp,probability=TRUE,ylim = c(0, 0.5))
x <- 50:100
y <- dnorm(x = x, mean = m, sd = s)
lines(x = x, y = y, col = "blue")

qqnormsim(newTab$AveTemp)

qqline(newTab$AveTemp)

==================================================================================================================

Project 2 Dataset 3:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.]
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission, for each of the three chosen datasets: The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.

===================================================================================================

This dataset was a smaller one, however the data was quiet intersting to know the overall revenue of the Big 5 IT companies as shown below:

Since it was not an HTML table which can be scraped, I thought of experimenting with database table creation using SQLDF package and then making it wide format for analysis purpose using dplyr.

library(sqldf)
library(rvest)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)


big5_IT_rev <-sqldf(c("Drop table if exists revenue","CREATE table IT_revenue ( ID serial PRIMARY KEY, Year int (10) NOT NULL, Google int NULL, Facebook int NULL, Yahoo int NULL, Microsoft int NULL, AOL int NULL)",
  "INSERT INTO IT_revenue (ID, Year, Google, Facebook, Yahoo, Microsoft, AOL) Values ('1','2009','0.36','0.56', '1.26', '0.37', '0.51'), ('2', '2010','0.86','1.21', '1.43', '0.51', '0.47'), ('3', '2011', '1.67', '1.73', '1.36', '0.6', '0.53'), ('4','2012','2.26','2.18', '1.35','0.9','0.7'), ('5','2013','2.99','3.17','1.27','0.79','0.73')", "Select * from IT_revenue"))
big5_IT_rev<-gather(big5_IT_rev,"Company","Revenue",3:7)

big5_IT_rev

##    ID Year   Company Revenue
## 1   1 2009    Google    0.36
## 2   2 2010    Google    0.86
## 3   3 2011    Google    1.67
## 4   4 2012    Google    2.26
## 5   5 2013    Google    2.99
## 6   1 2009  Facebook    0.56
## 7   2 2010  Facebook    1.21
## 8   3 2011  Facebook    1.73
## 9   4 2012  Facebook    2.18
## 10  5 2013  Facebook    3.17
## 11  1 2009     Yahoo    1.26
## 12  2 2010     Yahoo    1.43
## 13  3 2011     Yahoo    1.36
## 14  4 2012     Yahoo    1.35
## 15  5 2013     Yahoo    1.27
## 16  1 2009 Microsoft    0.37
## 17  2 2010 Microsoft    0.51
## 18  3 2011 Microsoft    0.60
## 19  4 2012 Microsoft    0.90
## 20  5 2013 Microsoft    0.79
## 21  1 2009       AOL    0.51
## 22  2 2010       AOL    0.47
## 23  3 2011       AOL    0.53
## 24  4 2012       AOL    0.70
## 25  5 2013       AOL    0.73

Now I have a wide format table of Big 5 IT Revenues, which I need to present as a total revenue generated in five fiscal year and conclude the company who has highest revenue.

So I started up with interval creation and then used sqldf to group the data per company.

year1<- 2009:2013
 indx<-findInterval(year1,seq(2009,2013,by=4))
group<-seq(2009,2013,by=4)
ind<-seq(1,length(group),by=1)
labl1<-paste(group[ind],group[ind+1],sep="-")[-42]
tempTab1<- data.frame(year=year1,decade=labl1[indx],Revenue=big5_IT_rev$Revenue, Company=big5_IT_rev$Company,stringsAsFactors=FALSE)
tempTab1$decade<-str_replace_all(tempTab1$decade,pattern="2013-NA","2009-2013")

com_Yr_Rev<- sqldf("select decade Fiscal_Years, Company, sum(Revenue) Total_Revenue from tempTab1 group by Company")
com_Yr_Rev

##   Fiscal_Years   Company Total_Revenue
## 1    2009-2013       AOL          2.94
## 2    2009-2013  Facebook          8.85
## 3    2009-2013    Google          8.14
## 4    2009-2013 Microsoft          3.17
## 5    2009-2013     Yahoo          6.67

Finally used ggplot to plot a histogram to reflect the company wise revenue and it has surprised my to see Facebook as a winner with 8.65 Billion in last 5 year.

ggplot(data = com_Yr_Rev, aes(x = Fiscal_Years, y = Total_Revenue, fill = Company)) + geom_bar(stat ="identity", position="dodge")  + ggtitle("Total Revenue in last 5 Fiscal Year") + ylab("Revenue (2009-2013) in Billions")

IS607 Project 2 (Data Set 2 & 3)

sanjivek

October 14, 2015

Project 2 Dataset 2::

Project 2 Dataset 3: