Project 2

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

#Here i am using 3 datasets that are most commonly discussed in the group. Based on the insights and comments i am taking some datasets related to Education, Global warming and Oil consumption.

#1) In education, Graduation rate is a predictor for jobs. I am taking this dataset as this was the Discussion topic that i used in Week2 discussion form. This is the link where i got the content and used table as a dataset : https://www.theanalysisfactor.com/wide-and-long-data/
#2) Global warming, Tempareatures are increasing and causing a lot of impact on global warming. Even though we are taking many precautionary steps still temerature is impacting global warming. I collected data from NOAA sites http://w2.weather.gov/climate/, https://data.noaa.gov/dataset/ 
#3) Oil consumtion, For credue oil we are getting engine oil and gear oil. We are planning to analyse which oil type is mostly consumed.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(RCurl)

## Loading required package: bitops

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

## Dataset 1 :
# Education dataset : Created a CSV file from the above metioned site and placed in Github. Here we will identify the predictor of Jobs with respect to Graduation rate.
# Githib link : 

# Create a csv and load the data. Change the format from wide to long

dataset1_url <- getURL("https://raw.githubusercontent.com/vijay564/R-Maincode/master/Education.csv")

# Create csv table.
table1 <- read.csv(text = dataset1_url)

#Use tbl_df from dplyr
tbl_df(table1)

## # A tibble: 5 x 11
##   County LandArea NatAmenity College1970 College1980 College1990
##   <fct>     <int>      <int>       <dbl>       <dbl>       <dbl>
## 1 Autau~      599          4       0.064       0.121       0.145
## 2 Baldw~     1578          4       0.065       0.121       0.168
## 3 Barbo~      891          4       0.073       0.092       0.118
## 4 Bibb        625          3       0.042       0.049       0.047
## 5 Blount      639          4       0.027       0.053       0.07 
## # ... with 5 more variables: College2000 <dbl>, Jobs1970 <int>,
## #   Jobs1980 <int>, Jobs1990 <int>, Jobs2000 <int>

# View the structure
str(table1)

## 'data.frame':    5 obs. of  11 variables:
##  $ County     : Factor w/ 5 levels "Autauga","Baldwin",..: 1 2 3 4 5
##  $ LandArea   : int  599 1578 891 625 639
##  $ NatAmenity : int  4 4 4 3 4
##  $ College1970: num  0.064 0.065 0.073 0.042 0.027
##  $ College1980: num  0.121 0.121 0.092 0.049 0.053
##  $ College1990: num  0.145 0.168 0.118 0.047 0.07
##  $ College2000: num  0.18 0.231 0.109 0.071 0.096
##  $ Jobs1970   : int  6853 19749 9448 3965 7587
##  $ Jobs1980   : int  11278 27861 9755 4276 9490
##  $ Jobs1990   : int  11471 40809 12163 5564 11811
##  $ Jobs2000   : int  16289 70247 15197 6098 16503

# Formatting data to long. We will data into 2 table. One for Colleges and other for Jobs. We will join them by county.
# split into 2 tables
table1_colleges <- select(table1, 1:7)
table1_jobs <- select(table1, 1, 8:11)

# Rename collumn for years as year
names(table1_colleges) <- c("County", "LandArea", "NatAmenity", "1970", "1980", "1990", "2000")
names(table1_jobs) <- c("County", "1970", "1980", "1990", "2000")

# Transform each table from wide to long format using the gather function from tidyr and store resulting table so that we can join them
# Change format to long using Gather function from tidyr and store in table 
table1_colleges_long <- gather(table1_colleges, years, college_graduation, 4:7)
table1_jobs_long <- gather(table1_jobs, years, jobs, 2:5)

# Use inner_join from dplyr package and join on county, year
table1_result <- inner_join(table1_colleges_long, table1_jobs_long, by = c("County", "years"))

# Conclusive prediction of college graduation as a predictor of number of jobs
ggplot(table1_result, aes(x=college_graduation, y=jobs, colour = County)) + geom_line() +geom_point()

## Dataset 2 :
# Global Warming Temperature data, Here we will try to finout the differences in temperature.Read CSV file from Github.

temperaturedata <- read.csv("https://raw.githubusercontent.com/vijay564/R-Maincode/master/noaa_temperature.csv", header = TRUE, stringsAsFactors = FALSE)
tbl_df(temperaturedata)

## # A tibble: 17 x 12
##    Period Value Twentieth.Centu~ Departure Low.Rank High.Rank Record.Low
##     <int> <dbl>            <dbl>     <dbl>    <int>     <int>      <int>
##  1      1  73.0             72.1     0.85        91        31       1927
##  2      2  73.4             72.9     0.580       87        35       1992
##  3      3  72.8             71.4     1.35       110        12       1915
##  4      4  69.8             68.6     1.17       108        14       1907
##  5      5  66.5             65.1     1.38       115         7       1907
##  6      6  63.0             61.2     1.8        115         7       1917
##  7      7  58.7             57.2     1.42       109        13       1917
##  8      8  55.5             53.9     1.62       113         9       1912
##  9      9  53.4             51.5     1.88       116         5       1917
## 10     10  52.0             50.5     1.45       111        10       1912
## 11     11  52.4             50.9     1.58       113         8       1912
## 12     12  53.6             52.0     1.55       113         8       1917
## 13     18  56.2             55.1     1.13       107        14       1917
## 14     24  52.8             52.0     0.79        97        23       1918
## 15     36  52.9             52       0.88       100        19       1917
## 16     48  53.5             52.0     1.49       112         6       1920
## 17     60  53.4             52.0     1.39       111         6       1920
## # ... with 5 more variables: Record.High <int>, Lowest.Since <int>,
## #   Highest.Since <int>, Percentile <chr>, Ties <chr>

# Tidying, we see period 60 is september 2010 to August 2015.Period 48 is September 2011 to August 2015; period 36 is September 2012 to August 2015; period 24 is September 2013 to August 2015; and period 12 is September 2014 to August 2015. 
# I choose Value (temp in F), Mean, Departure (difference) and High.Rank

tidy.temperature <- subset(temperaturedata, Period >= 12, select = c(Period, Value, Twentieth.Century.Mean, Departure, High.Rank))
colnames(tidy.temperature) <- c("Period", "Temp", "Mean", "Diff", "Rank")
tidy.temperature <- tidy.temperature[-2, ]     # to get rid of row 18
tidy.temperature

##    Period  Temp  Mean Diff Rank
## 12     12 53.58 52.03 1.55    8
## 14     24 52.81 52.02 0.79   23
## 15     36 52.88 52.00 0.88   19
## 16     48 53.48 51.99 1.49    6
## 17     60 53.37 51.98 1.39    6

# Analyse Temperature, If we sort by Temperature we see that 1 year was the hottest, followed by 4 and 5. It is good that the temperatures are in sync with the difference from the mean. It is almost a test that the data makes some sense and we see it correctly.
arrange(tidy.temperature, Temp)

##   Period  Temp  Mean Diff Rank
## 1     24 52.81 52.02 0.79   23
## 2     36 52.88 52.00 0.88   19
## 3     60 53.37 51.98 1.39    6
## 4     48 53.48 51.99 1.49    6
## 5     12 53.58 52.03 1.55    8

## Dataset 3 :
# Oil Consumption, Here we are trying to find out the most consumed brand across the 2 category of oil.
# Read CSV file from Github
oil_url <- getURL("https://raw.githubusercontent.com/vijay564/R-Maincode/master/OilConsumption.csv")
table3 <- read.csv(text = oil_url)

# Use tbl_df function in dplyr package
tbl_df(table3)

## # A tibble: 15 x 8
##    X     X.1         Caltex    X.2      Gulf      X.3     Mobil    X.4    
##    <fct> <fct>       <fct>     <fct>    <fct>     <fct>   <fct>    <fct>  
##  1 Month Category    Purchased Consumed Purchased Consum~ Purchas~ Consum~
##  2 Open  Engine Oil  140       0        199       0       141      0      
##  3 ""    GearBox Oil 198       0        132       0       121      0      
##  4 Jan   Engine Oil  170       103      194       132     109      127    
##  5 ""    GearBox Oil 132       106      125       105     191      100    
##  6 Feb   Engine Oil  112       133      138       113     171      101    
##  7 ""    GearBox Oil 193       148      199       119     134      127    
##  8 Mar   Engine Oil  184       100      141       141     114      108    
##  9 ""    GearBox Oil 138       121      172       133     193      115    
## 10 Apr   Engine Oil  149       150      117       118     117      118    
## 11 ""    GearBox Oil 185       125      191       133     119      121    
## 12 May   Engine Oil  170       139      104       119     200      117    
## 13 ""    GearBox Oil 168       117      138       102     121      146    
## 14 Jun   Engine Oil  159       129      170       138     169      105    
## 15 ""    GearBox Oil 107       129      195       141     141      112

# View structure
tbl_df(table3)

## # A tibble: 15 x 8
##    X     X.1         Caltex    X.2      Gulf      X.3     Mobil    X.4    
##    <fct> <fct>       <fct>     <fct>    <fct>     <fct>   <fct>    <fct>  
##  1 Month Category    Purchased Consumed Purchased Consum~ Purchas~ Consum~
##  2 Open  Engine Oil  140       0        199       0       141      0      
##  3 ""    GearBox Oil 198       0        132       0       121      0      
##  4 Jan   Engine Oil  170       103      194       132     109      127    
##  5 ""    GearBox Oil 132       106      125       105     191      100    
##  6 Feb   Engine Oil  112       133      138       113     171      101    
##  7 ""    GearBox Oil 193       148      199       119     134      127    
##  8 Mar   Engine Oil  184       100      141       141     114      108    
##  9 ""    GearBox Oil 138       121      172       133     193      115    
## 10 Apr   Engine Oil  149       150      117       118     117      118    
## 11 ""    GearBox Oil 185       125      191       133     119      121    
## 12 May   Engine Oil  170       139      104       119     200      117    
## 13 ""    GearBox Oil 168       117      138       102     121      146    
## 14 Jun   Engine Oil  159       129      170       138     169      105    
## 15 ""    GearBox Oil 107       129      195       141     141      112

# First we will split the table for purchased and consumed value, repeating the common column of Month and Category. Then, we will rename the columns and remove the first line of table. Then we will fill in the "month" column using fill() function from tidyr package.
# split the table into 2
table3_purchased <- select(table3, 1:3, 5, 7)
table3_consumed <- select(table3, 1:2, 4, 6, 8)

# Rename column for years as year
names(table3_purchased) <- c("Month", "Category", "Caltex", "Gulf", "Mobil")
names(table3_consumed) <- c("Month", "Category", "Caltex", "Gulf", "Mobil")

# remove first row 
table3_purchased <- table3_purchased[-c(1), ] 
table3_consumed <- table3_consumed[-c(1), ] 

# The conclusion is we want how 2 categories of Oil are consumed for Mobil. We are plotting a grapth that specified how they are consumed
ggplot(table3_consumed,aes(x=Month,y=Mobil,fill=factor(Category)))+    geom_bar(stat="identity",position="dodge")

Project 2

Vijaya Cherukuri

October 7, 2018

R Markdown

Including Plots