This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
#Here i am using 3 datasets that are most commonly discussed in the group. Based on the insights and comments i am taking some datasets related to Education, Global warming and Oil consumption.
#1) In education, Graduation rate is a predictor for jobs. I am taking this dataset as this was the Discussion topic that i used in Week2 discussion form. This is the link where i got the content and used table as a dataset : https://www.theanalysisfactor.com/wide-and-long-data/
#2) Global warming, Tempareatures are increasing and causing a lot of impact on global warming. Even though we are taking many precautionary steps still temerature is impacting global warming. I collected data from NOAA sites http://w2.weather.gov/climate/, https://data.noaa.gov/dataset/
#3) Oil consumtion, For credue oil we are getting engine oil and gear oil. We are planning to analyse which oil type is mostly consumed.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
## Dataset 1 :
# Education dataset : Created a CSV file from the above metioned site and placed in Github. Here we will identify the predictor of Jobs with respect to Graduation rate.
# Githib link :
# Create a csv and load the data. Change the format from wide to long
dataset1_url <- getURL("https://raw.githubusercontent.com/vijay564/R-Maincode/master/Education.csv")
# Create csv table.
table1 <- read.csv(text = dataset1_url)
#Use tbl_df from dplyr
tbl_df(table1)
## # A tibble: 5 x 11
## County LandArea NatAmenity College1970 College1980 College1990
## <fct> <int> <int> <dbl> <dbl> <dbl>
## 1 Autau~ 599 4 0.064 0.121 0.145
## 2 Baldw~ 1578 4 0.065 0.121 0.168
## 3 Barbo~ 891 4 0.073 0.092 0.118
## 4 Bibb 625 3 0.042 0.049 0.047
## 5 Blount 639 4 0.027 0.053 0.07
## # ... with 5 more variables: College2000 <dbl>, Jobs1970 <int>,
## # Jobs1980 <int>, Jobs1990 <int>, Jobs2000 <int>
# View the structure
str(table1)
## 'data.frame': 5 obs. of 11 variables:
## $ County : Factor w/ 5 levels "Autauga","Baldwin",..: 1 2 3 4 5
## $ LandArea : int 599 1578 891 625 639
## $ NatAmenity : int 4 4 4 3 4
## $ College1970: num 0.064 0.065 0.073 0.042 0.027
## $ College1980: num 0.121 0.121 0.092 0.049 0.053
## $ College1990: num 0.145 0.168 0.118 0.047 0.07
## $ College2000: num 0.18 0.231 0.109 0.071 0.096
## $ Jobs1970 : int 6853 19749 9448 3965 7587
## $ Jobs1980 : int 11278 27861 9755 4276 9490
## $ Jobs1990 : int 11471 40809 12163 5564 11811
## $ Jobs2000 : int 16289 70247 15197 6098 16503
# Formatting data to long. We will data into 2 table. One for Colleges and other for Jobs. We will join them by county.
# split into 2 tables
table1_colleges <- select(table1, 1:7)
table1_jobs <- select(table1, 1, 8:11)
# Rename collumn for years as year
names(table1_colleges) <- c("County", "LandArea", "NatAmenity", "1970", "1980", "1990", "2000")
names(table1_jobs) <- c("County", "1970", "1980", "1990", "2000")
# Transform each table from wide to long format using the gather function from tidyr and store resulting table so that we can join them
# Change format to long using Gather function from tidyr and store in table
table1_colleges_long <- gather(table1_colleges, years, college_graduation, 4:7)
table1_jobs_long <- gather(table1_jobs, years, jobs, 2:5)
# Use inner_join from dplyr package and join on county, year
table1_result <- inner_join(table1_colleges_long, table1_jobs_long, by = c("County", "years"))
# Conclusive prediction of college graduation as a predictor of number of jobs
ggplot(table1_result, aes(x=college_graduation, y=jobs, colour = County)) + geom_line() +geom_point()
## Dataset 2 :
# Global Warming Temperature data, Here we will try to finout the differences in temperature.Read CSV file from Github.
temperaturedata <- read.csv("https://raw.githubusercontent.com/vijay564/R-Maincode/master/noaa_temperature.csv", header = TRUE, stringsAsFactors = FALSE)
tbl_df(temperaturedata)
## # A tibble: 17 x 12
## Period Value Twentieth.Centu~ Departure Low.Rank High.Rank Record.Low
## <int> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 1 73.0 72.1 0.85 91 31 1927
## 2 2 73.4 72.9 0.580 87 35 1992
## 3 3 72.8 71.4 1.35 110 12 1915
## 4 4 69.8 68.6 1.17 108 14 1907
## 5 5 66.5 65.1 1.38 115 7 1907
## 6 6 63.0 61.2 1.8 115 7 1917
## 7 7 58.7 57.2 1.42 109 13 1917
## 8 8 55.5 53.9 1.62 113 9 1912
## 9 9 53.4 51.5 1.88 116 5 1917
## 10 10 52.0 50.5 1.45 111 10 1912
## 11 11 52.4 50.9 1.58 113 8 1912
## 12 12 53.6 52.0 1.55 113 8 1917
## 13 18 56.2 55.1 1.13 107 14 1917
## 14 24 52.8 52.0 0.79 97 23 1918
## 15 36 52.9 52 0.88 100 19 1917
## 16 48 53.5 52.0 1.49 112 6 1920
## 17 60 53.4 52.0 1.39 111 6 1920
## # ... with 5 more variables: Record.High <int>, Lowest.Since <int>,
## # Highest.Since <int>, Percentile <chr>, Ties <chr>
# Tidying, we see period 60 is september 2010 to August 2015.Period 48 is September 2011 to August 2015; period 36 is September 2012 to August 2015; period 24 is September 2013 to August 2015; and period 12 is September 2014 to August 2015.
# I choose Value (temp in F), Mean, Departure (difference) and High.Rank
tidy.temperature <- subset(temperaturedata, Period >= 12, select = c(Period, Value, Twentieth.Century.Mean, Departure, High.Rank))
colnames(tidy.temperature) <- c("Period", "Temp", "Mean", "Diff", "Rank")
tidy.temperature <- tidy.temperature[-2, ] # to get rid of row 18
tidy.temperature
## Period Temp Mean Diff Rank
## 12 12 53.58 52.03 1.55 8
## 14 24 52.81 52.02 0.79 23
## 15 36 52.88 52.00 0.88 19
## 16 48 53.48 51.99 1.49 6
## 17 60 53.37 51.98 1.39 6
# Analyse Temperature, If we sort by Temperature we see that 1 year was the hottest, followed by 4 and 5. It is good that the temperatures are in sync with the difference from the mean. It is almost a test that the data makes some sense and we see it correctly.
arrange(tidy.temperature, Temp)
## Period Temp Mean Diff Rank
## 1 24 52.81 52.02 0.79 23
## 2 36 52.88 52.00 0.88 19
## 3 60 53.37 51.98 1.39 6
## 4 48 53.48 51.99 1.49 6
## 5 12 53.58 52.03 1.55 8
## Dataset 3 :
# Oil Consumption, Here we are trying to find out the most consumed brand across the 2 category of oil.
# Read CSV file from Github
oil_url <- getURL("https://raw.githubusercontent.com/vijay564/R-Maincode/master/OilConsumption.csv")
table3 <- read.csv(text = oil_url)
# Use tbl_df function in dplyr package
tbl_df(table3)
## # A tibble: 15 x 8
## X X.1 Caltex X.2 Gulf X.3 Mobil X.4
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 Month Category Purchased Consumed Purchased Consum~ Purchas~ Consum~
## 2 Open Engine Oil 140 0 199 0 141 0
## 3 "" GearBox Oil 198 0 132 0 121 0
## 4 Jan Engine Oil 170 103 194 132 109 127
## 5 "" GearBox Oil 132 106 125 105 191 100
## 6 Feb Engine Oil 112 133 138 113 171 101
## 7 "" GearBox Oil 193 148 199 119 134 127
## 8 Mar Engine Oil 184 100 141 141 114 108
## 9 "" GearBox Oil 138 121 172 133 193 115
## 10 Apr Engine Oil 149 150 117 118 117 118
## 11 "" GearBox Oil 185 125 191 133 119 121
## 12 May Engine Oil 170 139 104 119 200 117
## 13 "" GearBox Oil 168 117 138 102 121 146
## 14 Jun Engine Oil 159 129 170 138 169 105
## 15 "" GearBox Oil 107 129 195 141 141 112
# View structure
tbl_df(table3)
## # A tibble: 15 x 8
## X X.1 Caltex X.2 Gulf X.3 Mobil X.4
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 Month Category Purchased Consumed Purchased Consum~ Purchas~ Consum~
## 2 Open Engine Oil 140 0 199 0 141 0
## 3 "" GearBox Oil 198 0 132 0 121 0
## 4 Jan Engine Oil 170 103 194 132 109 127
## 5 "" GearBox Oil 132 106 125 105 191 100
## 6 Feb Engine Oil 112 133 138 113 171 101
## 7 "" GearBox Oil 193 148 199 119 134 127
## 8 Mar Engine Oil 184 100 141 141 114 108
## 9 "" GearBox Oil 138 121 172 133 193 115
## 10 Apr Engine Oil 149 150 117 118 117 118
## 11 "" GearBox Oil 185 125 191 133 119 121
## 12 May Engine Oil 170 139 104 119 200 117
## 13 "" GearBox Oil 168 117 138 102 121 146
## 14 Jun Engine Oil 159 129 170 138 169 105
## 15 "" GearBox Oil 107 129 195 141 141 112
# First we will split the table for purchased and consumed value, repeating the common column of Month and Category. Then, we will rename the columns and remove the first line of table. Then we will fill in the "month" column using fill() function from tidyr package.
# split the table into 2
table3_purchased <- select(table3, 1:3, 5, 7)
table3_consumed <- select(table3, 1:2, 4, 6, 8)
# Rename column for years as year
names(table3_purchased) <- c("Month", "Category", "Caltex", "Gulf", "Mobil")
names(table3_consumed) <- c("Month", "Category", "Caltex", "Gulf", "Mobil")
# remove first row
table3_purchased <- table3_purchased[-c(1), ]
table3_consumed <- table3_consumed[-c(1), ]
# The conclusion is we want how 2 categories of Oil are consumed for Mobil. We are plotting a grapth that specified how they are consumed
ggplot(table3_consumed,aes(x=Month,y=Mobil,fill=factor(Category)))+ geom_bar(stat="identity",position="dodge")
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.