price vs. x

Notes:

# In this problem set, you'll continue
# to explore the diamonds data set.

# Your first task is to create a
# scatterplot of price vs x.
# using the ggplot syntax.

# Import Library
library(ggplot2)

# Load the dataset
data("diamonds")

# Check the Docs on the dataset
# ?diamonds

# Get the first few records
head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
# Inspiration

# Create scatterplot - using QPlot
# qplot(x = age, y = friend_count, data = pf)
#
# Create scatterplot - using GGPlot
# ggplot(aes(x = age, y = friend_count), data = pf)  + geom_point()

# Create scatterplot - using GGPlot
ggplot(aes(x = price, y = x), data = diamonds)  + geom_point()

# Correlation between Price and X
cor.test(diamonds$price, diamonds$x, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  diamonds$price and diamonds$x
## t = 440.16, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8825835 0.8862594
## sample estimates:
##       cor 
## 0.8844352
# Correlation between Price and Y
cor.test(diamonds$price, diamonds$y, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  diamonds$price and diamonds$y
## t = 401.14, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8632867 0.8675241
## sample estimates:
##       cor 
## 0.8654209
# Correlation between Price and Z
cor.test(diamonds$price, diamonds$z, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  diamonds$price and diamonds$z
## t = 393.6, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8590541 0.8634131
## sample estimates:
##       cor 
## 0.8612494

NOTE:

Did you notice some outliers and an exponential relationship between price and x?

Question Template

(Round Your Answers to 2 Decimals)

Q: What is the correlation between price and X?

A: 0.88

Q: What is the correlation between price and Y?

A: 0.87

Q: What is the correlation between price and Z?

A: 0.86


price vs. depth

Notes:

# Create a simple scatter plot of price vs depth.

# Create scatterplot - using GGPlot
ggplot(aes(x = depth, y = price), data = diamonds)  + geom_point()

Visual Adjustments - price vs. depth

Notes:

# Change the code to make the transparency of the
# points to be 1/100 of what they are now and mark
# the x-axis every 2 units. See the instructor notes
# for two hints.

# Inspiration:
#  
# Jitter
# ggplot(aes(x = age, y = friend_count), data = pf)  + geom_jitter(alpha = 1/20) + xlim(13, 90)

# Create scatterplot - using GGPlot, change the transparancy
ggplot(aes(x = depth, y = price), data = diamonds)  + geom_jitter(alpha = 1/30)

Typical Depth Range

Based on the scatterplot of depth vs price, most of the diamonds are between what values of depth?

Lower Limit: 58

Upper Limit: 64


Correlation - price and depth

Notes:

# Correlation between Price and Z
cor.test(diamonds$depth, diamonds$price, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  diamonds$depth and diamonds$price
## t = -2.473, df = 53938, p-value = 0.0134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.019084756 -0.002208537
## sample estimates:
##        cor 
## -0.0106474

Correlation - price and depth

Q: What is the correlation of depth vs price?

A: -0.01

Q: Based upon the correlation coefficient, would you use depth to predict the price of a diamond?

A: No.

Why?

Because there is not a strong correlation between those variables and what is there is negative.


price vs. carat

Notes:

# Create a scatterplot of price vs carat
# and omit the top 1% of price and carat
# values.

# Inspiration
# p1 <- ggplot(aes(x = age, y = friend_count_mean), 
#       data = subset(pf.fc_by_age, age < 71)) 

# Inspiration
# ggplot(aes(x = www_likes_received, y = likes_received), data = pf) + geom_point() +
#  xlim(0, quantile(pf$www_likes_received, 0.95)) +
#  ylim(0, quantile(pf$likes_received, 0.95))


# Create scatterplot - using GGPlot, change the transparancy
ggplot(aes(x = carat, y = price), data = diamonds )  + geom_jitter(alpha = 1/30) +
  xlim(0, quantile(diamonds$carat, 0.99)) +
  ylim(0, quantile(diamonds$price, 0.99))
## Warning: Removed 940 rows containing missing values (geom_point).

# What is the max carat size
max(diamonds[,1])
## [1] 5.01
# What is the max price
max(diamonds[,7])
## [1] 18823

Question Template

Q: Does the x-axis for carat extend past 2? It should!

A: Yes it does.


price vs. volume

Notes:

# Create a scatterplot of price vs. volume (x * y * z).
# This is a very rough approximation for a diamond's volume.

# Create a new variable for volume in the diamonds data frame.
# This will be useful in a later exercise.

# Don't make any adjustments to the plot just yet.


# Create scatterplot - using GGPlot
ggplot(aes(x = (x * y * z), y = price), data = diamonds)  + geom_point()

Findings - price vs. volume

What were your observations from the price vs. volume scatterplot?

Response:

The distribution was a very dense, tightly bound plot dominated by a couple of outliers.

** Grader Notes: **

Did you notice some outliers? Some volumes are 0!

There’s an expensive diamond with a volume near 4000*, and a cheaper diamond with a volume near 900.

You can find out how many diamonds have 0 volume by using count(diamonds$volume == 0), the count() function comes with the plyr package.

Note: If you ran the count function from plyr, you need to run this command in R to unload the plyr package. detach(“package:plyr”, unload=TRUE)

The plyr package will conflict with the dplyr package in later exercises.

Depending on your investigation, it may or may not be important for you to understand how outliers, like these, came to be in your data.


Grader Hints Experiments - how many diamonds have 0 volume?

Notes:

# Instead of using PLYR, one can get the job done with simple tables

#Create the table "volume"
volume <- table(diamonds$x * diamonds$y * diamonds$z)

# Then subset it (which gives a count)
volume[names(volume)==0]
##  0 
## 20

how many diamonds have 0 volume?

Wow many diamonds have 0 volume?

Answer: 20


Correlations on Subsets

Notes:

# Spot Checks
#
# What is the max volume
# max(diamonds[,11])

# What is the min volume
# min(diamonds[,11])


# Create a new column in the DataFrame called Volume :  (X * Y * Z)
diamonds$volume <- diamonds$x * diamonds$y * diamonds$z

# Subset Experiments
# subset(diamonds, volume > 0 &  volume <= 800)

# CHANGE: Create a new column excluding diamonds of volume Zero and of volume >= 800
# Note: I had to create a new DF because of a mismatch in existing vs new rows in the original DF
#       when I run this cmd in place. Using a new DF cleared it up.
diamonds.no_outliers_volume <- subset(diamonds, volume > 0 &  volume <= 800)


# Correlation between Price and volume on new DF
cor.test(diamonds.no_outliers_volume$price, (diamonds.no_outliers_volume$volume), method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  diamonds.no_outliers_volume$price and (diamonds.no_outliers_volume$volume)
## t = 559.19, df = 53915, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9222944 0.9247772
## sample estimates:
##       cor 
## 0.9235455

Correlations on Subsets Question

What is the correlation of price and volume? Exlcude diamonds of volume zero or of volume >= 80?

Response: 0.92


Adjustments - price vs. volume

Notes:

# Subset the data to exclude diamonds with a volume
# greater than or equal to 800. Also, exclude diamonds
# with a volume of 0. 

# Adjust the transparency of the points and add a 
# linear model to the plot. (See the Instructor Notes 
# or look up the documentation of geom_smooth() for more 
# details about smoothers.)

# We encourage you to think about this next question.

# Do you think this would be a useful model to estimate
# the price of diamonds? Why or why not?

library(gridExtra)

# CHANGE: Create a new column excluding diamonds of volume Zero and of volume >= 800
# Note: I had to create a new DF because of a mismatch in existing vs new rows in the original DF
#       when I run this cmd in place. Using a new DF cleared it up.
diamonds.no_outliers_volume <- subset(diamonds, volume > 0 &  volume <= 800)


# Create scatterplot - using GGPlot
ggplot(aes(x = volume, y = price), data = diamonds.no_outliers_volume)  + 
  geom_point(alpha = .05, position = position_jitter(h = 0),color = 'orange') +
  geom_smooth()

# Experiments:
#

p1 <- ggplot(data = diamonds.no_outliers_volume,
       aes(x = volume, y = price)) +
  geom_point() 

# Default smoother
p2 <- p1 + geom_smooth()

# looking at a linear fit,
p3 <- p1 + stat_smooth(method = "lm", formula = y ~ x, size = 1) + coord_cartesian(ylim = c(0,20000))

# Looking at polynimoal functions of order 2
p4 <- p1 + stat_smooth(method = "lm", formula = y ~ poly(x, 2), size = 1) + coord_cartesian(ylim = c(0,20000))

# Looking at polynimoal functions of order 3
p5 <- p1 + stat_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1) + coord_cartesian(ylim = c(0,20000))

library(gridExtra)

grid.arrange(p2,p3,p4,p5,ncol =2)

Question Template

Do you think this would be a useful model to estimate the price of diamonds? Why or why not?

Does the linear model seem to be a good fit to the data? Share your thoughts.

Response:

This would be a useful model because the line appears to be pretty good fit to the data.


Mean Price by Clarity

Notes:

# Use the function dplyr package
# to create a new data frame containing
# info on diamonds by clarity.

# Name the data frame diamondsByClarity

# The data frame should contain the following
# variables in this order.

#       (1) mean_price
#       (2) median_price
#       (3) min_price
#       (4) max_price
#       (5) n

# where n is the number of diamonds in each
# level of clarity.

# This assignment WILL BE automatically
# graded!

# DO NOT ALTER THE NEXT THREE LINES OF CODE.
# ======================================================
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
data(diamonds)
# head(diamonds)

# ENTER YOUR CODE BELOW THIS LINE
# ======================================================


# Inspiration
#
# pf.fc_by_age_months <-pf %>%
#  group_by(age_with_months) %>%
#  summarize( friend_count_mean = mean(friend_count),
#             friend_count_median = median(friend_count),
#             n = n()) %>%
#  arrange(age_with_months)
#
# head(pf.fc_by_age_months)


diamondsByClarity <- diamonds %>%
  group_by(clarity) %>%
  summarize(mean_price = mean(price), 
            median_price =  median(price),
            min_price = min(price),
            max_price = max(price),
            n = n())
    
  
# Experiments:  
#
# colnames(diamonds)
# diamonds['cut']
# diamonds[,2]
  
# Spot Check:
head(diamondsByClarity, n = 6)
## Source: local data frame [6 x 6]
## 
##   clarity mean_price median_price min_price max_price     n
##    (fctr)      (dbl)        (dbl)     (int)     (int) (int)
## 1      I1   3924.169         3344       345     18531   741
## 2     SI2   5063.029         4072       326     18804  9194
## 3     SI1   3996.001         2822       326     18818 13065
## 4     VS2   3924.989         2054       334     18823 12258
## 5     VS1   3839.455         2005       327     18795  8171
## 6    VVS2   3283.737         1311       336     18768  5066
tail(diamondsByClarity, n = 6)
## Source: local data frame [6 x 6]
## 
##   clarity mean_price median_price min_price max_price     n
##    (fctr)      (dbl)        (dbl)     (int)     (int) (int)
## 1     SI1   3996.001         2822       326     18818 13065
## 2     VS2   3924.989         2054       334     18823 12258
## 3     VS1   3839.455         2005       327     18795  8171
## 4    VVS2   3283.737         1311       336     18768  5066
## 5    VVS1   2523.115         1093       336     18777  3655
## 6      IF   2864.839         1080       369     18806  1790
# Spot Check: Min / Max Price
# min(diamonds['price', diamonds$clarity == 'I1'])
# max(diamonds['price'])
# max(subset(diamonds['price'], diamonds['clarity'] == 'I1'))

# Spot Check
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# Spot Check
# diamondsByClarity

Bar Charts of Mean Price

Notes:

# We've created summary data frames with the mean price
# by clarity and color below. You can run the code in R to
# verify what data is in the variables diamonds_mp_by_clarity
# and diamonds_mp_by_color.

# Your task is to write additional code to create two bar plots
# on one output image using the grid.arrange() function from the package
# gridExtra.

# This assignment is not graded and
# will be marked as correct when you submit.

# See the Instructor Notes for more info on bar charts
# and for a hint on this task.

# DO NOT DELETE THE LINES OF CODE BELOW
# ===================================================================
data(diamonds)
library(dplyr)

diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))

diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))

diamonds_by_cut <- group_by(diamonds, cut)
diamonds_mp_by_cut <- summarise(diamonds_by_cut, mean_price = mean(price))

# ENTER YOUR CODE BELOW THIS LINE
# ===================================================================


# Mean Price across Clarity and Color

p1  <- ggplot(diamonds_mp_by_clarity, aes(x = clarity, y = mean_price, fill= clarity)) +
  geom_bar(stat = "identity")


p2 <- ggplot(diamonds_mp_by_color, aes(x = color, y = mean_price, fill= color)) +
  geom_bar(stat = "identity")


grid.arrange(p1,p2, ncol =2)

# Mean Price Adding in Cut

p3  <- ggplot(diamonds_mp_by_cut, aes(x = cut, y = mean_price, fill= cut)) +
  geom_bar(stat = "identity")



grid.arrange(p1,p2, p3, ncol = 3)

Question Template

What do you notice in each of the bar charts for mean price by clarity and mean price by color?

Response:

SI2 has the best mean price as in the Clarity group as does J in the color group. Both plots seem to be slightly skewed.

Graders Coments:

We think something odd is going here. These trends seem to go against our intuition.

Mean price tends to decrease as clarity improves. The same can be said for color.

We encourage you to look into the mean price across cut.

UPDATE

The Cut bar chart seems to show some oddities as well …. nothing is correlating with common sense intuition.


Gap Minder Data EXPERIMENTS

Notes:

library(dplyr)

# Load the Plot Library
library(ggplot2)

# Read the  CSV file, create the dataframe
cpi_df <- read.csv("corruption_perception.csv", header=TRUE, row.names = 1, check.names = T)

# Drop the 3 columns we do not want.
cpi_df <- subset(cpi_df, select = -c(X.1, X.2, X.3))


# Convert Dataset Rows names to explicit variable, name the column "Countries"
cpi_df <- cpi_df %>% add_rownames(var = "COUNTRIES")


# Spot Check
str(cpi_df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    180 obs. of  3 variables:
##  $ COUNTRIES: chr  "New Zealand" "Denmark" "Sweden" "Singapore" ...
##  $ X2008    : num  9.4 9.3 9.2 9.2 9 8.9 8.9 8.7 8.7 8.7 ...
##  $ X2009    : num  9.3 9.3 9.2 9.3 8.7 9.2 8.8 8.5 8.9 8.7 ...
glimpse(cpi_df)
## Observations: 180
## Variables: 3
## $ COUNTRIES (chr) "New Zealand", "Denmark", "Sweden", "Singapore", "Sw...
## $ X2008     (dbl) 9.4, 9.3, 9.2, 9.2, 9.0, 8.9, 8.9, 8.7, 8.7, 8.7, 8....
## $ X2009     (dbl) 9.3, 9.3, 9.2, 9.3, 8.7, 9.2, 8.8, 8.5, 8.9, 8.7, 8....
# Get Quick Summary
summary(cpi_df)
##   COUNTRIES             X2008           X2009      
##  Length:180         Min.   :1.100   Min.   :1.100  
##  Class :character   1st Qu.:2.500   1st Qu.:2.400  
##  Mode  :character   Median :3.300   Median :3.300  
##                     Mean   :4.031   Mean   :4.007  
##                     3rd Qu.:5.125   3rd Qu.:5.025  
##                     Max.   :9.400   Max.   :9.300  
##                                     NA's   :4
# Transpose the DataFrame - Probably not necessary.
cpi_df.T <- t(cpi_df)


### PLOTTING


# BarPlot - Misc plot
qplot(x = X2008, data = cpi_df, binwidth = .1, color = I('blue'), fill = I('#F79420')) +
  # scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7)) +
  xlab('Year 2008') +
  ylab('Corruption Index')

# Line Plot
ggplot(data=cpi_df, aes(x=X2008, y=X2009, group=1)) +
  geom_line()+
  geom_point()
## Warning: Removed 4 rows containing missing values (geom_point).

# Box Plot
qplot(x=X2008, y=X2009,
      data = cpi_df, 
      geom = 'boxplot')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

# Convert to Local DataFrame
cpi_df.ldf <- tbl_df(cpi_df)


# Rename Columns
# Inspiration: colnames(df)[colnames(df) == 'oldName'] <- 'newName'
colnames(cpi_df.ldf)[colnames(cpi_df.ldf) == 'X2008'] <- 'CPI_2008'
colnames(cpi_df.ldf)[colnames(cpi_df.ldf) == 'X2009'] <- 'CPI_2009'


### Create Aggregated Variables

# Inspiration
# dplyr approach (prints the new variable but does not store it)
# Mutate - articulate new variable name "speed", and when referring to existing column 
# names, no $ needed.
# Note: the select is not explicitly needed, just for readibility
# 
# flights %>%
#    select(Distance, AirTime) %>%
#    mutate(Speed = Distance/AirTime*60)
#
# store the new variable
#flights <- flights %>% mutate(Speed = Distance/AirTime*60)


# Create a Column called "HIGHLY_TRUSTED", which >= 8.0 CPI factor
cpi_df.ldf <- cpi_df.ldf %>%
  select(COUNTRIES, CPI_2008, CPI_2009) %>%
   mutate(HIGHLY_TRUSTED = as.integer(CPI_2008 >= 9.0))


# Create a Column called "TRUSTED", with  >= 8.0 & < 9.0  CPIfactor
cpi_df.ldf <- cpi_df.ldf %>%
  select(COUNTRIES, CPI_2008, CPI_2009, HIGHLY_TRUSTED) %>%
   mutate(TRUSTED = as.integer(CPI_2008 >= 8.0 &  CPI_2008 < 9.0))


# Create a Column called "SOMEWHAT_TRUSTED", with >= 7.0 & < 8.0 CPI factor
cpi_df.ldf <- cpi_df.ldf %>%
  select(everything()) %>%
   mutate(SOMEWHAT_TRUSTED = as.integer(CPI_2008 >= 7.0 & CPI_2008 < 8.0))


# Create a Column called "SUSPICIOUS", with >= 6.0 & < 7.0 CPI factor
cpi_df.ldf <- cpi_df.ldf %>%
  select(everything()) %>%
   mutate(SUSPICIOUS = as.integer(CPI_2008 >= 6.0 & CPI_2008 < 7.0))


# Create a Column called "QUESTIONABLE", with >= 5.0 & < 6.0 CPI factor
cpi_df.ldf <- cpi_df.ldf %>%
  select(everything()) %>%
   mutate(QUESTIONABLE = as.integer(CPI_2008 >= 5.0 & CPI_2008 < 6.0))


# Create a Column called "UNTRUSTED", with >= 4.0 & < 5.0 CPI factor
cpi_df.ldf <- cpi_df.ldf %>%
  select(everything()) %>%
   mutate(UNTRUSTED = as.integer(CPI_2008 >= 4.0 & CPI_2008 < 5.0))



# Create a Column called "REALLY_UNTRUSTED", with >= 3.0 & < 4.0 CPI factor
cpi_df.ldf <- cpi_df.ldf %>%
  select(everything()) %>%
   mutate(REALLY_UNTRUSTED = as.integer(CPI_2008 >= 3.0 & CPI_2008 < 4.0))


# Create a Column called "HIGHLY_UNTRUSTED", with < 3.0 CPI factor
cpi_df.ldf <- cpi_df.ldf %>%
  select(everything()) %>%
   mutate(HIGHLY_UNTRUSTED =  as.integer(CPI_2008 < 3.0))


# Spot Check
glimpse(cpi_df.ldf)
## Observations: 180
## Variables: 11
## $ COUNTRIES        (chr) "New Zealand", "Denmark", "Sweden", "Singapor...
## $ CPI_2008         (dbl) 9.4, 9.3, 9.2, 9.2, 9.0, 8.9, 8.9, 8.7, 8.7, ...
## $ CPI_2009         (dbl) 9.3, 9.3, 9.2, 9.3, 8.7, 9.2, 8.8, 8.5, 8.9, ...
## $ HIGHLY_TRUSTED   (int) 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ TRUSTED          (int) 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ SOMEWHAT_TRUSTED (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ SUSPICIOUS       (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ QUESTIONABLE     (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ UNTRUSTED        (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ REALLY_UNTRUSTED (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ HIGHLY_UNTRUSTED (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
str(cpi_df.ldf)
## Classes 'tbl_df', 'tbl' and 'data.frame':    180 obs. of  11 variables:
##  $ COUNTRIES       : chr  "New Zealand" "Denmark" "Sweden" "Singapore" ...
##  $ CPI_2008        : num  9.4 9.3 9.2 9.2 9 8.9 8.9 8.7 8.7 8.7 ...
##  $ CPI_2009        : num  9.3 9.3 9.2 9.3 8.7 9.2 8.8 8.5 8.9 8.7 ...
##  $ HIGHLY_TRUSTED  : int  1 1 1 1 1 0 0 0 0 0 ...
##  $ TRUSTED         : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ SOMEWHAT_TRUSTED: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SUSPICIOUS      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ QUESTIONABLE    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UNTRUSTED       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REALLY_UNTRUSTED: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HIGHLY_UNTRUSTED: int  0 0 0 0 0 0 0 0 0 0 ...
# Inspiration
# as.integer(as.logical(TRUE))


# BarPlot
qplot(x = X2008, data = cpi_df, binwidth = .1, color = I('blue'), fill = I('#F79420')) +
  xlab('Year 2008') +
  ylab('Corruption Index')

# dplyr Analysis Experiment
# filter(cpi_df.ldf, HIGHLY_TRUSTED == 1)



trustLevel <- function(cpi) 
  {
  if (cpi >= 9.0) { 
    designation = "HT" 
  } else if (cpi >= 8.0 & cpi < 9.0) {
    designation = "T" 
  } else if (cpi >= 7.0 & cpi < 8.0) {
    designation = "ST"
  } else if (cpi >= 6.0 & cpi < 7.0) {
    designation = "Q"
  } else if (cpi >= 5.0 & cpi < 6.0) {
    designation = "S"
  } else if (cpi >= 4.0 & cpi < 5.0) {
    designation = "UT"
  } else if (cpi >= 3.0 & cpi < 4.0) {
    designation = "HUT"
  } else if (cpi >= 2.0 & cpi < 3.0) {
    designation = "CUT"
  } else if ( cpi < 2.0) {
    designation = "CPT"
  } else {
    designation = "NA"
  }
  return(designation)
}


# trustLevel(8.1)


# Map a function across every value of a column to populate another
cpi_df.ldf$TRUST <- sapply(cpi_df.ldf$CPI_2008,trustLevel)


#### PLOTTING *****

# Create scatterplot - Countries vs Trust
ggplot(aes(x = TRUST, y = COUNTRIES), data = cpi_df.ldf)  + geom_point() + 
  xlab('Countries') +
  ylab('CPI 2008')

# Create a BarChart - Trust vs 2008 CPI
ggplot(aes(x = TRUST, y = CPI_2008), data = cpi_df.ldf)  + geom_bar(stat="identity") +
  xlab('TRUST') +
  ylab('CPI 2008')

# Create a BarChart - Countries vs 2008 CPI
ggplot(aes(x = COUNTRIES, y = CPI_2008), data = cpi_df.ldf)  + geom_bar(stat="identity") +
  xlab('Countries') +
  ylab('CPI 2008')

# Create a Line Chart Trust vs CPI 2008
ggplot(data=cpi_df.ldf, aes(x=TRUST, y=CPI_2008, group=1)) +
  geom_line()+
  geom_point() +
  xlab('TRUST Index') +
  ylab('CPI 2008')

# Create a Line Chart Countries vs TRUST
ggplot(data=cpi_df.ldf, aes(x=COUNTRIES, y=TRUST, group=1)) +
  geom_line()+
  geom_point() +
  xlab('Countries') +
  ylab('Corruption Index')

GAP Minder Data - Corruption Tndex

Notes:

# Load dplyr
library(dplyr)

# Load the Plot Library
library(ggplot2)

# Read the  CSV file, create the dataframe
cpi_df <- read.csv("corruption_perception.csv", header=TRUE, row.names = 1, check.names = T)

# Drop the 3 columns we do not want.
cpi_df <- subset(cpi_df, select = -c(X.1, X.2, X.3))


# Convert Dataset Rows names to explicit variable, name the column "Countries"
cpi_df <- cpi_df %>% add_rownames(var = "COUNTRIES")

# Convert to Local DataFrame
cpi_df.ldf <- tbl_df(cpi_df)

# Rename Columns
# Inspiration: colnames(df)[colnames(df) == 'oldName'] <- 'newName'
colnames(cpi_df.ldf)[colnames(cpi_df.ldf) == 'X2008'] <- 'CPI_2008'
colnames(cpi_df.ldf)[colnames(cpi_df.ldf) == 'X2009'] <- 'CPI_2009'

trustLevel <- function(cpi) 
  {
  if (cpi >= 9.0) { 
    designation = "HT" 
  } else if (cpi >= 8.0 & cpi < 9.0) {
    designation = "T" 
  } else if (cpi >= 7.0 & cpi < 8.0) {
    designation = "ST"
  } else if (cpi >= 6.0 & cpi < 7.0) {
    designation = "Q"
  } else if (cpi >= 5.0 & cpi < 6.0) {
    designation = "S"
  } else if (cpi >= 4.0 & cpi < 5.0) {
    designation = "UT"
  } else if (cpi >= 3.0 & cpi < 4.0) {
    designation = "HUT"
  } else if (cpi >= 2.0 & cpi < 3.0) {
    designation = "CUT"
  } else if ( cpi < 2.0) {
    designation = "CPT"
  } else {
    designation = "NA"
  }
  return(designation)
}

# Map a function across every value of a column to populate another
cpi_df.ldf$TRUST <- sapply(cpi_df.ldf$CPI_2008,trustLevel)

#### PLOTTING *****

# Create scatterplot - Countries vs Trust
ggplot(aes(x = TRUST, y = COUNTRIES), data = cpi_df.ldf)  + geom_point() + 
  xlab('Countries') +
  ylab('CPI 2008')

# Create a BarChart - Trust vs 2008 CPI
ggplot(aes(x = TRUST, y = CPI_2008), data = cpi_df.ldf)  + geom_bar(stat="identity") +
  xlab('TRUST') +
  ylab('CPI 2008')

# Create a BarChart - Countries vs 2008 CPI
ggplot(aes(x = COUNTRIES, y = CPI_2008), data = cpi_df.ldf)  + geom_bar(stat="identity") +
  xlab('Countries') +
  ylab('CPI 2008')

# Create a Line Chart Trust vs CPI 2008
ggplot(data=cpi_df.ldf, aes(x=TRUST, y=CPI_2008, group=1)) +
  geom_line()+
  geom_point() +
  xlab('TRUST Index') +
  ylab('CPI 2008')

# Create a Line Chart Countries vs TRUST
ggplot(data=cpi_df.ldf, aes(x=COUNTRIES, y=TRUST, group=1)) +
  geom_line()+
  geom_point() +
  xlab('Countries') +
  ylab('Corruption Index')