Analysis02

For one dimensional summarize, there are number of options in R:

Five-number summary: This gives the minimum, 25th percentile, median, 75th percentile, maximum of the data and is quick check on the distribution of the data (see the fivenum())
Boxplots: Boxplots are a visual representation of the five-number summary plus a bit more information. In particular, boxplots commonly plot outliers that go beyond the bulk of the data. This is implemented via the boxplot() function
Barplot: Barplots are useful for visualizing categorical data, with the number of entries for each category being proportional to the height of the bar. Think “piechart” but actually useful. The barplot can be made with the barplot() function.
Histograms: Histograms show the complete empirical distribution of the data, beyond the five data points shown by the boxplots. Here, you can easily check skewwness of the data, symmetry, multi-modality, and other features. The hist() function makes a histogram, and a handy function to go with it sometimes is the rug() function.
Density plot: The density() function computes a non-parametric estimate of the distribution of a variables

df <- mtcars
str(df)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

names(df)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

boxplot(as.numeric(df$mpg), col="blue")
#boxplot(as.numeric(df$Price, na.rm=T), col="blue")
abline(h=mean(as.numeric(df$mpg), na.rm = T), col="red")

# more details about the shape of the distribution
hist(as.numeric(df$disp), col="green", breaks=10)

# rug plots all the points underneath and you can see where is the bulk of the data are and the outliers are
rug(as.numeric(df$disp))

# make bars smaller to get more rougher histogram
# not so small: cant get the shape, too large: noisy hist
#df$Mileage <- gsub(",", "", df$Mileage)   # remove comma
#df$Mileage <- as.numeric(df$Mileage)      # turn into numbers
 
hist(df$mpg, col="green", breaks=10)
rug(df$mpg)
abline(v=median(df$mpg), col ="lightgray", lwd=4)

barplot(table(df$vs), col="wheat", main="Number of Counties in Each Region")

# multiple boxplots
# The boxplot() function can take a formula, with the left hand side indicating the variable for which we want to create the boxplot (continuous) and the right hand side indicating the variable that stratifies the left hand side into categories
boxplot(mpg~vs, data=df, col="red")

# multiple histograms
par(mfrow = c(2,1), mar=c(4, 4, 2, 1))
# note: mfrow= (nrows, ncols), 
#mar: A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1.

hist(subset(df, cyl > 4)$gear, col="green")
hist(subset(df, cyl <= 4)$gear, col="blue")

# scatterplot
with(df, plot(cyl, carb, col="blue"))
abline(h=mean(df$carb), lwd=2, lty=2, col="red")
# you can add color to add another dim
with(df, plot(as.numeric(df$cyl), carb, col=gear))
abline(h=mean(df$cyl), lwd=2, lty=2, col="red")

# mulitple scatterplots (panel of plots)
par(mfrow=c(1,2), mar= c(5,4,2,1))
with(subset(df, hp>100), plot(mpg, cyl, main="expensive cars"))
with(subset(df, hp<=100), plot(mpg, cyl, main="cheap cars"))

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.3

qplot(mpg, cyl, data=df, facets= .~vs)

Country <- c("Egypt","UK","France")
gdp.1960 <- c(1,2,3)
gdp.1970 <- c(2,4,6)
mydat <- data.frame(Country,gdp.1960,gdp.1970)
mydat # wide format

##   Country gdp.1960 gdp.1970
## 1   Egypt        1        2
## 2      UK        2        4
## 3  France        3        6

dff <- reshape( data = mydat, varying = list(2:3) , v.names = "gdp", direction = "long") # long format
# Note:   
#varying gives the numbers of the columns which are time-varying
#v.names gives the prefix of the time-varying variables
#direction gives the direction, either "long" or "wide".

library(rworldmap)

## Warning: package 'rworldmap' was built under R version 3.2.3

## Loading required package: sp
## ### Welcome to rworldmap ###
## For a short introduction type :   vignette('rworldmap')

#create a map-shaped window
mapDevice('x11')
#join to a coarse resolution map
# join column is name of column dff$Country and the joinCode with world map is done by name
spdf <- joinCountryData2Map(dff, joinCode="NAME", nameJoinColumn="Country")

## 6 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 240 codes from the map weren't represented in your data

# the color in the map will indicate the dff$id
mapCountryData(spdf, nameColumnToPlot="id", catMethod="fixedWidth")

References

R Graph Gallery
R bloggers

Analysis02

Noha Elprince

February 15, 2016

References