Param Singh IS607 - week6 assignment

  1. Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).
#install.packages("ggplot2")
#install.packages("alr4")
require(ggplot2)
## Loading required package: ggplot2
require(alr4)
## Loading required package: alr4
## Loading required package: car
## Warning: package 'car' was built under R version 3.1.3
## Loading required package: effects
## 
## Attaching package: 'effects'
## 
## The following object is masked from 'package:car':
## 
##     Prestige
head(salary)
##    degree rank    sex year ysdeg salary
## 1 Masters Prof   Male   25    35  36350
## 2 Masters Prof   Male   13    22  35350
## 3 Masters Prof   Male   10    23  28200
## 4 Masters Prof Female    7    27  26775
## 5     PhD Prof   Male   19    30  33696
## 6 Masters Prof   Male   16    21  28516
# Identify which variables in the diamonds dataset are numeric, and which
# variables are categorical
str(salary)
## 'data.frame':    52 obs. of  6 variables:
##  $ degree: Factor w/ 2 levels "Masters","PhD": 1 1 1 1 2 1 2 1 2 2 ...
##  $ rank  : Factor w/ 3 levels "Asst","Assoc",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ sex   : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
##  $ year  : int  25 13 10 7 19 16 0 16 13 13 ...
##  $ ysdeg : int  35 22 23 27 30 21 32 18 30 31 ...
##  $ salary: int  36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
  1. Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in diamonds.
summary(salary)
##      degree      rank        sex          year            ysdeg      
##  Masters:34   Asst :18   Male  :38   Min.   : 0.000   Min.   : 1.00  
##  PhD    :18   Assoc:14   Female:14   1st Qu.: 3.000   1st Qu.: 6.75  
##               Prof :20               Median : 7.000   Median :15.50  
##                                      Mean   : 7.481   Mean   :16.12  
##                                      3rd Qu.:11.000   3rd Qu.:23.25  
##                                      Max.   :25.000   Max.   :35.00  
##      salary     
##  Min.   :15000  
##  1st Qu.:18247  
##  Median :23719  
##  Mean   :23798  
##  3rd Qu.:27258  
##  Max.   :38045
  1. Determine the frequency for one of the categeorical variables
table(salary$degree)
## 
## Masters     PhD 
##      34      18
  1. Determine the frequency for one of the categeorical variables, by a different categorical variable
table(salary$degree, salary$sex)
##          
##           Male Female
##   Masters   24     10
##   PhD       14      4
  1. Create a graph for a single numeric variable
boxplot(salary$salary)

hist(salary$salary)

# ggplot
qplot(salary, data=salary)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

  1. Create a scatterplot of two numeric variables
qplot(ysdeg, salary, data=salary)