Here are the notes I took in R from the power point and exercise slide:
#Install knitr package
install.packages('knitr', dependencies = TRUE)
#Code to Enter Data
data_name <-read.csv(file.choose(), header=TRUE)
facdat <-read.table("/Users/kace999/Documents/...", sep=",", header=T)
names(data_name)
#To select a column use:
data_name$column_name
#Other data call:
data_name[row selection criteria, column names]
#Subsetting data
#Returns rows of data where value=T
#Blanks after commas in [ ] indicate all values
data_name[data_name$column_name=="value",column_names]
#Create new dataset
data_name.subset <- data_name[data_name$column_name=="value",column_names]
#Create new variables
data_name$new_variable[data_name$old_variable <= "minimum in range" & data_name$old_variable >= "maximum in range"] <- ordinal values
#Histogram
hist(variable/data_name$column_name, col="color_name", breaks=seq(min,max,by=increment), xlab="label", main="figure_title")
#Boxplots
boxplot(variable/data_name$column_name, main="figure_title", ylab="label",col="color_name")
boxplot(data_name$column_name1 ~ data_name$column_name2 , main="figure_title", ylab="label",col=rainbow(2))
#Scatter plots
plot(x,y)
plot(x~y)
help(plot)
plot(x,y, xlab="label", ylab="label", pch=2, col="color_name", main="figure_title", sub="figure_subtitle",
#Line Graphs
data_name <-read.csv(file.choose(), header=TRUE)
time1<-1:nrow(data_name)
plot(time1,data_name$column_name, xlab="label",ylab="label", main="figure_title", type="l", ylim=c(min,max))
#Correlation Coefficient
cor(y, x)
#Two lines on one graph
lines(time1,data_name$column_name,lty=2, col="colorname")
#Legends
legend(x, y, c("column_name1", "column_name2"), lty=c(1,2), col=c("color_name1","colorname2"))
#Pannel Figure
par(mfrow=c(number_rows, number_columns))
#hist(CO2$uptake, col="blue",
#breaks=seq(0,50,by=5),xlab="Rate (umol/m^2 sec)", main =
#"CO2 Uptake Rates")
#plot(CO2$conc,CO2$uptake, xlab="CO2 Concentration",
#ylab="CO2 Uptake Rate", pch=2, col="red")
#boxplot(CO2$uptake~CO2$Type, main="CO2 Uptake Rates",
#ylab="Rate (umol/m^2 sec)",col=rainbow(2))
#plot(t1,thames2$Feildes,xlab="Time",ylab="Flow Rate",
#main="Flow Rate for Thames Tributaries", type="l")
#lines(t1,thames2$Redbridge,lty=2,col="red")
#legend(5,8,c("Feildes","Redbridge"),lty=c(1,2),
#col=c("black","red"))
The exercise asks five questions of the faculty.csv file provided on dropbox. I respond to these below.
1. What are the column names? How many observations are there? How many variables?
# Input data
facdat <- read.table("/Users/kace999/Documents/Grad School/PhD/Classes/S13/Spatial Stats/Data/faculty.csv",
sep = ",", header = T)
names(facdat)
## [1] "AYSALARY" "R1" "R2" "R7" "PRIOREXP" "YRBG"
## [7] "YRRANK" "TERMDEG" "YRDG" "EMINENT" "FEMALE"
ncol(facdat)
## [1] 11
nrow(facdat)
## [1] 725
There are 11 variables and 725 total observations.
2. Is annual salary normally distributed?
# Check for range
summary(facdat$AYSALARY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23800 36800 46700 47800 57600 103000
mean(facdat$AYSALARY)
## [1] 47801
# Plot histogram
hist(facdat$AYSALARY, col = "blue", breaks = seq(10000, 120000, by = 2500),
xlab = "Annual Salary", main = "Faculty Salary Distribution")
The histogram above indicates that salaries are not normally distrbuted. The distribution is truncated on the minimum skewed higher at the maximum, with more weight of the distribution towards the lower end of the range. A poisson distribution may be a better statistical model.
3. Does it appear that male and female faculty members make the same annual salary?
# Boxplot with both variabels
boxplot(facdat$AYSALARY ~ facdat$FEMALE, main = "Gender and Salary", ylab = "Salary",
xlab = "Men=0, Women=1", col = rainbow(2))
It appears women make much less than men.
4. Does there appear to be a relationship between salary and the number of years of employment?
# Scatterplot with trendline First define model, then plot
model1 <- lm(facdat$AYSALARY ~ facdat$YRBG)
plot(facdat$YRBG, facdat$AYSALARY, xlab = "Years of Service", ylab = "Annual Salary",
pch = 8, col = "black", main = "Annual Salary by Years of Service with Trendline")
abline(reg = model1, col = "blue")
There is definitely a positive correlation between years of service and annual salary.
5. BONUS: Create a new variable combining R1, R2, and R3 into one categorical variable of rank. Does one category appear to have higher salaries?
# Create new variable, put ordinal values into it
facdat$rank[facdat$R1 == 1] <- 1
facdat$rank[facdat$R2 == 1] <- 2
facdat$rank[facdat$R7 == 1] <- 3
# Three boxplots to compare across variables
boxplot(facdat$AYSALARY ~ facdat$rank, main = "Rank and Salary", ylab = "Salary",
xlab = "1=Tenured Professor, 2=Associate Prof, 3=Lecturer/Instructor", col = rainbow(3))
It is clear that rank is highly correlated with salary.