Use the built-in swiss data set for this problem.
Find the average fertility rate for the 47 French-speaking provinces, and find the number of provinces with fertility rate above 60%.
mean(swiss$Fertility)
## [1] 70.14255
dim(subset(swiss,Fertility>=60))[1]
## [1] 39
Use abbreviate to shorten the names of the 47 provinces in the ‘swiss’ data set. Compare the results to using substr to extract 4 characters from the names of provinces. Which of the two functions gives better shortened names?
ANS: It is proposed that
abbreviate(rownames(swiss))
## Courtelary Delemont Franches-Mnt Moutier Neuveville
## "Crtl" "Dlmn" "Fr-M" "Motr" "Nvvl"
## Porrentruy Broye Glane Gruyere Sarine
## "Prrn" "Broy" "Glan" "Gryr" "Sarn"
## Veveyse Aigle Aubonne Avenches Cossonay
## "Vvys" "Aigl" "Abnn" "Avnc" "Cssn"
## Echallens Grandson Lausanne La Vallee Lavaux
## "Echl" "Grnd" "Lsnn" "LVll" "Lavx"
## Morges Moudon Nyone Orbe Oron
## "Mrgs" "Modn" "Nyon" "Orbe" "Oron"
## Payerne Paysd'enhaut Rolle Vevey Yverdon
## "Pyrn" "Pys'" "Roll" "Vevy" "Yvrd"
## Conthey Entremont Herens Martigwy Monthey
## "Cnth" "Entr" "Hrns" "Mrtg" "Mnth"
## St Maurice Sierre Sion Boudry La Chauxdfnd
## "StMr" "Sirr" "Sion" "Bdry" "LChx"
## Le Locle Neuchatel Val de Ruz ValdeTravers V. De Geneve
## "LLcl" "Ncht" "VldR" "VldT" "V.DG"
## Rive Droite Rive Gauche
## "RvDr" "RvGc"
substr(rownames(swiss),1,4)
## [1] "Cour" "Dele" "Fran" "Mout" "Neuv" "Porr" "Broy" "Glan" "Gruy" "Sari"
## [11] "Veve" "Aigl" "Aubo" "Aven" "Coss" "Echa" "Gran" "Laus" "La V" "Lava"
## [21] "Morg" "Moud" "Nyon" "Orbe" "Oron" "Paye" "Pays" "Roll" "Veve" "Yver"
## [31] "Cont" "Entr" "Here" "Mart" "Mont" "St M" "Sier" "Sion" "Boud" "La C"
## [41] "Le L" "Neuc" "Val " "Vald" "V. D" "Rive" "Rive"
How many provinces have over 50% Catholic? Define these provinces as Catholic and the other provinces as Protestant. Which kind of provinces has a higher average fertility rate? Which kind of provinces has a higher average education rate beyond primary school for draftees?
dim(subset(swiss,Catholic>50))[1]
## [1] 18
swiss$CorP <- ifelse(swiss$Catholic>50,"Catholic","Protestant")
t.test(Fertility~CorP,data=swiss)
##
## Welch Two Sample t-test
##
## data: Fertility by CorP
## t = 2.7004, df = 26.742, p-value = 0.01186
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.455904 18.024939
## sample estimates:
## mean in group Catholic mean in group Protestant
## 76.46111 66.22069
t.test(Education~CorP,data=swiss)
##
## Welch Two Sample t-test
##
## data: Education by CorP
## t = -1.1229, df = 43.236, p-value = 0.2677
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.462232 2.408592
## sample estimates:
## mean in group Catholic mean in group Protestant
## 9.111111 12.137931
Comment on the line by line output of the following R script.
x <- c(5, 7, 9, 13, -4, 8) #create a variable called x
is.vector(x) # ensure x is a vector
## [1] TRUE
is.factor(x) # test whether x is a factor
## [1] FALSE
xl <- list(w1="Hello", w2="World!") # create a list called x1
is.list(xl) # ensure x1 is a list
## [1] TRUE
is.factor(xl) # test if x1 is a factor
## [1] FALSE
M1 <- matrix(1:6, 3, 2) # create a matrix called M1
M2 <- matrix(1:6, 2, 3) # create a matrix called M2
M3 <- matrix(1:6, 2, 3, byrow=T) # create a matrix called M3
is.matrix(M1) # ensure M1 is a matrix
## [1] TRUE
is.vector(M1) # test if M1 is a vector
## [1] FALSE
df <- data.frame(ltr=letters[1:6], num=11:16) # create a data frame called df
is.data.frame(df) # ensure df is a dataframe
## [1] TRUE
is.matrix(df) # test if df is a matrix
## [1] FALSE
df$ltr[3] # show the thrid row in ltr variable
## [1] c
## Levels: a b c d e f
df$num[3] # show the thrid row in num variable
## [1] 13
lt <- list(ltr=letters[1:6], num=11:16) # create a list called ltr
is.list(lt) # ensure lt is a list
## [1] TRUE
is.data.frame(lt) # test if lt is a dataframe
## [1] FALSE
lt$ltr[3] # show the thrid vector in ltr list
## [1] "c"
lt$num[3] # show the thrid vector in num list
## [1] 13
Download the data file in junior school project and read it into your currect R session. Assign the data set to a data frame object called jsp.
Display school information in jsp.
jsp <- read.table("data/jsp.txt",h=T)
head(jsp)$school
## [1] S1 S1 S1 S1 S1 S1
## 49 Levels: S1 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S2 S20 S21 ... S9
Display class information in jsp.
head(jsp)$class
## [1] C1 C1 C1 C1 C1 C1
## Levels: C1 C2 C3 C4
Display student information in jsp.
head(jsp)[,3:9]
## sex soc ravens pupil english math year
## 1 G 9 23 P1 72 23 0
## 2 G 9 23 P1 80 24 1
## 3 G 9 23 P1 39 23 2
## 4 B 2 15 P2 7 14 0
## 5 B 2 15 P2 17 11 1
## 6 B 2 22 P3 88 36 0
Display student information in class 2 of school 1.
head(subset(jsp,class=="C2"))[,3:9]
## sex soc ravens pupil english math year
## 46 G 9 17 P19 52 14 0
## 47 G 9 17 P19 84 10 1
## 48 G 9 17 P19 46 25 2
## 49 B 4 22 P20 11 14 0
## 50 B 4 22 P20 23 20 1
## 51 B 4 22 P20 9 17 2
Re-label the values of the variable ‘junior school year’: One = 1, Two = 2, Three = 3.
jsp$year<- ifelse (jsp$year==0,1,ifelse(jsp$year==1,2,3))
Re-name the variable ‘sex’ as ‘gender’.
#library(dplyr)
#jsp <- rename(jsp,gender=sex)
colnames(jsp)[3] <- "gender"
head(jsp)
## school class gender soc ravens pupil english math year
## 1 S1 C1 G 9 23 P1 72 23 1
## 2 S1 C1 G 9 23 P1 80 24 2
## 3 S1 C1 G 9 23 P1 39 23 3
## 4 S1 C1 B 2 15 P2 7 14 1
## 5 S1 C1 B 2 15 P2 17 11 2
## 6 S1 C1 B 2 22 P3 88 36 1
Move the variable ‘student ID’ from the 6th column to the third column and shift the rest down one column.
jsp <- jsp[,c(1,2,6,3,4,5,7:9)]
Write jsp out as a csv file.
write.csv(jsp, file="finaljspcsv.csv")
Solve the problem of data type conversion in the following R script.
# load MASS library
library(MASS)
# make a copy of minn38 data set
y <- minn38
str(y)
## 'data.frame': 168 obs. of 5 variables:
## $ hs : Factor w/ 3 levels "L","M","U": 1 1 1 1 1 1 1 1 1 1 ...
## $ phs: Factor w/ 4 levels "C","E","N","O": 1 1 1 1 1 1 1 3 3 3 ...
## $ fol: Factor w/ 7 levels "F1","F2","F3",..: 1 2 3 4 5 6 7 1 2 3 ...
## $ sex: Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ f : int 87 72 52 88 32 14 20 3 6 17 ...
head(y)
## hs phs fol sex f
## 1 L C F1 M 87
## 2 L C F2 M 72
## 3 L C F3 M 52
## 4 L C F4 M 88
## 5 L C F5 M 32
## 6 L C F6 M 14
# coerce it to become numeric
y$sex <- as.numeric(y$sex)
# check for numeric type
y$sex
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# change it back to characters
y$sex <- ifelse(y$sex == 1, "F", "M")
# factor type? why not
is.factor(y$sex)
## [1] FALSE
# make it so
y$sex <- as.factor(y$sex)
#
y$phs <- as.numeric(y$phs)
# show it
y$phs
## [1] 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1
## [36] 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 3 3 3 3 3 3 3
## [71] 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 2 2 2 2 2 2
## [106] 4 4 4 4 4 4 4 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4
## [141] 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4
# how do I change y$phs back to the original type?
ANS:
y$phs <- as.factor(y$phs)
y$phs <- ifelse(y$phs=="1","C",ifelse(y$phs=="2","E",ifelse(y$phs=="3","N","O")))
y$phs <- as.factor(y$phs)
Chatterjee and Hadi (Regression by Examples, 2006) provided a link to the right to work data set on their web page. Read the data into an R session. ANS: the links of the dataset was down.
dta <- read.table(file="http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P005.txt", header=T,sep="\t")
The AAUP2 data set is a comma-delimited fixed column format text file with ‘Asterisk’ for missing value.Import the file into R and indicate missing values by ‘NA’.
# If new ideas come up, I will update the file.
The titanic data set is the survival of Titanic passengers in an R data file format. Import the file into an R session and examine the file contents.
load("data/titanic.raw.rdata")
str(titanic.raw)
## 'data.frame': 2201 obs. of 4 variables:
## $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 2 2 2 2 2 ...
## $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...