Use the built-in swiss data set for this problem.

Find the average fertility rate for the 47 French-speaking provinces, and find the number of provinces with fertility rate above 60%.

mean(swiss$Fertility)
## [1] 70.14255
dim(subset(swiss,Fertility>=60))[1]
## [1] 39

Use abbreviate to shorten the names of the 47 provinces in the ‘swiss’ data set. Compare the results to using substr to extract 4 characters from the names of provinces. Which of the two functions gives better shortened names?

ANS: It is proposed that

abbreviate(rownames(swiss))
##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville 
##       "Crtl"       "Dlmn"       "Fr-M"       "Motr"       "Nvvl" 
##   Porrentruy        Broye        Glane      Gruyere       Sarine 
##       "Prrn"       "Broy"       "Glan"       "Gryr"       "Sarn" 
##      Veveyse        Aigle      Aubonne     Avenches     Cossonay 
##       "Vvys"       "Aigl"       "Abnn"       "Avnc"       "Cssn" 
##    Echallens     Grandson     Lausanne    La Vallee       Lavaux 
##       "Echl"       "Grnd"       "Lsnn"       "LVll"       "Lavx" 
##       Morges       Moudon        Nyone         Orbe         Oron 
##       "Mrgs"       "Modn"       "Nyon"       "Orbe"       "Oron" 
##      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
##       "Pyrn"       "Pys'"       "Roll"       "Vevy"       "Yvrd" 
##      Conthey    Entremont       Herens     Martigwy      Monthey 
##       "Cnth"       "Entr"       "Hrns"       "Mrtg"       "Mnth" 
##   St Maurice       Sierre         Sion       Boudry La Chauxdfnd 
##       "StMr"       "Sirr"       "Sion"       "Bdry"       "LChx" 
##     Le Locle    Neuchatel   Val de Ruz ValdeTravers V. De Geneve 
##       "LLcl"       "Ncht"       "VldR"       "VldT"       "V.DG" 
##  Rive Droite  Rive Gauche 
##       "RvDr"       "RvGc"
substr(rownames(swiss),1,4)
##  [1] "Cour" "Dele" "Fran" "Mout" "Neuv" "Porr" "Broy" "Glan" "Gruy" "Sari"
## [11] "Veve" "Aigl" "Aubo" "Aven" "Coss" "Echa" "Gran" "Laus" "La V" "Lava"
## [21] "Morg" "Moud" "Nyon" "Orbe" "Oron" "Paye" "Pays" "Roll" "Veve" "Yver"
## [31] "Cont" "Entr" "Here" "Mart" "Mont" "St M" "Sier" "Sion" "Boud" "La C"
## [41] "Le L" "Neuc" "Val " "Vald" "V. D" "Rive" "Rive"

How many provinces have over 50% Catholic? Define these provinces as Catholic and the other provinces as Protestant. Which kind of provinces has a higher average fertility rate? Which kind of provinces has a higher average education rate beyond primary school for draftees?

dim(subset(swiss,Catholic>50))[1]
## [1] 18
swiss$CorP <- ifelse(swiss$Catholic>50,"Catholic","Protestant")
t.test(Fertility~CorP,data=swiss)
## 
##  Welch Two Sample t-test
## 
## data:  Fertility by CorP
## t = 2.7004, df = 26.742, p-value = 0.01186
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.455904 18.024939
## sample estimates:
##   mean in group Catholic mean in group Protestant 
##                 76.46111                 66.22069
t.test(Education~CorP,data=swiss)
## 
##  Welch Two Sample t-test
## 
## data:  Education by CorP
## t = -1.1229, df = 43.236, p-value = 0.2677
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.462232  2.408592
## sample estimates:
##   mean in group Catholic mean in group Protestant 
##                 9.111111                12.137931

Comment on the line by line output of the following R script.

x <- c(5, 7, 9, 13, -4, 8) #create a variable called x
is.vector(x) # ensure x is a vector
## [1] TRUE
is.factor(x) # test whether x is a factor
## [1] FALSE
xl <- list(w1="Hello", w2="World!") # create a list called x1
is.list(xl) # ensure x1 is a list
## [1] TRUE
is.factor(xl) # test if x1 is a factor
## [1] FALSE
M1 <- matrix(1:6, 3, 2) # create a matrix called M1 
M2 <- matrix(1:6, 2, 3) # create a matrix called M2
M3 <- matrix(1:6, 2, 3, byrow=T) # create a matrix called M3
is.matrix(M1) # ensure M1 is a matrix
## [1] TRUE
is.vector(M1) # test if M1 is a vector 
## [1] FALSE
df <- data.frame(ltr=letters[1:6], num=11:16) # create a data frame called df
is.data.frame(df) # ensure df is a dataframe
## [1] TRUE
is.matrix(df) # test if df is a matrix
## [1] FALSE
df$ltr[3] # show the thrid row in ltr variable
## [1] c
## Levels: a b c d e f
df$num[3] # show the thrid row in num variable
## [1] 13
lt <- list(ltr=letters[1:6], num=11:16) # create a list called ltr
is.list(lt) # ensure lt is a list
## [1] TRUE
is.data.frame(lt) # test if lt is a dataframe
## [1] FALSE
lt$ltr[3] # show the thrid vector in ltr list
## [1] "c"
lt$num[3] # show the thrid vector in num list
## [1] 13

Download the data file in junior school project and read it into your currect R session. Assign the data set to a data frame object called jsp.

Display school information in jsp.

jsp <- read.table("data/jsp.txt",h=T)
head(jsp)$school
## [1] S1 S1 S1 S1 S1 S1
## 49 Levels: S1 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S2 S20 S21 ... S9

Display class information in jsp.

head(jsp)$class
## [1] C1 C1 C1 C1 C1 C1
## Levels: C1 C2 C3 C4

Display student information in jsp.

head(jsp)[,3:9]
##   sex soc ravens pupil english math year
## 1   G   9     23    P1      72   23    0
## 2   G   9     23    P1      80   24    1
## 3   G   9     23    P1      39   23    2
## 4   B   2     15    P2       7   14    0
## 5   B   2     15    P2      17   11    1
## 6   B   2     22    P3      88   36    0

Display student information in class 2 of school 1.

head(subset(jsp,class=="C2"))[,3:9]
##    sex soc ravens pupil english math year
## 46   G   9     17   P19      52   14    0
## 47   G   9     17   P19      84   10    1
## 48   G   9     17   P19      46   25    2
## 49   B   4     22   P20      11   14    0
## 50   B   4     22   P20      23   20    1
## 51   B   4     22   P20       9   17    2

Re-label the values of the variable ‘junior school year’: One = 1, Two = 2, Three = 3.

jsp$year<- ifelse (jsp$year==0,1,ifelse(jsp$year==1,2,3))

Re-name the variable ‘sex’ as ‘gender’.

#library(dplyr)
#jsp <- rename(jsp,gender=sex)
colnames(jsp)[3] <- "gender"
head(jsp)
##   school class gender soc ravens pupil english math year
## 1     S1    C1      G   9     23    P1      72   23    1
## 2     S1    C1      G   9     23    P1      80   24    2
## 3     S1    C1      G   9     23    P1      39   23    3
## 4     S1    C1      B   2     15    P2       7   14    1
## 5     S1    C1      B   2     15    P2      17   11    2
## 6     S1    C1      B   2     22    P3      88   36    1

Move the variable ‘student ID’ from the 6th column to the third column and shift the rest down one column.

jsp <- jsp[,c(1,2,6,3,4,5,7:9)]

Write jsp out as a csv file.

write.csv(jsp, file="finaljspcsv.csv")

Solve the problem of data type conversion in the following R script.

# load MASS library
library(MASS)
# make a copy of minn38 data set
y <- minn38
str(y)
## 'data.frame':    168 obs. of  5 variables:
##  $ hs : Factor w/ 3 levels "L","M","U": 1 1 1 1 1 1 1 1 1 1 ...
##  $ phs: Factor w/ 4 levels "C","E","N","O": 1 1 1 1 1 1 1 3 3 3 ...
##  $ fol: Factor w/ 7 levels "F1","F2","F3",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ sex: Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ f  : int  87 72 52 88 32 14 20 3 6 17 ...
head(y)
##   hs phs fol sex  f
## 1  L   C  F1   M 87
## 2  L   C  F2   M 72
## 3  L   C  F3   M 52
## 4  L   C  F4   M 88
## 5  L   C  F5   M 32
## 6  L   C  F6   M 14
# coerce it to become numeric
y$sex <- as.numeric(y$sex)
# check for numeric type
y$sex
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# change it back to characters
y$sex <- ifelse(y$sex == 1, "F", "M")
# factor type? why not
is.factor(y$sex)
## [1] FALSE
# make it so
y$sex <- as.factor(y$sex)
#
y$phs <- as.numeric(y$phs)
# show it
y$phs
##   [1] 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1
##  [36] 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 3 3 3 3 3 3 3
##  [71] 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 2 2 2 2 2 2
## [106] 4 4 4 4 4 4 4 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4
## [141] 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4
# how do I change y$phs back to the original type?

ANS:

y$phs <- as.factor(y$phs)
y$phs <- ifelse(y$phs=="1","C",ifelse(y$phs=="2","E",ifelse(y$phs=="3","N","O")))
y$phs <- as.factor(y$phs)

Chatterjee and Hadi (Regression by Examples, 2006) provided a link to the right to work data set on their web page. Read the data into an R session. ANS: the links of the dataset was down.

dta <- read.table(file="http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P005.txt", header=T,sep="\t")

The AAUP2 data set is a comma-delimited fixed column format text file with ‘Asterisk’ for missing value.Import the file into R and indicate missing values by ‘NA’.

# If new ideas come up, I will update the file.  

The titanic data set is the survival of Titanic passengers in an R data file format. Import the file into an R session and examine the file contents.

load("data/titanic.raw.rdata")
str(titanic.raw)
## 'data.frame':    2201 obs. of  4 variables:
##  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Age     : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...