1

Read in the CSV file, summarize, mean + median.

#filename<-"hw2/transplant.csv"
filename<-"https://raw.githubusercontent.com/djunga/msdsbridgehw2/main/transplant.csv"
df<-read.csv(file=filename)

head(df, n=5L)
##   X age sex abo year futime    event
## 1 1  47   m   B 1994   1197    death
## 2 2  55   m   A 1991     28      ltx
## 3 3  52   m   B 1996     85      ltx
## 4 4  40   f   O 1995    231      ltx
## 5 5  70   m   O 1996   1271 censored
summary(df)
##        X              age            sex                abo           
##  Min.   :  1.0   Min.   :17.00   Length:815         Length:815        
##  1st Qu.:204.5   1st Qu.:44.00   Class :character   Class :character  
##  Median :408.0   Median :52.00   Mode  :character   Mode  :character  
##  Mean   :408.0   Mean   :50.52                                        
##  3rd Qu.:611.5   3rd Qu.:58.00                                        
##  Max.   :815.0   Max.   :72.00                                        
##                  NA's   :18                                           
##       year          futime          event          
##  Min.   :1990   Min.   :   0.0   Length:815        
##  1st Qu.:1993   1st Qu.:  50.0   Class :character  
##  Median :1995   Median : 128.0   Mode  :character  
##  Mean   :1995   Mean   : 213.6                     
##  3rd Qu.:1997   3rd Qu.: 276.5                     
##  Max.   :1999   Max.   :2055.0                     
## 
mean(df[,"age"], na.rm=TRUE)
## [1] 50.5207
median(df[,"age"], na.rm=TRUE)
## [1] 52
mean(df[,"futime"], na.rm=TRUE)
## [1] 213.5706
median(df[,"futime"], na.rm=TRUE)
## [1] 128

2

Get subset of original dataset.

df1<-subset(df, select=c(age,sex,futime))
df1<-df1[1:200,]
head(df1, n=5L)
##   age sex futime
## 1  47   m   1197
## 2  55   m     28
## 3  52   m     85
## 4  40   f    231
## 5  70   m   1271

3

Rename the columns.

names(df1)<-c("AGE", "SEX", "TIME")
head(df1, n=5L)
##   AGE SEX TIME
## 1  47   m 1197
## 2  55   m   28
## 3  52   m   85
## 4  40   f  231
## 5  70   m 1271

4

Summary + mean + median of new dataset.

summary(df1)
##       AGE            SEX                 TIME       
##  Min.   :27.00   Length:200         Min.   :   0.0  
##  1st Qu.:44.00   Class :character   1st Qu.:  40.0  
##  Median :53.00   Mode  :character   Median : 104.5  
##  Mean   :51.67                      Mean   : 163.9  
##  3rd Qu.:59.00                      3rd Qu.: 231.0  
##  Max.   :72.00                      Max.   :1271.0  
##  NA's   :5
mean(df1[,"AGE"], na.rm=TRUE)
## [1] 51.66667
median(df1[,"AGE"], na.rm=TRUE)
## [1] 53
mean(df1[,"TIME"], na.rm=TRUE)
## [1] 163.87
median(df1[,"TIME"], na.rm=TRUE)
## [1] 104.5
  • The mean age has increased by about 1 year.
  • The median age increased by 1 year as well.
  • The mean time significantly decreased, by about 50.
  • The dataset description did not specify the units of the time.
  • The median time decreased by 24.

The mean and median ages did not change very significantly. The mean and median times have changed a lot. This leads me to believe that the subset that I took has higher time values than the original dataset.

5

I choose values 47, 55, and 52 in the AGE column to change.

47->51, 55->65, 52->40

Count of each before changing their values:

length(which(df1["AGE"]==47))
## [1] 7
length(which(df1["AGE"]==55))
## [1] 5
length(which(df1["AGE"]==52))
## [1] 11

Change their values:

df1$AGE[df1$AGE==47] <- 51
df1$AGE[df1$AGE==55] <- 65
df1$AGE[df1$AGE==52] <- 40

count of each after changing their values:

length(which(df1["AGE"]==47))
## [1] 0
length(which(df1["AGE"]==55))
## [1] 0
length(which(df1["AGE"]==52))
## [1] 0

First 5 rows of the altered dataset:

head(df1, n=5L)
##   AGE SEX TIME
## 1  51   m 1197
## 2  65   m   28
## 3  40   m   85
## 4  40   f  231
## 5  70   m 1271