1

Read in the CSV file, summarize, mean + median.

#filename<-"hw2/transplant.csv"
filename<-"https://raw.githubusercontent.com/djunga/msdsbridgehw2/main/transplant.csv"
df<-read.csv(file=filename)

head(df, n=5L)

##   X age sex abo year futime    event
## 1 1  47   m   B 1994   1197    death
## 2 2  55   m   A 1991     28      ltx
## 3 3  52   m   B 1996     85      ltx
## 4 4  40   f   O 1995    231      ltx
## 5 5  70   m   O 1996   1271 censored

summary(df)

##        X              age            sex                abo           
##  Min.   :  1.0   Min.   :17.00   Length:815         Length:815        
##  1st Qu.:204.5   1st Qu.:44.00   Class :character   Class :character  
##  Median :408.0   Median :52.00   Mode  :character   Mode  :character  
##  Mean   :408.0   Mean   :50.52                                        
##  3rd Qu.:611.5   3rd Qu.:58.00                                        
##  Max.   :815.0   Max.   :72.00                                        
##                  NA's   :18                                           
##       year          futime          event          
##  Min.   :1990   Min.   :   0.0   Length:815        
##  1st Qu.:1993   1st Qu.:  50.0   Class :character  
##  Median :1995   Median : 128.0   Mode  :character  
##  Mean   :1995   Mean   : 213.6                     
##  3rd Qu.:1997   3rd Qu.: 276.5                     
##  Max.   :1999   Max.   :2055.0                     
##

mean(df[,"age"], na.rm=TRUE)

## [1] 50.5207

median(df[,"age"], na.rm=TRUE)

## [1] 52

mean(df[,"futime"], na.rm=TRUE)

## [1] 213.5706

median(df[,"futime"], na.rm=TRUE)

## [1] 128

2

Get subset of original dataset.

df1<-subset(df, select=c(age,sex,futime))
df1<-df1[1:200,]
head(df1, n=5L)

##   age sex futime
## 1  47   m   1197
## 2  55   m     28
## 3  52   m     85
## 4  40   f    231
## 5  70   m   1271

3

Rename the columns.

names(df1)<-c("AGE", "SEX", "TIME")
head(df1, n=5L)

##   AGE SEX TIME
## 1  47   m 1197
## 2  55   m   28
## 3  52   m   85
## 4  40   f  231
## 5  70   m 1271

4

Summary + mean + median of new dataset.

summary(df1)

##       AGE            SEX                 TIME       
##  Min.   :27.00   Length:200         Min.   :   0.0  
##  1st Qu.:44.00   Class :character   1st Qu.:  40.0  
##  Median :53.00   Mode  :character   Median : 104.5  
##  Mean   :51.67                      Mean   : 163.9  
##  3rd Qu.:59.00                      3rd Qu.: 231.0  
##  Max.   :72.00                      Max.   :1271.0  
##  NA's   :5

mean(df1[,"AGE"], na.rm=TRUE)

## [1] 51.66667

median(df1[,"AGE"], na.rm=TRUE)

## [1] 53

mean(df1[,"TIME"], na.rm=TRUE)

## [1] 163.87

median(df1[,"TIME"], na.rm=TRUE)

## [1] 104.5

The mean age has increased by about 1 year.
The median age increased by 1 year as well.
The mean time significantly decreased, by about 50.
The dataset description did not specify the units of the time.
The median time decreased by 24.

The mean and median ages did not change very significantly. The mean and median times have changed a lot. This leads me to believe that the subset that I took has higher time values than the original dataset.

5

I choose values 47, 55, and 52 in the AGE column to change.

47->51, 55->65, 52->40

Count of each before changing their values:

length(which(df1["AGE"]==47))

## [1] 7

length(which(df1["AGE"]==55))

## [1] 5

length(which(df1["AGE"]==52))

## [1] 11

Change their values:

df1$AGE[df1$AGE==47] <- 51
df1$AGE[df1$AGE==55] <- 65
df1$AGE[df1$AGE==52] <- 40

count of each after changing their values:

length(which(df1["AGE"]==47))

## [1] 0

length(which(df1["AGE"]==55))

## [1] 0

length(which(df1["AGE"]==52))

## [1] 0

First 5 rows of the altered dataset:

head(df1, n=5L)

##   AGE SEX TIME
## 1  51   m 1197
## 2  65   m   28
## 3  40   m   85
## 4  40   f  231
## 5  70   m 1271

HW2

Tora Mullings

1/12/2022

1

Read in the CSV file, summarize, mean + median.

2

Get subset of original dataset.

3

Rename the columns.

4

Summary + mean + median of new dataset.

The mean and median ages did not change very significantly. The mean and median times have changed a lot. This leads me to believe that the subset that I took has higher time values than the original dataset.

5

I choose values 47, 55, and 52 in the AGE column to change.

47->51, 55->65, 52->40

Count of each before changing their values:

Change their values:

count of each after changing their values:

First 5 rows of the altered dataset: