Place the original .csv in a github file and have R read from the link
I decided to start with the Bonus first since it makes the rest of the assignment easier once the data is assigned to a data frame. I used the College Distance Data set from the github page and display the first few rows below the summary The first few rows of this csv are displayed.
library(RCurl)
x <- getURL("https://vincentarelbundock.github.io/Rdatasets/csv/AER/CollegeDistance.csv")
JR_CollegeDistance <- read.csv(text = x)
summary(JR_CollegeDistance)
## X gender ethnicity score
## Min. : 1 Length:4739 Length:4739 Min. :28.95
## 1st Qu.: 1186 Class :character Class :character 1st Qu.:43.92
## Median : 2370 Mode :character Mode :character Median :51.19
## Mean : 3955 Mean :50.89
## 3rd Qu.: 3554 3rd Qu.:57.77
## Max. :37810 Max. :72.81
## fcollege mcollege home urban
## Length:4739 Length:4739 Length:4739 Length:4739
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## unemp wage distance tuition
## Min. : 1.400 Min. : 6.590 Min. : 0.000 Min. :0.2575
## 1st Qu.: 5.900 1st Qu.: 8.850 1st Qu.: 0.400 1st Qu.:0.4850
## Median : 7.100 Median : 9.680 Median : 1.000 Median :0.8245
## Mean : 7.597 Mean : 9.501 Mean : 1.803 Mean :0.8146
## 3rd Qu.: 8.900 3rd Qu.:10.150 3rd Qu.: 2.500 3rd Qu.:1.1270
## Max. :24.900 Max. :12.960 Max. :20.000 Max. :1.4042
## education income region
## Min. :12.00 Length:4739 Length:4739
## 1st Qu.:12.00 Class :character Class :character
## Median :13.00 Mode :character Mode :character
## Mean :13.81
## 3rd Qu.:16.00
## Max. :18.00
head(JR_CollegeDistance)
## X gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1 1 male other 39.15 yes no yes yes 6.2 8.09 0.2
## 2 2 female other 48.87 no no yes yes 6.2 8.09 0.2
## 3 3 male other 48.74 no no yes yes 6.2 8.09 0.2
## 4 4 male afam 40.40 no no yes yes 6.2 8.09 0.2
## 5 5 female other 40.48 no no no yes 5.6 8.09 0.4
## 6 6 male other 54.71 no no yes yes 5.6 8.09 0.4
## tuition education income region
## 1 0.88915 12 high other
## 2 0.88915 12 low other
## 3 0.88915 12 low other
## 4 0.88915 12 low other
## 5 0.88915 13 low other
## 6 0.88915 12 low other
Mean and Median for the Score and Distance Attributes
# mean & median for score column
mean(JR_CollegeDistance$score)
## [1] 50.88903
median(JR_CollegeDistance$score)
## [1] 51.19
# mean & median for distance column
mean(JR_CollegeDistance$distance)
## [1] 1.80287
median(JR_CollegeDistance$distance)
## [1] 1
The df is a subset containing the gender, ethnicity, score, distance and tuition columns. str allows us to see the data type and columns of the new data frame. The first few rows are displayed of the new subset data frame.
JR_NewDF <- data.frame(JR_CollegeDistance[, c("gender", "ethnicity", "score", "distance", "tuition")])
str (JR_NewDF)
## 'data.frame': 4739 obs. of 5 variables:
## $ gender : chr "male" "female" "male" "male" ...
## $ ethnicity: chr "other" "other" "other" "afam" ...
## $ score : num 39.2 48.9 48.7 40.4 40.5 ...
## $ distance : num 0.2 0.2 0.2 0.2 0.4 ...
## $ tuition : num 0.889 0.889 0.889 0.889 0.889 ...
head(JR_NewDF)
## gender ethnicity score distance tuition
## 1 male other 39.15 0.2 0.88915
## 2 female other 48.87 0.2 0.88915
## 3 male other 48.74 0.2 0.88915
## 4 male afam 40.40 0.2 0.88915
## 5 female other 40.48 0.4 0.88915
## 6 male other 54.71 0.4 0.88915
str allows us to see the data type and new column names The first few rows are displayed of the renamed columns.
colnames(JR_NewDF) <- c("gender_new", "ethnicity_new", "score_new", "distance_new", "tuition_new")
str (JR_NewDF)
## 'data.frame': 4739 obs. of 5 variables:
## $ gender_new : chr "male" "female" "male" "male" ...
## $ ethnicity_new: chr "other" "other" "other" "afam" ...
## $ score_new : num 39.2 48.9 48.7 40.4 40.5 ...
## $ distance_new : num 0.2 0.2 0.2 0.2 0.4 ...
## $ tuition_new : num 0.889 0.889 0.889 0.889 0.889 ...
head(JR_NewDF)
## gender_new ethnicity_new score_new distance_new tuition_new
## 1 male other 39.15 0.2 0.88915
## 2 female other 48.87 0.2 0.88915
## 3 male other 48.74 0.2 0.88915
## 4 male afam 40.40 0.2 0.88915
## 5 female other 40.48 0.4 0.88915
## 6 male other 54.71 0.4 0.88915
Then print the mean and median for the same two attributes. Please compare.
In this particular case, the mean and median of the attributes are the same as before.
summary(JR_NewDF)
## gender_new ethnicity_new score_new distance_new
## Length:4739 Length:4739 Min. :28.95 Min. : 0.000
## Class :character Class :character 1st Qu.:43.92 1st Qu.: 0.400
## Mode :character Mode :character Median :51.19 Median : 1.000
## Mean :50.89 Mean : 1.803
## 3rd Qu.:57.77 3rd Qu.: 2.500
## Max. :72.81 Max. :20.000
## tuition_new
## Min. :0.2575
## 1st Qu.:0.4850
## Median :0.8245
## Mean :0.8146
## 3rd Qu.:1.1270
## Max. :1.4042
# mean & median for score column
mean(JR_NewDF$score_new)
## [1] 50.88903
median(JR_NewDF$score_new)
## [1] 51.19
# mean & median for distance column
mean(JR_NewDF$distance_new)
## [1] 1.80287
median(JR_NewDF$distance_new)
## [1] 1
Although I know this is possible through subsetting[] code, I found the replace() function easier to read.
JR_NewDF$ethnicity_new <- replace(JR_NewDF$ethnicity_new, JR_NewDF$ethnicity_new == "other", "Unknown")
JR_NewDF$ethnicity_new <- replace(JR_NewDF$ethnicity_new, JR_NewDF$ethnicity_new == "hispanic", "Latino")
JR_NewDF$ethnicity_new <- replace(JR_NewDF$ethnicity_new, JR_NewDF$ethnicity_new == "afam", "African American")
In this case, tail allows you to see the 3 values that have changed in the data frame column.
tail(JR_NewDF, n=10)
## gender_new ethnicity_new score_new distance_new tuition_new
## 4730 female Latino 58.16 0.3 0.25751
## 4731 female Unknown 58.44 0.3 0.25751
## 4732 male African American 49.25 0.3 0.25751
## 4733 male African American 50.83 0.3 0.25751
## 4734 female Unknown 59.29 0.3 0.25751
## 4735 male African American 56.53 0.8 0.25751
## 4736 male African American 59.77 0.8 0.25751
## 4737 male Unknown 43.17 0.8 0.25751
## 4738 male African American 49.97 0.8 0.25751
## 4739 male African American 53.41 0.8 0.25751
```