7. BONUS - Reading my data set from Github

Place the original .csv in a github file and have R read from the link

I decided to start with the Bonus first since it makes the rest of the assignment easier once the data is assigned to a data frame. I used the College Distance Data set from the github page and display the first few rows below the summary The first few rows of this csv are displayed.

library(RCurl)
x <- getURL("https://vincentarelbundock.github.io/Rdatasets/csv/AER/CollegeDistance.csv")
JR_CollegeDistance <- read.csv(text = x)
summary(JR_CollegeDistance)
##        X            gender           ethnicity             score      
##  Min.   :    1   Length:4739        Length:4739        Min.   :28.95  
##  1st Qu.: 1186   Class :character   Class :character   1st Qu.:43.92  
##  Median : 2370   Mode  :character   Mode  :character   Median :51.19  
##  Mean   : 3955                                         Mean   :50.89  
##  3rd Qu.: 3554                                         3rd Qu.:57.77  
##  Max.   :37810                                         Max.   :72.81  
##    fcollege           mcollege             home              urban          
##  Length:4739        Length:4739        Length:4739        Length:4739       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      unemp             wage           distance         tuition      
##  Min.   : 1.400   Min.   : 6.590   Min.   : 0.000   Min.   :0.2575  
##  1st Qu.: 5.900   1st Qu.: 8.850   1st Qu.: 0.400   1st Qu.:0.4850  
##  Median : 7.100   Median : 9.680   Median : 1.000   Median :0.8245  
##  Mean   : 7.597   Mean   : 9.501   Mean   : 1.803   Mean   :0.8146  
##  3rd Qu.: 8.900   3rd Qu.:10.150   3rd Qu.: 2.500   3rd Qu.:1.1270  
##  Max.   :24.900   Max.   :12.960   Max.   :20.000   Max.   :1.4042  
##    education        income             region         
##  Min.   :12.00   Length:4739        Length:4739       
##  1st Qu.:12.00   Class :character   Class :character  
##  Median :13.00   Mode  :character   Mode  :character  
##  Mean   :13.81                                        
##  3rd Qu.:16.00                                        
##  Max.   :18.00
head(JR_CollegeDistance)
##   X gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1 1   male     other 39.15      yes       no  yes   yes   6.2 8.09      0.2
## 2 2 female     other 48.87       no       no  yes   yes   6.2 8.09      0.2
## 3 3   male     other 48.74       no       no  yes   yes   6.2 8.09      0.2
## 4 4   male      afam 40.40       no       no  yes   yes   6.2 8.09      0.2
## 5 5 female     other 40.48       no       no   no   yes   5.6 8.09      0.4
## 6 6   male     other 54.71       no       no  yes   yes   5.6 8.09      0.4
##   tuition education income region
## 1 0.88915        12   high  other
## 2 0.88915        12    low  other
## 3 0.88915        12    low  other
## 4 0.88915        12    low  other
## 5 0.88915        13    low  other
## 6 0.88915        12    low  other

1. Display the mean and median for at least two attributes

Mean and Median for the Score and Distance Attributes

# mean & median for score column
mean(JR_CollegeDistance$score)
## [1] 50.88903
median(JR_CollegeDistance$score)
## [1] 51.19
# mean & median for distance column
mean(JR_CollegeDistance$distance)
## [1] 1.80287
median(JR_CollegeDistance$distance)
## [1] 1

2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.

The df is a subset containing the gender, ethnicity, score, distance and tuition columns. str allows us to see the data type and columns of the new data frame. The first few rows are displayed of the new subset data frame.

JR_NewDF <- data.frame(JR_CollegeDistance[, c("gender", "ethnicity", "score", "distance", "tuition")])
str (JR_NewDF)
## 'data.frame':    4739 obs. of  5 variables:
##  $ gender   : chr  "male" "female" "male" "male" ...
##  $ ethnicity: chr  "other" "other" "other" "afam" ...
##  $ score    : num  39.2 48.9 48.7 40.4 40.5 ...
##  $ distance : num  0.2 0.2 0.2 0.2 0.4 ...
##  $ tuition  : num  0.889 0.889 0.889 0.889 0.889 ...
head(JR_NewDF)
##   gender ethnicity score distance tuition
## 1   male     other 39.15      0.2 0.88915
## 2 female     other 48.87      0.2 0.88915
## 3   male     other 48.74      0.2 0.88915
## 4   male      afam 40.40      0.2 0.88915
## 5 female     other 40.48      0.4 0.88915
## 6   male     other 54.71      0.4 0.88915

3. Create new column names for the new data frame

str allows us to see the data type and new column names The first few rows are displayed of the renamed columns.

colnames(JR_NewDF) <- c("gender_new", "ethnicity_new", "score_new", "distance_new", "tuition_new")
str (JR_NewDF)
## 'data.frame':    4739 obs. of  5 variables:
##  $ gender_new   : chr  "male" "female" "male" "male" ...
##  $ ethnicity_new: chr  "other" "other" "other" "afam" ...
##  $ score_new    : num  39.2 48.9 48.7 40.4 40.5 ...
##  $ distance_new : num  0.2 0.2 0.2 0.2 0.4 ...
##  $ tuition_new  : num  0.889 0.889 0.889 0.889 0.889 ...
head(JR_NewDF)
##   gender_new ethnicity_new score_new distance_new tuition_new
## 1       male         other     39.15          0.2     0.88915
## 2     female         other     48.87          0.2     0.88915
## 3       male         other     48.74          0.2     0.88915
## 4       male          afam     40.40          0.2     0.88915
## 5     female         other     40.48          0.4     0.88915
## 6       male         other     54.71          0.4     0.88915

4. Use the summary function to create an overview of your new data frame.

Then print the mean and median for the same two attributes. Please compare.

In this particular case, the mean and median of the attributes are the same as before.

summary(JR_NewDF)
##   gender_new        ethnicity_new        score_new      distance_new   
##  Length:4739        Length:4739        Min.   :28.95   Min.   : 0.000  
##  Class :character   Class :character   1st Qu.:43.92   1st Qu.: 0.400  
##  Mode  :character   Mode  :character   Median :51.19   Median : 1.000  
##                                        Mean   :50.89   Mean   : 1.803  
##                                        3rd Qu.:57.77   3rd Qu.: 2.500  
##                                        Max.   :72.81   Max.   :20.000  
##   tuition_new    
##  Min.   :0.2575  
##  1st Qu.:0.4850  
##  Median :0.8245  
##  Mean   :0.8146  
##  3rd Qu.:1.1270  
##  Max.   :1.4042
# mean & median for score column
mean(JR_NewDF$score_new)
## [1] 50.88903
median(JR_NewDF$score_new)
## [1] 51.19
# mean & median for distance column
mean(JR_NewDF$distance_new)
## [1] 1.80287
median(JR_NewDF$distance_new)
## [1] 1

5. For at least 3 values in a column please rename so that every value in that column is renamed.

Although I know this is possible through subsetting[] code, I found the replace() function easier to read.

JR_NewDF$ethnicity_new <- replace(JR_NewDF$ethnicity_new, JR_NewDF$ethnicity_new == "other", "Unknown")
JR_NewDF$ethnicity_new <- replace(JR_NewDF$ethnicity_new, JR_NewDF$ethnicity_new == "hispanic", "Latino")
JR_NewDF$ethnicity_new <- replace(JR_NewDF$ethnicity_new, JR_NewDF$ethnicity_new == "afam", "African American")

6. Display enough rows to see example.

In this case, tail allows you to see the 3 values that have changed in the data frame column.

tail(JR_NewDF, n=10)
##      gender_new    ethnicity_new score_new distance_new tuition_new
## 4730     female           Latino     58.16          0.3     0.25751
## 4731     female          Unknown     58.44          0.3     0.25751
## 4732       male African American     49.25          0.3     0.25751
## 4733       male African American     50.83          0.3     0.25751
## 4734     female          Unknown     59.29          0.3     0.25751
## 4735       male African American     56.53          0.8     0.25751
## 4736       male African American     59.77          0.8     0.25751
## 4737       male          Unknown     43.17          0.8     0.25751
## 4738       male African American     49.97          0.8     0.25751
## 4739       male African American     53.41          0.8     0.25751

```