List of Data Sets

The list of data sets are provided by http://vincentarelbundock.github.io/Rdatasets/. In order to review and select the data set we want to explore, we can clone the Rdatasets fromm the github to our local machine.

We can also download and review the datsets.csv to get an idea of what we can explore in each dataset provided.

I have selected “Vocabulary and Education” dataset.

We can start with downloading the csv file of the dataset into our R markdown document.

# first I want to make sure the csv file and the R markdown file are within the same folder.
# i dont neccessarily need to do this but it is easier for me to set up R to read it.

getwd()
## [1] "C:/Users/Anil Akyildirim/Desktop/Data Science/MS Bridge/Week 2"
# I can read the "Vocab.csv" file (which is also in Data Science/MS Bridge/Week 2 folder) and create my dataframe.

vocab <- read.csv('Vocab.csv', header=TRUE, stringsAsFactors = FALSE)
head(vocab)
##          X year    sex education vocabulary
## 1 19740001 1974   Male        14          9
## 2 19740002 1974   Male        16          9
## 3 19740003 1974 Female        10          9
## 4 19740004 1974 Female        10          5
## 5 19740005 1974 Female        12          8
## 6 19740006 1974   Male        16          8

Initial View

When we briefly take a look at the data set, we see below columns;

The description of these columns are outlined in this link https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/doc/carData/Vocab.html and;

Question 1:

Use the summary function to gain overview of the data set. Then display the mean and median for at least two attributes.

Answer 1:

We can use summary() function and we can get the mean and median of education and vocabulary attributes.

summary(vocab)
##        X                 year          sex              education    
##  Min.   :19740001   Min.   :1974   Length:30351       Min.   : 0.00  
##  1st Qu.:19870112   1st Qu.:1987   Class :character   1st Qu.:12.00  
##  Median :19942104   Median :1994   Mode  :character   Median :12.00  
##  Mean   :19954597   Mean   :1995                      Mean   :13.03  
##  3rd Qu.:20063676   3rd Qu.:2006                      3rd Qu.:15.00  
##  Max.   :20162866   Max.   :2016                      Max.   :20.00  
##    vocabulary    
##  Min.   : 0.000  
##  1st Qu.: 5.000  
##  Median : 6.000  
##  Mean   : 6.004  
##  3rd Qu.: 7.000  
##  Max.   :10.000
mean(vocab$education)
## [1] 13.03423
mean(vocab$vocabulary)
## [1] 6.003657
median(vocab$education)
## [1] 12
median(vocab$vocabulary)
## [1] 6

When we look at our findings;

  • Summary of the dataframe shows us the min, 1st Qu, Median, Mean, 3rd Qu and Max of each column.

  • Mean of the education column is 13.03. We can see this in both summary() and mean function output.

  • Mean of the vocabulary column is 6.00. We can see this in both summary() and mean function output.

  • Median of the education column is 12. We can see this in both summary() and mean function output.

  • Median of the vocabulary column is 6. We can see this in both summary() and mean function output.

Question 2:

Create a new data frame with a subset of the columns and rows. Make sure to rename it.

Answer 2:

In my new data set; I would like to, not consider surveys that are older than 1990. With the assumption of majority of individuals’ minimum education year being 8, I would like to also not include education year smaller than 7 (including 8).

The purpose of me subsetting the data set this way might be to answer the question; “Would i get a better vocabulary score if i have higher education year, granted the my minimum education year would be 8?”

vocab_new <- subset(vocab, year>1990 & education > 7)
head(vocab_new)
##              X year    sex education vocabulary
## 11879 19910001 1991 Female        12          5
## 11880 19910003 1991   Male        20         10
## 11881 19910005 1991 Female        12          5
## 11882 19910006 1991   Male        10          3
## 11883 19910008 1991 Female        16          6
## 11884 19910010 1991 Female        14          7

Question 3:

Create new column names for the new data frame.

Answer 3:

# rename column name "X" to "survey_id".
# rename column name "year" to "survey_year".
# rename column name "sex" to "gender".
# rename column name "education" to "education_years".
# rename column name "vocabulary" to "score".

colnames(vocab_new) <- c("survey_id", "survey_year", "gender", "education_years", "score")
head(vocab_new)
##       survey_id survey_year gender education_years score
## 11879  19910001        1991 Female              12     5
## 11880  19910003        1991   Male              20    10
## 11881  19910005        1991 Female              12     5
## 11882  19910006        1991   Male              10     3
## 11883  19910008        1991 Female              16     6
## 11884  19910010        1991 Female              14     7

Question 4:

Use the summary function to create an overview of your new data frame. Then print the mean and median for the same two attributes. Please compare.

Answer 4:

Below outlines the summary of both the new and the original dataframe.

summary(vocab_new) # summary of the new data frame
##    survey_id         survey_year      gender          education_years
##  Min.   :19910001   Min.   :1991   Length:18002       Min.   : 8.00  
##  1st Qu.:19961230   1st Qu.:1996   Class :character   1st Qu.:12.00  
##  Median :20041772   Median :2004   Mode  :character   Median :13.00  
##  Mean   :20038417   Mean   :2004                      Mean   :13.69  
##  3rd Qu.:20120329   3rd Qu.:2012                      3rd Qu.:16.00  
##  Max.   :20162866   Max.   :2016                      Max.   :20.00  
##      score       
##  Min.   : 0.000  
##  1st Qu.: 5.000  
##  Median : 6.000  
##  Mean   : 6.112  
##  3rd Qu.: 7.000  
##  Max.   :10.000
summary(vocab) # summary of the old data frame
##        X                 year          sex              education    
##  Min.   :19740001   Min.   :1974   Length:30351       Min.   : 0.00  
##  1st Qu.:19870112   1st Qu.:1987   Class :character   1st Qu.:12.00  
##  Median :19942104   Median :1994   Mode  :character   Median :12.00  
##  Mean   :19954597   Mean   :1995                      Mean   :13.03  
##  3rd Qu.:20063676   3rd Qu.:2006                      3rd Qu.:15.00  
##  Max.   :20162866   Max.   :2016                      Max.   :20.00  
##    vocabulary    
##  Min.   : 0.000  
##  1st Qu.: 5.000  
##  Median : 6.000  
##  Mean   : 6.004  
##  3rd Qu.: 7.000  
##  Max.   :10.000

Below are my findings in regards to comparison between vocab (old data frame) and vocab_new(new data frame);

  1. As expected the minimum education in years for the new data frame is 8 since we set that up that way as part of the subset process. However the minimum education in years for the old data frame is 0.

  2. Median of score (vocabulary in the original dataframe) for the both new and old data frame is the same (6.00), which is suprising. Considering we have subsetted the individuals whose education years less than 8, I would have expected the median score on the new data frame to be higher than the old data frame.

  3. Median education in years in the new data frame is higher than the old data frame. This is expected as we subsetted the individuals whose education in years less than 8 from the old dataframe which would increase the median number.

  4. Max of year for each summary is identical for both new and old data frame. This is again expected as we didnt subset anything from the old data frame that would change the max value.

Considering findings 2 and 3 we can further explore the dataset to see if there is correlation between education in years and vocabulary score. Based on my findings, my assumption is that, there is correlation but not as high as i would expect.

Question 5

For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “Excellent”

Answer 5:

I can rename the value “Female” and “Male” row values, in column “gender” in our new data set. Since the requirement is the rename at least 3 values based on the question, i will also rename 5 with 6 in “score” column.

vocab_new$gender <- with(vocab_new, replace(gender, gender=="Male", "M"))
vocab_new$gender <- with(vocab_new, replace(gender, gender=="Female", "F"))
vocab_new$score <- with(vocab_new, replace(score, score==5,6))
head(vocab_new)
##       survey_id survey_year gender education_years score
## 11879  19910001        1991      F              12     6
## 11880  19910003        1991      M              20    10
## 11881  19910005        1991      F              12     6
## 11882  19910006        1991      M              10     3
## 11883  19910008        1991      F              16     6
## 11884  19910010        1991      F              14     7

Question 6:

Display enough rows to see examples of all of the steps 1-5 above.

Answer 6:

I will display first 10 and last 10 rows.

head(vocab_new, 10)
##       survey_id survey_year gender education_years score
## 11879  19910001        1991      F              12     6
## 11880  19910003        1991      M              20    10
## 11881  19910005        1991      F              12     6
## 11882  19910006        1991      M              10     3
## 11883  19910008        1991      F              16     6
## 11884  19910010        1991      F              14     7
## 11885  19910011        1991      M               9     7
## 11886  19910012        1991      F              12     8
## 11887  19910013        1991      M              15     9
## 11888  19910014        1991      M              14     2
tail(vocab_new, 10)
##       survey_id survey_year gender education_years score
## 30342  20162856        2016      M              15     7
## 30343  20162857        2016      F              15     6
## 30344  20162858        2016      F              12     6
## 30345  20162859        2016      F              15     6
## 30346  20162860        2016      M              12     3
## 30347  20162861        2016      F              13     6
## 30348  20162863        2016      F              20     8
## 30349  20162864        2016      M              15     7
## 30350  20162865        2016      F              14     9
## 30351  20162866        2016      F              14     6

Question 7:

BONUS - place the original.csv file in a github file and have R read from the link. This will be very useful skill as you progress in your data science education and career.

Answer 7:

First i uploaded the csv file to my github repo i created. It is located here: https://raw.githubusercontent.com/anilak1978/r-bridge-week-2-assignment/master/Vocab.csv

Now i can read it from the github link

vocab_github <- read.csv('https://raw.githubusercontent.com/anilak1978/r-bridge-week-2-assignment/master/Vocab.csv', header=TRUE)
head(vocab_github)
##          X year    sex education vocabulary
## 1 19740001 1974   Male        14          9
## 2 19740002 1974   Male        16          9
## 3 19740003 1974 Female        10          9
## 4 19740004 1974 Female        10          5
## 5 19740005 1974 Female        12          8
## 6 19740006 1974   Male        16          8