The list of data sets are provided by http://vincentarelbundock.github.io/Rdatasets/. In order to review and select the data set we want to explore, we can clone the Rdatasets fromm the github to our local machine.
We can also download and review the datsets.csv to get an idea of what we can explore in each dataset provided.
I have selected “Vocabulary and Education” dataset.
We can start with downloading the csv file of the dataset into our R markdown document.
# first I want to make sure the csv file and the R markdown file are within the same folder.
# i dont neccessarily need to do this but it is easier for me to set up R to read it.
getwd()
## [1] "C:/Users/Anil Akyildirim/Desktop/Data Science/MS Bridge/Week 2"
# I can read the "Vocab.csv" file (which is also in Data Science/MS Bridge/Week 2 folder) and create my dataframe.
vocab <- read.csv('Vocab.csv', header=TRUE, stringsAsFactors = FALSE)
head(vocab)
## X year sex education vocabulary
## 1 19740001 1974 Male 14 9
## 2 19740002 1974 Male 16 9
## 3 19740003 1974 Female 10 9
## 4 19740004 1974 Female 10 5
## 5 19740005 1974 Female 12 8
## 6 19740006 1974 Male 16 8
When we briefly take a look at the data set, we see below columns;
The description of these columns are outlined in this link https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/doc/carData/Vocab.html and;
Use the summary function to gain overview of the data set. Then display the mean and median for at least two attributes.
We can use summary() function and we can get the mean and median of education and vocabulary attributes.
summary(vocab)
## X year sex education
## Min. :19740001 Min. :1974 Length:30351 Min. : 0.00
## 1st Qu.:19870112 1st Qu.:1987 Class :character 1st Qu.:12.00
## Median :19942104 Median :1994 Mode :character Median :12.00
## Mean :19954597 Mean :1995 Mean :13.03
## 3rd Qu.:20063676 3rd Qu.:2006 3rd Qu.:15.00
## Max. :20162866 Max. :2016 Max. :20.00
## vocabulary
## Min. : 0.000
## 1st Qu.: 5.000
## Median : 6.000
## Mean : 6.004
## 3rd Qu.: 7.000
## Max. :10.000
mean(vocab$education)
## [1] 13.03423
mean(vocab$vocabulary)
## [1] 6.003657
median(vocab$education)
## [1] 12
median(vocab$vocabulary)
## [1] 6
When we look at our findings;
Summary of the dataframe shows us the min, 1st Qu, Median, Mean, 3rd Qu and Max of each column.
Mean of the education column is 13.03. We can see this in both summary() and mean function output.
Mean of the vocabulary column is 6.00. We can see this in both summary() and mean function output.
Median of the education column is 12. We can see this in both summary() and mean function output.
Median of the vocabulary column is 6. We can see this in both summary() and mean function output.
Create a new data frame with a subset of the columns and rows. Make sure to rename it.
In my new data set; I would like to, not consider surveys that are older than 1990. With the assumption of majority of individuals’ minimum education year being 8, I would like to also not include education year smaller than 7 (including 8).
The purpose of me subsetting the data set this way might be to answer the question; “Would i get a better vocabulary score if i have higher education year, granted the my minimum education year would be 8?”
vocab_new <- subset(vocab, year>1990 & education > 7)
head(vocab_new)
## X year sex education vocabulary
## 11879 19910001 1991 Female 12 5
## 11880 19910003 1991 Male 20 10
## 11881 19910005 1991 Female 12 5
## 11882 19910006 1991 Male 10 3
## 11883 19910008 1991 Female 16 6
## 11884 19910010 1991 Female 14 7
Create new column names for the new data frame.
# rename column name "X" to "survey_id".
# rename column name "year" to "survey_year".
# rename column name "sex" to "gender".
# rename column name "education" to "education_years".
# rename column name "vocabulary" to "score".
colnames(vocab_new) <- c("survey_id", "survey_year", "gender", "education_years", "score")
head(vocab_new)
## survey_id survey_year gender education_years score
## 11879 19910001 1991 Female 12 5
## 11880 19910003 1991 Male 20 10
## 11881 19910005 1991 Female 12 5
## 11882 19910006 1991 Male 10 3
## 11883 19910008 1991 Female 16 6
## 11884 19910010 1991 Female 14 7
Use the summary function to create an overview of your new data frame. Then print the mean and median for the same two attributes. Please compare.
Below outlines the summary of both the new and the original dataframe.
summary(vocab_new) # summary of the new data frame
## survey_id survey_year gender education_years
## Min. :19910001 Min. :1991 Length:18002 Min. : 8.00
## 1st Qu.:19961230 1st Qu.:1996 Class :character 1st Qu.:12.00
## Median :20041772 Median :2004 Mode :character Median :13.00
## Mean :20038417 Mean :2004 Mean :13.69
## 3rd Qu.:20120329 3rd Qu.:2012 3rd Qu.:16.00
## Max. :20162866 Max. :2016 Max. :20.00
## score
## Min. : 0.000
## 1st Qu.: 5.000
## Median : 6.000
## Mean : 6.112
## 3rd Qu.: 7.000
## Max. :10.000
summary(vocab) # summary of the old data frame
## X year sex education
## Min. :19740001 Min. :1974 Length:30351 Min. : 0.00
## 1st Qu.:19870112 1st Qu.:1987 Class :character 1st Qu.:12.00
## Median :19942104 Median :1994 Mode :character Median :12.00
## Mean :19954597 Mean :1995 Mean :13.03
## 3rd Qu.:20063676 3rd Qu.:2006 3rd Qu.:15.00
## Max. :20162866 Max. :2016 Max. :20.00
## vocabulary
## Min. : 0.000
## 1st Qu.: 5.000
## Median : 6.000
## Mean : 6.004
## 3rd Qu.: 7.000
## Max. :10.000
Below are my findings in regards to comparison between vocab (old data frame) and vocab_new(new data frame);
As expected the minimum education in years for the new data frame is 8 since we set that up that way as part of the subset process. However the minimum education in years for the old data frame is 0.
Median of score (vocabulary in the original dataframe) for the both new and old data frame is the same (6.00), which is suprising. Considering we have subsetted the individuals whose education years less than 8, I would have expected the median score on the new data frame to be higher than the old data frame.
Median education in years in the new data frame is higher than the old data frame. This is expected as we subsetted the individuals whose education in years less than 8 from the old dataframe which would increase the median number.
Max of year for each summary is identical for both new and old data frame. This is again expected as we didnt subset anything from the old data frame that would change the max value.
Considering findings 2 and 3 we can further explore the dataset to see if there is correlation between education in years and vocabulary score. Based on my findings, my assumption is that, there is correlation but not as high as i would expect.
For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “Excellent”
I can rename the value “Female” and “Male” row values, in column “gender” in our new data set. Since the requirement is the rename at least 3 values based on the question, i will also rename 5 with 6 in “score” column.
vocab_new$gender <- with(vocab_new, replace(gender, gender=="Male", "M"))
vocab_new$gender <- with(vocab_new, replace(gender, gender=="Female", "F"))
vocab_new$score <- with(vocab_new, replace(score, score==5,6))
head(vocab_new)
## survey_id survey_year gender education_years score
## 11879 19910001 1991 F 12 6
## 11880 19910003 1991 M 20 10
## 11881 19910005 1991 F 12 6
## 11882 19910006 1991 M 10 3
## 11883 19910008 1991 F 16 6
## 11884 19910010 1991 F 14 7
Display enough rows to see examples of all of the steps 1-5 above.
I will display first 10 and last 10 rows.
head(vocab_new, 10)
## survey_id survey_year gender education_years score
## 11879 19910001 1991 F 12 6
## 11880 19910003 1991 M 20 10
## 11881 19910005 1991 F 12 6
## 11882 19910006 1991 M 10 3
## 11883 19910008 1991 F 16 6
## 11884 19910010 1991 F 14 7
## 11885 19910011 1991 M 9 7
## 11886 19910012 1991 F 12 8
## 11887 19910013 1991 M 15 9
## 11888 19910014 1991 M 14 2
tail(vocab_new, 10)
## survey_id survey_year gender education_years score
## 30342 20162856 2016 M 15 7
## 30343 20162857 2016 F 15 6
## 30344 20162858 2016 F 12 6
## 30345 20162859 2016 F 15 6
## 30346 20162860 2016 M 12 3
## 30347 20162861 2016 F 13 6
## 30348 20162863 2016 F 20 8
## 30349 20162864 2016 M 15 7
## 30350 20162865 2016 F 14 9
## 30351 20162866 2016 F 14 6
BONUS - place the original.csv file in a github file and have R read from the link. This will be very useful skill as you progress in your data science education and career.
First i uploaded the csv file to my github repo i created. It is located here: https://raw.githubusercontent.com/anilak1978/r-bridge-week-2-assignment/master/Vocab.csv
Now i can read it from the github link
vocab_github <- read.csv('https://raw.githubusercontent.com/anilak1978/r-bridge-week-2-assignment/master/Vocab.csv', header=TRUE)
head(vocab_github)
## X year sex education vocabulary
## 1 19740001 1974 Male 14 9
## 2 19740002 1974 Male 16 9
## 3 19740003 1974 Female 10 9
## 4 19740004 1974 Female 10 5
## 5 19740005 1974 Female 12 8
## 6 19740006 1974 Male 16 8