Wilson Chau Summer 2022
I wanted to do some analysis on this data set given to me that focuses on Vocabulary and Education. This dataset had sources recorded from 1972-2016. I wanted to see if there’s any correlation with what male to female on taking the exam, and who had the better educational background.
source: https://vincentarelbundock.github.io/Rdatasets/doc/carData/Vocab.html Questions to answer: 1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
#install.packages("ggplot2")
require(ggplot2)
## Loading required package: ggplot2
#5. BONUS – place the original .csv in a github file and have R read from the link.
Voc <- "https://raw.githubusercontent.com/Wilchau/Rprogrammingweek3Project/main/Vocab.csv"
Voc <- read.csv(Voc,TRUE, ",")
Vocdf <- data.frame(Voc) #convert to dataframe
#This dataset has over 30,351 observation variables with 5 variables: ID, birthyear, sex, education(in years), vocabulary score(scored out of a 10)
head(Vocdf)
## X year sex education vocabulary
## 1 19740001 1974 Male 14 9
## 2 19740002 1974 Male 16 9
## 3 19740003 1974 Female 10 9
## 4 19740004 1974 Female 10 5
## 5 19740005 1974 Female 12 8
## 6 19740006 1974 Male 16 8
summary(Vocdf)
## X year sex education
## Min. :19740001 Min. :1974 Length:30351 Min. : 0.00
## 1st Qu.:19870112 1st Qu.:1987 Class :character 1st Qu.:12.00
## Median :19942104 Median :1994 Mode :character Median :12.00
## Mean :19954597 Mean :1995 Mean :13.03
## 3rd Qu.:20063676 3rd Qu.:2006 3rd Qu.:15.00
## Max. :20162866 Max. :2016 Max. :20.00
## vocabulary
## Min. : 0.000
## 1st Qu.: 5.000
## Median : 6.000
## Mean : 6.004
## 3rd Qu.: 7.000
## Max. :10.000
#This anaswers Question 1 in finding medans, medians, quartiles.
I am changing the names of year and vocabulary to test_year and voc_score to give us a better identifier
#change year = Test_year and vocabulary = Voc_Score
colnames(Vocdf)[colnames(Vocdf) == "year"] <- "test_year"
colnames(Vocdf)[colnames(Vocdf) == "vocabulary"] <- "voc_score"
colnames(Vocdf, do.NULL = TRUE, prefix ="col")
## [1] "X" "test_year" "sex" "education" "voc_score"
These 3 summary gives us the Mean and Median for test_year,education, voc_score
summary(Vocdf$test_year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1974 1987 1994 1995 2006 2016
summary(Vocdf$education)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 12.00 12.00 13.03 15.00 20.00
summary(Vocdf$voc_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 6.000 6.004 7.000 10.000
#2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together) This code: Makes a subset to focus on the facotrial data of test_year, education, voc_score, and transforming sex from a string variable to a factorial data
Voc_subset <- Vocdf[,c ("test_year", "education", "voc_score","sex")]
head(Voc_subset)
## test_year education voc_score sex
## 1 1974 14 9 Male
## 2 1974 16 9 Male
## 3 1974 10 9 Female
## 4 1974 10 5 Female
## 5 1974 12 8 Female
## 6 1974 16 8 Male
Voc_subset$sex <- as.factor((Voc_subset$sex)) #This allow us to convert sex to a factorial data
summary(Voc_subset)
## test_year education voc_score sex
## Min. :1974 Min. : 0.00 Min. : 0.000 Female:17148
## 1st Qu.:1987 1st Qu.:12.00 1st Qu.: 5.000 Male :13203
## Median :1994 Median :12.00 Median : 6.000
## Mean :1995 Mean :13.03 Mean : 6.004
## 3rd Qu.:2006 3rd Qu.:15.00 3rd Qu.: 7.000
## Max. :2016 Max. :20.00 Max. :10.000
Based on this summary, there were 17,148 Females and 13,203 Males who particpated on this exam. The frequent year were people were taking this exam was in 1994, and the median education level is around 12. #3 Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t limited to this. Please explore the many other options in R packages such as ggplot2. Graphs, Based on summary of subset, it seems like education and vocabular can be related, test year and voc score can determine if different year has easier or harder exam which can result low or high scores based on difficultiy
#This histograph shows that the average score is below a 6 and it seems to be around the median. That means 50% of the 30,351 particpants scored below a 6 vocabulary score. I also showed the histograph of education level and it’s sitting around a
boxplot(Voc_subset$education, main = "education level", ylab = "education years")
hist(Voc_subset$voc_score, main ="Vocbulary score", xlab ="Vocabulary Score")
summary(Voc_subset$education)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 12.00 12.00 13.03 15.00 20.00
Boxplot indicates the Sex versus education, it’s shown that the male had a higher education level than the females. In terms of investigation both males and females took the test at the same years, and they scored similarly.
plot(Voc_subset$education ~ Voc_subset$sex)
plot(Voc_subset$test_year ~ Voc_subset$sex)
plot(Voc_subset$voc_score ~ Voc_subset$sex)
#4. Meaningful question for analysis: Please state at the beginning a
meaningful question for analysis. Use the first three steps and anything
else that would be helpful to answer the question you are posing from
the data set you chose. Please write a brief conclusion paragraph in R
markdown at the end.
In the beginning I wanted to see in this data set if the education level has any correlation to a higher voc_score. On the Histogram of sex versus education. There was a lot more outlier of females particpant who has less than the 25% quartile of education. This may skew some of the data more, but the male has fewer outliar. Based on observation, Males has a higher education background. Through the analysis of sex vs voc_score. Both male and female score around a median of 6 at the same educational level. This can indicate that