This is a final project to show off what you have learned. Select your data set from the list below: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list). Another good source is found here: https://https://archive.ics.uci.edu/ml/datasets.html The presentation approach is up to you but it should contain the following: 1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text. 2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together) 3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2. 4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end. 5. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
I decided to analyze who or what population of people servied in 1985 smoked the most, Based on gender marital status employment, etc..
# Load the data from my github part of the bonus same file as local
file <- "https://raw.githubusercontent.com/Eperez54/R-Final-Project/main/CPS1985.csv"
smokerdf <- read.csv(file, TRUE, ",")
smokerdf <- data.frame(smokerdf) %>% na.omit() # not counting N/A Values
head(smokerdf)
## X wage education experience age ethnicity region gender occupation
## 1 1 5.10 8 21 35 hispanic other female worker
## 2 1100 4.95 9 42 57 cauc other female worker
## 3 2 6.67 12 1 19 cauc other male worker
## 4 3 4.00 12 4 22 cauc other male worker
## 5 4 7.50 12 17 35 cauc other male worker
## 6 5 13.07 13 9 28 cauc other male worker
## sector union married
## 1 manufacturing no yes
## 2 manufacturing no yes
## 3 manufacturing no no
## 4 other no no
## 5 other no yes
## 6 other yes no
# Print summary
summary(smokerdf)
## X wage education experience
## Min. : 1.0 Min. : 1.000 Min. : 2.00 Min. : 0.00
## 1st Qu.: 134.2 1st Qu.: 5.250 1st Qu.:12.00 1st Qu.: 8.00
## Median : 267.5 Median : 7.780 Median :12.00 Median :15.00
## Mean : 268.6 Mean : 9.024 Mean :13.02 Mean :17.82
## 3rd Qu.: 400.8 3rd Qu.:11.250 3rd Qu.:15.00 3rd Qu.:26.00
## Max. :1100.0 Max. :44.500 Max. :18.00 Max. :55.00
## age ethnicity region gender
## Min. :18.00 Length:534 Length:534 Length:534
## 1st Qu.:28.00 Class :character Class :character Class :character
## Median :35.00 Mode :character Mode :character Mode :character
## Mean :36.83
## 3rd Qu.:44.00
## Max. :64.00
## occupation sector union married
## Length:534 Length:534 Length:534 Length:534
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
colnames(smokerdf,TRUE, col)
## [1] "X" "wage" "education" "experience" "age"
## [6] "ethnicity" "region" "gender" "occupation" "sector"
## [11] "union" "married"
#print Means and Medians
means <- sapply(smokerdf[, c("wage", "education", "experience", "age", "ethnicity", "region", "gender",
"occupation", "sector", "union", "married")], mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
medians <- sapply(smokerdf[, c("wage", "education", "experience", "age", "ethnicity", "region", "gender",
"occupation", "sector", "union", "married")], median)
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
mean_mediandf <- data.frame(means, medians)
mean_mediandf
## means medians
## wage 9.024064 7.78
## education 13.018727 12.00
## experience 17.822097 15.00
## age 36.833333 35.00
## ethnicity NA NA
## region NA NA
## gender NA NA
## occupation NA NA
## sector NA NA
## union NA NA
## married NA NA
smoker_subset <- smokerdf[1:40,c("wage", "education", "experience", "age", "ethnicity", "region", "gender",
"occupation", "sector", "union", "married")]
head(smoker_subset)
## wage education experience age ethnicity region gender occupation
## 1 5.10 8 21 35 hispanic other female worker
## 2 4.95 9 42 57 cauc other female worker
## 3 6.67 12 1 19 cauc other male worker
## 4 4.00 12 4 22 cauc other male worker
## 5 7.50 12 17 35 cauc other male worker
## 6 13.07 13 9 28 cauc other male worker
## sector union married
## 1 manufacturing no yes
## 2 manufacturing no yes
## 3 manufacturing no no
## 4 other no no
## 5 other no yes
## 6 other yes no
tail(smoker_subset)
## wage education experience age ethnicity region gender occupation
## 35 9.25 12 19 37 cauc other male worker
## 36 10.67 12 36 54 other other male worker
## 37 7.61 12 20 38 other south male worker
## 38 10.00 12 35 53 other other male worker
## 39 7.50 12 3 21 cauc other male worker
## 40 12.20 14 10 30 cauc south male worker
## sector union married
## 35 manufacturing no no
## 36 other no no
## 37 construction no yes
## 38 construction yes yes
## 39 other no no
## 40 manufacturing no yes
# Rename some columns in my subset data
colnames(smoker_subset) <- c("Salary", "Education", "Experience", "Age", "Race", "Region", "Sex",
"Occupation", "Sector", "union", "Married")
colnames(smoker_subset)
## [1] "Salary" "Education" "Experience" "Age" "Race"
## [6] "Region" "Sex" "Occupation" "Sector" "union"
## [11] "Married"
head(smoker_subset)
## Salary Education Experience Age Race Region Sex Occupation
## 1 5.10 8 21 35 hispanic other female worker
## 2 4.95 9 42 57 cauc other female worker
## 3 6.67 12 1 19 cauc other male worker
## 4 4.00 12 4 22 cauc other male worker
## 5 7.50 12 17 35 cauc other male worker
## 6 13.07 13 9 28 cauc other male worker
## Sector union Married
## 1 manufacturing no yes
## 2 manufacturing no yes
## 3 manufacturing no no
## 4 other no no
## 5 other no yes
## 6 other yes no
tail(smoker_subset)
## Salary Education Experience Age Race Region Sex Occupation Sector
## 35 9.25 12 19 37 cauc other male worker manufacturing
## 36 10.67 12 36 54 other other male worker other
## 37 7.61 12 20 38 other south male worker construction
## 38 10.00 12 35 53 other other male worker construction
## 39 7.50 12 3 21 cauc other male worker other
## 40 12.20 14 10 30 cauc south male worker manufacturing
## union Married
## 35 no no
## 36 no no
## 37 no yes
## 38 yes yes
## 39 no no
## 40 no yes
summary(smoker_subset)
## Salary Education Experience Age
## Min. : 3.350 Min. : 6.00 Min. : 1.00 Min. :19.00
## 1st Qu.: 4.987 1st Qu.:10.00 1st Qu.: 9.00 1st Qu.:27.75
## Median : 7.400 Median :12.00 Median :19.00 Median :35.00
## Mean : 8.614 Mean :11.28 Mean :20.82 Mean :38.10
## 3rd Qu.:10.168 3rd Qu.:12.00 3rd Qu.:30.00 3rd Qu.:45.25
## Max. :22.200 Max. :17.00 Max. :46.00 Max. :64.00
## Race Region Sex Occupation
## Length:40 Length:40 Length:40 Length:40
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Sector union Married
## Length:40 Length:40 Length:40
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
#print Means and Medians
means2 <- sapply(smoker_subset[, c("Salary", "Education", "Experience", "Age", "Race", "Region", "Sex",
"Occupation", "Sector", "union", "Married")], mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
medians2 <- sapply(smoker_subset[, c("Salary", "Education", "Experience", "Age", "Race", "Region", "Sex",
"Occupation", "Sector", "union", "Married")], median)
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
mean_mediandf2 <- data.frame(means2, medians2)
mean_mediandf2
## means2 medians2
## Salary 8.6145 7.4
## Education 11.2750 12.0
## Experience 20.8250 19.0
## Age 38.1000 35.0
## Race NA NA
## Region NA NA
## Sex NA NA
## Occupation NA NA
## Sector NA NA
## union NA NA
## Married NA NA
summary(smoker_subset$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 27.75 35.00 38.10 45.25 64.00
hist(smoker_subset$Age,main = "Age Histogram", xlab = "Age")
## Observation Most people that smoke are in their late 20’s and early 30’s lets see on their education level. lets see this information in a bar graph
summary(smoker_subset$Education)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 10.00 12.00 11.28 12.00 17.00
summary(smoker_subset$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.350 4.987 7.400 8.614 10.168 22.200
tn <- table(smoker_subset$Education)
barplot(tn, main= "Smoker Based Education", xlab="Highest Education Finished",
col=c("yellow","orange", "blue","green","red"),
legend= rownames(tn), beside = TRUE)
tn
##
## 6 7 8 9 10 11 12 13 14 16 17
## 1 2 3 3 2 2 22 1 2 1 1
the most smokers have no college education. only finished HS.
summary(smoker_subset$Education)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 10.00 12.00 11.28 12.00 17.00
summary(smoker_subset$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.350 4.987 7.400 8.614 10.168 22.200
Hist <- ggplot(smoker_subset,aes(x=Education)) +
geom_histogram(fill = "gray",bins=10)
Hist
I like how it look in a barplot
Hist <- ggplot(smoker_subset,aes(x=Age)) +
geom_histogram(fill = "gray",bins=10)
Hist
Hist <- ggplot(smoker_subset,aes(x=Experience)) +
geom_histogram(fill = "gray",bins=10)
Hist
Most smokers have over 10 year of job experience.
The scatter plot belowe will show the relationship between years of experience and Education. Many of the smoker have graduated from High School.
#Scatter Plot
ScatterPlot <- ggplot(smoker_subset,aes(x=Experience, y=Age, color = factor(Age))) + geom_point(size=2.5)
ScatterPlot
The more experience you have the more likely you will be a smoker?
ScatterPlot <- ggplot(smoker_subset,aes(x=Experience, y=Education, color = factor(Education))) + geom_point(size=2.5)
ScatterPlot
## It is pretty constant that most High school graduate in the late 80’s where smokers
ScatterPlot <- ggplot(smoker_subset,aes(x=Race, y=Salary, color = factor(Race))) + geom_point(size=2.5)
ScatterPlot