Final Project

R Markdown

This is a final project to show off what you have learned. Select your data set from the list below: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list). Another good source is found here: https://https://archive.ics.uci.edu/ml/datasets.html The presentation approach is up to you but it should contain the following: 1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text. 2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together) 3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2. 4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end. 5. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Analysis

I decided to analyze who or what population of people servied in 1985 smoked the most, Based on gender marital status employment, etc..

# Load the data from my github part of the bonus same file as local 
file <- "https://raw.githubusercontent.com/Eperez54/R-Final-Project/main/CPS1985.csv"

smokerdf <- read.csv(file, TRUE, ",") 
smokerdf <- data.frame(smokerdf) %>% na.omit() # not counting N/A Values

head(smokerdf)

##      X  wage education experience age ethnicity region gender occupation
## 1    1  5.10         8         21  35  hispanic  other female     worker
## 2 1100  4.95         9         42  57      cauc  other female     worker
## 3    2  6.67        12          1  19      cauc  other   male     worker
## 4    3  4.00        12          4  22      cauc  other   male     worker
## 5    4  7.50        12         17  35      cauc  other   male     worker
## 6    5 13.07        13          9  28      cauc  other   male     worker
##          sector union married
## 1 manufacturing    no     yes
## 2 manufacturing    no     yes
## 3 manufacturing    no      no
## 4         other    no      no
## 5         other    no     yes
## 6         other   yes      no

Data Exploration

# Print summary
summary(smokerdf)

##        X               wage          education       experience   
##  Min.   :   1.0   Min.   : 1.000   Min.   : 2.00   Min.   : 0.00  
##  1st Qu.: 134.2   1st Qu.: 5.250   1st Qu.:12.00   1st Qu.: 8.00  
##  Median : 267.5   Median : 7.780   Median :12.00   Median :15.00  
##  Mean   : 268.6   Mean   : 9.024   Mean   :13.02   Mean   :17.82  
##  3rd Qu.: 400.8   3rd Qu.:11.250   3rd Qu.:15.00   3rd Qu.:26.00  
##  Max.   :1100.0   Max.   :44.500   Max.   :18.00   Max.   :55.00  
##       age         ethnicity            region             gender         
##  Min.   :18.00   Length:534         Length:534         Length:534        
##  1st Qu.:28.00   Class :character   Class :character   Class :character  
##  Median :35.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :36.83                                                           
##  3rd Qu.:44.00                                                           
##  Max.   :64.00                                                           
##   occupation           sector             union             married         
##  Length:534         Length:534         Length:534         Length:534        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##

Let’s See the Column names

colnames(smokerdf,TRUE, col)

##  [1] "X"          "wage"       "education"  "experience" "age"       
##  [6] "ethnicity"  "region"     "gender"     "occupation" "sector"    
## [11] "union"      "married"

#print Means and Medians
means <- sapply(smokerdf[, c("wage", "education", "experience", "age", "ethnicity", "region", "gender",
                             "occupation", "sector", "union", "married")], mean)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

medians <- sapply(smokerdf[, c("wage", "education", "experience", "age", "ethnicity", "region", "gender",
                             "occupation", "sector", "union", "married")], median)

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

mean_mediandf <- data.frame(means, medians)
mean_mediandf

##                means medians
## wage        9.024064    7.78
## education  13.018727   12.00
## experience 17.822097   15.00
## age        36.833333   35.00
## ethnicity         NA      NA
## region            NA      NA
## gender            NA      NA
## occupation        NA      NA
## sector            NA      NA
## union             NA      NA
## married           NA      NA

Create a new data frame with a subset of main data frame

smoker_subset <- smokerdf[1:40,c("wage", "education", "experience", "age", "ethnicity", "region", "gender",
                             "occupation", "sector", "union", "married")]
head(smoker_subset)

##    wage education experience age ethnicity region gender occupation
## 1  5.10         8         21  35  hispanic  other female     worker
## 2  4.95         9         42  57      cauc  other female     worker
## 3  6.67        12          1  19      cauc  other   male     worker
## 4  4.00        12          4  22      cauc  other   male     worker
## 5  7.50        12         17  35      cauc  other   male     worker
## 6 13.07        13          9  28      cauc  other   male     worker
##          sector union married
## 1 manufacturing    no     yes
## 2 manufacturing    no     yes
## 3 manufacturing    no      no
## 4         other    no      no
## 5         other    no     yes
## 6         other   yes      no

tail(smoker_subset)

##     wage education experience age ethnicity region gender occupation
## 35  9.25        12         19  37      cauc  other   male     worker
## 36 10.67        12         36  54     other  other   male     worker
## 37  7.61        12         20  38     other  south   male     worker
## 38 10.00        12         35  53     other  other   male     worker
## 39  7.50        12          3  21      cauc  other   male     worker
## 40 12.20        14         10  30      cauc  south   male     worker
##           sector union married
## 35 manufacturing    no      no
## 36         other    no      no
## 37  construction    no     yes
## 38  construction   yes     yes
## 39         other    no      no
## 40 manufacturing    no     yes

# Rename some columns in my subset data
colnames(smoker_subset) <- c("Salary", "Education", "Experience", "Age", "Race", "Region", "Sex",
                             "Occupation", "Sector", "union", "Married")
colnames(smoker_subset)

##  [1] "Salary"     "Education"  "Experience" "Age"        "Race"      
##  [6] "Region"     "Sex"        "Occupation" "Sector"     "union"     
## [11] "Married"

head(smoker_subset)

##   Salary Education Experience Age     Race Region    Sex Occupation
## 1   5.10         8         21  35 hispanic  other female     worker
## 2   4.95         9         42  57     cauc  other female     worker
## 3   6.67        12          1  19     cauc  other   male     worker
## 4   4.00        12          4  22     cauc  other   male     worker
## 5   7.50        12         17  35     cauc  other   male     worker
## 6  13.07        13          9  28     cauc  other   male     worker
##          Sector union Married
## 1 manufacturing    no     yes
## 2 manufacturing    no     yes
## 3 manufacturing    no      no
## 4         other    no      no
## 5         other    no     yes
## 6         other   yes      no

tail(smoker_subset)

##    Salary Education Experience Age  Race Region  Sex Occupation        Sector
## 35   9.25        12         19  37  cauc  other male     worker manufacturing
## 36  10.67        12         36  54 other  other male     worker         other
## 37   7.61        12         20  38 other  south male     worker  construction
## 38  10.00        12         35  53 other  other male     worker  construction
## 39   7.50        12          3  21  cauc  other male     worker         other
## 40  12.20        14         10  30  cauc  south male     worker manufacturing
##    union Married
## 35    no      no
## 36    no      no
## 37    no     yes
## 38   yes     yes
## 39    no      no
## 40    no     yes

I want to see summary of my new data set

summary(smoker_subset)

##      Salary         Education       Experience         Age       
##  Min.   : 3.350   Min.   : 6.00   Min.   : 1.00   Min.   :19.00  
##  1st Qu.: 4.987   1st Qu.:10.00   1st Qu.: 9.00   1st Qu.:27.75  
##  Median : 7.400   Median :12.00   Median :19.00   Median :35.00  
##  Mean   : 8.614   Mean   :11.28   Mean   :20.82   Mean   :38.10  
##  3rd Qu.:10.168   3rd Qu.:12.00   3rd Qu.:30.00   3rd Qu.:45.25  
##  Max.   :22.200   Max.   :17.00   Max.   :46.00   Max.   :64.00  
##      Race              Region              Sex             Occupation       
##  Length:40          Length:40          Length:40          Length:40         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Sector             union             Married         
##  Length:40          Length:40          Length:40         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##

#print Means and Medians
means2 <- sapply(smoker_subset[, c("Salary", "Education", "Experience", "Age", "Race", "Region", "Sex",
                             "Occupation", "Sector", "union", "Married")], mean)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

medians2 <- sapply(smoker_subset[, c("Salary", "Education", "Experience", "Age", "Race", "Region", "Sex",
                             "Occupation", "Sector", "union", "Married")], median)

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

mean_mediandf2 <- data.frame(means2, medians2)
mean_mediandf2

##             means2 medians2
## Salary      8.6145      7.4
## Education  11.2750     12.0
## Experience 20.8250     19.0
## Age        38.1000     35.0
## Race            NA       NA
## Region          NA       NA
## Sex             NA       NA
## Occupation      NA       NA
## Sector          NA       NA
## union           NA       NA
## Married         NA       NA

Lets see some graphs

summary(smoker_subset$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   27.75   35.00   38.10   45.25   64.00

hist(smoker_subset$Age,main = "Age Histogram", xlab = "Age")

## Observation Most people that smoke are in their late 20’s and early 30’s lets see on their education level. lets see this information in a bar graph

summary(smoker_subset$Education)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   10.00   12.00   11.28   12.00   17.00

summary(smoker_subset$Salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.350   4.987   7.400   8.614  10.168  22.200

tn <- table(smoker_subset$Education)

barplot(tn, main= "Smoker Based Education", xlab="Highest Education Finished", 
        col=c("yellow","orange", "blue","green","red"),
        legend= rownames(tn), beside = TRUE)

tn

## 
##  6  7  8  9 10 11 12 13 14 16 17 
##  1  2  3  3  2  2 22  1  2  1  1

Observation

the most smokers have no college education. only finished HS.

summary(smoker_subset$Education)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   10.00   12.00   11.28   12.00   17.00

summary(smoker_subset$Salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.350   4.987   7.400   8.614  10.168  22.200

Hist <- ggplot(smoker_subset,aes(x=Education)) + 
  geom_histogram(fill = "gray",bins=10)

Hist

I like how it look in a barplot

Hist <- ggplot(smoker_subset,aes(x=Age)) + 
  geom_histogram(fill = "gray",bins=10)

Hist

Hist <- ggplot(smoker_subset,aes(x=Experience)) + 
  geom_histogram(fill = "gray",bins=10)

Hist

Interesting

Most smokers have over 10 year of job experience.

The scatter plot belowe will show the relationship between years of experience and Education. Many of the smoker have graduated from High School.

#Scatter Plot 
ScatterPlot <- ggplot(smoker_subset,aes(x=Experience, y=Age, color = factor(Age))) + geom_point(size=2.5)
ScatterPlot

The more experience you have the more likely you will be a smoker?

ScatterPlot <- ggplot(smoker_subset,aes(x=Experience, y=Education, color = factor(Education))) + geom_point(size=2.5)
ScatterPlot

## It is pretty constant that most High school graduate in the late 80’s where smokers

ScatterPlot <- ggplot(smoker_subset,aes(x=Race, y=Salary, color = factor(Race))) + geom_point(size=2.5)
ScatterPlot

Conclusion

while wrangling and exploration i came to a realization that the more experience and years you have in a given job you where a smoker. maybe the job had too much pressure or maybe there was no campaing to help inform of the harm of smoking. It was very alarming that most smoker where High school only graduate with no college experience. I would love to verify this information with more current data, to see if the FDA and the surgeon general have had any impact on lowering these number. Furthermore most smokers are mainly Caucasians.