Student Deatails NICKY KIPKORIR BOIT s3646703 OLIVES JERUTO TEKEIWA s3643389
Required packages #library(knitr) #library(plyr) #library(dplyr) #library(mvoutlier) #library(MVN) #library(reshape) #library(ggplot2)
Data & Sources
https://catalog.data.gov/dataset/behavioral-risk-factor-data-tobacco-use-2011-to-present-e0ad1 https://catalog.data.gov/dataset/age-adjusted-death-rates-for-the-top-10-leading-causes-of-death-united-states-2013/resource/0e603f1d-31bf-4809-8f10-a994b305b379
The Tobacco dataset above are sourced from the USA database and is a continuous, state-based surveillance system that collects information about modifiable risk factors for chronic diseases and other leading causes of death.Tobacco topics included are cigarette smoking status, cigarette smoking prevalence by demographics, cigarette smoking frequency, and quit attempts The death rate dataset presents the age-adjusted death rates for the 10 leading causes of death in the United States beginning in 1999. Data are based on information from all resident death certificates filed in the 50 states and the District of Columbia using demographic and medical characteristics. Age-adjusted death rates (per 100,000 population) are based on the 2000 U.S. standard population. Populations used for computing death rates after 2010 are postcensal estimates based on the 2010 census, estimated as of July 1, 2010. Rates for census years are based on populations enumerated in the corresponding censuses. Rates for non-census years before 2010 are revised using updated intercensal population estimates and may differ from rates previously published. Causes of death classified by the International Classification of Diseases, Tenth Revision (ICD-10) are ranked according to the number of deaths assigned to rankable causes. Cause of death statistics are based on the underlying cause of death
Reading Data to r
library(readr)
death <- read.csv("C:/Users/Nicky Boit/Desktop/Data reprocessing/NCHS_-_Leading_Causes_of_Death__United_States.csv",stringsAsFactors = FALSE)
head(death)
## Year X113.Cause.Name
## 1 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 2 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 3 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 4 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 5 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 6 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## Cause.Name State Deaths Age.adjusted.Death.Rate
## 1 Unintentional injuries Alabama 2755 55.5
## 2 Unintentional injuries Alaska 439 63.1
## 3 Unintentional injuries Arizona 4010 54.2
## 4 Unintentional injuries Arkansas 1604 51.8
## 5 Unintentional injuries California 13213 32.0
## 6 Unintentional injuries Colorado 2880 51.2
library(readr)
tobacco <- read.csv("C:/Users/Nicky Boit/Desktop/Data reprocessing/Behavioral_Risk_Factor_Data__Tobacco_Use__2011_to_present_.csv",stringsAsFactors = FALSE)
head(tobacco)
## YEAR LocationAbbr State TopicType
## 1 2015-2016 AL Alabama Tobacco Use â\200“ Survey Data
## 2 2015-2016 AL Alabama Tobacco Use â\200“ Survey Data
## 3 2015-2016 AL Alabama Tobacco Use â\200“ Survey Data
## 4 2015-2016 AL Alabama Tobacco Use â\200“ Survey Data
## 5 2015-2016 AL Alabama Tobacco Use â\200“ Survey Data
## 6 2015-2016 AL Alabama Tobacco Use â\200“ Survey Data
## TopicDesc
## 1 Cigarette Use (Adults)
## 2 Cigarette Use (Adults)
## 3 Cigarette Use (Adults)
## 4 Cigarette Use (Adults)
## 5 Cigarette Use (Adults)
## 6 Smokeless Tobacco Use (Adults)
## MeasureDesc DataSource Response
## 1 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) BRFSS
## 2 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) BRFSS
## 3 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) BRFSS
## 4 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) BRFSS
## 5 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) BRFSS
## 6 Current Use â\200“ (2 yrs â\200“ Race/Ethnicity) BRFSS
## Data_Value_Unit Data_Value_Type Data_Value Data_Value_Footnote_Symbol
## 1 % Percentage 19.1
## 2 % Percentage 35.1
## 3 % Percentage 20.9
## 4 % Percentage 17.0
## 5 % Percentage 22.4
## 6 % Percentage 3.1
## Data_Value_Footnote Data_Value_Std_Err Low_Confidence_Limit
## 1 0.9 17.3
## 2 8.1 19.3
## 3 4.4 12.3
## 4 3.6 10.0
## 5 0.6 21.2
## 6 0.4 2.3
## High_Confidence_Limit Sample_Size Gender Race
## 1 20.9 3725 Overall African American
## 2 50.9 80 Overall American Indian/Alaska Native
## 3 29.5 178 Overall Asian/Pacific Islander
## 4 24.0 176 Overall Hispanic
## 5 23.6 9824 Overall White
## 6 3.9 3734 Overall African American
## Age Education GeoLocation TopicTypeId
## 1 All Ages All Grades (32.84057112200048, -86.63186076199969) BEH
## 2 All Ages All Grades (32.84057112200048, -86.63186076199969) BEH
## 3 All Ages All Grades (32.84057112200048, -86.63186076199969) BEH
## 4 All Ages All Grades (32.84057112200048, -86.63186076199969) BEH
## 5 All Ages All Grades (32.84057112200048, -86.63186076199969) BEH
## 6 All Ages All Grades (32.84057112200048, -86.63186076199969) BEH
## TopicId MeasureId StratificationID1 StratificationID2 StratificationID3
## 1 100BEH 112CS2 1GEN 8AGE 1RAC
## 2 100BEH 112CS2 1GEN 8AGE 2RAC
## 3 100BEH 112CS2 1GEN 8AGE 3RAC
## 4 100BEH 112CS2 1GEN 8AGE 4RAC
## 5 100BEH 112CS2 1GEN 8AGE 5RAC
## 6 150BEH 177SCR 1GEN 8AGE 1RAC
## StratificationID4 SubMeasureID DisplayOrder
## 1 6EDU BRF30 30
## 2 6EDU BRF31 31
## 3 6EDU BRF32 32
## 4 6EDU BRF33 33
## 5 6EDU BRF34 34
## 6 6EDU BRF73 73
Summary datasets
head(summary(tobacco))
## YEAR LocationAbbr State
## Length:25161 Length:25161 Length:25161
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## TopicType TopicDesc MeasureDesc
## Length:25161 Length:25161 Length:25161
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## DataSource Response Data_Value_Unit
## Length:25161 Length:25161 Length:25161
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Data_Value_Type Data_Value Data_Value_Footnote_Symbol
## Length:25161 Min. : 0.00 Length:25161
## Class :character 1st Qu.: 5.50 Class :character
## Mode :character Median :18.70 Mode :character
## Mean :24.15
## 3rd Qu.:36.05
## Max. :97.70
## Data_Value_Footnote Data_Value_Std_Err Low_Confidence_Limit
## Length:25161 Min. : 0.0 Min. : 0.00
## Class :character 1st Qu.: 0.7 1st Qu.: 3.60
## Mode :character Median : 1.2 Median :15.70
## Mean : 1.8 Mean :20.64
## 3rd Qu.: 2.3 3rd Qu.:28.70
## Max. :15.0 Max. :97.00
## High_Confidence_Limit Sample_Size Gender
## Min. : 0.00 Min. : 50 Length:25161
## 1st Qu.: 7.60 1st Qu.: 460 Class :character
## Median : 21.40 Median : 1634 Mode :character
## Mean : 27.66 Mean : 2931
## 3rd Qu.: 44.20 3rd Qu.: 3947
## Max. :100.00 Max. :39952
## Race Age Education
## Length:25161 Length:25161 Length:25161
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## GeoLocation TopicTypeId TopicId
## Length:25161 Length:25161 Length:25161
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## MeasureId StratificationID1 StratificationID2
## Length:25161 Length:25161 Length:25161
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## StratificationID3 StratificationID4 SubMeasureID DisplayOrder
## Length:25161 Length:25161 Length:25161 Min. : 5.00
## Class :character Class :character Class :character 1st Qu.:24.00
## Mode :character Mode :character Mode :character Median :30.00
## Mean :43.91
## 3rd Qu.:70.00
## Max. :79.00
head(summary(death))
## Year X113.Cause.Name Cause.Name State
## Min. :1999 Length:10296 Length:10296 Length:10296
## 1st Qu.:2003 Class :character Class :character Class :character
## Median :2008 Mode :character Mode :character Mode :character
## Mean :2008
## 3rd Qu.:2012
## Max. :2016
## Deaths Age.adjusted.Death.Rate
## Min. : 21 Min. : 2.6
## 1st Qu.: 606 1st Qu.: 19.2
## Median : 1704 Median : 35.8
## Mean : 15327 Mean : 128.0
## 3rd Qu.: 5678 3rd Qu.: 153.0
## Max. :2712630 Max. :1087.3
Structure of datasets
str(death)
## 'data.frame': 10296 obs. of 6 variables:
## $ Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ X113.Cause.Name : chr "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" ...
## $ Cause.Name : chr "Unintentional injuries" "Unintentional injuries" "Unintentional injuries" "Unintentional injuries" ...
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Deaths : int 2755 439 4010 1604 13213 2880 1978 516 401 12561 ...
## $ Age.adjusted.Death.Rate: num 55.5 63.1 54.2 51.8 32 51.2 50.3 52.4 58.3 54.9 ...
str(tobacco)
## 'data.frame': 25161 obs. of 31 variables:
## $ YEAR : chr "2015-2016" "2015-2016" "2015-2016" "2015-2016" ...
## $ LocationAbbr : chr "AL" "AL" "AL" "AL" ...
## $ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ TopicType : chr "Tobacco Use â\200“ Survey Data" "Tobacco Use â\200“ Survey Data" "Tobacco Use â\200“ Survey Data" "Tobacco Use â\200“ Survey Data" ...
## $ TopicDesc : chr "Cigarette Use (Adults)" "Cigarette Use (Adults)" "Cigarette Use (Adults)" "Cigarette Use (Adults)" ...
## $ MeasureDesc : chr "Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)" "Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)" "Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)" "Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)" ...
## $ DataSource : chr "BRFSS" "BRFSS" "BRFSS" "BRFSS" ...
## $ Response : chr "" "" "" "" ...
## $ Data_Value_Unit : chr "%" "%" "%" "%" ...
## $ Data_Value_Type : chr "Percentage" "Percentage" "Percentage" "Percentage" ...
## $ Data_Value : num 19.1 35.1 20.9 17 22.4 3.1 10.3 5.7 3.3 7 ...
## $ Data_Value_Footnote_Symbol: chr "" "" "" "" ...
## $ Data_Value_Footnote : chr "" "" "" "" ...
## $ Data_Value_Std_Err : num 0.9 8.1 4.4 3.6 0.6 0.4 5.8 1.8 1.5 0.4 ...
## $ Low_Confidence_Limit : num 17.3 19.3 12.3 10 21.2 2.3 0 2.1 0.4 6.3 ...
## $ High_Confidence_Limit : num 20.9 50.9 29.5 24 23.6 3.9 21.7 9.3 6.2 7.7 ...
## $ Sample_Size : int 3725 80 178 176 9824 3734 81 178 177 9868 ...
## $ Gender : chr "Overall" "Overall" "Overall" "Overall" ...
## $ Race : chr "African American" "American Indian/Alaska Native" "Asian/Pacific Islander" "Hispanic" ...
## $ Age : chr "All Ages" "All Ages" "All Ages" "All Ages" ...
## $ Education : chr "All Grades" "All Grades" "All Grades" "All Grades" ...
## $ GeoLocation : chr "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" ...
## $ TopicTypeId : chr "BEH" "BEH" "BEH" "BEH" ...
## $ TopicId : chr "100BEH" "100BEH" "100BEH" "100BEH" ...
## $ MeasureId : chr "112CS2" "112CS2" "112CS2" "112CS2" ...
## $ StratificationID1 : chr "1GEN" "1GEN" "1GEN" "1GEN" ...
## $ StratificationID2 : chr "8AGE" "8AGE" "8AGE" "8AGE" ...
## $ StratificationID3 : chr "1RAC" "2RAC" "3RAC" "4RAC" ...
## $ StratificationID4 : chr "6EDU" "6EDU" "6EDU" "6EDU" ...
## $ SubMeasureID : chr "BRF30" "BRF31" "BRF32" "BRF33" ...
## $ DisplayOrder : int 30 31 32 33 34 73 74 75 76 77 ...
Executive summary The data reprocessing that has been done has been highlited from the top to bottom.We started by reading the data to R using read.csv after which we checked on the structure of he dataset.Then changing of variavles from character to fac tor or char to numeric followed.After which We removed unwanted variables and restructuring the factor variables by renaming and reordering. Then we filtered the data to check on specific years.Renaming of variables and removal of NAs followed.Subsetting was next followed by joining the two datasets.We detected outliers and the removed them by exclusion after which we visualised a histogram which needed a transformation which was done by applying log transformation. This is a brief description of our data reprocessing.
Changing to factor variable for the 11 levels
death$Cause.Name<-as.factor(death$Cause.Name)
levels of the factor variable
levels(death$Cause.Name)
## [1] "All causes" "Alzheimer's disease"
## [3] "Cancer" "CLRD"
## [5] "Diabetes" "Heart disease"
## [7] "Influenza and pneumonia" "Kidney disease"
## [9] "Stroke" "Suicide"
## [11] "Unintentional injuries"
Remove unwanted colomn in the tobacco dataset
tobacco <- tobacco[ -c(2, 7:10) ]
tobacco$Data_Value_Footnote_Symbol<-NULL
tobacco$Data_Value_Footnote<-NULL
tobacco$Gender<-as.factor(tobacco$Gender)
tobacco$Race<-as.factor(tobacco$Race)
tobacco$Age<-as.factor(tobacco$Age)
tobacco$Education<-as.factor(tobacco$Education)
levels(tobacco$Gender)
## [1] "Female" "Male" "Overall"
levels(tobacco$Race)
## [1] "African American" "All Races"
## [3] "American Indian/Alaska Native" "Asian/Pacific Islander"
## [5] "Hispanic" "White"
levels(tobacco$Age)
## [1] "18 to 24 Years" "18 to 44 Years" "25 to 44 Years"
## [4] "45 to 64 Years" "65 Years and Older" "Age 20 and Older"
## [7] "Age 25 and Older" "All Ages"
levels(tobacco$Education)
## [1] "< 12th Grade" "> 12th Grade" "12th Grade" "All Grades"
Restructuring age factor
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
tobacco<-mutate(tobacco, Age = revalue(Age, c("18 to 24 Years" = "18-24","18 to 44 Years"="25-44","25 to 44 Years"="25-44","45 to 64 Years"="45-64","65 Years and Older"=">65","Age 20 and Older"="18-24","Age 25 and Older"="25-44","All Ages"="<18")))
head(tobacco)
## YEAR State TopicType
## 1 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 2 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 3 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 4 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 5 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 6 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## TopicDesc
## 1 Cigarette Use (Adults)
## 2 Cigarette Use (Adults)
## 3 Cigarette Use (Adults)
## 4 Cigarette Use (Adults)
## 5 Cigarette Use (Adults)
## 6 Smokeless Tobacco Use (Adults)
## MeasureDesc Data_Value
## 1 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 19.1
## 2 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 35.1
## 3 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 20.9
## 4 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 17.0
## 5 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 22.4
## 6 Current Use â\200“ (2 yrs â\200“ Race/Ethnicity) 3.1
## Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit
## 1 0.9 17.3 20.9
## 2 8.1 19.3 50.9
## 3 4.4 12.3 29.5
## 4 3.6 10.0 24.0
## 5 0.6 21.2 23.6
## 6 0.4 2.3 3.9
## Sample_Size Gender Race Age Education
## 1 3725 Overall African American <18 All Grades
## 2 80 Overall American Indian/Alaska Native <18 All Grades
## 3 178 Overall Asian/Pacific Islander <18 All Grades
## 4 176 Overall Hispanic <18 All Grades
## 5 9824 Overall White <18 All Grades
## 6 3734 Overall African American <18 All Grades
## GeoLocation TopicTypeId TopicId MeasureId
## 1 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 2 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 3 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 4 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 5 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 6 (32.84057112200048, -86.63186076199969) BEH 150BEH 177SCR
## StratificationID1 StratificationID2 StratificationID3 StratificationID4
## 1 1GEN 8AGE 1RAC 6EDU
## 2 1GEN 8AGE 2RAC 6EDU
## 3 1GEN 8AGE 3RAC 6EDU
## 4 1GEN 8AGE 4RAC 6EDU
## 5 1GEN 8AGE 5RAC 6EDU
## 6 1GEN 8AGE 1RAC 6EDU
## SubMeasureID DisplayOrder
## 1 BRF30 30
## 2 BRF31 31
## 3 BRF32 32
## 4 BRF33 33
## 5 BRF34 34
## 6 BRF73 73
table(tobacco$YEAR)
##
## 2011 2011-2012 2012 2012-2013 2013 2013-2014 2014
## 3451 530 3451 530 3451 530 3451
## 2014-2015 2015 2015-2016 2016
## 530 3451 530 5256
Filtering #Removed the data that was criss crossing between years
tobacco[tobacco$YEAR=="2011-2012","YEAR"] <- 2011
tobacco[tobacco$YEAR=="2012-2013","YEAR"] <- 2012
tobacco[tobacco$YEAR=="2013-2014","YEAR"] <- 2013
tobacco[tobacco$YEAR=="2014-2015","YEAR"] <- 2014
tobacco[tobacco$YEAR=="2015-2016","YEAR"] <- 2015
table(tobacco$YEAR)
##
## 2011 2012 2013 2014 2015 2016
## 3981 3981 3981 3981 3981 5256
tobacco$YEAR<-as.numeric(tobacco$YEAR)
table(tobacco$YEAR)
##
## 2011 2012 2013 2014 2015 2016
## 3981 3981 3981 3981 3981 5256
tobacco[tobacco$YEAR==1,"YEAR"] <- 2011
tobacco[tobacco$YEAR==3,"YEAR"] <- 2012
tobacco[tobacco$YEAR==5,"YEAR"] <- 2013
tobacco[tobacco$YEAR==7,"YEAR"] <- 2014
tobacco[tobacco$YEAR==9,"YEAR"] <- 2015
tobacco[tobacco$YEAR==11,"YEAR"] <- 2016
table(tobacco$YEAR)
##
## 2011 2012 2013 2014 2015 2016
## 3981 3981 3981 3981 3981 5256
summary(tobacco$YEAR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2011 2012 2014 2014 2015 2016
tidy format by Renaming variables
library(reshape)
## Warning: package 'reshape' was built under R version 3.5.1
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:plyr':
##
## rename, round_any
## The following object is masked from 'package:dplyr':
##
## rename
death <- rename(death, c(Year="YEAR"))
summary(death$YEAR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1999 2003 2008 2008 2012 2016
death <- rename(death, c(Age.adjusted.Death.Rate="AgeAdjDeathRate"))
death <- rename(death, c(Cause.Name="Cause"))
head(death)
## YEAR X113.Cause.Name
## 1 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 2 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 3 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 4 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 5 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 6 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## Cause State Deaths AgeAdjDeathRate
## 1 Unintentional injuries Alabama 2755 55.5
## 2 Unintentional injuries Alaska 439 63.1
## 3 Unintentional injuries Arizona 4010 54.2
## 4 Unintentional injuries Arkansas 1604 51.8
## 5 Unintentional injuries California 13213 32.0
## 6 Unintentional injuries Colorado 2880 51.2
Removal of missing values
tobacco1 <- na.omit(tobacco)
head(tobacco1)
## YEAR State TopicType TopicDesc
## 1 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 2 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 3 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 4 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 5 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 6 2015 Alabama Tobacco Use â\200“ Survey Data Smokeless Tobacco Use (Adults)
## MeasureDesc Data_Value
## 1 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 19.1
## 2 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 35.1
## 3 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 20.9
## 4 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 17.0
## 5 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 22.4
## 6 Current Use â\200“ (2 yrs â\200“ Race/Ethnicity) 3.1
## Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit
## 1 0.9 17.3 20.9
## 2 8.1 19.3 50.9
## 3 4.4 12.3 29.5
## 4 3.6 10.0 24.0
## 5 0.6 21.2 23.6
## 6 0.4 2.3 3.9
## Sample_Size Gender Race Age Education
## 1 3725 Overall African American <18 All Grades
## 2 80 Overall American Indian/Alaska Native <18 All Grades
## 3 178 Overall Asian/Pacific Islander <18 All Grades
## 4 176 Overall Hispanic <18 All Grades
## 5 9824 Overall White <18 All Grades
## 6 3734 Overall African American <18 All Grades
## GeoLocation TopicTypeId TopicId MeasureId
## 1 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 2 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 3 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 4 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 5 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 6 (32.84057112200048, -86.63186076199969) BEH 150BEH 177SCR
## StratificationID1 StratificationID2 StratificationID3 StratificationID4
## 1 1GEN 8AGE 1RAC 6EDU
## 2 1GEN 8AGE 2RAC 6EDU
## 3 1GEN 8AGE 3RAC 6EDU
## 4 1GEN 8AGE 4RAC 6EDU
## 5 1GEN 8AGE 5RAC 6EDU
## 6 1GEN 8AGE 1RAC 6EDU
## SubMeasureID DisplayOrder
## 1 BRF30 30
## 2 BRF31 31
## 3 BRF32 32
## 4 BRF33 33
## 5 BRF34 34
## 6 BRF73 73
death1 <- na.omit(death)
head(tobacco1)
## YEAR State TopicType TopicDesc
## 1 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 2 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 3 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 4 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 5 2015 Alabama Tobacco Use â\200“ Survey Data Cigarette Use (Adults)
## 6 2015 Alabama Tobacco Use â\200“ Survey Data Smokeless Tobacco Use (Adults)
## MeasureDesc Data_Value
## 1 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 19.1
## 2 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 35.1
## 3 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 20.9
## 4 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 17.0
## 5 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity) 22.4
## 6 Current Use â\200“ (2 yrs â\200“ Race/Ethnicity) 3.1
## Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit
## 1 0.9 17.3 20.9
## 2 8.1 19.3 50.9
## 3 4.4 12.3 29.5
## 4 3.6 10.0 24.0
## 5 0.6 21.2 23.6
## 6 0.4 2.3 3.9
## Sample_Size Gender Race Age Education
## 1 3725 Overall African American <18 All Grades
## 2 80 Overall American Indian/Alaska Native <18 All Grades
## 3 178 Overall Asian/Pacific Islander <18 All Grades
## 4 176 Overall Hispanic <18 All Grades
## 5 9824 Overall White <18 All Grades
## 6 3734 Overall African American <18 All Grades
## GeoLocation TopicTypeId TopicId MeasureId
## 1 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 2 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 3 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 4 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 5 (32.84057112200048, -86.63186076199969) BEH 100BEH 112CS2
## 6 (32.84057112200048, -86.63186076199969) BEH 150BEH 177SCR
## StratificationID1 StratificationID2 StratificationID3 StratificationID4
## 1 1GEN 8AGE 1RAC 6EDU
## 2 1GEN 8AGE 2RAC 6EDU
## 3 1GEN 8AGE 3RAC 6EDU
## 4 1GEN 8AGE 4RAC 6EDU
## 5 1GEN 8AGE 5RAC 6EDU
## 6 1GEN 8AGE 1RAC 6EDU
## SubMeasureID DisplayOrder
## 1 BRF30 30
## 2 BRF31 31
## 3 BRF32 32
## 4 BRF33 33
## 5 BRF34 34
## 6 BRF73 73
subset
table(tobacco1$YEAR)
##
## 2011 2012 2013 2014 2015 2016
## 3801 3765 3762 3765 3753 5002
table(death1$YEAR)
##
## 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
## 572 572 572 572 572 572 572 572 572 572 572 572 572 572 572
## 2014 2015 2016
## 572 572 572
tobacco2 <- subset(tobacco1, YEAR >= 2015, select=c(YEAR, State, Race, Education, Data_Value, Sample_Size))
death2 <- subset(death1, YEAR >= 2015, select=c(YEAR, State, Cause, Deaths, AgeAdjDeathRate))
head(tobacco2)
## YEAR State Race Education Data_Value
## 1 2015 Alabama African American All Grades 19.1
## 2 2015 Alabama American Indian/Alaska Native All Grades 35.1
## 3 2015 Alabama Asian/Pacific Islander All Grades 20.9
## 4 2015 Alabama Hispanic All Grades 17.0
## 5 2015 Alabama White All Grades 22.4
## 6 2015 Alabama African American All Grades 3.1
## Sample_Size
## 1 3725
## 2 80
## 3 178
## 4 176
## 5 9824
## 6 3734
head(death2)
## YEAR State Cause Deaths AgeAdjDeathRate
## 1 2016 Alabama Unintentional injuries 2755 55.5
## 2 2016 Alaska Unintentional injuries 439 63.1
## 3 2016 Arizona Unintentional injuries 4010 54.2
## 4 2016 Arkansas Unintentional injuries 1604 51.8
## 5 2016 California Unintentional injuries 13213 32.0
## 6 2016 Colorado Unintentional injuries 2880 51.2
joining the 2 datasets
total<-tobacco2 %>% left_join(death2, by = "State")
subset of the joined dataset
head(total)
## YEAR.x State Race Education Data_Value Sample_Size YEAR.y
## 1 2015 Alabama African American All Grades 19.1 3725 2016
## 2 2015 Alabama African American All Grades 19.1 3725 2016
## 3 2015 Alabama African American All Grades 19.1 3725 2015
## 4 2015 Alabama African American All Grades 19.1 3725 2016
## 5 2015 Alabama African American All Grades 19.1 3725 2015
## 6 2015 Alabama African American All Grades 19.1 3725 2016
## Cause Deaths AgeAdjDeathRate
## 1 Unintentional injuries 2755 55.5
## 2 All causes 52466 920.4
## 3 All causes 51909 924.5
## 4 Alzheimer's disease 2507 45.0
## 5 Alzheimer's disease 2282 41.8
## 6 Cancer 10419 174.0
checking whether its a factor variable
is.factor(total$Race)
## [1] TRUE
is.factor(total$Education)
## [1] TRUE
levels of a factor variable
levels(total$Education)
## [1] "< 12th Grade" "> 12th Grade" "12th Grade" "All Grades"
Mapping values and reordering the levels
library(plyr)
total_1<-mapvalues(total$Education,
from = c("< 12th Grade", "12th Grade" ,"> 12th Grade", "All Grades"),
to = c("1","2","3" ,"4"))
head(total_1)
## [1] 4 4 4 4 4 4
## Levels: 1 3 2 4
tidy format by Renaming variables
total <- rename(total, c(Data_Value="TobacoUserPrct"))
total <- rename(total, c(Sample_Size="Population"))
head(total)
## YEAR.x State Race Education TobacoUserPrct Population
## 1 2015 Alabama African American All Grades 19.1 3725
## 2 2015 Alabama African American All Grades 19.1 3725
## 3 2015 Alabama African American All Grades 19.1 3725
## 4 2015 Alabama African American All Grades 19.1 3725
## 5 2015 Alabama African American All Grades 19.1 3725
## 6 2015 Alabama African American All Grades 19.1 3725
## YEAR.y Cause Deaths AgeAdjDeathRate
## 1 2016 Unintentional injuries 2755 55.5
## 2 2016 All causes 52466 920.4
## 3 2015 All causes 51909 924.5
## 4 2016 Alzheimer's disease 2507 45.0
## 5 2015 Alzheimer's disease 2282 41.8
## 6 2016 Cancer 10419 174.0
Creating a new variable
total<-mutate(total,
population_tobacco = TobacoUserPrct/100 * Population
)
head(total)
## YEAR.x State Race Education TobacoUserPrct Population
## 1 2015 Alabama African American All Grades 19.1 3725
## 2 2015 Alabama African American All Grades 19.1 3725
## 3 2015 Alabama African American All Grades 19.1 3725
## 4 2015 Alabama African American All Grades 19.1 3725
## 5 2015 Alabama African American All Grades 19.1 3725
## 6 2015 Alabama African American All Grades 19.1 3725
## YEAR.y Cause Deaths AgeAdjDeathRate population_tobacco
## 1 2016 Unintentional injuries 2755 55.5 711.475
## 2 2016 All causes 52466 920.4 711.475
## 3 2015 All causes 51909 924.5 711.475
## 4 2016 Alzheimer's disease 2507 45.0 711.475
## 5 2015 Alzheimer's disease 2282 41.8 711.475
## 6 2016 Cancer 10419 174.0 711.475
Removing NAs values
head( na.omit(total))
## YEAR.x State Race Education TobacoUserPrct Population
## 1 2015 Alabama African American All Grades 19.1 3725
## 2 2015 Alabama African American All Grades 19.1 3725
## 3 2015 Alabama African American All Grades 19.1 3725
## 4 2015 Alabama African American All Grades 19.1 3725
## 5 2015 Alabama African American All Grades 19.1 3725
## 6 2015 Alabama African American All Grades 19.1 3725
## YEAR.y Cause Deaths AgeAdjDeathRate population_tobacco
## 1 2016 Unintentional injuries 2755 55.5 711.475
## 2 2016 All causes 52466 920.4 711.475
## 3 2015 All causes 51909 924.5 711.475
## 4 2016 Alzheimer's disease 2507 45.0 711.475
## 5 2015 Alzheimer's disease 2282 41.8 711.475
## 6 2016 Cancer 10419 174.0 711.475
Detecting Outliers
boxplot(total$AgeAdjDeathRate ~ total$Education, main="death rate by education", ylab = "rate", xlab = "education")
Removing ouliers by excluding them
Transformation #Histogram before transformation
library(ggplot2)
ggplot(data=total, aes(total$AgeAdjDeathRate)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 288 rows containing non-finite values (stat_bin).
Histogram after transformation using logarithm
library(ggplot2)
ln_total <- log(total$AgeAdjDeathRate)
hist(ln_total)