Student Deatails NICKY KIPKORIR BOIT s3646703 OLIVES JERUTO TEKEIWA s3643389

Required packages #library(knitr) #library(plyr) #library(dplyr) #library(mvoutlier) #library(MVN) #library(reshape) #library(ggplot2)

Data & Sources

https://catalog.data.gov/dataset/behavioral-risk-factor-data-tobacco-use-2011-to-present-e0ad1 https://catalog.data.gov/dataset/age-adjusted-death-rates-for-the-top-10-leading-causes-of-death-united-states-2013/resource/0e603f1d-31bf-4809-8f10-a994b305b379

The Tobacco dataset above are sourced from the USA database and is a continuous, state-based surveillance system that collects information about modifiable risk factors for chronic diseases and other leading causes of death.Tobacco topics included are cigarette smoking status, cigarette smoking prevalence by demographics, cigarette smoking frequency, and quit attempts The death rate dataset presents the age-adjusted death rates for the 10 leading causes of death in the United States beginning in 1999. Data are based on information from all resident death certificates filed in the 50 states and the District of Columbia using demographic and medical characteristics. Age-adjusted death rates (per 100,000 population) are based on the 2000 U.S. standard population. Populations used for computing death rates after 2010 are postcensal estimates based on the 2010 census, estimated as of July 1, 2010. Rates for census years are based on populations enumerated in the corresponding censuses. Rates for non-census years before 2010 are revised using updated intercensal population estimates and may differ from rates previously published. Causes of death classified by the International Classification of Diseases, Tenth Revision (ICD-10) are ranked according to the number of deaths assigned to rankable causes. Cause of death statistics are based on the underlying cause of death

Reading Data to r

library(readr)
death <- read.csv("C:/Users/Nicky Boit/Desktop/Data reprocessing/NCHS_-_Leading_Causes_of_Death__United_States.csv",stringsAsFactors = FALSE)
head(death)
##   Year                                      X113.Cause.Name
## 1 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 2 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 3 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 4 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 5 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 6 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
##               Cause.Name      State Deaths Age.adjusted.Death.Rate
## 1 Unintentional injuries    Alabama   2755                    55.5
## 2 Unintentional injuries     Alaska    439                    63.1
## 3 Unintentional injuries    Arizona   4010                    54.2
## 4 Unintentional injuries   Arkansas   1604                    51.8
## 5 Unintentional injuries California  13213                    32.0
## 6 Unintentional injuries   Colorado   2880                    51.2
library(readr)
tobacco <- read.csv("C:/Users/Nicky Boit/Desktop/Data reprocessing/Behavioral_Risk_Factor_Data__Tobacco_Use__2011_to_present_.csv",stringsAsFactors = FALSE)
head(tobacco)
##        YEAR LocationAbbr   State                   TopicType
## 1 2015-2016           AL Alabama Tobacco Use â\200“ Survey Data
## 2 2015-2016           AL Alabama Tobacco Use â\200“ Survey Data
## 3 2015-2016           AL Alabama Tobacco Use â\200“ Survey Data
## 4 2015-2016           AL Alabama Tobacco Use â\200“ Survey Data
## 5 2015-2016           AL Alabama Tobacco Use â\200“ Survey Data
## 6 2015-2016           AL Alabama Tobacco Use â\200“ Survey Data
##                        TopicDesc
## 1         Cigarette Use (Adults)
## 2         Cigarette Use (Adults)
## 3         Cigarette Use (Adults)
## 4         Cigarette Use (Adults)
## 5         Cigarette Use (Adults)
## 6 Smokeless Tobacco Use (Adults)
##                                      MeasureDesc DataSource Response
## 1 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)      BRFSS         
## 2 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)      BRFSS         
## 3 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)      BRFSS         
## 4 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)      BRFSS         
## 5 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)      BRFSS         
## 6     Current Use â\200“ (2 yrs â\200“ Race/Ethnicity)      BRFSS         
##   Data_Value_Unit Data_Value_Type Data_Value Data_Value_Footnote_Symbol
## 1               %      Percentage       19.1                           
## 2               %      Percentage       35.1                           
## 3               %      Percentage       20.9                           
## 4               %      Percentage       17.0                           
## 5               %      Percentage       22.4                           
## 6               %      Percentage        3.1                           
##   Data_Value_Footnote Data_Value_Std_Err Low_Confidence_Limit
## 1                                    0.9                 17.3
## 2                                    8.1                 19.3
## 3                                    4.4                 12.3
## 4                                    3.6                 10.0
## 5                                    0.6                 21.2
## 6                                    0.4                  2.3
##   High_Confidence_Limit Sample_Size  Gender                          Race
## 1                  20.9        3725 Overall              African American
## 2                  50.9          80 Overall American Indian/Alaska Native
## 3                  29.5         178 Overall        Asian/Pacific Islander
## 4                  24.0         176 Overall                      Hispanic
## 5                  23.6        9824 Overall                         White
## 6                   3.9        3734 Overall              African American
##        Age  Education                             GeoLocation TopicTypeId
## 1 All Ages All Grades (32.84057112200048, -86.63186076199969)         BEH
## 2 All Ages All Grades (32.84057112200048, -86.63186076199969)         BEH
## 3 All Ages All Grades (32.84057112200048, -86.63186076199969)         BEH
## 4 All Ages All Grades (32.84057112200048, -86.63186076199969)         BEH
## 5 All Ages All Grades (32.84057112200048, -86.63186076199969)         BEH
## 6 All Ages All Grades (32.84057112200048, -86.63186076199969)         BEH
##   TopicId MeasureId StratificationID1 StratificationID2 StratificationID3
## 1  100BEH    112CS2              1GEN              8AGE              1RAC
## 2  100BEH    112CS2              1GEN              8AGE              2RAC
## 3  100BEH    112CS2              1GEN              8AGE              3RAC
## 4  100BEH    112CS2              1GEN              8AGE              4RAC
## 5  100BEH    112CS2              1GEN              8AGE              5RAC
## 6  150BEH    177SCR              1GEN              8AGE              1RAC
##   StratificationID4 SubMeasureID DisplayOrder
## 1              6EDU        BRF30           30
## 2              6EDU        BRF31           31
## 3              6EDU        BRF32           32
## 4              6EDU        BRF33           33
## 5              6EDU        BRF34           34
## 6              6EDU        BRF73           73

Summary datasets

head(summary(tobacco))
##      YEAR           LocationAbbr          State          
##  Length:25161       Length:25161       Length:25161      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   TopicType          TopicDesc         MeasureDesc       
##  Length:25161       Length:25161       Length:25161      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   DataSource          Response         Data_Value_Unit   
##  Length:25161       Length:25161       Length:25161      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  Data_Value_Type      Data_Value    Data_Value_Footnote_Symbol
##  Length:25161       Min.   : 0.00   Length:25161              
##  Class :character   1st Qu.: 5.50   Class :character          
##  Mode  :character   Median :18.70   Mode  :character          
##                     Mean   :24.15                             
##                     3rd Qu.:36.05                             
##                     Max.   :97.70                             
##  Data_Value_Footnote Data_Value_Std_Err Low_Confidence_Limit
##  Length:25161        Min.   : 0.0       Min.   : 0.00       
##  Class :character    1st Qu.: 0.7       1st Qu.: 3.60       
##  Mode  :character    Median : 1.2       Median :15.70       
##                      Mean   : 1.8       Mean   :20.64       
##                      3rd Qu.: 2.3       3rd Qu.:28.70       
##                      Max.   :15.0       Max.   :97.00       
##  High_Confidence_Limit  Sample_Size       Gender         
##  Min.   :  0.00        Min.   :   50   Length:25161      
##  1st Qu.:  7.60        1st Qu.:  460   Class :character  
##  Median : 21.40        Median : 1634   Mode  :character  
##  Mean   : 27.66        Mean   : 2931                     
##  3rd Qu.: 44.20        3rd Qu.: 3947                     
##  Max.   :100.00        Max.   :39952                     
##      Race               Age             Education        
##  Length:25161       Length:25161       Length:25161      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  GeoLocation        TopicTypeId          TopicId         
##  Length:25161       Length:25161       Length:25161      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   MeasureId         StratificationID1  StratificationID2 
##  Length:25161       Length:25161       Length:25161      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  StratificationID3  StratificationID4  SubMeasureID        DisplayOrder  
##  Length:25161       Length:25161       Length:25161       Min.   : 5.00  
##  Class :character   Class :character   Class :character   1st Qu.:24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :30.00  
##                                                           Mean   :43.91  
##                                                           3rd Qu.:70.00  
##                                                           Max.   :79.00
head(summary(death))
##       Year      X113.Cause.Name     Cause.Name           State          
##  Min.   :1999   Length:10296       Length:10296       Length:10296      
##  1st Qu.:2003   Class :character   Class :character   Class :character  
##  Median :2008   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2008                                                           
##  3rd Qu.:2012                                                           
##  Max.   :2016                                                           
##      Deaths        Age.adjusted.Death.Rate
##  Min.   :     21   Min.   :   2.6         
##  1st Qu.:    606   1st Qu.:  19.2         
##  Median :   1704   Median :  35.8         
##  Mean   :  15327   Mean   : 128.0         
##  3rd Qu.:   5678   3rd Qu.: 153.0         
##  Max.   :2712630   Max.   :1087.3

Structure of datasets

str(death)
## 'data.frame':    10296 obs. of  6 variables:
##  $ Year                   : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ X113.Cause.Name        : chr  "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" ...
##  $ Cause.Name             : chr  "Unintentional injuries" "Unintentional injuries" "Unintentional injuries" "Unintentional injuries" ...
##  $ State                  : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Deaths                 : int  2755 439 4010 1604 13213 2880 1978 516 401 12561 ...
##  $ Age.adjusted.Death.Rate: num  55.5 63.1 54.2 51.8 32 51.2 50.3 52.4 58.3 54.9 ...
str(tobacco)
## 'data.frame':    25161 obs. of  31 variables:
##  $ YEAR                      : chr  "2015-2016" "2015-2016" "2015-2016" "2015-2016" ...
##  $ LocationAbbr              : chr  "AL" "AL" "AL" "AL" ...
##  $ State                     : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ TopicType                 : chr  "Tobacco Use â\200“ Survey Data" "Tobacco Use â\200“ Survey Data" "Tobacco Use â\200“ Survey Data" "Tobacco Use â\200“ Survey Data" ...
##  $ TopicDesc                 : chr  "Cigarette Use (Adults)" "Cigarette Use (Adults)" "Cigarette Use (Adults)" "Cigarette Use (Adults)" ...
##  $ MeasureDesc               : chr  "Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)" "Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)" "Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)" "Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)" ...
##  $ DataSource                : chr  "BRFSS" "BRFSS" "BRFSS" "BRFSS" ...
##  $ Response                  : chr  "" "" "" "" ...
##  $ Data_Value_Unit           : chr  "%" "%" "%" "%" ...
##  $ Data_Value_Type           : chr  "Percentage" "Percentage" "Percentage" "Percentage" ...
##  $ Data_Value                : num  19.1 35.1 20.9 17 22.4 3.1 10.3 5.7 3.3 7 ...
##  $ Data_Value_Footnote_Symbol: chr  "" "" "" "" ...
##  $ Data_Value_Footnote       : chr  "" "" "" "" ...
##  $ Data_Value_Std_Err        : num  0.9 8.1 4.4 3.6 0.6 0.4 5.8 1.8 1.5 0.4 ...
##  $ Low_Confidence_Limit      : num  17.3 19.3 12.3 10 21.2 2.3 0 2.1 0.4 6.3 ...
##  $ High_Confidence_Limit     : num  20.9 50.9 29.5 24 23.6 3.9 21.7 9.3 6.2 7.7 ...
##  $ Sample_Size               : int  3725 80 178 176 9824 3734 81 178 177 9868 ...
##  $ Gender                    : chr  "Overall" "Overall" "Overall" "Overall" ...
##  $ Race                      : chr  "African American" "American Indian/Alaska Native" "Asian/Pacific Islander" "Hispanic" ...
##  $ Age                       : chr  "All Ages" "All Ages" "All Ages" "All Ages" ...
##  $ Education                 : chr  "All Grades" "All Grades" "All Grades" "All Grades" ...
##  $ GeoLocation               : chr  "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" ...
##  $ TopicTypeId               : chr  "BEH" "BEH" "BEH" "BEH" ...
##  $ TopicId                   : chr  "100BEH" "100BEH" "100BEH" "100BEH" ...
##  $ MeasureId                 : chr  "112CS2" "112CS2" "112CS2" "112CS2" ...
##  $ StratificationID1         : chr  "1GEN" "1GEN" "1GEN" "1GEN" ...
##  $ StratificationID2         : chr  "8AGE" "8AGE" "8AGE" "8AGE" ...
##  $ StratificationID3         : chr  "1RAC" "2RAC" "3RAC" "4RAC" ...
##  $ StratificationID4         : chr  "6EDU" "6EDU" "6EDU" "6EDU" ...
##  $ SubMeasureID              : chr  "BRF30" "BRF31" "BRF32" "BRF33" ...
##  $ DisplayOrder              : int  30 31 32 33 34 73 74 75 76 77 ...

Executive summary The data reprocessing that has been done has been highlited from the top to bottom.We started by reading the data to R using read.csv after which we checked on the structure of he dataset.Then changing of variavles from character to fac tor or char to numeric followed.After which We removed unwanted variables and restructuring the factor variables by renaming and reordering. Then we filtered the data to check on specific years.Renaming of variables and removal of NAs followed.Subsetting was next followed by joining the two datasets.We detected outliers and the removed them by exclusion after which we visualised a histogram which needed a transformation which was done by applying log transformation. This is a brief description of our data reprocessing.

Changing to factor variable for the 11 levels

death$Cause.Name<-as.factor(death$Cause.Name)

levels of the factor variable

levels(death$Cause.Name)
##  [1] "All causes"              "Alzheimer's disease"    
##  [3] "Cancer"                  "CLRD"                   
##  [5] "Diabetes"                "Heart disease"          
##  [7] "Influenza and pneumonia" "Kidney disease"         
##  [9] "Stroke"                  "Suicide"                
## [11] "Unintentional injuries"

Remove unwanted colomn in the tobacco dataset

tobacco <- tobacco[ -c(2, 7:10) ]
tobacco$Data_Value_Footnote_Symbol<-NULL
tobacco$Data_Value_Footnote<-NULL
tobacco$Gender<-as.factor(tobacco$Gender)
tobacco$Race<-as.factor(tobacco$Race)
tobacco$Age<-as.factor(tobacco$Age)
tobacco$Education<-as.factor(tobacco$Education)
levels(tobacco$Gender)
## [1] "Female"  "Male"    "Overall"
levels(tobacco$Race)
## [1] "African American"              "All Races"                    
## [3] "American Indian/Alaska Native" "Asian/Pacific Islander"       
## [5] "Hispanic"                      "White"
levels(tobacco$Age)
## [1] "18 to 24 Years"     "18 to 44 Years"     "25 to 44 Years"    
## [4] "45 to 64 Years"     "65 Years and Older" "Age 20 and Older"  
## [7] "Age 25 and Older"   "All Ages"
levels(tobacco$Education)
## [1] "< 12th Grade" "> 12th Grade" "12th Grade"   "All Grades"

Restructuring age factor

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
tobacco<-mutate(tobacco, Age = revalue(Age, c("18 to 24 Years" = "18-24","18 to 44 Years"="25-44","25 to 44 Years"="25-44","45 to 64 Years"="45-64","65 Years and Older"=">65","Age 20 and Older"="18-24","Age 25 and Older"="25-44","All Ages"="<18")))
head(tobacco)
##        YEAR   State                   TopicType
## 1 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 2 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 3 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 4 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 5 2015-2016 Alabama Tobacco Use â\200“ Survey Data
## 6 2015-2016 Alabama Tobacco Use â\200“ Survey Data
##                        TopicDesc
## 1         Cigarette Use (Adults)
## 2         Cigarette Use (Adults)
## 3         Cigarette Use (Adults)
## 4         Cigarette Use (Adults)
## 5         Cigarette Use (Adults)
## 6 Smokeless Tobacco Use (Adults)
##                                      MeasureDesc Data_Value
## 1 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       19.1
## 2 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       35.1
## 3 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       20.9
## 4 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       17.0
## 5 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       22.4
## 6     Current Use â\200“ (2 yrs â\200“ Race/Ethnicity)        3.1
##   Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit
## 1                0.9                 17.3                  20.9
## 2                8.1                 19.3                  50.9
## 3                4.4                 12.3                  29.5
## 4                3.6                 10.0                  24.0
## 5                0.6                 21.2                  23.6
## 6                0.4                  2.3                   3.9
##   Sample_Size  Gender                          Race Age  Education
## 1        3725 Overall              African American <18 All Grades
## 2          80 Overall American Indian/Alaska Native <18 All Grades
## 3         178 Overall        Asian/Pacific Islander <18 All Grades
## 4         176 Overall                      Hispanic <18 All Grades
## 5        9824 Overall                         White <18 All Grades
## 6        3734 Overall              African American <18 All Grades
##                               GeoLocation TopicTypeId TopicId MeasureId
## 1 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 2 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 3 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 4 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 5 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 6 (32.84057112200048, -86.63186076199969)         BEH  150BEH    177SCR
##   StratificationID1 StratificationID2 StratificationID3 StratificationID4
## 1              1GEN              8AGE              1RAC              6EDU
## 2              1GEN              8AGE              2RAC              6EDU
## 3              1GEN              8AGE              3RAC              6EDU
## 4              1GEN              8AGE              4RAC              6EDU
## 5              1GEN              8AGE              5RAC              6EDU
## 6              1GEN              8AGE              1RAC              6EDU
##   SubMeasureID DisplayOrder
## 1        BRF30           30
## 2        BRF31           31
## 3        BRF32           32
## 4        BRF33           33
## 5        BRF34           34
## 6        BRF73           73
table(tobacco$YEAR)
## 
##      2011 2011-2012      2012 2012-2013      2013 2013-2014      2014 
##      3451       530      3451       530      3451       530      3451 
## 2014-2015      2015 2015-2016      2016 
##       530      3451       530      5256

Filtering #Removed the data that was criss crossing between years

tobacco[tobacco$YEAR=="2011-2012","YEAR"] <- 2011
tobacco[tobacco$YEAR=="2012-2013","YEAR"] <- 2012
tobacco[tobacco$YEAR=="2013-2014","YEAR"] <- 2013
tobacco[tobacco$YEAR=="2014-2015","YEAR"] <- 2014
tobacco[tobacco$YEAR=="2015-2016","YEAR"] <- 2015
table(tobacco$YEAR)
## 
## 2011 2012 2013 2014 2015 2016 
## 3981 3981 3981 3981 3981 5256
tobacco$YEAR<-as.numeric(tobacco$YEAR)
table(tobacco$YEAR)
## 
## 2011 2012 2013 2014 2015 2016 
## 3981 3981 3981 3981 3981 5256
tobacco[tobacco$YEAR==1,"YEAR"] <- 2011
tobacco[tobacco$YEAR==3,"YEAR"] <- 2012
tobacco[tobacco$YEAR==5,"YEAR"] <- 2013
tobacco[tobacco$YEAR==7,"YEAR"] <- 2014
tobacco[tobacco$YEAR==9,"YEAR"] <- 2015
tobacco[tobacco$YEAR==11,"YEAR"] <- 2016
table(tobacco$YEAR)
## 
## 2011 2012 2013 2014 2015 2016 
## 3981 3981 3981 3981 3981 5256
summary(tobacco$YEAR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2011    2012    2014    2014    2015    2016

tidy format by Renaming variables

library(reshape)
## Warning: package 'reshape' was built under R version 3.5.1
## 
## Attaching package: 'reshape'
## The following objects are masked from 'package:plyr':
## 
##     rename, round_any
## The following object is masked from 'package:dplyr':
## 
##     rename
death <- rename(death, c(Year="YEAR"))
summary(death$YEAR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1999    2003    2008    2008    2012    2016
death <- rename(death, c(Age.adjusted.Death.Rate="AgeAdjDeathRate"))
death <- rename(death, c(Cause.Name="Cause"))
head(death)
##   YEAR                                      X113.Cause.Name
## 1 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 2 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 3 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 4 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 5 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 6 2016 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
##                    Cause      State Deaths AgeAdjDeathRate
## 1 Unintentional injuries    Alabama   2755            55.5
## 2 Unintentional injuries     Alaska    439            63.1
## 3 Unintentional injuries    Arizona   4010            54.2
## 4 Unintentional injuries   Arkansas   1604            51.8
## 5 Unintentional injuries California  13213            32.0
## 6 Unintentional injuries   Colorado   2880            51.2

Removal of missing values

tobacco1 <- na.omit(tobacco)
head(tobacco1)
##   YEAR   State                   TopicType                      TopicDesc
## 1 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 2 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 3 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 4 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 5 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 6 2015 Alabama Tobacco Use â\200“ Survey Data Smokeless Tobacco Use (Adults)
##                                      MeasureDesc Data_Value
## 1 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       19.1
## 2 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       35.1
## 3 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       20.9
## 4 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       17.0
## 5 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       22.4
## 6     Current Use â\200“ (2 yrs â\200“ Race/Ethnicity)        3.1
##   Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit
## 1                0.9                 17.3                  20.9
## 2                8.1                 19.3                  50.9
## 3                4.4                 12.3                  29.5
## 4                3.6                 10.0                  24.0
## 5                0.6                 21.2                  23.6
## 6                0.4                  2.3                   3.9
##   Sample_Size  Gender                          Race Age  Education
## 1        3725 Overall              African American <18 All Grades
## 2          80 Overall American Indian/Alaska Native <18 All Grades
## 3         178 Overall        Asian/Pacific Islander <18 All Grades
## 4         176 Overall                      Hispanic <18 All Grades
## 5        9824 Overall                         White <18 All Grades
## 6        3734 Overall              African American <18 All Grades
##                               GeoLocation TopicTypeId TopicId MeasureId
## 1 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 2 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 3 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 4 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 5 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 6 (32.84057112200048, -86.63186076199969)         BEH  150BEH    177SCR
##   StratificationID1 StratificationID2 StratificationID3 StratificationID4
## 1              1GEN              8AGE              1RAC              6EDU
## 2              1GEN              8AGE              2RAC              6EDU
## 3              1GEN              8AGE              3RAC              6EDU
## 4              1GEN              8AGE              4RAC              6EDU
## 5              1GEN              8AGE              5RAC              6EDU
## 6              1GEN              8AGE              1RAC              6EDU
##   SubMeasureID DisplayOrder
## 1        BRF30           30
## 2        BRF31           31
## 3        BRF32           32
## 4        BRF33           33
## 5        BRF34           34
## 6        BRF73           73
death1 <- na.omit(death)
head(tobacco1)
##   YEAR   State                   TopicType                      TopicDesc
## 1 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 2 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 3 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 4 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 5 2015 Alabama Tobacco Use â\200“ Survey Data         Cigarette Use (Adults)
## 6 2015 Alabama Tobacco Use â\200“ Survey Data Smokeless Tobacco Use (Adults)
##                                      MeasureDesc Data_Value
## 1 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       19.1
## 2 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       35.1
## 3 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       20.9
## 4 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       17.0
## 5 Current Smoking â\200“ (2 yrs â\200“ Race/Ethnicity)       22.4
## 6     Current Use â\200“ (2 yrs â\200“ Race/Ethnicity)        3.1
##   Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit
## 1                0.9                 17.3                  20.9
## 2                8.1                 19.3                  50.9
## 3                4.4                 12.3                  29.5
## 4                3.6                 10.0                  24.0
## 5                0.6                 21.2                  23.6
## 6                0.4                  2.3                   3.9
##   Sample_Size  Gender                          Race Age  Education
## 1        3725 Overall              African American <18 All Grades
## 2          80 Overall American Indian/Alaska Native <18 All Grades
## 3         178 Overall        Asian/Pacific Islander <18 All Grades
## 4         176 Overall                      Hispanic <18 All Grades
## 5        9824 Overall                         White <18 All Grades
## 6        3734 Overall              African American <18 All Grades
##                               GeoLocation TopicTypeId TopicId MeasureId
## 1 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 2 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 3 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 4 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 5 (32.84057112200048, -86.63186076199969)         BEH  100BEH    112CS2
## 6 (32.84057112200048, -86.63186076199969)         BEH  150BEH    177SCR
##   StratificationID1 StratificationID2 StratificationID3 StratificationID4
## 1              1GEN              8AGE              1RAC              6EDU
## 2              1GEN              8AGE              2RAC              6EDU
## 3              1GEN              8AGE              3RAC              6EDU
## 4              1GEN              8AGE              4RAC              6EDU
## 5              1GEN              8AGE              5RAC              6EDU
## 6              1GEN              8AGE              1RAC              6EDU
##   SubMeasureID DisplayOrder
## 1        BRF30           30
## 2        BRF31           31
## 3        BRF32           32
## 4        BRF33           33
## 5        BRF34           34
## 6        BRF73           73

subset

table(tobacco1$YEAR)
## 
## 2011 2012 2013 2014 2015 2016 
## 3801 3765 3762 3765 3753 5002
table(death1$YEAR)
## 
## 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
##  572  572  572  572  572  572  572  572  572  572  572  572  572  572  572 
## 2014 2015 2016 
##  572  572  572
tobacco2 <- subset(tobacco1, YEAR >= 2015, select=c(YEAR, State, Race, Education, Data_Value, Sample_Size))
death2 <- subset(death1, YEAR >= 2015, select=c(YEAR, State, Cause, Deaths, AgeAdjDeathRate))
head(tobacco2)
##   YEAR   State                          Race  Education Data_Value
## 1 2015 Alabama              African American All Grades       19.1
## 2 2015 Alabama American Indian/Alaska Native All Grades       35.1
## 3 2015 Alabama        Asian/Pacific Islander All Grades       20.9
## 4 2015 Alabama                      Hispanic All Grades       17.0
## 5 2015 Alabama                         White All Grades       22.4
## 6 2015 Alabama              African American All Grades        3.1
##   Sample_Size
## 1        3725
## 2          80
## 3         178
## 4         176
## 5        9824
## 6        3734
head(death2)
##   YEAR      State                  Cause Deaths AgeAdjDeathRate
## 1 2016    Alabama Unintentional injuries   2755            55.5
## 2 2016     Alaska Unintentional injuries    439            63.1
## 3 2016    Arizona Unintentional injuries   4010            54.2
## 4 2016   Arkansas Unintentional injuries   1604            51.8
## 5 2016 California Unintentional injuries  13213            32.0
## 6 2016   Colorado Unintentional injuries   2880            51.2

joining the 2 datasets

total<-tobacco2 %>% left_join(death2, by = "State")

subset of the joined dataset

head(total)
##   YEAR.x   State             Race  Education Data_Value Sample_Size YEAR.y
## 1   2015 Alabama African American All Grades       19.1        3725   2016
## 2   2015 Alabama African American All Grades       19.1        3725   2016
## 3   2015 Alabama African American All Grades       19.1        3725   2015
## 4   2015 Alabama African American All Grades       19.1        3725   2016
## 5   2015 Alabama African American All Grades       19.1        3725   2015
## 6   2015 Alabama African American All Grades       19.1        3725   2016
##                    Cause Deaths AgeAdjDeathRate
## 1 Unintentional injuries   2755            55.5
## 2             All causes  52466           920.4
## 3             All causes  51909           924.5
## 4    Alzheimer's disease   2507            45.0
## 5    Alzheimer's disease   2282            41.8
## 6                 Cancer  10419           174.0

checking whether its a factor variable

is.factor(total$Race)
## [1] TRUE
is.factor(total$Education)
## [1] TRUE

levels of a factor variable

levels(total$Education)
## [1] "< 12th Grade" "> 12th Grade" "12th Grade"   "All Grades"

Mapping values and reordering the levels

library(plyr)
total_1<-mapvalues(total$Education, 
          from = c("< 12th Grade", "12th Grade" ,"> 12th Grade", "All Grades"), 
          to = c("1","2","3" ,"4"))
head(total_1)
## [1] 4 4 4 4 4 4
## Levels: 1 3 2 4

tidy format by Renaming variables

total <- rename(total, c(Data_Value="TobacoUserPrct"))
total <- rename(total, c(Sample_Size="Population"))
head(total)
##   YEAR.x   State             Race  Education TobacoUserPrct Population
## 1   2015 Alabama African American All Grades           19.1       3725
## 2   2015 Alabama African American All Grades           19.1       3725
## 3   2015 Alabama African American All Grades           19.1       3725
## 4   2015 Alabama African American All Grades           19.1       3725
## 5   2015 Alabama African American All Grades           19.1       3725
## 6   2015 Alabama African American All Grades           19.1       3725
##   YEAR.y                  Cause Deaths AgeAdjDeathRate
## 1   2016 Unintentional injuries   2755            55.5
## 2   2016             All causes  52466           920.4
## 3   2015             All causes  51909           924.5
## 4   2016    Alzheimer's disease   2507            45.0
## 5   2015    Alzheimer's disease   2282            41.8
## 6   2016                 Cancer  10419           174.0

Creating a new variable

total<-mutate(total,
        population_tobacco = TobacoUserPrct/100 * Population
       )
head(total)
##   YEAR.x   State             Race  Education TobacoUserPrct Population
## 1   2015 Alabama African American All Grades           19.1       3725
## 2   2015 Alabama African American All Grades           19.1       3725
## 3   2015 Alabama African American All Grades           19.1       3725
## 4   2015 Alabama African American All Grades           19.1       3725
## 5   2015 Alabama African American All Grades           19.1       3725
## 6   2015 Alabama African American All Grades           19.1       3725
##   YEAR.y                  Cause Deaths AgeAdjDeathRate population_tobacco
## 1   2016 Unintentional injuries   2755            55.5            711.475
## 2   2016             All causes  52466           920.4            711.475
## 3   2015             All causes  51909           924.5            711.475
## 4   2016    Alzheimer's disease   2507            45.0            711.475
## 5   2015    Alzheimer's disease   2282            41.8            711.475
## 6   2016                 Cancer  10419           174.0            711.475

Removing NAs values

head( na.omit(total))
##   YEAR.x   State             Race  Education TobacoUserPrct Population
## 1   2015 Alabama African American All Grades           19.1       3725
## 2   2015 Alabama African American All Grades           19.1       3725
## 3   2015 Alabama African American All Grades           19.1       3725
## 4   2015 Alabama African American All Grades           19.1       3725
## 5   2015 Alabama African American All Grades           19.1       3725
## 6   2015 Alabama African American All Grades           19.1       3725
##   YEAR.y                  Cause Deaths AgeAdjDeathRate population_tobacco
## 1   2016 Unintentional injuries   2755            55.5            711.475
## 2   2016             All causes  52466           920.4            711.475
## 3   2015             All causes  51909           924.5            711.475
## 4   2016    Alzheimer's disease   2507            45.0            711.475
## 5   2015    Alzheimer's disease   2282            41.8            711.475
## 6   2016                 Cancer  10419           174.0            711.475

Detecting Outliers

boxplot(total$AgeAdjDeathRate ~ total$Education, main="death rate by education", ylab = "rate", xlab = "education")

Removing ouliers by excluding them

total<- total\(AgeAdjDeathRate[ - which( total\)AgeAdjDeathRate >300)]

Transformation #Histogram before transformation

library(ggplot2)
ggplot(data=total, aes(total$AgeAdjDeathRate)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 288 rows containing non-finite values (stat_bin).

Histogram after transformation using logarithm

library(ggplot2)
ln_total <- log(total$AgeAdjDeathRate)
hist(ln_total)