# import neccessary libraries

library("tidyr")
library("stringr")
library("dplyr")
library("ggplot2")

Introduction

The goal of this assignment is to practice in preparing different datasets for downstream analysis work by using three “Wide” datasets. The data transformation and analysis tasks that are required for this assignment are;

Creation of a .CSV file which includes all the information of the wide dataset.
Read the information from the .CSV file into R, use tidyr and dplyr as needed to tidy and transform the data.
Perform the analysis.

Child Mortality

The first wide dataset we will transform and analyze is “Child Mortality, Estimates for under-five, infant and neonatral mortality” for each country based on the research of United Nations Inter-agency Group for Child Mortality Estimation. The dataset can be found at https://childmortality.org/. The insight that we want to get from this data analysis is to see the death count on children that are under 5, infant and neonatal in the United States between 1950 and 2015.

Data Collection

# read .csv into R

child_mortality <- read.csv("https://raw.githubusercontent.com/anilak1978/medium-article-data/master/RatesDeaths_AllIndicators.csv", 
    skip = 6)
head(child_mortality, n = 5)

     ISO.Code CountryName Uncertainty.bounds. U5MR.1950 U5MR.1951
     U5MR.1952 U5MR.1953 U5MR.1954 U5MR.1955 U5MR.1956 U5MR.1957 U5MR.1958
     U5MR.1959 U5MR.1960 U5MR.1961 U5MR.1962 U5MR.1963 U5MR.1964 U5MR.1965
     U5MR.1966 U5MR.1967 U5MR.1968 U5MR.1969 U5MR.1970 U5MR.1971 U5MR.1972
     U5MR.1973 U5MR.1974 U5MR.1975 U5MR.1976 U5MR.1977 U5MR.1978 U5MR.1979
     U5MR.1980 U5MR.1981 U5MR.1982 U5MR.1983 U5MR.1984 U5MR.1985 U5MR.1986
     U5MR.1987 U5MR.1988 U5MR.1989 U5MR.1990 U5MR.1991 U5MR.1992 U5MR.1993
     U5MR.1994 U5MR.1995 U5MR.1996 U5MR.1997 U5MR.1998 U5MR.1999 U5MR.2000
     U5MR.2001 U5MR.2002 U5MR.2003 U5MR.2004 U5MR.2005 U5MR.2006 U5MR.2007
     U5MR.2008 U5MR.2009 U5MR.2010 U5MR.2011 U5MR.2012 U5MR.2013 U5MR.2014
     U5MR.2015 IMR.1950 IMR.1951 IMR.1952 IMR.1953 IMR.1954 IMR.1955
     IMR.1956 IMR.1957 IMR.1958 IMR.1959 IMR.1960 IMR.1961 IMR.1962
     IMR.1963 IMR.1964 IMR.1965 IMR.1966 IMR.1967 IMR.1968 IMR.1969
     IMR.1970 IMR.1971 IMR.1972 IMR.1973 IMR.1974 IMR.1975 IMR.1976
     IMR.1977 IMR.1978 IMR.1979 IMR.1980 IMR.1981 IMR.1982 IMR.1983
     IMR.1984 IMR.1985 IMR.1986 IMR.1987 IMR.1988 IMR.1989 IMR.1990
     IMR.1991 IMR.1992 IMR.1993 IMR.1994 IMR.1995 IMR.1996 IMR.1997
     IMR.1998 IMR.1999 IMR.2000 IMR.2001 IMR.2002 IMR.2003 IMR.2004
     IMR.2005 IMR.2006 IMR.2007 IMR.2008 IMR.2009 IMR.2010 IMR.2011
     IMR.2012 IMR.2013 IMR.2014 IMR.2015 NMR.1950 NMR.1951 NMR.1952
     NMR.1953 NMR.1954 NMR.1955 NMR.1956 NMR.1957 NMR.1958 NMR.1959
     NMR.1960 NMR.1961 NMR.1962 NMR.1963 NMR.1964 NMR.1965 NMR.1966
     NMR.1967 NMR.1968 NMR.1969 NMR.1970 NMR.1971 NMR.1972 NMR.1973
     NMR.1974 NMR.1975 NMR.1976 NMR.1977 NMR.1978 NMR.1979 NMR.1980
     NMR.1981 NMR.1982 NMR.1983 NMR.1984 NMR.1985 NMR.1986 NMR.1987
     NMR.1988 NMR.1989 NMR.1990 NMR.1991 NMR.1992 NMR.1993 NMR.1994
     NMR.1995 NMR.1996 NMR.1997 NMR.1998 NMR.1999 NMR.2000 NMR.2001
     NMR.2002 NMR.2003 NMR.2004 NMR.2005 NMR.2006 NMR.2007 NMR.2008
     NMR.2009 NMR.2010 NMR.2011 NMR.2012 NMR.2013 NMR.2014 NMR.2015
     Under.five.Deaths.1950 Under.five.Deaths.1951 Under.five.Deaths.1952
     Under.five.Deaths.1953 Under.five.Deaths.1954 Under.five.Deaths.1955
     Under.five.Deaths.1956 Under.five.Deaths.1957 Under.five.Deaths.1958
     Under.five.Deaths.1959 Under.five.Deaths.1960 Under.five.Deaths.1961
     Under.five.Deaths.1962 Under.five.Deaths.1963 Under.five.Deaths.1964
     Under.five.Deaths.1965 Under.five.Deaths.1966 Under.five.Deaths.1967
     Under.five.Deaths.1968 Under.five.Deaths.1969 Under.five.Deaths.1970
     Under.five.Deaths.1971 Under.five.Deaths.1972 Under.five.Deaths.1973
     Under.five.Deaths.1974 Under.five.Deaths.1975 Under.five.Deaths.1976
     Under.five.Deaths.1977 Under.five.Deaths.1978 Under.five.Deaths.1979
     Under.five.Deaths.1980 Under.five.Deaths.1981 Under.five.Deaths.1982
     Under.five.Deaths.1983 Under.five.Deaths.1984 Under.five.Deaths.1985
     Under.five.Deaths.1986 Under.five.Deaths.1987 Under.five.Deaths.1988
     Under.five.Deaths.1989 Under.five.Deaths.1990 Under.five.Deaths.1991
     Under.five.Deaths.1992 Under.five.Deaths.1993 Under.five.Deaths.1994
     Under.five.Deaths.1995 Under.five.Deaths.1996 Under.five.Deaths.1997
     Under.five.Deaths.1998 Under.five.Deaths.1999 Under.five.Deaths.2000
     Under.five.Deaths.2001 Under.five.Deaths.2002 Under.five.Deaths.2003
     Under.five.Deaths.2004 Under.five.Deaths.2005 Under.five.Deaths.2006
     Under.five.Deaths.2007 Under.five.Deaths.2008 Under.five.Deaths.2009
     Under.five.Deaths.2010 Under.five.Deaths.2011 Under.five.Deaths.2012
     Under.five.Deaths.2013 Under.five.Deaths.2014 Under.five.Deaths.2015
     Infant.Deaths.1950 Infant.Deaths.1951 Infant.Deaths.1952
     Infant.Deaths.1953 Infant.Deaths.1954 Infant.Deaths.1955
     Infant.Deaths.1956 Infant.Deaths.1957 Infant.Deaths.1958
     Infant.Deaths.1959 Infant.Deaths.1960 Infant.Deaths.1961
     Infant.Deaths.1962 Infant.Deaths.1963 Infant.Deaths.1964
     Infant.Deaths.1965 Infant.Deaths.1966 Infant.Deaths.1967
     Infant.Deaths.1968 Infant.Deaths.1969 Infant.Deaths.1970
     Infant.Deaths.1971 Infant.Deaths.1972 Infant.Deaths.1973
     Infant.Deaths.1974 Infant.Deaths.1975 Infant.Deaths.1976
     Infant.Deaths.1977 Infant.Deaths.1978 Infant.Deaths.1979
     Infant.Deaths.1980 Infant.Deaths.1981 Infant.Deaths.1982
     Infant.Deaths.1983 Infant.Deaths.1984 Infant.Deaths.1985
     Infant.Deaths.1986 Infant.Deaths.1987 Infant.Deaths.1988
     Infant.Deaths.1989 Infant.Deaths.1990 Infant.Deaths.1991
     Infant.Deaths.1992 Infant.Deaths.1993 Infant.Deaths.1994
     Infant.Deaths.1995 Infant.Deaths.1996 Infant.Deaths.1997
     Infant.Deaths.1998 Infant.Deaths.1999 Infant.Deaths.2000
     Infant.Deaths.2001 Infant.Deaths.2002 Infant.Deaths.2003
     Infant.Deaths.2004 Infant.Deaths.2005 Infant.Deaths.2006
     Infant.Deaths.2007 Infant.Deaths.2008 Infant.Deaths.2009
     Infant.Deaths.2010 Infant.Deaths.2011 Infant.Deaths.2012
     Infant.Deaths.2013 Infant.Deaths.2014 Infant.Deaths.2015
     Neonatal.Deaths.1950 Neonatal.Deaths.1951 Neonatal.Deaths.1952
     Neonatal.Deaths.1953 Neonatal.Deaths.1954 Neonatal.Deaths.1955
     Neonatal.Deaths.1956 Neonatal.Deaths.1957 Neonatal.Deaths.1958
     Neonatal.Deaths.1959 Neonatal.Deaths.1960 Neonatal.Deaths.1961
     Neonatal.Deaths.1962 Neonatal.Deaths.1963 Neonatal.Deaths.1964
     Neonatal.Deaths.1965 Neonatal.Deaths.1966 Neonatal.Deaths.1967
     Neonatal.Deaths.1968 Neonatal.Deaths.1969 Neonatal.Deaths.1970
     Neonatal.Deaths.1971 Neonatal.Deaths.1972 Neonatal.Deaths.1973
     Neonatal.Deaths.1974 Neonatal.Deaths.1975 Neonatal.Deaths.1976
     Neonatal.Deaths.1977 Neonatal.Deaths.1978 Neonatal.Deaths.1979
     Neonatal.Deaths.1980 Neonatal.Deaths.1981 Neonatal.Deaths.1982
     Neonatal.Deaths.1983 Neonatal.Deaths.1984 Neonatal.Deaths.1985
     Neonatal.Deaths.1986 Neonatal.Deaths.1987 Neonatal.Deaths.1988
     Neonatal.Deaths.1989 Neonatal.Deaths.1990 Neonatal.Deaths.1991
     Neonatal.Deaths.1992 Neonatal.Deaths.1993 Neonatal.Deaths.1994
     Neonatal.Deaths.1995 Neonatal.Deaths.1996 Neonatal.Deaths.1997
     Neonatal.Deaths.1998 Neonatal.Deaths.1999 Neonatal.Deaths.2000
     Neonatal.Deaths.2001 Neonatal.Deaths.2002 Neonatal.Deaths.2003
     Neonatal.Deaths.2004 Neonatal.Deaths.2005 Neonatal.Deaths.2006
     Neonatal.Deaths.2007 Neonatal.Deaths.2008 Neonatal.Deaths.2009
     Neonatal.Deaths.2010 Neonatal.Deaths.2011 Neonatal.Deaths.2012
     Neonatal.Deaths.2013 Neonatal.Deaths.2014 Neonatal.Deaths.2015
 [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

Data Transformation and Cleaning

We can see that the data set is wide and messy, column headers are values and not variable names, column names have multiple values stored in them (example: U5MR.1950). We can use gather make the dataset long.

# use gather to create a column for year variable and use columns from
# U5MR.1950 to Neonatal.Deaths.2015 to be the row values of the new column
# year.
child_mortality <- child_mortality %>% gather(year, count, U5MR.1950:Neonatal.Deaths.2015)

head(child_mortality, n = 5)

  ISO.Code CountryName Uncertainty.bounds.      year count
1      AFG Afghanistan               Lower U5MR.1950    NA
2      AFG Afghanistan              Median U5MR.1950    NA
3      AFG Afghanistan               Upper U5MR.1950    NA
4      AGO      Angola               Lower U5MR.1950    NA
5      AGO      Angola              Median U5MR.1950    NA

tail(child_mortality, n = 5)

       ISO.Code CountryName Uncertainty.bounds.                 year count
231656      ZMB      Zambia              Median Neonatal.Deaths.2015 13188
231657      ZMB      Zambia               Upper Neonatal.Deaths.2015 17197
231658      ZWE    Zimbabwe               Lower Neonatal.Deaths.2015  9125
231659      ZWE    Zimbabwe              Median Neonatal.Deaths.2015 12836
231660      ZWE    Zimbabwe               Upper Neonatal.Deaths.2015 18372

We can see that multiple values are stored in year column. We can separate the year column values and assign them to two columns.

  ISO.Code CountryName Uncertainty.bounds. type year count
1      AFG Afghanistan               Lower U5MR 1950    NA
2      AFG Afghanistan              Median U5MR 1950    NA
3      AFG Afghanistan               Upper U5MR 1950    NA
4      AGO      Angola               Lower U5MR 1950    NA
5      AGO      Angola              Median U5MR 1950    NA
6      AGO      Angola               Upper U5MR 1950    NA

       ISO.Code CountryName Uncertainty.bounds.            type year count
231655      ZMB      Zambia               Lower Neonatal Deaths 2015  9796
231656      ZMB      Zambia              Median Neonatal Deaths 2015 13188
231657      ZMB      Zambia               Upper Neonatal Deaths 2015 17197
231658      ZWE    Zimbabwe               Lower Neonatal Deaths 2015  9125
231659      ZWE    Zimbabwe              Median Neonatal Deaths 2015 12836
231660      ZWE    Zimbabwe               Upper Neonatal Deaths 2015 18372

We can change the name of the columns , look at missing values and change the data type of each column as part of the data transformation and cleaning.

# rename the columns as part of cleaning

colnames(child_mortality) <- c("Code", "Country_Name", "Uncertainity", "Type", 
    "Year", "Count")
head(child_mortality)

  Code Country_Name Uncertainity Type Year Count
1  AFG  Afghanistan        Lower U5MR 1950    NA
2  AFG  Afghanistan       Median U5MR 1950    NA
3  AFG  Afghanistan        Upper U5MR 1950    NA
4  AGO       Angola        Lower U5MR 1950    NA
5  AGO       Angola       Median U5MR 1950    NA
6  AGO       Angola        Upper U5MR 1950    NA

# find missing values in each column
sum(is.na(child_mortality$Code))

[1] 0

sum(is.na(child_mortality$Country_Name))

[1] 0

sum(is.na(child_mortality$Type))

[1] 0

sum(is.na(child_mortality$Year))

[1] 0

sum(is.na(child_mortality$Count))

[1] 57980

'data.frame':   231660 obs. of  6 variables:
 $ Code        : Factor w/ 195 levels "AFG","AGO","ALB",..: 1 1 1 2 2 2 3 3 3 4 ...
 $ Country_Name: Factor w/ 195 levels "Afghanistan",..: 1 1 1 5 5 5 2 2 2 4 ...
 $ Uncertainity: Factor w/ 3 levels "Lower","Median",..: 1 2 3 1 2 3 1 2 3 1 ...
 $ Type        : chr  "U5MR" "U5MR" "U5MR" "U5MR" ...
 $ Year        : chr  "1950" "1950" "1950" "1950" ...
 $ Count       : num  NA NA NA NA NA NA NA NA NA NA ...

# we need to change the type of certain variables

child_mortality$Code <- as.character(child_mortality$Code)
child_mortality$Country_Name <- as.character(child_mortality$Country_Name)
child_mortality$Year <- as.numeric(child_mortality$Year)
child_mortality$Count <- as.numeric(child_mortality$Count)
child_mortality$Type <- as.factor(child_mortality$Type)
str(child_mortality)

'data.frame':   231660 obs. of  6 variables:
 $ Code        : chr  "AFG" "AFG" "AFG" "AGO" ...
 $ Country_Name: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Angola" ...
 $ Uncertainity: Factor w/ 3 levels "Lower","Median",..: 1 2 3 1 2 3 1 2 3 1 ...
 $ Type        : Factor w/ 6 levels "IMR","Infant Deaths",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Year        : num  1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
 $ Count       : num  NA NA NA NA NA NA NA NA NA NA ...

Data Exploration and Analysis.

Main goal of the analysis is to look at the deaths occured in the United States for under five, infant and neonatal death types. We can filter and create a new data frame to help with our analysis.

# first get the names of each category Type
levels(child_mortality$Type)

[1] "IMR"               "Infant Deaths"     "Neonatal Deaths"  
[4] "NMR"               "U5MR"              "Under five.Deaths"

# filter through based on the conditions for our main goal of analysis.

child_mortality_us <- child_mortality %>% filter(Country_Name == "United States of America", 
    Type == "Neonatal Deaths" | Type == "Infant Deaths" | Type == "Under five.Deaths")

head(child_mortality_us, n = 10)

   Code             Country_Name Uncertainity              Type Year
1   USA United States of America        Lower Under five.Deaths 1950
2   USA United States of America       Median Under five.Deaths 1950
3   USA United States of America        Upper Under five.Deaths 1950
4   USA United States of America        Lower Under five.Deaths 1951
5   USA United States of America       Median Under five.Deaths 1951
6   USA United States of America        Upper Under five.Deaths 1951
7   USA United States of America        Lower Under five.Deaths 1952
8   USA United States of America       Median Under five.Deaths 1952
9   USA United States of America        Upper Under five.Deaths 1952
10  USA United States of America        Lower Under five.Deaths 1953
    Count
1  139898
2  144132
3  148474
4  136000
5  138850
6  142215
7  132867
8  135167
9  137880
10 130702

tail(child_mortality_us, n = 10)

    Code             Country_Name Uncertainity            Type Year Count
585  USA United States of America        Upper Neonatal Deaths 2012 17265
586  USA United States of America        Lower Neonatal Deaths 2013 13129
587  USA United States of America       Median Neonatal Deaths 2013 14522
588  USA United States of America        Upper Neonatal Deaths 2013 17080
589  USA United States of America        Lower Neonatal Deaths 2014 12178
590  USA United States of America       Median Neonatal Deaths 2014 14023
591  USA United States of America        Upper Neonatal Deaths 2014 17093
592  USA United States of America        Lower Neonatal Deaths 2015 11479
593  USA United States of America       Median Neonatal Deaths 2015 13699
594  USA United States of America        Upper Neonatal Deaths 2015 17370

We can look at the distribution of Type within our filtered data set.

library("ggplot2")
g <- ggplot(child_mortality_us, aes(Type))
g + geom_bar(aes(fill = Uncertainity), width = 0.5) + theme(axis.text.x = element_text(angle = 65, 
    vjust = 0.6)) + labs(title = "Type Distribution for Children Mortality")

The overall the Type and Uncertainity variable are evenly distributed. We can also look at the change of death counts based on the year variable.

ggplot(child_mortality_us, aes(x = Year)) + geom_line(aes(y = Count)) + labs(title = "Children Mortality Time Series")

We can see that the death counts for children under 5, infant and neonatal deaths significantly went down compare to the 1950’s.

Conclusion

Based on our analysis, we can confirm that the death counts for children under 5, infant and neonatal deaths went down significantly during the timeframe between 1950 and 2015. We can further look into research and insights, make further analysis to determine to possible causes and define if there are any correlation with different possible data sets and variables.

Hospital Separation and Patient Days

In this data set , we have a semi wide and messy data around patients and hospitals and counts between 1993 and 1998. The dataset can be downloaded from here. The main purpose of the anlaysis is to see the trend of patient days and separations for Mental Disorders between the years of 1993 and 1998.

Data Collection

patients <- read.csv("https://raw.githubusercontent.com/anilak1978/medium-article-data/master/hospital_patients.csv")
head(patients)

                            IcdChapter       Field    FY1993    FY1994
1                      0. Not Reported PatientDays   257,965    55,582
2                      0. Not Reported Separations    37,178     6,146
3 1. Infectious and Parasitic Diseases PatientDays   311,221   313,386
4 1. Infectious and Parasitic Diseases Separations    75,857    78,323
5                         2. Neoplasms PatientDays 1,686,919 1,707,437
6                         2. Neoplasms Separations   301,928   336,447
     FY1995    FY1996    FY1997    FY1998
1   128,507   182,226    61,599   685,879
2     3,832     4,861     1,558    53,575
3   324,693   311,560   306,688 1,567,548
4    84,631    80,864    79,148   398,823
5 1,795,751 1,770,559 1,777,452 8,738,118
6   348,905   360,578   378,070 1,725,928

tail(patients)

                                                               IcdChapter
33                         16. Symptoms, Signs and Ill?Defined Conditions
34                         16. Symptoms, Signs and Ill?Defined Conditions
35                                               17. Injury and Poisoning
36                                               17. Injury and Poisoning
37 18. Factors Influencing Health Status and Contact with Health Services
38 18. Factors Influencing Health Status and Contact with Health Services
         Field    FY1993    FY1994    FY1995    FY1996    FY1997
33 PatientDays   747,943   759,598   777,964   767,115   788,946
34 Separations   246,752   261,302   277,289   297,481   318,088
35 PatientDays 1,895,224 1,838,321 1,831,864 1,770,099 1,765,479
36 Separations   365,908   379,247   393,941   389,829   403,387
37 PatientDays 2,237,646 2,638,779 2,999,856 3,055,220 3,169,730
38 Separations   658,363   768,737   875,268   942,537   996,787
       FY1998
33  3,841,566
34  1,400,912
35  9,100,987
36  1,932,312
37 14,101,231
38  4,241,692

Data Transformation and Cleaning

We can see that the column headers are values and not variable names. So we can use gather for tidying the dataset. We can also remove the FY from the year column and remove “,” from count column and make those columns numeric to further help with our analyis.

patients <- patients %>% gather(year, count, FY1993:FY1998) %>% mutate(year = as.numeric(gsub("FY", 
    "", year))) %>% mutate(count = as.numeric(gsub(",", "", count)))

head(patients)

                            IcdChapter       Field year   count
1                      0. Not Reported PatientDays 1993  257965
2                      0. Not Reported Separations 1993   37178
3 1. Infectious and Parasitic Diseases PatientDays 1993  311221
4 1. Infectious and Parasitic Diseases Separations 1993   75857
5                         2. Neoplasms PatientDays 1993 1686919
6                         2. Neoplasms Separations 1993  301928

We can rename the column names , update the Chapter column values by removing the unneccessary numbering.

patients$IcdChapter <- str_replace(patients$IcdChapter, "[:digit:]\\.", "")
head(patients)

                          IcdChapter       Field year   count
1                       Not Reported PatientDays 1993  257965
2                       Not Reported Separations 1993   37178
3  Infectious and Parasitic Diseases PatientDays 1993  311221
4  Infectious and Parasitic Diseases Separations 1993   75857
5                          Neoplasms PatientDays 1993 1686919
6                          Neoplasms Separations 1993  301928

colnames(patients) <- c("Reason", "Field", "Year", "Count")
patients$Count <- as.numeric(patients$Count)
head(patients)

                              Reason       Field Year   Count
1                       Not Reported PatientDays 1993  257965
2                       Not Reported Separations 1993   37178
3  Infectious and Parasitic Diseases PatientDays 1993  311221
4  Infectious and Parasitic Diseases Separations 1993   75857
5                          Neoplasms PatientDays 1993 1686919
6                          Neoplasms Separations 1993  301928

Let’s look to see if there are any missing values.

sum(is.na(patients$Reason))

[1] 0

sum(is.na(patients$Field))

[1] 0

sum(is.na(patients$Year))

[1] 0

sum(is.na(patients$Count))

[1] 0

There are no missing values, the data set is ready for data analysis and exploration.

Data Exploration and Analysis

Main goal of the analysis is to extract insights for patients that have mental disorders with Patient Days. We can further filter our analysis based on this condition.

mental_disorders <- filter(patients, Reason == " Mental Disorders", Field == 
    "PatientDays")
head(mental_disorders)

             Reason       Field Year    Count
1  Mental Disorders PatientDays 1993  1577877
2  Mental Disorders PatientDays 1994  1504333
3  Mental Disorders PatientDays 1995  2592320
4  Mental Disorders PatientDays 1996  2985958
5  Mental Disorders PatientDays 1997  3188096
6  Mental Disorders PatientDays 1998 11848584

We can further look to see the trend based on the year.

ggplot(mental_disorders, aes(x = Year)) + geom_line(aes(y = Count)) + labs(title = "Mental Disorders Time Series for Patient Days")

mental_disorders_2 <- filter(patients, Reason == " Mental Disorders", Field == 
    "Separations")
head(mental_disorders_2)

             Reason       Field Year  Count
1  Mental Disorders Separations 1993 134763
2  Mental Disorders Separations 1994 137649
3  Mental Disorders Separations 1995 191183
4  Mental Disorders Separations 1996 204431
5  Mental Disorders Separations 1997 234956
6  Mental Disorders Separations 1998 902982

ggplot(mental_disorders_2, aes(x = Year)) + geom_line(aes(y = Count)) + labs(title = "Mental Disorders Time Series for Separations")

Conclusion

Based on our analysis, we can state that the patients that were in the Mental Disorder Category, Patient Days in the hospital and Separations increased significantly starting between 1997 and 1998. We can further extend our analysis on the reasoning of this insight.

Great British Bakeoff

At our last analysis, we will take a look at a data set

Data Collection

baker <- read.csv("https://raw.githubusercontent.com/anilak1978/medium-article-data/master/baker_results.csv")
head(baker)

  series          baker_full   baker age                        occupation
1      1       Annetha Mills Annetha  30                     Single mother
2      1      David Chambers   David  31                      Entrepreneur
3      1 Edward "Edd" Kimber     Edd  24 Debt collector for Yorkshire Bank
       hometown baker_last baker_first star_baker technical_winner
1         Essex      Mills     Annetha          0                0
2 Milton Keynes   Chambers       David          0                0
3      Bradford     Kimber      Edward          0                2
  technical_top3 technical_bottom technical_highest technical_lowest
1              1                1                 2                7
2              1                3                 3                8
3              4                1                 1                6
  technical_median series_winner series_runner_up total_episodes_appeared
1              4.5             0                0                       2
2              4.5             0                0                       4
3              2.0             1                0                       6
  first_date_appeared last_date_appeared first_date_us last_date_us
1          2010-08-17         2010-08-24          <NA>         <NA>
2          2010-08-17         2010-09-07          <NA>         <NA>
3          2010-08-17         2010-09-21          <NA>         <NA>
  percent_episodes_appeared percent_technical_top3
1                  33.33333               50.00000
2                  66.66667               25.00000
3                 100.00000               66.66667
 [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

Data Transformation and Cleaning

When we look at our data set, we can see that it is wide and messy. There are way too many columns that we will not need for the purpose of our analysis. We can start by subsetting the raw data set to get the columns we need.

baker_results <- select(baker, baker_full, age, occupation, hometown, technical_median, 
    first_date_appeared, percent_episodes_appeared)
head(baker_results)

           baker_full age                        occupation
1       Annetha Mills  30                     Single mother
2      David Chambers  31                      Entrepreneur
3 Edward "Edd" Kimber  24 Debt collector for Yorkshire Bank
4  Jasminder Randhawa  45  Assistant Credit Control Manager
5   Jonathan Shepherd  25                  Research Analyst
6          Lea Harris  51                           Retired
              hometown technical_median first_date_appeared
1                Essex              4.5          2010-08-17
2        Milton Keynes              4.5          2010-08-17
3             Bradford              2.0          2010-08-17
4           Birmingham              3.0          2010-08-17
5            St Albans              6.0          2010-08-17
6 Midlothian, Scotland             10.0          2010-08-17
  percent_episodes_appeared
1                  33.33333
2                  66.66667
3                 100.00000
4                  83.33333
5                  50.00000
6                  16.66667

Let’s also update the column names.

colnames(baker_results) <- c("Contestant", "Age", "Occupation", "HomeTown", 
    "Rate", "Date", "Rating")
head(baker_results)

           Contestant Age                        Occupation
1       Annetha Mills  30                     Single mother
2      David Chambers  31                      Entrepreneur
3 Edward "Edd" Kimber  24 Debt collector for Yorkshire Bank
4  Jasminder Randhawa  45  Assistant Credit Control Manager
5   Jonathan Shepherd  25                  Research Analyst
6          Lea Harris  51                           Retired
              HomeTown Rate       Date    Rating
1                Essex  4.5 2010-08-17  33.33333
2        Milton Keynes  4.5 2010-08-17  66.66667
3             Bradford  2.0 2010-08-17 100.00000
4           Birmingham  3.0 2010-08-17  83.33333
5            St Albans  6.0 2010-08-17  50.00000
6 Midlothian, Scotland 10.0 2010-08-17  16.66667

We can split the Date variable into year, month and day which will allow us to look at more specific time series analysis.

baker_results <- separate(baker_results, Date, c("Year", "Month", "Day"), sep = "-")
head(baker_results)

           Contestant Age                        Occupation
1       Annetha Mills  30                     Single mother
2      David Chambers  31                      Entrepreneur
3 Edward "Edd" Kimber  24 Debt collector for Yorkshire Bank
4  Jasminder Randhawa  45  Assistant Credit Control Manager
5   Jonathan Shepherd  25                  Research Analyst
6          Lea Harris  51                           Retired
              HomeTown Rate Year Month Day    Rating
1                Essex  4.5 2010    08  17  33.33333
2        Milton Keynes  4.5 2010    08  17  66.66667
3             Bradford  2.0 2010    08  17 100.00000
4           Birmingham  3.0 2010    08  17  83.33333
5            St Albans  6.0 2010    08  17  50.00000
6 Midlothian, Scotland 10.0 2010    08  17  16.66667

We can filter to a particular month to see if the rating of the show depends based on the month.

baker_results$Month <- as.integer(baker_results$Month)
baker_results$Year <- as.integer(baker_results$Year)
baker_results$Day <- as.integer(baker_results$Day)
str(baker_results)

'data.frame':   95 obs. of  9 variables:
 $ Contestant: Factor w/ 95 levels "Ali Imdad","Alvin Magallanes",..: 4 17 21 36 40 48 51 58 63 76 ...
 $ Age       : int  30 31 24 45 25 51 44 48 37 31 ...
 $ Occupation: Factor w/ 87 levels "Accountant","Advertising executive",..: 75 26 22 6 66 68 60 10 29 67 ...
 $ HomeTown  : Factor w/ 83 levels "Aldershot, Hampshire",..: 31 50 14 11 71 48 43 69 47 60 ...
 $ Rate      : num  4.5 4.5 2 3 6 10 4 NA 3 3.5 ...
 $ Year      : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ Month     : int  8 8 8 8 8 8 8 8 8 8 ...
 $ Day       : int  17 17 17 17 17 17 17 17 17 17 ...
 $ Rating    : num  33.3 66.7 100 83.3 50 ...

sum(is.na(baker_results$Age))

[1] 0

sum(is.na(baker_results$Contestant))

[1] 0

sum(is.na(baker_results$Occupation))

[1] 0

sum(is.na(baker_results$HomeTown))

[1] 0

sum(is.na(baker_results$Rate))

[1] 1

sum(is.na(baker_results$Year))

[1] 0

sum(is.na(baker_results$Month))

[1] 0

sum(is.na(baker_results$Day))

[1] 0

sum(is.na(baker_results$Rating))

[1] 0

There is one missing value in Rate. We can remove that data set from our analysis.

baker_results <- na.omit(baker_results)
sum(is.na(baker_results$Rate))

[1] 0

ggplot(baker_results) + geom_point(aes(Rate, Rating))

Conclusion

There is no linear correlation between Rate and Rating. This means, the ratings of the show does not depend on how good the Contestant is.

Data Transformation

2019-10-04

Introduction

Child Mortality

Data Collection

Data Transformation and Cleaning

Data Exploration and Analysis.

Conclusion

Hospital Separation and Patient Days

Data Collection

Data Transformation and Cleaning

Data Exploration and Analysis

Conclusion

Great British Bakeoff

Data Collection

Data Transformation and Cleaning

Conclusion