# import neccessary libraries
library("tidyr")
library("stringr")
library("dplyr")
library("ggplot2")Introduction
The goal of this assignment is to practice in preparing different datasets for downstream analysis work by using three “Wide” datasets. The data transformation and analysis tasks that are required for this assignment are;
Creation of a .CSV file which includes all the information of the wide dataset.
Read the information from the .CSV file into R, use tidyr and dplyr as needed to tidy and transform the data.
Perform the analysis.
Child Mortality
The first wide dataset we will transform and analyze is “Child Mortality, Estimates for under-five, infant and neonatral mortality” for each country based on the research of United Nations Inter-agency Group for Child Mortality Estimation. The dataset can be found at https://childmortality.org/. The insight that we want to get from this data analysis is to see the death count on children that are under 5, infant and neonatal in the United States between 1950 and 2015.
Data Collection
# read .csv into R
child_mortality <- read.csv("https://raw.githubusercontent.com/anilak1978/medium-article-data/master/RatesDeaths_AllIndicators.csv",
skip = 6)
head(child_mortality, n = 5) ISO.Code CountryName Uncertainty.bounds. U5MR.1950 U5MR.1951
U5MR.1952 U5MR.1953 U5MR.1954 U5MR.1955 U5MR.1956 U5MR.1957 U5MR.1958
U5MR.1959 U5MR.1960 U5MR.1961 U5MR.1962 U5MR.1963 U5MR.1964 U5MR.1965
U5MR.1966 U5MR.1967 U5MR.1968 U5MR.1969 U5MR.1970 U5MR.1971 U5MR.1972
U5MR.1973 U5MR.1974 U5MR.1975 U5MR.1976 U5MR.1977 U5MR.1978 U5MR.1979
U5MR.1980 U5MR.1981 U5MR.1982 U5MR.1983 U5MR.1984 U5MR.1985 U5MR.1986
U5MR.1987 U5MR.1988 U5MR.1989 U5MR.1990 U5MR.1991 U5MR.1992 U5MR.1993
U5MR.1994 U5MR.1995 U5MR.1996 U5MR.1997 U5MR.1998 U5MR.1999 U5MR.2000
U5MR.2001 U5MR.2002 U5MR.2003 U5MR.2004 U5MR.2005 U5MR.2006 U5MR.2007
U5MR.2008 U5MR.2009 U5MR.2010 U5MR.2011 U5MR.2012 U5MR.2013 U5MR.2014
U5MR.2015 IMR.1950 IMR.1951 IMR.1952 IMR.1953 IMR.1954 IMR.1955
IMR.1956 IMR.1957 IMR.1958 IMR.1959 IMR.1960 IMR.1961 IMR.1962
IMR.1963 IMR.1964 IMR.1965 IMR.1966 IMR.1967 IMR.1968 IMR.1969
IMR.1970 IMR.1971 IMR.1972 IMR.1973 IMR.1974 IMR.1975 IMR.1976
IMR.1977 IMR.1978 IMR.1979 IMR.1980 IMR.1981 IMR.1982 IMR.1983
IMR.1984 IMR.1985 IMR.1986 IMR.1987 IMR.1988 IMR.1989 IMR.1990
IMR.1991 IMR.1992 IMR.1993 IMR.1994 IMR.1995 IMR.1996 IMR.1997
IMR.1998 IMR.1999 IMR.2000 IMR.2001 IMR.2002 IMR.2003 IMR.2004
IMR.2005 IMR.2006 IMR.2007 IMR.2008 IMR.2009 IMR.2010 IMR.2011
IMR.2012 IMR.2013 IMR.2014 IMR.2015 NMR.1950 NMR.1951 NMR.1952
NMR.1953 NMR.1954 NMR.1955 NMR.1956 NMR.1957 NMR.1958 NMR.1959
NMR.1960 NMR.1961 NMR.1962 NMR.1963 NMR.1964 NMR.1965 NMR.1966
NMR.1967 NMR.1968 NMR.1969 NMR.1970 NMR.1971 NMR.1972 NMR.1973
NMR.1974 NMR.1975 NMR.1976 NMR.1977 NMR.1978 NMR.1979 NMR.1980
NMR.1981 NMR.1982 NMR.1983 NMR.1984 NMR.1985 NMR.1986 NMR.1987
NMR.1988 NMR.1989 NMR.1990 NMR.1991 NMR.1992 NMR.1993 NMR.1994
NMR.1995 NMR.1996 NMR.1997 NMR.1998 NMR.1999 NMR.2000 NMR.2001
NMR.2002 NMR.2003 NMR.2004 NMR.2005 NMR.2006 NMR.2007 NMR.2008
NMR.2009 NMR.2010 NMR.2011 NMR.2012 NMR.2013 NMR.2014 NMR.2015
Under.five.Deaths.1950 Under.five.Deaths.1951 Under.five.Deaths.1952
Under.five.Deaths.1953 Under.five.Deaths.1954 Under.five.Deaths.1955
Under.five.Deaths.1956 Under.five.Deaths.1957 Under.five.Deaths.1958
Under.five.Deaths.1959 Under.five.Deaths.1960 Under.five.Deaths.1961
Under.five.Deaths.1962 Under.five.Deaths.1963 Under.five.Deaths.1964
Under.five.Deaths.1965 Under.five.Deaths.1966 Under.five.Deaths.1967
Under.five.Deaths.1968 Under.five.Deaths.1969 Under.five.Deaths.1970
Under.five.Deaths.1971 Under.five.Deaths.1972 Under.five.Deaths.1973
Under.five.Deaths.1974 Under.five.Deaths.1975 Under.five.Deaths.1976
Under.five.Deaths.1977 Under.five.Deaths.1978 Under.five.Deaths.1979
Under.five.Deaths.1980 Under.five.Deaths.1981 Under.five.Deaths.1982
Under.five.Deaths.1983 Under.five.Deaths.1984 Under.five.Deaths.1985
Under.five.Deaths.1986 Under.five.Deaths.1987 Under.five.Deaths.1988
Under.five.Deaths.1989 Under.five.Deaths.1990 Under.five.Deaths.1991
Under.five.Deaths.1992 Under.five.Deaths.1993 Under.five.Deaths.1994
Under.five.Deaths.1995 Under.five.Deaths.1996 Under.five.Deaths.1997
Under.five.Deaths.1998 Under.five.Deaths.1999 Under.five.Deaths.2000
Under.five.Deaths.2001 Under.five.Deaths.2002 Under.five.Deaths.2003
Under.five.Deaths.2004 Under.five.Deaths.2005 Under.five.Deaths.2006
Under.five.Deaths.2007 Under.five.Deaths.2008 Under.five.Deaths.2009
Under.five.Deaths.2010 Under.five.Deaths.2011 Under.five.Deaths.2012
Under.five.Deaths.2013 Under.five.Deaths.2014 Under.five.Deaths.2015
Infant.Deaths.1950 Infant.Deaths.1951 Infant.Deaths.1952
Infant.Deaths.1953 Infant.Deaths.1954 Infant.Deaths.1955
Infant.Deaths.1956 Infant.Deaths.1957 Infant.Deaths.1958
Infant.Deaths.1959 Infant.Deaths.1960 Infant.Deaths.1961
Infant.Deaths.1962 Infant.Deaths.1963 Infant.Deaths.1964
Infant.Deaths.1965 Infant.Deaths.1966 Infant.Deaths.1967
Infant.Deaths.1968 Infant.Deaths.1969 Infant.Deaths.1970
Infant.Deaths.1971 Infant.Deaths.1972 Infant.Deaths.1973
Infant.Deaths.1974 Infant.Deaths.1975 Infant.Deaths.1976
Infant.Deaths.1977 Infant.Deaths.1978 Infant.Deaths.1979
Infant.Deaths.1980 Infant.Deaths.1981 Infant.Deaths.1982
Infant.Deaths.1983 Infant.Deaths.1984 Infant.Deaths.1985
Infant.Deaths.1986 Infant.Deaths.1987 Infant.Deaths.1988
Infant.Deaths.1989 Infant.Deaths.1990 Infant.Deaths.1991
Infant.Deaths.1992 Infant.Deaths.1993 Infant.Deaths.1994
Infant.Deaths.1995 Infant.Deaths.1996 Infant.Deaths.1997
Infant.Deaths.1998 Infant.Deaths.1999 Infant.Deaths.2000
Infant.Deaths.2001 Infant.Deaths.2002 Infant.Deaths.2003
Infant.Deaths.2004 Infant.Deaths.2005 Infant.Deaths.2006
Infant.Deaths.2007 Infant.Deaths.2008 Infant.Deaths.2009
Infant.Deaths.2010 Infant.Deaths.2011 Infant.Deaths.2012
Infant.Deaths.2013 Infant.Deaths.2014 Infant.Deaths.2015
Neonatal.Deaths.1950 Neonatal.Deaths.1951 Neonatal.Deaths.1952
Neonatal.Deaths.1953 Neonatal.Deaths.1954 Neonatal.Deaths.1955
Neonatal.Deaths.1956 Neonatal.Deaths.1957 Neonatal.Deaths.1958
Neonatal.Deaths.1959 Neonatal.Deaths.1960 Neonatal.Deaths.1961
Neonatal.Deaths.1962 Neonatal.Deaths.1963 Neonatal.Deaths.1964
Neonatal.Deaths.1965 Neonatal.Deaths.1966 Neonatal.Deaths.1967
Neonatal.Deaths.1968 Neonatal.Deaths.1969 Neonatal.Deaths.1970
Neonatal.Deaths.1971 Neonatal.Deaths.1972 Neonatal.Deaths.1973
Neonatal.Deaths.1974 Neonatal.Deaths.1975 Neonatal.Deaths.1976
Neonatal.Deaths.1977 Neonatal.Deaths.1978 Neonatal.Deaths.1979
Neonatal.Deaths.1980 Neonatal.Deaths.1981 Neonatal.Deaths.1982
Neonatal.Deaths.1983 Neonatal.Deaths.1984 Neonatal.Deaths.1985
Neonatal.Deaths.1986 Neonatal.Deaths.1987 Neonatal.Deaths.1988
Neonatal.Deaths.1989 Neonatal.Deaths.1990 Neonatal.Deaths.1991
Neonatal.Deaths.1992 Neonatal.Deaths.1993 Neonatal.Deaths.1994
Neonatal.Deaths.1995 Neonatal.Deaths.1996 Neonatal.Deaths.1997
Neonatal.Deaths.1998 Neonatal.Deaths.1999 Neonatal.Deaths.2000
Neonatal.Deaths.2001 Neonatal.Deaths.2002 Neonatal.Deaths.2003
Neonatal.Deaths.2004 Neonatal.Deaths.2005 Neonatal.Deaths.2006
Neonatal.Deaths.2007 Neonatal.Deaths.2008 Neonatal.Deaths.2009
Neonatal.Deaths.2010 Neonatal.Deaths.2011 Neonatal.Deaths.2012
Neonatal.Deaths.2013 Neonatal.Deaths.2014 Neonatal.Deaths.2015
[ reached 'max' / getOption("max.print") -- omitted 5 rows ]
Data Transformation and Cleaning
We can see that the data set is wide and messy, column headers are values and not variable names, column names have multiple values stored in them (example: U5MR.1950). We can use gather make the dataset long.
# use gather to create a column for year variable and use columns from
# U5MR.1950 to Neonatal.Deaths.2015 to be the row values of the new column
# year.
child_mortality <- child_mortality %>% gather(year, count, U5MR.1950:Neonatal.Deaths.2015)
head(child_mortality, n = 5) ISO.Code CountryName Uncertainty.bounds. year count
1 AFG Afghanistan Lower U5MR.1950 NA
2 AFG Afghanistan Median U5MR.1950 NA
3 AFG Afghanistan Upper U5MR.1950 NA
4 AGO Angola Lower U5MR.1950 NA
5 AGO Angola Median U5MR.1950 NA
ISO.Code CountryName Uncertainty.bounds. year count
231656 ZMB Zambia Median Neonatal.Deaths.2015 13188
231657 ZMB Zambia Upper Neonatal.Deaths.2015 17197
231658 ZWE Zimbabwe Lower Neonatal.Deaths.2015 9125
231659 ZWE Zimbabwe Median Neonatal.Deaths.2015 12836
231660 ZWE Zimbabwe Upper Neonatal.Deaths.2015 18372
We can see that multiple values are stored in year column. We can separate the year column values and assign them to two columns.
ISO.Code CountryName Uncertainty.bounds. type year count
1 AFG Afghanistan Lower U5MR 1950 NA
2 AFG Afghanistan Median U5MR 1950 NA
3 AFG Afghanistan Upper U5MR 1950 NA
4 AGO Angola Lower U5MR 1950 NA
5 AGO Angola Median U5MR 1950 NA
6 AGO Angola Upper U5MR 1950 NA
ISO.Code CountryName Uncertainty.bounds. type year count
231655 ZMB Zambia Lower Neonatal Deaths 2015 9796
231656 ZMB Zambia Median Neonatal Deaths 2015 13188
231657 ZMB Zambia Upper Neonatal Deaths 2015 17197
231658 ZWE Zimbabwe Lower Neonatal Deaths 2015 9125
231659 ZWE Zimbabwe Median Neonatal Deaths 2015 12836
231660 ZWE Zimbabwe Upper Neonatal Deaths 2015 18372
We can change the name of the columns , look at missing values and change the data type of each column as part of the data transformation and cleaning.
# rename the columns as part of cleaning
colnames(child_mortality) <- c("Code", "Country_Name", "Uncertainity", "Type",
"Year", "Count")
head(child_mortality) Code Country_Name Uncertainity Type Year Count
1 AFG Afghanistan Lower U5MR 1950 NA
2 AFG Afghanistan Median U5MR 1950 NA
3 AFG Afghanistan Upper U5MR 1950 NA
4 AGO Angola Lower U5MR 1950 NA
5 AGO Angola Median U5MR 1950 NA
6 AGO Angola Upper U5MR 1950 NA
[1] 0
[1] 0
[1] 0
[1] 0
[1] 57980
'data.frame': 231660 obs. of 6 variables:
$ Code : Factor w/ 195 levels "AFG","AGO","ALB",..: 1 1 1 2 2 2 3 3 3 4 ...
$ Country_Name: Factor w/ 195 levels "Afghanistan",..: 1 1 1 5 5 5 2 2 2 4 ...
$ Uncertainity: Factor w/ 3 levels "Lower","Median",..: 1 2 3 1 2 3 1 2 3 1 ...
$ Type : chr "U5MR" "U5MR" "U5MR" "U5MR" ...
$ Year : chr "1950" "1950" "1950" "1950" ...
$ Count : num NA NA NA NA NA NA NA NA NA NA ...
# we need to change the type of certain variables
child_mortality$Code <- as.character(child_mortality$Code)
child_mortality$Country_Name <- as.character(child_mortality$Country_Name)
child_mortality$Year <- as.numeric(child_mortality$Year)
child_mortality$Count <- as.numeric(child_mortality$Count)
child_mortality$Type <- as.factor(child_mortality$Type)
str(child_mortality)'data.frame': 231660 obs. of 6 variables:
$ Code : chr "AFG" "AFG" "AFG" "AGO" ...
$ Country_Name: chr "Afghanistan" "Afghanistan" "Afghanistan" "Angola" ...
$ Uncertainity: Factor w/ 3 levels "Lower","Median",..: 1 2 3 1 2 3 1 2 3 1 ...
$ Type : Factor w/ 6 levels "IMR","Infant Deaths",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Year : num 1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
$ Count : num NA NA NA NA NA NA NA NA NA NA ...
Data Exploration and Analysis.
Main goal of the analysis is to look at the deaths occured in the United States for under five, infant and neonatal death types. We can filter and create a new data frame to help with our analysis.
[1] "IMR" "Infant Deaths" "Neonatal Deaths"
[4] "NMR" "U5MR" "Under five.Deaths"
# filter through based on the conditions for our main goal of analysis.
child_mortality_us <- child_mortality %>% filter(Country_Name == "United States of America",
Type == "Neonatal Deaths" | Type == "Infant Deaths" | Type == "Under five.Deaths")
head(child_mortality_us, n = 10) Code Country_Name Uncertainity Type Year
1 USA United States of America Lower Under five.Deaths 1950
2 USA United States of America Median Under five.Deaths 1950
3 USA United States of America Upper Under five.Deaths 1950
4 USA United States of America Lower Under five.Deaths 1951
5 USA United States of America Median Under five.Deaths 1951
6 USA United States of America Upper Under five.Deaths 1951
7 USA United States of America Lower Under five.Deaths 1952
8 USA United States of America Median Under five.Deaths 1952
9 USA United States of America Upper Under five.Deaths 1952
10 USA United States of America Lower Under five.Deaths 1953
Count
1 139898
2 144132
3 148474
4 136000
5 138850
6 142215
7 132867
8 135167
9 137880
10 130702
Code Country_Name Uncertainity Type Year Count
585 USA United States of America Upper Neonatal Deaths 2012 17265
586 USA United States of America Lower Neonatal Deaths 2013 13129
587 USA United States of America Median Neonatal Deaths 2013 14522
588 USA United States of America Upper Neonatal Deaths 2013 17080
589 USA United States of America Lower Neonatal Deaths 2014 12178
590 USA United States of America Median Neonatal Deaths 2014 14023
591 USA United States of America Upper Neonatal Deaths 2014 17093
592 USA United States of America Lower Neonatal Deaths 2015 11479
593 USA United States of America Median Neonatal Deaths 2015 13699
594 USA United States of America Upper Neonatal Deaths 2015 17370
We can look at the distribution of Type within our filtered data set.
library("ggplot2")
g <- ggplot(child_mortality_us, aes(Type))
g + geom_bar(aes(fill = Uncertainity), width = 0.5) + theme(axis.text.x = element_text(angle = 65,
vjust = 0.6)) + labs(title = "Type Distribution for Children Mortality")The overall the Type and Uncertainity variable are evenly distributed. We can also look at the change of death counts based on the year variable.
ggplot(child_mortality_us, aes(x = Year)) + geom_line(aes(y = Count)) + labs(title = "Children Mortality Time Series")We can see that the death counts for children under 5, infant and neonatal deaths significantly went down compare to the 1950’s.
Conclusion
Based on our analysis, we can confirm that the death counts for children under 5, infant and neonatal deaths went down significantly during the timeframe between 1950 and 2015. We can further look into research and insights, make further analysis to determine to possible causes and define if there are any correlation with different possible data sets and variables.
Hospital Separation and Patient Days
In this data set , we have a semi wide and messy data around patients and hospitals and counts between 1993 and 1998. The dataset can be downloaded from here. The main purpose of the anlaysis is to see the trend of patient days and separations for Mental Disorders between the years of 1993 and 1998.
Data Collection
patients <- read.csv("https://raw.githubusercontent.com/anilak1978/medium-article-data/master/hospital_patients.csv")
head(patients) IcdChapter Field FY1993 FY1994
1 0. Not Reported PatientDays 257,965 55,582
2 0. Not Reported Separations 37,178 6,146
3 1. Infectious and Parasitic Diseases PatientDays 311,221 313,386
4 1. Infectious and Parasitic Diseases Separations 75,857 78,323
5 2. Neoplasms PatientDays 1,686,919 1,707,437
6 2. Neoplasms Separations 301,928 336,447
FY1995 FY1996 FY1997 FY1998
1 128,507 182,226 61,599 685,879
2 3,832 4,861 1,558 53,575
3 324,693 311,560 306,688 1,567,548
4 84,631 80,864 79,148 398,823
5 1,795,751 1,770,559 1,777,452 8,738,118
6 348,905 360,578 378,070 1,725,928
IcdChapter
33 16. Symptoms, Signs and Ill?Defined Conditions
34 16. Symptoms, Signs and Ill?Defined Conditions
35 17. Injury and Poisoning
36 17. Injury and Poisoning
37 18. Factors Influencing Health Status and Contact with Health Services
38 18. Factors Influencing Health Status and Contact with Health Services
Field FY1993 FY1994 FY1995 FY1996 FY1997
33 PatientDays 747,943 759,598 777,964 767,115 788,946
34 Separations 246,752 261,302 277,289 297,481 318,088
35 PatientDays 1,895,224 1,838,321 1,831,864 1,770,099 1,765,479
36 Separations 365,908 379,247 393,941 389,829 403,387
37 PatientDays 2,237,646 2,638,779 2,999,856 3,055,220 3,169,730
38 Separations 658,363 768,737 875,268 942,537 996,787
FY1998
33 3,841,566
34 1,400,912
35 9,100,987
36 1,932,312
37 14,101,231
38 4,241,692
Data Transformation and Cleaning
We can see that the column headers are values and not variable names. So we can use gather for tidying the dataset. We can also remove the FY from the year column and remove “,” from count column and make those columns numeric to further help with our analyis.
patients <- patients %>% gather(year, count, FY1993:FY1998) %>% mutate(year = as.numeric(gsub("FY",
"", year))) %>% mutate(count = as.numeric(gsub(",", "", count)))
head(patients) IcdChapter Field year count
1 0. Not Reported PatientDays 1993 257965
2 0. Not Reported Separations 1993 37178
3 1. Infectious and Parasitic Diseases PatientDays 1993 311221
4 1. Infectious and Parasitic Diseases Separations 1993 75857
5 2. Neoplasms PatientDays 1993 1686919
6 2. Neoplasms Separations 1993 301928
We can rename the column names , update the Chapter column values by removing the unneccessary numbering.
IcdChapter Field year count
1 Not Reported PatientDays 1993 257965
2 Not Reported Separations 1993 37178
3 Infectious and Parasitic Diseases PatientDays 1993 311221
4 Infectious and Parasitic Diseases Separations 1993 75857
5 Neoplasms PatientDays 1993 1686919
6 Neoplasms Separations 1993 301928
colnames(patients) <- c("Reason", "Field", "Year", "Count")
patients$Count <- as.numeric(patients$Count)
head(patients) Reason Field Year Count
1 Not Reported PatientDays 1993 257965
2 Not Reported Separations 1993 37178
3 Infectious and Parasitic Diseases PatientDays 1993 311221
4 Infectious and Parasitic Diseases Separations 1993 75857
5 Neoplasms PatientDays 1993 1686919
6 Neoplasms Separations 1993 301928
Let’s look to see if there are any missing values.
[1] 0
[1] 0
[1] 0
[1] 0
There are no missing values, the data set is ready for data analysis and exploration.
Data Exploration and Analysis
Main goal of the analysis is to extract insights for patients that have mental disorders with Patient Days. We can further filter our analysis based on this condition.
mental_disorders <- filter(patients, Reason == " Mental Disorders", Field ==
"PatientDays")
head(mental_disorders) Reason Field Year Count
1 Mental Disorders PatientDays 1993 1577877
2 Mental Disorders PatientDays 1994 1504333
3 Mental Disorders PatientDays 1995 2592320
4 Mental Disorders PatientDays 1996 2985958
5 Mental Disorders PatientDays 1997 3188096
6 Mental Disorders PatientDays 1998 11848584
We can further look to see the trend based on the year.
ggplot(mental_disorders, aes(x = Year)) + geom_line(aes(y = Count)) + labs(title = "Mental Disorders Time Series for Patient Days")mental_disorders_2 <- filter(patients, Reason == " Mental Disorders", Field ==
"Separations")
head(mental_disorders_2) Reason Field Year Count
1 Mental Disorders Separations 1993 134763
2 Mental Disorders Separations 1994 137649
3 Mental Disorders Separations 1995 191183
4 Mental Disorders Separations 1996 204431
5 Mental Disorders Separations 1997 234956
6 Mental Disorders Separations 1998 902982
ggplot(mental_disorders_2, aes(x = Year)) + geom_line(aes(y = Count)) + labs(title = "Mental Disorders Time Series for Separations")Conclusion
Based on our analysis, we can state that the patients that were in the Mental Disorder Category, Patient Days in the hospital and Separations increased significantly starting between 1997 and 1998. We can further extend our analysis on the reasoning of this insight.
Great British Bakeoff
At our last analysis, we will take a look at a data set
Data Collection
baker <- read.csv("https://raw.githubusercontent.com/anilak1978/medium-article-data/master/baker_results.csv")
head(baker) series baker_full baker age occupation
1 1 Annetha Mills Annetha 30 Single mother
2 1 David Chambers David 31 Entrepreneur
3 1 Edward "Edd" Kimber Edd 24 Debt collector for Yorkshire Bank
hometown baker_last baker_first star_baker technical_winner
1 Essex Mills Annetha 0 0
2 Milton Keynes Chambers David 0 0
3 Bradford Kimber Edward 0 2
technical_top3 technical_bottom technical_highest technical_lowest
1 1 1 2 7
2 1 3 3 8
3 4 1 1 6
technical_median series_winner series_runner_up total_episodes_appeared
1 4.5 0 0 2
2 4.5 0 0 4
3 2.0 1 0 6
first_date_appeared last_date_appeared first_date_us last_date_us
1 2010-08-17 2010-08-24 <NA> <NA>
2 2010-08-17 2010-09-07 <NA> <NA>
3 2010-08-17 2010-09-21 <NA> <NA>
percent_episodes_appeared percent_technical_top3
1 33.33333 50.00000
2 66.66667 25.00000
3 100.00000 66.66667
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
Data Transformation and Cleaning
When we look at our data set, we can see that it is wide and messy. There are way too many columns that we will not need for the purpose of our analysis. We can start by subsetting the raw data set to get the columns we need.
baker_results <- select(baker, baker_full, age, occupation, hometown, technical_median,
first_date_appeared, percent_episodes_appeared)
head(baker_results) baker_full age occupation
1 Annetha Mills 30 Single mother
2 David Chambers 31 Entrepreneur
3 Edward "Edd" Kimber 24 Debt collector for Yorkshire Bank
4 Jasminder Randhawa 45 Assistant Credit Control Manager
5 Jonathan Shepherd 25 Research Analyst
6 Lea Harris 51 Retired
hometown technical_median first_date_appeared
1 Essex 4.5 2010-08-17
2 Milton Keynes 4.5 2010-08-17
3 Bradford 2.0 2010-08-17
4 Birmingham 3.0 2010-08-17
5 St Albans 6.0 2010-08-17
6 Midlothian, Scotland 10.0 2010-08-17
percent_episodes_appeared
1 33.33333
2 66.66667
3 100.00000
4 83.33333
5 50.00000
6 16.66667
Let’s also update the column names.
colnames(baker_results) <- c("Contestant", "Age", "Occupation", "HomeTown",
"Rate", "Date", "Rating")
head(baker_results) Contestant Age Occupation
1 Annetha Mills 30 Single mother
2 David Chambers 31 Entrepreneur
3 Edward "Edd" Kimber 24 Debt collector for Yorkshire Bank
4 Jasminder Randhawa 45 Assistant Credit Control Manager
5 Jonathan Shepherd 25 Research Analyst
6 Lea Harris 51 Retired
HomeTown Rate Date Rating
1 Essex 4.5 2010-08-17 33.33333
2 Milton Keynes 4.5 2010-08-17 66.66667
3 Bradford 2.0 2010-08-17 100.00000
4 Birmingham 3.0 2010-08-17 83.33333
5 St Albans 6.0 2010-08-17 50.00000
6 Midlothian, Scotland 10.0 2010-08-17 16.66667
We can split the Date variable into year, month and day which will allow us to look at more specific time series analysis.
baker_results <- separate(baker_results, Date, c("Year", "Month", "Day"), sep = "-")
head(baker_results) Contestant Age Occupation
1 Annetha Mills 30 Single mother
2 David Chambers 31 Entrepreneur
3 Edward "Edd" Kimber 24 Debt collector for Yorkshire Bank
4 Jasminder Randhawa 45 Assistant Credit Control Manager
5 Jonathan Shepherd 25 Research Analyst
6 Lea Harris 51 Retired
HomeTown Rate Year Month Day Rating
1 Essex 4.5 2010 08 17 33.33333
2 Milton Keynes 4.5 2010 08 17 66.66667
3 Bradford 2.0 2010 08 17 100.00000
4 Birmingham 3.0 2010 08 17 83.33333
5 St Albans 6.0 2010 08 17 50.00000
6 Midlothian, Scotland 10.0 2010 08 17 16.66667
We can filter to a particular month to see if the rating of the show depends based on the month.
baker_results$Month <- as.integer(baker_results$Month)
baker_results$Year <- as.integer(baker_results$Year)
baker_results$Day <- as.integer(baker_results$Day)
str(baker_results)'data.frame': 95 obs. of 9 variables:
$ Contestant: Factor w/ 95 levels "Ali Imdad","Alvin Magallanes",..: 4 17 21 36 40 48 51 58 63 76 ...
$ Age : int 30 31 24 45 25 51 44 48 37 31 ...
$ Occupation: Factor w/ 87 levels "Accountant","Advertising executive",..: 75 26 22 6 66 68 60 10 29 67 ...
$ HomeTown : Factor w/ 83 levels "Aldershot, Hampshire",..: 31 50 14 11 71 48 43 69 47 60 ...
$ Rate : num 4.5 4.5 2 3 6 10 4 NA 3 3.5 ...
$ Year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
$ Month : int 8 8 8 8 8 8 8 8 8 8 ...
$ Day : int 17 17 17 17 17 17 17 17 17 17 ...
$ Rating : num 33.3 66.7 100 83.3 50 ...
[1] 0
[1] 0
[1] 0
[1] 0
[1] 1
[1] 0
[1] 0
[1] 0
[1] 0
There is one missing value in Rate. We can remove that data set from our analysis.
[1] 0
Conclusion
There is no linear correlation between Rate and Rating. This means, the ratings of the show does not depend on how good the Contestant is.