The main purpose of the ‘Assignment:Lab1’ is to analyse a specific data set named, ‘present’, to understand and determine different information that are hidden in that data set.The data set refers to the present day birth records in the United States.Different data exploration and data visualization techniques will be used here in this regard.
install.packages("tidyverse")
install.packages("openintro")
knitr::opts_chunk$set(eval = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(openintro)
data('present', package='openintro')
head(present)
## # A tibble: 6 × 3
## year boys girls
## <dbl> <dbl> <dbl>
## 1 1940 1211684 1148715
## 2 1941 1289734 1223693
## 3 1942 1444365 1364631
## 4 1943 1508959 1427901
## 5 1944 1435301 1359499
## 6 1945 1404587 1330869
Data will be explored and visualized based on the following questions:
What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?
What are minimum and maximum number of boy births in a year in the U.S.?
Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.?
In what year did we see the most total number of births in the U.S.?
** Ans-1: Question 1 has three parts.First part is to find out the years that are included in the data set. To do this following code can be used:
data('present', package='openintro')
years<-present$year
years
## [1] 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954
## [16] 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
## [31] 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
## [46] 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
## [61] 2000 2001 2002
By running the code above, it is seen that the data set contains 63 number of years starting from 1940 to ending with 2002.
Second part refers to find out the dimensions of the data frame and the third part is to find out the variable (column) names.The following code can be used to get all the answers:
data('present', package='openintro')
glimpse(present)
## Rows: 63
## Columns: 3
## $ year <dbl> 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950…
## $ boys <dbl> 1211684, 1289734, 1444365, 1508959, 1435301, 1404587, 1691220, 1…
## $ girls <dbl> 1148715, 1223693, 1364631, 1427901, 1359499, 1330869, 1597452, 1…
It is seen that the data frame has (63,3) dimension i.e. 63 number of rows and 3 number of columns. It can be said that the data set has 63 observations of 3 variables. It is also seen that the data set contains 3 variables (columns) namely, year, boys and girls. The following code can be used too to find out the data set dimension:
data('present', package='openintro')
dim(present)
## [1] 63 3
It is clearly seen that the data frame has (63,3) dimension where 63 represents the number of rows and 3 represents the number of columns.
** Ans-2: To compare the above counts of ‘present’ data set with the ‘arbuthnot’ data set, the following data set and code are required:
data('arbuthnot', package='openintro')
glimpse(arbuthnot)
## Rows: 82
## Columns: 3
## $ year <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639…
## $ boys <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366…
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784…
It is seen that the ‘arbuthnot’ data set has (82,3) dimension whereas the ‘present’ data set has (63,3) dimension. Both the data frame have equal number of variables (columns) and the variables’ names are same as well: year, boys and girls.
** Ans-3: To find the minimum and maximum number of boy births in a year the following code can be used:
present %>%
summarize(min = min(boys), max = max(boys))
## # A tibble: 1 × 2
## min max
## <dbl> <dbl>
## 1 1211684 2186274
The minimum number of boy births is 1211684 and the maximum number of boy births is 2186274 respectively.
** Ans-4: To plot proportion of boys born over time following coding steps can be used:
1st step: Addition of 2 new variables i.e. total and boy_ratio in data frame:
present<- present %>%
mutate(total = boys + girls, boy_ratio=boys/total)
head(present)
## # A tibble: 6 × 5
## year boys girls total boy_ratio
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1940 1211684 1148715 2360399 0.513
## 2 1941 1289734 1223693 2513427 0.513
## 3 1942 1444365 1364631 2808996 0.514
## 4 1943 1508959 1427901 2936860 0.514
## 5 1944 1435301 1359499 2794800 0.514
## 6 1945 1404587 1330869 2735456 0.513
2nd step: Plotting the ‘Proportion of boys born over time’ graph:
ggplot(data = present, aes(x = year, y = boy_ratio)) +
geom_line(col="blue")+ ggtitle("Proportion of boys born over time")
From the “Proportion of boys born over time” plot it is seen that the overall trend of the boys birth ratio is declining over the years in between 1940 and 2002. Though it has gone through some ups and downs within this time frame.
To check whether boys being born in greater proportion than girls in the U.S, the following code can be used:
present <- present %>%
mutate(girl_ratio= girls/total,more_boy_ratio = boy_ratio > girl_ratio)
present
## # A tibble: 63 × 7
## year boys girls total boy_ratio girl_ratio more_boy_ratio
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 1940 1211684 1148715 2360399 0.513 0.487 TRUE
## 2 1941 1289734 1223693 2513427 0.513 0.487 TRUE
## 3 1942 1444365 1364631 2808996 0.514 0.486 TRUE
## 4 1943 1508959 1427901 2936860 0.514 0.486 TRUE
## 5 1944 1435301 1359499 2794800 0.514 0.486 TRUE
## 6 1945 1404587 1330869 2735456 0.513 0.487 TRUE
## 7 1946 1691220 1597452 3288672 0.514 0.486 TRUE
## 8 1947 1899876 1800064 3699940 0.513 0.487 TRUE
## 9 1948 1813852 1721216 3535068 0.513 0.487 TRUE
## 10 1949 1826352 1733177 3559529 0.513 0.487 TRUE
## # … with 53 more rows
From the data frame above, it is clear that the boys being born in greater proportion than girls in the U.S.
** Ans-5: To find the most total number of births in the U.S. following code can be used:
present<-present %>%
arrange(desc(total))
present
## # A tibble: 63 × 7
## year boys girls total boy_ratio girl_ratio more_boy_ratio
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 1961 2186274 2082052 4268326 0.512 0.488 TRUE
## 2 1960 2179708 2078142 4257850 0.512 0.488 TRUE
## 3 1957 2179960 2074824 4254784 0.512 0.488 TRUE
## 4 1959 2173638 2071158 4244796 0.512 0.488 TRUE
## 5 1958 2152546 2051266 4203812 0.512 0.488 TRUE
## 6 1962 2132466 2034896 4167362 0.512 0.488 TRUE
## 7 1956 2133588 2029502 4163090 0.513 0.487 TRUE
## 8 1990 2129495 2028717 4158212 0.512 0.488 TRUE
## 9 1991 2101518 2009389 4110907 0.511 0.489 TRUE
## 10 1963 2101632 1996388 4098020 0.513 0.487 TRUE
## # … with 53 more rows
From the data above it is clearly seen that the most total number of births happened in the U.S. in 1961. The same data can be determined the following way too:
index<-which.max(present$total)
year<-present$year[index]
year
## [1] 1961
It is seen that the above code has given the same year i.e.1961 for the most total births in the U.S.
In this assignment, data analysis has been done on the ‘present’ data set to find out required information by using different data exploration and data visualization techniques. It was a great opportunity to learn about the exploratory data analysis techniques.