Question 1(a): Codebook

Data report overview

The dataset examined has the following dimensions:

Feature Result
Number of observations 578
Number of variables 4

Codebook summary table

Label Variable Class # unique values Missing Description
weight numeric 212 0.00 %
Time numeric 12 0.00 %
Chick ordered 50 0.00 %
Diet factor 4 0.00 %

Variable list

weight

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 212
Median 103
1st and 3rd quartiles 63; 163.75
Min. and max. 35; 373


Time

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 12
Median 10
1st and 3rd quartiles 4; 16
Min. and max. 0; 21


Chick

Feature Result
Variable type ordered
Number of missing obs. 0 (0 %)
Number of unique values 50
Mode “13”
Reference category 18

  • Observed factor levels: "1", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "2", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "3", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "4", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "5", "50", "6", "7", "8", "9".

Diet

Feature Result
Variable type factor
Number of missing obs. 0 (0 %)
Number of unique values 4
Mode “1”
Reference category 1

  • Observed factor levels: "1", "2", "3", "4".



Question 1(b)

The library dplyr and datasets of Income Data by States are used.

The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.

Below are some random rows of the dataset used.

##   Index       State   Y2002   Y2003   Y2004   Y2005   Y2006   Y2007   Y2008
## 1     C Connecticut 1610512 1232844 1181949 1518933 1841266 1976976 1764457
## 2     O    Oklahoma 1173918 1334639 1663622 1798714 1312574 1708245 1256746
## 3     A     Alabama 1296530 1317711 1118631 1492583 1107408 1440134 1945229
## 4     I        Iowa 1499269 1444576 1576367 1388924 1554813 1452911 1317983
## 5     T   Tennessee 1811867 1485909 1974179 1157059 1786132 1399191 1826406
##     Y2009   Y2010   Y2011   Y2012   Y2013   Y2014   Y2015
## 1 1972730 1968730 1945524 1228529 1582249 1503156 1718072
## 2 1853142 1673831 1822933 1674707 1900523 1956742 1307678
## 3 1944173 1237582 1440756 1186741 1852841 1558906 1916661
## 4 1150783 1751389 1992996 1501879 1173694 1431705 1641866
## 5 1326460 1231739 1469785 1849041 1560887 1349173 1162164


i. filter()

The filter() function is used to subset data with matching logical conditions.
It has a syntax of
filter( [Data Frame] , [logical conditions] )

For example, the filter() function can be used to filter the rows which its Index column equals to “M”.

filter(q2data, Index == "M")
##   Index         State   Y2002   Y2003   Y2004   Y2005   Y2006   Y2007   Y2008
## 1     M         Maine 1582720 1678622 1208496 1912040 1438549 1330014 1295877
## 2     M      Maryland 1579713 1404700 1849798 1397738 1310270 1789128 1112765
## 3     M Massachusetts 1647582 1686259 1620601 1777250 1531641 1380529 1978904
## 4     M      Michigan 1295635 1149931 1601027 1340716 1729449 1567494 1990431
## 5     M     Minnesota 1729921 1675204 1903907 1561839 1985692 1148621 1328133
## 6     M   Mississippi 1983285 1292558 1631325 1943311 1354579 1731643 1428291
## 7     M      Missouri 1221316 1858368 1773451 1573967 1374863 1486197 1735099
## 8     M       Montana 1877154 1540099 1332722 1273327 1625721 1983568 1251742
##     Y2009   Y2010   Y2011   Y2012   Y2013   Y2014   Y2015
## 1 1969163 1627262 1706080 1437088 1318546 1116792 1529233
## 2 1967225 1486246 1872327 1175819 1314343 1979529 1569566
## 3 1567651 1761048 1658538 1482203 1731917 1669749 1963337
## 4 1575185 1267626 1274673 1709853 1815596 1965196 1646634
## 5 1890633 1995304 1575533 1910216 1972021 1515366 1864553
## 6 1568049 1383227 1629132 1988270 1907777 1649668 1991232
## 7 1800620 1164202 1425363 1800052 1698105 1767835 1996005
## 8 1592690 1350619 1520064 1185225 1465705 1110394 1125903


ii. arrange()

The arrange function is used to sort data according to the value of variables.
By default, it sorts the values ascendingly. To sort values descendingly, desc() function is used.
It has a syntax of
arrange( [Data Frame] , [variable(s) to sort] )

For example, the arrange() function can be used to sort the Index variable. (Only first 8 samples are shown)

arrangedData1 <- arrange(q2data, Index)
head(arrangedData1, 8)
##   Index       State   Y2002   Y2003   Y2004   Y2005   Y2006   Y2007   Y2008
## 1     A     Alabama 1296530 1317711 1118631 1492583 1107408 1440134 1945229
## 2     A      Alaska 1170302 1960378 1818085 1447852 1861639 1465841 1551826
## 3     A     Arizona 1742027 1968140 1377583 1782199 1102568 1109382 1752886
## 4     A    Arkansas 1485531 1994927 1119299 1947979 1669191 1801213 1188104
## 5     C  California 1685349 1675807 1889570 1480280 1735069 1812546 1487315
## 6     C    Colorado 1343824 1878473 1886149 1236697 1871471 1814218 1875146
## 7     C Connecticut 1610512 1232844 1181949 1518933 1841266 1976976 1764457
## 8     D    Delaware 1330403 1268673 1706751 1403759 1441351 1300836 1762096
##     Y2009   Y2010   Y2011   Y2012   Y2013   Y2014   Y2015
## 1 1944173 1237582 1440756 1186741 1852841 1558906 1916661
## 2 1436541 1629616 1230866 1512804 1985302 1580394 1979143
## 3 1554330 1300521 1130709 1907284 1363279 1525866 1647724
## 4 1628980 1669295 1928238 1216675 1591896 1360959 1329341
## 5 1663809 1624509 1639670 1921845 1156536 1388461 1644607
## 6 1752387 1913275 1665877 1491604 1178355 1383978 1330736
## 7 1972730 1968730 1945524 1228529 1582249 1503156 1718072
## 8 1553585 1370984 1318669 1984027 1671279 1803169 1627508


Another example, the arrange() function is used to sort the Index variable ascendingly and Y2002 descendingly. (Only first 8 samples are shown)

arrangedData2 <- arrange(q2data, Index, desc(Y2002))
head(arrangedData2, 8)
##   Index       State   Y2002   Y2003   Y2004   Y2005   Y2006   Y2007   Y2008
## 1     A     Arizona 1742027 1968140 1377583 1782199 1102568 1109382 1752886
## 2     A    Arkansas 1485531 1994927 1119299 1947979 1669191 1801213 1188104
## 3     A     Alabama 1296530 1317711 1118631 1492583 1107408 1440134 1945229
## 4     A      Alaska 1170302 1960378 1818085 1447852 1861639 1465841 1551826
## 5     C  California 1685349 1675807 1889570 1480280 1735069 1812546 1487315
## 6     C Connecticut 1610512 1232844 1181949 1518933 1841266 1976976 1764457
## 7     C    Colorado 1343824 1878473 1886149 1236697 1871471 1814218 1875146
## 8     D    Delaware 1330403 1268673 1706751 1403759 1441351 1300836 1762096
##     Y2009   Y2010   Y2011   Y2012   Y2013   Y2014   Y2015
## 1 1554330 1300521 1130709 1907284 1363279 1525866 1647724
## 2 1628980 1669295 1928238 1216675 1591896 1360959 1329341
## 3 1944173 1237582 1440756 1186741 1852841 1558906 1916661
## 4 1436541 1629616 1230866 1512804 1985302 1580394 1979143
## 5 1663809 1624509 1639670 1921845 1156536 1388461 1644607
## 6 1972730 1968730 1945524 1228529 1582249 1503156 1718072
## 7 1752387 1913275 1665877 1491604 1178355 1383978 1330736
## 8 1553585 1370984 1318669 1984027 1671279 1803169 1627508


iii. mutate()

The mutate() function is used to add new variables into the dataset.
It has a syntax of
mutate( [Data Frame] , [Expression(s)] )
For example, the mutate() function is used to add a variable sumOf2years in the dataset.(Only first 8 samples are shown)

mutatedData <- mutate(q2data, sumOf2years=Y2002+Y2003)
head(mutatedData, 8)
##   Index       State   Y2002   Y2003   Y2004   Y2005   Y2006   Y2007   Y2008
## 1     A     Alabama 1296530 1317711 1118631 1492583 1107408 1440134 1945229
## 2     A      Alaska 1170302 1960378 1818085 1447852 1861639 1465841 1551826
## 3     A     Arizona 1742027 1968140 1377583 1782199 1102568 1109382 1752886
## 4     A    Arkansas 1485531 1994927 1119299 1947979 1669191 1801213 1188104
## 5     C  California 1685349 1675807 1889570 1480280 1735069 1812546 1487315
## 6     C    Colorado 1343824 1878473 1886149 1236697 1871471 1814218 1875146
## 7     C Connecticut 1610512 1232844 1181949 1518933 1841266 1976976 1764457
## 8     D    Delaware 1330403 1268673 1706751 1403759 1441351 1300836 1762096
##     Y2009   Y2010   Y2011   Y2012   Y2013   Y2014   Y2015 sumOf2years
## 1 1944173 1237582 1440756 1186741 1852841 1558906 1916661     2614241
## 2 1436541 1629616 1230866 1512804 1985302 1580394 1979143     3130680
## 3 1554330 1300521 1130709 1907284 1363279 1525866 1647724     3710167
## 4 1628980 1669295 1928238 1216675 1591896 1360959 1329341     3480458
## 5 1663809 1624509 1639670 1921845 1156536 1388461 1644607     3361156
## 6 1752387 1913275 1665877 1491604 1178355 1383978 1330736     3222297
## 7 1972730 1968730 1945524 1228529 1582249 1503156 1718072     2843356
## 8 1553585 1370984 1318669 1984027 1671279 1803169 1627508     2599076


iv. select()

The select() function is used to select desired variables.
It has a syntax of
select( [Data Frame] , [Variables by name or function] )
For example, the select function is used to select from Index variable to Y2006 variable.(Only first 8 samples are shown)

selectedData <- dplyr::select(q2data,Index:Y2006)
head(selectedData, 8)
##   Index       State   Y2002   Y2003   Y2004   Y2005   Y2006
## 1     A     Alabama 1296530 1317711 1118631 1492583 1107408
## 2     A      Alaska 1170302 1960378 1818085 1447852 1861639
## 3     A     Arizona 1742027 1968140 1377583 1782199 1102568
## 4     A    Arkansas 1485531 1994927 1119299 1947979 1669191
## 5     C  California 1685349 1675807 1889570 1480280 1735069
## 6     C    Colorado 1343824 1878473 1886149 1236697 1871471
## 7     C Connecticut 1610512 1232844 1181949 1518933 1841266
## 8     D    Delaware 1330403 1268673 1706751 1403759 1441351


v. summarise()

The summarise() function is used to summarise data.
It has a syntax of
summarise( [Data Frame] , [Summary Functions] )
For example, the summarise() function is used to calculate mean and median of variable Y2015

summarise(q2data, mean2015 = mean(Y2015), median2015 = median(Y2015))
##   mean2015 median2015
## 1  1588297    1627508