HarvardX PH125.1x: Chapter 2 Exercises

Section 2.3

1. What is the sum of the first 100 positive integers? The formula for the sum of integers 1 through n is $n(n+1)/2$. Define n=100 then use R to compute the sum of 11 through 100 using the formula. What is the sum?

n<-100
num100<-(100)*(100+1)/2
num100

## [1] 5050

Now use the same formula to compute the sum of the integers from 1 through 1,000.

n<-1000
num1000<-(1000)*(100+1)/2
num1000

## [1] 50500

Look at the result of typing the following code into R:

n <- 1000
x <- seq(1, n)
sum(x)

## [1] 500500

##    [1]    1    2    3    4    5    6    7    8    9   10   11   12   13   14
##   [15]   15   16   17   18   19   20   21   22   23   24   25   26   27   28
##   [29]   29   30   31   32   33   34   35   36   37   38   39   40   41   42
##   [43]   43   44   45   46   47   48   49   50   51   52   53   54   55   56
##   [57]   57   58   59   60   61   62   63   64   65   66   67   68   69   70
##   [71]   71   72   73   74   75   76   77   78   79   80   81   82   83   84
##   [85]   85   86   87   88   89   90   91   92   93   94   95   96   97   98
##   [99]   99  100  101  102  103  104  105  106  107  108  109  110  111  112
##  [113]  113  114  115  116  117  118  119  120  121  122  123  124  125  126
##  [127]  127  128  129  130  131  132  133  134  135  136  137  138  139  140
##  [141]  141  142  143  144  145  146  147  148  149  150  151  152  153  154
##  [155]  155  156  157  158  159  160  161  162  163  164  165  166  167  168
##  [169]  169  170  171  172  173  174  175  176  177  178  179  180  181  182
##  [183]  183  184  185  186  187  188  189  190  191  192  193  194  195  196
##  [197]  197  198  199  200  201  202  203  204  205  206  207  208  209  210
##  [211]  211  212  213  214  215  216  217  218  219  220  221  222  223  224
##  [225]  225  226  227  228  229  230  231  232  233  234  235  236  237  238
##  [239]  239  240  241  242  243  244  245  246  247  248  249  250  251  252
##  [253]  253  254  255  256  257  258  259  260  261  262  263  264  265  266
##  [267]  267  268  269  270  271  272  273  274  275  276  277  278  279  280
##  [281]  281  282  283  284  285  286  287  288  289  290  291  292  293  294
##  [295]  295  296  297  298  299  300  301  302  303  304  305  306  307  308
##  [309]  309  310  311  312  313  314  315  316  317  318  319  320  321  322
##  [323]  323  324  325  326  327  328  329  330  331  332  333  334  335  336
##  [337]  337  338  339  340  341  342  343  344  345  346  347  348  349  350
##  [351]  351  352  353  354  355  356  357  358  359  360  361  362  363  364
##  [365]  365  366  367  368  369  370  371  372  373  374  375  376  377  378
##  [379]  379  380  381  382  383  384  385  386  387  388  389  390  391  392
##  [393]  393  394  395  396  397  398  399  400  401  402  403  404  405  406
##  [407]  407  408  409  410  411  412  413  414  415  416  417  418  419  420
##  [421]  421  422  423  424  425  426  427  428  429  430  431  432  433  434
##  [435]  435  436  437  438  439  440  441  442  443  444  445  446  447  448
##  [449]  449  450  451  452  453  454  455  456  457  458  459  460  461  462
##  [463]  463  464  465  466  467  468  469  470  471  472  473  474  475  476
##  [477]  477  478  479  480  481  482  483  484  485  486  487  488  489  490
##  [491]  491  492  493  494  495  496  497  498  499  500  501  502  503  504
##  [505]  505  506  507  508  509  510  511  512  513  514  515  516  517  518
##  [519]  519  520  521  522  523  524  525  526  527  528  529  530  531  532
##  [533]  533  534  535  536  537  538  539  540  541  542  543  544  545  546
##  [547]  547  548  549  550  551  552  553  554  555  556  557  558  559  560
##  [561]  561  562  563  564  565  566  567  568  569  570  571  572  573  574
##  [575]  575  576  577  578  579  580  581  582  583  584  585  586  587  588
##  [589]  589  590  591  592  593  594  595  596  597  598  599  600  601  602
##  [603]  603  604  605  606  607  608  609  610  611  612  613  614  615  616
##  [617]  617  618  619  620  621  622  623  624  625  626  627  628  629  630
##  [631]  631  632  633  634  635  636  637  638  639  640  641  642  643  644
##  [645]  645  646  647  648  649  650  651  652  653  654  655  656  657  658
##  [659]  659  660  661  662  663  664  665  666  667  668  669  670  671  672
##  [673]  673  674  675  676  677  678  679  680  681  682  683  684  685  686
##  [687]  687  688  689  690  691  692  693  694  695  696  697  698  699  700
##  [701]  701  702  703  704  705  706  707  708  709  710  711  712  713  714
##  [715]  715  716  717  718  719  720  721  722  723  724  725  726  727  728
##  [729]  729  730  731  732  733  734  735  736  737  738  739  740  741  742
##  [743]  743  744  745  746  747  748  749  750  751  752  753  754  755  756
##  [757]  757  758  759  760  761  762  763  764  765  766  767  768  769  770
##  [771]  771  772  773  774  775  776  777  778  779  780  781  782  783  784
##  [785]  785  786  787  788  789  790  791  792  793  794  795  796  797  798
##  [799]  799  800  801  802  803  804  805  806  807  808  809  810  811  812
##  [813]  813  814  815  816  817  818  819  820  821  822  823  824  825  826
##  [827]  827  828  829  830  831  832  833  834  835  836  837  838  839  840
##  [841]  841  842  843  844  845  846  847  848  849  850  851  852  853  854
##  [855]  855  856  857  858  859  860  861  862  863  864  865  866  867  868
##  [869]  869  870  871  872  873  874  875  876  877  878  879  880  881  882
##  [883]  883  884  885  886  887  888  889  890  891  892  893  894  895  896
##  [897]  897  898  899  900  901  902  903  904  905  906  907  908  909  910
##  [911]  911  912  913  914  915  916  917  918  919  920  921  922  923  924
##  [925]  925  926  927  928  929  930  931  932  933  934  935  936  937  938
##  [939]  939  940  941  942  943  944  945  946  947  948  949  950  951  952
##  [953]  953  954  955  956  957  958  959  960  961  962  963  964  965  966
##  [967]  967  968  969  970  971  972  973  974  975  976  977  978  979  980
##  [981]  981  982  983  984  985  986  987  988  989  990  991  992  993  994
##  [995]  995  996  997  998  999 1000

Based on the result, what do you think the functions seq and sum do? You can use help.

seq creates a list of numbers and sum adds them up.

In math and programming, we say that we evaluate a function when we replace the argument with a given number. So if we type sqrt(4), we evaluate the sqrt function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the log, in base 10, of the square root of 100.

log(sqrt(100), base=10)

## [1] 1

5. Which of the following will always return the numeric value stored in x? You can try out examples and use the help system if you want.

log(exp(2))

## [1] 2

Section 2.5

1. Load the US murders dataset.

Use the function str to examine the structure of the murders object. Which of the following best describes the variables represented in this data frame?

library(dslabs)
data(murders)
str(murders)

## 'data.frame':    51 obs. of  5 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...

The state name, the abbreviation of the state name, the state’s region, and the state’s population and total number of murders for 2010.

What are the column names used by the data frame for these five variables?

names(murders)

## [1] "state"      "abb"        "region"     "population" "total"

3. Use the accessor $ to extract the state abbreviations and assign them to the object a. What is the class of this object?

a<-murders$abb
class(a)

## [1] "character"

4. Now use the square brackets to extract the state abbreviations and assign them to the object b. Use the identical function to determine if a and b are the same.

b<-murders[,2]
b

##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA" "HI" "ID" "IL" "IN"
## [16] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH"
## [31] "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VT" "VA" "WA" "WV" "WI" "WY"

identical(a, b)

## [1] TRUE

5. We saw that the region column stores a factor. You can corroborate this by typing:

With one line of code, use the function levels and length to determine the number of regions defined by this dataset.

class(murders$region)

## [1] "factor"

length(levels(murders$region))

## [1] 4

6. The function table takes a vector and returns the frequency of each element. You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region.

table(murders$region)

## 
##     Northeast         South North Central          West 
##             9            17            12            13

Section 2.8

1. Use the function c to create a vector with the average high temperatures in January for Beijing, Lagos, Paris, Rio de Janeiro, San Juan, and Toronto, which are 35, 88, 42, 84, 81, and 30 degrees Fahrenheit. Call the object temp.

temp<-c(35, 88, 42, 84, 81, 30)

2. Now create a vector with the city names and call the object city.

city<-c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")

3. Use the names function and the objects defined in the previous exercises to associate the temperature data with its corresponding city.

names(temp)<-city
temp

##        Beijing          Lagos          Paris Rio de Janeiro       San Juan 
##             35             88             42             84             81 
##        Toronto 
##             30

4. Use the [ and : operators to access the temperature of the first three cities on the list.

temp[1:3]

## Beijing   Lagos   Paris 
##      35      88      42

5. Use the [ operator to access the temperature of Paris and San Juan.

temp[c(3,5)]

##    Paris San Juan 
##       42       81

6. Use the : operator to create a sequence of numbers 12,13,14,…, 73

c(12:73)

##  [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [26] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## [51] 62 63 64 65 66 67 68 69 70 71 72 73

7. Create a vector containing all the positive odd numbers smaller than 100.

seq(1, 100, by=2)

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
## [26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

8. Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6 + 4/7, 6 + 8/7, and so on. How many numbers does the list have? Hint: use seq and length.

length(seq(6, 55, 4/7))

## [1] 86

9. What is the class of the following object a <- seq(1, 10, 0.5)?

a<-seq(1, 10, .5)
class(a)

## [1] "numeric"

10. What is the class of the following object a <- seq(1, 10)?

a<-seq(1, 10)
class(a)

## [1] "integer"

11. The class of class(a<-1) is numeric, not integer. R defaults to numeric and to force an integer, you need to add the letter L. Confirm that the class of 1L is integer.

class(a<-1)

## [1] "numeric"

class(a<-1L)

## [1] "integer"

12. Define the following vector:

x <- c("1", "3", "5")

and coerce it to get integers.

x <-c(1L, 3L, 5L)
class(x)

## [1] "integer"

Section 2.10

For these exercises we will use the US murders dataset. Make sure you load it prior to starting.

library(dslabs) 
data("murders")

1. Use the $ operator to access the population size data and store it as the object pop. Then use the sort function to redefine pop so that it is sorted. Finally, use the [ operator to report the smallest population size.

library(dslabs)
data("murders")
pop<-sort(murders$population)
pop[1]

## [1] 563626

2. Now instead of the smallest population size, find the index of the entry with the smallest population size. Hint: use order instead of sort.

popo<-order(murders$population)
popo[1]

## [1] 51

3. We can actually perform the same operation as in the previous exercise using the function which.min. Write one line of code that does this.

which.min(murders$population)

## [1] 51

4. Now we know how small the smallest state is and we know which row represents it. Which state is it? Define a variable states to be the state names from the murders data frame. Report the name of the state with the smallest population.

states<-murders$state
states[51]

## [1] "Wyoming"

5. You can create a data frame using the data.frame function. Here is a quick example:

temp <- c(35, 88, 42, 84, 81, 30) 
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro",       "San Juan", "Toronto") 
city_temps <- data.frame(name = city, temperature = temp)

Use the rank function to determine the population rank of each state from smallest population size to biggest. Save these ranks in an object called ranks, then create a data frame with the state name and its rank. Call the data frame my_df.

ranks<-rank(murders$population)
my_df<-data.frame(states, ranks)
my_df

##                  states ranks
## 1               Alabama    29
## 2                Alaska     5
## 3               Arizona    36
## 4              Arkansas    20
## 5            California    51
## 6              Colorado    30
## 7           Connecticut    23
## 8              Delaware     7
## 9  District of Columbia     2
## 10              Florida    49
## 11              Georgia    44
## 12               Hawaii    12
## 13                Idaho    13
## 14             Illinois    47
## 15              Indiana    37
## 16                 Iowa    22
## 17               Kansas    19
## 18             Kentucky    26
## 19            Louisiana    27
## 20                Maine    11
## 21             Maryland    33
## 22        Massachusetts    38
## 23             Michigan    43
## 24            Minnesota    31
## 25          Mississippi    21
## 26             Missouri    34
## 27              Montana     8
## 28             Nebraska    14
## 29               Nevada    17
## 30        New Hampshire    10
## 31           New Jersey    41
## 32           New Mexico    16
## 33             New York    48
## 34       North Carolina    42
## 35         North Dakota     4
## 36                 Ohio    45
## 37             Oklahoma    24
## 38               Oregon    25
## 39         Pennsylvania    46
## 40         Rhode Island     9
## 41       South Carolina    28
## 42         South Dakota     6
## 43            Tennessee    35
## 44                Texas    50
## 45                 Utah    18
## 46              Vermont     3
## 47             Virginia    40
## 48           Washington    39
## 49        West Virginia    15
## 50            Wisconsin    32
## 51              Wyoming     1

6. Repeat the previous exercise, but this time order my_df so that the states are ordered from least populous to most populous. Hint: create an object ind that stores the indexes needed to order the population values. Then use the bracket operator [ to re-order each column in the data frame.

ind<-order(murders$population)
my_df2<-data.frame(states[ind], ranks[ind])
my_df2

##             states.ind. ranks.ind.
## 1               Wyoming          1
## 2  District of Columbia          2
## 3               Vermont          3
## 4          North Dakota          4
## 5                Alaska          5
## 6          South Dakota          6
## 7              Delaware          7
## 8               Montana          8
## 9          Rhode Island          9
## 10        New Hampshire         10
## 11                Maine         11
## 12               Hawaii         12
## 13                Idaho         13
## 14             Nebraska         14
## 15        West Virginia         15
## 16           New Mexico         16
## 17               Nevada         17
## 18                 Utah         18
## 19               Kansas         19
## 20             Arkansas         20
## 21          Mississippi         21
## 22                 Iowa         22
## 23          Connecticut         23
## 24             Oklahoma         24
## 25               Oregon         25
## 26             Kentucky         26
## 27            Louisiana         27
## 28       South Carolina         28
## 29              Alabama         29
## 30             Colorado         30
## 31            Minnesota         31
## 32            Wisconsin         32
## 33             Maryland         33
## 34             Missouri         34
## 35            Tennessee         35
## 36              Arizona         36
## 37              Indiana         37
## 38        Massachusetts         38
## 39           Washington         39
## 40             Virginia         40
## 41           New Jersey         41
## 42       North Carolina         42
## 43             Michigan         43
## 44              Georgia         44
## 45                 Ohio         45
## 46         Pennsylvania         46
## 47             Illinois         47
## 48             New York         48
## 49              Florida         49
## 50                Texas         50
## 51           California         51

7. The na_example vector represents a series of counts. You can quickly examine the object using:

data("na_example")   
str(na_example)

However, when we compute the average with the function mean, we obtain an NA:

mean(na_example)

The is.na function returns a logical vector that tells us which entries are NA. Assign this logical vector to an object called ind and determine how many NAs does na_example have.

ind<-is.na(na_example)
sum(ind)

## [1] 145

8. Now compute the average again, but only for the entries that are not NA. Hint: remember the ! operator.

mean(na_example[!is.na(na_example)])

## [1] 2.301754

Section 2.12

1. Previously we created this data frame:

temp <- c(35, 88, 42, 84, 81, 30) 
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro",      "San Juan", "Toronto") 
city_temps <- data.frame(name = city, temperature = temp)

Remake the data frame using the code above, but add a line that converts the temperature from Fahrenheit to Celsius. The conversion is $C=5/9*(F-32)$

temp <- c(35, 88, 42, 84, 81, 30) 
tempc<-(temp-32)*5/9
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto") 
city_temps <- data.frame(name=city, temperature=tempc)
city_temps

##             name temperature
## 1        Beijing    1.666667
## 2          Lagos   31.111111
## 3          Paris    5.555556
## 4 Rio de Janeiro   28.888889
## 5       San Juan   27.222222
## 6        Toronto   -1.111111

2. What is the following sum $1/1^2 + 1/2^2 + 1/3^2 +...+ 1/100^2$? Hint: thanks to Euler, we know it should be close to \[(pi^2)/6\]

j<-seq(1,100)
y<-seq(1,1000)
ysum<-sum(1/(y*y))
eul<-pi*pi/6
ydiff<-ysum-eul
jsum<-sum(1/(j*j))
jdiff<-jsum-eul
jsum

## [1] 1.634984

3. Compute the per 100,000 murder rate for each state and store it in the object murder_rate. Then compute the average murder rate for the US using the function mean. What is the average?

murder_rate<-(murders$total)/((murders$population)*1/100000)
mean(murder_rate)

## [1] 2.779125

Section 2.14

Start by loading the library and data.

library(dslabs)
data(murders)

1. Compute the per 100,000 murder rate for each state and store it in an object called murder_rate. Then use logical operators to create a logical vector named low that tells us which entries of murder_rate are lower than 1.

murder_rate<-(murders$total)/((murders$population)*1/100000)
low<-murder_rate<1
sum(low)

## [1] 12

2. Now use the results from the previous exercise and the function which to determine the indices of murder_rate associated with values lower than 1.

which(murder_rate<1)

##  [1] 12 13 16 20 24 30 35 38 42 45 46 51

3. Use the results from the previous exercise to report the names of the states with murder rates lower than 1.

lowname<-murders$state[which(murder_rate<1)]
lowname

##  [1] "Hawaii"        "Idaho"         "Iowa"          "Maine"        
##  [5] "Minnesota"     "New Hampshire" "North Dakota"  "Oregon"       
##  [9] "South Dakota"  "Utah"          "Vermont"       "Wyoming"

4. Now extend the code from exercises 2 and 3 to report the states in the Northeast with murder rates lower than 1. Hint: use the previously defined logical vector low and the logical operator &.

levels(murders$region)

## [1] "Northeast"     "South"         "North Central" "West"

lowne<-murders$state[which(murder_rate<1 & murders$region=="Northeast")]
lowne

## [1] "Maine"         "New Hampshire" "Vermont"

5. In a previous exercise we computed the murder rate for each state and the average of these numbers. How many states are below the average?

murder_rate<-(murders$total)/((murders$population)*1/100000)
mean(murder_rate)

## [1] 2.779125

belowavg<-murders$state[which(murder_rate<mean(murder_rate))]
nobelowavg<-sum(murder_rate<mean(murder_rate))
length(belowavg)

## [1] 27

6. Use the match function to identify the states with abbreviations AK, MI, and IA. Hint: start by defining an index of the entries of murders$abb that match the three abbreviations, then use the [ operator to extract the states.

ind<-match(c("AK", "MI", "IA"), murders$abb)
murders$state[ind]

## [1] "Alaska"   "Michigan" "Iowa"

7. Use the %in% operator to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU?

y<-c("MA", "ME", "MI", "MO", "MU") %in% murders$abb
y

## [1]  TRUE  TRUE  TRUE  TRUE FALSE

8. Extend the code you used in exercise 7 to report the one entry that is not an actual abbreviation. Hint: use the ! operator, which turns FALSE into TRUE and vice versa, then which to obtain an index.

which(!c("MA", "ME", "MI", "MO", "MU")%in%murders$abb)

## [1] 5

Section 2.16

1. We made a plot of total murders versus population and noted a strong relationship. Not surprisingly, states with larger populations had more murders.

library(dslabs)
data(murders) 
population_in_millions <- murders$population/10^6 
total_gun_murders <- murders$total 
plot(population_in_millions, total_gun_murders)

Keep in mind that many states have populations below 5 million and are bunched up. We may gain further insights from making this plot in the log scale. Transform the variables using the log10 transformation and then plot them.

library("dslabs")
data(murders) 
population_in_millions <- murders$population/10^6 
total_gun_murders <- murders$total 
plot(population_in_millions, total_gun_murders)

plot(log(population_in_millions, log(total_gun_murders)))

2. Create a histogram of the state populations.

hist(murders$population)

3. Generate boxplots of the state populations by region.

boxplot(murders$population~region, data=murders)

HarvardX PH125.1x: Chapter 2 Exercises

Dimple K. Patel

2023-12-31