Intro Logical Statements/Relational Operators

Is 4 greater than 3?

4>3
## [1] TRUE

Is 3 or 8 greater than or equal to 3?

c(3,8) >= 3
## [1] TRUE TRUE

Is 3 or 8 less than or equal to 3?

c(3,8) <= 3
## [1]  TRUE FALSE

Is 1, 4, or 9 exactly equal to 9?

c(1,4,9) == 9
## [1] FALSE FALSE  TRUE

Is 1, 4, or 9 not (exactly) equal to 9?

c(1,4,9) != 9
## [1]  TRUE  TRUE FALSE

Applications of logical statements

NCbirths <- read.csv("~/Desktop/spring 2020/stats 10/Lab_data_sets/births.csv")

What is the number of babies that weighed more than 100 ounces?

sum(NCbirths$weight > 100)
## [1] 1643

What is the proportion of babies that weighed more than 100 ounces?

mean(NCbirths$weight > 100)
## [1] 0.8247992

What is the proportion of female babies?

mean(NCbirths$Gender == "Female")
## [1] 0.4804217

What is the proportion of babies NOT assigned male?

mean(NCbirths$Gender != "Male")
## [1] 0.4804217

Applications of logical statements: calculations

fem_weights <- NCbirths$weight[NCbirths$Gender == "Female"]
fem_weights
##   [1] 177 144  98 104 123 153 106 125 115  83 130  84 147 106 117 112 115 107
##  [19] 105 119 143 119  33 118 134 106 118 130 102 134 116 119  57 118 123 135
##  [37]  77 122 117 112  89 122  83 151 125 114 109  93  96  86 137  86 142 136
##  [55] 105  98 119 106 109 139  97  85 121 115  91 120 116 122  99 128 114  77
##  [73] 115  74 122 117 136 143  92 121 109 130 125 108 132 130 100 114 108 131
##  [91]  95 116 118 117 102 108  95 108 120  91 135  95 145  34  84 111 110 131
## [109] 103 113 104 107  36 136 109 135 148 111 104 139 113 145 133 117 142  99
## [127] 105 105 120 125 120 131 124 117 126 126 104 126 121 106 111 128 101 128
## [145] 135 104 116 103  84 136 108 119 106 113 113 131 118  98 114 117 113 126
## [163] 121 120 119 130 111 121 122  99 113 120  77 113 144 128 104  50  69 117
## [181]  80 116  85 111 112 123 124  80 106 143 122 109  73 109  57 102 107 124
## [199] 118 113 104 131 123 101 120 132 122 118 108 123 110 122 112 137 133 111
## [217] 120 120 128  47 107 134  96 127 132 125  87  99 104 137 108  76 103 153
## [235] 125 112 130 104  70 120 108 123 112  72 115 136 110 141 121  93 108 100
## [253] 108 131 161 111 105 107 109 123 112  95 121 114 123 111 124 139 106 109
## [271] 117 104 124 113 117 122  83 124 121 106 123 138 128 116 119 111  96 115
## [289] 117 120 115 106 121 144 124 145 115 145 126 143 140 109  96 120 105  93
## [307]  92 111 134 110 128 128 147  83 123 146 130 112 109 102 102 107  96  81
## [325] 102 108 137 120 117  99  51 146 114 106 110  96 135 122 107 115 104 120
## [343] 107 118  99 115 100 157 126 126 102 136 125  96  77 111 102  98 122 117
## [361] 124 114 111 140  99 120 129  87 106 114 111 147 135 110 131 146 117 116
## [379] 113 104 114 111 115 112  91 103 119 119 132 106 106 124 119  85 115 105
## [397]  17 126 121 116 139 128 104 112 104 116  93 142 118 118 118 123 126 114
## [415] 129 120 105  43 103 116  99 119 131 118 137 104 108 109  89 105 130 119
## [433] 113 114  76 119 133 107 124 136 102 130  94 105 139 109 114  91 121 103
## [451] 120 100 151 132  96 121 142 112 115 120 117  72 130 122 115  91 117 131
## [469] 100 121 118 141 100 115  97  93  97 117 146 112  94 117 120  71 131  89
## [487]  14 118 117 133 114 120 120 133  87 111  91 118 102 110 108  98 108 114
## [505] 100 100 130 157 140 113 121 118 128  82 110 139 137 103 126 114 117 118
## [523] 138 107 106 137 116 107 143 112 133 127 100 108  98  95 110 108 129 120
## [541] 102 109 133  69  96 136 121  59 115 135 105 112 107 123 127 115 113 115
## [559] 117 104 114 101 115  92 134 130 110 104 116  89 118 131 113 110 118  73
## [577] 106 111  91 128 131 142 109 100 105  44 126 140 139 129 119  98 124 109
## [595] 117 112 116 102 114 100  78 134 135 155 134 143  97 143 122 145  76 119
## [613] 107 120 128 131 161 127  28 125 118 146 122  85 111 127 135 118 121 102
## [631] 101 105 133  78  90 102  23 145 143 119 130 118 116 105 141 129 125  52
## [649] 124 105 116  95 107 123  98 139 130 116 119 129 114 110 107  98 124 125
## [667] 146 105  81 105  98 124 132 115 132 123 113 133 131 120 140 110 106 129
## [685] 151 112 110 126 118 100 115  95 109 101 109 128  58 148 120 126 117 129
## [703] 137 105  90 102 119 143 144 119 119 108 120 115 117 112 116 128 127 110
## [721]  57 111 127 106 150 126 111 106 115 116 135 147  25 127 115  78 110 124
## [739] 120 126 113  94 112 104  96 108 108 134 147 121 122 112 122 110 128 100
## [757] 135  94 130 107 111 146  98 134  96 118 108 115 119 139 155 129 128 104
## [775] 105 125 113 147 113  84 106  96 116 108 108 115 105 114 108  93 126 133
## [793] 122 122 118 119  96  99 105 113 113 126 104 144  99  99 127 113  72 141
## [811]  98 135 139 116 112  91 135 118 106 116 121 115 107 111  95 128 109 100
## [829] 106  94  77 125 100  97 129 112 133 103 131 125 124  87 129  88 135 117
## [847]  86 145 114 122 132 103 130 105 125 108 125 119 161  97 143 135 149  68
## [865] 143  87  86 108 111 122 123 133 103  88  94  98 123 109 119 122 102 117
## [883] 118  96 105 114 109  76 120 114 145 105 125  75 120 104 129 117  97 106
## [901] 128  76 119  83 117 107 115 120  95  99  60 140 117 100 138 123 134 124
## [919] 125 144 128  90 113 125  20 118  94  89 121 123 119 109 120 105  88 122
## [937]  92 141 131 107 130 105 104 118 144 110 120 114 105 133 112 139  91 112
## [955] 104 115 129

Good coding practices

Create an object with the baby weights from NCbirths

baby_weight <- NCbirths$weight

Create an object with the baby genders from NCbirths

baby_gender <- NCbirths$Gender

Create a logical vector to describe if the gender is female

is_female <- baby_gender =="Female"

Create the vector of weights containing only females

fem_weights <- NCbirths$weight[NCbirths$Gender == "Female"]

Exercise 1

part a

Download the data from CCLE and read it into R. When you read in the data, name your object “flint”.

flint <- read.csv("~/Desktop/spring 2020/stats 10/Lab_data_sets/flint.csv")
head(flint)

part b

The EPA states a water source is especially dangerous if the lead level is 15 PPB or greater. What proportion of the locations tested were found to have dangerous lead levels?

mean(flint$Pb >= 15)
## [1] 0.04436229

The proportion of the locations tested found to have dangerous lead levels is 0.04436229.

part c

Report the mean copper level for only test sites in the North region.

mean(flint$Cu[flint$Region == "North"])
## [1] 44.6424

part d

Report the mean copper level for only test sites with dangerous lead levels (at least 15 PPB).

mean(flint$Cu[flint$Pb >= 15])
## [1] 305.8333

part e

Report the mean lead and copper levels.

mean(flint$Cu)
## [1] 54.58102
mean(flint$Pb)
## [1] 3.383272

part f

Create a box plot with a good title for the lead levels.

boxplot(flint$Pb, xlab = "Lead levels", main = "Lead level boxplot")

part g

Based on what you see in part (f), does the mean seem to be a good measure of center for the data? Report a more useful statistic for this data.

The mean is not a good measure of center for the data, since the data is right-skewed. The median is a better measure of center.

median(flint$Pb)
## [1] 0

Exercise 2

part a

Construct a scatterplot of Life against Income. Note: Income should be on the horizontal axis. How does income appear to affect life expectancy?

life <- read.table("http://www.stat.ucla.edu/~nchristo/statistics12/countries_life.txt", header = TRUE)
plot(y = life$Life, x = life$Income, xlab = "Income", ylab = "Life Expectancy")

Life expectancy increases as income increases.

part b

Construct the boxplot and histogram of Income. Are there any outliers?

hist(life$Income, xlab = "Income", main = "Life Expectancies vs. per Capita Income")

boxplot(life$Income, xlab = "Income", main = "Life Expectancies vs. per Capita Income")

Yes, there are several outliers.

part c

Split the data set into two parts: One for which the Income is strictly below $1000, and one for which the Income is at least $1000. Come up with your own names for these two objects.

below1000 = life[life$Income < 1000,]
above1000 = life[life$Income > 1000,]

part d

Use the data for which the Income is below $1000. Plot Life against Income and compute the correlation coefficient. Hint: use the function cor()

plot(below1000$Life~below1000$Income, xlab = "Income", ylab = "Life Expectancy")

cor(x = below1000$Life, y = below1000$Income)
## [1] 0.752886

Exercise 3

part a

Compute the summary statistics for lead and zinc using the summary() function.

maas <- read.table("http://www.stat.ucla.edu/~nchristo/statistics12/soil.txt", header = TRUE)
summary(maas$lead)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    37.0    72.5   123.0   153.4   207.0   654.0
summary(maas$zinc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   113.0   198.0   326.0   469.7   674.5  1839.0

part b

Plot two histograms: one of lead and one of log(lead).

hist(maas$lead)

hist(log(maas$lead))

part c

Plot log(lead) against log(zinc). What do you observe?

plot(log(lead) ~ log(zinc), data = maas, xlab = "Logarithm of zinc concentration", ylab = "Logarithm of lead concentration")

Thhe scatterplot is linear. The correlation coefficient is positive. However, we are unable to imply a causal relationship.

part d

The level of risk for surface soil based on lead concentration in ppm is given on the table below: Mean concentration (ppm) Level of risk Below 150 Lead-free Between 150-400 Lead-safe Above 400 Signif. environmental lead hazard Use techniques similar to last lab to give different colors and sizes to the lead concentration at these 155 locations. You do not need to use the maps package create a map of the area. Just plot the points without a map.

lead_colors <- c("green", "yellow", "red")
lead_levels <- cut(maas$lead, c(0, 150, 400, 1000))

plot(maas$x, maas$y, cex = maas$lead/mean(maas$lead), col = lead_colors[as.numeric(lead_levels)], pch = 19)

Exercise 4

part a

Plot the data point locations. Use good formatting for the axes and title. Then add the outline of LA County by typing: map(“county”, “california”, add = TRUE)

LA <- read.table("http://www.stat.ucla.edu/~nchristo/statistics12/la_data.txt", header = TRUE)
find.package("maps")
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/maps"
library(maps)
plot(x = LA$Longitude, y = LA$Latitude, xlim = c(-120,-117), ylim = c(33,35), ylab = "Latitude", xlab = "Longitude", main = "Schools in LA")
map("county", "California", add = TRUE)

part b

Do you see any relationship between income and school performance? Hint: Plot the variable Schools against the variable Income and describe what you see. Ignore the data points on the plot for which Schools = 0. Use what you learned about subsetting with logical statements to first create the objects you need for the scatter plot. Then, create the scatter plot.

LA.subset <- LA[LA$Schoolsd!=0,]
plot(LA$Schools~LA$Income, data = LA.subset)

The variables are moderately associated and the scatterplot is linear. It is not possible to conclude that there is causation. However, school performance generally increases as income increases