Lab 2

Exercise 1

flint = read.csv("C:\\Users\\liaev\\Documents\\STATS10\\flint-2.csv")

I read the file into R and named the object “flint”

mean(flint$Pb >= 15)
## [1] 0.04436229

I used the mean of the data to find that the proportion of the location levels tested is 0.04.

Copper = flint$Cu
mean(Copper[flint$Region == "North"])
## [1] 44.6424

First I saved the variable Copper to make my code neater. Then I found the average copper levels in the North by filtering out any regions that were not the North.

mean(Copper >= 15)
## [1] 0.3844732

I determined what proportion of the levels were greater than or equal to the dangerous level of 15.

AvgPb = mean(flint$Pb)
AvgCu = mean(Copper)

I found the means of the levels for both elements.

boxplot(flint$Pb,
        xlab="Elements",
        ylab="Levels", main="Lead Levels in the Water")

This is not a good measure of center because the data is skewed. Median is a better measure because its not as skewed by outliers.

median(flint$Pb)
## [1] 0

Exercise 2

life <- read.table("http://www.stat.ucla.edu/~nchristo/statistics12/countries_life.txt",  
header = TRUE)  
plot(y = life$Life, x = life$Income,
     xlab = "Life Expectancy", ylab = "Income")

boxplot(life$Income)

hist(life$Income)

Yes there are many outliers and the data is very right skewed.

below1000 = life[life$Income < 1000,]
above1000 = life[life$Income > 1000,]
plot(below1000$Life~below1000$Income, 
     xlab = "Income", ylab = "Life Expectancy")

I plotted the life expectancy versus the income of everyone with an income lower that $1000.

cor(x = below1000$Income, y = below1000$Life)
## [1] 0.752886

Exercise 3

maas <- read.table("http://www.stat.ucla.edu/~nchristo/statistics12/soil.txt", header = TRUE)  
summary(maas$lead, maas$zinc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    37.0    72.5   123.0   153.4   207.0   654.0

I used the summary function to give me the 5 number summary of lead and zinc.

hist(maas$lead)

hist(log(maas$lead))

plot(log(maas$lead)~log(maas$zinc),
     xlab = "Log of lead concentration", ylab = "Log of zinc concentration")

The distribution is positively linear.

maasColors <- c("red", "orange", "yellow", "green")
maasLevels <- cut(maas$lead, c(0,150,400,1000))

plot(maas$x, maas$y, cex = maas$lead/mean(maas$lead), col = maasColors[as.numeric(maasLevels)], pch = 5)

Here I assigned colors to different levels of lead in the water to signify how dangerous the levels are. I then plotted the lead levels onto a plot.

Exercise 4

LA <- read.table("http://www.stat.ucla.edu/~nchristo/statistics12/la_data.txt", header = TRUE)  
find.package("maps")
## [1] "C:/Users/liaev/AppData/Local/R/win-library/4.2/maps"
library(maps)
plot(LA$Longitude, LA$Latitude,
     xlim = c(-120, -117), ylim = c(33.5, 34.5),
     xlab = "Longitude", ylab = "Latitude", 
     main = "Schools in LA")
map("county", "california", add = TRUE) 

I plotted the longitude and latitude out and formatted that plot. Then, I added the outline of California.

abovezero = LA$School!=0
plot(LA$Income~LA$Schools,
     LA[abovezero, ])

The plot is not linear, but it is moderately correlated and the amount of schools seems to generally increase as income increases. However, there is not enough correlation to infer causation.