library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
airquality <- airquality
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
mean(airquality$Temp)
## [1] 77.88235
mean(airquality[,4])
## [1] 77.88235
median(airquality$Ozone)
## [1] NA
median(airquality$Solar.R)
## [1] NA
median(airquality$Wind)
## [1] 9.7
sd(airquality$Ozone)
## [1] NA
sd(airquality$Wind)
## [1] 3.523001
var(airquality$Temp)
## [1] 89.59133
airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"
Look at the summary statistics of the dataset, and see how Month has changed to have characters instead of numbers
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : chr "May" "May" "May" "May" ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
head(airquality,10)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 May 1
## 2 36 118 8.0 72 May 2
## 3 12 149 12.6 74 May 3
## 4 18 313 11.5 62 May 4
## 5 NA NA 14.3 56 May 5
## 6 28 NA 14.9 66 May 6
## 7 23 299 8.6 65 May 7
## 8 19 99 13.8 59 May 8
## 9 8 19 20.1 61 May 9
## 10 NA 194 8.6 69 May 10
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Length:153 Min. : 1.0
## Class :character 1st Qu.: 8.0
## Mode :character Median :16.0
## Mean :15.8
## 3rd Qu.:23.0
## Max. :31.0
##
When I apply the funtion median to ozone, Solar.R, the results were NA. But in the summary, they have the values. Why?
mean(airquality$ozone)
## Warning in mean.default(airquality$ozone): argument is not numeric or logical:
## returning NA
## [1] NA
airquality$Month <- factor(airquality$Month,levels = c("May","June","July","August","September"))
Qplot stands for “Quick-Plot”(in the ggplot 2 package) R is case sensitive. Be careful. The difference of the max. and min. temperature is 41 so ‘20 bins’ are good choice for the histogram.
p1 <- qplot(data = airquality,Temp,fill = Month,geom = "histogram", alpha = 0.5, bins = 20,color =I("gray"))
p1
ggplot is more sophisticated than qplot, but still uses ggplot2 package
Reorder the legend so that it is not the default (alphabetical), but rather in order that months come
Outline the bars in gray using the color = “gray” command
I changed the binwidth = 2 to compare the qplot histogram and the ggplot histogram.
The results seemed different in some parts. The frequency of temperature 80 to 85 is different from the qplot histogram and the ggplot histogram although they have the same data. The count of qplot = 46 and the count of ggpot = 19. Because the qplot’s count in y-axis is the sum of each month’s frequency and the qqplot’s count in y-axis shows the frequency of each month as its color.
p2 <- airquality %>%
ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity", alpha=0.5, binwidth = 2, color = "gray")+
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p2
fill=Month command fills each boxplot with a different color in the aesthetics
scale_fill_discrete makes the legend on the side for discrete color values
p3 <- airquality %>%
ggplot(aes(Month, Temp, fill = Month)) +
ggtitle("Temperatures") +
xlab("Monthly Temperatures") +
ylab("Frequency") +
geom_boxplot() +
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3
Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics
p4 <- airquality %>%
ggplot(aes(Month, Temp, fill = Month)) +
ggtitle("Monthly Temperature Variations") +
xlab("Monthly Temperatures") +
ylab("Frequency") +
geom_boxplot()+
scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4
First, make a easy histogram. It seems that the boxplot is more useful than the histogram regarding temperature and month.
p2 <- airquality %>%
ggplot(aes(x= Month,y= Temp, fill=Month)) +
stat_summary(fun = mean, geom = "col", alpha = 0.5)
p2
I want to see each variable’s correlations so that what makes a bad influence to air quality(that is ozone level) I used these code but see only the correlation of Ozone and Solar.R and I got an error about render.
{r}
columns <-c('Ozone','Solar.R','Wind','Temp')
rows <-rowSums(is.na(airquality)) == 0
round(cor(airquality[row,col]),3)
{r}
Col <-c('Wind','Temp')
row <-rowSums(is.na(airquality)) == 0
round(cor(airquality[row,col]),1)
I failed to make the correlation table. However, in general, it is known that the temperature increases, the ground level ozone increases as well. As result of the scatter plot, we can see that temperature and ozone are strongly related. First, make scatter plot using qplot.
qplot(Temp,Ozone,data=airquality,color=Month,geom='point')
## Warning: Removed 37 rows containing missing values (geom_point).
To add a regression line, use the function geom_smooth() The default confidence level is 0.95. I chaged it to 0.90 for pratice.
ggplot(airquality,aes(x=Temp, y=Ozone)) + geom_point(aes(shape = Month, color = Month)) + geom_smooth(method="lm" ,level=0.90)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 37 rows containing non-finite values (stat_smooth).
## Warning: Removed 37 rows containing missing values (geom_point).