What is a dataframe?
A data frame is a two-dimensional data structure in R used to store data tables.It is made up of three principle components, the data, rows, and columns.
For this assignment, I am using the 'airquality' dataset. It contains data on the daily measurement of air quality in New York from May to September 1973. It contains 153 observations on 6 variables which are Ozone, Solar.R, Wind, Temp, Month and Day.
#First, I am printing out the first 6 rows of the airquality dataset.
data(airquality)
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
#Column names
colnames(airquality)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
Here I load the 'dplyr' package into R
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
As we can see, the 'Day' and 'Month" variable should be considered as factors instead of integers as they represent the day of the month(1,...,31) and the month (January = 1, February = 2,..., December = 12) respectively.
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
#Changing them into factors
airquality$Day <- as.factor(airquality$Day)
airquality$Month <- as.factor(airquality$Month)
For simplicity, I will use a random subset of 20 observations from the original airquality dataset.
#Setting the seed to get the same set of random numbers
set.seed(1234)
airquality_subset <- sample_n(airquality, 20)
##Renaming the Solar.R to Solar
a <- airquality_subset %>%
rename(Solar = Solar.R)
Since the Temperature in the airquality dataset is listed in Fahrenheit in the 'Temp' column, lets add a new column with the Temperature converted to Celcius named Temp_Cel, rounded to the nearest whole number.
I have renamed the original 'Temp' column to Temp_F for clarity.
I have also removed the 'NA' values in the dataset
airquality_subset2 <- a %>%
rename(Temp_F = Temp) %>%
mutate(Temp_C = round((Temp_F - 32) * (5/9),0)) %>%
na.omit()
dim(airquality_subset2)
## [1] 17 7
#14 rows left after removing NA
First I try to obtain the maximum Ozone reading. Then, I proceed to count how many observations that have the maximum Ozone reading present in the subset.
maxOzone <- max(airquality_subset2$Ozone)
maxOzone
## [1] 135
#Filter out rows with the highest Ozone reading.
maxOzone_subset <- airquality_subset2 %>%
filter(Ozone == maxOzone)
maxOzone_subset
## Ozone Solar Wind Temp_F Month Day Temp_C
## 1 135 269 4.1 84 7 1 29
#Number of observations with the max Ozone reading in this subset
nrow(maxOzone_subset)
## [1] 1
#Playing around more with the data
#Arranging the dataset ascendingly according to 'Ozone' value
arrange(airquality_subset2, Ozone)
## Ozone Solar Wind Temp_F Month Day Temp_C
## 1 9 24 10.9 71 9 14 22
## 2 13 238 12.6 64 9 21 18
## 3 14 274 10.9 68 5 14 20
## 4 18 313 11.5 62 5 4 17
## 5 21 230 10.9 75 9 9 24
## 6 23 13 12.0 67 5 28 19
## 7 23 14 9.2 71 9 22 22
## 8 24 259 9.7 73 9 10 23
## 9 31 244 10.9 78 8 19 26
## 10 45 212 9.7 79 8 24 26
## 11 50 275 7.4 86 7 29 30
## 12 61 285 6.3 84 7 18 29
## 13 73 183 2.8 93 9 3 34
## 14 79 187 5.1 87 7 19 31
## 15 97 272 5.7 92 7 9 33
## 16 110 207 8.0 90 8 9 32
## 17 135 269 4.1 84 7 1 29
#Remove the first row
nofirstrow <- airquality_subset2[-1,]
dim(nofirstrow)
## [1] 16 7
#13 rows
#Remove the last row
nolastrow <- airquality_subset2[-20,]
dim(nolastrow)
## [1] 17 7
#13 rows
#Adding a new row of data to the subset of the mean of all numeric variables in the dataset, and Month = 5 and Day = 6
rbind(airquality_subset2, list(mean(airquality_subset2$Ozone), mean(airquality_subset2$Solar),mean(airquality_subset2$Wind), mean(airquality_subset2$Temp_F), 5 , 6 , mean(airquality_subset2$Temp_C) ) )
## Ozone Solar Wind Temp_F Month Day Temp_C
## 1 23.00000 13.0000 12.000000 67.00000 5 28 19.00000
## 2 79.00000 187.0000 5.100000 87.00000 7 19 31.00000
## 4 110.00000 207.0000 8.000000 90.00000 8 9 32.00000
## 5 31.00000 244.0000 10.900000 78.00000 8 19 26.00000
## 6 9.00000 24.0000 10.900000 71.00000 9 14 22.00000
## 7 24.00000 259.0000 9.700000 73.00000 9 10 23.00000
## 8 13.00000 238.0000 12.600000 64.00000 9 21 18.00000
## 9 21.00000 230.0000 10.900000 75.00000 9 9 24.00000
## 12 50.00000 275.0000 7.400000 86.00000 7 29 30.00000
## 13 97.00000 272.0000 5.700000 92.00000 7 9 33.00000
## 14 61.00000 285.0000 6.300000 84.00000 7 18 29.00000
## 15 45.00000 212.0000 9.700000 79.00000 8 24 26.00000
## 16 14.00000 274.0000 10.900000 68.00000 5 14 20.00000
## 17 73.00000 183.0000 2.800000 93.00000 9 3 34.00000
## 18 135.00000 269.0000 4.100000 84.00000 7 1 29.00000
## 19 18.00000 313.0000 11.500000 62.00000 5 4 17.00000
## 20 23.00000 14.0000 9.200000 71.00000 9 22 22.00000
## 181 48.58824 205.8235 8.688235 77.88235 5 6 25.58824
#Round to nearest whole number again
airquality_subset2 %>%
mutate_if(is.numeric, round, 0)
## Ozone Solar Wind Temp_F Month Day Temp_C
## 1 23 13 12 67 5 28 19
## 2 79 187 5 87 7 19 31
## 3 110 207 8 90 8 9 32
## 4 31 244 11 78 8 19 26
## 5 9 24 11 71 9 14 22
## 6 24 259 10 73 9 10 23
## 7 13 238 13 64 9 21 18
## 8 21 230 11 75 9 9 24
## 9 50 275 7 86 7 29 30
## 10 97 272 6 92 7 9 33
## 11 61 285 6 84 7 18 29
## 12 45 212 10 79 8 24 26
## 13 14 274 11 68 5 14 20
## 14 73 183 3 93 9 3 34
## 15 135 269 4 84 7 1 29
## 16 18 313 12 62 5 4 17
## 17 23 14 9 71 9 22 22