Introduction

What is a dataframe?

A data frame is a two-dimensional data structure in R used to store data tables.It is made up of three principle components, the data, rows, and columns.

For this assignment, I am using the 'airquality' dataset. It contains data on the daily measurement of air quality in New York from May to September 1973. It contains 153 observations on 6 variables which are Ozone, Solar.R, Wind, Temp, Month and Day.

#First, I am printing out the first 6 rows of the airquality dataset. 
data(airquality)
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
#Column names
colnames(airquality)
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"

Packages Info

Here I load the 'dplyr' package into R

#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data Preparation

As we can see, the 'Day' and 'Month" variable should be considered as factors instead of integers as they represent the day of the month(1,...,31) and the month (January = 1, February = 2,..., December = 12) respectively.

str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
#Changing them into factors
airquality$Day <- as.factor(airquality$Day)
airquality$Month <- as.factor(airquality$Month)

For simplicity, I will use a random subset of 20 observations from the original airquality dataset.

#Setting the seed to get the same set of random numbers 
set.seed(1234)
airquality_subset <- sample_n(airquality, 20)


##Renaming the Solar.R to Solar
a <- airquality_subset %>%
  rename(Solar = Solar.R)

Since the Temperature in the airquality dataset is listed in Fahrenheit in the 'Temp' column, lets add a new column with the Temperature converted to Celcius named Temp_Cel, rounded to the nearest whole number.

I have renamed the original 'Temp' column to Temp_F for clarity.

I have also removed the 'NA' values in the dataset

airquality_subset2 <- a %>% 
  rename(Temp_F = Temp) %>%
  mutate(Temp_C = round((Temp_F - 32) * (5/9),0)) %>%
  na.omit()

dim(airquality_subset2)
## [1] 17  7
#14 rows left after removing NA

Data Analysis

First I try to obtain the maximum Ozone reading. Then, I proceed to count how many observations that have the maximum Ozone reading present in the subset.

maxOzone <- max(airquality_subset2$Ozone)
maxOzone
## [1] 135
#Filter out rows with the highest Ozone reading.
maxOzone_subset <- airquality_subset2 %>%
  filter(Ozone == maxOzone)
maxOzone_subset
##   Ozone Solar Wind Temp_F Month Day Temp_C
## 1   135   269  4.1     84     7   1     29
#Number of observations with the max Ozone reading in this subset
nrow(maxOzone_subset)
## [1] 1
#Playing around more with the data

#Arranging the dataset ascendingly according to 'Ozone' value
arrange(airquality_subset2, Ozone)
##    Ozone Solar Wind Temp_F Month Day Temp_C
## 1      9    24 10.9     71     9  14     22
## 2     13   238 12.6     64     9  21     18
## 3     14   274 10.9     68     5  14     20
## 4     18   313 11.5     62     5   4     17
## 5     21   230 10.9     75     9   9     24
## 6     23    13 12.0     67     5  28     19
## 7     23    14  9.2     71     9  22     22
## 8     24   259  9.7     73     9  10     23
## 9     31   244 10.9     78     8  19     26
## 10    45   212  9.7     79     8  24     26
## 11    50   275  7.4     86     7  29     30
## 12    61   285  6.3     84     7  18     29
## 13    73   183  2.8     93     9   3     34
## 14    79   187  5.1     87     7  19     31
## 15    97   272  5.7     92     7   9     33
## 16   110   207  8.0     90     8   9     32
## 17   135   269  4.1     84     7   1     29
#Remove the first row
nofirstrow <- airquality_subset2[-1,]
dim(nofirstrow)
## [1] 16  7
#13 rows

#Remove the last row
nolastrow <- airquality_subset2[-20,]
dim(nolastrow)
## [1] 17  7
#13 rows

#Adding a new row of data to the subset of the mean of all numeric variables in the dataset, and Month = 5 and Day = 6
rbind(airquality_subset2, list(mean(airquality_subset2$Ozone), mean(airquality_subset2$Solar),mean(airquality_subset2$Wind), mean(airquality_subset2$Temp_F), 5 , 6 , mean(airquality_subset2$Temp_C) ) ) 
##         Ozone    Solar      Wind   Temp_F Month Day   Temp_C
## 1    23.00000  13.0000 12.000000 67.00000     5  28 19.00000
## 2    79.00000 187.0000  5.100000 87.00000     7  19 31.00000
## 4   110.00000 207.0000  8.000000 90.00000     8   9 32.00000
## 5    31.00000 244.0000 10.900000 78.00000     8  19 26.00000
## 6     9.00000  24.0000 10.900000 71.00000     9  14 22.00000
## 7    24.00000 259.0000  9.700000 73.00000     9  10 23.00000
## 8    13.00000 238.0000 12.600000 64.00000     9  21 18.00000
## 9    21.00000 230.0000 10.900000 75.00000     9   9 24.00000
## 12   50.00000 275.0000  7.400000 86.00000     7  29 30.00000
## 13   97.00000 272.0000  5.700000 92.00000     7   9 33.00000
## 14   61.00000 285.0000  6.300000 84.00000     7  18 29.00000
## 15   45.00000 212.0000  9.700000 79.00000     8  24 26.00000
## 16   14.00000 274.0000 10.900000 68.00000     5  14 20.00000
## 17   73.00000 183.0000  2.800000 93.00000     9   3 34.00000
## 18  135.00000 269.0000  4.100000 84.00000     7   1 29.00000
## 19   18.00000 313.0000 11.500000 62.00000     5   4 17.00000
## 20   23.00000  14.0000  9.200000 71.00000     9  22 22.00000
## 181  48.58824 205.8235  8.688235 77.88235     5   6 25.58824
#Round to nearest whole number again
airquality_subset2 %>%
  mutate_if(is.numeric, round, 0)
##    Ozone Solar Wind Temp_F Month Day Temp_C
## 1     23    13   12     67     5  28     19
## 2     79   187    5     87     7  19     31
## 3    110   207    8     90     8   9     32
## 4     31   244   11     78     8  19     26
## 5      9    24   11     71     9  14     22
## 6     24   259   10     73     9  10     23
## 7     13   238   13     64     9  21     18
## 8     21   230   11     75     9   9     24
## 9     50   275    7     86     7  29     30
## 10    97   272    6     92     7   9     33
## 11    61   285    6     84     7  18     29
## 12    45   212   10     79     8  24     26
## 13    14   274   11     68     5  14     20
## 14    73   183    3     93     9   3     34
## 15   135   269    4     84     7   1     29
## 16    18   313   12     62     5   4     17
## 17    23    14    9     71     9  22     22