R Bridge Assignment 2

By Md Forhad Akbar
July 28, 2019

Data preparation: Read data in to R and display the data table

theUrl <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/airquality.csv"
nyairquality<- read.table(file = theUrl, header = TRUE, sep = ",")
head(nyairquality)
##   X Ozone Solar.R Wind Temp Month Day
## 1 1    41     190  7.4   67     5   1
## 2 2    36     118  8.0   72     5   2
## 3 3    12     149 12.6   74     5   3
## 4 4    18     313 11.5   62     5   4
## 5 5    NA      NA 14.3   56     5   5
## 6 6    28      NA 14.9   66     5   6

1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

#Summary Statistics for the main data table, nyairquality
summary(nyairquality)
##        X           Ozone           Solar.R           Wind       
##  Min.   :  1   Min.   :  1.00   Min.   :  7.0   Min.   : 1.700  
##  1st Qu.: 39   1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400  
##  Median : 77   Median : 31.50   Median :205.0   Median : 9.700  
##  Mean   : 77   Mean   : 42.13   Mean   :185.9   Mean   : 9.958  
##  3rd Qu.:115   3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500  
##  Max.   :153   Max.   :168.00   Max.   :334.0   Max.   :20.700  
##                NA's   :37       NA's   :7                       
##       Temp           Month            Day      
##  Min.   :56.00   Min.   :5.000   Min.   : 1.0  
##  1st Qu.:72.00   1st Qu.:6.000   1st Qu.: 8.0  
##  Median :79.00   Median :7.000   Median :16.0  
##  Mean   :77.88   Mean   :6.993   Mean   :15.8  
##  3rd Qu.:85.00   3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :97.00   Max.   :9.000   Max.   :31.0  
## 
#Mean and Median of column (attribute) "Wind" (from the main table)
mean(nyairquality$Wind)
## [1] 9.957516
median(nyairquality$Wind)
## [1] 9.7
#Mean and Median of column (attribute) "Temp" (from the main table)
mean(nyairquality$Temp)
## [1] 77.88235
median(nyairquality$Temp)
## [1] 79

2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.

nyairqualityDframe<- data.frame (nyairquality[1:20, c(1, 4:5)])
nyairqualityDframe
##     X Wind Temp
## 1   1  7.4   67
## 2   2  8.0   72
## 3   3 12.6   74
## 4   4 11.5   62
## 5   5 14.3   56
## 6   6 14.9   66
## 7   7  8.6   65
## 8   8 13.8   59
## 9   9 20.1   61
## 10 10  8.6   69
## 11 11  6.9   74
## 12 12  9.7   69
## 13 13  9.2   66
## 14 14 10.9   68
## 15 15 13.2   58
## 16 16 11.5   64
## 17 17 12.0   66
## 18 18 18.4   57
## 19 19 11.5   68
## 20 20  9.7   62

###3. Create new column names for the new data frame.

names(nyairqualityDframe) <- c("Sequence", "DFrame.Wind", "DFrame.Temp")

###4.Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare

#Summary Statistics for the data frame, nyairqualityDframe
summary(nyairqualityDframe)
##     Sequence      DFrame.Wind     DFrame.Temp   
##  Min.   : 1.00   Min.   : 6.90   Min.   :56.00  
##  1st Qu.: 5.75   1st Qu.: 9.05   1st Qu.:61.75  
##  Median :10.50   Median :11.50   Median :66.00  
##  Mean   :10.50   Mean   :11.64   Mean   :65.15  
##  3rd Qu.:15.25   3rd Qu.:13.35   3rd Qu.:68.25  
##  Max.   :20.00   Max.   :20.10   Max.   :74.00
#Mean and Median of column (attribute) "Wind" (from the data frame)
mean(nyairqualityDframe$DFrame.Wind)
## [1] 11.64
median(nyairqualityDframe$DFrame.Wind)
## [1] 11.5
#Mean and Median of column (attribute) "Temp" (from the data frame)
mean(nyairqualityDframe$DFrame.Temp)
## [1] 65.15
median(nyairqualityDframe$DFrame.Temp)
## [1] 66

####Comparison Comment 1: I chose two attributes (Wind and Temp) and 20 observations from the main data table NYAirQ to create a subset data, called nyairqualityDframe.

####Comparison Comment 2: The mean and median of Wind the subset data (DFrame.wind) is 11.64 and 11.50, respectively. Both of which are higher than those of the main data table (Wind) – 9.95 and 9.70, respectively.

####Comparison Comment 3: The mean and median of Temp the subset data (DFrame.Temp) is 65.15 and 66.00, respectively. Both of which are lower than those of the main data table (Temp) – 77.89 and 79, respectively.

####Comparison Comment 4: The results of subset means and medians are clearly bias due to my selection of only 20 first data rows with no randomization.

###*5. For at least 3 values in a column please rename so that every value in that column is renamed.For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.

nyairquality$Month[nyairquality$Month == 5] <- "May"
nyairquality$Month[nyairquality$Month == 6] <- "June"
nyairquality$Month[nyairquality$Month == 9] <- "September"

###6. Display enough rows to see examples of all of steps 1-5 above.

head(nyairquality)
##   X Ozone Solar.R Wind Temp Month Day
## 1 1    41     190  7.4   67   May   1
## 2 2    36     118  8.0   72   May   2
## 3 3    12     149 12.6   74   May   3
## 4 4    18     313 11.5   62   May   4
## 5 5    NA      NA 14.3   56   May   5
## 6 6    28      NA 14.9   66   May   6
nyairquality[50:55, 1:7]
##     X Ozone Solar.R Wind Temp Month Day
## 50 50    12     120 11.5   73  June  19
## 51 51    13     137 10.3   76  June  20
## 52 52    NA     150  6.3   77  June  21
## 53 53    NA      59  1.7   76  June  22
## 54 54    NA      91  4.6   76  June  23
## 55 55    NA     250  6.3   76  June  24
tail(nyairquality)
##       X Ozone Solar.R Wind Temp     Month Day
## 148 148    14      20 16.6   63 September  25
## 149 149    30     193  6.9   70 September  26
## 150 150    NA     145 13.2   77 September  27
## 151 151    14     191 14.3   75 September  28
## 152 152    18     131  8.0   76 September  29
## 153 153    20     223 11.5   68 September  30

7. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

theUrl <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/airquality.csv"
nyairquality<- read.table(file = theUrl, header = TRUE, sep = ",")
head(nyairquality)
##   X Ozone Solar.R Wind Temp Month Day
## 1 1    41     190  7.4   67     5   1
## 2 2    36     118  8.0   72     5   2
## 3 3    12     149 12.6   74     5   3
## 4 4    18     313 11.5   62     5   4
## 5 5    NA      NA 14.3   56     5   5
## 6 6    28      NA 14.9   66     5   6