Setup

Load locally saved CSV data file to be used in following tasks. The data file contains environmental conditions in New York City. Data is courtesy of vincentarelbundock.

nyc.environment <- read.table(file = "environmental.csv", header = TRUE, sep = ",")

Task 1

Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

Summary

summary(nyc.environment)
##        X             ozone         radiation      temperature   
##  Min.   :  1.0   Min.   :  1.0   Min.   :  7.0   Min.   :57.00  
##  1st Qu.: 28.5   1st Qu.: 18.0   1st Qu.:113.5   1st Qu.:71.00  
##  Median : 56.0   Median : 31.0   Median :207.0   Median :79.00  
##  Mean   : 56.0   Mean   : 42.1   Mean   :184.8   Mean   :77.79  
##  3rd Qu.: 83.5   3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:84.50  
##  Max.   :111.0   Max.   :168.0   Max.   :334.0   Max.   :97.00  
##       wind       
##  Min.   : 2.300  
##  1st Qu.: 7.400  
##  Median : 9.700  
##  Mean   : 9.939  
##  3rd Qu.:11.500  
##  Max.   :20.700

Mean and median of temperature

mean(nyc.environment$temperature)
## [1] 77.79279
median(nyc.environment$temperature)
## [1] 79

Mean and median of radiation

mean(nyc.environment$radiation)
## [1] 184.8018
median(nyc.environment$radiation)
## [1] 207

Save values for future use

# Save temperature values for future use
df.stats <- data.frame(Mean = mean(nyc.environment$temperature), Median = median(nyc.environment$temperature), row.names = "FullSet-Temp")

# Add radiation values for future use 
df.stats <- rbind(df.stats, data.frame(Mean = mean(nyc.environment$radiation), Median = median(nyc.environment$radiation), row.names = "FullSet-Rad"))

Task 2

Create a new data frame with a subset of the columns and rows. Make sure to rename it.

# Select temperature and radiation values for 10 random rows, 
# rename columns, sort the selected subset, rename rows 
df.nyc <- data.frame(nyc.environment[sample(1:nrow(nyc.environment), 10), c(1, 3:4)])
names(df.nyc) <- c("orig.row","rad","temp")
df.nyc <- df.nyc[order(df.nyc$orig.row), ]
row.names(df.nyc) <- 1:10
df.nyc
##    orig.row rad temp
## 1        10 274   68
## 2        11  65   58
## 3        35 248   85
## 4        51   7   74
## 5        53 223   85
## 6        57 275   86
## 7        61  24   81
## 8        64 229   90
## 9        80 225   94
## 10       98 237   78

Task 3

Create new column names for the new data frame.

Note: I have renamed the columns in task 2 before I got to this task, but figured I’ll try to find another way. Plus, I like more descriptive full names anyway.

df.nyc <- setNames(df.nyc, c("OriginalRow","Radiation","Temperature"))
df.nyc
##    OriginalRow Radiation Temperature
## 1           10       274          68
## 2           11        65          58
## 3           35       248          85
## 4           51         7          74
## 5           53       223          85
## 6           57       275          86
## 7           61        24          81
## 8           64       229          90
## 9           80       225          94
## 10          98       237          78

Task 4

Use the summary function to create an overview of your new data frame. Then print the mean and median for the same two attributes. Please compare.

Summary

summary(df.nyc)
##   OriginalRow      Radiation      Temperature   
##  Min.   :10.00   Min.   :  7.0   Min.   :58.00  
##  1st Qu.:39.00   1st Qu.:104.5   1st Qu.:75.00  
##  Median :55.00   Median :227.0   Median :83.00  
##  Mean   :52.00   Mean   :180.7   Mean   :79.90  
##  3rd Qu.:63.25   3rd Qu.:245.2   3rd Qu.:85.75  
##  Max.   :98.00   Max.   :275.0   Max.   :94.00

Mean and median of temperature

mean(df.nyc$Temperature)
## [1] 79.9
median(df.nyc$Temperature)
## [1] 83

Mean and median of radiation

mean(df.nyc$Radiation)
## [1] 180.7
median(df.nyc$Radiation)
## [1] 227

Save new values for comparison

# Add subset temperature values
df.stats <- rbind(df.stats, data.frame(Mean = mean(df.nyc$Temperature), Median = median(df.nyc$Temperature), row.names = "SubSet-Temp"))

# Add subset radiation values 
df.stats <- rbind(df.stats, data.frame(Mean = mean(df.nyc$Radiation), Median = median(df.nyc$Radiation), row.names = "SubSet-Rad"))

df.stats
##                   Mean Median
## FullSet-Temp  77.79279     79
## FullSet-Rad  184.80180    207
## SubSet-Temp   79.90000     83
## SubSet-Rad   180.70000    227

Comparison: Mean and median values between a full data set and a subset are obviously different, but since a subset represents a random sample, they should be relatively close. Of course, since the sample is small there may be exceptions to this. Also, I was slightly unsure whether you want descriptive comparison or an actual if…then comparison with a printout. As you see, I went with descriptive a comparison/analysis.

Task 5

For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent.”

Please note that my data only had numeric values, which were not well suited for this task. I have decided to add a character column based on my data to do these data manipulations. The solution may include more than was requested, but it was great practice for me.

# Add a column to hold Beaufort wind force scale with possible values applicable to the data set
nyc.environment <- cbind(nyc.environment, beaufort = factor(NA, levels = c("la", "lb", "gb", "mb", "fb")))

# Populate Beaufort scale based on wind speed
nyc.environment[nyc.environment$wind <= 3, "beaufort"] <- "la"
nyc.environment[nyc.environment$wind > 3 & nyc.environment$wind <= 7, "beaufort"] <- "lb"
nyc.environment[nyc.environment$wind > 7 & nyc.environment$wind <= 12, "beaufort"] <- "gb"
nyc.environment[nyc.environment$wind > 12 & nyc.environment$wind <= 20, "beaufort"] <- "mb"
nyc.environment[nyc.environment$wind > 20, "beaufort"] <- "fb"

# Quick data check
head(nyc.environment)
##   X ozone radiation temperature wind beaufort
## 1 1    41       190          67  7.4       gb
## 2 2    36       118          72  8.0       gb
## 3 3    12       149          74 12.6       mb
## 4 4    18       313          62 11.5       gb
## 5 5    23       299          65  8.6       gb
## 6 6    19        99          59 13.8       mb
# Convert Beaufort scale column from factor to character in order to add new values previously not included
v.beaufort <- as.character(nyc.environment[,"beaufort"])

# Replace old values with more descriptive new values
v.beaufort[v.beaufort == "la"] <- "light air"
v.beaufort[v.beaufort == "lb"] <- "light breeze"
v.beaufort[v.beaufort == "gb"] <- "gentle breeze"
v.beaufort[v.beaufort == "mb"] <- "moderate breeze"
v.beaufort[v.beaufort == "fb"] <- "fresh breeze"

# Convert back to a factor and replace old factor with new one in the data set
nyc.environment[,"beaufort"] <- as.factor(v.beaufort)

Task 6

Display enough rows to see examples of all of steps 1-5 above.

Please note that for tasks 1 through 4 results were displayed within each task, so the sample of the data below is mostly to demonstrate solution to task 5.

nyc.environment[1:15, ]
##     X ozone radiation temperature wind        beaufort
## 1   1    41       190          67  7.4   gentle breeze
## 2   2    36       118          72  8.0   gentle breeze
## 3   3    12       149          74 12.6 moderate breeze
## 4   4    18       313          62 11.5   gentle breeze
## 5   5    23       299          65  8.6   gentle breeze
## 6   6    19        99          59 13.8 moderate breeze
## 7   7     8        19          61 20.1    fresh breeze
## 8   8    16       256          69  9.7   gentle breeze
## 9   9    11       290          66  9.2   gentle breeze
## 10 10    14       274          68 10.9   gentle breeze
## 11 11    18        65          58 13.2 moderate breeze
## 12 12    14       334          64 11.5   gentle breeze
## 13 13    34       307          66 12.0   gentle breeze
## 14 14     6        78          57 18.4 moderate breeze
## 15 15    30       322          68 11.5   gentle breeze

Task 7 (Bonus)

Place the original .csv in a github file and have R read from the link.

require(RCurl)
## Loading required package: RCurl
## Loading required package: bitops
# Load data file from GitHub
git.nyc.environment <- read.csv(text=getURL("https://raw.githubusercontent.com/ilyakats/CUNY-R-Bridge-Workshop/master/git-environmental.csv"), header = TRUE, sep = ",")

# Quick data check
head(git.nyc.environment)
##   X ozone radiation temperature wind
## 1 1    41       190          67  7.4
## 2 2    36       118          72  8.0
## 3 3    12       149          74 12.6
## 4 4    18       313          62 11.5
## 5 5    23       299          65  8.6
## 6 6    19        99          59 13.8

Cleanup

rm(nyc.environment, git.nyc.environment, df.nyc, df.stats)