WQD7004 - Assignment 1

Question

Write a simple R Markdown to explain some of the R codes you have learned regarding data frame. You are free to write your own program and add new codes such as how would you show all the rows except the last one or how would you get the last 6 rows of the data frame. Eg. you can explain what a data frame is, and then write the code to show how R handles data frame, and so on. But one mandatory topic is you must explain the different ways R returns back a vector or a data frame when you access values from a data frame. Your program doesn’t have to be long. Play around with the different font sizes in markdown to make your markdown readable. Explore. Publish your markdown on RPubs and submit the link only. Adding new codes that have not been discussed will earn you high marks.

Note : Best to create your own data frame or use a data frame which is small in size and so that it would be easy to see the results after the execution of each or after a few codes.

Answer

Data frame is a two dimensional table.
Each column (called variable) contain 1 variable. Each variable in a column is same class.
Each row contains one set of values for each column.

Following are the characteristics of a data frame.
1. The column names should be non-empty.
2. The row names should be unique.
3. The data stored in a data frame can be of numeric, factor or character type.
4. Each column should contain same number of data items.

Create data frame

data = list(  'Fruit'=c('Apple','Watermelon','Kiwi')
              ,'Price'=c(5,1,10)
              ,'Weight_g'=c(200,1300,50) )

# Create data frame from list
df = data.frame(data)

# Check data frame structure
str(df)

## 'data.frame':    3 obs. of  3 variables:
##  $ Fruit   : chr  "Apple" "Watermelon" "Kiwi"
##  $ Price   : num  5 1 10
##  $ Weight_g: num  200 1300 50

Handling data frame column.

There is multiple way to handle columns in data frame

# Data frame can be subsetting using column name
# it return as vector.
str(df$Fruit)

##  chr [1:3] "Apple" "Watermelon" "Kiwi"

# Data frame can also be subsetting using column number/name
# it return as dataframe
str(df['Fruit'])

## 'data.frame':    3 obs. of  1 variable:
##  $ Fruit: chr  "Apple" "Watermelon" "Kiwi"

str(df[1])

## 'data.frame':    3 obs. of  1 variable:
##  $ Fruit: chr  "Apple" "Watermelon" "Kiwi"

# Data frame can also be subsetting by vector (multiple column)
# the vector can be numeric or columns name (can even repeat same column) 
str(df[c('Fruit','Price')])

## 'data.frame':    3 obs. of  2 variables:
##  $ Fruit: chr  "Apple" "Watermelon" "Kiwi"
##  $ Price: num  5 1 10

str(df[c(1,2)])

## 'data.frame':    3 obs. of  2 variables:
##  $ Fruit: chr  "Apple" "Watermelon" "Kiwi"
##  $ Price: num  5 1 10

str(df[c(1,1)])

## 'data.frame':    3 obs. of  2 variables:
##  $ Fruit  : chr  "Apple" "Watermelon" "Kiwi"
##  $ Fruit.1: chr  "Apple" "Watermelon" "Kiwi"

# vector can also be logical
str(df[c(TRUE,TRUE,FALSE)])

## 'data.frame':    3 obs. of  2 variables:
##  $ Fruit: chr  "Apple" "Watermelon" "Kiwi"
##  $ Price: num  5 1 10

# Add column into dataframe
df1 = df
df1$Origin = c('Oversea','Local','Oversea')
str(df1)

## 'data.frame':    3 obs. of  4 variables:
##  $ Fruit   : chr  "Apple" "Watermelon" "Kiwi"
##  $ Price   : num  5 1 10
##  $ Weight_g: num  200 1300 50
##  $ Origin  : chr  "Oversea" "Local" "Oversea"

df1 = cbind(df,list('Origin'=c('Oversea','Local','Oversea')))
str(df1)

## 'data.frame':    3 obs. of  4 variables:
##  $ Fruit   : chr  "Apple" "Watermelon" "Kiwi"
##  $ Price   : num  5 1 10
##  $ Weight_g: num  200 1300 50
##  $ Origin  : chr  "Oversea" "Local" "Oversea"

Handling data frame row

There is multiple way to handle rows in data frame

# Data frame can also be subsetting using column number
# it return as dataframe
str(df[1,])

## 'data.frame':    1 obs. of  3 variables:
##  $ Fruit   : chr "Apple"
##  $ Price   : num 5
##  $ Weight_g: num 200

# Data frame can also be subsetting by vector (multiple column)
# the vector can be numeric (can even repeat same column) 
str(df[c(1,2),])

## 'data.frame':    2 obs. of  3 variables:
##  $ Fruit   : chr  "Apple" "Watermelon"
##  $ Price   : num  5 1
##  $ Weight_g: num  200 1300

str(df[c(1,1),])

## 'data.frame':    2 obs. of  3 variables:
##  $ Fruit   : chr  "Apple" "Apple"
##  $ Price   : num  5 5
##  $ Weight_g: num  200 200

# vector can also be logical
str(df[c(TRUE,TRUE,FALSE),])

## 'data.frame':    2 obs. of  3 variables:
##  $ Fruit   : chr  "Apple" "Watermelon"
##  $ Price   : num  5 1
##  $ Weight_g: num  200 1300

# this method is useful especially for logic condition.
# example: we try to extract row of 'Apple'
str(df[(df$Fruit == 'Apple'), ])

## 'data.frame':    1 obs. of  3 variables:
##  $ Fruit   : chr "Apple"
##  $ Price   : num 5
##  $ Weight_g: num 200

# Add Row into dataframe
df1 = rbind(df,list('Orange','4','300'))
str(df1)

## 'data.frame':    4 obs. of  3 variables:
##  $ Fruit   : chr  "Apple" "Watermelon" "Kiwi" "Orange"
##  $ Price   : chr  "5" "1" "10" "4"
##  $ Weight_g: chr  "200" "1300" "50" "300"

Handling data frame rows and columns together

There is multiple way to handle rows and columns in data frame

# Data frame can also be subsetting using column number
# it return as vector
str(df[1,1])

##  chr "Apple"

# Data frame can also be subsetting by vector (multiple column)
# the vector can be numeric (can even repeat same column) 
str(df[c(1,2),c(1,2)])

## 'data.frame':    2 obs. of  2 variables:
##  $ Fruit: chr  "Apple" "Watermelon"
##  $ Price: num  5 1

str(df[c(1,1),1])

##  chr [1:2] "Apple" "Apple"

# vector can also be logical
str(df[c(TRUE,TRUE,FALSE),c(2,3)])

## 'data.frame':    2 obs. of  2 variables:
##  $ Price   : num  5 1
##  $ Weight_g: num  200 1300

# this method is useful especially for logic condition.
# example: we try to Price of  'Apple'
str(df[(df$Fruit == 'Apple'), ]$Price)

##  num 5

Example use of data frame in real life.

I have choosen a dataset from Amazon Best Seller Book webpage. This dataset represent Amazon top 50 best selling books between Year 2009 and 2019. The dataset also provide the feature (variables) as
1. Name Name of the Book
2. Author The author of the Book
3. User Rating - Amazon User Rating
4. Reviews - Number of written reviews on amazon
5. Price - Number of written reviews on amazon
6. Year - The Year(s) it ranked on the bestseller
7. Genre - Whether fiction or non-fiction

This dataset is suitable as it has a reasonable size (550 observations of 7 variables) that is suitable to study multiple feature.

1. Getting the data into dataframe and inspect the data.

# read data into dataframe
data <- read.csv("bestsellers with categories.csv")

# check a sample of the data
head(data)

##                                                                 Name
## 1                                      10-Day Green Smoothie Cleanse
## 2                                                  11/22/63: A Novel
## 3                            12 Rules for Life: An Antidote to Chaos
## 4                                             1984 (Signet Classics)
## 5 5,000 Awesome Facts (About Everything!) (National Geographic Kids)
## 6                      A Dance with Dragons (A Song of Ice and Fire)
##                     Author User.Rating Reviews Price Year       Genre
## 1                 JJ Smith         4.7   17350     8 2016 Non Fiction
## 2             Stephen King         4.6    2052    22 2011     Fiction
## 3       Jordan B. Peterson         4.7   18979    15 2018 Non Fiction
## 4            George Orwell         4.7   21424     6 2017     Fiction
## 5 National Geographic Kids         4.8    7665    12 2019 Non Fiction
## 6      George R. R. Martin         4.4   12643    11 2011     Fiction

# Check variables
summary(data)

##      Name              Author           User.Rating       Reviews     
##  Length:550         Length:550         Min.   :3.300   Min.   :   37  
##  Class :character   Class :character   1st Qu.:4.500   1st Qu.: 4058  
##  Mode  :character   Mode  :character   Median :4.700   Median : 8580  
##                                        Mean   :4.618   Mean   :11953  
##                                        3rd Qu.:4.800   3rd Qu.:17253  
##                                        Max.   :4.900   Max.   :87841  
##      Price            Year         Genre          
##  Min.   :  0.0   Min.   :2009   Length:550        
##  1st Qu.:  7.0   1st Qu.:2011   Class :character  
##  Median : 11.0   Median :2014   Mode  :character  
##  Mean   : 13.1   Mean   :2014                     
##  3rd Qu.: 16.0   3rd Qu.:2017                     
##  Max.   :105.0   Max.   :2019

2. Which Genre is the best seller and is it consistent every year?

# Check total count of both Genre
table(data$Genre)

## 
##     Fiction Non Fiction 
##         240         310

# Check the count of both Genre by Year
table(data[c("Year","Genre")])

##       Genre
## Year   Fiction Non Fiction
##   2009      24          26
##   2010      20          30
##   2011      21          29
##   2012      21          29
##   2013      24          26
##   2014      29          21
##   2015      17          33
##   2016      19          31
##   2017      24          26
##   2018      21          29
##   2019      20          30

# Show the Propotion (Percentage) of both Genre by Year
prop.table(table(data[c("Year","Genre")]),1) * 100

##       Genre
## Year   Fiction Non Fiction
##   2009      48          52
##   2010      40          60
##   2011      42          58
##   2012      42          58
##   2013      48          52
##   2014      58          42
##   2015      34          66
##   2016      38          62
##   2017      48          52
##   2018      42          58
##   2019      40          60

Based on the data,
1. Non Fiction have been overall best seller.
2. However, Non Fiction have been best seller for every year except year 2014.

# Summarize the data into table
tbl <- as.data.frame(table(data[c("Year","Genre")]))

# Format the table column properly
tbl$Year <- as.numeric(as.character(tbl$Year))
tbl$Genre <- as.character(tbl$Genre)

# Plot the data for visualization
# Plot line for Fiction
plot(tbl$Year[tbl$Genre == "Fiction"],tbl$Freq[tbl$Genre == "Fiction"],type="b",col="blue",xlab="Year",ylim=range(10,40),ylab="Count",main="Amazon Best Selling Book \n(Genre Count by Year)"  )

# Add plot for Non Fiction
lines(tbl$Year[tbl$Genre == "Non Fiction"],tbl$Freq[tbl$Genre == "Non Fiction"],type="b",col="magenta")
# Add legend
legend("topleft",c("Fiction","Non Fiction"),col = c("blue","magenta"), lty=1:1, cex=0.8 )

2. Who is the author with the most top selling book and the books details (Name, rating, price)?

# Some of the books has been repeated best selling for years.
# Have to filter to unique name.

booklist <- unique(data[c("Name","Author")] )

# Count book name per author
bookcount <- list()
for (i in booklist$Author) { bookcount[i] <- sum( booklist$Author == i ) }

# Display the name of Author with the highest bookcount
highestbookcount <- max(unlist(bookcount))
highestauthor <- names(bookcount)[bookcount == highestbookcount]
print(paste("Highest Book count is ",highestbookcount," by ",highestauthor,"."))

## [1] "Highest Book count is  12  by  Jeff Kinney ."

# Display the books details (Name, rating, price)
data[c("Name","User.Rating","Price")] [ data$Author == highestauthor,]

##                                                   Name User.Rating Price
## 43          Cabin Fever (Diary of a Wimpy Kid, Book 6)         4.8     0
## 72             Diary of a Wimpy Kid: Hard Luck, Book 8         4.8     0
## 73       Diary of a Wimpy Kid: The Last Straw (Book 3)         4.8    15
## 74                 Diary of a Wimpy Kid: The Long Haul         4.8    22
## 81  Dog Days (Diary of a Wimpy Kid, Book 4) (Volume 4)         4.8    12
## 89              Double Down (Diary of a Wimpy Kid #11)         4.8    20
## 254              Old School (Diary of a Wimpy Kid #10)         4.8     7
## 382                                        The Getaway         4.8     0
## 436        The Meltdown (Diary of a Wimpy Kid Book 13)         4.8     8
## 469     The Third Wheel (Diary of a Wimpy Kid, Book 7)         4.7     7
## 475      The Ugly Truth (Diary of a Wimpy Kid, Book 5)         4.8    12
## 546       Wrecking Ball (Diary of a Wimpy Kid Book 14)         4.9     8