Write a simple R Markdown to explain some of the R codes you have learned regarding data frame. You are free to write your own program and add new codes such as how would you show all the rows except the last one or how would you get the last 6 rows of the data frame. Eg. you can explain what a data frame is, and then write the code to show how R handles data frame, and so on. But one mandatory topic is you must explain the different ways R returns back a vector or a data frame when you access values from a data frame. Your program doesnโt have to be long. Play around with the different font sizes in markdown to make your markdown readable. Explore. Publish your markdown on RPubs and submit the link only. Adding new codes that have not been discussed will earn you high marks.
Note : Best to create your own data frame or use a data frame which is small in size and so that it would be easy to see the results after the execution of each or after a few codes.
Data frame is a two dimensional table.
Each column (called variable) contain 1 variable. Each variable in a column is same class.
Each row contains one set of values for each column.
Following are the characteristics of a data frame.
1. The column names should be non-empty.
2. The row names should be unique.
3. The data stored in a data frame can be of numeric, factor or character type.
4. Each column should contain same number of data items.
data = list( 'Fruit'=c('Apple','Watermelon','Kiwi')
,'Price'=c(5,1,10)
,'Weight_g'=c(200,1300,50) )
# Create data frame from list
df = data.frame(data)
# Check data frame structure
str(df)
## 'data.frame': 3 obs. of 3 variables:
## $ Fruit : chr "Apple" "Watermelon" "Kiwi"
## $ Price : num 5 1 10
## $ Weight_g: num 200 1300 50
There is multiple way to handle columns in data frame
# Data frame can be subsetting using column name
# it return as vector.
str(df$Fruit)
## chr [1:3] "Apple" "Watermelon" "Kiwi"
# Data frame can also be subsetting using column number/name
# it return as dataframe
str(df['Fruit'])
## 'data.frame': 3 obs. of 1 variable:
## $ Fruit: chr "Apple" "Watermelon" "Kiwi"
str(df[1])
## 'data.frame': 3 obs. of 1 variable:
## $ Fruit: chr "Apple" "Watermelon" "Kiwi"
# Data frame can also be subsetting by vector (multiple column)
# the vector can be numeric or columns name (can even repeat same column)
str(df[c('Fruit','Price')])
## 'data.frame': 3 obs. of 2 variables:
## $ Fruit: chr "Apple" "Watermelon" "Kiwi"
## $ Price: num 5 1 10
str(df[c(1,2)])
## 'data.frame': 3 obs. of 2 variables:
## $ Fruit: chr "Apple" "Watermelon" "Kiwi"
## $ Price: num 5 1 10
str(df[c(1,1)])
## 'data.frame': 3 obs. of 2 variables:
## $ Fruit : chr "Apple" "Watermelon" "Kiwi"
## $ Fruit.1: chr "Apple" "Watermelon" "Kiwi"
# vector can also be logical
str(df[c(TRUE,TRUE,FALSE)])
## 'data.frame': 3 obs. of 2 variables:
## $ Fruit: chr "Apple" "Watermelon" "Kiwi"
## $ Price: num 5 1 10
# Add column into dataframe
df1 = df
df1$Origin = c('Oversea','Local','Oversea')
str(df1)
## 'data.frame': 3 obs. of 4 variables:
## $ Fruit : chr "Apple" "Watermelon" "Kiwi"
## $ Price : num 5 1 10
## $ Weight_g: num 200 1300 50
## $ Origin : chr "Oversea" "Local" "Oversea"
df1 = cbind(df,list('Origin'=c('Oversea','Local','Oversea')))
str(df1)
## 'data.frame': 3 obs. of 4 variables:
## $ Fruit : chr "Apple" "Watermelon" "Kiwi"
## $ Price : num 5 1 10
## $ Weight_g: num 200 1300 50
## $ Origin : chr "Oversea" "Local" "Oversea"
There is multiple way to handle rows in data frame
# Data frame can also be subsetting using column number
# it return as dataframe
str(df[1,])
## 'data.frame': 1 obs. of 3 variables:
## $ Fruit : chr "Apple"
## $ Price : num 5
## $ Weight_g: num 200
# Data frame can also be subsetting by vector (multiple column)
# the vector can be numeric (can even repeat same column)
str(df[c(1,2),])
## 'data.frame': 2 obs. of 3 variables:
## $ Fruit : chr "Apple" "Watermelon"
## $ Price : num 5 1
## $ Weight_g: num 200 1300
str(df[c(1,1),])
## 'data.frame': 2 obs. of 3 variables:
## $ Fruit : chr "Apple" "Apple"
## $ Price : num 5 5
## $ Weight_g: num 200 200
# vector can also be logical
str(df[c(TRUE,TRUE,FALSE),])
## 'data.frame': 2 obs. of 3 variables:
## $ Fruit : chr "Apple" "Watermelon"
## $ Price : num 5 1
## $ Weight_g: num 200 1300
# this method is useful especially for logic condition.
# example: we try to extract row of 'Apple'
str(df[(df$Fruit == 'Apple'), ])
## 'data.frame': 1 obs. of 3 variables:
## $ Fruit : chr "Apple"
## $ Price : num 5
## $ Weight_g: num 200
# Add Row into dataframe
df1 = rbind(df,list('Orange','4','300'))
str(df1)
## 'data.frame': 4 obs. of 3 variables:
## $ Fruit : chr "Apple" "Watermelon" "Kiwi" "Orange"
## $ Price : chr "5" "1" "10" "4"
## $ Weight_g: chr "200" "1300" "50" "300"
There is multiple way to handle rows and columns in data frame
# Data frame can also be subsetting using column number
# it return as vector
str(df[1,1])
## chr "Apple"
# Data frame can also be subsetting by vector (multiple column)
# the vector can be numeric (can even repeat same column)
str(df[c(1,2),c(1,2)])
## 'data.frame': 2 obs. of 2 variables:
## $ Fruit: chr "Apple" "Watermelon"
## $ Price: num 5 1
str(df[c(1,1),1])
## chr [1:2] "Apple" "Apple"
# vector can also be logical
str(df[c(TRUE,TRUE,FALSE),c(2,3)])
## 'data.frame': 2 obs. of 2 variables:
## $ Price : num 5 1
## $ Weight_g: num 200 1300
# this method is useful especially for logic condition.
# example: we try to Price of 'Apple'
str(df[(df$Fruit == 'Apple'), ]$Price)
## num 5
I have choosen a dataset from Amazon Best Seller Book webpage. This dataset represent Amazon top 50 best selling books between Year 2009 and 2019. The dataset also provide the feature (variables) as
1. Name Name of the Book
2. Author The author of the Book
3. User Rating - Amazon User Rating
4. Reviews - Number of written reviews on amazon
5. Price - Number of written reviews on amazon
6. Year - The Year(s) it ranked on the bestseller
7. Genre - Whether fiction or non-fiction
This dataset is suitable as it has a reasonable size (550 observations of 7 variables) that is suitable to study multiple feature.
# read data into dataframe
data <- read.csv("bestsellers with categories.csv")
# check a sample of the data
head(data)
## Name
## 1 10-Day Green Smoothie Cleanse
## 2 11/22/63: A Novel
## 3 12 Rules for Life: An Antidote to Chaos
## 4 1984 (Signet Classics)
## 5 5,000 Awesome Facts (About Everything!) (National Geographic Kids)
## 6 A Dance with Dragons (A Song of Ice and Fire)
## Author User.Rating Reviews Price Year Genre
## 1 JJ Smith 4.7 17350 8 2016 Non Fiction
## 2 Stephen King 4.6 2052 22 2011 Fiction
## 3 Jordan B. Peterson 4.7 18979 15 2018 Non Fiction
## 4 George Orwell 4.7 21424 6 2017 Fiction
## 5 National Geographic Kids 4.8 7665 12 2019 Non Fiction
## 6 George R. R. Martin 4.4 12643 11 2011 Fiction
# Check variables
summary(data)
## Name Author User.Rating Reviews
## Length:550 Length:550 Min. :3.300 Min. : 37
## Class :character Class :character 1st Qu.:4.500 1st Qu.: 4058
## Mode :character Mode :character Median :4.700 Median : 8580
## Mean :4.618 Mean :11953
## 3rd Qu.:4.800 3rd Qu.:17253
## Max. :4.900 Max. :87841
## Price Year Genre
## Min. : 0.0 Min. :2009 Length:550
## 1st Qu.: 7.0 1st Qu.:2011 Class :character
## Median : 11.0 Median :2014 Mode :character
## Mean : 13.1 Mean :2014
## 3rd Qu.: 16.0 3rd Qu.:2017
## Max. :105.0 Max. :2019
# Check total count of both Genre
table(data$Genre)
##
## Fiction Non Fiction
## 240 310
# Check the count of both Genre by Year
table(data[c("Year","Genre")])
## Genre
## Year Fiction Non Fiction
## 2009 24 26
## 2010 20 30
## 2011 21 29
## 2012 21 29
## 2013 24 26
## 2014 29 21
## 2015 17 33
## 2016 19 31
## 2017 24 26
## 2018 21 29
## 2019 20 30
# Show the Propotion (Percentage) of both Genre by Year
prop.table(table(data[c("Year","Genre")]),1) * 100
## Genre
## Year Fiction Non Fiction
## 2009 48 52
## 2010 40 60
## 2011 42 58
## 2012 42 58
## 2013 48 52
## 2014 58 42
## 2015 34 66
## 2016 38 62
## 2017 48 52
## 2018 42 58
## 2019 40 60
Based on the data,
1. Non Fiction have been overall best seller.
2. However, Non Fiction have been best seller for every year except year 2014.
# Summarize the data into table
tbl <- as.data.frame(table(data[c("Year","Genre")]))
# Format the table column properly
tbl$Year <- as.numeric(as.character(tbl$Year))
tbl$Genre <- as.character(tbl$Genre)
# Plot the data for visualization
# Plot line for Fiction
plot(tbl$Year[tbl$Genre == "Fiction"],tbl$Freq[tbl$Genre == "Fiction"],type="b",col="blue",xlab="Year",ylim=range(10,40),ylab="Count",main="Amazon Best Selling Book \n(Genre Count by Year)" )
# Add plot for Non Fiction
lines(tbl$Year[tbl$Genre == "Non Fiction"],tbl$Freq[tbl$Genre == "Non Fiction"],type="b",col="magenta")
# Add legend
legend("topleft",c("Fiction","Non Fiction"),col = c("blue","magenta"), lty=1:1, cex=0.8 )