Code should be clearly commented. Plots should be presentable and properly labeled. Mitigate overplotting whenever possible.
We’ll use the data file dataset_hw4.csv, which should be in the same directory as this markdown file (which should also be your working directory). It is a data frame of expenditures by household from the consumer expenditure survey
Problem 1:
Group the households into two categories:
home.type is “apartment or flat” or “high-rise”home.type is “single family detached”Compare the housing expenditures for the two groups using a quantile plot. Do they appear different?
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.2.4
library(ggplot2movies)
expenditures = read.csv(file = 'dataset_hw4.csv', header = TRUE)
# extract subset for apartment or flat and high-rise
expenditures.sm <- subset(expenditures, subset = home.type %in% c('apartment or flat', 'high-rise'))
#drop unused factor levels
expenditures.sm$home.type <- factor(expenditures.sm$home.type)
#2 plots on 1 graph quantile plot housing expenditures
ggplot(data = expenditures.sm, mapping = aes(sample = housing, color = home.type)) +
stat_qq(geom = 'point', distribution = 'qunif') +
labs(x = 'Quantiles', y = 'Housing', title = 'Quantile Plot, Housing Expenditure')
ANS: At the unmodified quantile level, it looks like the high rise has more expensive housing above the 75th percentile, and slightly lower (approximately) below the 25th percentile.
Problem 2:
Compare the housing expenditures but first take a log transform. (Exclude those who spent zero or negative amounts on housing from the analysis)
Use quantile plots and/or QQ plots to compare the two groups after the log transform. If helpful, also compare the residuals. What are the similarities and differences between the two groups?
#2 plots on 1 graph quantile plot housing expenditures
#ensure all values positive and greater than 0
exp.sm.log <- subset(expenditures.sm, subset = housing > 0)
exp.sm.log <- mutate(exp.sm.log, log.housing = log(housing))
ggplot(data = exp.sm.log, mapping = aes(sample = log.housing, color = home.type)) +
stat_qq(geom = 'point', distribution = 'qunif') +
labs(x = 'Quantiles', y = 'log(Housing)', title = 'Quantile Plot, Housing Expenditure')
#QQ plot to confirm comparing two curves
#plot residual, log (housing, residual log housing) ddply
n.pts <- with(exp.sm.log,
min(length (log.housing [home.type == 'apartment or flat']),
length (log.housing [home.type == 'high-rise'])))
#set up quantiles to plot
probs = seq(from = 0, to = 1, length.out = n.pts)
#extract quantiles
q1 <- with(exp.sm.log, quantile(log.housing [home.type == 'apartment or flat'], probs = probs))
q2 <- with(exp.sm.log, quantile(log.housing [home.type == 'high-rise'], probs = probs))
#make QQ plot
ggplot(mapping = aes(x = q2, y = q1)) + geom_point() +
labs(x = 'High-Rise', y = 'Apartment/Flat', title = 'QQ Plot, Log of Housing Costs \n (Grouped by Home Type)') + geom_abline(intercept = 0, slope = 1)
ANS: The two plots between high-rise and apartment or flat after the log transform seem closer, but still vary across the extreme end points. In housing, it mostly seems like a multiplicative effect, except for highest or lowest quantiles.
** Problem 3:**
Using the movies data set that comes with ggplot2, visualize the distribution of ratings for movies before and after 1930.
How do the distributions differ? Should the distributions be compared with or without a log transform?
You may want to adjust the size of the points to reduce overplotting.
#task: create quantile plot and qq plot with residuals
#add frame that splits the time period to movies > 1930 and movies <=1930
movies$timeperiod <- ifelse(movies$year > 1930, 'post.1930', 'pre.1930')
#2 plots on 1 graph quantile plot movie ratings
ggplot(data = movies, mapping = aes(sample = rating, color = timeperiod)) +
stat_qq(geom = 'point', distribution = 'qunif') +
labs(x = 'Quantiles', y = 'Rating', title = 'Quantile Plot, Movie Rating')
#plot residual, log (housing, residual log housing) ddply
n.pts.m <- with(movies,
min(length (rating [timeperiod == 'pre.1930']),
length (rating [timeperiod == 'post.1930'])))
#set up quantiles to plot
probs.m = seq(from = 0, to = 1, length.out = n.pts.m)
#extract quantiles
q1m <- with(movies, quantile(rating [timeperiod == 'pre.1930'], probs = probs.m))
q2m <- with(movies, quantile(rating [timeperiod == 'post.1930'], probs = probs.m))
#make QQ plot
ggplot(mapping = aes(x = q2m, y = q1m)) + geom_point(size = 1) +
labs(x = 'Post 1930', y = 'Pre 1930', title = 'QQ Plot, Movie Rating \n (Grouped by Time Period before and after 1930)') + geom_abline(intercept = 0, slope = 1)
ANS: Ratings for movies are between 0 - 10 makes less sense for a log transform since it already is part a controlled scale.
The ratings seem to have a slightly higher ratings across the overall distribution for Pre 1930 vs. Post 1930.
** Problem 4:**
Compare the distribution of ratings for each of the genres: action, animation, comedy, drama, documentary, romance, and short. If a movie belongs to more than one genre, include it in both distributions. Use both quantile plots and Q-Q plots.
# melt movies to create a genre column
movies.melt <- melt(movies, c('Action', 'Animation', 'Comedy', 'Drama', 'Documentary', 'Romance', 'Short'), id = c('title','rating','year'))
#rename variable to genre to make less confusing
colnames(movies.melt)[2] <- 'rating'
colnames(movies.melt)[3] <- 'year'
colnames(movies.melt)[4] <- 'genre'
movies.melt.nonzero <- subset(movies.melt, subset = movies.melt$value != 0)
#melted number is higher than original movie count since some movies are classified in multiple categories
#extract movie stats as aggregate
movies.rating.pool <- subset(movies, select = 'rating')
#quantile plot of distributions of ratings by genre vs. overall pooled ratings
ggplot(data = movies.melt.nonzero, mapping = aes(sample = rating)) +
stat_qq(distribution = qunif) + facet_wrap('genre', nrow = 2) +
stat_qq(data = movies.rating.pool, mapping = aes(sample = rating),
distribution = qunif, geom = 'line') +
labs(x = 'Quantiles', y = 'Rating', title = 'Quantile Plot')
#function to find quantiles to create QQ plot comparing pooled movie ratings to movies@rating
Find.QQ = function(data, movies.rating.pool) {
n.pts = min( length(data$rating), length(movies.rating.pool))
probs = seq(from = 0, to = 1, length.out = n.pts)
q1 = quantile(data$rating, probs= probs)
q2 = quantile(movies.rating.pool, probs=probs )
return( data.frame(rating = q1, movies.rating.pool = q2, quantile = probs) )
}
many.QQ.plots <- ddply(movies.melt.nonzero, 'genre', Find.QQ, movies.rating.pool = movies$rating)
ggplot(data = many.QQ.plots, mapping= aes(x = movies.rating.pool, y = rating)) +
geom_point() + facet_wrap('genre') +
labs(title = 'Movie Ratings QQ Plots, groups vs. pooled data') +
geom_abline(slope = 1)
True or False: * The ratings distribution for action movies is worse than those of the pooled movies: > True
Animation movies have better ratings than the overall distribution at the lower quantiles, but worse than the overall distribution at the highest quantiles. > True, note that it looks very close to the same at the top
Documentaries and Shorts have worse ratings than the overall distribution at the lower quantiles, but better ratings than the overall distribution otherwise. > True
Also, which worked better for answering the T/F questions: quantile plots or QQ plots? > No comparison, QQ plots much easier to see, as should be expected since curves are harder to compare.
** Problem 5:**
Compare the distribution of ratings for each of the genres, and also for the following time periods: 1900-1920, 1921-1940, 1941-1960, 1961-1980, 1981-2000 (i.e., there should be 35 groups total). Use both quantile plots and Q-Q plots.
# group by genre and time periods
#extract movies from 1900-2000
movies.1900s <- subset(movies.melt.nonzero, subset = year >= 1900 & year <= 2000)
#drop unused factor levels
movies.1900s$year <- factor(movies.1900s$year)
#cut movies by time period, and assign labels
movies.1900s.cut <- cut(as.numeric(movies.1900s$year), breaks = 5,labels = c('1900-1920', '1921-1940','1941-1960','1961-1980','1981-2000'))
#add the cut column timeperiod to movies data frame
movies.1900.mod <- cbind(movies.1900s,time.period = movies.1900s.cut)
#quantile plot of distributions of ratings by genre vs. overall pooled ratings
ggplot(data = movies.1900.mod, mapping = aes(sample = rating)) +
stat_qq(distribution = qunif) +
stat_qq(data = movies.rating.pool, mapping = aes(sample = rating),
distribution = qunif, geom = 'line') +
labs(x = 'Quantiles', y = 'Rating', title = 'Quantile Plot by Genre and Time Period') +
facet_grid(time.period ~ genre)
#QQ plots pulling from previous QQ function
many.QQ.plots.grid <- ddply(movies.1900.mod, c('genre','time.period'), Find.QQ, movies.rating.pool = movies$rating)
#QQ plot
ggplot(data = many.QQ.plots.grid, mapping= aes(x = movies.rating.pool, y = rating)) +
geom_point() + facet_grid(genre ~ time.period) +
labs(title = 'Movie Ratings QQ Plots, Grouped by Time Period and Genre') +
geom_abline(slope = 1)
Action movies were average performers early on, then became slightly higher than average in rating, this trend faltering in the upper quantiles in 1941 - 1960, then dropped significantly in the last 40 years.
Comedies had better lower ratings, lower high ratings early on, then gradually trended to be overall better except in high quantiles, then in 1961-1980 became very close to average, and now have a slightly downward trend in the most recent 20 years.