Working on univariate statistics with cereal and debt in trillions.
For this homework assignment I plan to work with three different types of data sets. Below I will read in the data sets. This time instead of putting layout=“l-body-outset” I put that in the setup as a opts_chunk$set function.
I will first clean data that is needed to be cleaned. I had previously cleaned the debt in trillions so below you will find my work and how I did it.
dt <- read_excel(here("_data", "debt_in_trillions.xlsx"))
dt <- rename(dt, "year_and_quarter" = "Year and Quarter",
"herevolving" = "HE Revolving",
"autoloan" = "Auto Loan",
"creditcard" = "Credit Card",
"studentloan" = "Student Loan",) %>%
select(!starts_with("total"))
# I changed the rest of the names to lowercase but it doesn't have to be done.
# I just prefer to not have to press the shift key when typing if I can avoid it.
colnames(dt)<-tolower(colnames(dt))
dt <- dt %>%
separate(year_and_quarter, into = c("year", "quarter"), sep=":")
dt <- transform(dt, year = as.numeric(year))
# This is adding 2000 so that the years will be in the 2000s-2010s
dt$year <- dt$year+2000
# Needed to change mortgage to a numeric form, it was originally a character form.
dt$mortgage <- as.numeric(dt$mortgage)
datatable(dt, options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
"}"),
caption = 'Debt in Trillions',
width = '100%', options = list(scrollX = TRUE)))
All variables were picked from the data sets we were provided. There are no missing values.
For variables I’m going to focus on the sugar and sodium columns and compare them based on the different cereals.
For debt in trillions I want to focus on the type of debt and how it fluctuates by year and quarter.
First lets look at descriptive statics of sodium and and sugar.
summarize_all(cereal, list(mean=mean, median=median, mean=mean, max=max, sd=sd, IQR=IQR ), na.rm=TRUE)
# A tibble: 1 × 24
Cereal_mean..1 Sodium_mean..1 Sugar_mean..1 Type_mean..1
<dbl> <dbl> <dbl> <dbl>
1 NA 167 8.75 NA
# … with 20 more variables: Cereal_median <dbl>, Sodium_median <dbl>,
# Sugar_median <dbl>, Type_median <dbl>, Cereal_mean..3 <dbl>,
# Sodium_mean..3 <dbl>, Sugar_mean..3 <dbl>, Type_mean..3 <dbl>,
# Cereal_max <chr>, Sodium_max <dbl>, Sugar_max <dbl>,
# Type_max <chr>, Cereal_sd <dbl>, Sodium_sd <dbl>, Sugar_sd <dbl>,
# Type_sd <dbl>, Cereal_IQR <dbl>, Sodium_IQR <dbl>,
# Sugar_IQR <dbl>, Type_IQR <dbl>
# A tibble: 1 × 2
quantile.sodium quantile.sugar
<dbl> <dbl>
1 138. 12.5
summarise
function (.data, ..., .groups = NULL)
{
UseMethod("summarise")
}
<bytecode: 0x7fb9c2c14ed8>
<environment: namespace:dplyr>
For the sugar quartile I found that for the third quartile that 25% of the cereal had 12.5 of sugar or more.
For the sodium quartile I found that for the first quartile that 25% of the cereal had 137.5 of sodium or less. For the standard deviation, the mean was 167 while the standard deviation was about 77.3. Only one data item, 340 was more than two standard deviations above the mean, and only one data item, 0 was more than two standard deviations below the mean.
The fact that sodium is so much higher in sodium than sugar surprised me because of people mostly talking about how “sugary” cereals are. Sodium ranged from 0-340 grams compared to sugar’s 0-18 grams.
Now lets look at the descriptive statics for debt in trillions.
summarize_all(dt, list(mean=mean, median=median, mean=mean, max=max, sd=sd, IQR=IQR), na.rm=TRUE)
year_mean..1 quarter_mean..1 mortgage_mean..1 herevolving_mean..1
1 2011.757 NA 8.273986 0.5160662
autoloan_mean..1 creditcard_mean..1 studentloan_mean..1
1 0.9308703 0.7565176 0.9188575
other_mean..1 year_median quarter_median mortgage_median
1 0.383077 2012 NA 8.412
herevolving_median autoloan_median creditcard_median
1 0.5165 0.8145 0.7375
studentloan_median other_median year_mean..3 quarter_mean..3
1 0.9088281 0.39205 2011.757 NA
mortgage_mean..3 herevolving_mean..3 autoloan_mean..3
1 8.273986 0.5160662 0.9308703
creditcard_mean..3 studentloan_mean..3 other_mean..3 year_max
1 0.7565176 0.9188575 0.383077 2021
quarter_max mortgage_max herevolving_max autoloan_max
1 Q4 10.442 0.714 1.415
creditcard_max studentloan_max other_max year_sd quarter_sd
1 0.927 1.584 0.486 5.380866 NA
mortgage_sd herevolving_sd autoloan_sd creditcard_sd studentloan_sd
1 1.184876 0.1237017 0.2346748 0.06879579 0.4276748
other_sd year_IQR quarter_IQR mortgage_IQR herevolving_IQR
1 0.04582675 9 NA 1.011 0.1897
autoloan_IQR creditcard_IQR studentloan_IQR other_IQR
1 0.4085 0.11995 0.7689997 0.074025
We can notice that because quarters is a character column it is returning NA.
Here we can see some interesting statistics. If we look at all the means we have three that are relatively close in amount (from about .73-.91 ) two that are lower (.39-.52) other and he_revolving (I believe to be home equity revolving), and one that is way higher than any of them, mortgage at 8.41
For other loans we don’t know what loans are in the category, so it makes sense for it to be smaller than the major loans. he_revolving may be lower due to a much lower range of debt than the other loan types.
However for mortgage the median is a much larger number. I believe this may because of how many people purchase homes and need to borrow money from the bank, unlike for the other types of loans. Usually these loans are much larger.
Looking at the means for each of these categories the means seem to be on a similar path to the medians with all the categories being close to their median amount.
Now that I’ve done some stastics I would like to see it in a visualized form. Below you will find both data sets broken out into different graphs.
For cereal I’m interested in seeing the sodium compared to sugar for each cereal brand.
p1 <- ggplot(cereal, aes(Cereal, Sodium)) +
geom_col() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
theme(text=element_text(size=5))
p2 <- ggplot(cereal, aes(Cereal, Sugar)) +
geom_col() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
theme(text=element_text(size=5))
p1 / p2
This gets me two separate graphs, however I’m more interested in having them on one graph. Additionally I noticed that the font had to be extremely small to fit the whole bar chart in with the names. This time I’ll try using a legend.
p3<- ggplot(cereal, aes(x=Sodium, y=Sugar, group=Cereal)) +
geom_col(aes(color=Cereal, position='dodge')) +
theme(legend.key.size = unit(.5, 'cm'))
p4 <- ggplot(cereal, aes(fill=Cereal, y=Sugar, x=Sodium)) +
geom_bar(stat='identity', position='dodge') +
theme(legend.key.size = unit(.5, 'cm'))
p3 | p4
This is closer to what I want but not exactly. This time I’m going to try to force sodium and sugar as the key with the cereal being on the chart.
cereal1 <- cereal
cereal1 <- cereal %>%
select(!starts_with("Type")) %>%
gather(key = Type, value = Amount, Sugar,Sodium)
cereal1 %>%
ggplot(aes(x=Cereal, y=Amount, group=Type, color=Type)) +
geom_line() +
geom_point() +
ggtitle("Cereal by amount of sodium vs sugar") +
scale_x_discrete(guide = guide_axis(n.dodge = 2), labels = function(x)
stringr::str_wrap(x, width = 10)) +
theme(text=element_text(size=9))
This is what I wanted! This allows you to see how sugar and sodium’s amount compared to the cereal. As you can see sodium is much more drastically different compared to the sugar.
With what I learned about plotting in Cereal, I’m going to focus on trying to make tables here.
dt1 <- dt %>%
gather(key = debt, value = totaldebt, mortgage:other)
p5 <- ggplot(dt1, aes(year, totaldebt, fill = debt)) +
geom_col(position = "dodge")
p6 <- ggplot(dt1, aes(quarter, totaldebt, fill = debt)) +
geom_col(position = "dodge")
p5 / p6 + plot_layout(guides = "collect") + plot_annotation(
title = "Total debt by year and quarter")
p7 <- ggplot(dt, aes(year, creditcard)) +
geom_col(position = "dodge")
p8 <- ggplot(dt, aes(year, mortgage)) +
geom_col(position = "dodge")
p9 <- ggplot(dt, aes(year, herevolving)) +
geom_col(position = "dodge")
p10 <- ggplot(dt, aes(year, autoloan)) +
geom_col(position = "dodge")
p11 <- ggplot(dt, aes(year, studentloan)) +
geom_col(position = "dodge")
p12 <- ggplot(dt, aes(year, other)) +
geom_col(position = "dodge")
p7+p8+p9+p10+p11+p12+plot_layout(ncol = 3) + plot_annotation(
title = "Different debt in trillions",
caption = "by credit card, mortgage,
he revolving, auto loan, student loan, other"
)
From the different types of debt you can see that each debt has it’s own graph, while student loan seems to always go up compared to he revolving going down after 2010, the rest seemed to have a dip between 2010-2015.
For overall mortgage was by far the largest amount.
For this I’m going to attempt to turn the quarter and years into time frames.
dt2 <- dt
dt2 <- dt2 %>%
# Mutating all quarters to turn into the first of the month for that quarter
mutate(quarter = case_when(
startsWith(quarter, "Q1") ~ "-1",
startsWith(quarter, "Q2") ~ "-4",
startsWith(quarter, "Q3") ~ "-7",
startsWith(quarter, "Q4") ~ "-10"
)) %>%
# Uniting the year and quarter columns back into one column
unite('yq', year:quarter, remove = FALSE) %>%
select(-c(year, quarter)) %>%
# Removing the _ that was in the year quarter that made it look like this 2003_-1
mutate_at("yq", str_replace, "_", "")
# I used zoo to first change this into a date type column
dt2$yq <- as.Date(as.yearmon(dt2$yq))
# and then change it into a year quarter one like before!
dt2$yq <- as.yearqtr(dt2$yq, format = "%Y-%m-%d")
# Using the paged table you can see that the column type has changed to S3: yearqtr :)
paged_table(dt2)
Now that I’ve been able to change the year and quarter column to a date I’m going to try to plot it this way. I attempted to use the lubridate package but I was having trouble getting the dates back into Q1 form. So I just used the zoo package.
First I’ll try to visualize debt in trillions without using it as a time function.
dt2 <- dt2 %>%
gather(key = debt, value = totaldebt, mortgage:other)
p13 <- ggplot(dt2, aes(x=yq, y=totaldebt, fill=debt)) +
geom_area(alpha=0.6) +
scale_fill_brewer(palette="Dark2") +
scale_x_continuous(name="year") +
scale_y_continuous(name="total debt")
p14 <- ggplot(dt2, aes(y = totaldebt, x = yq, colour = debt)) +
geom_line() +
scale_x_continuous(name="year") +
scale_y_continuous(name="total debt")
p13/p14
Now that I know that it works I will try to do it using a time frame.
p15 <- ggplot(dt2, aes(y = totaldebt, x = yq, colour = debt)) +
geom_line() +
scale_x_date(date_labels = "%YQ%q") +
scale_x_continuous(name="year") +
scale_y_continuous(name="total debt")
p15
I noticed there was not much difference because quarter is not being read in. Instead I decided to try this with the zoo package.
# Here is reading in ggplot and using the zoo package
p16 <- ggplot(dt2, aes(yq, totaldebt)) +
geom_line() +
scale_x_yearqtr(format = "%YQ%q", name="year and quarter - ggplot") +
scale_y_continuous(name="total debt")
# Here is only using the zoo package
p17 <- zoo(dt2$totaldebt, as.yearqtr(dt2$yq))
p17 <- autoplot(p17) +
scale_x_yearqtr(format = "%YQ%q", name="year and quarter - zoo package") +
scale_y_continuous(name="total debt")
p16/p17
Both look the same, however using the only zoo package I get “Warning in zoo(dt2\(totaldebt, as.yearqtr(dt2\)yq)): some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique”
Lets see if I can combine graphs together.
p18 <- ggplot(dt2, aes(yq, totaldebt, colour = debt)) +
geom_line() +
ggtitle("Total Debt in Trillions") +
scale_x_yearqtr(format = "%YQ%q", name="year and quarter") +
scale_y_continuous(name="total debt")
p18
I did it!