STA112MaggieProblemSet1

publishers <- read.csv("~/Downloads/publishers.csv")

Section 1

I have chosen a data set about amazon e-book sales by genre and publisher (http://authorearnings.com/report/september-2015-author-earnings-report/). The data is presented in the form of data sets, comprised of variables. The data has 13 rows and 27028 columns, with each row representing each ebook’s genre, by which publisher it was sold, average amazon revenue, daily average author revenue, daily average gross sales, daily average publisher revenue, daily average units sold, publisher name, publisher type, average rating, sale price, sales rank, and total reviews. I believe that the data is reliable because the data is formatted and stored in a standardized way, collected in a unbiased way via the amazon website, and comes from the reliable source CORGIS Dataset Project.

Section 2

We are going to explore the relationiship between the daily average units sold and the total reviews of the ebooks, as well as the relationship between the daily average units sold and different publisher types.

I chose the research questions to investigate because as a reader, I am personally interested in whether the number of reviews and popularity of ebooks, measured by daily average units sold, are associated and whether this relationship is more salient in specific publishers. I expect there to be a positive relationship between the daily average units sold and the total reviews of the ebooks because I believe number of reviews and daily average units sold are two aspects of an ebook’s popularity, and I try to use statistical models to test my hypothesis. In reality, this information might be valuable to those publishers who want to figure out some potential associative factors with the ebook’s daily average amounts sold, and to individual authors on deciding the publishing houses to choose in considering where to have their books publish.

Section 3

I choose to analyze the categorical variable “publisher type”, and we start by creating a bar graph in which each column represents a publisher type, such as big five or small/medium publishers, to display count and relative proportions of the number of each publisher type.

library(ggplot2)
ggplot(publishers, aes(x=publisher.type)) + 
    geom_bar(fill='gold', color= 'black')+
  labs(title = "3.1")

Next, we create a boxplot of the five publisher type.

library(ggplot2)
knitr::kable(table(publishers$publisher.type), caption = "3.2")

3.2
Var1	Freq
amazon	423
big five	7309
indie	5946
single author	3608
small/medium	9741

Section 4

ggplot(publishers, aes(x=log(daily.average.units.sold), y = log(statistics.total.reviews))) + geom_point(color="turquoise") +labs(title = "4.1")

ggplot(publishers, aes(x=log(daily.average.units.sold), y = log(statistics.total.reviews), color = publisher.type)) + geom_point() + labs(title = "4.2")

I chose scatterplots to explain the relationship between two variables that are consists of numerical data, or numbers, and take the log of the two numerical variables because their range is very large. In figure 4.2, I used different colors to represent different publisher types when graph the relationship between the log amount of daily average units sold and the log amount of statistics total review. According to the two plots, there seems to be a weak and positive relationship between the two variables we are measuring with points centered between x=2.5 and x=5.0, while there are no points that largely deviate from the mean. There are some points with an unusually higher x values on the right. The strength of the relationship is very low and the shape is not very obvious (or linear as I speculate) if we look merely at the first graph because data points are piled up together. If we look at the second graph, however, we could notice that the yellow points, representing the “big five” publisher types, are above the rest. This means that as one the log amount of daily average units sold increases, we predict that the log amount of statistics total reviews also tends to slightly increase on average. Hence, my response to the first research question would be that there is a weak, linear and positive relationship between the daily average units sold and the total reviews of the ebooks, although the shape for now is hard to determine with certainty.

Section 5

ggplot(publishers, aes(x=log(daily.average.units.sold), fill=publisher.type)) + facet_wrap( ~ publisher.type, ncol=5) + geom_histogram(binwidth = 0.5) +labs(title = "5.1")

ggplot(publishers, aes(x=log(daily.average.units.sold), y=publisher.type, fill = publisher.type)) + 
       geom_boxplot() +labs(title = "5.2")

Since here, we are exploring the relationship between a categorical variable (data that can be divided into groups), publisher type, and a numerical variable that consists of numbers, daily average units sold, I used a histogram labelled as figure 5.1 and a boxplot labelled as figure 5.2. Figure 5.2 gives us direct visual information about the median, where the top and bottom 25% of the data reside, and how many data points deviate usually from the rest. Figure 5.1 let us see the frequency distribution of the variable daily average units sold and visualize the symmetry and skewness more easily. Both graphs tell us that the five publisher types all have a positively skewed distribution, suggesting that the log amount of daily average units sold can be unusually high in some cases. Amazon has the highest medium number of log amount of daily average units sold, followed by big five and indie that have similar medians, single author, and small/medium publisher types. This is reasonable because amazon is a famous and large company compared to small/medium ones and individual publishers.

Section 6

There are three variables involved in this graph: percentage of U.S. weather stations that broke all-time temperature records, year, and two groups that break temperature records classified as cold/heat temperatures. The percentage of U.S. weather stations is a numerical variable, while the year series and cold/heat temperature are categorical variables. The graph is plotting percentage of weather stations that broke all-time temperature records each year from 1991 to 2021 in a bar graph, with each vertical column illustrating the percentage in that particular year, represented by the blue color that denotes cold temperature and the red color that denotes hot temperature. We can see the overall trend of percentage of weather stations that broke all-time temperature records, the difference between the percentages from each individual year progressing to the next year, and the relative proportions of cold and hot temperatures that broke all-time temperature records each year. For 2021, approximately 8 percent of the stations had record extremely high temperatures and 2 percent recorded extremely low temperatures. The number of high temperature records broken has been greater than the number of low temperature records in 26 of the past 31 years, and is the greatest in the year of 2021. This might be an indication of increased global warming. The graph is relatively easy for the public to understand, while audience could not figure out the specific criteria for the temperatures that are classified as record-breaking, be it hot or cold. Hence, if possible, I would suggest the author to illustrate the specific temperatures that are classified as hot/cold in Fahrenheit below the graph for audience to better understand the two groups colored in red and blue that consist the columns of percentages.

Section 7

Krishna Karra and Tim Wallace, Climate, accessed 13/9/2022, https://www.nytimes.com/interactive/2022/01/11/climate/record-temperatures-map-2021.html.

Austin Cory Bart. publishers. 2015. CORGIS Dataset Project. csv, accessed 13/9/2022 from https://corgis-edu.github.io/corgis/csv/publishers.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.