Data Analysis: An Introduction to R Tools and Techniques

Author

Compiled By: Joshua Lizardi

1 Introduction

This text is an introduction to the R programming language for basic statistics & data analysis. Intended to be a reference for students attending one of Trine University’s Online Graduate Programs. This text will cover modern tools & methods used to summarize, analyze, and interpret data. The text is not meant to replace a statistics textbook, This text is not meant to replace a textbook on R programming. This text will try to avoid (when possible) the complex mathematical theory underpinning many of the methods displayed and focuses on correctly applying each method and the correct interpretation of results. The reader should approach this text like one approaches learning to drive. When learning to drive a car, only the rules of the road and how to operate a vehicle are necessary. No need to understand how a car runs, or how to change oil, or a tire (that knowledge comes later). In this text the rules of the road are statistical concepts and the car, is R.

Hence, this text will focus on showing HOW to use basic data analysis techniques. Rarely will the text go into WHY techniques are used, though I hope to, when necessary, try to explain WHAT is going on, and point to resources for further understanding.

The text will discuss descriptive analytics, basic data visualization, graphing, clustering & cluster plots. Followed by a section on reading and understanding null hypothesis testing and p-values. The text then covers some commonly used statistical tests, then a section on predictive modeling. Each section will utilize the same case study dataset. This text will also briefly cover a few topics concerning data cleaning/transforming/munging/wrangling.

The text main focus will concern using R to create/present things like summary statistics, contingency tables, pie charts, bar graphs, histograms, box plots, dendrograms & hierarchical clustering, K-means clustering and clustering plots, T-test, Chi2 test, ANOVA and Regression Analysis.

The case study is a contrived example and presentation of a practical application of modern Data science tools and techniques on a small dataset 13 variables and 50 observations . This case study is for educational purposes and may be used, remixed, and shared freely without limitation.

When I was a student Statistics was done in spiral notebooks with large textbooks that had never ending statistical tables printed in the back. Now there is R, R has simple easy-to-use functions and packages for everything from density & distributions to linear regression and neural networks, If it is statistics, it can be done in R. (Navarro, n.d.)

1.0.0.0.1 THELINK-Trine University OER

1.0.0.0.1.1 CC License Information

2 R

R is freely distributed online, it can be downloaded from: http://cran.r-project.org/ At the top of the page – under the heading “Download and Install R” – there are also download links for Windows users, Mac users, and Linux users. After you have installed R, download and install RStudio from: http://rstudio.org. RStudio provides a convenient way to work with R under all platforms and is also freely available (Muller).

The R console is where you give R commands and is the lower left window in RStudio. It is the same way you would interact with R on the command line or terminal. In other words, the “Console” tab in the lower left window is the only part of RStudio that is actually R itself; everything else is extra/optional (Analysis of Microbiome Data in R, n.d.).

You can enter text directly into the R Console next to the prompt, >, and hit enter to run the code. for the purpose of this text we will make use of a simple text file called an R script. To start create an R script by choosing the New File icon at the top-left, then choose R Script. The R script will open in the upper left window and is basically a plain text editor (think Notepad). You can have multiple R scripts open at once and they appear in tabs (Analysis of Microbiome Data in R, n.d.)

To follow along with the analysis presented in this text, copy-paste the R code presented in each section into a single R script. To run a script or selection of code from the R script, put the cursor on the line of code or highlight the code selection and then click the Run button at the top of the file window. Or just press CTRL-Enter .

The “Environment” tab in the top right window will list the variables and functions present in the current R session. You can view your dataset in a tabbed window after importing using the Environment tabs variables window (Analysis of Microbiome Data in R, n.d.).

2.0.0.1 R: Installing packages and libraries

R’s flexibility functionality and ease of use comes in the form packages and libraries. R Packages are free libraries of code written by R’s open source developer community. There are maybe tens of thousands of R packages (Quick List of Useful R Packages, 2023). This text will make use of at least 10.

To install the R packages and libraries necessary to follow along with this text in RStudio, open a RStudio session and paste in the console

Code

#Packages
install.packages("summarytools")
install.packages("lessR")
install.packages("mosaic")
install.packages("corrplot")
install.packages("psych")
install.packages("factoextra")
install.packages("cluster")
install.packages("car")
install.packages("MASS")
install.packages("forecast")

# Libraries
library(summarytools)
library(lessR)
library(mosaic)
library(corrplot)
library(psych)
library(factoextra)
library(cluster)
library(car)
library(MASS)
library(forecast)

*Hit enter.

2.1 Data

Amazon web services defines Structured data as “…data that has a standardized format for efficient access by software and humans alike”(What Is Structured Data?, n.d.). Structured data is typically presented in a table (on a table?) with rows and columns.

This text will deal with structured data only. That is, data presented in rows and columns. Columns will correspond to variables, and rows to observations.

Basic Data type taxonomy for this text are Numerical and Categorical

Numerical: Continuous, Discrete
Categorical: Nominal, Ordinal, Binary

Data type determines what kinds of graphs, tables and test are appropriate (Types of Data in Statistics, n.d.).

2.2 Video Game Cost & Sales Case Study

Dataset Description: The data is composed of games created by a video game publisher over 15 years. The dataset; aggregated sales data; includes 50 observations (video games), 13 Features including marketing spend, research and development spend…etc. The data contains no missing values. The Cost, Sales and Profit features are aggregates of variables in the data;

Sales = Unit.Price*Units.Sold
Cost = R.D.Spend+Administration+Marketing.Spend
Profit = Sales - Cost

2.2.0.1 Data collection methods:

All games where divided into groups based on game type. Then games that had many expansions or parts/chapters, or many related releases, such as re-themed/re-skinned games, where either removed or averaged out to create a representative.

Of games that released on multiple platforms, the best performing platform versions were chosen, the rest removed. Then the games of the dataset were selected randomly from the groups utilizing a cluster sampling algorithm.

The Game ID col contains the unique identifier for each game. the Id number contains coded information about each game, for example, multi-part games of the same series will have similar IDs, the same game on different platforms will only have a difference of 1 number in ID’s, etc.

The data is presented in the .csv format, and can be downloaded here

Data for Analysis - 50 Video Games

2.2.1 Uploading Data to Rstuido

We start with inputting the data into RStudio. Copy-pasting the code below into an Rscrpit highlight the desired code and hit run.

example below;

Code

# input data

data = read.csv("~/50_Video Games.csv", stringsAsFactors = FALSE)
data$gameids = as.factor(data$gameids)

# input data

x = read.csv("~/50_Video Games.csv", stringsAsFactors = TRUE)
drops <- c("gameids")
x= x[ , !(names(x) %in% drops)]

Next we prepare the data for analysis.

2.3 Data Preparation: Part 1

2.3.1 Re-coding categorical variables:

Before you re-code a categorical variable, you first need to figure out which numerical values correspond to each of its categories in the data set (Recoding and Labeling Variables, n.d.). (show examples)

2.3.2 Scaling numerical variables:

For data that has mean and standard deviation, standardization is done by taking away the mean from each observation and than dividing each observation by the standard deviation.The Standardized Variable always has mean 0 and standard deviation 1 (Standardized Linear Regression, n.d.)

For example, if profits have a mean and standard deviation of $480,000 and $160,000, respectively, then a game making $560,000 in profit has a standardized profit of (560,000-480,000)/160,000 = 1/2 because the profit is one-half standard deviation above the mean profit. The advantage of standardizing is that it facilitates the comparison of values that have different units of measurement. For example, compare the summary statistics from profits and units sold. Before and after standardizing.

2.3.3 Binning Variables:

The process of binning variables takes a variable and breaks it into a smaller number of ranges, also called grouping or discretizing. Note a version of this is done to create the histograms for continuous variables. The basic idea is to split data into groups based on some criteria. (Dividing a Continuous Variable into Categories, n.d.). (Comprehensive Guide to Grouping and Aggregating with Pandas - Practical Business Python, n.d.)

2.3.4 Aggregation of Binned Data:

An aggregation function takes multiple individual values and returns a summary. In the majority of the cases, this summary is a single value (Integrate.io, n.d.). The most common aggregation functions are an average or summation of values. Aggregation is typically used in conjunction with grouping. (Zaitsev, 2017). Once binned in groups we can apply aggregation to each group independently. After which we can combine the results into a variable to utilize for both descriptive visualizations and inferential analysis. (Comprehensive Guide to Grouping and Aggregating with Pandas - Practical Business Python, n.d.)

Code

#numerical data only subset*
# for box plots

keeps = c(
  "R.D.Spend", "Year.Created", "Marketing.Spend",
  "Administration", "cost", "Profit", "Sales",
  "units.sold", "IGN.Rating", "Unit.Price"
)

numericdata = data[, (names(data) %in% keeps)]

index = 1:ncol(numericdata)
numericdata[, index] = lapply(numericdata[, index], as.numeric)


# coding categorical varibles with numueric lables and transposing (R to C)
# for clustering 


index = 1:ncol(x)
x[, index] = lapply(x[, index], as.numeric)
str(x)

'data.frame':   50 obs. of  13 variables:
 $ R.D.Spend      : num  66052 100672 165349 91992 142107 ...
 $ Administration : num  182646 91791 136898 135495 91392 ...
 $ Marketing.Spend: num  118148 249745 471784 252665 366168 ...
 $ Profit         : num  226888 182937 182177 166198 124288 ...
 $ Platform       : num  4 4 2 2 2 2 3 2 1 2 ...
 $ Game.Type      : num  1 2 6 4 5 2 6 8 4 9 ...
 $ IGN.Rating     : num  10 10 9 9 9 8 8 8 8 8 ...
 $ Year.Created   : num  2022 2021 2020 2019 2018 ...
 $ cost           : num  366845 442207 774031 480152 599668 ...
 $ Sales          : num  593734 625144 956208 646350 723955 ...
 $ Unit.Price     : num  76.2 71.7 157.2 103.9 88 ...
 $ units.sold     : num  7794 8725 6082 6220 8230 ...
 $ IGP            : num  1 2 1 2 1 2 1 1 2 1 ...

Code

dfx = x
x = as.data.frame(t(x))

# Bin continuous variable into high mid low vs continuous variables
data$Unit.Pricebin = cut(data$Unit.Price, breaks = c(7, 50, 70, 200), labels = c("low price", "mid price", "high price"))

data$units.soldbin = cut(data$units.sold, breaks = c(0, 9000, 10000, 15000), labels = c("low selling", "mid selling", "high selling"))


#grouping by and aggreagatng varibles 

GrpBysumdata = data %>%
  group_by(Year.Created) %>%
  summarise(units.sold = sum(units.sold))
colnames(GrpBysumdata)[2] = "TotalSold"

GrpBydatamean3 = data %>%
  group_by(Platform) %>%
  summarise(Unit.Price = mean(Unit.Price))
colnames(GrpBydatamean3)[2] = "AVERAGEPRICE"
GrpBydatamean3 = as.data.frame(GrpBydatamean3)

GrpBydata1 = data %>%
  group_by(IGP) %>%
  summarise(units.sold = sum(units.sold))
colnames(GrpBydata1)[2] = "TotalSold"

GrpBydata2 = data %>%
  group_by(IGP) %>%
  summarise(Sales = sum(Sales))

We start the study with a basic overview and application of descriptive analysis.

3 Descriptive Analysis

According to Harvard Business School Descriptive Analytics are the processes and methods used on data to identify trends and find relationships. It is the simplest form of data analysis (What Is Descriptive Analytics?, 2021).

Descriptive statistics, for the purpose of this text, are numbers that are used to summarize and describe data, they are also referred to as summary statistics. In practice we present several descriptive statistics at once to help give as full of a picture of the data as possible. Keep in mind descriptive statistics are just that, descriptive. They do not involve “generalizing beyond the data at hand” or IT CAN’T help you come to conclusions and make predictions based on your data. Generalizing is the business of inferential methods, which we will see in the next section (“What Are Descriptive Statistics?,” 2021).

Descriptive statistics are presented in graphs and tables. Apart from the above-described univariate descriptive statistics, (one variable only) there are also bi and/or multivariate descriptive statistics, which describe a relation between 2 or more variables. These could include, among others, scatter plots, cross-tabulations, clustering analysis, and other multi-dimensional graphical presentations . These are not plainly descriptive statistics anymore since the true aim of such analysis is to provide inductive and exploratory insights (“What Are Descriptive Statistics?,” 2021).

Descriptive statistics summarize the data at hand and can present data using graphs. Exploratory analysis helps you discover correlations and relationships among variables in the dataset using graphs and tables. So what really is the difference?

(Monica, 2020).

For the purposes of this text we will use the umbrella term Descriptive Analytics to refer to techniques (and attitudes) from both Descriptive statistics and Exploratory analysis. As well as using these terms interchangeably , in a colloquial fashion.

Descriptive Analytics are reported for the entire dataset (each variable separately and all of them at once), and for subgroups predefined by domain experts, or hinted at during the Exploratory/Descriptive phase ; for example,by grouping games according to their Platform and/or IGN rating, because a manager thinks it could be important,or because an exploratory table or graph hints at a meaningful association. Below a sample of the dataset with R code is presented.

3.1 Full Dataset Univariate Descriptive Analytics

Code

summary(data)

   R.D.Spend      Administration   Marketing.Spend      Profit      
 Min.   :     0   Min.   : 51283   Min.   :     0   Min.   :  6431  
 1st Qu.: 39936   1st Qu.:103731   1st Qu.:129300   1st Qu.: 53259  
 Median : 73051   Median :122700   Median :212716   Median :123546  
 Mean   : 73722   Mean   :121345   Mean   :211025   Mean   :107584  
 3rd Qu.:101603   3rd Qu.:144842   3rd Qu.:299469   3rd Qu.:149041  
 Max.   :165349   Max.   :182646   Max.   :471784   Max.   :226888  
                                                                    
   Platform          Game.Type           IGN.Rating     Year.Created 
 Length:50          Length:50          Min.   : 3.00   Min.   :2007  
 Class :character   Class :character   1st Qu.: 4.00   1st Qu.:2009  
 Mode  :character   Mode  :character   Median : 5.00   Median :2014  
                                       Mean   : 5.52   Mean   :2014  
                                       3rd Qu.: 6.00   3rd Qu.:2020  
                                       Max.   :10.00   Max.   :2022  
                                                                     
      cost            Sales          Unit.Price       units.sold   
 Min.   : 52285   Min.   : 66335   Min.   :  7.14   Min.   : 6000  
 1st Qu.:293422   1st Qu.:387044   1st Qu.: 38.08   1st Qu.: 7826  
 Median :411889   Median :520625   Median : 52.98   Median : 8986  
 Mean   :406091   Mean   :513675   Mean   : 58.44   Mean   : 9273  
 3rd Qu.:516943   3rd Qu.:617329   3rd Qu.: 76.20   3rd Qu.:10578  
 Max.   :774031   Max.   :956208   Max.   :157.22   Max.   :12788  
                                                                   
     IGP                        gameids      Unit.Pricebin      units.soldbin
 Length:50          284201892897.14 : 1   low price :22    low selling :25   
 Class :character   1142020891743.26: 1   mid price :12    mid selling : 7   
 Mode  :character   1352018779751.58: 1   high price:16    high selling:18   
                    1572012980480.36: 1                                      
                    1662020876984.54: 1                                      
                    1832020941634.5 : 1                                      
                    (Other)         :44

Code

knitr::kable(data)

R.D.Spend	Administration	Marketing.Spend	Profit	Platform	Game.Type	IGN.Rating	Year.Created	cost	Sales	Unit.Price	units.sold	IGP	gameids	Unit.Pricebin	units.soldbin
66051.52	182645.6	118148.20	226888.39	XBOX	Action-adventure.	10	2022	366845.3	593733.7	76.18	7794	No	41102022779476.1	high price	low selling
100671.96	91790.6	249744.55	182937.09	XBOX	Multiplayer online battle arena (MOBA)	10	2021	442207.1	625144.2	71.65	8725	Yes	42102021872571.6	high price	low selling
165349.20	136897.8	471784.10	182177.09	PC	Role-playing (RPG, ARPG, and More)	9	2020	774031.1	956208.2	157.22	6082	No	26920206082157.2	high price	low selling
91992.39	135495.1	252664.93	166198.07	PC	Racing.	9	2019	480152.4	646350.5	103.91	6220	Yes	24920196220103.9	high price	low selling
142107.34	91391.8	366168.42	124287.50	PC	Real-time strategy (RTS)	9	2018	599667.5	723955.0	87.97	8230	No	2592018823087.97	high price	low selling
123334.88	108679.2	304981.62	191026.85	PC	Multiplayer online battle arena (MOBA)	8	2015	536995.7	728022.5	73.32	9930	Yes	2282015993073.32	high price	mid selling
144372.41	118671.9	383199.62	193686.68	PlayStation	Role-playing (RPG, ARPG, and More)	8	2017	646243.9	839930.6	66.71	12590	No	36820171259066.7	mid price	high selling
78389.47	153773.4	299737.29	112538.52	PC	Shooters (FPS and TPS)	8	2013	531900.2	644438.7	50.39	12788	No	28820131278850.3	mid price	high selling
134615.46	147198.9	127716.82	191808.13	Nintendo	Racing.	8	2016	409531.2	601339.3	48.24	12465	Yes	14820161246548.2	low price	high selling
15505.73	127382.3	35534.17	113608.47	PC	Sports	8	2014	178422.2	292030.7	27.49	10623	No	29820141062327.4	low price	high selling
153441.51	101145.6	407934.54	125349.35	Nintendo	Real-time strategy (RTS)	7	2012	662521.6	787871.0	80.36	9804	No	1572012980480.36	high price	mid selling
75328.87	144136.0	134050.07	95973.11	Nintendo	Sports	7	2011	353514.9	449488.0	47.68	9428	Yes	1972011942847.68	low price	mid selling
130298.13	145530.1	323876.68	122804.55	PC	Real-time strategy (RTS)	6	2020	599704.9	722509.4	100.21	7210	Yes	25620207210100.2	high price	low selling
78013.11	121597.6	264346.06	157015.76	PC	Multiplayer online battle arena (MOBA)	6	2010	463956.7	620972.5	88.06	7052	Yes	2262010705288.06	high price	low selling
131876.90	99814.7	362861.36	146790.23	Nintendo	Role-playing (RPG, ARPG, and More)	6	2020	594553.0	741343.2	84.54	8769	No	1662020876984.54	high price	low selling
76253.86	113867.3	298664.47	156084.42	PC	Multiplayer online battle arena (MOBA)	6	2009	488785.6	644870.0	76.21	8462	Yes	2262009846276.21	high price	low selling
46426.07	157693.9	210797.67	149791.50	PC	Multiplayer online battle arena (MOBA)	6	2020	414917.7	564709.2	65.59	8610	Yes	2262020861065.59	mid price	low selling
64664.71	139553.2	137962.62	155748.03	Nintendo	Racing.	6	2008	342180.5	497928.5	46.87	10624	Yes	14620081062446.8	low price	high selling
63408.86	129219.6	46085.25	152201.97	PC	Racing.	6	2007	238713.7	390915.7	34.69	11269	Yes	24620071126934.6	low price	high selling
94657.16	145077.6	282574.31	53310.08	PC	Sports	5	2009	522309.0	575619.1	90.83	6337	No	2952009633790.83	high price	low selling
73994.56	122782.8	303319.26	28937.44	PlayStation	Sports	5	2007	500096.6	529034.0	88.17	6000	Yes	3952007600088.17	high price	low selling
44069.95	51283.1	197029.42	146098.18	XBOX	Multiplayer online battle arena (MOBA)	5	2007	292382.5	438480.7	61.82	7093	No	4252007709361.82	mid price	low selling
28754.33	118546.1	172795.67	141560.25	PC	Multiplayer online battle arena (MOBA)	5	2018	320096.0	461656.3	59.89	7708	Yes	2252018770859.89	mid price	low selling
72107.60	127864.6	353183.81	53241.58	PC	Sports	5	2008	553156.0	606397.5	58.06	10445	No	29520081044558	mid price	high selling
61136.38	152701.9	88218.23	138141.05	PC	Shooters (FPS and TPS)	5	2020	302056.5	440197.6	57.42	7666	No	2852020766657.42	mid price	low selling
65605.48	153032.1	107138.38	132168.98	PC	Shooters (FPS and TPS)	5	2009	325775.9	457944.9	56.97	8038	Yes	2852009803856.97	mid price	low selling
23640.93	96189.6	148001.11	134327.19	Nintendo	Puzzlers and party games.	5	2018	267831.7	402158.9	51.58	7797	Yes	1352018779751.58	mid price	low selling
38558.51	82982.1	174999.30	144269.99	XBOX	Multiplayer online battle arena (MOBA)	5	2012	296539.9	440809.9	42.54	10363	No	42520121036342.5	low price	high selling
91749.16	114175.8	294919.57	24081.67	PlayStation	Action-adventure.	5	2009	500844.5	524926.2	41.37	12690	Yes	31520091269041.3	low price	high selling
61994.48	115641.3	91131.24	33864.79	PlayStation	Sports	5	2007	268767.0	302631.8	38.25	7912	No	3952007791238.25	low price	low selling
77044.01	99281.3	140574.81	19261.80	PC	Shooters (FPS and TPS)	5	2011	316900.2	336162.0	37.13	9054	Yes	2852011905437.13	low price	mid selling
46014.02	85047.4	205517.64	20769.08	PC	Shooters (FPS and TPS)	5	2020	336579.1	357348.2	35.89	9958	Yes	2852020995835.89	low price	mid selling
20229.59	65947.9	185265.10	22495.25	PC	Shooters (FPS and TPS)	5	2016	271442.6	293937.9	33.42	8795	No	2852016879533.42	low price	low selling
22177.74	154806.1	28334.72	132615.72	PlayStation	Puzzlers and party games.	5	2016	205318.6	337934.3	32.83	10293	Yes	33520161029332.8	low price	high selling
0.00	135426.9	0.00	129904.93	XBOX	Puzzlers and party games.	5	2020	135426.9	265331.8	26.19	10131	Yes	43520201013126.1	low price	high selling
0.00	116983.8	45173.06	126989.88	PC	Sandbox.	5	2009	162156.9	289146.7	23.46	12324	Yes	27520091232423.4	low price	high selling
120542.52	148719.0	311613.29	10500.38	PC	Racing.	4	2007	580874.8	591375.1	76.26	7755	No	2442007775576.26	high price	low selling
114523.61	122616.8	261776.23	18457.77	PC	Shooters (FPS and TPS)	4	2020	498916.7	517374.4	46.14	11213	No	28420201121346.1	low price	high selling
101913.08	110594.1	229160.95	118468.25	PC	Action-adventure.	4	2009	441668.1	560136.4	45.95	12189	Yes	21420091218945.9	low price	high selling
55493.95	103057.5	214634.81	12566.79	Nintendo	Action-adventure.	4	2020	373186.2	385753.0	43.26	8917	Yes	1142020891743.26	low price	low selling
27892.92	84710.8	164470.71	11248.54	PC	Action-adventure.	4	2011	277074.4	288322.9	34.77	8292	No	2142011829234.77	low price	low selling
542.05	51743.2	0.00	14049.35	PC	Shooters (FPS and TPS)	4	2018	52285.2	66334.5	7.14	9289	No	284201892897.14	low price	mid selling
162597.70	151377.6	443898.53	192276.44	PC	Sports	3	2020	757873.8	950150.3	108.79	8734	Yes	29320208734108.7	high price	low selling
67532.53	105751.0	304768.73	6430.77	PC	Action-adventure.	3	2007	478052.3	484483.1	77.16	6279	Yes	2132007627977.16	high price	low selling
93863.75	127320.4	249839.44	111327.32	XBOX	Sports	3	2009	471023.6	582350.9	53.59	10866	No	49320091086653.5	mid price	high selling
119943.24	156547.4	256512.92	69234.42	Nintendo	Action-adventure.	3	2007	533003.6	602238.0	53.59	11238	No	11320071123853.5	mid price	high selling
1315.46	115816.2	297114.46	109630.39	PlayStation	Sports	3	2016	414246.1	523876.5	52.36	10006	No	39320161000652.3	mid price	high selling
28663.76	127056.2	201126.82	79153.00	XBOX	Action-adventure.	3	2014	356846.8	435999.8	38.02	11467	Yes	41320141146738	low price	high selling
86419.70	153514.1	0.00	84919.88	Nintendo	Shooters (FPS and TPS)	3	2020	239933.8	324853.7	34.50	9416	No	1832020941634.5	low price	mid selling
1000.23	124153.0	1903.93	111977.20	PC	Shooters (FPS and TPS)	3	2009	127057.2	239034.4	27.42	8719	Yes	2832009871927.42	low price	low selling

Code

print(dfSummary(data), method = 'render')

Data Frame Summary

data

Dimensions: 50 x 16
Duplicates: 0

R.D.Spend [numeric]

Mean (sd) : 73721.6 (45902.3)
min ≤ med ≤ max:
0 ≤ 73051.1 ≤ 165349
IQR (CV) : 61666.4 (0.6)

49 distinct values

50 (100.0%)

0 (0.0%)

Administration [numeric]

Mean (sd) : 121345 (28017.8)
min ≤ med ≤ max:
51283.1 ≤ 122700 ≤ 182646
IQR (CV) : 41111.3 (0.2)

50 distinct values

50 (100.0%)

0 (0.0%)

Marketing.Spend [numeric]

Mean (sd) : 211025 (122290)
min ≤ med ≤ max:
0 ≤ 212716 ≤ 471784
IQR (CV) : 170169 (0.6)

48 distinct values

50 (100.0%)

0 (0.0%)

Profit [numeric]

Mean (sd) : 107584 (61181.4)
min ≤ med ≤ max:
6430.8 ≤ 123546 ≤ 226888
IQR (CV) : 95782.5 (0.6)

50 distinct values

50 (100.0%)

0 (0.0%)

Platform [character]

1. Nintendo ·
2. PC
3. PlayStation
4. XBOX

9	(	18.0%	)
28	(	56.0%	)
6	(	12.0%	)
7	(	14.0%	)

50 (100.0%)

0 (0.0%)

Game.Type [character]

1. Action-adventure.
2. Multiplayer online battle
3. Puzzlers and party games.
4. Racing.
5. Real-time strategy (RTS)
6. Role-playing (RPG, ARPG,
7. Sandbox.
8. Shooters (FPS and TPS)
9. Sports

8	(	16.0%	)
8	(	16.0%	)
3	(	6.0%	)
5	(	10.0%	)
3	(	6.0%	)
3	(	6.0%	)
1	(	2.0%	)
10	(	20.0%	)
9	(	18.0%	)

50 (100.0%)

0 (0.0%)

IGN.Rating [integer]

Mean (sd) : 5.5 (1.9)
min ≤ med ≤ max:
3 ≤ 5 ≤ 10
IQR (CV) : 2 (0.3)

3	:	8	(	16.0%	)
4	:	6	(	12.0%	)
5	:	17	(	34.0%	)
6	:	7	(	14.0%	)
7	:	2	(	4.0%	)
8	:	5	(	10.0%	)
9	:	3	(	6.0%	)
10	:	2	(	4.0%	)

50 (100.0%)

0 (0.0%)

Year.Created [integer]

Mean (sd) : 2013.9 (5.1)
min ≤ med ≤ max:
2007 ≤ 2014 ≤ 2022
IQR (CV) : 10.8 (0)

16 distinct values

50 (100.0%)

0 (0.0%)

cost [numeric]

Mean (sd) : 406091 (162419)
min ≤ med ≤ max:
52285.2 ≤ 411889 ≤ 774031
IQR (CV) : 223521 (0.4)

50 distinct values

50 (100.0%)

0 (0.0%)

Sales [numeric]

Mean (sd) : 513675 (184056)
min ≤ med ≤ max:
66334.5 ≤ 520626 ≤ 956208
IQR (CV) : 230285 (0.4)

50 distinct values

50 (100.0%)

0 (0.0%)

Unit.Price [numeric]

Mean (sd) : 58.4 (27.1)
min ≤ med ≤ max:
7.1 ≤ 53 ≤ 157.2
IQR (CV) : 38.1 (0.5)

49 distinct values

50 (100.0%)

0 (0.0%)

units.sold [integer]

Mean (sd) : 9273.2 (1880.4)
min ≤ med ≤ max:
6000 ≤ 8985.5 ≤ 12788
IQR (CV) : 2752.8 (0.2)

50 distinct values

50 (100.0%)

0 (0.0%)

IGP [character]

1. No
2. Yes

23	(	46.0%	)
27	(	54.0%	)

50 (100.0%)

0 (0.0%)

gameids [factor]

1. 284201892897.14
2. 1142020891743.26
3. 1352018779751.58
4. 1572012980480.36
5. 1662020876984.54
6. 1832020941634.5
7. 1972011942847.68
8. 2132007627977.16
9. 2142011829234.77
10. 2252018770859.89
[ 40 others ]

1	(	2.0%	)
1	(	2.0%	)
1	(	2.0%	)
1	(	2.0%	)
1	(	2.0%	)
1	(	2.0%	)
1	(	2.0%	)
1	(	2.0%	)
1	(	2.0%	)
1	(	2.0%	)
40	(	80.0%	)

50 (100.0%)

0 (0.0%)

Unit.Pricebin [factor]

1. low price
2. mid price
3. high price

22	(	44.0%	)
12	(	24.0%	)
16	(	32.0%	)

50 (100.0%)

0 (0.0%)

units.soldbin [factor]

1. low selling
2. mid selling
3. high selling

25	(	50.0%	)
7	(	14.0%	)
18	(	36.0%	)

50 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2023-05-13

The output of the preceding summary provides the descriptive statistics for all variables in the data set, this includes summary statistics, basic plots, and contingency tables for the categorical variables (and the time variable). Next we will look at pie charts histograms and box plots.

3.2 Univariate Graphs: Pie-Charts, Histograms & Box-Plots

Code

# pie chart
Platform = data$Platform
GameType = data$Game.Type
IGP = data$IGP
Rating = data$IGN.Rating
Year = data$Year.Created

R.D.Spend = data$R.D.Spend
Administration = data$Administration
Profit = data$Profit
Unit.Price = data$Unit.Price
units.sold = data$units.sold
cost = data$cost
Marketing.Spend = data$Marketing.Spend
Sales = data$Sales

PieChart(Platform)

>>> Note: Platform is not in a data frame (table)
>>> Note: Platform is not in a data frame (table)

>>> suggestions
PieChart(Platform, hole=0)  # traditional pie chart
PieChart(Platform, values="%")  # display %'s on the chart
PieChart(Platform)  # bar chart
Plot(Platform)  # bubble plot
Plot(Platform, values="count")  # lollipop plot 

--- Platform --- 

               Nintendo      PC  PlayStation   XBOX    Total 
Frequencies:           9     28            6      7       50 
Proportions:       0.180  0.560        0.120  0.140    1.000 

Chi-squared test of null hypothesis of equal probabilities 
  Chisq = 26.000, df = 3, p-value = 0.000

Code

PieChart(GameType)

>>> Note: GameType is not in a data frame (table)
>>> Note: GameType is not in a data frame (table)

>>> suggestions
PieChart(GameType, hole=0)  # traditional pie chart
PieChart(GameType, values="%")  # display %'s on the chart
PieChart(GameType)  # bar chart
Plot(GameType)  # bubble plot
Plot(GameType, values="count")  # lollipop plot 

--- GameType --- 

            GameType Count   Prop 
--------------------------------- 
   Action-adventure.    8   0.160 
Mltplyronlnbta(MOBA)    8   0.160 
Puzzlersandpartygms.    3   0.060 
             Racing.    5   0.100 
Real-timstratgy(RTS)    3   0.060 
Rl-ply(RPG,ARPG,aMr)    3   0.060 
            Sandbox.    1   0.020 
 Shooters(FPSandTPS)   10   0.200 
              Sports    9   0.180 
--------------------------------- 
               Total   50   1.000 

Chi-squared test of null hypothesis of equal probabilities 
  Chisq = 15.160, df = 8, p-value = 0.056

Code

PieChart(IGP)

>>> Note: IGP is not in a data frame (table)
>>> Note: IGP is not in a data frame (table)

>>> suggestions
PieChart(IGP, hole=0)  # traditional pie chart
PieChart(IGP, values="%")  # display %'s on the chart
PieChart(IGP)  # bar chart
Plot(IGP)  # bubble plot
Plot(IGP, values="count")  # lollipop plot 

--- IGP --- 

                  No    Yes    Total 
Frequencies:      23     27       50 
Proportions:   0.460  0.540    1.000 

Chi-squared test of null hypothesis of equal probabilities 
  Chisq = 0.320, df = 1, p-value = 0.572

Code

PieChart(Rating)

>>> Note: Rating is not in a data frame (table)
>>> Note: Rating is not in a data frame (table)

>>> suggestions
PieChart(Rating, hole=0)  # traditional pie chart
PieChart(Rating, values="%")  # display %'s on the chart
PieChart(Rating)  # bar chart
Plot(Rating)  # bubble plot
Plot(Rating, values="count")  # lollipop plot 

--- Rating --- 

                   3      4      5      6      7      8      9     10    Total 
Frequencies:       8      6     17      7      2      5      3      2       50 
Proportions:   0.160  0.120  0.340  0.140  0.040  0.100  0.060  0.040    1.000 

Chi-squared test of null hypothesis of equal probabilities 
  Chisq = 26.800, df = 7, p-value = 0.000

Code

PieChart(Year)

>>> Note: Year is not in a data frame (table)
>>> Note: Year is not in a data frame (table)

>>> suggestions
PieChart(Year, hole=0)  # traditional pie chart
PieChart(Year, values="%")  # display %'s on the chart
PieChart(Year)  # bar chart
Plot(Year)  # bubble plot
Plot(Year, values="count")  # lollipop plot 

--- Year --- 

 Year Count   Prop 
------------------ 
 2007    7   0.140 
 2008    2   0.040 
 2009    8   0.160 
 2010    1   0.020 
 2011    3   0.060 
 2012    2   0.040 
 2013    1   0.020 
 2014    2   0.040 
 2015    1   0.020 
 2016    4   0.080 
 2017    1   0.020 
 2018    4   0.080 
 2019    1   0.020 
 2020   11   0.220 
 2021    1   0.020 
 2022    1   0.020 
------------------ 
Total   50   1.000 

Chi-squared test of null hypothesis of equal probabilities 
  Chisq = 44.080, df = 15, p-value = 0.000 
>>> Low cell expected frequencies, so chi-squared approximation may not be accurate

Code

# HISTOGRAMS
Histogram(R.D.Spend)

>>> Note: R.D.Spend is not in a data frame (table)
>>> Note: R.D.Spend is not in a data frame (table)

>>> Suggestions 
bin_width: set the width of each bin 
bin_start: set the start of the first bin 
bin_end: set the end of the last bin 
Histogram(R.D.Spend, density=TRUE)  # smoothed curve + histogram 
Plot(R.D.Spend)  # Violin/Box/Scatterplot (VBS) plot 

--- R.D.Spend --- 
 
     n   miss         mean           sd          min          mdn          max 
     50      0    73721.616    45902.256        0.000    73051.080   165349.200 

No (Box plot) outliers 


Bin Width: 20000 
Number of Bins: 9 
 
             Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
--------------------------------------------------------- 
      0 >  20000   10000      6    0.12        6     0.12 
  20000 >  40000   30000      7    0.14       13     0.26 
  40000 >  60000   50000      4    0.08       17     0.34 
  60000 >  80000   70000     14    0.28       31     0.62 
  80000 > 100000   90000      5    0.10       36     0.72 
 100000 > 120000  110000      4    0.08       40     0.80 
 120000 > 140000  130000      5    0.10       45     0.90 
 140000 > 160000  150000      3    0.06       48     0.96 
 160000 > 180000  170000      2    0.04       50     1.00

Code

Histogram(Administration)

>>> Note: Administration is not in a data frame (table)
>>> Note: Administration is not in a data frame (table)

>>> Suggestions 
bin_width: set the width of each bin 
bin_start: set the start of the first bin 
bin_end: set the end of the last bin 
Histogram(Administration, density=TRUE)  # smoothed curve + histogram 
Plot(Administration)  # Violin/Box/Scatterplot (VBS) plot 

--- Administration --- 
 
     n   miss        mean          sd         min         mdn         max 
     50      0   121344.64    28017.80    51283.14   122699.79   182645.56 

No (Box plot) outliers 


Bin Width: 20000 
Number of Bins: 8 
 
             Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
--------------------------------------------------------- 
  40000 >  60000   50000      2    0.04        2     0.04 
  60000 >  80000   70000      1    0.02        3     0.06 
  80000 > 100000   90000      8    0.16       11     0.22 
 100000 > 120000  110000     12    0.24       23     0.46 
 120000 > 140000  130000     13    0.26       36     0.72 
 140000 > 160000  150000     13    0.26       49     0.98 
 160000 > 180000  170000      0    0.00       49     0.98 
 180000 > 200000  190000      1    0.02       50     1.00

Code

Histogram(Profit)

>>> Note: Profit is not in a data frame (table)
>>> Note: Profit is not in a data frame (table)

>>> Suggestions 
bin_width: set the width of each bin 
bin_start: set the start of the first bin 
bin_end: set the end of the last bin 
Histogram(Profit, density=TRUE)  # smoothed curve + histogram 
Plot(Profit)  # Violin/Box/Scatterplot (VBS) plot 

--- Profit --- 
 
     n   miss          mean            sd           min           mdn           max 
     50      0    107583.881     61181.438      6430.770    123546.025    226888.390 

No (Box plot) outliers 


Bin Width: 50000 
Number of Bins: 5 
 
             Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
--------------------------------------------------------- 
      0 >  50000   25000     12    0.24       12     0.24 
  50000 > 100000   75000      6    0.12       18     0.36 
 100000 > 150000  125000     20    0.40       38     0.76 
 150000 > 200000  175000     11    0.22       49     0.98 
 200000 > 250000  225000      1    0.02       50     1.00

Code

Histogram(Unit.Price)

>>> Note: Unit.Price is not in a data frame (table)
>>> Note: Unit.Price is not in a data frame (table)

>>> Suggestions 
bin_width: set the width of each bin 
bin_start: set the start of the first bin 
bin_end: set the end of the last bin 
Histogram(Unit.Price, density=TRUE)  # smoothed curve + histogram 
Plot(Unit.Price)  # Violin/Box/Scatterplot (VBS) plot 

--- Unit.Price --- 
 
     n   miss       mean         sd        min        mdn        max 
     50      0     58.441     27.103      7.140     52.975    157.220 

  
--- Outliers ---     from the box plot: 1 
 
Small      Large 
-----      ----- 
            157.2 


Bin Width: 20 
Number of Bins: 8 
 
       Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
--------------------------------------------------- 
   0 >  20      10      1    0.02        1     0.02 
  20 >  40      30     13    0.26       14     0.28 
  40 >  60      50     17    0.34       31     0.62 
  60 >  80      70      9    0.18       40     0.80 
  80 > 100      90      6    0.12       46     0.92 
 100 > 120     110      3    0.06       49     0.98 
 120 > 140     130      0    0.00       49     0.98 
 140 > 160     150      1    0.02       50     1.00

Code

Histogram(units.sold)

>>> Note: units.sold is not in a data frame (table)
>>> Note: units.sold is not in a data frame (table)

>>> Suggestions 
bin_width: set the width of each bin 
bin_start: set the start of the first bin 
bin_end: set the end of the last bin 
Histogram(units.sold, density=TRUE)  # smoothed curve + histogram 
Plot(units.sold)  # Violin/Box/Scatterplot (VBS) plot 

--- units.sold --- 
 
     n   miss       mean         sd        min        mdn        max 
     50      0    9273.18    1880.45    6000.00    8985.50   12788.00 

No (Box plot) outliers 


Bin Width: 1000 
Number of Bins: 7 
 
           Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
------------------------------------------------------- 
  6000 >  7000    6500      5    0.10        5     0.10 
  7000 >  8000    7500      9    0.18       14     0.28 
  8000 >  9000    8500     11    0.22       25     0.50 
  9000 > 10000    9500      7    0.14       32     0.64 
 10000 > 11000   10500      8    0.16       40     0.80 
 11000 > 12000   11500      4    0.08       44     0.88 
 12000 > 13000   12500      6    0.12       50     1.00

Code

Histogram(cost)

>>> Note: cost is not in a data frame (table)
>>> Note: cost is not in a data frame (table)

>>> Suggestions 
bin_width: set the width of each bin 
bin_start: set the start of the first bin 
bin_end: set the end of the last bin 
Histogram(cost, density=TRUE)  # smoothed curve + histogram 
Plot(cost)  # Violin/Box/Scatterplot (VBS) plot 

--- cost --- 
 
     n   miss        mean          sd         min         mdn         max 
     50      0   406091.35   162419.01    52285.20   411888.64   774031.10 

No (Box plot) outliers 


Bin Width: 100000 
Number of Bins: 8 
 
             Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
--------------------------------------------------------- 
      0 > 100000   50000      1    0.02        1     0.02 
 100000 > 200000  150000      4    0.08        5     0.10 
 200000 > 300000  250000      9    0.18       14     0.28 
 300000 > 400000  350000     10    0.20       24     0.48 
 400000 > 500000  450000     11    0.22       35     0.70 
 500000 > 600000  550000     11    0.22       46     0.92 
 600000 > 700000  650000      2    0.04       48     0.96 
 700000 > 800000  750000      2    0.04       50     1.00

Code

Histogram(Marketing.Spend)

>>> Note: Marketing.Spend is not in a data frame (table)
>>> Note: Marketing.Spend is not in a data frame (table)

>>> Suggestions 
bin_width: set the width of each bin 
bin_start: set the start of the first bin 
bin_end: set the end of the last bin 
Histogram(Marketing.Spend, density=TRUE)  # smoothed curve + histogram 
Plot(Marketing.Spend)  # Violin/Box/Scatterplot (VBS) plot 

--- Marketing.Spend --- 
 
     n   miss          mean            sd           min           mdn           max 
     50      0    211025.098    122290.311         0.000    212716.240    471784.100 

No (Box plot) outliers 


Bin Width: 50000 
Number of Bins: 10 
 
             Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
--------------------------------------------------------- 
      0 >  50000   25000      8    0.16        8     0.16 
  50000 > 100000   75000      2    0.04       10     0.20 
 100000 > 150000  125000      7    0.14       17     0.34 
 150000 > 200000  175000      5    0.10       22     0.44 
 200000 > 250000  225000      7    0.14       29     0.58 
 250000 > 300000  275000      9    0.18       38     0.76 
 300000 > 350000  325000      5    0.10       43     0.86 
 350000 > 400000  375000      4    0.08       47     0.94 
 400000 > 450000  425000      2    0.04       49     0.98 
 450000 > 500000  475000      1    0.02       50     1.00

Code

Histogram(Sales)

>>> Note: Sales is not in a data frame (table)
>>> Note: Sales is not in a data frame (table)

>>> Suggestions 
bin_width: set the width of each bin 
bin_start: set the start of the first bin 
bin_end: set the end of the last bin 
Histogram(Sales, density=TRUE)  # smoothed curve + histogram 
Plot(Sales)  # Violin/Box/Scatterplot (VBS) plot 

--- Sales --- 
 
     n   miss        mean          sd         min         mdn         max 
     50      0   513675.23   184055.64    66334.55   520625.48   956208.19 

No (Box plot) outliers 


Bin Width: 100000 
Number of Bins: 10 
 
               Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
----------------------------------------------------------- 
       0 >  100000   50000      1    0.02        1     0.02 
  100000 >  200000  150000      0    0.00        1     0.02 
  200000 >  300000  250000      6    0.12        7     0.14 
  300000 >  400000  350000      7    0.14       14     0.28 
  400000 >  500000  450000     10    0.20       24     0.48 
  500000 >  600000  550000     10    0.20       34     0.68 
  600000 >  700000  650000      8    0.16       42     0.84 
  700000 >  800000  750000      5    0.10       47     0.94 
  800000 >  900000  850000      1    0.02       48     0.96 
  900000 > 1000000  950000      2    0.04       50     1.00

Code

# boxplot scaled & un-scaled
boxplot(numericdata)

Code

boxplot(scale(numericdata))

Now we can delve into exploring bivariate & multivariate relationships.

3.3 Bivariate Descriptive Tables for Categorical Variables.

We will see these tables again soon in the inferential section.

Code

knitr::kable( table(data$Platform,data$IGP))

	No	Yes
Nintendo	4	5
PC	12	16
PlayStation	3	3
XBOX	4	3

Code

knitr::kable( table(data$Platform,data$IGP))

	No	Yes
Nintendo	4	5
PC	12	16
PlayStation	3	3
XBOX	4	3

Code

knitr::kable( table(data$IGN.Rating,data$IGP))

	No	Yes
3	4	4
4	4	2
5	7	10
6	1	6
7	1	1
8	3	2
9	2	1
10	1	1

Code

knitr::kable( table(data$Game.Type,data$IGP))

	No	Yes
Action-adventure.	3	5
Multiplayer online battle arena (MOBA)	2	6
Puzzlers and party games.	0	3
Racing.	1	4
Real-time strategy (RTS)	2	1
Role-playing (RPG, ARPG, and More)	3	0
Sandbox.	0	1
Shooters (FPS and TPS)	6	4
Sports	6	3

Code

knitr::kable( table(data$Year.Created,data$IGP))

	No	Yes
2007	4	3
2008	1	1
2009	2	6
2010	0	1
2011	1	2
2012	2	0
2013	1	0
2014	1	1
2015	0	1
2016	2	2
2017	1	0
2018	2	2
2019	0	1
2020	5	6
2021	0	1
2022	1	0

Code

knitr::kable( table(data$Platform,data$Game.Type))

	Action-adventure.	Multiplayer online battle arena (MOBA)	Puzzlers and party games.	Racing.	Real-time strategy (RTS)	Role-playing (RPG, ARPG, and More)	Sandbox.	Shooters (FPS and TPS)	Sports
Nintendo	2	0	1	2	1	1	0	1	1
PC	3	5	0	3	2	1	1	9	4
PlayStation	1	0	1	0	0	1	0	0	3
XBOX	2	3	1	0	0	0	0	0	1

Code

knitr::kable( table(data$Year.Created,data$Game.Type))

	Action-adventure.	Multiplayer online battle arena (MOBA)	Puzzlers and party games.	Racing.	Real-time strategy (RTS)	Role-playing (RPG, ARPG, and More)	Sandbox.	Shooters (FPS and TPS)	Sports
2007	2	1	0	2	0	0	0	0	2
2008	0	0	0	1	0	0	0	0	1
2009	2	1	0	0	0	0	1	2	2
2010	0	1	0	0	0	0	0	0	0
2011	1	0	0	0	0	0	0	1	1
2012	0	1	0	0	1	0	0	0	0
2013	0	0	0	0	0	0	0	1	0
2014	1	0	0	0	0	0	0	0	1
2015	0	1	0	0	0	0	0	0	0
2016	0	0	1	1	0	0	0	1	1
2017	0	0	0	0	0	1	0	0	0
2018	0	1	1	0	1	0	0	1	0
2019	0	0	0	1	0	0	0	0	0
2020	1	1	1	0	1	2	0	4	1
2021	0	1	0	0	0	0	0	0	0
2022	1	0	0	0	0	0	0	0	0

Code

knitr::kable( table(data$IGN.Rating,data$Game.Type))

	Action-adventure.	Multiplayer online battle arena (MOBA)	Puzzlers and party games.	Racing.	Real-time strategy (RTS)	Role-playing (RPG, ARPG, and More)	Sandbox.	Shooters (FPS and TPS)	Sports
3	3	0	0	0	0	0	0	2	3
4	3	0	0	1	0	0	0	2	0
5	1	3	3	0	0	0	1	5	4
6	0	3	0	2	1	1	0	0	0
7	0	0	0	0	1	0	0	0	1
8	0	1	0	1	0	1	0	1	1
9	0	0	0	1	1	1	0	0	0
10	1	1	0	0	0	0	0	0	0

Code

knitr::kable( table(data$Platform,data$IGN.Rating))

	3	4	5	6	7	8	9	10
Nintendo	2	1	1	2	2	1	0	0
PC	3	5	9	5	0	3	3	0
PlayStation	1	0	4	0	0	1	0	0
XBOX	2	0	3	0	0	0	0	2

Code

knitr::kable( table(data$Year.Created,data$IGN.Rating))

	3	4	5	6	7	8	9	10
2007	2	1	3	1	0	0	0	0
2008	0	0	1	1	0	0	0	0
2009	2	1	4	1	0	0	0	0
2010	0	0	0	1	0	0	0	0
2011	0	1	1	0	1	0	0	0
2012	0	0	1	0	1	0	0	0
2013	0	0	0	0	0	1	0	0
2014	1	0	0	0	0	1	0	0
2015	0	0	0	0	0	1	0	0
2016	1	0	2	0	0	1	0	0
2017	0	0	0	0	0	1	0	0
2018	0	1	2	0	0	0	1	0
2019	0	0	0	0	0	0	1	0
2020	2	2	3	3	0	0	1	0
2021	0	0	0	0	0	0	0	1
2022	0	0	0	0	0	0	0	1

3.4 Correlation analysis Graphs Plots and Tables

Code

# correlation plots
corMatrix = round(cor(numericdata), 4)
corMatrix

                R.D.Spend Administration Marketing.Spend Profit IGN.Rating
R.D.Spend          1.0000         0.2420          0.7242 0.2530     0.3475
Administration     0.2420         1.0000         -0.0322 0.3188     0.1105
Marketing.Spend    0.7242        -0.0322          1.0000 0.0828     0.1907
Profit             0.2530         0.3188          0.0828 1.0000     0.5736
IGN.Rating         0.3475         0.1105          0.1907 0.5736     1.0000
Year.Created       0.0983         0.0682          0.0353 0.3108     0.3248
cost               0.8697         0.2167          0.9521 0.1889     0.2609
Sales              0.8515         0.2972          0.8677 0.4991     0.4209
Unit.Price         0.6975         0.2417          0.7699 0.3824     0.3989
units.sold        -0.0355         0.0532         -0.1563 0.0402    -0.0666
                Year.Created    cost   Sales Unit.Price units.sold
R.D.Spend             0.0983  0.8697  0.8515     0.6975    -0.0355
Administration        0.0682  0.2167  0.2972     0.2417     0.0532
Marketing.Spend       0.0353  0.9521  0.8677     0.7699    -0.1563
Profit                0.3108  0.1889  0.4991     0.3824     0.0402
IGN.Rating            0.3248  0.2609  0.4209     0.3989    -0.0666
Year.Created          1.0000  0.0661  0.1616     0.1745    -0.0951
cost                  0.0661  1.0000  0.9452     0.8185    -0.1185
Sales                 0.1616  0.9452  1.0000     0.8494    -0.0912
Unit.Price            0.1745  0.8185  0.8494     1.0000    -0.5658
units.sold           -0.0951 -0.1185 -0.0912    -0.5658     1.0000

Code

corrplot(corMatrix, method = "number", order = "hclust", addrect = 4)

Code

# correlation plots
pairs.panels(numericdata)

The Bivariate correlation analysis above doesn’t take in to account the categorical variables of the dataset (CorrMatrix Function - RDocumentation, n.d.).

3.5 Clustering

Next we will explore clustering variables into groups that share similar characteristics... or is it grouping variables into clusters that share similar characteristics? Whatever the terminology Variables are put together that are similar with each other in some measurable way. Calculations and definitions of similarity and distance for the clusters vary by method. (Kaushik, 2016)

3.6 Hierarchical Clustering

Below a hierarchical clustering algorithm is used that can handle mixed data. This will help us visualize potential significant relationships between our numerical and non-numerical data in a multivariate manner (Kodali, 2016). Note a hierarchical clustering algorithm was utilized in the correlation analysis to group variables based on the values of each variables correlation coefficient.

To start any kind of clustering analysis the number of clusters we want to model needs to be specified beforehand. A graph of the calculated silhouette Width vs The number of clusters(groups) is usually presented and we pick the number of clusters, where the average distance falls suddenly (Mahendru, 2019). Note with for an increase in the number of clusters, the width decreases.

Code

# Hierarchical Clustering Analysis
# Plot sihouette width (higher is better)

d_dist = daisy(x, metric = "gower", type = list(logratio = 4))

sil_width = c(NA)
for (i in 2:10)
{
  pam_fit = pam(d_dist,
    diss = TRUE,
    k = i
  )

  sil_width[i] = pam_fit$silinfo$avg.width
}
# create plot of number of clusters vs total within sum of squares

plot(1:10, sil_width,
  xlab = "Number of clusters", ylim = c(0, 1),
  ylab = "Silhouette Width"
)
lines(1:10, sil_width)

Code

df = x
# calculate distance

d_dist = daisy(df, metric = "gower", type = list(logratio = 4))
# hierarchical clustering
hc = hclust(d_dist, method = "ward.D")
# dendrogram
plot(hc)
rect.hclust(hc, k = 4, border = "red")

Code

# choose k, number of clusters
cluster = cutree(hc, k = 4)
cluster

      R.D.Spend  Administration Marketing.Spend          Profit        Platform 
              1               1               2               1               3 
      Game.Type      IGN.Rating    Year.Created            cost           Sales 
              3               3               3               4               4 
     Unit.Price      units.sold             IGP 
              3               3               3

Code

d_dist = daisy(df, metric = "gower", type = list(logratio = 4))
# hierarchical clustering
hc = hclust(d_dist, method = "ward.D")
# dendrogram
plot(hc)
rect.hclust(hc, k = 4, border = "red")

Code

# choose k, number of clusters
cluster = cutree(hc, k = 4)
cluster

      R.D.Spend  Administration Marketing.Spend          Profit        Platform 
              1               1               2               1               3 
      Game.Type      IGN.Rating    Year.Created            cost           Sales 
              3               3               3               4               4 
     Unit.Price      units.sold             IGP 
              3               3               3

Code

d_dist = daisy(df, metric = "manhattan", type = list(logratio = 4))
# hierarchical clustering
hc = hclust(d_dist, method = "ward.D")
# dendrogram
plot(hc)
rect.hclust(hc, k = 4, border = "red")

Code

# choose k, number of clusters
cluster = cutree(hc, k = 4)
cluster

      R.D.Spend  Administration Marketing.Spend          Profit        Platform 
              1               1               2               1               3 
      Game.Type      IGN.Rating    Year.Created            cost           Sales 
              3               3               3               4               4 
     Unit.Price      units.sold             IGP 
              3               3               3

Code

d_dist = daisy(df, metric = "euclidean", type = list(logratio = 4))
# hierarchical clustering
hc = hclust(d_dist, method = "ward.D")
# dendrogram
plot(hc)
rect.hclust(hc, k = 4, border = "red")

Code

# choose k, number of clusters
cluster = cutree(hc, k = 4)
cluster

      R.D.Spend  Administration Marketing.Spend          Profit        Platform 
              1               1               2               1               3 
      Game.Type      IGN.Rating    Year.Created            cost           Sales 
              3               3               3               4               4 
     Unit.Price      units.sold             IGP 
              3               3               3

3.7 K-means for numerical variables

A K-means clustering algorithm was also run on the dataset. This algorithm doesn’t take into account non-numeric variables (Harris, 2021). Even so, we still see a similar association among variables was found using both methods.

Code

kmeans = kmeans(scale(t(numericdata)), 4, nstart = 30000)
# plot the clusters
fviz_cluster(kmeans, data = scale(t(numericdata)), ellipse.type = "norm")

Too few points to calculate an ellipse
Too few points to calculate an ellipse
Too few points to calculate an ellipse

Code

kmeans$cluster

      R.D.Spend  Administration Marketing.Spend          Profit      IGN.Rating 
              1               3               4               3               1 
   Year.Created            cost           Sales      Unit.Price      units.sold 
              1               2               2               1               1

Its always good practice when clustering variables to run different types of clustering algorithms on the same data. This helps to give as full a picture as possible of probable correlations and associations among all variables (Kaushik, 2016). Notice any differences? Similarities? Where might these come from? Why do they exist? *“Calculations and definitions of similarity and distance for the clusters vary by method.”*

*Found groups of variable associations:

HYC: 1. (RDSpend,Admin,Profit), 2. (MarkSpend), 4. (cost,Sales)

3. (UnitPrice,units.sold,IGP,Platform,GameType,IGNRate,Year)

KMC: 1. (RDSpend,IGNRate,Year,UnitPrice,unitssold), 2. (MarkSpend), 3. (Admin,Profit), 4. (cost,Sales)

3.8 More Graphs:

Different kinds of graphs including, histograms, bar graphs, box-plots, time series plots, below also help to assist in investigating the associations hinted at in the visualizations above.

Code

# HISTOGRAMS W/ FACETS IGN vs categorical variables
gf_histogram(~IGN.Rating, data = data, binwidth = 64000, stat = "count") %>%
  gf_facet_wrap(~IGP) %>%
  gf_labs(title = "")

Code

gf_histogram(~IGN.Rating, data = data, binwidth = 64000, stat = "count") %>%
  gf_facet_wrap(~Platform) %>%
  gf_labs(title = "")

Code

gf_histogram(~IGN.Rating, data = data, binwidth = 64000, stat = "count") %>%
  gf_facet_wrap(~Year.Created) %>%
  gf_labs(title = "")

Code

gf_histogram(~IGN.Rating, data = data, binwidth = 64000, stat = "count") %>%
  gf_facet_wrap(~Game.Type) %>%
  gf_labs(title = "")

Code

plot(IGN.Rating ~ Unit.Pricebin, data, ylim = c(0, 10))

Code

plot(R.D.Spend ~ Unit.Pricebin, data)

Code

plot(Profit ~ Unit.Pricebin, data)

Code

plot(Sales ~ Unit.Pricebin, data)

Code

plot(units.sold ~ Unit.Pricebin, data)

Code

plot(factor(data$Platform) ~ data$Unit.Pricebin)

Code

plot(IGN.Rating ~ units.soldbin, data, ylim = c(0, 10))

Code

plot(R.D.Spend ~ units.soldbin, data)

Code

plot(Profit ~ units.soldbin, data)

Code

plot(Sales ~ units.soldbin, data)

Code

plot(Unit.Price ~ units.soldbin, data)

Code

plot(factor(data$Platform) ~ data$units.soldbin)

Code

plot(factor(data$IGP) ~ factor(data$IGN.Rating))

Code

# BARGRAPHS using grouped category variables vs aggregated continuous variables

barchart(GrpBysumdata$TotalSold ~ factor(GrpBysumdata$Year.Created))

Code

barchart(GrpBydatamean3$AVERAGEPRICE ~ GrpBydatamean3$Platform, ylim = c(0, 70))

Code

barchart(GrpBydata1$TotalSold ~ factor(GrpBydata1$IGP),ylim = c(0,350000))

Code

barchart(GrpBydata2$Sales ~ factor(GrpBydata2$IGP),ylim = c(0, 14500000))

3.9 mplot:

mplot is an enhanced plotting engine for R. The mplot( )function consolidates daily plotting and formatting tasks into a single, easy-to-use application. Code and examples for one and two variables below.

Code

install.packages("mplot")
library("mplot")

data = read.csv("~/50_Video Games.csv",stringsAsFactors=TRUE)

mplot(data)

Mplot Example 1

3.10 Section Notes:

Read- Hypothesis Generation for Data Science Projects – A Critical Problem Solving Step

Read- Domain expertise — why is it important for data scientists?

Can you come up with a few hypothesis to check?

Hint* look over any graphs and tables made so far (or make your own! )and come up with questions based on them.

That concludes our exploratory study of the video game dataset. We continue the study with an overview and application of inferential analysis.

4 Inferential Analysis

Inferential Analysis is concerned with making decisions, predictions, calculating estimates and intervals based on information contained in a set of data (Scott, 2009).

Hypothesis Tests: This section focuses on understanding of the basics of hypothesis testing.

below; the basic formal structure of hypothesis testing:

An observation has been made/ A question has been asked about the data.
The null (H₀ ) and alternative (H₁ or H_A) hypotheses are specified.
With given data, a value of a statistic is calculated.
Under a set of general assumptions about the data, as well as assuming the null hypothesis is true, the distribution of the test statistic is known.
Given the distribution and value of the test statistic, as well as the form of the alternative hypothesis, we can calculate a p-value of the test.
Based on the p-value and prespecified level of significance, we make a decision. One of: – Fail to reject the null hypothesis. OR. Reject the null hypothesis.

Inferential Analysis (statistical testing) is done for subgroups of variables from our dataset as a way to (answer any questions)(confirm or deny any hypothesis) about these subgroups of data that may have arisen during the Exploratory phase.

All statistical tests, include, a set of assumptions, a null hypothesis, an alternate hypothesis, p-value and a significance level.This allows us to create statistically valid estimates & intervals. (Given the set of general assumptions is true.)

Keep in mind! The point. The smaller the p-value the stronger the evidence is in support of the alternate hypothesis, The smaller the p-value the more likely the alternate hypothesis is true. Reject the null hypothesis for small p-values.

4.1 Summary of Inferential Analysis Steps

Step 1. Create a Question about the data, most likely based on the Exploratory phase.

Step 2. Take the question and make it a statement of a relationship among variables not existing, or a difference between variables not existing, and you have a null hypothesis.

Step 3. Choose an appropriate statistical test, verify any required test assumptions.

Step 4. Pick a significance level, which is 0.05, usually. (why?)

Step 5. Reject the null hypothesis if the calculated p-value is less than the significance level. (why).

For example,we may have initially grouped games according to their Platform and IGN rating and made graphs and tables . So there could be many (questions asked) = (hypothesis made) about this grouping, for example are the differences we see among Game platform types significant? Does a significant relationship between Platform type and IGN rating exist? Can we use a variable or combination of variables to predict units sold?

4.2 Assumption Checking for Statistical Testing

Assumption checking allows you to determine if conclusions drawn from the results of your analysis are valid. Assumptions are the requirements you must fulfill (Moran, 2017). To stick to our car example just as you should not drive a car until you can demonstrate working knowledge of the rules of the road, you should not conduct statistical analyses without demonstrating that your data “follows the rules, and can receive a permit for testing if you will”.

So now that we have talked about how to come up with questions, create hypothesis and check our test assumptions. Lets practice with an example reading p-value results for a statistical test. We will start with an example of assumption checking. First we will test two assumptions about our data. If you Note the Statistical test Assumptions Section, a common assumption is one of data needing independent and random observations.

4.2.1 Ljung-Box Test for independence

The Ljung-Box test (sometimes called the Portmanteau test) is used to test whether or not observations over time are independent. (10.3 - Regression with Autoregressive Errors | STAT 462, n.d.)

Ljung-Box test assumptions: This procedures require certain assumptions on the data which we will not discuss. see (10.3 - Regression with Autoregressive Errors | STAT 462, n.d.)

H₀: The observations are independent in time.

H_A: The observations are not independent in time.

4.2.2 Run test of randomness

The Run test of randomness sometimes called the Geary test, is a nonparametric test. Commonly used to test data for for randomness (Lani, 2009).

Runs Test for randomness assumptions:

Assumption #1: Independence of observations.

H0: the sequence of observations are random

Ha: the sequence of observations are not random(there exist some pattern)

Code

#runs 
library(randtests)
x=dfx
str(x)

'data.frame':   50 obs. of  13 variables:
 $ R.D.Spend      : num  66052 100672 165349 91992 142107 ...
 $ Administration : num  182646 91791 136898 135495 91392 ...
 $ Marketing.Spend: num  118148 249745 471784 252665 366168 ...
 $ Profit         : num  226888 182937 182177 166198 124288 ...
 $ Platform       : num  4 4 2 2 2 2 3 2 1 2 ...
 $ Game.Type      : num  1 2 6 4 5 2 6 8 4 9 ...
 $ IGN.Rating     : num  10 10 9 9 9 8 8 8 8 8 ...
 $ Year.Created   : num  2022 2021 2020 2019 2018 ...
 $ cost           : num  366845 442207 774031 480152 599668 ...
 $ Sales          : num  593734 625144 956208 646350 723955 ...
 $ Unit.Price     : num  76.2 71.7 157.2 103.9 88 ...
 $ units.sold     : num  7794 8725 6082 6220 8230 ...
 $ IGP            : num  1 2 1 2 1 2 1 1 2 1 ...

Code

x$gameids = data$gameids

index = 1:ncol(x)
x[, index] = lapply(x[, index], as.numeric)

##H0:-The symbols occur in random order reject if p-value <.05
randtests ::runs.test(x$gameids)


    Runs Test

data:  x$gameids
statistic = -1.143, runs = 22, n1 = 25, n2 = 25, n = 50, p-value =
0.253
alternative hypothesis: nonrandomness

Code

#Null Hypothesis (H0): autocorrelation is not present
#If the p-value <.05 reject
Box.test(x$gameids)


    Box-Pierce test

data:  x$gameids
X-squared = 1.001, df = 1, p-value = 0.317

Code

Box.test(x$R.D.Spend)


    Box-Pierce test

data:  x$R.D.Spend
X-squared = 1.368, df = 1, p-value = 0.242

Code

Box.test(x$Administration)


    Box-Pierce test

data:  x$Administration
X-squared = 0.5161, df = 1, p-value = 0.473

Code

Box.test(x$Marketing.Spend)


    Box-Pierce test

data:  x$Marketing.Spend
X-squared = 2.136, df = 1, p-value = 0.144

Code

Box.test(x$Profit)


    Box-Pierce test

data:  x$Profit
X-squared = 7.136, df = 1, p-value = 0.00755

Code

Box.test(x$Platform)


    Box-Pierce test

data:  x$Platform
X-squared = 0.7505, df = 1, p-value = 0.386

Code

Box.test(x$Game.Type)


    Box-Pierce test

data:  x$Game.Type
X-squared = 0.1198, df = 1, p-value = 0.729

Code

Box.test(x$IGN.Rating)


    Box-Pierce test

data:  x$IGN.Rating
X-squared = 41.26, df = 1, p-value = 0.000000000133

Code

Box.test(x$Year.Created)


    Box-Pierce test

data:  x$Year.Created
X-squared = 0.03638, df = 1, p-value = 0.849

Code

Box.test(x$cost)


    Box-Pierce test

data:  x$cost
X-squared = 1.873, df = 1, p-value = 0.171

Code

Box.test(x$Sales)


    Box-Pierce test

data:  x$Sales
X-squared = 4.429, df = 1, p-value = 0.0353

Code

Box.test(x$Unit.Price)


    Box-Pierce test

data:  x$Unit.Price
X-squared = 9.337, df = 1, p-value = 0.00225

Code

Box.test(x$units.sold)


    Box-Pierce test

data:  x$units.sold
X-squared = 7.791, df = 1, p-value = 0.00525

Code

Box.test(x$IGP)


    Box-Pierce test

data:  x$IGP
X-squared = 3.598, df = 1, p-value = 0.0579

What should we conclude from the testing? What are the possible issues our data could face in testing? More on these issues at the end of the analysis.

4.3 Chi Squared Testing

R’s built-in chi-squared test, chisq.test, compares the proportion of counts in each category with the expected proportions. By default, the expected frequencies in each category are assumed to be equal (Team, 2018). The “Goodness of fit Test” is the test you would use if to check if the differences you observe in visuals like pie charts or bar graphs are statistically significant. The “Test of Association” is used to test for relationships between to categorical variables.()

4.3.0.1 Chi² Goodness of Fit Test assumptions:

Assumption #1: One categorical variable.
Assumption #2: Independence of random observations.
Assumption #3: The groups of the categorical variable must be mutually exclusive.
Assumption #4: There must be at least 5 expected frequencies in each level of the variable.

Test Hypothesis:

H₀: All category groups have Equal Probabilities.

H₁: At least one of the categories has a Probability unequal to other categories.

Code

#PieChart() makes a graph with along with chisq.test for counts.
#note the violations of assumption #4 in for 2 of the test.. 
#Remedial Measures:  
chisq.test(counts(data$Platform),simulate.p.value = TRUE)


    Chi-squared test for given probabilities with simulated p-value (based
    on 2000 replicates)

data:  counts(data$Platform)
X-squared = 26, df = NA, p-value = 0.0005

Code

chisq.test(counts(data$IGP),simulate.p.value = TRUE)


    Chi-squared test for given probabilities with simulated p-value (based
    on 2000 replicates)

data:  counts(data$IGP)
X-squared = 0.32, df = NA, p-value = 0.679

Code

chisq.test(counts(data$IGN.Rating),simulate.p.value = TRUE)


    Chi-squared test for given probabilities with simulated p-value (based
    on 2000 replicates)

data:  counts(data$IGN.Rating)
X-squared = 26.8, df = NA, p-value = 0.0015

Code

chisq.test(counts(data$Game.Type),simulate.p.value = TRUE)


    Chi-squared test for given probabilities with simulated p-value (based
    on 2000 replicates)

data:  counts(data$Game.Type)
X-squared = 15.16, df = NA, p-value = 0.059

4.4 Remedial Measures: Chi² Testing

We take our first look at how to make corrections to the data or test if test assumptions are not valid for the data. In this case we use a different method to obtain a valid p-value.

How?- Rules to apply Monte Carlo simulation of p-values for chi-squared test.

4.4.0.1 Chi² Test of Independence assumptions:

Assumption #1: Two categorical variables.
Assumption #2: Independence of random observations.
Assumption #3: The groups of the categorical variable must be mutually exclusive.
Assumption #4: There must be at least 5 expected frequencies in each level of the variable.

Hypothesis:

H₀: Two variables are not associated.

H₁: The two variables are associated.

Code

Platform.IGP = table(data$Platform,data$IGP)
IGNRating.IGP = table(data$IGN.Rating,data$IGP)
GameType.IGP = table(data$Game.Type,data$IGP)
YearCreated.IGP = table(data$Year.Created,data$IGP)
Platform.GameType = table(data$Platform,data$Game.Type)
YearCreated.GameType = table(data$Year.Created,data$Game.Type)
YearCreated.IGNRating = table(data$Year.Created,data$IGN.Rating)
GameType.IGNRating = table(data$IGN.Rating,data$Game.Type)
Platform.IGNRating = table(data$Platform,data$IGN.Rating)

#Remedial Measures  
chisq.test(IGNRating.IGP,simulate.p.value = TRUE)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  IGNRating.IGP
X-squared = 5.013, df = NA, p-value = 0.734

Code

chisq.test(GameType.IGP,simulate.p.value = TRUE)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  GameType.IGP
X-squared = 12.8, df = NA, p-value = 0.105

Code

chisq.test(YearCreated.IGP,simulate.p.value = TRUE)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  YearCreated.IGP
X-squared = 11.32, df = NA, p-value = 0.887

Code

chisq.test(Platform.GameType,simulate.p.value = TRUE)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  Platform.GameType
X-squared = 26.87, df = NA, p-value = 0.335

Code

chisq.test(YearCreated.GameType,simulate.p.value = TRUE)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  YearCreated.GameType
X-squared = 104.7, df = NA, p-value = 0.835

Code

chisq.test(YearCreated.IGNRating,simulate.p.value = TRUE)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  YearCreated.IGNRating
X-squared = 144.8, df = NA, p-value = 0.0115

Code

chisq.test(GameType.IGNRating,simulate.p.value = TRUE)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  GameType.IGNRating
X-squared = 67.19, df = NA, p-value = 0.164

Code

chisq.test(Platform.IGNRating,simulate.p.value = TRUE)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  Platform.IGNRating
X-squared = 34.52, df = NA, p-value = 0.0295

Code

chisq.test(Platform.IGP,simulate.p.value = TRUE)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  Platform.IGP
X-squared = 0.5087, df = NA, p-value = 0.977

4.5 t-Test & ANOVA

Continuing our study of inference with a comparison of the average amount spent on developing games grouped by different departments, which are unrelated/independent groups.

4.5.0.1 Unpaired t- Testing

Assumption #1: Two continuous variables.
Assumption #2: Independence of random observations.
Assumption #3: Both variables are approximately normally distributed.
Assumption #4: Both variables have approximately the same variance.
Assumption #5: No significant outliers.

Hypothesis

H0: The difference between variable means is zero

H1: The difference between variable means is not zero

Code

rm(list = ls())
data = read.csv("~/50_Video Games.csv", stringsAsFactors = TRUE)

library(outliers)


Attaching package: 'outliers'

The following object is masked from 'package:psych':

    outlier

Code

library(rstatix)


Attaching package: 'rstatix'

The following object is masked from 'package:MASS':

    select

The following objects are masked from 'package:mosaic':

    cor_test, prop_test, t_test

The following object is masked from 'package:stats':

    filter

Code

library(car)


grubbs.test(data$Administration, type = 10, opposite = FALSE, two.sided = FALSE)


    Grubbs test for one outlier

data:  data$Administration
G = 2.5006, U = 0.8698, p-value = 0.251
alternative hypothesis: lowest value 51283.14 is an outlier

Code

grubbs.test(data$Marketing.Spend, type = 10, opposite = FALSE, two.sided = FALSE)


    Grubbs test for one outlier

data:  data$Marketing.Spend
G = 2.1323, U = 0.9053, p-value = 0.743
alternative hypothesis: highest value 471784.1 is an outlier

Code

grubbs.test(data$R.D.Spend, type = 10, opposite = FALSE, two.sided = FALSE)


    Grubbs test for one outlier

data:  data$R.D.Spend
G = 1.996, U = 0.917, p-value = 1
alternative hypothesis: highest value 165349.2 is an outlier

Code

# Assuption checking

shapiro.test(data$Administration)


    Shapiro-Wilk normality test

data:  data$Administration
W = 0.9702, p-value = 0.237

Code

shapiro.test(data$Marketing.Spend)


    Shapiro-Wilk normality test

data:  data$Marketing.Spend
W = 0.9744, p-value = 0.345

Code

shapiro.test(data$R.D.Spend)


    Shapiro-Wilk normality test

data:  data$R.D.Spend
W = 0.9673, p-value = 0.18

Code

leveneTest(data$Administration,data$Marketing.Spend, center = mean)

Levene's Test for Homogeneity of Variance (center = mean)
      Df F value Pr(>F)
group 47   0.255  0.974
       2

Code

leveneTest(data$Marketing.Spend,data$R.D.Spend)

Levene's Test for Homogeneity of Variance (center = median)
      Df                        F value              Pr(>F)    
group 48 171315312150900043184486424644 0.00000000000000192 ***
       1                                                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

leveneTest(data$Administration,data$R.D.Spend)

Levene's Test for Homogeneity of Variance (center = median)
      Df                       F value              Pr(>F)    
group 48 33984670753675280972846260086 0.00000000000000431 ***
       1                                                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

#reject if p-value <.05
t.test(data$Administration,data$Marketing.Spend)


    Welch Two Sample t-test

data:  data$Administration and data$Marketing.Spend
t = -5.055, df = 54.13, p-value = 0.00000526
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -125250.2  -54110.7
sample estimates:
mean of x mean of y 
   121345    211025

Code

t.test(data$R.D.Spend,data$Marketing.Spend)


    Welch Two Sample t-test

data:  data$R.D.Spend and data$Marketing.Spend
t = -7.433, df = 62.54, p-value = 0.000000000365
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -174223 -100384
sample estimates:
mean of x mean of y 
  73721.6  211025.1

Code

t.test(data$Administration,data$R.D.Spend)


    Welch Two Sample t-test

data:  data$Administration and data$R.D.Spend
t = 6.262, df = 81.06, p-value = 0.0000000172
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 32491.1 62755.0
sample estimates:
mean of x mean of y 
 121344.6   73721.6

4.6 Remedial Measures: t-Testing

Again in this case we use a different method to obtain a valid p-value. How?- The standard t-test assumes that the variance of the two groups are equal. We use instead a Welch’s t-test, a test for means where equal variances is not assumed. So we find an alternative test where the assumption violated doesn’t matter. - Welch’s t-test in R. This is a typical way around a violated assumption. In statistics speak we say “The Welch’s t test is the equivalent non parametric test for the studentized t test” - Some notes about Welch’s t-test - The power of Welch’s test.

4.7 Data Preparation Part 2

Code

#ANOVA PREP
spend =c(rep(data$R.D.Spend), rep(data$Administration), rep(data$Marketing.Spend))
Department = c(rep("R&D Spend", 50), rep("Administration", 50), rep("Marketing Spend", 50))
anovadata=data.frame(spend, Department)
anovadata

        spend      Department
1    66051.52       R&D Spend
2   100671.96       R&D Spend
3   165349.20       R&D Spend
4    91992.39       R&D Spend
5   142107.34       R&D Spend
6   123334.88       R&D Spend
7   144372.41       R&D Spend
8    78389.47       R&D Spend
9   134615.46       R&D Spend
10   15505.73       R&D Spend
11  153441.51       R&D Spend
12   75328.87       R&D Spend
13  130298.13       R&D Spend
14   78013.11       R&D Spend
15  131876.90       R&D Spend
16   76253.86       R&D Spend
17   46426.07       R&D Spend
18   64664.71       R&D Spend
19   63408.86       R&D Spend
20   94657.16       R&D Spend
21   73994.56       R&D Spend
22   44069.95       R&D Spend
23   28754.33       R&D Spend
24   72107.60       R&D Spend
25   61136.38       R&D Spend
26   65605.48       R&D Spend
27   23640.93       R&D Spend
28   38558.51       R&D Spend
29   91749.16       R&D Spend
30   61994.48       R&D Spend
31   77044.01       R&D Spend
32   46014.02       R&D Spend
33   20229.59       R&D Spend
34   22177.74       R&D Spend
35       0.00       R&D Spend
36       0.00       R&D Spend
37  120542.52       R&D Spend
38  114523.61       R&D Spend
39  101913.08       R&D Spend
40   55493.95       R&D Spend
41   27892.92       R&D Spend
42     542.05       R&D Spend
43  162597.70       R&D Spend
44   67532.53       R&D Spend
45   93863.75       R&D Spend
46  119943.24       R&D Spend
47    1315.46       R&D Spend
48   28663.76       R&D Spend
49   86419.70       R&D Spend
50    1000.23       R&D Spend
51  182645.56  Administration
52   91790.61  Administration
53  136897.80  Administration
54  135495.07  Administration
55   91391.77  Administration
56  108679.17  Administration
57  118671.85  Administration
58  153773.43  Administration
59  147198.87  Administration
60  127382.30  Administration
61  101145.55  Administration
62  144135.98  Administration
63  145530.06  Administration
64  121597.55  Administration
65   99814.71  Administration
66  113867.30  Administration
67  157693.92  Administration
68  139553.16  Administration
69  129219.61  Administration
70  145077.58  Administration
71  122782.75  Administration
72   51283.14  Administration
73  118546.05  Administration
74  127864.55  Administration
75  152701.92  Administration
76  153032.06  Administration
77   96189.63  Administration
78   82982.09  Administration
79  114175.79  Administration
80  115641.28  Administration
81   99281.34  Administration
82   85047.44  Administration
83   65947.93  Administration
84  154806.14  Administration
85  135426.92  Administration
86  116983.80  Administration
87  148718.95  Administration
88  122616.84  Administration
89  110594.11  Administration
90  103057.49  Administration
91   84710.77  Administration
92   51743.15  Administration
93  151377.59  Administration
94  105751.03  Administration
95  127320.38  Administration
96  156547.42  Administration
97  115816.21  Administration
98  127056.21  Administration
99  153514.11  Administration
100 124153.04  Administration
101 118148.20 Marketing Spend
102 249744.55 Marketing Spend
103 471784.10 Marketing Spend
104 252664.93 Marketing Spend
105 366168.42 Marketing Spend
106 304981.62 Marketing Spend
107 383199.62 Marketing Spend
108 299737.29 Marketing Spend
109 127716.82 Marketing Spend
110  35534.17 Marketing Spend
111 407934.54 Marketing Spend
112 134050.07 Marketing Spend
113 323876.68 Marketing Spend
114 264346.06 Marketing Spend
115 362861.36 Marketing Spend
116 298664.47 Marketing Spend
117 210797.67 Marketing Spend
118 137962.62 Marketing Spend
119  46085.25 Marketing Spend
120 282574.31 Marketing Spend
121 303319.26 Marketing Spend
122 197029.42 Marketing Spend
123 172795.67 Marketing Spend
124 353183.81 Marketing Spend
125  88218.23 Marketing Spend
126 107138.38 Marketing Spend
127 148001.11 Marketing Spend
128 174999.30 Marketing Spend
129 294919.57 Marketing Spend
130  91131.24 Marketing Spend
131 140574.81 Marketing Spend
132 205517.64 Marketing Spend
133 185265.10 Marketing Spend
134  28334.72 Marketing Spend
135      0.00 Marketing Spend
136  45173.06 Marketing Spend
137 311613.29 Marketing Spend
138 261776.23 Marketing Spend
139 229160.95 Marketing Spend
140 214634.81 Marketing Spend
141 164470.71 Marketing Spend
142      0.00 Marketing Spend
143 443898.53 Marketing Spend
144 304768.73 Marketing Spend
145 249839.44 Marketing Spend
146 256512.92 Marketing Spend
147 297114.46 Marketing Spend
148 201126.82 Marketing Spend
149      0.00 Marketing Spend
150   1903.93 Marketing Spend

Lets take a look at the same data in the perspective of ANOVA. Note the transformation of the data for this test. Department vs Spend.

4.7.0.1 One-Way ANOVA:

Assumption #1: one continuous variable and one categorical variable.
Assumption #2: Independence of random observations.
Assumption #3: Continuous variable approximately normally distributed at each level of the categorical variable.
Assumption #4: Continuous variable has approximately the same variance at each level of the categorical variable.
Assumption #5: Independent levels of the categorical variable .

Hypothesis:

H0: The difference between variable means is zero

H1: The difference between variable means is not zero

Code

#assumption checking
leveneTest(spend ~ Department, data = anovadata,center=median)

Levene's Test for Homogeneity of Variance (center = median)
       Df F value               Pr(>F)    
group   2   46.23 0.000000000000000265 ***
      147                                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

#anova
res.aov = aov(spend ~ Department, data = anovadata)
summary(res.aov)

             Df       Sum Sq      Mean Sq F value             Pr(>F)    
Department    2 486046387035 243023193518    40.9 0.0000000000000078 ***
Residuals   147 874499791423   5948978173                               
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

TukeyHSD(res.aov)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = spend ~ Department, data = anovadata)

$Department
                                    diff       lwr       upr    p adj
Marketing Spend-Administration   89680.5   53156.6  126204.3 0.000000
R&D Spend-Administration        -47623.0  -84146.9  -11099.2 0.006787
R&D Spend-Marketing Spend      -137303.5 -173827.3 -100779.6 0.000000

Code

#remeidal testing
kruskal.test(spend ~ Department,data = anovadata)


    Kruskal-Wallis rank sum test

data:  spend by Department
Kruskal-Wallis chi-squared = 48.32, df = 2, p-value = 0.0000000000322

4.8 Remedial Measures: ANOVA

“Roughly speaking, a test or estimator is called ‘robust’ if it still works reasonably well, even if some assumptions required for its theoretical development are not met in practice (BruceET, 2021).” The ANOVA test for means is considered to be robust to violations of the homogeneity of variances assumption when the groups’ sizes are similar.

In our case group sizes are equal. Therefore results from our ANOVA test are valid even though our data failed assumption #4. Note we still used an alternative test, The Kruskal-Wallis test. This test is the equivalent non - parametric test for the One way ANOVA test. Notice Results from all tests do agree.

Note* - The Various Forms of ANOVA

5 Predictive Analysis:

According to investopedia.com, Predictive Analysis is the use of a mix of statistical and machine learning modeling techniques for making predictions about the future. “Predictive analytics looks at current and historical data patterns to determine if those patterns are likely to emerge again.”(Predictive Analytics, n.d.)

This section of the text will cover; regression analysis for predicting numerical variables, logistic regression for predicting categorical variables (this is also referred to as classification ), time series analysis for the prediction of time dependent variables, and finally we will end with a brief introduction into utilizing neural networks for prediction and classification.

One notable difference between this section and the last is the switch to focusing on performance metrics and the use of hold out datasets for model validation.

While the paradigm of p-value calculating and hypothesis testing do play a role here, ( mostly in the model & variable selection phases ). Significance testing of variables is not the “goal” here… the goal here, is good predictions on unseen data. To this end, methods from both the previous sections will be utilized, so a good grasp of topics previously covered is imperative before moving forward.

5.1 Regression Analysis:

Regression analysis is used to predict the value of a numerical dependent variable based on the value of at least one independent variable. Methods inference aid in explaining the impact of changes to independent variables on the dependent variable. These techniques can help in producing models that have “better” predictions.

Assumption #1: Multiple feature variables of any type, One continuous target variable.
Assumption #2: The relationship between the features and some transformation of target variable is linear.
Assumption #3: Independence of random observations.
Assumption #4: All variables approximately normally distributed.
Assumption #5: All variables have approximately the same variance.
Assumption #6: There is No Multicollinearity Among feature Variables.

Code

rm(list = ls())
library(gvlma)
library(MASS)
library(forecast)
data = read.csv("~/50_Video Games.csv",stringsAsFactors=TRUE)
drops <- c("gameids")
data= data[ , !(names(data) %in% drops)]

#regression model predicting units sold 

#model fitting marketing 
model1 = lm(data$units.sold ~data$Marketing.Spend, data = data)
mod1=gvlma(model1)
summary(model1)


Call:
lm(formula = data$units.sold ~ data$Marketing.Spend, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
 -3051  -1592   -356   1132   3731 

Coefficients:
                       Estimate Std. Error t value            Pr(>|t|)    
(Intercept)          9780.21420  533.32979    18.3 <0.0000000000000002 ***
data$Marketing.Spend   -0.00240    0.00219    -1.1                0.28    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1880 on 48 degrees of freedom
Multiple R-squared:  0.0244,    Adjusted R-squared:  0.00409 
F-statistic:  1.2 on 1 and 48 DF,  p-value: 0.279

Code

plot.gvlma(mod1)

Code

mod1


Call:
lm(formula = data$units.sold ~ data$Marketing.Spend, data = data)

Coefficients:
         (Intercept)  data$Marketing.Spend  
           9780.2142               -0.0024  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = model1) 

                     Value p-value                Decision
Global Stat        3.24845   0.517 Assumptions acceptable.
Skewness           0.79037   0.374 Assumptions acceptable.
Kurtosis           1.20017   0.273 Assumptions acceptable.
Link Function      0.00811   0.928 Assumptions acceptable.
Heteroscedasticity 1.24981   0.264 Assumptions acceptable.

Code

#model fitting Rd 
model2 = lm(data$units.sold ~data$R.D.Spend, data = data)
mod2=gvlma(model2)
summary(model2)


Call:
lm(formula = data$units.sold ~ data$R.D.Spend, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
 -3273  -1432   -299   1241   3522 

Coefficients:
                 Estimate Std. Error t value            Pr(>|t|)    
(Intercept)    9380.28099  511.74898   18.33 <0.0000000000000002 ***
data$R.D.Spend   -0.00145    0.00591   -0.25                0.81    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1900 on 48 degrees of freedom
Multiple R-squared:  0.00126,   Adjusted R-squared:  -0.0195 
F-statistic: 0.0604 on 1 and 48 DF,  p-value: 0.807

Code

plot.gvlma(mod2)

Code

mod2


Call:
lm(formula = data$units.sold ~ data$R.D.Spend, data = data)

Coefficients:
   (Intercept)  data$R.D.Spend  
    9380.28099        -0.00145  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = model2) 

                   Value p-value                Decision
Global Stat        3.463   0.484 Assumptions acceptable.
Skewness           0.282   0.595 Assumptions acceptable.
Kurtosis           1.288   0.256 Assumptions acceptable.
Link Function      0.391   0.532 Assumptions acceptable.
Heteroscedasticity 1.502   0.220 Assumptions acceptable.

Code

#model fitting marketing 
model3 = lm(data$units.sold ~data$Unit.Price, data = data)
mod3=gvlma(model3)
summary(model3)


Call:
lm(formula = data$units.sold ~ data$Unit.Price, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
 -2259  -1287   -157   1220   3641 

Coefficients:
                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)     11567.35     531.00   21.78 < 0.0000000000000002 ***
data$Unit.Price   -39.26       8.26   -4.75             0.000019 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1570 on 48 degrees of freedom
Multiple R-squared:  0.32,  Adjusted R-squared:  0.306 
F-statistic: 22.6 on 1 and 48 DF,  p-value: 0.0000185

Code

plot.gvlma(mod3)

Code

mod3


Call:
lm(formula = data$units.sold ~ data$Unit.Price, data = data)

Coefficients:
    (Intercept)  data$Unit.Price  
        11567.4            -39.3  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = model3) 

                    Value p-value                Decision
Global Stat        2.5846   0.630 Assumptions acceptable.
Skewness           1.5604   0.212 Assumptions acceptable.
Kurtosis           0.9497   0.330 Assumptions acceptable.
Link Function      0.0643   0.800 Assumptions acceptable.
Heteroscedasticity 0.0101   0.920 Assumptions acceptable.

Code

#all non correlated variables 
model = lm(data$units.sold ~. -units.sold -Sales -Profit -cost -Year.Created -IGN.Rating, data = data)
mod=gvlma(model)
summary(model)


Call:
lm(formula = data$units.sold ~ . - units.sold - Sales - Profit - 
    cost - Year.Created - IGN.Rating, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2465.3  -452.5    17.8   372.3  1520.2 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                     7912.38660  837.66356    9.45
R.D.Spend                                          0.01712    0.00552    3.10
Administration                                     0.02746    0.00597    4.60
Marketing.Spend                                    0.01234    0.00228    5.42
PlatformPC                                       380.15969  405.94601    0.94
PlatformPlayStation                              165.96966  518.20570    0.32
PlatformXBOX                                    1146.36669  513.48320    2.23
Game.TypeMultiplayer online battle arena (MOBA)  657.23601  490.77997    1.34
Game.TypePuzzlers and party games.               535.79393  682.82257    0.78
Game.TypeRacing.                                 925.14214  559.21262    1.65
Game.TypeReal-time strategy (RTS)                646.43117  724.36308    0.89
Game.TypeRole-playing (RPG, ARPG, and More)     2392.22774  807.88395    2.96
Game.TypeSandbox.                               2683.64471 1023.68847    2.62
Game.TypeShooters (FPS and TPS)                  -68.45977  502.72906   -0.14
Game.TypeSports                                  117.99455  467.95535    0.25
Unit.Price                                      -117.88587    9.98601  -11.81
IGPYes                                           343.17392  307.67872    1.12
                                                        Pr(>|t|)    
(Intercept)                                     0.00000000006618 ***
R.D.Spend                                                 0.0039 ** 
Administration                                  0.00005900458575 ***
Marketing.Spend                                 0.00000534758352 ***
PlatformPC                                                0.3558    
PlatformPlayStation                                       0.7508    
PlatformXBOX                                              0.0325 *  
Game.TypeMultiplayer online battle arena (MOBA)           0.1897    
Game.TypePuzzlers and party games.                        0.4382    
Game.TypeRacing.                                          0.1075    
Game.TypeReal-time strategy (RTS)                         0.3786    
Game.TypeRole-playing (RPG, ARPG, and More)               0.0056 ** 
Game.TypeSandbox.                                         0.0131 *  
Game.TypeShooters (FPS and TPS)                           0.8925    
Game.TypeSports                                           0.8025    
Unit.Price                                      0.00000000000022 ***
IGPYes                                                    0.2728    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 884 on 33 degrees of freedom
Multiple R-squared:  0.851, Adjusted R-squared:  0.779 
F-statistic: 11.8 on 16 and 33 DF,  p-value: 0.00000000225

Code

#plot.gvlma(mod)
mod


Call:
lm(formula = data$units.sold ~ . - units.sold - Sales - Profit - 
    cost - Year.Created - IGN.Rating, data = data)

Coefficients:
                                    (Intercept)  
                                      7912.3866  
                                      R.D.Spend  
                                         0.0171  
                                 Administration  
                                         0.0275  
                                Marketing.Spend  
                                         0.0123  
                                     PlatformPC  
                                       380.1597  
                            PlatformPlayStation  
                                       165.9697  
                                   PlatformXBOX  
                                      1146.3667  
Game.TypeMultiplayer online battle arena (MOBA)  
                                       657.2360  
             Game.TypePuzzlers and party games.  
                                       535.7939  
                               Game.TypeRacing.  
                                       925.1421  
              Game.TypeReal-time strategy (RTS)  
                                       646.4312  
    Game.TypeRole-playing (RPG, ARPG, and More)  
                                      2392.2277  
                              Game.TypeSandbox.  
                                      2683.6447  
                Game.TypeShooters (FPS and TPS)  
                                       -68.4598  
                                Game.TypeSports  
                                       117.9945  
                                     Unit.Price  
                                      -117.8859  
                                         IGPYes  
                                       343.1739  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = model) 

                     Value p-value                   Decision
Global Stat        16.9492 0.00198 Assumptions NOT satisfied!
Skewness            1.4010 0.23656    Assumptions acceptable.
Kurtosis            4.8227 0.02809 Assumptions NOT satisfied!
Link Function      10.6620 0.00109 Assumptions NOT satisfied!
Heteroscedasticity  0.0635 0.80103    Assumptions acceptable.

Code

# transform variables to fi xmissed assumptions
bc = boxcox(data$units.sold ~. -units.sold -Sales -Profit -cost -Year.Created -IGN.Rating, data = data)

Code

(lambda = bc$x[which.max(bc$y)])

[1] -0.383838

Code

new_model = lm(((data$units.sold^lambda)/lambda) ~. -units.sold -Sales -Profit -cost -Year.Created -IGN.Rating, data = data)
summary(new_model)


Call:
lm(formula = ((data$units.sold^lambda)/lambda) ~ . - units.sold - 
    Sales - Profit - cost - Year.Created - IGN.Rating, data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.006873 -0.001647  0.000031  0.001452  0.004720 

Coefficients:
                                                     Estimate    Std. Error
(Intercept)                                     -0.0825937832  0.0026112669
R.D.Spend                                        0.0000000550  0.0000000172
Administration                                   0.0000000936  0.0000000186
Marketing.Spend                                  0.0000000410  0.0000000071
PlatformPC                                       0.0008674707  0.0012654644
PlatformPlayStation                             -0.0007373308  0.0016154139
PlatformXBOX                                     0.0033384235  0.0016006924
Game.TypeMultiplayer online battle arena (MOBA)  0.0033065419  0.0015299191
Game.TypePuzzlers and party games.               0.0024447066  0.0021285777
Game.TypeRacing.                                 0.0029368748  0.0017432457
Game.TypeReal-time strategy (RTS)                0.0036570825  0.0022580728
Game.TypeRole-playing (RPG, ARPG, and More)      0.0086166708  0.0025184343
Game.TypeSandbox.                                0.0074161604  0.0031911664
Game.TypeShooters (FPS and TPS)                  0.0001845619  0.0015671683
Game.TypeSports                                  0.0011497796  0.0014587674
Unit.Price                                      -0.0004116328  0.0000311296
IGPYes                                           0.0010734659  0.0009591336
                                                t value             Pr(>|t|)
(Intercept)                                      -31.63 < 0.0000000000000002
R.D.Spend                                          3.20               0.0030
Administration                                     5.03   0.0000166891216603
Marketing.Spend                                    5.77   0.0000018849819732
PlatformPC                                         0.69               0.4978
PlatformPlayStation                               -0.46               0.6511
PlatformXBOX                                       2.09               0.0448
Game.TypeMultiplayer online battle arena (MOBA)    2.16               0.0380
Game.TypePuzzlers and party games.                 1.15               0.2590
Game.TypeRacing.                                   1.68               0.1015
Game.TypeReal-time strategy (RTS)                  1.62               0.1148
Game.TypeRole-playing (RPG, ARPG, and More)        3.42               0.0017
Game.TypeSandbox.                                  2.32               0.0264
Game.TypeShooters (FPS and TPS)                    0.12               0.9070
Game.TypeSports                                    0.79               0.4362
Unit.Price                                       -13.22   0.0000000000000097
IGPYes                                             1.12               0.2711
                                                   
(Intercept)                                     ***
R.D.Spend                                       ** 
Administration                                  ***
Marketing.Spend                                 ***
PlatformPC                                         
PlatformPlayStation                                
PlatformXBOX                                    *  
Game.TypeMultiplayer online battle arena (MOBA) *  
Game.TypePuzzlers and party games.                 
Game.TypeRacing.                                   
Game.TypeReal-time strategy (RTS)                  
Game.TypeRole-playing (RPG, ARPG, and More)     ** 
Game.TypeSandbox.                               *  
Game.TypeShooters (FPS and TPS)                    
Game.TypeSports                                    
Unit.Price                                      ***
IGPYes                                             
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.00275 on 33 degrees of freedom
Multiple R-squared:  0.873, Adjusted R-squared:  0.811 
F-statistic: 14.2 on 16 and 33 DF,  p-value: 0.0000000002

Code

modnew=gvlma(new_model)
#plot.gvlma(modnew)
modnew


Call:
lm(formula = ((data$units.sold^lambda)/lambda) ~ . - units.sold - 
    Sales - Profit - cost - Year.Created - IGN.Rating, data = data)

Coefficients:
                                    (Intercept)  
                                  -0.0825937832  
                                      R.D.Spend  
                                   0.0000000550  
                                 Administration  
                                   0.0000000936  
                                Marketing.Spend  
                                   0.0000000410  
                                     PlatformPC  
                                   0.0008674707  
                            PlatformPlayStation  
                                  -0.0007373308  
                                   PlatformXBOX  
                                   0.0033384235  
Game.TypeMultiplayer online battle arena (MOBA)  
                                   0.0033065419  
             Game.TypePuzzlers and party games.  
                                   0.0024447066  
                               Game.TypeRacing.  
                                   0.0029368748  
              Game.TypeReal-time strategy (RTS)  
                                   0.0036570825  
    Game.TypeRole-playing (RPG, ARPG, and More)  
                                   0.0086166708  
                              Game.TypeSandbox.  
                                   0.0074161604  
                Game.TypeShooters (FPS and TPS)  
                                   0.0001845619  
                                Game.TypeSports  
                                   0.0011497796  
                                     Unit.Price  
                                  -0.0004116328  
                                         IGPYes  
                                   0.0010734659  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = new_model) 

                   Value p-value                   Decision
Global Stat        7.005  0.1356    Assumptions acceptable.
Skewness           0.841  0.3592    Assumptions acceptable.
Kurtosis           0.874  0.3497    Assumptions acceptable.
Link Function      4.562  0.0327 Assumptions NOT satisfied!
Heteroscedasticity 0.729  0.3933    Assumptions acceptable.

Code

#Adjust for overfitting
#Use 80% of dataset as training set and remaining 20% as testing set
set.seed(168988)
sample = sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.6,0.4))
train = data[sample, ]
test = data[!sample, ]  

#Refined model fit , fix for each model 
model = lm(units.sold ~. -units.sold -Sales -Profit -cost -Year.Created -IGN.Rating, data = train)
#model evaluation
summary(model)


Call:
lm(formula = units.sold ~ . - units.sold - Sales - Profit - cost - 
    Year.Created - IGN.Rating, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-1705.2  -338.8    54.8   353.6  1084.1 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                     8983.97070 1259.35196    7.13
R.D.Spend                                          0.01917    0.00762    2.52
Administration                                     0.01818    0.00853    2.13
Marketing.Spend                                    0.00783    0.00335    2.33
PlatformPC                                       326.85229  670.91711    0.49
PlatformPlayStation                              354.10692  763.61399    0.46
PlatformXBOX                                    1265.87947  901.08911    1.40
Game.TypeMultiplayer online battle arena (MOBA)  536.54472  646.03355    0.83
Game.TypePuzzlers and party games.               437.30033  965.29612    0.45
Game.TypeRacing.                                 867.70128  772.02873    1.12
Game.TypeReal-time strategy (RTS)                400.81265 1141.22859    0.35
Game.TypeRole-playing (RPG, ARPG, and More)     3371.99236 1122.12256    3.01
Game.TypeSandbox.                               2488.26148 1148.71406    2.17
Game.TypeShooters (FPS and TPS)                 -262.78032  704.01643   -0.37
Game.TypeSports                                  776.93612  648.82331    1.20
Unit.Price                                      -107.15825   12.86105   -8.33
IGPYes                                           558.16095  426.45058    1.31
                                                  Pr(>|t|)    
(Intercept)                                     0.00000237 ***
R.D.Spend                                           0.0229 *  
Administration                                      0.0488 *  
Marketing.Spend                                     0.0329 *  
PlatformPC                                          0.6327    
PlatformPlayStation                                 0.6491    
PlatformXBOX                                        0.1792    
Game.TypeMultiplayer online battle arena (MOBA)     0.4185    
Game.TypePuzzlers and party games.                  0.6566    
Game.TypeRacing.                                    0.2776    
Game.TypeReal-time strategy (RTS)                   0.7300    
Game.TypeRole-playing (RPG, ARPG, and More)         0.0084 ** 
Game.TypeSandbox.                                   0.0457 *  
Game.TypeShooters (FPS and TPS)                     0.7139    
Game.TypeSports                                     0.2486    
Unit.Price                                      0.00000033 ***
IGPYes                                              0.2091    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 920 on 16 degrees of freedom
Multiple R-squared:  0.883, Adjusted R-squared:  0.766 
F-statistic: 7.55 on 16 and 16 DF,  p-value: 0.000105

Code

mod=gvlma(model)
#plot.gvlma(mod)
mod


Call:
lm(formula = units.sold ~ . - units.sold - Sales - Profit - cost - 
    Year.Created - IGN.Rating, data = train)

Coefficients:
                                    (Intercept)  
                                     8983.97070  
                                      R.D.Spend  
                                        0.01917  
                                 Administration  
                                        0.01818  
                                Marketing.Spend  
                                        0.00783  
                                     PlatformPC  
                                      326.85229  
                            PlatformPlayStation  
                                      354.10692  
                                   PlatformXBOX  
                                     1265.87947  
Game.TypeMultiplayer online battle arena (MOBA)  
                                      536.54472  
             Game.TypePuzzlers and party games.  
                                      437.30033  
                               Game.TypeRacing.  
                                      867.70128  
              Game.TypeReal-time strategy (RTS)  
                                      400.81265  
    Game.TypeRole-playing (RPG, ARPG, and More)  
                                     3371.99236  
                              Game.TypeSandbox.  
                                     2488.26148  
                Game.TypeShooters (FPS and TPS)  
                                     -262.78032  
                                Game.TypeSports  
                                      776.93612  
                                     Unit.Price  
                                     -107.15825  
                                         IGPYes  
                                      558.16095  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = model) 

                     Value p-value                Decision
Global Stat        4.05441   0.399 Assumptions acceptable.
Skewness           1.70355   0.192 Assumptions acceptable.
Kurtosis           0.04579   0.831 Assumptions acceptable.
Link Function      2.29652   0.130 Assumptions acceptable.
Heteroscedasticity 0.00855   0.926 Assumptions acceptable.

Code

#test model on unseen data
pdata = predict(model, newdata = test,)

#model evaluation

#mape <10 great ... 10-20 good ... 20-50 ok ...<50 bad


accuracy(pdata, test$units.sold)

             ME    RMSE     MAE       MPE   MAPE
Test set 41.622 1131.26 948.576 -0.870989 9.9329

Code

res =as.data.frame(round(test$units.sold - pdata),0)
table = cbind(test$units.sold,round(pdata,0),res)
table

   test$units.sold round(pdata, 0) round(test$units.sold - pdata)
5             8230            7536                            694
8            12788           10293                           2495
11            9804            8746                           1058
14            7052            6744                            308
15            8769           10479                          -1710
18           10624           10244                            380
20            6337            7018                           -681
22            7093            7481                           -388
28           10363            9845                            518
30            7912           10021                          -2109
31            9054           10010                           -956
35           10131           10901                           -770
38           11213           10577                            636
39           12189           10703                           1486
40            8917            9524                           -607
45           10866           11354                           -488
46           11238           10395                            843

Code

sum(res)

[1] 709

Code

# adding boxcox to model
bc = boxcox(units.sold ~. -units.sold -Sales -Profit -cost -Year.Created -IGN.Rating, data = train)

Code

(lambda = bc$x[which.max(bc$y)])

[1] 0.181818

Code

new_model = lm(((units.sold^lambda)/lambda) ~. -units.sold -Sales -Profit -cost -Year.Created -IGN.Rating, data = train)
summary(new_model)


Call:
lm(formula = ((units.sold^lambda)/lambda) ~ . - units.sold - 
    Sales - Profit - cost - Year.Created - IGN.Rating, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9482 -0.1799  0.0105  0.2253  0.6891 

Coefficients:
                                                   Estimate  Std. Error t value
(Intercept)                                     28.58733713  0.72476915   39.44
R.D.Spend                                        0.00001115  0.00000438    2.54
Administration                                   0.00001147  0.00000491    2.34
Marketing.Spend                                  0.00000481  0.00000193    2.50
PlatformPC                                       0.23168335  0.38611924    0.60
PlatformPlayStation                              0.11422110  0.43946719    0.26
PlatformXBOX                                     0.81411645  0.51858544    1.57
Game.TypeMultiplayer online battle arena (MOBA)  0.47406392  0.37179852    1.28
Game.TypePuzzlers and party games.               0.46950196  0.55553719    0.85
Game.TypeRacing.                                 0.50587260  0.44430995    1.14
Game.TypeReal-time strategy (RTS)                0.33752255  0.65678802    0.51
Game.TypeRole-playing (RPG, ARPG, and More)      1.89507334  0.64579231    2.93
Game.TypeSandbox.                                1.35910178  0.66109598    2.06
Game.TypeShooters (FPS and TPS)                 -0.03027904  0.40516822   -0.07
Game.TypeSports                                  0.55574426  0.37340405    1.49
Unit.Price                                      -0.06424986  0.00740166   -8.68
IGPYes                                           0.25880967  0.24542641    1.05
                                                            Pr(>|t|)    
(Intercept)                                     < 0.0000000000000002 ***
R.D.Spend                                                     0.0217 *  
Administration                                                0.0328 *  
Marketing.Spend                                               0.0239 *  
PlatformPC                                                    0.5569    
PlatformPlayStation                                           0.7982    
PlatformXBOX                                                  0.1360    
Game.TypeMultiplayer online battle arena (MOBA)               0.2205    
Game.TypePuzzlers and party games.                            0.4105    
Game.TypeRacing.                                              0.2716    
Game.TypeReal-time strategy (RTS)                             0.6143    
Game.TypeRole-playing (RPG, ARPG, and More)                   0.0097 ** 
Game.TypeSandbox.                                             0.0565 .  
Game.TypeShooters (FPS and TPS)                               0.9414    
Game.TypeSports                                               0.1561    
Unit.Price                                                0.00000019 ***
IGPYes                                                        0.3073    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.529 on 16 degrees of freedom
Multiple R-squared:  0.884, Adjusted R-squared:  0.768 
F-statistic: 7.64 on 16 and 16 DF,  p-value: 0.0000975

Code

modnew=gvlma(new_model)
#plot.gvlma(modnew)
modnew


Call:
lm(formula = ((units.sold^lambda)/lambda) ~ . - units.sold - 
    Sales - Profit - cost - Year.Created - IGN.Rating, data = train)

Coefficients:
                                    (Intercept)  
                                    28.58733713  
                                      R.D.Spend  
                                     0.00001115  
                                 Administration  
                                     0.00001147  
                                Marketing.Spend  
                                     0.00000481  
                                     PlatformPC  
                                     0.23168335  
                            PlatformPlayStation  
                                     0.11422110  
                                   PlatformXBOX  
                                     0.81411645  
Game.TypeMultiplayer online battle arena (MOBA)  
                                     0.47406392  
             Game.TypePuzzlers and party games.  
                                     0.46950196  
                               Game.TypeRacing.  
                                     0.50587260  
              Game.TypeReal-time strategy (RTS)  
                                     0.33752255  
    Game.TypeRole-playing (RPG, ARPG, and More)  
                                     1.89507334  
                              Game.TypeSandbox.  
                                     1.35910178  
                Game.TypeShooters (FPS and TPS)  
                                    -0.03027904  
                                Game.TypeSports  
                                     0.55574426  
                                     Unit.Price  
                                    -0.06424986  
                                         IGPYes  
                                     0.25880967  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = new_model) 

                    Value p-value                Decision
Global Stat        3.5517   0.470 Assumptions acceptable.
Skewness           1.5891   0.207 Assumptions acceptable.
Kurtosis           0.0832   0.773 Assumptions acceptable.
Link Function      1.7883   0.181 Assumptions acceptable.
Heteroscedasticity 0.0911   0.763 Assumptions acceptable.

Code

#test model on unseen data
pdata2 = (predict(new_model, newdata = test,))
#model evaluation

#mape <10 great ... 10-20 good ... 20-50 ok ...<50 bad

accuracy(pdata2*lambda,(test$units.sold^lambda))

                  ME     RMSE       MAE        MPE    MAPE
Test set -0.00266521 0.107353 0.0936984 -0.0930724 1.77156

Code

pdata2= round((pdata2*lambda)^(1/lambda),0)
res =as.data.frame(test$units.sold - pdata2)
table = cbind(test$units.sold,pdata2,res)
table

   test$units.sold pdata2 test$units.sold - pdata2
5             8230   7564                      666
8            12788  10534                     2254
11            9804   8663                     1141
14            7052   6891                      161
15            8769  10113                    -1344
18           10624   9951                      673
20            6337   7154                     -817
22            7093   7612                     -519
28           10363  10027                      336
30            7912   9821                    -1909
31            9054   9970                     -916
35           10131  11276                    -1145
38           11213  10803                      410
39           12189  10538                     1651
40            8917   9182                     -265
45           10866  11745                     -879
46           11238  10302                      936

Code

sum(res)

[1] 434

Code

#simplifiy model 
#variable selection 
step(new_model ,direction="both")

Start:  AIC=-31.86
((units.sold^lambda)/lambda) ~ (R.D.Spend + Administration + 
    Marketing.Spend + Profit + Platform + Game.Type + IGN.Rating + 
    Year.Created + cost + Sales + Unit.Price + IGP) - units.sold - 
    Sales - Profit - cost - Year.Created - IGN.Rating

                  Df Sum of Sq    RSS    AIC
- Platform         3     0.811  5.297 -32.37
<none>                          4.485 -31.86
- IGP              1     0.312  4.797 -31.64
- Administration   1     1.530  6.016 -24.17
- Marketing.Spend  1     1.745  6.231 -23.01
- R.D.Spend        1     1.813  6.299 -22.65
- Game.Type        8     5.269  9.754 -22.22
- Unit.Price       1    21.124 25.609  23.63

Step:  AIC=-32.37
((units.sold^lambda)/lambda) ~ R.D.Spend + Administration + Marketing.Spend + 
    Game.Type + Unit.Price + IGP

                  Df Sum of Sq    RSS    AIC
- IGP              1     0.219  5.515 -33.04
<none>                          5.297 -32.37
+ Platform         3     0.811  4.485 -31.86
- Game.Type        8     4.992 10.288 -26.46
- Administration   1     1.701  6.998 -25.18
- Marketing.Spend  1     1.956  7.252 -24.00
- R.D.Spend        1     2.056  7.353 -23.55
- Unit.Price       1    24.272 29.568  22.38

Step:  AIC=-33.04
((units.sold^lambda)/lambda) ~ R.D.Spend + Administration + Marketing.Spend + 
    Game.Type + Unit.Price

                  Df Sum of Sq    RSS    AIC
<none>                          5.515 -33.04
+ IGP              1     0.219  5.297 -32.37
+ Platform         3     0.718  4.797 -31.64
- Game.Type        8     4.909 10.424 -28.03
- Administration   1     1.660  7.176 -26.35
- Marketing.Spend  1     1.951  7.466 -25.04
- R.D.Spend        1     2.354  7.869 -23.31
- Unit.Price       1    24.094 29.610  20.42


Call:
lm(formula = ((units.sold^lambda)/lambda) ~ R.D.Spend + Administration + 
    Marketing.Spend + Game.Type + Unit.Price, data = train)

Coefficients:
                                    (Intercept)  
                                    29.09850953  
                                      R.D.Spend  
                                     0.00001091  
                                 Administration  
                                     0.00001151  
                                Marketing.Spend  
                                     0.00000451  
Game.TypeMultiplayer online battle arena (MOBA)  
                                     0.45166637  
             Game.TypePuzzlers and party games.  
                                     0.17546338  
                               Game.TypeRacing.  
                                     0.24145438  
              Game.TypeReal-time strategy (RTS)  
                                     0.14219085  
    Game.TypeRole-playing (RPG, ARPG, and More)  
                                     1.38539700  
                              Game.TypeSandbox.  
                                     1.27709707  
                Game.TypeShooters (FPS and TPS)  
                                    -0.30919334  
                                Game.TypeSports  
                                     0.22775898  
                                     Unit.Price  
                                    -0.06128712

Code

Lastmodel = lm(formula = ((units.sold^lambda)/lambda) ~ R.D.Spend + Administration + Marketing.Spend + Game.Type + Unit.Price, data = train)

pdatalast = (predict(Lastmodel, newdata = test,))


pdatalastint = predict(Lastmodel, newdata = test, interval = "predict")
pdatalastint = round((pdatalastint*lambda)^(1/lambda),0)
pdatalastint = as.data.frame(pdatalastint)


accuracy(pdatalast*lambda,(test$units.sold^lambda))

                  ME     RMSE       MAE      MPE    MAPE
Test set -0.00532362 0.101095 0.0786651 -0.14734 1.49246

Code

pdatalast = round((pdatalast*lambda)^(1/lambda),0)
res =as.data.frame(test$units.sold - pdatalast)
table = cbind(test$units.sold,pdatalast,pdatalastint$lwr,pdatalastint$upr,res)
table

   test$units.sold pdatalast pdatalastint$lwr pdatalastint$upr
5             8230      7874             5682            10714
8            12788     10628             8093            13779
11            9804      9348             6807            12618
14            7052      7124             5558             9034
15            8769     10333             7953            13267
18           10624     10094             7943            12701
20            6337      7326             5721             9281
22            7093      7303             5535             9508
28           10363      9552             7511            12026
30            7912     10092             7874            12797
31            9054      9594             7601            11996
35           10131      9770             7519            12545
38           11213     10876             8488            13787
39           12189     10672             8396            13431
40            8917      9732             7708            12172
45           10866     10589             8413            13205
46           11238     11447             8983            14439
   test$units.sold - pdatalast
5                          356
8                         2160
11                         456
14                         -72
15                       -1564
18                         530
20                        -989
22                        -210
28                         811
30                       -2180
31                        -540
35                         361
38                         337
39                        1517
40                        -815
45                         277
46                        -209

Code

sum(res)

[1] 226

Note* “The one way ANOVA model is identical to the linear regression model with one categorical variable - the group. When using the linear regression the results will be the same ANOVA table and the same p-value.” - https://www.statskingdom.com/doc_anova.html . Try it out on the anovadata data set, see if you get the same results.

5.2 Logistic Regression Analysis

Logistic Regression analysis concern statistical models known as logit models, as opposed to the linear models used in regression; though this is a kinda of misnomer, as both model are subsets of a larger class of models called generalized linear models, this includes ANOVA as well. (Beyond Logistic Regression, n.d.)

Logit models are used in predictive analytics for categorical dependent variables, based on at least one independent variable. (What Is Logistic Regression?, n.d.). Predictive analysis for a categorical dependent variable is offten referred to as classification.

Assumption #1: Multiple feature variables of any type, One continuous target variable.
Assumption #2: The relationship between the features and some transformation of target variable is linear.
Assumption #3: Independence of random observations.
Assumption #4: All variables approximately normally distributed.
Assumption #5: All variables have approximately the same variance.
Assumption #6: There is No Multicollinearity Among feature Variables.

Code

library(forecast)
library(caret)


Attaching package: 'caret'

The following object is masked from 'package:mosaic':

    dotPlot

Code

library(fastDummies)

rm(list = ls())

data = read.csv("~/50_Video Games.csv",stringsAsFactors=TRUE)

#Adjust for overfitting
#Use 70% of dataset as training set and remaining 30% as testing set
set.seed(168989) #7 and 0 
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.6,0.4))
train <- data[sample, ]
test <- data[!sample, ] 

logmodel <- glm(IGP~., data = train,family = "binomial")
summary(logmodel)


Call:
glm(formula = IGP ~ ., family = "binomial", data = train)

Deviance Residuals: 
        Min           1Q       Median           3Q          Max  
-0.00002834  -0.00001433  -0.00000002   0.00000421   0.00007239  

Coefficients: (1 not defined because of singularities)
                                                              Estimate
(Intercept)                                        29240.4281452081013
R.D.Spend                                          73414.1703861917777
Administration                                     73414.1613511074102
Marketing.Spend                                    73414.1672007999732
Profit                                             73414.1714038907376
PlatformPC                                          -373.1464739538064
PlatformPlayStation                                 -433.0080715562260
PlatformXBOX                                       -1070.2425904622673
Game.TypeMultiplayer online battle arena (MOBA)     -108.4417547538506
Game.TypePuzzlers and party games.                  -467.8504076024070
Game.TypeRacing.                                    -764.3972176317810
Game.TypeReal-time strategy (RTS)                   -618.0205071556582
Game.TypeRole-playing (RPG, ARPG, and More)        -2328.3049753069054
Game.TypeSandbox.                                  -1670.0744821309625
Game.TypeShooters (FPS and TPS)                       -9.5374177143739
Game.TypeSports                                      -48.6225275119522
IGN.Rating                                            47.8319401061441
Year.Created                                         -15.7986001839345
cost                                                                NA
Sales                                             -73414.1783372929203
Unit.Price                                            77.0297076037164
units.sold                                             0.4083716522604
gameids                                                0.0000000000193
                                                            Std. Error z value
(Intercept)                                     15849567.5515316426754       0
R.D.Spend                                       35521249.4668596461415       0
Administration                                  35521243.1408409625292       0
Marketing.Spend                                 35521247.9900171309710       0
Profit                                          35521251.2150845378637       0
PlatformPC                                        289563.4477628378663       0
PlatformPlayStation                               255740.8106062586885       0
PlatformXBOX                                      673425.5798976761289       0
Game.TypeMultiplayer online battle arena (MOBA)   394560.5836117870640       0
Game.TypePuzzlers and party games.                317232.7138482637238       0
Game.TypeRacing.                                38751068.2598710581660       0
Game.TypeReal-time strategy (RTS)                 621468.2836358817294       0
Game.TypeRole-playing (RPG, ARPG, and More)      1336643.3898657865357       0
Game.TypeSandbox.                                 995403.4291097223759       0
Game.TypeShooters (FPS and TPS)                   170702.5483604636102       0
Game.TypeSports                                   162427.0008985286404       0
IGN.Rating                                         45123.0311261153620       0
Year.Created                                        8273.9072929590766       0
cost                                                                NA      NA
Sales                                           35521252.6594500169158       0
Unit.Price                                         36031.9709425770270       0
units.sold                                           198.9865542775573       0
gameids                                                0.0000000127286       0
                                                Pr(>|z|)
(Intercept)                                            1
R.D.Spend                                              1
Administration                                         1
Marketing.Spend                                        1
Profit                                                 1
PlatformPC                                             1
PlatformPlayStation                                    1
PlatformXBOX                                           1
Game.TypeMultiplayer online battle arena (MOBA)        1
Game.TypePuzzlers and party games.                     1
Game.TypeRacing.                                       1
Game.TypeReal-time strategy (RTS)                      1
Game.TypeRole-playing (RPG, ARPG, and More)            1
Game.TypeSandbox.                                      1
Game.TypeShooters (FPS and TPS)                        1
Game.TypeSports                                        1
IGN.Rating                                             1
Year.Created                                           1
cost                                                  NA
Sales                                                  1
Unit.Price                                             1
units.sold                                             1
gameids                                                1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 42.942861533340  on 30  degrees of freedom
Residual deviance:  0.000000013759  on  9  degrees of freedom
AIC: 44

Number of Fisher Scoring iterations: 25

Code

#test model on unseen data
pdata <- predict(logmodel, newdata = test)

pdata = as.data.frame(ifelse(pdata > 0,0,1))
test$IGP<- ifelse(as.numeric(test$IGP) > 1,1,0)

table = as.data.frame(cbind(test$IGP,pdata))
confusionMatrix(factor(table$`test$IGP`),factor(table$`ifelse(pdata > 0, 0, 1)`))

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0  2  5
         1  1 11
                                        
               Accuracy : 0.684         
                 95% CI : (0.434, 0.874)
    No Information Rate : 0.842         
    P-Value [Acc > NIR] : 0.979         
                                        
                  Kappa : 0.23          
                                        
 Mcnemar's Test P-Value : 0.221         
                                        
            Sensitivity : 0.667         
            Specificity : 0.688         
         Pos Pred Value : 0.286         
         Neg Pred Value : 0.917         
             Prevalence : 0.158         
         Detection Rate : 0.105         
   Detection Prevalence : 0.368         
      Balanced Accuracy : 0.677         
                                        
       'Positive' Class : 0

Code

pdata = pdata$`ifelse(pdata > 0, 0, 1)`

accuracy(test$IGP,pdata)

               ME     RMSE      MAE  MPE MAPE
Test set 0.210526 0.561951 0.315789 -Inf  Inf

Code

res =as.data.frame(test$IGP - pdata)
table = as.data.frame(cbind(test$IGP,pdata,res))
table

   test$IGP pdata test$IGP - pdata
1         0     1               -1
2         1     1                0
3         0     1               -1
4         0     0                0
5         0     1               -1
6         1     1                0
7         1     1                0
8         1     1                0
9         1     1                0
10        1     1                0
11        1     1                0
12        0     1               -1
13        1     0                1
14        1     1                0
15        0     0                0
16        1     1                0
17        1     1                0
18        0     1               -1
19        1     1                0

Code

###################################################################################

##################################################################################
rm(list = ls())

data = read.csv("~/50_Video Games.csv",stringsAsFactors=TRUE)

#Adjust for overfitting
#Use 70% of dataset as training set and remaining 30% as testing set
set.seed(168989) #7 and 0 
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.6,0.4))
train <- data[sample, ]
test <- data[!sample, ] 

logmodel <- glm(IGP~.-IGP -gameids -cost -Platform, data = train,family = "binomial")
summary(logmodel)


Call:
glm(formula = IGP ~ . - IGP - gameids - cost - Platform, family = "binomial", 
    data = train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.476  -0.379   0.000   0.183   1.553  

Coefficients:
                                                   Estimate  Std. Error z value
(Intercept)                                       399.60880   421.41572    0.95
R.D.Spend                                         696.87320   713.16243    0.98
Administration                                    696.87321   713.16237    0.98
Marketing.Spend                                   696.87312   713.16235    0.98
Profit                                            696.87311   713.16236    0.98
Game.TypeMultiplayer online battle arena (MOBA)     8.97974     5.92772    1.51
Game.TypePuzzlers and party games.                 13.50005  6926.30230    0.00
Game.TypeRacing.                                   13.28762  5542.64153    0.00
Game.TypeReal-time strategy (RTS)                  20.43086 10754.01114    0.00
Game.TypeRole-playing (RPG, ARPG, and More)       -37.71766  5487.58612   -0.01
Game.TypeSandbox.                                   0.72570 10754.02223    0.00
Game.TypeShooters (FPS and TPS)                    -0.24538     2.05925   -0.12
Game.TypeSports                                    -2.52468     3.61598   -0.70
IGN.Rating                                          1.65462     1.25927    1.31
Year.Created                                       -0.23688     0.22792   -1.04
Sales                                            -696.87331   713.16249   -0.98
Unit.Price                                          1.24655     1.03298    1.21
units.sold                                          0.00821     0.00685    1.20
                                                Pr(>|z|)
(Intercept)                                         0.34
R.D.Spend                                           0.33
Administration                                      0.33
Marketing.Spend                                     0.33
Profit                                              0.33
Game.TypeMultiplayer online battle arena (MOBA)     0.13
Game.TypePuzzlers and party games.                  1.00
Game.TypeRacing.                                    1.00
Game.TypeReal-time strategy (RTS)                   1.00
Game.TypeRole-playing (RPG, ARPG, and More)         0.99
Game.TypeSandbox.                                   1.00
Game.TypeShooters (FPS and TPS)                     0.91
Game.TypeSports                                     0.49
IGN.Rating                                          0.19
Year.Created                                        0.30
Sales                                               0.33
Unit.Price                                          0.23
units.sold                                          0.23

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 42.943  on 30  degrees of freedom
Residual deviance: 18.824  on 13  degrees of freedom
AIC: 54.82

Number of Fisher Scoring iterations: 18

Code

#test model on unseen data
pdata <- predict(logmodel, newdata = test)

pdata = as.data.frame(ifelse(pdata > 0,0,1))
test$IGP<- ifelse(as.numeric(test$IGP) > 1,1,0)

table = as.data.frame(cbind(test$IGP,pdata))
confusionMatrix(factor(table$`test$IGP`),factor(table$`ifelse(pdata > 0, 0, 1)`))

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0  5  2
         1  1 11
                                        
               Accuracy : 0.842         
                 95% CI : (0.604, 0.966)
    No Information Rate : 0.684         
    P-Value [Acc > NIR] : 0.105         
                                        
                  Kappa : 0.65          
                                        
 Mcnemar's Test P-Value : 1.000         
                                        
            Sensitivity : 0.833         
            Specificity : 0.846         
         Pos Pred Value : 0.714         
         Neg Pred Value : 0.917         
             Prevalence : 0.316         
         Detection Rate : 0.263         
   Detection Prevalence : 0.368         
      Balanced Accuracy : 0.840         
                                        
       'Positive' Class : 0

Code

pdata = pdata$`ifelse(pdata > 0, 0, 1)`

accuracy(test$IGP,pdata)

                ME    RMSE      MAE  MPE MAPE
Test set 0.0526316 0.39736 0.157895 -Inf  Inf

Code

res =as.data.frame(test$IGP - pdata)
table = as.data.frame(cbind(test$IGP,pdata,res))
table

   test$IGP pdata test$IGP - pdata
1         0     0                0
2         1     1                0
3         0     1               -1
4         0     0                0
5         0     0                0
6         1     1                0
7         1     1                0
8         1     0                1
9         1     1                0
10        1     1                0
11        1     1                0
12        0     0                0
13        1     1                0
14        1     1                0
15        0     0                0
16        1     1                0
17        1     1                0
18        0     1               -1
19        1     1                0

Code

#########################################################################

5.3 Time Series Analysis

Time series analysis is the analysis of data collected over time. In time series analysis, time is a significant variable. This is usually something we want to avoid, especially in the regression methods covered so far. Predictive analysis using time dependent data is usually referred to as forecasting.

The key difference between modeling data via time series methods or using the methods discussed so far is “Time series analysis accounts for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for.”

This section will give a brief overview of the R implementation of Prophet a widely used package for time series modeling and analysis. Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Code

library(prophet)

Loading required package: Rcpp

Loading required package: rlang

Code

library(readr)
library(aweek)
library(dplyr)
library(forecast)


df <- read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)


colnames(df)[colnames(df) == "Date"] ="ds"
colnames(df)[colnames(df) == "Units.Sold"] ="y"
#colnames(df)[colnames(df) == "Total.Profit"] ="y"

# Convert to class date and remove $
df <- df %>% 
  mutate(ds = as.Date(ds, format = "%Y-%m-%d"))
#df <- df %>% 
# mutate(y = parse_number(y))


#Dataframe must have columns 'ds' and 'y' with the dates and values respectively.
m <- prophet(df, yearly.seasonality =TRUE , daily.seasonality = TRUE)
future <- make_future_dataframe(m, periods = 5)

# R
forecast <- predict(m, future)

plot(m, forecast,ylim = c(0, 6000))

Code

prophet_plot_components(m, forecast)

Code

dyplot.prophet(m, forecast)

Code

model1_cv <- cross_validation(m, initial =330 ,horizon = 365/12, units = "days")

Making 13 forecasts with cutoffs between 2021-11-29 02:00:00 and 2022-05-30 14:00:00

Code

table = as.data.frame(cbind(round((model1_cv$yhat)),(model1_cv$y)))
accuracy(table$V2,table$V1)

              ME    RMSE     MAE      MPE    MAPE
Test set 3.83797 353.492 305.352 -2.99879 58.4062

Code

#performance_metrics(model1_cv)
plot_cross_validation_metric(model1_cv,metric = 'mape')

Code

###################
rm(list = ls())


library(prophet)
library(readr)
library(aweek)
library(dplyr)
library(forecast)


df <- read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)


colnames(df)[colnames(df) == "Date"] ="ds"
colnames(df)[colnames(df) == "Units.Sold"] ="y"
#colnames(df)[colnames(df) == "Total.Profit"] ="y"

# Convert to class date and remove $
df <- df %>% 
  mutate(ds = as.Date(ds, format = "%Y-%m-%d"))
#df <- df %>% 
 # mutate(y = parse_number(y))


df$week_num <- strftime(df$ds, format = "%V")
df = df %>%
  group_by(week_num) %>%
  summarize(y = sum(y))
colnames(df)[colnames(df) == "week_num"] ="ds"


df <- df %>% 
  mutate(ds = get_date(ds, year=2021))

#Dataframe must have columns 'ds' and 'y' with the dates and values respectively.
m <- prophet(df, yearly.seasonality =TRUE , daily.seasonality = TRUE)

Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

Code

future <- make_future_dataframe(m, periods = 5)

# R
forecast <- predict(m, future)

plot(m, forecast,ylim = c(0, 6000))

Code

prophet_plot_components(m, forecast)

Code

dyplot.prophet(m, forecast)

Code

model1_cv <- cross_validation(m, initial =330 ,horizon = 365/12, units = "days")

Making 1 forecasts with cutoffs between 2021-12-03 14:00:00 and 2021-12-03 14:00:00

Code

table = as.data.frame(cbind(round((model1_cv$yhat)),(model1_cv$y)))
accuracy(table$V2,table$V1)

            ME    RMSE   MAE     MPE    MAPE
Test set 15582 17097.3 15582 79.9211 79.9211

Code

performance_metrics(model1_cv)

        horizon       mse     rmse      mae     mape    mdape   smape coverage
1  2.41667 days  23905198  4889.29  4889.29  1.27858  1.27858 0.77996        0
2  9.41667 days 109544164 10466.33 10466.33  2.63304  2.63304 1.13664        0
3 16.41667 days 290956488 17057.45 17057.45  5.22433  5.22433 1.44632        0
4 23.41667 days 471231254 21707.86 21707.86  9.49185  9.49185 1.65193        0
5 30.41667 days 565948026 23789.66 23789.66 28.55902 28.55902 1.86911        0

Code

plot_cross_validation_metric(model1_cv,metric = 'mape')

Code

###############################################
rm(list = ls())

library(prophet)
library(readr)
library(aweek)
library(dplyr)
library(forecast)


df <- read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)

colnames(df)[colnames(df) == "Date"] ="ds"
colnames(df)[colnames(df) == "Total.Profit"] ="y"

# Convert to class date and remove $
df <- df %>% 
  mutate(ds = as.Date(ds, format = "%Y-%m-%d"))
df <- df %>% 
mutate(y = parse_number(y))


df$week_num <- strftime(df$ds, format = "%V")
df = df %>%
  group_by(week_num) %>%
  summarize(y = sum(y))
colnames(df)[colnames(df) == "week_num"] ="ds"


df <- df %>% 
  mutate(ds = get_date(ds, year=2021))

#Dataframe must have columns 'ds' and 'y' with the dates and values respectively.
m <- prophet(df, yearly.seasonality =TRUE , daily.seasonality = TRUE)

Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

Code

future <- make_future_dataframe(m, periods = 5)

# R
forecast <- predict(m, future)

plot(m, forecast,ylim = c(0, 6000))

Code

prophet_plot_components(m, forecast)

Code

dyplot.prophet(m, forecast)

Code

model1_cv <- cross_validation(m, initial =330 ,horizon = 365/12, units = "days")

Making 1 forecasts with cutoffs between 2021-12-03 14:00:00 and 2021-12-03 14:00:00

Code

table = as.data.frame(cbind(round((model1_cv$yhat)),(model1_cv$y)))
accuracy(table$V2,table$V1)

             ME   RMSE    MAE     MPE    MAPE
Test set 232335 263400 232335 85.6229 85.6229

Code

performance_metrics(model1_cv)

        horizon          mse   rmse    mae     mape    mdape    smape coverage
1  2.41667 days   2426354781  49258  49258  1.78555  1.78555 0.943351        0
2  9.41667 days  21761523348 147518 147518  7.34670  7.34670 1.572041        0
3 16.41667 days  57283453493 239340 239340  6.86387  6.86387 1.548730        0
4 23.41667 days 108615820484 329569 329569 10.56461 10.56461 1.681646        0
5 30.41667 days 156809069236 395991 395991 36.85251 36.85251 1.897047        0

Code

plot_cross_validation_metric(model1_cv,metric = 'mape')

Code

##########################avg################33
rm(list = ls())

library(prophet)
library(readr)
library(aweek)
library(dplyr)
library(forecast)



df <- read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)


colnames(df)[colnames(df) == "Date"] ="ds"
colnames(df)[colnames(df) == "Units.Sold"] ="y"
#colnames(df)[colnames(df) == "Total.Profit"] ="y"

# Convert to class date and remove $
df <- df %>% 
  mutate(ds = as.Date(ds, format = "%Y-%m-%d"))
#df <- df %>% 
# mutate(y = parse_number(y))


df$week_num <- strftime(df$ds, format = "%V")
df = df %>%
  group_by(week_num) %>%
  summarize(y = mean(y))
colnames(df)[colnames(df) == "week_num"] ="ds"


df <- df %>% 
  mutate(ds = get_date(ds, year=2021))

#Dataframe must have columns 'ds' and 'y' with the dates and values respectively.
m <- prophet(df, yearly.seasonality =TRUE , daily.seasonality = TRUE)

Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

Code

future <- make_future_dataframe(m, periods = 5)

# R
forecast <- predict(m, future)

plot(m, forecast,ylim = c(0, 6000))

Code

prophet_plot_components(m, forecast)

Code

dyplot.prophet(m, forecast)

Code

model1_cv <- cross_validation(m, initial =330 ,horizon = 365/12, units = "days")

Making 1 forecasts with cutoffs between 2021-12-03 14:00:00 and 2021-12-03 14:00:00

Code

table = as.data.frame(cbind(round((model1_cv$yhat)),(model1_cv$y)))
accuracy(table$V2,table$V1)

              ME    RMSE     MAE     MPE    MAPE
Test set 1325.61 1426.82 1325.61 71.7057 71.7057

Code

performance_metrics(model1_cv)

        horizon     mse     rmse      mae     mape    mdape   smape coverage
1  2.41667 days  259948  509.851  509.851 0.933304 0.933304 0.63635        0
2  9.41667 days  891398  944.139  944.139 1.662634 1.662634 0.90789        0
3 16.41667 days 2142912 1463.869 1463.869 3.138463 3.138463 1.22156        0
4 23.41667 days 3344035 1828.670 1828.670 5.597154 5.597154 1.47349        0
5 30.41667 days 3542450 1882.140 1882.140 6.778414 6.778414 1.54434        0

Code

plot_cross_validation_metric(model1_cv,metric = 'mape')

Code

###############################################
rm(list = ls())

library(prophet)
library(readr)
library(aweek)
library(dplyr)
library(forecast)

df <- read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)

colnames(df)[colnames(df) == "Date"] ="ds"
colnames(df)[colnames(df) == "Total.Profit"] ="y"

# Convert to class date and remove $
df <- df %>% 
  mutate(ds = as.Date(ds, format = "%Y-%m-%d"))
df <- df %>% 
  mutate(y = parse_number(y))


df$week_num <- strftime(df$ds, format = "%V")
df = df %>%
  group_by(week_num) %>%
  summarize(y = mean(y))
colnames(df)[colnames(df) == "week_num"] ="ds"


df <- df %>% 
  mutate(ds = get_date(ds, year=2021))

#Dataframe must have columns 'ds' and 'y' with the dates and values respectively.
m <- prophet(df, yearly.seasonality =TRUE , daily.seasonality = TRUE)

Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

Code

future <- make_future_dataframe(m, periods = 5)

# R
forecast <- predict(m, future)

plot(m, forecast,ylim = c(0, 6000))

Code

prophet_plot_components(m, forecast)

Code

dyplot.prophet(m, forecast)

Code

model1_cv <- cross_validation(m, initial =330 ,horizon = 365/12, units = "days")

Making 1 forecasts with cutoffs between 2021-12-03 14:00:00 and 2021-12-03 14:00:00

Code

table = as.data.frame(cbind(round((model1_cv$yhat)),(model1_cv$y)))
accuracy(table$V2,table$V1)

            ME    RMSE   MAE     MPE    MAPE
Test set 20850 23471.7 20850 79.0548 79.0548

Code

performance_metrics(model1_cv)

        horizon        mse     rmse      mae    mape   mdape    smape coverage
1  2.41667 days   20742630  4554.41  4554.41 1.15565 1.15565 0.732431        0
2  9.41667 days  194658944 13952.02 13952.02 4.86388 4.86388 1.417239        0
3 16.41667 days  459822021 21443.46 21443.46 4.30475 4.30475 1.365557        0
4 23.41667 days  881330355 29687.21 29687.21 6.66154 6.66154 1.538188        0
5 30.41667 days 1198040939 34612.73 34612.73 9.66359 9.66359 1.657052        0

Code

plot_cross_validation_metric(model1_cv,metric = 'mape')

5.4 Neural Network Analysis

Neural Networks are inspired by the structure of the human brain used to detect patterns in data sets. These models can detect the most subtle and complex relationships between variables using shear mathematical power. Neural networks can be used to make predictions on dependent variables of any type; including numerical, categorical and time series.

The structure of a neural-network algorithm has three layers. The input layer of a neural network is where each variables starts hence the size of your input layer is the number of variables in your dataset. The output layer of a neural network is where the results will be displayed. The hidden layers are in the middle. One (very) simple way to think about a Neural Network for new analyst is a net of logit models.

Though, if we want we can use other activation functions. And we can even mix and match… it gets complicated …

The important conceptual point to keep in mind is we input variables, it outputs predictions. We can check the predictions using the techniques and metrics we have utilized for predictive analysis so far .

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.8.0     ✔ tibble    3.2.1
✔ purrr     1.0.1     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%@%()         masks rlang::%@%()
✖ psych::%+%()         masks ggplot2::%+%()
✖ psych::alpha()       masks ggplot2::alpha()
✖ mosaic::count()      masks dplyr::count()
✖ purrr::cross()       masks mosaic::cross()
✖ mosaic::do()         masks dplyr::do()
✖ tidyr::expand()      masks Matrix::expand()
✖ rstatix::filter()    masks dplyr::filter(), stats::filter()
✖ purrr::flatten()     masks rlang::flatten()
✖ purrr::flatten_chr() masks rlang::flatten_chr()
✖ purrr::flatten_dbl() masks rlang::flatten_dbl()
✖ purrr::flatten_int() masks rlang::flatten_int()
✖ purrr::flatten_lgl() masks rlang::flatten_lgl()
✖ purrr::flatten_raw() masks rlang::flatten_raw()
✖ purrr::invoke()      masks rlang::invoke()
✖ dplyr::lag()         masks stats::lag()
✖ purrr::lift()        masks caret::lift()
✖ tidyr::pack()        masks Matrix::pack()
✖ car::recode()        masks dplyr::recode(), lessR::recode()
✖ dplyr::rename()      masks lessR::rename()
✖ rstatix::select()    masks MASS::select(), dplyr::select()
✖ purrr::some()        masks car::some()
✖ purrr::splice()      masks rlang::splice()
✖ mosaic::stat()       masks ggplot2::stat()
✖ mosaic::tally()      masks dplyr::tally()
✖ tidyr::unpack()      masks Matrix::unpack()
✖ tibble::view()       masks summarytools::view()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(keras)
library(neuralnet)


Attaching package: 'neuralnet'

The following object is masked from 'package:dplyr':

    compute

Code

library(caret)
library(generics)


Attaching package: 'generics'

The following object is masked from 'package:keras':

    evaluate

The following object is masked from 'package:lubridate':

    as.difftime

The following object is masked from 'package:caret':

    train

The following object is masked from 'package:dplyr':

    explain

The following objects are masked from 'package:base':

    as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
    setequal, union

Code

library(forecast)
library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

The following objects are masked from 'package:psych':

    alpha, rescale

The following object is masked from 'package:mosaic':

    rescale

The following object is masked from 'package:lessR':

    rescale

Code

library(fastDummies)


data = read.csv("~/50_Video Games.csv",stringsAsFactors=TRUE)
#index = 1:ncol(data)
#data[, index] = lapply(data[, index], as.numeric)
str(data)

'data.frame':   50 obs. of  14 variables:
 $ R.D.Spend      : num  66052 100672 165349 91992 142107 ...
 $ Administration : num  182646 91791 136898 135495 91392 ...
 $ Marketing.Spend: num  118148 249745 471784 252665 366168 ...
 $ Profit         : num  226888 182937 182177 166198 124288 ...
 $ Platform       : Factor w/ 4 levels "Nintendo ","PC",..: 4 4 2 2 2 2 3 2 1 2 ...
 $ Game.Type      : Factor w/ 9 levels "Action-adventure.",..: 1 2 6 4 5 2 6 8 4 9 ...
 $ IGN.Rating     : int  10 10 9 9 9 8 8 8 8 8 ...
 $ Year.Created   : int  2022 2021 2020 2019 2018 2015 2017 2013 2016 2014 ...
 $ cost           : num  366845 442207 774031 480152 599668 ...
 $ Sales          : num  593734 625144 956208 646350 723955 ...
 $ Unit.Price     : num  76.2 71.7 157.2 103.9 88 ...
 $ units.sold     : int  7794 8725 6082 6220 8230 9930 12590 12788 12465 10623 ...
 $ IGP            : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
 $ gameids        : num  41102022779476 42102021872572 26920206082157 24920196220104 2592018823088 ...

Code

data <- dummy_cols(data, select_columns = c('Game.Type', 'Platform', 'IGP'),remove_selected_columns = TRUE)
maxs <- apply(data, 2, max) 
mins <- apply(data, 2, min)
data= scale(data,center = mins, scale = maxs - mins)


#Adjust for overfitting
#Use 80% of dataset as training set and remaining 20% as testing set
set.seed(168988)
sample = sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.8,0.2))
train = as.data.frame(data[sample, ])
test = as.data.frame(data[!sample, ])  

model = neuralnet(
  factor(IGP_Yes)~IGN.Rating+ cost + Unit.Price+ Game.Type_Sports,
  data=train,
  hidden=c(4,2),
  rep = 1,
  act.fct = "logistic",
  linear.output = FALSE
)
plot(model,rep = "best")

Code

#test model on unseen data

pdata = as.data.frame(predict(model, newdata = test))
#model evaluation

#mape <10 great ... 10-20 good ... 20-50 ok ...<50 bad

table = as.data.frame(cbind(round((pdata$V2)),(test$IGP_Yes)))
table

Code

accuracy(table$V2,table$V1)

               ME    RMSE      MAE MPE MAPE
Test set 0.333333 0.57735 0.333333  50   50

Code

confusionMatrix(factor(table$V2),factor(table$V1))

Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 3 3
         1 0 3
                                        
               Accuracy : 0.667         
                 95% CI : (0.299, 0.925)
    No Information Rate : 0.667         
    P-Value [Acc > NIR] : 0.650         
                                        
                  Kappa : 0.4           
                                        
 Mcnemar's Test P-Value : 0.248         
                                        
            Sensitivity : 1.000         
            Specificity : 0.500         
         Pos Pred Value : 0.500         
         Neg Pred Value : 1.000         
             Prevalence : 0.333         
         Detection Rate : 0.333         
   Detection Prevalence : 0.667         
      Balanced Accuracy : 0.750         
                                        
       'Positive' Class : 0

Code

library(tidyverse)
library(keras)
library(neuralnet)
library(caret)
library(generics)
library(forecast)
library(scales)
library(fastDummies)


data = read.csv("~/50_Video Games.csv",stringsAsFactors=TRUE)
#index = 1:ncol(data)
#data[, index] = lapply(data[, index], as.numeric)
str(data)

'data.frame':   50 obs. of  14 variables:
 $ R.D.Spend      : num  66052 100672 165349 91992 142107 ...
 $ Administration : num  182646 91791 136898 135495 91392 ...
 $ Marketing.Spend: num  118148 249745 471784 252665 366168 ...
 $ Profit         : num  226888 182937 182177 166198 124288 ...
 $ Platform       : Factor w/ 4 levels "Nintendo ","PC",..: 4 4 2 2 2 2 3 2 1 2 ...
 $ Game.Type      : Factor w/ 9 levels "Action-adventure.",..: 1 2 6 4 5 2 6 8 4 9 ...
 $ IGN.Rating     : int  10 10 9 9 9 8 8 8 8 8 ...
 $ Year.Created   : int  2022 2021 2020 2019 2018 2015 2017 2013 2016 2014 ...
 $ cost           : num  366845 442207 774031 480152 599668 ...
 $ Sales          : num  593734 625144 956208 646350 723955 ...
 $ Unit.Price     : num  76.2 71.7 157.2 103.9 88 ...
 $ units.sold     : int  7794 8725 6082 6220 8230 9930 12590 12788 12465 10623 ...
 $ IGP            : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
 $ gameids        : num  41102022779476 42102021872572 26920206082157 24920196220104 2592018823088 ...

Code

data <- dummy_cols(data, select_columns = c('Game.Type', 'Platform', 'IGP'),remove_selected_columns = TRUE)
maxs <- apply(data, 2, max) 
mins <- apply(data, 2, min)
data= scale(data,center = mins, scale = maxs - mins)


#Adjust for overfitting
#Use 80% of dataset as training set and remaining 20% as testing set
set.seed(168988)
sample = sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.6,0.4))
train = as.data.frame(data[sample, ])
test = as.data.frame(data[!sample, ])  

model = neuralnet(
  units.sold~IGN.Rating+ cost + Unit.Price+ Game.Type_Sports,
  data=train,
  hidden=c(6,2),
  rep = 1,
  act.fct = "logistic",
  linear.output = TRUE
)
plot(model,rep = "best")

Code

#test model on unseen data

pdata = as.data.frame(predict(model, newdata = test))
#model evaluation

#mape <10 great ... 10-20 good ... 20-50 ok ...<50 bad

test$units.sold=test$units.sold*(12788.00-6000.00) + 6000.00
pdata=pdata*(12788.00-6000.00) + 6000.00

res =as.data.frame(round(test$units.sold - pdata$V1))

table = as.data.frame(cbind(test$units.sold,pdata$V1,res))
table

   test$units.sold pdata$V1 round(test$units.sold - pdata$V1)
1             8230  8855.67                              -626
2            12788 13776.96                              -989
3             9804 10276.77                              -473
4             7052  7175.15                              -123
5             8769  7895.21                               874
6            10624 10788.54                              -165
7             6337  5816.21                               521
8             7093  7475.91                              -383
9            10363  8930.69                              1432
10            7912  8022.57                              -111
11            9054  9600.03                              -546
12           10131 10609.31                              -478
13           11213 12057.82                              -845
14           12189 10525.65                              1663
15            8917  9382.11                              -465
16           10866 10984.19                              -118
17           11238 12545.00                             -1307

Code

accuracy(test$units.sold ,pdata$V1)

              ME    RMSE     MAE      MPE    MAPE
Test set 125.753 798.024 654.017 0.610453 6.71192

Code

sum(res)

[1] -2139

Code

library(tidyverse)
library(keras)
library(neuralnet)
library(prophet)
library(readr)
library(aweek)
library(dplyr)
library(forecast)




df <- read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)
data = df 

colnames(df)[colnames(df) == "Date"] ="ds"
colnames(df)[colnames(df) == "Units.Sold"] ="y"
#colnames(df)[colnames(df) == "Total.Profit"] ="y"

# Convert to class date and remove $
df <- df %>% 
  mutate(ds = as.Date(ds, format = "%Y-%m-%d"))
#df <- df %>% 
# mutate(y = parse_number(y))
data = df 


fit = nnetar(ts(df$y),lambda=0.5)
fit

Series: ts(df$y) 
Model:  NNAR(1,1) 
Call:   nnetar(y = ts(df$y), lambda = 0.5)

Average of 20 networks, each of which is
a 1-1-1 network with 4 weights
options were - linear output units 

sigma^2 estimated as 278

Code

fcast <- forecast(fit,PI=TRUE, h=20)
autoplot(fcast)

Code

accuracy(fcast)

                  ME   RMSE     MAE      MPE    MAPE     MASE       ACF1
Training set 69.6006 340.94 296.709 -305.662 347.691 0.772336 0.00712707

Code

################################################################

##########################avg################33
rm(list = ls())




df <- read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)


colnames(df)[colnames(df) == "Date"] ="ds"
colnames(df)[colnames(df) == "Units.Sold"] ="y"
#colnames(df)[colnames(df) == "Total.Profit"] ="y"

# Convert to class date and remove $
df <- df %>% 
  mutate(ds = as.Date(ds, format = "%Y-%m-%d"))
#df <- df %>% 
# mutate(y = parse_number(y))


df$week_num <- strftime(df$ds, format = "%V")
df = df %>%
  group_by(week_num) %>%
  summarize(y = mean(y))
colnames(df)[colnames(df) == "week_num"] ="ds"


df <- df %>% 
  mutate(ds = get_date(ds, year=2021))

data = df 


fit = nnetar(ts(df$y),lambda=0.5 ,xreg = df$Total.Profit)
fit

Series: ts(df$y) 
Model:  NNAR(1,1) 
Call:   nnetar(y = ts(df$y), xreg = df$Total.Profit, lambda = 0.5)

Average of 20 networks, each of which is
a 1-1-1 network with 4 weights
options were - linear output units 

sigma^2 estimated as 30.5

Code

fcast <- forecast(fit, PI=TRUE, h=20)
autoplot(fcast)

Code

accuracy(fcast)

                  ME   RMSE   MAE      MPE    MAPE     MASE      ACF1
Training set 7.62245 125.19 101.2 -4.86941 20.8985 0.679469 0.0206835

Code

###############################################
rm(list = ls())



df <- read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)

colnames(df)[colnames(df) == "Date"] ="ds"
colnames(df)[colnames(df) == "Total.Profit"] ="y"

# Convert to class date and remove $
df <- df %>% 
  mutate(ds = as.Date(ds, format = "%Y-%m-%d"))
df <- df %>% 
  mutate(y = parse_number(y))


df$week_num <- strftime(df$ds, format = "%V")
df = df %>%
  group_by(week_num) %>%
  summarize(y = mean(y))
colnames(df)[colnames(df) == "week_num"] ="ds"


df <- df %>% 
  mutate(ds = get_date(ds, year=2021))

data = df 


fit = nnetar(ts(df$y),lambda=0.5 ,xreg = df$Total.Profit)
fit

Series: ts(df$y) 
Model:  NNAR(5,3) 
Call:   nnetar(y = ts(df$y), xreg = df$Total.Profit, lambda = 0.5)

Average of 20 networks, each of which is
a 5-3-1 network with 22 weights
options were - linear output units 

sigma^2 estimated as 49.4

Code

fcast <- forecast(fit,PI=TRUE, h=20)
autoplot(fcast)

Code

accuracy(fcast)

                 ME    RMSE     MAE      MPE    MAPE     MASE       ACF1
Training set 17.522 294.914 216.112 -1.04745 6.82279 0.269934 0.00272143

6 Next Steps:, Further research/analysis.

Look back to the descriptive analytic section ,what other relationships do you think are worth testing why? What methods would you use to test those relationships?
There could be an issue with independence for the dependent variable in our prediction analysis. what could this mean for the results? -The effect of non-independence of observations on regression
A study of Two-Way ANOVA is suggested. - Two-Way ANOVA Test in R
Further study into the methods of time series analysis is suggested. - Time Series Analysis in R
Further study into Logistic Regression for classification is suggested - Logistic Regression howto R - Logistic Regression & - Multinomial Logistic Regression in R
A study into prediction and classification using Neural Networks is suggested - Building Neural Network (NN) Models in R & Deep Neural Network in R & Building A Neural Net from Scratch Using R

This concludes our analysis of the video game dataset.

7 R Programming Resources:

This text is in no way meant to be a complete reference for the R programming language, but rather an introduction to many of the concepts utilized in modern statistical approaches to problem solving. The following resources will prove to be useful if you would like a deeper understanding of R:

R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners | Edureka

Resources for Statistics, Data Science, Machine Learning, & Data Analysis

Statistics for Data Science | Probability and Statistics | Statistics Tutorial | Ph.D. (Stanford)

IBM Data Analyst Complete Course | Data Analyst Tutorial For Beginners,

Machine Learning for Everybody – Full Course

8 Glossary of Terms:

From: Introduction to Statistics, Online Edition Primary author and editor: David M. Lane The work is in the public domain. https://onlinestatbook.com/Online_Statistics_Education.pdf

Alternative Hypothesis: In hypothesis testing, the null hypothesis and an alternative hypothesis are put forward. If the data are sufficiently strong to reject the null hypothesis, then the null hypothesis is rejected in favor of an alternative hypothesis. For instance, if the null hypothesis were that mu 1 = mu 2 then the alternative hypothesis (for a two-tailed test) would be mu 1 != mu 2 .

Analysis of Variance: Analysis of variance is a method for testing hypotheses about means. It is the most widely-used method of statistical inference for the analysis of experimental data.

Average: The (arithmetic) mean; Any measure of central tendency.

Bar Chart: A graphical method of presenting data. A bar is drawn for each level of a variable. The height of each bar contains the value of the variable. Bar charts are useful for displaying things such as frequency counts and percent increases. They are not recommended for displaying means (despite the widespread practice) since box plots present more information in the same amount of space.

Beta weight: A standardized regression coefficient.

Bias: 1. A sampling method is biased if each element does not have an equal chance of being selected. A sample of internet users found reading an online statistics book would be a biased sample of all internet users. A random sample is unbiased. Note that possible bias refers to the sampling method, not the result. An unbiased method could, by chance, lead to a very non-representative sample.

2. An estimator is biased if it systematically overestimates or underestimates the parameter it is estimating. In other words, it is biased if the mean of the sampling distribution of the statistic is not the parameter it is estimating, The sample mean is an unbiased estimate of the population mean. The mean squared deviation of sample scores from their mean is a biased estimate of the variance since it tends to underestimate the population variance.

Binomial Distribution: A probability distribution for independent events for which there are only two possible outcomes such as a coin flip. If one of the two outcomes is defined as a success, then the probability of exactly x successes out of N trials (events) is given by:

Bin Width: Also known as the class interval, the bin width is a division of data for use in a histogram. For instance, it is possible to partition scores on a 100 point test into class intervals of 1-25, 26-49, 50-74 and 75-100.

Bivariate: Bivariate data is data for which there are two variables for each observation. That is, two scores per subject.

Bonferroni Correction: In general, to keep the familywise error rate (FER) at or below .05, the per-comparison error rate (PCER) should be: PCER = .05/c where c is the number of comparisons. More generally, to insure that the FER is less than or equal to alpha, use PCER = alpha/c.

Box Plot: One of the more effective graphical summaries of a data set, the box plot generally shows mean, median, 25th and 75th percentiles, and outliers. A standard box plot is composed of the median, upper hinge, lower hinge, higher adjacent value, lower adjacent value, outside values, and far out values. An example is shown below. Parallel box plots are very useful for comparing distributions.

Central Tendency: There are many measures of the center of a distribution. These are called measures of central tendency. The most common are the mean, median, and, mode. Others include the trimean, trimmed mean, and geometric mean.)

Class Frequency: One of the components of a histogram, the class frequency is the number of observations in each class interval. See also: relative frequency.

Class Interval: Also known as bin width, the class interval is a division of data for use in a histogram. For instance, it is possible to partition scores on a 100 point test into class intervals of 1-25, 26-49, 50-74 and 75-100.

Conditional Probability: The probability that event A occurs given that event B has already occurred is called the conditional probability of A given B. Symbolically, this is written as P(A|B). The probability it rains on Monday given that it rained on Sunday would be written as P(Rain on Monday | Rain on Sunday).

Confidence Interval: A confidence interval is a range of scores likely to contain the parameter being estimated. Intervals can be constructed to be more or less likely to contain the parameter: 95% of 95% confidence intervals contain the estimated parameter whereas 99% of 99% confidence intervals contain the estimated parameter. The wider the confidence interval, the more uncertainty there is about the value of the parameter.

Confounding: Two or more variables are confounded if their effects cannot be separated because they vary together. For example, if a study on the effect of light inadvertently manipulated heat along with light, then light and heat would be confounded.

Cook’s D: Cook’s D is a measure of the influence of an observation in regression and is proportional to the sum of the squared differences between predictions made with all observations in the analysis and predictions made leaving out the observation in question.

Constant: A value that does not change. Values such as pie, or the mass of the Earth are constants.

Continuous Variables: Variables that can take on any value in a certain range. Time and distance are continuous; gender, SAT score and “time rounded to the nearest second” are not. Variables that are not continuous are known as discrete variables. No measured variable is truly continuous; however, discrete variables measured with enough precision can often be considered continuous for practical purposes.

Dependent Variable: A variable that measures the experimental outcome. In most experiments, the effects of the independent variable on the dependent variables are observed. For example, if a study investigated the effectiveness of an experimental treatment for depression, then the measure of depression would be the dependent variable.

Descriptive Statistics: 1. The branch of statistics concerned with describing and summarizing data. 2. A set of statistics such as the mean, standard deviation, and skew that describe a distribution.

Degrees of Freedom: The degrees of freedom of an estimate is the number of independent pieces of information that go into the estimate. In general, the degrees of freedom for an estimate is equal to the number of values minus the number of parameters estimated en route to the estimate in question. For example, to estimate the population variance, one must first estimate the population mean. Therefore, if the estimate of variance is based on N observations, there are N-1 degrees of freedom.

Discrete Variables: Variables that can only take on a finite number of values are called “discrete variables.” All qualitative variables are discrete. Some quantitative variables are discrete, such as performance rated as 1,2,3,4, or 5, or temperature rounded to the nearest degree. Sometimes, a variable that takes on enough discrete values can be considered to be continuous for practical purposes. One example is time to the nearest millisecond.

Distribution: The distribution of empirical data is called a frequency distribution and consists of a count of the number of occurrences of each value. If the data are continuous, then a grouped frequency distribution is used. Typically, a distribution is portrayed using a frequency polygon or a histogram. Mathematical equations are often used to define distributions. The normal distribution is, perhaps, the best known example. Many empirical distributions are approximated well by mathematical distributions such as the normal distribution.

Expected Value: The expected value of a statistic is the mean of the sampling distribution of the statistic. It can be loosely thought of as the long-run average value of the statistic. Factor (Independent Variable) Variables that are manipulated by the experimenter, as opposed to dependent variables. Most experiments consist of observing the effect of the independent variable(s) on the dependent variable(s).

False Positive: A false positive occurs when a diagnostic procedure returns a positive result while the true state of the subject is negative. For example, if a test for strep says the patient has strep when in fact he or she does not, then the error in diagnosis would be called a false positive. In some contexts, a false positive is called a false alarm. The concept is similar to a Type I error in significance testing.

Familywise Error Rate: When a series of significance tests is conducted, the familywise error rate (FER) is the probability that one or more of the significance tests results in a Type I error.

Far Out Value: One of the components of a box plot, far out values are those that are more than 2 steps beyond the nearest hinge. They are beyond an outer fence.

Favorable Outcome: A favorable outcome is the outcome of interest. For example one could define a favorable outcome in the flip of a coin as a head. The term “favorable outcome” does not

necessarily mean that the outcome is desirable – in some experiments, the favorable outcome could be the failure of a test, or the occurrence of an undesirable event.

Frequency Distribution: For a discrete variable, a frequency distribution consists of the distribution of the number of occurrences for each value of the variable. For a continuous variable, it is the number of occurrences for a variety of ranges of variables.

Frequency Table: A table containing the number of occurrences in each class of data; for example, the number of each color of M&Ms in a bag. Frequency tables often used to create histograms and frequency polygons. When a frequency table is created for a quantitative variable, a grouped frequency table is generally used.

Histogram: A histogram is a graphical representation of a distribution . It partitions the variable on the x-axis into various contiguous class intervals of (usually) equal widths. The heights of the bars represent the class frequencies.

History Effect: A problem of confounding where the passage of time, and not the variable of interest, is responsible for observed effects. See also: third variable problem.

Homogeneity of Variance: The assumption that the variances of all the populations are equal.

Homoscedasticity: In linear regression, the assumption that the variance around the regression line is the same for all values of the predictor variable.

Independent Events: Events A and B are independent events if the probability of Event B occurring is the same whether or not Event A occurs. For example, if you throw two dice, the probability that the second die comes up 1 is independent of whether the first die came up Formally, this can be stated in terms of conditional probabilities: P(A|B) = P(A) and P(B|A) = P(B).

Inferential Statistics: The branch of statistics concerned with drawing conclusions about a population from a sample. This is generally done through random sampling, followed by inferences made about central tendency, or any of a number of other aspects of a distribution.

Influence: Influence refers to the degree to which a single observation in regression influences the estimation of the regression parameters. It is often measured in terms how much the predicted scores for other observations would differ if the observation in question were not included.

Interquartile Range: The Interquartile Range (IQR) is the 75th percentile minus the 25th percentile. It is a robust measure of variability.

Interval Estimate: An interval estimate is a range of scores likely to contain the estimated parameter. see “confidence interval.”

Interval Scale: One of four commonly used levels of measurement, an interval scales is a numerical scales in which intervals have the same meaning throughout. As an example, consider the Fahrenheit scale of temperature. The difference between 30 degrees and 40 degrees represents the same temperature difference as the difference between 80 degrees and 90 degrees. This is because each 10 degree interval has the same physical meaning (in terms of the kinetic energy. Unlike ratio scales, interval scales do not have a true zero point.

Levels of Measurement: Measurement scales differ in their level of measurement. There are four common levels of measurement: 1. Nominal scales are only labels. 2. Ordinal Scales are ordered but are not truly quantitative. Equal intervals on the ordinal scale do not imply equal intervals on the underlying trait. 3. Interval scales are are ordered and equal intervals equal intervals on the underlying trait. However, interval scales do not have a true zero point. 4. Ratio scales are interval scales that do have a true zero point. With ratio scales, it is sensible to talk about one value being twice as large as another, for example.

Leverage: Leverage is a factor affecting the influence of an observation in regression. Leverage is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The greater an observation’s leverage, the more potential it has to be an influential observation.

Lies: There are three types of lies: 1. regular lies 2. damned lies 3. statistics This is according to Benjamin Disraeli as quoted by Mark Twain.

Line Graph: Essentially a bar graph in which the height of each par is represented by a single point, with each of these points connected by a line. Line graphs are best used to show change over time, and should not be used if your X-axis is not an ordered variable. An example is shown below.

Linear Combination: A linear combination of variables is a way of creating a new variable by combining other variables. A linear combination is one in which each variable is multiplied by a coefficient and the are products summed. For example, if Y = 3X1 + 2X2 + .5X3 then Y is a linear combination of the variables X1, X2, and X3.

Linear Regression: Linear regression is a method for predicting a criterion variable from one or more predictor variable. In simple regression, the criterion is predicted from a single predictor variable and the best-fitting straight line is of the form Y’ = bX + A where Y’ is the predicted score, X is the predictor variable, b is the slope, and A is the Y intercept. Typically, the criterion for the “best fitting” line is the line for which the sum of the squared errors of prediction is minimized. In multiple regression, the criterion is predicted from two or more predictor variables.

Linear Relationship: There is a perfect linear relationship between two variables if a scatter-plot of the points falls on a straight line. The relationship is linear even if the points diverge from the line as long as the divergence is random rather than being systematic.

Linear Transformation: A linear transformation is any transformation of a variable that can be achieved by multiplying it by a constant, and then adding a second constant. If Y is the transformed value of X, then Y = aX + b. The transformation from degrees Fahrenheit to degrees Centigrade is linear and is done using the formula: C = 0.55556F - 17.7778.

Logarithm: The logarithm of a number is the power the base of the logarithm has to be raised to in order to equal the number. If the base of the logarithm is 10 and the number is 1,000, then the log is 3 since 10 has to be raised to the 3rd power to equal 1,000.

Margin of Error: When a statistic is used to estimate a parameter, it is common to compute a confidence interval. The margin of error is the difference between the statistic and the endpoints of the interval. For example, if the statistic were 0.6 and the confidence interval ranged from 0.4 to 0.8, then the margin of error would be 0.20. Unless otherwise specified, the 95% confidence interval is used.

Mean: Also known as the arithmetic mean, the mean is typically what is meant by the word “average.” The mean is perhaps the most common measure of central tendency. The mean of a variable is given by (the sum of all its values)/(the number of values). For example, the mean of 4, 8, and 9 is 7. The sample mean is written as M, and the population mean as the Greek letter mu ( mu ). Despite its popularity, the mean may not be an appropriate measure of central tendency for skewed distributions, or in situations with outliers. Other than the arithmetic mean, there is the geometric mean and the harmonic mean.

Median:The median is a popular measure of central tendency. It is the 50th percentile of a distribution. To find the median of a number of values, first order them, then find the observation in the middle: the median of 5, 2, 7, 9, and 4 is 5. (Note that if there is an even number of values, one takes the average of the middle two: the median of 4, 6, 8, and 10 is 7.) The median is often more appropriate than the mean in skewed distributions and in situations with outliers.

Mode: The mode is a measure of central tendency. It is the most frequent value in a distribution: the mode of 3, 4, 4, 5, 5, 5, 8 is 5. Note that the mode may be very different from the mean and the median.

Multiple Regression: Multiple regression is linear regression in which two or more predictor variables are used to predict the criterion.

Negative Association: There is a negative association between variables X and Y if smaller values of X are associated with larger values of Y and larger values of X are associated with smaller values of Y.

Nominal Scales: A nominal scale is one of four commonly-used levels of measurement. No ordering is implied, and addition/subtraction and multiplication/division would be inappropriate for a variable on a nominal scale. {Female, Male} and {Buddhist, Christian, Hindu, Muslim} have no natural ordering (except alphabetic). Occasionally, numeric values are nominal: for instance, if a variable were coded as Female = 1, Male =2, the set {1,2} is still nominal.

Non-representative: A non-representative sample is a sample that does not accurately reflect the population.

Normal Distribution: One of the most common continuous distributions, a normal distribution is sometimes referred to as a “bell-shaped distribution.” If mu is the distribution mean, and sigma the standard deviation,If the mean is 0 and the standard deviation is 1, the distribution is referred to as the “standard normal distribution.”

Null Hypothesis: A null hypothesis is a hypothesis tested in significance testing. It is typically the hypothesis that a parameter is zero or that a difference between parameters is zero. For example, the null hypothesis might be that the difference between population means is zero. Experimenters typically design experiments to allow the null hypothesis to be rejected.

Omnibus Null Hypothesis: The null hypothesis that all population means are equal.

One Tailed: The last step in significance testing involves calculating the probability that a statistic would differ as much or more from the parameter specified in the null hypothesis as does the statistics obtained in the experiment. A probability computed considering differences in only one direction, such as the statistic is larger than the parameter, is called a one-tailed probability. For example, if a parameter is 0 and the statistic is 12, a one-tailed probability (the positive tail) would be the probability of a statistic being >= to 12. Compare with the two-tailed probability which would be the probability of being either <= -12 or >=12.

Ordinal Scales: One of four commonly-used levels of measurement, an ordinal scale is a set of ordered values. However, there is no set distance between scale values. For instance, for the scale: (Very Poor, Poor, Average, Good, Very Good) is an ordinal scale.

Outer Fence: In a box plot, the lower outer fence is two steps below the lower hinge whereas the upper inner fence is two steps above the upper hinge.

Outlier: Outliers are atypical, infrequent observations; values that have an extreme deviation from the center of the distribution. There is no universally-agreed on criterion for defining an outlier, and outliers should only be discarded with extreme caution. However, one should always assess the effects of outliers on the statistical conclusions.

Outside Values: A component of a box plot, outside values are more than one step beyond the nearest hinge but not more than two steps. They are beyond an inner fence but not beyond an outer fence.

Pairwise Comparisons: Two or more box plots drawn on the same Y-axis. These are often useful in comparing features of distributions. An example portraying the times it took samples of women and men to do a task is shown below.

Parallel Box Plots: Two or more box plots drawn on the same Y-axis. These are often useful in comparing features of distributions. An example portraying the times it took samples of women and men to do a task is shown below.

Parameter: A value calculated in a population. For example, the mean of the numbers in a population is a parameter. Compare with a statistic, which is a value computed in a sample to estimate a parameter.

Partial slope: The partial slope in multiple regression is the slope of the relationship between the part of the predictor variable that is independent of the other predictor variables and criterion. It is also the regression coefficient for the predictor variable in question.

Pearson’s r: Pearson’s correlation is a measure of the strength of the linear relationship between two variables. It ranges from -1 for a perfect negative relationship to +1 for a perfect positive relationship. A correlation of 0 means that there is no linear relationship. 1. Define IR as the integer portion of R (the number to the left of the decimal point). 2. Define FR as the fractional portion or R. 3. Find the scores with Rank IR and with Rank IR + 1. 4. Interpolate by multiplying the difference between the scores by FR and add the result to the lower score.

Per-Comparison Error Rate: The per-comparison error rate refers to the Type I error rate of any one significance test conducted as part of a series of significance tests. Thus, if 10 significance tests were each conducted at 0.05 significance level, then the per-comparison error rate would be 0.05. Compare with the familywise error rate.

Pie Chart: A graphical representation of data, the pie chart shows relative frequencies of classes of data. It is a circle cut into a number of wedges, one for each class, with the area of each wedge proportional to its relative frequency. Pie charts are only effective for a small number of classes, and are one of the less effective graphical representations.

Point Estimate: When a parameter is being estimated, the estimate can be either a single number or it can be a range of numbers such as in a confidence interval. When the estimate is a single number, the estimate is called a “point estimate.”

Polynomial Regression: Polynomial regression is a form of multiple regression in which powers of a predictor variable instead of other predictor variables are used. In the following example, the criterion (Y) is predicted by X, X2 and, X3. Y = b1X + b2X2 + b3X3 + A

Population: A population is the complete set of observations a researcher is interested in. Contrast this with a sample which is a subset of a population. A population can be defined in a manner convenient for a researcher. For example, one could define a population as all girls in fourth grade in Houston, Texas. Or, a different population is the set of all girls in fourth grade in the United States. Inferential statistics are computed from sample data in order to make inferences about the population.

Positive Association: There is a positive association between variables X and Y if smaller values of X are associated with smaller values of Y and larger values of X are associated with larger values of Y.

Power: In significance testing, power is the probability of rejecting a false null hypothesis.

Precision: A statistic’s precision concerns to how close it is expected to be to the parameter it is estimating. Precise statistics are vary less from sample to sample. The precision of a statistic is usually defined in terms of it standard error.

Predictor: A predictor variable is a variable used in regression to predict another variable. It is sometimes referred to as an independent variable if it is manipulated rather than just measured.

Probability Density: For a discrete random variable, a probability distribution contains the probability of each possible outcome. However, for a continuous random variable, the probability of any one outcome is zero (if you specify it to enough decimal places). A probability density function is a formula that can be used to compute probabilities of a range of outcomes for a continuous random variable. The sum of all densities is always 1.0 and the value of the function is always greater or equal to zero.

Probability Distribution: For a discrete random variable, a probability distribution contains the probability of each possible outcome. The sum of all probabilities is always 1.0. See binomial distribution for an example.

Probability Value: In significance testing, the probability value (sometimes called the p value) is the probability of obtaining a statistic as different or more different from the parameter specified in the null hypothesis as the statistic obtained in the experiment. The probability value is computed assuming the null hypothesis is true. The lower the probability value, the stronger the evidence that the null hypothesis is false. Traditionally, the null hypothesis is rejected if the probability value is below 0.05.

Qualitative Variable: Also known as categorical variables, qualitative variables are variables with no natural sense of ordering. They are therefore measured on a nominal scale. For instance, hair color (Black, Brown, Gray, Red, Yellow) is a qualitative variable, as is name (Adam, Becky, Christina, Dave . . .). Qualitative variables can be coded to appear numeric but their numbers are meaningless, as in male=1, female=2. Variables that are not qualitative are known as quantitative variables.

Quantitative Variable: Variables that are measured on a numeric or quantitative scale. Ordinal, interval and ratio scales are quantitative. A country’s population, a person’s shoe size, or a car’s speed are all quantitative variables. Variables that are not quantitative are known as qualitative variables.

Quantile-Quantile Plot: A quantile-quantile or q-q plot is an exploratory graphical device used to check the validity of a distributional assumption for a data set. In general, the basic idea is to compute the theoretically expected value for each data point based on the distribution in question. If the data indeed follow the assumed distribution, then the points on the q-q plot will fall approximately on a straight line.

Random Sampling: The process of selecting a subset of a population for the purposes of statistical inference. Random sampling means that every member of the population is equally likely to be chosen.

Range: The difference between the maximum and minimum values of a variable or distribution. The range is the simplest measure of variability.

Ratio Scale: One of the four basic levels of measurement, a ratio scale is a numerical scale with a true zero point and in which a given size interval has the same interpretation for the entire scale. Weight is a ratio scale, Therefore, it is meaningful to say that a 200 pound person weighs twice as much as a 100 pound person.

Regression: Regression means “prediction.” The regression of Y on X means the prediction of Y by X.

Regression Coefficient: A regression coefficient is the slope of the regression line in simple regression or the partial slope in multiple regression.

Regression Line: In linear regression, the line of best fit is called the regression line.

Relative Frequency: The proportion of observations falling into a given class. For example, if a bag of 55 M & M’s has 11 green M&M’s, then the frequency of green M&M’s is 11 and the relative frequency is 11/55 = 0.20. Relative frequencies are often used in histograms, pie charts, and bar graphs.

Relative Frequency Distribution: A relative frequency distribution is just like a frequency distribution except that it consists of the proportions of occurrences instead of the numbers of occurrences for each value (or range of values) of a variable.

Reliability: Although there are many ways to conceive of the reliability of a test, the classical way is to define the reliability as the correlation between two parallel forms of the test. When defined this way, the reliability is the ratio of true score variance to test score variance. Chronbach’s alpha is a common measure of reliability.

Representative Sample: A representative sample is a sample chosen to match the qualities of the population from which it is drawn. With a large sample size, random sampling will approximate a representative sample; stratified random sampling can be used to make a small sample more representative.

Robust: Something is robust if it holds up well in the face of adversity. A measure of central tendency or variability is considered robust if it is not greatly affected by a few extreme scores. A statistical test is considered robust if it works well in spite of moderate violations of the assumptions on which it is based.

Sample: A sample is a subset of a population, often taken for the purpose of statistical inference. Generally, one uses a random sample.

Sampling Distribution: A sampling distribution can be thought of as a relative frequency distribution with a very large number of samples. More precisely, a relative frequency distribution approaches the sampling distribution as the number of samples approaches infinity. When a variable is discrete, the heights of the distribution are probabilities. When a variable is continuous, the class intervals have no width and and the heights of the distribution are probability densities.

Scatter Plot: A scatter plot of two variables shows the values of one variable on the Y axis and the values of the other variable on the X axis. Scatter plots are well suited for revealing the relationship between two variables. The scatter plot shown below illustrates the relationship between grip strength and arm strength in a sample of workers.

Significance Level: In significance testing, the significance level is the highest value of a probability value for which the null hypothesis is rejected. Common significance levels are 0.05 and 0.01. If the 0.05 level is used, then the null hypothesis is rejected if the probability value is less than or equal to 0.05.

Significance Testing: A statistical procedure that tests the viability of the null hypothesis. If data (or more extreme data) are very unlikely given that the null hypothesis is true, then the null hypothesis is rejected. If the data or more extreme data are not unlikely, then the null hypothesis is not rejected. If the null hypothesis is rejected, then the result of the test is said to be significant. A statistically significant effect does not mean the effect is important.

Simple Regression: Simple regression is linear regression in which one more predictor variable is used to predict the criterion.

Skew: A distribution is skewed if one tail extends out further than the other. A distribution has a positive skew (is skewed to the right) if the tail to the right is longer. It has a negative skew (skewed to the left) if the tail to the left is longer.

Slope: The slope of a line is the change in Y for each change of one unit of X. It is sometimes defined as “rise over run” which is the same thing. The slope of the black line in the graph is 0.675 because the line increases by 0.675 each time X increases by 1.0.

Standard Deviation: The standard deviation is a widely used measure of variability. It is computed by taking the square root of the variance. An important attribute of the standard deviation as a 684

Standard error of the Estimate: measure of variability is that if the mean and standard deviation of a normal distribution are known, it is possible to compute the percentile rank associated with any given score.

Standard Error:

Standard Error of Measurement:

Standard Error of the Mean: The standard error of the mean is the standard deviation of the sampling distribution of the mean. The formula for the standard error of the mean in a population is:

Standard Normal Distribution: The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1.. The transformation from a raw score X to a z score can be done using the following formula: z = (X - mu )/ sigma Transforming a variable in this way is called “standardizing” the variable. It should be kept in mind that if X is not normally distributed then the transformed variable will not be normally distributed either.

Standardize: A variable is standardized if it has a mean of 0 and a standard deviation of 1. The transformation from a raw score X to a standard score can be done using the following formula: X standardized = (X - mu )/ sigma where mu is the mean and sigma is the standard deviation. Transforming a variable in this way is called “standardizing” the variable. It should be kept in mind that if X is not normally distributed then the transformed variable will not be normally distributed either.

Statistics: 1. What you are studying right now, also known as statistical analysis, or statistical inference. It is a field of study concerned with summarizing data, interpreting data, and making decisions based on data. 2. A quantity calculated in a sample to estimate a value in a population is called a “statistic.”

Stratified Random Sampling: In stratified random sampling, the population is divided into a number of subgroups (or strata). Random samples are then taken from each subgroup with sample sizes proportional to the size of the subgroup in the population. For instance, if a population contained equal numbers of men and women, and the variable of interest is suspected to vary by gender, one might conduct stratified random sampling to insure a representative sample.

Sturgis’ Rule: One method of determining the number of classes for a histogram, Sturgis’ rule is to take 1 + Log2(N) classes, rounded to the nearest integer.

Sum of Squares Error: In linear regression, the sum of squares error is the sum of squared errors of prediction. In analysis of variance, it is the sum of squared deviations from cell means for between-subjects factors and the Subjects x Treatment interaction for within-subject factors.

Symmetric Distribution: In a symmetric distribution, the upper and lower halves of the distribution are mirror images of each other. In a symmetric distribution, the mean is equal to the median.

t distribution: The t distribution is the distribution of a value sampled from a normal distribution divided by an estimate of the distribution’s standard deviation. In practice, the value is typically a statistic such as the mean or the difference between means and the standard deviation is an estimate of the standard error of the statistic.

t test: Most commonly, a significance test of the difference between means based on the t distribution. Other applications include (a) testing the significance of the difference between a sample mean and a hypothesized value of the mean and (b) testing a specific contrast among means.

Third Variable Problem: A type of confounding in which a third variable leads to a mistaken causal relationship between two others. For instance, cities with a greater number of churches have a higher crime rate. However, more churches do not lead to more crime, but instead the third variable, population, leads to both more churches and more crime.

Tukey HSD Test: The “Honestly Significantly Different” (HSD) test developed by the statistician John Tukey to test all pairwise comparisons among means. The test is based on the “studentized range distribution.”

Two Tailed: The last step in significance testing involves calculating the probability that a statistic would differ as much or more from the parameter specified in the null hypothesis as does the statistics obtained in the experiment. A probability computed considering differences in both direction (statistic either larger or smaller than the parameter) is called two-tailed probability. For example, if a parameter is 0 and the statistic is 12, a two-tailed probability would be the he probability of being either <= -12 or >=12. Compare with the one-tailed probability which would be the probability of a statistic being >= to 12 if that were the direction specified in advance.

Type I Error: In significance testing, the error of rejecting a true null hypothesis.

Type II Error: In significance testing, the failure to reject a false null hypothesis.

Unbiased: A sample is said to be unbiased when every individual has an equal chance of being chosen from the population. An estimator is unbiased if it does not systematically overestimate or underestimate the parameter it is estimating. In other words, it is unbiased if the mean of the sampling distribution of the statistic is the parameter it is estimating, The sample mean is an unbiased estimate of the population mean.

Variability: Variability refers to the extent to which values differ from one another. That is, how much they vary. Variability can also be thought of as how spread out a distribution is. The standard deviation and the semi-interquartile range are measures of variability.

Variable: Something that can take on different values. For example, different subjects in an experiment weigh different amounts. Therefore “weight” is a variable in the experiment. Or, subjects may be given different doses of a drug. This would make “dosage” a variable. Variables can be dependent or independent, qualitative or quantitative, and continuous or discrete.

Variance: The variance is a widely used measure of variability. It is defined as the mean squared Measures of variability deviation of scores from the mean. T

Measures of variability:

Y Intercept: The Y-intercept of a line is the value of Y at the point that the line intercepts the Y axis. It is the value of Y when X equals 0. The Y intercept of the black line shown in the graph is 0.785.

z score: The number of standard deviations a score is from the mean of its population. The term”standard score” is usually used for normal populations; the terms “z score” and “normal deviate” should only be used in reference to normal distributions. The transformation from a raw score X to a z score can be done using the following formula: z = (X - mu )/ sigma Transforming a variable in this way is called “standardizing” the variable. It should be kept in mind that if X is not normally distributed then the transformed variable will not be normally distributed either.

9 Practical Problems:

A finance manager claims that the average profit of most games we create is $6000, with a standard deviation of $1000. Find the probability that a random sample of 36 games averages less than $5700 in profit.
Suppose a senior analyst from the company claimed that the average profit from most games made by the company is at least $600,000, and any game that makes less is a fluke, an outlier. Suppose that you suspect the claim may be exaggerated. Use our sample of 50 games, find the average in profits. Test the CEO’s claim, against your suspicion, at the 5% level of significance.
The CEO of the business claims that because the average profit from the past 3 Nintendo race car games was less than $200,000. The next Nintendo race car game we make will have an average profit less than 200,000. The Given this information. Find the probability that profit from a random sample of 3 of these games’ averages less than $200,000. What does this suggest about the claim made by the CEO?
“In Game Purchases” IGP, are a growing revenue stream for many video-game companies. One survey showed that up to 20% of players take part in IGP. Based this Information, what is the probability that for a random sample of 10 gamers 6 will take part in IGP. Do we need our dataset to answer this question?
A freshly hired product analyst alleges that “in game purchases” are favored by the rating agency IGN Entertainment Inc. His claim is that this would bias any metrics based on these Ratings that we use to measure the performance of our games. “…essentially this would/could have us making games for high ratings and not for high sales or profit (for the customers), for example putting IGP in every game because it will raise our IGN rating.”Is there a significant association between IGP and IGN Rating?
The Owner wants to understand the relationship between sales and the amount spent to develop, market, and distribute a game. She suspects the sales of new games can be predicted from the amount of time money and effort spent on the game, regardless of game type or console.
Suppose the Operations Manager claimed that Research & development, Marketing, and Administration have roughly the same budget on every project. How can we check this claim.
On Average the company spends the same amount on marketing as we do on Research & Development per game.
Suppose a Manager claimed that Marketing, and Administration have roughly the same average budget on every project. How can we check this claim.
How could we check the claim that the average sales of MOBA games before 2012 were significantly higher than sales of MOBA games after 2012.
The CFO of the business claims that IGP is a main driver of sales among all games. Given this assumption. How could we check this?
The marketing manager wants to know the average sales of MOBA games after 2012.
The marketing manager wants to estimate the average sales of MOBA games before 2012.
How could we check the claim that average sales of MOBA games before 2012 were significantly lower than sales of RTS games after 2012?
The marketing manager wants to estimate the minimum cost of making a high rated MMO game.
Dataset 2 contains the play time in minutes from a high earning MOBA game. A new analyst is convinced time of day effects playtime regardless of country. Test whether country and times of day have an effect on playtime. Is the analyst claim justified ? Explain.
Suppose the Operations Manager claimed that the median number of minutes played in the US can’t be more than 750 minutes. A sales manager doubts the accuracy of this claim. Can you reject the claim given the data?

9.1 Statistical Test Assumptions:

Ljung-Box test assumptions: This procedures require certain assumptions on the data which we will not discuss. see (10.3 - Regression with Autoregressive Errors | STAT 462, n.d.)

Runs Test for randomness assumptions:

Assumption #1: Independence of observations.

Chi² Goodness of Fit Test assumptions:

Assumption #1: One categorical variable.
Assumption #2: Independence of random observations.
Assumption #3: The groups of the categorical variable must be mutually exclusive.
Assumption #4: There must be at least 5 expected frequencies in each level of the variable.

Chi² Test of Independence assumptions:

Assumption #1: Two categorical variables.
Assumption #2: Independence of random observations.
Assumption #3: The groups of the categorical variable must be mutually exclusive.
Assumption #4: There must be at least 5 expected frequencies in each level of the variable.

t- Testing

Assumption #1: Two continuous variables.
Assumption #2: Independence of random observations.
Assumption #3: Both variables are approximately normally distributed.
Assumption #4: Both variables have approximately the same variance.

ANOVA

Assumption #1: Multiple continuous variables.
Assumption #2: Independence of random observations.
Assumption #3: All variables approximately normally distributed.
Assumption #4: All variables have approximately the same variance.
Assumption #5: There is No Multicollinearity Among feature Variables

Linear Regression

Assumption #1: Multiple feature variables of any type, One continuous target variable.
Assumption #2: The relationship between the features and the target variable is linear.
Assumption #3: Independence of random observations.
Assumption #4: All variables approximately normally distributed.
Assumption #5: All variables have approximately the same variance.
Assumption #6: There is No Multicollinearity Among feature Variables.

10 References and Resources

R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

10.3—Regression with Autoregressive Errors | STAT 462. (n.d.). Retrieved April 14, 2023, from https://online.stat.psu.edu/stat462/node/189/

Analysis of Microbiome Data in R. (n.d.). Retrieved April 8, 2023, from https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/00--intro_to_rstudio.html

ANOVA in R: The Ultimate Guide. (n.d.). Datanovia. Retrieved April 19, 2023, from https://www.datanovia.com/en/lessons/anova-in-r/

ANOVA test, Levene’s test, Kruskal Wallis test. (n.d.). Retrieved April 19, 2023, from https://www.statskingdom.com/doc_anova.html

Autocorrelation: What It Is, How It Works, Tests. (n.d.). Investopedia. Retrieved April 6, 2023, from https://www.investopedia.com/terms/a/autocorrelation.asp

Binned Variable—Open Risk Manual. (n.d.). Retrieved April 1, 2023, from https://www.openriskmanual.org/wiki/Binned_Variable

BruceET. (2021, December 3). Answer to “What does it mean for a statistical test to be ‘robust’?” Cross Validated. https://stats.stackexchange.com/a/554739

Burns, P. J. (2003). Robustness of the Ljung-Box Test and its Rank Equivalent. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.443560

Chi-Square Independence Test | Real Statistics Using Excel. (n.d.). Retrieved April 4, 2023, from https://real-statistics.com/chi-square-and-f-distributions/independence-testing/

Comprehensive Guide to Grouping and Aggregating with Pandas—Practical Business Python. (n.d.). Retrieved April 15, 2023, from https://pbpython.com/groupby-agg.html

corrMatrix function—RDocumentation. (n.d.). Retrieved April 18, 2023, from https://www.rdocumentation.org/packages/jmv/versions/2.3.4/topics/corrMatrix

Datalab, A. (2021, August 5). Understanding Durbin-Watson Test. Medium. https://medium.com/@analyttica/durbin-watson-test-fde429f79203

Deen, K. (2015, January 19). Variables in Statistics vs. Computer Science. Following Data. https://kenandeen.wordpress.com/2015/01/19/variables-in-statistics-vs-computer-science/

Dividing a Continuous Variable into Categories. (n.d.). Retrieved April 1, 2023, from https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html

finnstats. (2021, August 22). Test For Randomness in R-How to check Dataset Randomness | R-bloggers. https://www.r-bloggers.com/2021/08/test-for-randomness-in-r-how-to-check-dataset-randomness/

Flom, P. (2014, January 7). Answer to “Warning in R - Chi-squared approximation may be incorrect.” Cross Validated. https://stats.stackexchange.com/a/81485

Frost, J. (2018, January 30). Guide to Data Types and How to Graph Them in Statistics. Statistics By Jim. http://statisticsbyjim.com/basics/data-types/

Glen_b. (2014, January 7). Answer to “Warning in R - Chi-squared approximation may be incorrect.” Cross Validated. https://stats.stackexchange.com/a/81498

Goodness of fit test. (n.d.). Retrieved April 2, 2023, from https://statkat.com/stat-tests/goodness-of-fit-test.php

Harris, T. (2021, January 19). How to Use and Visualize K-Means Clustering in R. Medium. https://towardsdatascience.com/how-to-use-and-visualize-k-means-clustering-in-r-19264374a53c

Hassani, H., & Yeganegi, M. R. (2020). Selecting optimal lag order in Ljung–Box test. Physica A: Statistical Mechanics and Its Applications, 541, 123700. https://doi.org/10.1016/j.physa.2019.123700

Henry. (2014, January 7). Answer to “Warning in R - Chi-squared approximation may be incorrect.” Cross Validated. https://stats.stackexchange.com/a/81502

Integrate.io. (n.d.). What is SQL Aggregation? Integrate.Io. Retrieved April 18, 2023, from https://www.integrate.io/glossary/what-is-sql-aggregation/

Kat. (2020, March 10). Answer to “Using and interpreting output from gvlma.” Stack Overflow. https://stackoverflow.com/a/60626152

kausar, kamran. (2017, December 8). Answer to “Using and interpreting output from gvlma.” Stack Overflow. https://stackoverflow.com/a/47714995

Kaushik, S. (2016, November 3). Clustering | Introduction, Different Methods, and Applications (Updated 2023). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/

Kodali, T. (2016, January 22). Hierarchical Clustering in R | R-bloggers. https://www.r-bloggers.com/2016/01/hierarchical-clustering-in-r-2/

Kolassa, S. (2016, July 5). Answer to “How to know that your machine learning problem is hopeless?” Cross Validated. https://stats.stackexchange.com/a/222189

Kumar, S. (2021, September 21). Silhouette Method—Better than Elbow Method to find Optimal Clusters. Medium. https://towardsdatascience.com/silhouette-method-better-than-elbow-method-to-find-optimal-clusters-378d62ff6891

Lani, J. (2009, June 3). Runs Test of Randomness. Statistics Solutions. https://www.statisticssolutions.com/runs-test-of-randomness/

Mahendru, K. (2019, June 17). How to determine the optimal K for K-Means? Analytics Vidhya. https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb

McHugh, M. L. (2013). The Chi-square test of independence. Biochemia Medica, 23(2), 143–149. https://doi.org/10.11613/BM.2013.018

Metwalli, S. A. (2021, July 15). Data Visualization 101: How to Choose a Chart Type. Medium. https://towardsdatascience.com/data-visualization-101-how-to-choose-a-chart-type-9b8830e558d6

Mikko Rönkkö (Director). (2019, September 23). The effect of non-independence of observations on regression. https://www.youtube.com/watch?v=fwSkLuozazk

Monica, gung-R. (2020, June 9). Answer to “Exploratory vs. Descriptive Statistical Analysis.” Cross Validated. https://stats.stackexchange.com/a/471272

Moran, M. (2017, January 4). The Importance of Assumption Testing. Statistics Solutions. https://www.statisticssolutions.com/the-importance-of-assumption-testing/

Nahhas, R. W. (n.d.). 5.14 Checking the independence assumption | Introduction to Regression Methods for Public Health Using R. Retrieved April 18, 2023, from https://bookdown.org/rwnahhas/RMPH/mlr-independence.html

Navarro, D. (n.d.). Learning Statistics with R.

Overview for Cluster Variables. (n.d.). [Mtbconcept]. Retrieved April 14, 2023, from https://support.minitab.com/en-us/minitab/20/help-and-how-to/statistical-modeling/multivariate/how-to/cluster-variables/before-you-start/overview/

Predictive Analytics: Definition, Model Types, and Uses. (n.d.). Investopedia. Retrieved May 10, 2023, from https://www.investopedia.com/terms/p/predictive-analytics.asp

PCUnique. (2017, April 6). Using and interpreting output from gvlma [Forum post]. Stack Overflow. https://stackoverflow.com/q/43252474

Potter, G. (2016, July 6). Answer to “How to know that your machine learning problem is hopeless?” Cross Validated. https://stats.stackexchange.com/a/222407

Quick list of useful R packages. (2023, February 20). Posit Support. https://support.posit.co/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages

RPubs—Blog 3: Binning numerical variables. (n.d.). Retrieved April 1, 2023, from https://rpubs.com/hillt5/blog3_621

Runs tests- Principles. (n.d.). Retrieved April 6, 2023, from https://influentialpoints.com/Training/runs_tests-principles-properties-assumptions.htm

Sauer, C. (2016, July 6). Answer to “How to know that your machine learning problem is hopeless?” Cross Validated. https://stats.stackexchange.com/a/222370

Scott, D. M. (2009). Statistics, Inferential. In R. Kitchin & N. Thrift (Eds.), International Encyclopedia of Human Geography (pp. 429–435). Elsevier. https://doi.org/10.1016/B978-008044910-4.00535-6

Siegle, D. (2015, September 5). Null and Alternative Hypotheses | Educational Research Basics by Del Siegle. https://researchbasics.education.uconn.edu/null-and-alternative-hypotheses/

Some comments on <code>simulate.p.value</code> in <code>chisq.test</code>. (n.d.). Retrieved April 18, 2023, from https://rstudio-pubs-static.s3.amazonaws.com/1142_ecdc7d6b3cb84f54a6f3b5466c50b394.html

Standardized linear regression. (n.d.). Retrieved April 18, 2023, from https://www.statlect.com/fundamentals-of-statistics/linear-regression-with-standardized-variables

T.2.3—Testing and Remedial Measures for Autocorrelation | STAT 501. (n.d.). PennState: Statistics Online Courses. Retrieved April 6, 2023, from https://online.stat.psu.edu/stat501/lesson/t/t.2/t.2.3-testing-and-remedial-measures-autocorrelation

Team, D. (2018, January 30). Chi-Square Test in R | Explore the Examples and Essential concepts! DataFlair. https://data-flair.training/blogs/chi-square-test-in-r/

The Durbin-Watson Test—Basic Statistics and Data Analysis. (2021, January 10). https://itfeature.com/time-series-analysis-and-forecasting/autocorrelation/durbin-watson-test

Tim. (2017, April 13). How to know that your machine learning problem is hopeless? [Forum post]. Cross Validated. https://stats.stackexchange.com/q/222179

T-test in R: The Ultimate Guide. (n.d.). Datanovia. Retrieved April 18, 2023, from https://www.datanovia.com/en/lessons/t-test-in-r/

Types of Data Analysis: A Guide | Built In. (n.d.). Retrieved March 25, 2023, from https://builtin.com/data-science/types-of-data-analysis

Types of Data in Statistics: A Guide | Built In. (n.d.). Retrieved April 17, 2023, from https://builtin.com/data-science/data-types-statistics

What are descriptive statistics? (2021, April 13). GCP-Service. https://www.gcp-service.com/what-are-descriptive-statistics/

What Is Descriptive Analytics? 5 Examples | HBS Online. (2021, November 9). Business Insights Blog. https://online.hbs.edu/blog/post/descriptive-analytics

What is Structured Data? - Structured Data Explained - AWS. (n.d.). Amazon Web Services, Inc. Retrieved April 17, 2023, from https://aws.amazon.com/what-is/structured-data/

What is the difference between uncorrelation and independence? - AskUs. (n.d.). Retrieved April 2, 2023, from https://libanswers.lib.miamioh.edu/stats-faq/faq/343636

Zach. (2019a, April 27). A Complete Guide to Stepwise Regression in R. Statology. https://www.statology.org/stepwise-regression-r/

Zach. (2019b, May 30). How to Create a Prediction Interval in R. Statology. https://www.statology.org/prediction-interval-r/

Zach. (2020a, January 8). The Four Assumptions of Linear Regression. Statology. https://www.statology.org/linear-regression-assumptions/

Zach. (2020b, February 14). Ljung-Box Test: Definition + Example. Statology. https://www.statology.org/ljung-box-test/

Zach. (2020c, October 13). The 6 Assumptions of Logistic Regression (With Examples). Statology. https://www.statology.org/assumptions-of-logistic-regression/

Zach. (2021, March 16). What is the Assumption of Independence in Statistics? Statology. https://www.statology.org/assumption-of-independence/

Zaitsev, O. (2017, July 17). Aggregation and Grouping. Medium. https://towardsdatascience.com/aggregation-and-grouping-66396f26dd95

10.1 R Packages:

‘summarytools’:

Comtois D (2022). _summarytools: Tools to Quickly and Neatly Summarize Data_. R package version 1.0.1, https://CRAN.R-project.org/package=summarytools

‘lessR’:

Gerbing D, Business TSo, University PS (2023). _lessR: Less Code, More Results_. R package version 4.2.6, https://CRAN.R-project.org/package=lessR

Gerbing DW (2021). “Enhancement of the Command-Line Environment for use in the Introductory Statistics Course and

Beyond.” _Journal of Statistics and Data Science Education_, *29*(3), 251-256. doi:10.1080/26939169.2021.1999871 https://doi.org/10.1080/26939169.2021.1999871

‘mosaic’:

R. Pruim, D. T. Kaplan and N. J. Horton. The mosaic Package: Helping Students to ‘Think with Data’ Using R (2017). The R Journal, 9(1):77-102.

‘corrplot’:

Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). https://github.com/taiyun/corrplot

‘psych’

William Revelle (2023). _psych: Procedures for Psychological, Psychometric, and Personality Research_. Northwestern University, Evanston, Illinois. R package version 2.3.3, https://CRAN.R-project.org/package=psych.

‘factoextra’

Kassambara A, Mundt F (2020). _factoextra: Extract and Visualize the Results of Multivariate Data Analyses_. R package version 1.0.7,https://CRAN.R-project.org/package=factoextra

‘cluster’

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2022). cluster: Cluster Analysis Basics and Extensions R package version 2.1.3.

‘car’ :

John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage.URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/

MASS :

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

‘forecast’:

Hyndman R, Athanasopoulos G, Bergmeir C, Caceres G, Chhay L, O’Hara-Wild M, Petropoulos F, Razbash S, Wang E, Yasmeen F (2022). _forecast: Forecasting functions for time series and linear models_. R package version 8.18,<https://pkg.robjhyndman.com/forecast/>.

Hyndman RJ, Khandakar Y (2008). “Automatic time series forecasting: the forecast package for R.” _Journal of Statistical Software_, *26*(3), 1-22. doi:10.18637/jss.v027.i03 https://doi.org/10.18637/jss.v027.i03

Code

citation("summarytools")
citation("lessR")
citation("mosaic")
citation("corrplot")
citation("psych")
citation("factoextra")
citation("cluster")
citation("car")
citation("MASS")
citation("forecast")

11 About Me

My name is Joshua Lizardi. For the past 7 years, I have worked for various institutions teaching a wide range of courses in Math, Statistics and Technology. These included Quantitative Reasoning, Calculus ,Applied Technical Mathematics, Remedial Mathematics, Statistics, Computers & Office Automation, Introductory College Algebra, Intermediate College Algebra, Remedial Mathematics, Business Statistics.

I hold a bachelor’s in mathematics (Mercy College), a master’s in applied mathematics (Purdue University), and a master’s in data analytics (Western Governors University). I also hold a few certifications including “SAS Certified Statistical Business Analyst SAS 9”, “SAS Certified Base Programmer SAS 9”, “Oracle Database SQL Certified Associate”.

Subjects like mathematics, statistics, and computer science should not be taught as if they were spectator sports, the best way to learn these subjects is to perform them. Although understanding textbooks and lecture notes is valuable, the learning that comes from one’s own attempts at solving problems is the key to becoming competent in the subject overall. I have always been passionate about mathematics statistics and computer science, and I enjoy encouraging students to see the utility of these subjects.

SPECIALTIES

Applied Mathematics Applied Statistics Data Analytics Data Science Machine Learning Artificial Intelligence

SKILLS

R Python SQL SAS MiniTab Tableau Power BI Microsoft Office

https://www.youracclaim.com/users/joshua-lizardi

BOOM!

	Action-adventure.	Multiplayer online battle arena (MOBA)	Puzzlers and party games.	Racing.	Real-time strategy (RTS)	Role-playing (RPG, ARPG, and More)	Sandbox.	Shooters (FPS and TPS)	Sports
3	3	0	0	0	0	0	0	2	3
4	3	0	0	1	0	0	0	2	0
5	1	3	3	0	0	0	1	5	4
6	0	3	0	2	1	1	0	0	0
7	0	0	0	0	1	0	0	0	1
8	0	1	0	1	0	1	0	1	1
9	0	0	0	1	1	1	0	0	0
10	1	1	0	0	0	0	0	0	0

	Action-adventure.	Multiplayer online battle arena (MOBA)	Puzzlers and party games.	Racing.	Real-time strategy (RTS)	Role-playing (RPG, ARPG, and More)	Sandbox.	Shooters (FPS and TPS)	Sports
3	3	0	0	0	0	0	0	2	3
4	3	0	0	1	0	0	0	2	0
5	1	3	3	0	0	0	1	5	4
6	0	3	0	2	1	1	0	0	0
7	0	0	0	0	1	0	0	0	1
8	0	1	0	1	0	1	0	1	1
9	0	0	0	1	1	1	0	0	0
10	1	1	0	0	0	0	0	0	0

1 Introduction

1.0.0.0.1 THELINK-Trine University OER

1.0.0.0.1.1 CC License Information

2 R

2.0.0.1 R: Installing packages and libraries

2.1 Data

2.2 Video Game Cost & Sales Case Study

2.2.0.1 Data collection methods:

2.2.1 Uploading Data to Rstuido

2.3 Data Preparation: Part 1

2.3.1 Re-coding categorical variables:

2.3.2 Scaling numerical variables:

2.3.3 Binning Variables:

2.3.4 Aggregation of Binned Data:

3 Descriptive Analysis

3.1 Full Dataset Univariate Descriptive Analytics

Data Frame Summary

data

3.2 Univariate Graphs: Pie-Charts, Histograms & Box-Plots

3.3 Bivariate Descriptive Tables for Categorical Variables.

3.4 Correlation analysis Graphs Plots and Tables

3.5 Clustering

3.6 Hierarchical Clustering

3.7 K-means for numerical variables

3.8 More Graphs:

3.9 mplot:

3.10 Section Notes:

4 Inferential Analysis

4.1 Summary of Inferential Analysis Steps

4.2 Assumption Checking for Statistical Testing

4.2.1 Ljung-Box Test for independence

4.2.2 Run test of randomness

4.3 Chi Squared Testing

4.3.0.1 Chi2 Goodness of Fit Test assumptions:

4.4 Remedial Measures: Chi2 Testing

4.4.0.1 Chi2 Test of Independence assumptions:

4.5 t-Test & ANOVA

4.5.0.1 Unpaired t- Testing

4.6 Remedial Measures: t-Testing

4.7 Data Preparation Part 2

4.7.0.1 One-Way ANOVA:

4.8 Remedial Measures: ANOVA

5 Predictive Analysis:

5.1 Regression Analysis:

5.2 Logistic Regression Analysis

5.3 Time Series Analysis

5.4 Neural Network Analysis

6 Next Steps:, Further research/analysis.

7 R Programming Resources:

8 Glossary of Terms:

9 Practical Problems:

9.1 Statistical Test Assumptions:

10 References and Resources

10.1 R Packages:

11 About Me

4.3.0.1 Chi² Goodness of Fit Test assumptions:

4.4 Remedial Measures: Chi² Testing

4.4.0.1 Chi² Test of Independence assumptions:

	Action-adventure.	Multiplayer online battle arena (MOBA)	Puzzlers and party games.	Racing.	Real-time strategy (RTS)	Role-playing (RPG, ARPG, and More)	Sandbox.	Shooters (FPS and TPS)	Sports
3	3	0	0	0	0	0	0	2	3
4	3	0	0	1	0	0	0	2	0
5	1	3	3	0	0	0	1	5	4
6	0	3	0	2	1	1	0	0	0
7	0	0	0	0	1	0	0	0	1
8	0	1	0	1	0	1	0	1	1
9	0	0	0	1	1	1	0	0	0
10	1	1	0	0	0	0	0	0	0