There are numerous real-life examples that exemplify the practical application of data analysis. What are the most common platforms or systems implementing statistical data analysis?
More well-known examples can be listed below:
Let’s have a look at data analysis stages!
Data summary and visualization are essential in the initial steps of data analysis. They offer a quick overview, helping analysts understand data patterns and detect outliers. Instead of looking at numbers and words, we can see charts, graphs, and pictures that show us what the information means. This is helpful in many areas. For example, when we analyze data, visualizations let us see patterns and trends. In business, charts help people quickly understand important numbers. Even in science, pictures help researchers explain their discoveries. So, no matter what we are working on, visualization makes information clearer and helps us make better decisions.
Let’s continue by understanding and visualizing the dataset.
At the beginning, we need to load “gpa” data set to R using the following command:
## Zorunlu paket yükleniyor: airports
## Zorunlu paket yükleniyor: cherryblossom
## Zorunlu paket yükleniyor: usdata
Data Info: A survey of 55 Duke University students asked about their GPA, number of hours they study at night, number of nights they go out, and their gender.
A data frame with 55 observations on the following 5 variables.
To see the first 10 observations of the data, “head” command is run.
## # A tibble: 10 × 5
## gpa studyweek sleepnight out gender
## <dbl> <int> <dbl> <dbl> <fct>
## 1 3.89 50 6 3 female
## 2 3.9 15 6 1 female
## 3 3.75 15 7 1 female
## 4 3.6 10 6 4 male
## 5 4 25 7 3 female
## 6 3.15 20 7 3 male
## 7 3.25 15 6 1 female
## 8 3.92 10 8 3 female
## 9 3.43 12 8 2 female
## 10 3.8 2 8 4 male
REMINDER: Different statistical methods are appropriate for different types of data. Moreover, different statistical analyses require different representations of the same data.
R programming language has several data objects:
Determine the class of the object gpa:
## [1] "tbl_df" "tbl" "data.frame"
The data “gpa” has a structure of tibble. We may prefer to convert it to the most common object, which is a data frame.
## [1] "data.frame"
Moreover, we can use “dim()” function to check the dimension of the data set. The data has 55 rows and 5 columns. In order words, the data includes 55 observations and 5 variables.
## [1] 55 5
We prefer to extend the data set to show more visualization techniques. For example, student’s state with four levels and taught having elective course during the current semester with three levels are added. By doing this, we set the random number generator (“set.seed()”) and use “sample()” function.
For the variable called “state”, assume that there are 4 states that students may come from, which are State A, State B, State C and State D.
Additionally, for the variable called “elective”, assume that,
1: Yes (Certainly, the student will take the elective course)
2: No (Certainly, the student will not take elective course)
3: Maybe( The student may take the elective course but it hasn’t decided
yet)
set.seed(1)
gpa[,c("state","elective")]=c(sample(c(1,2,3,4), 55, replace=TRUE),sample(c(1,2,3), 55, replace=TRUE))
head(gpa,10)## gpa studyweek sleepnight out gender state elective
## 1 3.890 50 6 3 female 1 3
## 2 3.900 15 6 1 female 4 2
## 3 3.750 15 7 1 female 3 2
## 4 3.600 10 6 4 male 1 3
## 5 4.000 25 7 3 female 2 3
## 6 3.150 20 7 3 male 1 2
## 7 3.250 15 6 1 female 3 2
## 8 3.925 10 8 3 female 3 2
## 9 3.428 12 8 2 female 2 2
## 10 3.800 2 8 4 male 2 1
Let’s get more insights about the data. To provide the type of any variables, you can prefer to use “is.(factor/character/numeric/integer etc.)()” and “class()”.
## [1] TRUE
## [1] TRUE
## [1] "numeric"
## [1] "numeric"
If any variable is not matched with your expectations, you can change the data type by using the command “as.(factor/character/numeric/integer etc.)()”.
The command “str()” quickly provides information about the data types of all variables, offering a more efficient and straightforward method to understand variable types.
## 'data.frame': 55 obs. of 7 variables:
## $ gpa : num 3.89 3.9 3.75 3.6 4 ...
## $ studyweek : int 50 15 15 10 25 20 15 10 12 2 ...
## $ sleepnight: num 6 6 7 6 7 7 6 8 8 8 ...
## $ out : num 3 1 1 4 3 3 1 3 2 4 ...
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 1 2 1 1 1 2 ...
## $ state : Factor w/ 4 levels "1","2","3","4": 1 4 3 1 2 1 3 3 2 2 ...
## $ elective : Factor w/ 3 levels "1","2","3": 3 2 2 3 3 2 2 2 2 1 ...
The easiest way to provide summary statistics is to use “summary()” function. The function “summary()” gives us descriptive stats for all the variables. If you would like to see descriptive stats only for specific variables, you can use “sapply()” command.
## gpa studyweek sleepnight out gender
## Min. :2.900 Min. : 2.00 Min. :5.000 Min. :0.000 female:43
## 1st Qu.:3.400 1st Qu.:10.00 1st Qu.:6.000 1st Qu.:1.250 male :12
## Median :3.650 Median :15.00 Median :7.000 Median :2.000
## Mean :3.600 Mean :19.15 Mean :7.064 Mean :2.109
## 3rd Qu.:3.825 3rd Qu.:26.50 3rd Qu.:8.000 3rd Qu.:3.000
## Max. :4.670 Max. :50.00 Max. :9.000 Max. :4.000
## state elective
## 1:18 1:15
## 2:17 2:21
## 3:11 3:19
## 4: 9
##
##
## gpa studyweek sleepnight
## 3.600073 19.145455 7.063636
## gpa studyweek sleepnight out
## 0.1126396 153.4228956 1.0653199 1.0063973
## gpa studyweek
## 4.67 50.00
## gpa studyweek out
## 0% 2.900 2.0 0.00
## 25% 3.400 10.0 1.25
## 50% 3.650 15.0 2.00
## 75% 3.825 26.5 3.00
## 100% 4.670 50.0 4.00
It seems that there is a problematic observation in gpa, since the scale of gpa is assumed to be between 0 and 4. The max point of gpa in our data set is 4.67. Firstly, let’s check whether it is the only one or there are additional such observations.
## gpa studyweek sleepnight out gender state elective
## 16 4.67 14 6.5 3 male 2 3
4.67 is the only outlier for gpa. For future steps, it is good to remind it.
Let’s start with the simple visualization techniques. Caution: Do not underestimate the power of the basic visualization!
Research Question (RQ): How do gender frequencies vary across different states?
To have more insights about the frequencies of the categorical levels, we can use the command “table()”.
##
## 1 2 3 4
## female 15 12 10 6
## male 3 5 1 3
par(bg = "#F8F8FF")
barplot(gen_st_tab,main="Frequencies of Gender by States", xlab="States",
ylab="Frequency", beside=TRUE,col=c("darkblue","darkorange"),legend = c("Female","Male"))According to the barplot, the number of female students in each state is generally significantly higher than the number of male students. This indicates a diversity in gender distribution among different states. For instance, in the state 1, the number of female students is notably higher compared to other states. However, as we move to the state 4, this difference seems to decrease.
RQ: How does the frequency distribution of values across different states change across the four categories (1, 2, 3, 4)?
##
## 1 2 3 4
## 18 17 11 9
library(RColorBrewer)
mypalette<-brewer.pal(4,"Set3")
par(bg = "#F8F8FF")
barplot(st_tab,main="Frequency Distribution of The States", xlab="States",
ylab="Frequency", col=mypalette, names.arg=c("A", "B", "C","D"),cex.names=0.8
,legend = c("State A","State B","State C","State D"))The bar plot indicates how frequently each category occurs in various states. It’s clear that state A is the most common, whereas State D is the least common. This visual representation allows for a quick comparison of category occurrences in different states.
RQ: How do elective choices vary across different states? RQ1: Which elective options (yes, no, maybe) are more common in specific states? RQ2: Does the distribution differ significantly among states?
##
## 1 2 3
## 1 5 8 5
## 2 6 4 7
## 3 1 6 4
## 4 3 3 3
mypalette3<-brewer.pal(4,"Set3")
par(bg = "#F8F8FF")
barplot(st_el_tab,main="States and Elective Course", xlab="Taughts",names.arg=c("Yes", "No", "Maybe"), ylab="Frequency",col=mypalette3)
par(bg = "#F8F8FF")
barplot(st_el_tab,main="States and Elective Course", names.arg=c("Yes", "No", "Maybe"), xlab="Taughts", ylab="Frequency",col=mypalette3, beside=TRUE,legend.text = c("State A","State B","State C","State D"),args.legend = list(x = "topright",inset = c(- 0.05, -0.08),cex=0.50))The barplot represents the frequencies of four states across different elective categories. Looking at each bar, we observe how the distribution varies for each states across the elective categories. For instance, in State A, elective category “no” has the highest frequency, followed by State B and elective category “maybe”. In addition to that, in State D, the number of students considering an elective course is the same for each category.
RQ: What is the distribution of sleeping hours in the observed population, and how can we interpret the patterns or variations revealed by the histogram and density plot?
par(mfrow=c(2,2))
par(bg = "#F8F8FF")
sleep_hist1=hist(gpa$sleepnight,main="Histogram of Sleeping Hours",xlab="Sleeping Hours",col="darkmagenta") #without breaks
sleep_hist1$breaks## [1] 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0
sleep_hist2=hist(gpa$sleepnight,main="Histogram of Sleeping Hours",xlab="Sleeping Hours",col="darkmagenta",breaks=1)
sleep_hist2$breaks## [1] 5 10
sleep_hist3=hist(gpa$sleepnight,main="Histogram of Sleeping Hours",xlab="Sleeping Hours",col="darkmagenta",breaks=2) #breaks=Scott or breaks="FD": suggest the optimum breaks
sleep_hist3$breaks## [1] 4 6 8 10
sleep_hist4=hist(gpa$sleepnight,main="Histogram of Sleeping Hours",xlab="Sleeping Hours",col="darkmagenta",breaks="Scott") #breaks=Scott or breaks="FD": suggest the optimum breaks## [1] 16 18 16 5
## [1] 5 6 7 8 9
## [1] 5.5 6.5 7.5 8.5
## [1] 0.29090909 0.32727273 0.29090909 0.09090909
The histogram visually represents the distribution of sleeping habits, divided into specific intervals. The highest frequency is observed in the 6-7 hours interval, which means that a significant portion of the students favors a moderate sleep duration. The lower frequency in the 8-9 hours interval suggests that this duration is less commonly preferred within the students. This infrequency implies that a smaller portion of the students leans towards a longer sleep duration, perhaps due to personal preferences or specific lifestyle factors.
den_sleep=density(gpa$sleepnight) #bandwidth: a smoothing parameter: large bandwidths produce very smooth estimates, small values produce wiggly estimates (like perfect fit).
den_sleep2=density(gpa$sleepnight,bw=0.1)
den_sleep3=density(gpa$sleepnight,bw=0.5)
den_sleep4=density(gpa$sleepnight,bw=0.7)
par(bg = "#F8F8FF")
par(mfrow=c(2,2))
plot(den_sleep,main="Density Plot of Sleeping Hours",type="l",col="red",lwd=3)
plot(den_sleep2,main="Density Plot of Sleeping Hours",type="l",col="red",lwd=3)
plot(den_sleep3,main="Density Plot of Sleeping Hours",type="l",col="red",lwd=3)
plot(den_sleep4,main="Density Plot of Sleeping Hours",type="l",col="red",lwd=3)par(bg = "#F8F8FF")
hist(gpa$sleepnight,main="Histogram of the Number of Hours Sleeping",xlab="Sleeping Hours",col="darkmagenta",prob=TRUE,breaks="Scott")
lines(den_sleep,col="red",type="l",lwd=3)## Package 'sm', version 2.2-6.0: type help(sm) for summary information
par(bg = "#F8F8FF")
sm.density.compare(gpa$sleepnight, gpa$gender,lwd=3,xlab="Sleeping hours")
title(main="Density Plot of Sleeping Hours by Gender")
legend("topleft", levels(gpa$gender), col = 2:4, lty = 1:2,lwd=2) RQ1: How does the distribution of GPA vary among the observed population? RQ2: What insights can a box-plot provide regarding the central tendency, spread, and potential outliers in GPA scores?
par(bg = "#F8F8FF")
box_p=boxplot(gpa$gpa, main="Box plot of GPAs", xlab="GPAs",col="burlywood1",horizontal=TRUE)## $stats
## [,1]
## [1,] 2.900
## [2,] 3.400
## [3,] 3.650
## [4,] 3.825
## [5,] 4.000
##
## $n
## [1] 55
##
## $conf
## [,1]
## [1,] 3.559455
## [2,] 3.740545
##
## $out
## [1] 4.67
##
## $group
## [1] 1
##
## $names
## [1] ""
It seems that the median GPA score within the students is approximately 3.7 by looking at the plot. The box’s width shows how GPA scores are spread out. A wider box means there’s a mix of different GPA scores among students, with some noticeable differences. The points outside the ‘whiskers’ potentially represent outliers. A GPA score like 4.67 appears significantly deviated from the overall distribution, indicating that this particular student has a notably higher or lower GPA compared to the majority.
par(bg = "#F8F8FF")
box_p=boxplot(gpa[-16,]$gpa, main="Box plot of GPAs", xlab="GPAs",col="burlywood1",horizontal=TRUE)## $stats
## [,1]
## [1,] 2.900
## [2,] 3.400
## [3,] 3.625
## [4,] 3.825
## [5,] 4.000
##
## $n
## [1] 54
##
## $conf
## [,1]
## [1,] 3.53362
## [2,] 3.71638
##
## $out
## numeric(0)
##
## $group
## numeric(0)
##
## $names
## [1] ""
par(mfrow=c(1,3))
par(bg = "#F8F8FF")
boxplot(gpa$gpa, main="Box plot of GPAs", ylab="GPAs",col="burlywood1",horizontal=FALSE)
boxplot(gpa$studyweek, main="Box plot of Study Hours", ylab="Study Hours",col="lightcyan",horizontal=FALSE)
boxplot(gpa$sleepnight, main="Box plot of Sleep Night", ylab="Sleep Night",col="thistle1",horizontal=FALSE)mypalette1=brewer.pal(4,"Spectral")
par(bg = "#F8F8FF")
boxplot(gpa$gpa~gpa$state, main="Box plot of GPAs by States", xlab="GPAs",col=mypalette1,horizontal=TRUE,ylab=" ", names=c("State A","State B","State C","State D"))
legend("topright", legend = c("State A","State B","State C","State D") ,
col=mypalette1 , pch=20 , pt.cex = 2, cex = 1, horiz = FALSE)State A: The median is approximately around 3.65. GPA scores seem to have a wide range as indicated by the broad box, and there is an outlier. The outlier suggests a specific student in this state has significantly different GPA scores compared to the majority.
State B: The median is approximately around 3.5. An outlier which is 4.67 is observed.
State C: The median is approximately around 3.7. No outliers are observed, indicating a more consistent distribution of student scores. GPA scores are generally concentrated in a higher range, as indicated by the higher box.
State D: The median is approximately around 3.7 like State C. No outliers are observed, suggesting overall consistent performance among students.
RQ1: What relationship does GPA (Grade Point Average) have with study hours? RQ2: How does the distribution of study hours impact academic performance?
par(bg = "#F8F8FF")
plot(studyweek~gpa , data=gpa,
xlab="GPAs", ylab="Number of Hours Studying",
main="Scatter Plot of Studying Hours vs GPAs",pch=5) #pch: typeIt seems that there is no obvious correlation between these two variables gpa and study hours.
REMINDER:
The scatter-plot without the outlier 4.67 for gpa:
par(bg = "#F8F8FF")
plot( studyweek~gpa , data=gpa[-16,],
xlab="GPAs", ylab="Number of Hours Studying",
main="Scatter Plot of Studying Hours vs GPAs", pch=4, cex=1.5,lwd=4) #cex=size;lwd= thicknesspar(bg = "#F8F8FF")
spineplot(gpa$state,gpa$gender,col = c("#07798D", "#7BCB9F"),xlab="States", ylab="Gender",main="Spine-plot for states and gender")## Zorunlu paket yükleniyor: grid
par(bg = "#F8F8FF")
mosaic( ~ state+gender+elective, data = gpa,highlighting = "gender", highlighting_fill = c("#07798D", "#7BCB9F"),direction = c("v","h","v"), main="Mosaic-plot for state, gender and elective") Comment examples:
REMINDER:
Difference between spine-chart and mosaic-plot: Mosaic-plot is the multidimensional extension of spine-chart. Spine-plots are used for only two variables.
RQ: Is there a statistically significant correlation among GPA, study hours, sleeping hours, and time spent going out?
## corrplot 0.92 loaded
corr_num = cor(gpa[,c(1,2,3,4)])
corrplot(corr_num)
corrplot(corr_num, method = 'number',col=brewer.pal(n=10, name="Set3"),tl.srt=45) #tl.srt: rotate 45'degree
corrplot(corr_num, method = 'color', type="lower") #type: lower triangleComment on the shape of variable X.
library(FamilyRank)
set.seed(1)
x=rbinorm(100, 2, 20, 2, 2, prop=0.5)
par(bg = "#F8F8FF")
par(mfrow=c(1,2))
plot(density(x),main="Density plot of the variable X",type="l",col="darkred",lwd=3)
hist(x,main="Histogram of variable X", col="darkred",breaks="FD")Then, draw the box-plot of variable X and comment on it.
What insight do you get from the histogram and the density-plot, but not from the box-plot?
Comment on the graph below. Is this a nice one or not? Is it enough for the visualization? What inferences can be drawn from the graph? What insights can be gained regarding it?
What are the main/effective elements of graphs?
More (if appropriate):
But before that, make sure that you select the right visualization techniques!
Compare the graphs given below. Both represents the same data. Which one is more effective and feasible?
par(mfrow=c(1,2))
par(bg = "#F8F8FF")
barplot(gen_st_tab)
barplot(gen_st_tab,main="Frequencies of Gender by States", xlab="States",
ylab="Frequency", beside=TRUE,col=c("darkblue","darkorange"),names.arg=c("A", "B", "C","D"),cex.names=0.8,legend = c("Female","Male"))Info about color in R:
We may draw more effective and useful plots by using various colors. In addition to default colors, R has several color palettes. One of the most common ones is “RColorBrewer”. The default color palette and RColorBrewer palette are shown below for effective and nice graphs:
References:
https://www.latestquality.com/interpreting-a-scatter-plot/
Cohen, Y., & Cohen, J. Y. (2008). Statistics and Data with R: An applied approach through examples. John Wiley & Sons.
Mount, J., & Zumel, N. (2019). Practical data science with R. Simon and Schuster.
Yozgatlıgil,C. (2024). Stat 412-Lecture Notes.