MATH-1324-Applied data project - Assignment 2

Introduction to Statistics Assignment 2

GANESH PATHAN, STUDENT ID : S3847177

Last updated: 16 October, 2022

Introduction

Introduction Continued

Problem Statement

Data

The data set used in this project assignment is an open-source data collected from science Direct under a Creative Commons License.

The dataset is sourced from https://www.kaggle.com/datasets/mandysia/obesity-dataset-cleaned-and-data-sinthetic

Sampling Method Used:

Data Continued :

Data Characteristics:

Data Pre-processing :

# Read the Data file from the working directory:

ObesityData <- read_csv('ObesityData.csv')
dim(ObesityData)
## [1] 2111   18
# Check class of the Variable Obesity_Type

class(ObesityData$Obesity_Type)
## [1] "character"
# Convert the variable names - Obesity_Type from character to Factor and check the class of Obesity_Type again

ObesityData <- mutate_at(ObesityData, vars(Obesity_Type), as.factor)
class(ObesityData$Obesity_Type)
## [1] "factor"
#Check the calss of variable Height

class(ObesityData$Height)
## [1] "numeric"

Data Pre-processing Continued :

# Check the class of Variable Weight

class(ObesityData$Weight)
## [1] "numeric"
# Check the class of Variable BMI

class(ObesityData$BMI)
## [1] "numeric"
# Order the values of Obesity_Type

ObesityData$Obesity_Type <- factor(ObesityData$Obesity_Type,levels=(unique(ObesityData$Obesity_Type)),ordered=TRUE)
class(ObesityData$Obesity_Type)
## [1] "ordered" "factor"
# Check the distinct values of the variable Obesity_Type

levels(ObesityData$Obesity_Type)
## [1] "NormalWeight" "OBESE"

Descriptive Statistics and Visualisation :

## Identify the total number of LOCATIONS in the dataframe that has NULL VALUES.

sum(is.na(ObesityData))
## [1] 0
# The result shows that there are no Null values in the whole dataset.

Descriptive Statistics and Visualisation - Continued :

boxplot(ObesityData$BMI ~ ObesityData$Obesity_Type, main="Box-Plot of BMI And Obesity_Type", ylab = "BMI", xlab = "Obesity_Type")

## It can be seen from the boxlplot there are no major visual outliers in the data for BMI values of Different Obesity_Types. So we can safely proceed with the data. No cleaning of data is required.

Descriptive Statistics and Visualisation - Continued :

ObesityData %>% group_by(`Obesity_Type`) %>% summarise(Min = min(Weight,na.rm = TRUE),
                                         Q1 = quantile(Weight,probs = .25,na.rm = TRUE),
                                         Median = median(Weight, na.rm = TRUE),
                                         Q3 = quantile(Weight,probs = .75,na.rm = TRUE),
                                         Max = max(Weight,na.rm = TRUE),
                                         Mean = mean(Weight, na.rm = TRUE),
                                         SD = sd(Weight, na.rm = TRUE),
                                         n = n(),
                                         missing = sum(is.na(ObesityData$'Weight'))) -> table1
knitr::kable(table1)
Obesity_Type Min Q1 Median Q3 Max Mean SD n missing
NormalWeight 39 50 55 62 87 56.20930 9.968814 559 0
OBESE 53 80 96 112 173 97.53093 21.091164 1552 0

Descriptive Statistics and Visualisation - Continued :

ObesityData %>% group_by(`Obesity_Type`) %>% summarise(Min = min(BMI,na.rm = TRUE),
                                         Q1 = quantile(BMI,probs = .25,na.rm = TRUE),
                                         Median = median(BMI, na.rm = TRUE),
                                         Q3 = quantile(BMI,probs = .75,na.rm = TRUE),
                                         Max = max(BMI,na.rm = TRUE),
                                         Mean = mean(BMI, na.rm = TRUE),
                                         SD = sd(BMI, na.rm = TRUE),
                                         n = n(),
                                         missing = sum(is.na(ObesityData$'BMI'))) -> table1
knitr::kable(table1)
Obesity_Type Min Q1 Median Q3 Max Mean SD n missing
NormalWeight 12.99868 17.57812 18.73049 22.22222 24.91349 19.77105 2.712523 559 0
OBESE 22.82674 27.81278 32.33759 37.80136 50.81175 33.27643 6.027950 1552 0

Hypothesis Testing :

The two sample t-test states that, H0 : μ1 = μ2 HA : μ1 ≠ μ2 Here, μ1 and μ2 are the mean of BMI of people of Normal weight and OBESE people respectively. H0 is the null hypothesis and And HA is the alternative hypothesis.

Hypothesis Testing - COntinued - QQ Plot (BMI of Normal Weight People) :

BMI_NormalWeight <- ObesityData %>% filter(ObesityData$Obesity_Type == "NormalWeight")

BMI_NormalWeight$BMI %>% qqPlot(dist="norm")

## [1] 442 186
## From the Graph , the data points are not close to the diagonal line and hence the BMI of normal weight people is not normally distributed. 

Hypothesis Testing - COntinued - QQ Plot (BMI of OBESE People) :

BMI_OBESE <- ObesityData %>% filter(ObesityData$Obesity_Type == "OBESE")

BMI_OBESE$BMI %>% qqPlot(dist="norm")

## [1] 1256 1340
## From the Graph , the data points are not close to the diagonal line and hence the BMI of OBESE people is not normally distributed. 

Hypothesis Testing - Continued - Homogeneity of Variance:

knitr::kable(round(leveneTest(ObesityData$'BMI' ~ ObesityData$Obesity_Type, data = ObesityData),3))
Df F value Pr(>F)
group 1 335.99 0
2109 NA NA

Hypothesis Testing - Continued :

t.test(`BMI` ~ Obesity_Type,
data = ObesityData,
var.equal = TRUE,
alternative = "two.sided"
)
## 
##  Two Sample t-test
## 
## data:  BMI by Obesity_Type
## t = -51.134, df = 2109, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group NormalWeight and group OBESE is not equal to 0
## 95 percent confidence interval:
##  -14.02334 -12.98742
## sample estimates:
## mean in group NormalWeight        mean in group OBESE 
##                   19.77105                   33.27643
qt(p = .025, df = 559 + 1552 -2)
## [1] -1.961089

Discussion:

From the descriptive statistics,

References :