Basic Statistical Testing

ID: 20228034

Writing a report on the analysis of Box Plot, Scatter Diagram, Chi-square Test, Correlation, and Regression using hypothetical data involves presenting the findings and interpretations of each analysis

Introduction:

This report presents an analysis of various statistical techniques using hypothetical data. The Box Plot, Scatter Diagram, Chi-square Test, Correlation, and Regression analyses were performed to explore patterns, associations, and predictive relationships within the data. The findings and interpretations of each analysis are discussed, providing valuable insights into the hypothetical dataset.

Data Description:

The hypothetical data used for the analyses consists of numerical and categorical variables. The numerical data represents measurements or quantities, while the categorical data classifies observations into distinct categories. These variables were chosen to demonstrate the application of statistical techniques and explore relationships within the dataset.

Box Plot

# Hypothetical data
data <- c(25, 28, 30, 32, 35, 35, 36, 38, 40, 42, 45)

# Create a box plot
boxplot(data, main = "Box Plot", xlab = "Data", ylab = "Values")

The box plot shows the data in a simple way. It looks like a box with a line in the middle. The data is balanced on both sides of the line. The line in the middle represents the median, which is a number called 35.The box part of the plot shows where the middle 50% of the data is. It goes from around 30 to around 40. This means that half of the data is between these numbers.There are no unusual or extreme values in the data, which are called outliers. So everything seems normal.

Complexity in data, such example is presence of multiple outliers or skewed in the distribution :

# Modified hypothetical data with skewed distribution
data <- c(25, 28, 30, 32, 35, 35, 36, 38, 40, 42, 45, 55, 80, 85, 90)

# Create a box plot
boxplot(data, main = "Box Plot", xlab = "Data", ylab = "Values")

In this situation, the box plot might show that one side of the box is stretched out more than the other side. This means that the data is not evenly balanced and it leans more towards the side with the longer stretch.Also, there might be some values in the data that are very different from the rest. These values are called outliers. You can see them outside the ends of the box plot.All of this suggests that there are some extreme values in the data that are very different from most of the other values.

Scatter Diagram:

# Hypothetical data
hours_studied <- c(2, 4, 3, 5, 6, 4, 7, 6, 8)
test_scores <- c(60, 70, 65, 75, 80, 70, 85, 80, 90)

# Create a scatter plot
plot(hours_studied, test_scores, main = "Scatter Diagram", xlab = "Hours Studied", ylab = "Test Scores")

# Add a trendline (optional)
abline(lm(test_scores ~ hours_studied))

The scatter diagram shows a positive relationship between hours studied and test scores. As the number of hours studied increases, the corresponding test scores tend to be higher. The trendline further confirms this positive correlation. This suggests that more study time is associated with better academic performance.

# Hypothetical data with complex patterns
x <- c(1, 2, 3, 4, 5, 6, 7, 8)
y <- c(2, 4, 3, 8, 5, 10, 9, 6)

# Create a scatter plot
plot(x, y, main = "Scatter Diagram with Complex Patterns", xlab = "X", ylab = "Y")

# Add a trendline (optional)
abline(lm(x ~ y))

The scatter diagram shows us how two variables are related to each other. But in this case, it’s not a simple relationship. The data points on the diagram are scattered in different ways, which means there isn’t a clear and simple pattern. Sometimes, when one variable goes up, the other variable also goes up. This is called a positive correlation. But other times, when one variable goes up, the other variable goes down. This is called a negative correlation. We can also see that some points on the diagram are close together, forming groups. But there are also points that don’t follow any specific pattern and deviate from the main trend. All of this tells us that there are many different factors at play that affect how the variables are related to each other. It’s not a straightforward relationship, and there are more details and complexities involved.Here are the some points when Complexity in a scatter diagram occurs : Non-linear relationships, Clusters or groups, Outliers, Changing relationships etc

Chi-square Test:

# Hypothetical data
observed <- matrix(c(30, 10, 20, 25), nrow = 2, byrow = TRUE)

# Perform chi-square test
result <- chisq.test(observed)

# Print the test statistics and p-value
print(result)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  observed
## X-squared = 6.9499, df = 1, p-value = 0.008382

The chi-square test was conducted to determine if there is a significant association between two categorical variables. The test result provides a chi-square statistic and p-value. Based on the p-value, if it is below a pre-determined significance level (e.g., 0.05), we can conclude that there is evidence of a significant association between the categorical variables.

Correlation:

# Hypothetical data
x <- c(1, 2, 3, 4, 5)
y <- c(5, 4, 3, 2, 1)

# Calculate the correlation coefficient
correlation_coefficient <- cor(x, y)

# Print the correlation coefficient
print(correlation_coefficient)

## [1] -1

The correlation coefficient between variables X and Y is calculated to be -1. This negative value indicates a strong negative correlation, implying that as the values of X increase, the values of Y tend to decrease, and vice versa. The perfect negative linear relationship between X and Y is evident from the correlation coefficient of -1.

# Hypothetical data with complex patterns
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 3, 8, 5)

# Calculate the correlation coefficient
correlation_coefficient <- cor(x, y)

# Print the correlation coefficient
print(correlation_coefficient)

## [1] 0.6868028

A correlation coefficient of 0.6868028 indicates a moderate positive correlation between the variables being analyzed. This suggests that there is a tendency for the variables to move in the same direction, but not in a perfectly linear manner. The correlation coefficient being positive indicates that as one variable increases, the other tends to increase as well, although the relationship may not be extremely strong.

Regression:

# Hypothetical data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)

# Create a linear regression model
model <- lm(y ~ x)

# Print the regression summary
summary(model)

## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##          1          2          3          4          5 
## -9.656e-16  1.670e-15 -4.196e-16 -3.089e-16  2.391e-17 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 2.383e-15  1.210e-15 1.969e+00    0.144    
## x           2.000e+00  3.649e-16 5.482e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.154e-15 on 3 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.005e+31 on 1 and 3 DF,  p-value: < 2.2e-16

The linear regression analysis indicates a strong relationship between the variables X and Y. The intercept term is estimated to be close to zero, suggesting that when X is zero, the predicted value of Y is very close to zero. The coefficient for X is 2, indicating that for every unit increase in X, the predicted value of Y increases by 2. The p-value associated with the coefficient is very small (typically below 0.05), indicating that the relationship is statistically significant. The R-squared value of 1 indicates that 100% of the variation in Y can be explained by the variation in X, suggesting a perfect fit. The residuals (errors) are extremely small, indicating that the model fits the data well. Overall, the linear regression model provides a strong and significant relationship between X and Y, with a perfect fit to the data.

# Testing 
# Predict the y-values for new x-values
new_x <- c(6, 7, 8, 2)
predicted_y <- predict(model, data.frame(x = new_x))

# Print the predicted y-values
print(predicted_y)

##  1  2  3  4 
## 12 14 16  4

Conclusion:

The analysis of the hypothetical data revealed important insights. The box plot first we give a example with simple data which has no outlier/complexity data another one is identified outlier that could affect data interpretation. The scatter diagram indicated a positive correlation between study time and test scores another example shows the complexity between x and y variables. The chi-square test demonstrated a significant association between categorical variables. The correlation analysis showed a strong negative correlation between two variables and in the second example it’s showed a moderate positive correlation between two variables. The regression analysis indicated a significant and positive relationship between the variables. These findings emphasize the importance of study time for academic success and suggest potential dependencies among categorical variables. Further research could explore additional variables and conduct multivariate analysis. Overall, these analyses provide valuable insights for understanding the data and guiding future investigations.