Statistical Data Analysis Project

By: Aubrey Borgesi & Trevor Borchardt

This data is from a sieve analysis experiment in a geotechnical engineering lab. The purpose of this experiment was to collect data on a soil to create a grain distribution curve. This curve is then used to determine the classification of soil that was used in the lab. A few questions that need to be answered in this lab are:

Which sieve retained the greatest amount of soil?

What does the grain size distribution curve look like?

What does the soil classify as according to the look of the curve?

Does size increase with increasing Percent Finer?

What are the predicted values of D60, D30 and D10?

What are the values of Cu and Cd?

What is the classification of the soil based on the values of Cu and Cd?

What is the two-sided 95% confidence interval for this data?

library(knitr)
library(tidyverse) # loads ggplot2, dplyr,tidyr,readr,purr,tibble

## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.1       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(broom)  # because I find it useful
options(scipen = 4)

These commands load ggplot2, dplyr, tidyr, readr, purr, tibble, and make the program prefer not to use scientific notation.

library(readxl)
Statsproj <- read_excel("Statsproj.xlsx")
View(Statsproj)

This chunk of code loads the data that is being used for this statistical analysis from excel.

names(Statsproj)=c("Sieve_Number","Size","Mass_Retained","Percent_Retained","Cumulative_Percent_Retained", "Percent_Finer") 
ggplot(Statsproj,aes(x=Sieve_Number,y=Mass_Retained))+
         geom_point() +
   theme_minimal()

From this graph, it can be seen that sieve number 60 and sieve number 10 have the greatest amount of mass retained of soil. This means that the majority of the soil was left on sieve 60 and sieve 10 after being shaken. This concludes that the soil is mostly made up of soil larger than 0.85mm and smaller than 4.75mm.

ggplot(Statsproj,aes(x=Size,y=Percent_Finer))+
         geom_point() +
   theme_minimal()

To classify the soil, a graph is plotted for size vs. percent finer. From this graph, we will create a linear regression line to fit the data:

ggplot(Statsproj,aes(x=Size,y=Percent_Finer))+
   geom_point() +
    geom_smooth(method=lm)+
   theme_minimal()

Soil can sometimes be classified by the appearance of the linear regression curve. If the data is linear, then the soil is well-graded. According to the curve of the graph the data appears somewhat linear, which at this point the soil could be prematurely classified as well-graded soil. Which means that the soil particles are relatively evenly distributed in size. However, from this graph we will find the D60, D30 and D10 values that will be used to classify the soil more accurately.

mdl5=lm(formula=Size~Percent_Finer,data=Statsproj)
summary(mdl5)

## 
## Call:
## lm(formula = Size ~ Percent_Finer, data = Statsproj)
## 
## Residuals:
##       1       2       3       4       5       6       7 
##  0.8356 -0.5871 -0.8854 -0.1365  0.1838  0.2895  0.3001 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.362313   0.342262  -1.059  0.33823   
## Percent_Finer  0.043234   0.006913   6.254  0.00153 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6408 on 5 degrees of freedom
## Multiple R-squared:  0.8866, Adjusted R-squared:  0.864 
## F-statistic: 39.11 on 1 and 5 DF,  p-value: 0.001533

According to this coefficients table, the p-value is significant because it is less than the chosen alpha value of .05. The slope of this linear regression line is also positive. These two pieces of information mean that the size increases with increasing percent finer. From this table, we can write the linear regression line equation as y = -0.362 + 0.043x. The coefficients table also shows a standard error of 0.6408.

predict(mdl5,data.frame(Percent_Finer=60))

##        1 
## 2.231706

predict(mdl5,data.frame(Percent_Finer=30))

##         1 
## 0.9346964

predict(mdl5,data.frame(Percent_Finer=10))

##          1 
## 0.07002341

These numbers are found by using the predict command, which calculates the expected value of the size of the sieve at various percent finer values. When percent finer equals 60, the size of the sieve is 2.23 mm. When percent finer equals 30, the size of the sieve is 0.934 mm, and when percent finer equals 10, the size of the sieve is 0.070 mm. These numbers will be used to calculate Cu and Cd from given formulas in the lab.

predict(mdl5,data.frame(Percent_Finer=60),level=.9,interval="confidence")

##        fit      lwr      upr
## 1 2.231706 1.631978 2.831434

This is a 90% interval calculation, which calculates the expected size of the sieve, when the percent finer is 60. The expected size of the sieve is 2.23 mm, with a 90% interval of [1.63, 2.83].

confint(mdl5)

##                     2.5 %     97.5 %
## (Intercept)   -1.24212483 0.51749866
## Percent_Finer  0.02546206 0.06100524

This shows that the intercept coefficient, Bo, has a two sided confidence interval of [-1.24, 0.517], and that the slope, B1, has a two sided 95% confidence interval of [0.0254, 0.0610]. This calculation shows that there is a 95% certainty that the true mean of the population is contained in the intervals.

D60=2.23
D30=0.9
D10=.07

Cu=D60/D10
Cd=(D30^2)/(D60*D10)

From formulas given in the lab, Cu and Cd are calculated from D60, D30 and D10. The Cu value is greater than 6 and the Cd is not between 1 and 3, therefore the soil is classified as poorly graded. Poorly graded soil means that there isn’t a uniform distribution of particle sizes found throughout sample. This is the same results that were found in the lab. The Cu and Cd values were slightly different from the calculated lab results. This error is due to using a prediction from a linear regression model as opposed to calculating the true value.

The object of this experiment was to determine the class of the soil for engineering purposes that was used during the sieve analysis test, and to create a Grain Size Distribution Chart. The soil was determined to be poorly graded sand. For engineering purposes, the soil was determined to be poor for structural load, unless compacted. These experimental results were found by conducting the typical soil identification procedure, sieve analysis, and conducting a statistical analysis of the found data. The experimental results were reasonably accurate and reliable, and aligned with initial observations of the soil and its expected characteristics.