data <- read.csv('Customerprofit.csv')

Bootstrap

I have used simple data assignment, and built in dataset Customer Profit data that is included with R.

summary(data)
##   Ship_Mode             Profit         Unit_Price     Shipping_Cost  
##  Length:264         Min.   : -1766   Min.   :  2.88   Min.   : 0.50  
##  Class :character   1st Qu.: 48154   1st Qu.:  5.28   1st Qu.:74.35  
##  Mode  :character   Median :123915   Median : 40.42   Median :74.35  
##                     Mean   :125237   Mean   :101.48   Mean   :70.51  
##                     3rd Qu.:199676   3rd Qu.:120.98   3rd Qu.:74.35  
##                     Max.   :275438   Max.   :500.98   Max.   :74.35  
##  Customer_Name     
##  Length:264        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
  1. This datastats help me to choose top customer and profit of the customer and shipping cost as well as shipping mode using bootstrapping, test the hypothesis that the mean is 125237. Explain in words how to create a bootstrap and how to create a bootstrap distribution for the mean.

Make sure to state the hypothesis, express a confidence interval, \(p\) value, and state the conclusion in the proper statistical terms for the mean.

Answer:

A statistical strategy for generating several simulated samples from a single dataset is known as bootstrapping. This technique can be used to quantify standard deviations, construct confidence intervals, and perform hypothesis testing with a number of sample statistics. Bootstrap methods are a form of hypothesis testing that is simpler to understand and apply than standard hypothesis testing.

Cross Validation

The multiple linear regression using customer name, profit, and shipping cost.

customer <- lm(data$Profit ~ data$Customer_Name + data$Ship_Mode + data$Shipping_Cost, data = data)
summary(customer)
## 
## Call:
## lm(formula = data$Profit ~ data$Customer_Name + data$Ship_Mode + 
##     data$Shipping_Cost, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -134973  -70155     415   66743  140054 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          -16311.4    37756.7  -0.432    0.666    
## data$Customer_NameBarry French        -9472.5    30684.6  -0.309    0.758    
## data$Customer_NameCarl Ludwig          6636.6    29246.3   0.227    0.821    
## data$Customer_NameCarlos Soltero       6443.0    33769.8   0.191    0.849    
## data$Customer_NameClaudia Miner       -2384.6    33769.9  -0.071    0.944    
## data$Customer_NameClay Rozendal       -3505.4    33769.9  -0.104    0.917    
## data$Customer_NameDon Miller           9612.6    33769.8   0.285    0.776    
## data$Customer_NameEdward Hooks         5745.6    38546.4   0.149    0.882    
## data$Customer_NameEugene Barchas       2241.7    28320.4   0.079    0.937    
## data$Customer_NameJack Garza          10476.7    33769.8   0.310    0.757    
## data$Customer_NameJim Radford          2755.7    29246.0   0.094    0.925    
## data$Customer_NameJulia West           1088.1    38546.4   0.028    0.978    
## data$Customer_NameMuhammed MacIntyre -11735.9    33782.4  -0.347    0.729    
## data$Customer_NameNeola Schneider     -1270.9    33769.9  -0.038    0.970    
## data$Customer_NameSylvia Foulston     -5219.3    30695.5  -0.170    0.865    
## data$Ship_ModeExpress Air               637.1    38504.0   0.017    0.987    
## data$Ship_ModeRegular Air              1657.6    18497.4   0.090    0.929    
## data$Shipping_Cost                     1979.1      329.2   6.013 6.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82720 on 246 degrees of freedom
## Multiple R-squared:  0.1334, Adjusted R-squared:  0.07352 
## F-statistic: 2.228 on 17 and 246 DF,  p-value: 0.00421
  1. Repeat this linear model using 10 fold cross validation. Explain in words what you are doing. Examine one of the folds carefully explaining the steps involved. Examine the \(R^2\) value and residual mean standard error. Compare the values you get to the original model.

Answer:

Cross-validation is a statistical method for estimating the ability of machine learning models.

It is easy to understand, simple to implement, and provides ability forecasts with lower bias than other methods, it’s commonly used in advanced machine learning to compare and select a model for a given predictive modelling issue.