library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n (number of observations) and p (number of predictors).
You will now think of some real-life applications for statistical learning.
This exercise relates to the College data set, which can be found in the file college.csv in canvas. It contains a number of variables for 777 different universities and colleges in the US. The variables are:
note: < br > is just indicating a break and just makes the preview look better
• Private : Public/private indicator
• Apps : Number of
applications received
• Accept : Number of applicants accepted
• Enroll : Number of new students enrolled
• Top10perc : New
students from top 10% of high school class
• Top25perc : New
students from top 25% of high school class
• F.Undergrad : Number
of full-time undergraduates
• P.Undergrad : Number of part-time
undergraduates
• Outstate : Out-of-state tuition
• Room.Board
: Room and board costs
• Books : Estimated book costs
•
Personal : Estimated personal spending
• PhD : Percent of faculty
with Ph.D.’s
• Terminal : Percent of faculty with terminal degree
• S.F.Ratio : Student/faculty ratio
• perc.alumni : Percent of
alumni who donate
• Expend : Instructional expenditure per student
• Grad.Rate : Graduation rate
Before reading the data into R, it can be viewed in Excel or a text editor.
read.csv() function to read the data into R.
Call the loaded data college. Make sure that you have the
directory set to the correct location for the data.college <- read.csv("College.csv")
college
View() function. You should
notice that the first column is just the name of each university.We
don’t really want R to treat this as data. However, it may be handy to
have these names for later. Try the following commands:rownames(college) <- college[, 1]
View(college)
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try:
college <- college[, -1]
View(college)
Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.
summary() function to produce a numerical
summary of the variables in the data set.summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
pairs() function to produce a scatterplot
matrix of the first ten columns or variables of the data. Recall that
you can reference the first ten columns of a matrix A using
A[,1:10].tencolcollege <- college[,2:11] #i put it this way becuase it wouldnt take a column that wasnt int
pairs(tencolcollege)
plot() function to produce side-by-side
boxplots of Outstate versus Private.boxplot(Outstate ~ Private, data = college, col = c("lightyellow", "lightblue"), main = "Outstate VS Private", xlab = "Private", ylab = "Outstate Tuition")
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college , Elite)
Use the summary() function to see how many elite
universities there are. Now use the plot() function to
produce side-by-side boxplots of Outstate versus
Elite.
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate Elite
## Min. : 3186 Min. : 10.00 No :699
## 1st Qu.: 6751 1st Qu.: 53.00 Yes: 78
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
boxplot(Outstate ~ Elite, data = college, col = c("lightyellow", "lightblue"), main = "Outstate VS Elite", xlab = "Elite", ylab = "Outstate Tuition")
hist() function to produce some histograms with
differing numbers of bins for a few of the quantitative variables. You
may find the command par(mfrow = c(2, 2)) useful: it will divide the
print window into four regions so that four plots can be made
simultaneously. Modifying the arguments to this function will divide the
screen in other ways.par(mfrow = c(2, 2))
hist(college$Outstate, main = "Outstate", xlab = "Tuition", col = "coral", breaks = 15)
hist(college$Enroll, main = "Enroll", xlab = "Number of Students", col = "coral1", breaks = 20)
hist(college$Accept, main = "Accept", xlab = "Applicates Accepted", col = "coral2", breaks = 25)
hist(college$Room.Board, main = "Room and Board", xlab = "Room and Board Cost", col = "coral3", breaks = 30)
hist(college$F.Undergrad, col = rgb(1,0,0, alpha = 0.5), breaks = 30, xlim = c(0,10000), main = "Full time VS Part time Students", xlab = "Number of Students", ylab = "Number of Colleges")
hist(college$P.Undergrad, col = rgb(0,0,1, alpha = 0.5), breaks = 30, add = TRUE)
axis(side = 1, at = seq(0, 10000, by = 1000))
legend("topright", legend = c("Part-Time", "Full-Time"), fill = c("blue", "red"))
lm_10percvgrad <- lm(college$Grad.Rate ~ college$Top10perc)
plot(college$Top10perc, college$Grad.Rate)
abline(lm_10percvgrad, col = "red", lwd = 5)
summary(lm_10percvgrad)
##
## Call:
## lm(formula = college$Grad.Rate ~ college$Top10perc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.410 -9.834 0.288 9.080 61.482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.17990 0.99431 52.48 <2e-16 ***
## college$Top10perc 0.48201 0.03039 15.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.94 on 775 degrees of freedom
## Multiple R-squared: 0.245, Adjusted R-squared: 0.244
## F-statistic: 251.5 on 1 and 775 DF, p-value: < 2.2e-16
plot(lm_10percvgrad)
This exercise involves the Auto data set in canvas. Make sure that
the missing values have been removed from the data. This can be done
using the na.omit() function which removes missing values
from data.
auto <- read.csv('Auto.csv')
auto
#i noticed that using na.omit(auto) alone does nothing because there are no NA values. There is question marks "?" located in horsepower variable. so i need to rename those to NA so na.omits works.
#i also had to convert horsepower to int because it was in type chr even though its number values.
auto$horsepower[auto$horsepower == "?"] <- NA
newauto<- na.omit(auto)
newauto$horsepower <- as.integer(newauto$horsepower)
newauto
range() function.print('The range of the quantitative predictors are:')
## [1] "The range of the quantitative predictors are:"
cat(
"MPG:", range(newauto$mpg),
"\nCylinders:", range(newauto$cylinders),
"\nDisplacement:", range(newauto$displacement),
"\nHorsepower:", range(newauto$horsepower),
"\nWeight:", range(newauto$weight),
"\nAcceleration:", range(newauto$acceleration),
"\nYear:", range(newauto$year),
"\nOrigin:", range(newauto$origin)
)
## MPG: 9 46.6
## Cylinders: 3 8
## Displacement: 68 455
## Horsepower: 46 230
## Weight: 1613 5140
## Acceleration: 8 24.8
## Year: 70 82
## Origin: 1 3
print("here are the means: ")
## [1] "here are the means: "
cat(
"MPG:", mean(newauto$mpg),
"\nCylinders:", mean(newauto$cylinders),
"\nDisplacement:", mean(newauto$displacement),
"\nHorsepower:", mean(newauto$horsepower),
"\nWeight:", mean(newauto$weight),
"\nAcceleration:", mean(newauto$acceleration),
"\nYear:", mean(newauto$year),
"\nOrigin:", mean(newauto$origin)
)
## MPG: 23.44592
## Cylinders: 5.471939
## Displacement: 194.412
## Horsepower: 104.4694
## Weight: 2977.584
## Acceleration: 15.54133
## Year: 75.97959
## Origin: 1.576531
print("here are the SD's: ")
## [1] "here are the SD's: "
cat(
"MPG:", sd(newauto$mpg),
"\nCylinders:", sd(newauto$cylinders),
"\nDisplacement:", sd(newauto$displacement),
"\nHorsepower:", sd(newauto$horsepower),
"\nWeight:", sd(newauto$weight),
"\nAcceleration:", sd(newauto$acceleration),
"\nYear:", sd(newauto$year),
"\nOrigin:", sd(newauto$origin)
)
## MPG: 7.805007
## Cylinders: 1.705783
## Displacement: 104.644
## Horsepower: 38.49116
## Weight: 849.4026
## Acceleration: 2.758864
## Year: 3.683737
## Origin: 0.8055182
print('The range of the quantitative predictors are:')
## [1] "The range of the quantitative predictors are:"
cat(
"MPG:", range(newauto[-c(10:85),]$mpg),
"\nCylinders:", range(newauto[-c(10:85),]$cylinders),
"\nDisplacement:", range(newauto[-c(10:85),]$displacement),
"\nHorsepower:", range(newauto[-c(10:85),]$horsepower),
"\nWeight:", range(newauto[-c(10:85),]$weight),
"\nAcceleration:", range(newauto[-c(10:85),]$acceleration),
"\nYear:", range(newauto[-c(10:85),]$year),
"\nOrigin:", range(newauto[-c(10:85),]$origin)
)
## MPG: 11 46.6
## Cylinders: 3 8
## Displacement: 68 455
## Horsepower: 46 230
## Weight: 1649 4997
## Acceleration: 8.5 24.8
## Year: 70 82
## Origin: 1 3
print("here are the means: ")
## [1] "here are the means: "
cat(
"MPG:", mean(newauto[-c(10:85),]$mpg),
"\nCylinders:", mean(newauto[-c(10:85),]$cylinders),
"\nDisplacement:", mean(newauto[-c(10:85),]$displacement),
"\nHorsepower:", mean(newauto[-c(10:85),]$horsepower),
"\nWeight:", mean(newauto[-c(10:85),]$weight),
"\nAcceleration:", mean(newauto[-c(10:85),]$acceleration),
"\nYear:", mean(newauto[-c(10:85),]$year),
"\nOrigin:", mean(newauto[-c(10:85),]$origin)
)
## MPG: 24.40443
## Cylinders: 5.373418
## Displacement: 187.2405
## Horsepower: 100.7215
## Weight: 2935.972
## Acceleration: 15.7269
## Year: 77.14557
## Origin: 1.601266
print("here are the SD's: ")
## [1] "here are the SD's: "
cat(
"MPG:", sd(newauto[-c(10:85),]$mpg),
"\nCylinders:", sd(newauto[-c(10:85),]$cylinders),
"\nDisplacement:", sd(newauto[-c(10:85),]$displacement),
"\nHorsepower:", sd(newauto[-c(10:85),]$horsepower),
"\nWeight:", sd(newauto[-c(10:85),]$weight),
"\nAcceleration:", sd(newauto[-c(10:85),]$acceleration),
"\nYear:", sd(newauto[-c(10:85),]$year),
"\nOrigin:", sd(newauto[-c(10:85),]$origin)
)
## MPG: 7.867283
## Cylinders: 1.654179
## Displacement: 99.67837
## Horsepower: 35.70885
## Weight: 811.3002
## Acceleration: 2.693721
## Year: 3.106217
## Origin: 0.81991
plot(newauto$weight, newauto$mpg, main = "weight vs mpg")
plot(newauto$weight, newauto$acceleration, main = "weight vs acceleration")
ggplot(newauto, aes(x = factor(origin), y = mpg)) +
geom_boxplot() +
stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "purple") +
labs(title = "mpg vs origin", x = "origin", y = "mpg")
ggplot(newauto, aes(x = factor(origin), y = horsepower)) +
geom_boxplot() +
stat_summary(fun = mean, geom = "point", shape = 22, size = 3, color = "red") +
labs(title = "horsepower vs origin", x = "origin", y = "horsepower")
This question involves the use of simple and multiple linear regression on the Auto data set.
lm() function to perform a simple linear
regression with mpg as the response and horsepower as the predictor. Use
the summary() function to print the results. Comment on the
output.lm_slrauto <- lm(mpg ~ horsepower, data = newauto)
summary(lm_slrauto)
##
## Call:
## lm(formula = mpg ~ horsepower, data = newauto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
For example: i. Is there a relationship between the predictor and the response? ii. How strong is the relationship between the predictor and the response? iii. Is the relationship between the predictor and the response positive or negative? iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?
new_data <- data.frame(horsepower = 98)
predictions <- predict(lm_slrauto, newdata = new_data, interval = "confidence", level = 0.95)
print(predictions)
## fit lwr upr
## 1 24.46708 23.97308 24.96108
abline()
function to display the least squares regression line.plot(newauto$horsepower, newauto$mpg, main = "mpg vs horse power")
abline(lm_slrauto, col = "lightslateblue", lwd = 5)
plot() function to produce diagnostic plots of
the least squares regression fit. Comment on any problems you see with
the fit.plot(lm_slrauto)
pairs(newauto[,1:8])
cor(). You will need to exclude the name variable,
which is qualitative.cor(newauto[,1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm() function to perform a multiple linear
regression with mpg as the response and all other variables except name
as the predictors. Use the summary() function to print the
results.mlrauto <- lm(mpg~., data = newauto[,1:8])
summary(mlrauto)
##
## Call:
## lm(formula = mpg ~ ., data = newauto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. What does the coefficient for the year variable suggest?
interactionwa <- newauto$weight*newauto$acceleration
mlrauto1 <- lm(mpg ~ year + origin + displacement + interactionwa, data = newauto[,1:8])
summary(mlrauto1)
##
## Call:
## lm(formula = mpg ~ year + origin + displacement + interactionwa,
## data = newauto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3686 -1.8012 -0.1028 1.8168 14.2685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.322e+01 4.318e+00 -5.378 1.31e-07 ***
## year 7.787e-01 5.411e-02 14.390 < 2e-16 ***
## origin 8.808e-01 2.932e-01 3.004 0.00284 **
## displacement -3.566e-02 2.587e-03 -13.783 < 2e-16 ***
## interactionwa -1.535e-04 1.899e-05 -8.085 8.06e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.609 on 387 degrees of freedom
## Multiple R-squared: 0.7884, Adjusted R-squared: 0.7862
## F-statistic: 360.4 on 4 and 387 DF, p-value: < 2.2e-16
loggedweight <- log(newauto$weight)
sqacceleration <- sqrt(newauto$acceleration)
mlrauto2 <- lm(mpg ~ year + origin + displacement + loggedweight + sqacceleration, data = newauto[,1:8])
summary(mlrauto2)
##
## Call:
## lm(formula = mpg ~ year + origin + displacement + loggedweight +
## sqacceleration, data = newauto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2112 -1.9875 0.0934 1.7245 12.7750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 133.277481 11.074192 12.035 < 2e-16 ***
## year 0.797422 0.046457 17.165 < 2e-16 ***
## origin 0.918441 0.252387 3.639 0.000311 ***
## displacement 0.012590 0.004648 2.709 0.007051 **
## loggedweight -22.505206 1.509505 -14.909 < 2e-16 ***
## sqacceleration 1.224461 0.579046 2.115 0.035103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.099 on 386 degrees of freedom
## Multiple R-squared: 0.8444, Adjusted R-squared: 0.8423
## F-statistic: 418.8 on 5 and 386 DF, p-value: < 2.2e-16
Explore the mtcars dataset in R.
Load the dataset into the environment, use the functions you’ve learned so far to look into the data and run a linear model of your choice.
Then share details about your model including: the parameters, the variable chosen, the coefficients for each variable, and the \(R^2\).
mtcars
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
pairs(mtcars)
lmcars <- lm(mpg ~ wt, data = mtcars)
summary(lmcars)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
plot(mtcars$wt, mtcars$mpg, main = "mpg vs wt")
abline(lmcars, col = "mistyrose", lwd = 5)
Comment: This is a classification problem because we want to classify if the product will be a success or failure, this is also a prediction becuase we want to know if this product will be a success or failure in the future if launched based on other data on other products, number of observations is n = 20, and number of predictors is p = 13.