library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Problem One:

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n (number of observations) and p (number of predictors).

  1. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Comment:This would be a regression problem becuase the response varible Y is quantitative, This is also a inference, where we infer what is affecting ceo salary NOT what the salary is going to be. There is n = 500 observations and p = 3 number of predictors.

  1. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price,and ten other variables.

Comment: This is a classification problem because we want to classify if the product will be a success or failure, this is also a prediction becuase we want to know if this product will be a success or failure in the future if launched based on other data on other products, number of observations is n = 20, and number of predictors is p = 13.

  1. We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Comment:This is a regression problem becuase we are dealing with a quantiative response, this is a prediction becuase in the text it mentions that they want to predict the percent change given certain other worldy predictors, there is n = 52 observations this is becuase thats how many weeks are in a year, there is also p = 3 predictors.

Problem Two:

You will now think of some real-life applications for statistical learning.

  1. Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

Comment:

1) A classic example for a classification problem is detecting if a email is spam or not. The response is either flagged as spam or not spam. The goal of this is a prediction becuase we want to know if emails will be classified as either category.The predictors could be type of formatting of emails, presense of keywords.

2) Another example for a classification problem could be what varibles affect customer reviews as being associated with negative, positive or neutral. The response is negative, positive or neutral. This is a inference problem becuase we want to know what predictors help us in telling us how to distinguish reviews into their respective categories.

3) One more example of classification could be weather, in the model it could tell wether the week is going to be sunny or rainy given predictors. The response is either sunny or rainy for each day. This is a prediction becuase we want to know what the weather will be in future days given predictors. Some predictors could be humidity, temperature and wind speed.

  1. Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

Comment:

1)A regression problem could be what characteristics influences insurance costs for customers applying for insurance. The response is how much the person pays. This is a inference becuase we want know what factors influence how high customers rates are. predictors could be Sex, weight, Age.

2)Another regression problem could be predicting future sales numbers next month for a grocery store given certain predictors. The response would be Sales in dollars next month. this a prediction problem becuase we want to know what the sales are going to be in the future given factors about the past. predictors could be past sales data, whether there are discounts, money spent on ads.

3)last example of regression problem is predicting the amount of people expected to show up to a concert venue next week with rock concerts booked throughout. The response is the number of people to possibly attend. This is a predicition becuase we want to forecast what the attendence could be. Some predictors could be past attendence, genre popularity index, level of promotional activity, ticket pricing.

Problem Three:

This exercise relates to the College data set, which can be found in the file college.csv in canvas. It contains a number of variables for 777 different universities and colleges in the US. The variables are:

note: < br > is just indicating a break and just makes the preview look better

• Private : Public/private indicator
• Apps : Number of applications received
• Accept : Number of applicants accepted
• Enroll : Number of new students enrolled
• Top10perc : New students from top 10% of high school class
• Top25perc : New students from top 25% of high school class
• F.Undergrad : Number of full-time undergraduates
• P.Undergrad : Number of part-time undergraduates
• Outstate : Out-of-state tuition
• Room.Board : Room and board costs
• Books : Estimated book costs
• Personal : Estimated personal spending
• PhD : Percent of faculty with Ph.D.’s
• Terminal : Percent of faculty with terminal degree
• S.F.Ratio : Student/faculty ratio
• perc.alumni : Percent of alumni who donate
• Expend : Instructional expenditure per student
• Grad.Rate : Graduation rate

Before reading the data into R, it can be viewed in Excel or a text editor.

  1. Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.
college <- read.csv("College.csv")
college
  1. Look at the data using the View() function. You should notice that the first column is just the name of each university.We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:
rownames(college) <- college[, 1]
View(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try:

college <- college[, -1]
View(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

    1. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00
  1. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
tencolcollege <- college[,2:11] #i put it this way becuase it wouldnt take a column that wasnt int
pairs(tencolcollege)

  1. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
boxplot(Outstate ~ Private, data = college, col = c("lightyellow", "lightblue"), main = "Outstate VS Private", xlab = "Private", ylab = "Outstate Tuition")

  1. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50 %.
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college , Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

summary(college) 
##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate      Elite    
##  Min.   : 3186   Min.   : 10.00   No :699  
##  1st Qu.: 6751   1st Qu.: 53.00   Yes: 78  
##  Median : 8377   Median : 65.00            
##  Mean   : 9660   Mean   : 65.46            
##  3rd Qu.:10830   3rd Qu.: 78.00            
##  Max.   :56233   Max.   :118.00

There is 78 Elite Colleges

boxplot(Outstate ~ Elite, data = college, col = c("lightyellow", "lightblue"), main = "Outstate VS Elite", xlab = "Elite", ylab = "Outstate Tuition")

  1. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
par(mfrow = c(2, 2))

hist(college$Outstate, main = "Outstate", xlab = "Tuition", col = "coral", breaks = 15)
hist(college$Enroll, main = "Enroll", xlab = "Number of Students", col = "coral1", breaks = 20)
hist(college$Accept, main = "Accept", xlab = "Applicates Accepted", col = "coral2", breaks = 25)
hist(college$Room.Board, main = "Room and Board", xlab = "Room and Board Cost", col = "coral3", breaks = 30)

  1. Continue exploring the data, and provide a brief summary of what you discover.
hist(college$F.Undergrad, col = rgb(1,0,0, alpha = 0.5), breaks = 30, xlim = c(0,10000), main = "Full time VS Part time Students", xlab = "Number of Students", ylab = "Number of Colleges")

hist(college$P.Undergrad, col = rgb(0,0,1, alpha = 0.5), breaks = 30, add = TRUE)

axis(side = 1, at = seq(0, 10000, by = 1000))

legend("topright", legend = c("Part-Time", "Full-Time"), fill = c("blue", "red"))

As seen most colleges have Part-time attendance students at lower levels. Overall Colleges have Full-Time attendance students at higher levels .

lm_10percvgrad <- lm(college$Grad.Rate ~ college$Top10perc)


plot(college$Top10perc, college$Grad.Rate)
abline(lm_10percvgrad, col = "red", lwd = 5)

summary(lm_10percvgrad)
## 
## Call:
## lm(formula = college$Grad.Rate ~ college$Top10perc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.410  -9.834   0.288   9.080  61.482 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       52.17990    0.99431   52.48   <2e-16 ***
## college$Top10perc  0.48201    0.03039   15.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.94 on 775 degrees of freedom
## Multiple R-squared:  0.245,  Adjusted R-squared:  0.244 
## F-statistic: 251.5 on 1 and 775 DF,  p-value: < 2.2e-16
plot(lm_10percvgrad)

Despite the possible appearence of heteroskedascity because of a inward funnel, in the residuals vs fitted graph the red trend line is very close to zero. It seems here in this positive linear regression the R^2 is pretty low (0.2). The scatter plot graph with fitted line printed shows that as colleges with higher percent of top 10 percent highschool students will have higher consistancy of higher graduation rates. It does appear here that colleges with lower percentages of top 10 percent students have varying rates of graduation, points that are as high as universitys with high percentages of top ten student and very low graduation rates. Maybe in the future if a regression was ran accounting for robust standard errors we could see a different image. Also, strangely theres a instance where graduation rate is higher than 100. This maybe a interesting outlier to investigate.

Problem Four:

This exercise involves the Auto data set in canvas. Make sure that the missing values have been removed from the data. This can be done using the na.omit() function which removes missing values from data.

auto <- read.csv('Auto.csv')
auto
#i noticed that using na.omit(auto) alone does nothing because there are no NA values. There is question marks "?" located in horsepower variable. so i need to rename those to NA so na.omits works.

#i also had to convert horsepower to int because it was in type chr even though its number values.

auto$horsepower[auto$horsepower == "?"] <- NA

newauto<- na.omit(auto)

newauto$horsepower <- as.integer(newauto$horsepower)
Horse power screen shot
Horse power screen shot

As you can see here the horsepower variable has a ? for the start of the range becuase it contains the ? character which makes sense why its a chr varaible even though it actually contains mostly numbers.

newauto
  1. Which of the predictors are quantitative, and which are qualitative?

As seen in the above table these are the quantitative factors: MPG, Cylinders, Displacement, Horsepower (even though its actually coded as a Chr or character), Weight, acceleration, Year, Origin.

For the qualitative factors: Name

  1. What is the range of each quantitative predictor? You can answer this using the range() function.
print('The range of the quantitative predictors are:')
## [1] "The range of the quantitative predictors are:"
cat(
  "MPG:", range(newauto$mpg),
  "\nCylinders:", range(newauto$cylinders),
  "\nDisplacement:", range(newauto$displacement),
  "\nHorsepower:", range(newauto$horsepower),
  "\nWeight:", range(newauto$weight),
  "\nAcceleration:", range(newauto$acceleration),
  "\nYear:", range(newauto$year),
  "\nOrigin:", range(newauto$origin)
)
## MPG: 9 46.6 
## Cylinders: 3 8 
## Displacement: 68 455 
## Horsepower: 46 230 
## Weight: 1613 5140 
## Acceleration: 8 24.8 
## Year: 70 82 
## Origin: 1 3
  1. What is the mean and standard deviation of each quantitative predictor?
print("here are the means: ")
## [1] "here are the means: "
cat(
  "MPG:", mean(newauto$mpg),
  "\nCylinders:", mean(newauto$cylinders),
  "\nDisplacement:", mean(newauto$displacement),
  "\nHorsepower:", mean(newauto$horsepower),
  "\nWeight:", mean(newauto$weight),
  "\nAcceleration:", mean(newauto$acceleration),
  "\nYear:", mean(newauto$year),
  "\nOrigin:", mean(newauto$origin)
)
## MPG: 23.44592 
## Cylinders: 5.471939 
## Displacement: 194.412 
## Horsepower: 104.4694 
## Weight: 2977.584 
## Acceleration: 15.54133 
## Year: 75.97959 
## Origin: 1.576531
print("here are the SD's: ")
## [1] "here are the SD's: "
cat(
  "MPG:", sd(newauto$mpg),
  "\nCylinders:", sd(newauto$cylinders),
  "\nDisplacement:", sd(newauto$displacement),
  "\nHorsepower:", sd(newauto$horsepower),
  "\nWeight:", sd(newauto$weight),
  "\nAcceleration:", sd(newauto$acceleration),
  "\nYear:", sd(newauto$year),
  "\nOrigin:", sd(newauto$origin)
)
## MPG: 7.805007 
## Cylinders: 1.705783 
## Displacement: 104.644 
## Horsepower: 38.49116 
## Weight: 849.4026 
## Acceleration: 2.758864 
## Year: 3.683737 
## Origin: 0.8055182
  1. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
print('The range of the quantitative predictors are:')
## [1] "The range of the quantitative predictors are:"
cat(
  "MPG:", range(newauto[-c(10:85),]$mpg),
  "\nCylinders:", range(newauto[-c(10:85),]$cylinders),
  "\nDisplacement:", range(newauto[-c(10:85),]$displacement),
  "\nHorsepower:", range(newauto[-c(10:85),]$horsepower),
  "\nWeight:", range(newauto[-c(10:85),]$weight),
  "\nAcceleration:", range(newauto[-c(10:85),]$acceleration),
  "\nYear:", range(newauto[-c(10:85),]$year),
  "\nOrigin:", range(newauto[-c(10:85),]$origin)
)
## MPG: 11 46.6 
## Cylinders: 3 8 
## Displacement: 68 455 
## Horsepower: 46 230 
## Weight: 1649 4997 
## Acceleration: 8.5 24.8 
## Year: 70 82 
## Origin: 1 3
print("here are the means: ")
## [1] "here are the means: "
cat(
  "MPG:", mean(newauto[-c(10:85),]$mpg),
  "\nCylinders:", mean(newauto[-c(10:85),]$cylinders),
  "\nDisplacement:", mean(newauto[-c(10:85),]$displacement),
  "\nHorsepower:", mean(newauto[-c(10:85),]$horsepower),
  "\nWeight:", mean(newauto[-c(10:85),]$weight),
  "\nAcceleration:", mean(newauto[-c(10:85),]$acceleration),
  "\nYear:", mean(newauto[-c(10:85),]$year),
  "\nOrigin:", mean(newauto[-c(10:85),]$origin)
)
## MPG: 24.40443 
## Cylinders: 5.373418 
## Displacement: 187.2405 
## Horsepower: 100.7215 
## Weight: 2935.972 
## Acceleration: 15.7269 
## Year: 77.14557 
## Origin: 1.601266
print("here are the SD's: ")
## [1] "here are the SD's: "
cat(
  "MPG:", sd(newauto[-c(10:85),]$mpg),
  "\nCylinders:", sd(newauto[-c(10:85),]$cylinders),
  "\nDisplacement:", sd(newauto[-c(10:85),]$displacement),
  "\nHorsepower:", sd(newauto[-c(10:85),]$horsepower),
  "\nWeight:", sd(newauto[-c(10:85),]$weight),
  "\nAcceleration:", sd(newauto[-c(10:85),]$acceleration),
  "\nYear:", sd(newauto[-c(10:85),]$year),
  "\nOrigin:", sd(newauto[-c(10:85),]$origin)
)
## MPG: 7.867283 
## Cylinders: 1.654179 
## Displacement: 99.67837 
## Horsepower: 35.70885 
## Weight: 811.3002 
## Acceleration: 2.693721 
## Year: 3.106217 
## Origin: 0.81991
  1. Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings:

here are a couple scatterplots i thought would have a relationship:

plot(newauto$weight, newauto$mpg, main = "weight vs mpg")

plot(newauto$weight, newauto$acceleration, main = "weight vs acceleration")

ggplot(newauto, aes(x = factor(origin), y = mpg)) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "purple") +
  labs(title = "mpg vs origin", x = "origin", y = "mpg")

ggplot(newauto, aes(x = factor(origin), y = horsepower)) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", shape = 22, size = 3, color = "red") +
  labs(title = "horsepower vs origin", x = "origin", y = "horsepower")

Comment: There is a negative linear relationship with weight and mpg, There also seems to be a negative linear relationship between weight and acceleration but it appears that the points have have a weaker relationship due to the spread of the points (potentially a lower R^2 than the other graph. In the box plot comparing different origins and mpg, it appears that on average origin 3 has higher mpg as well as the highest mpg vehicles. However origin 1 seems to have a lower mpg average but has a very high Q3 and very low Q1. In the box plot comparing horsepower and origin it seems that origin 1 has the highest horsepower on average but an insane range. The other origins seems to be very similar and lower than origin 1.

  1. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Comment:I think weight could be a strong predictor. In my scatter plot it looks like a very tight negative linear relationship, as weight increases mpg decreases. I believe that it would most likely have a very high R^2.

Problem Five:

This question involves the use of simple and multiple linear regression on the Auto data set.

  1. Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.
lm_slrauto <- lm(mpg ~ horsepower, data = newauto)
summary(lm_slrauto)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = newauto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

As a predictor horsepower is very good. It has a very low P-value denoted by the three stars *** (very close to 0). there is also a pretty high R^2 as well.

For example: i. Is there a relationship between the predictor and the response? ii. How strong is the relationship between the predictor and the response? iii. Is the relationship between the predictor and the response positive or negative? iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

new_data <- data.frame(horsepower = 98)

predictions <- predict(lm_slrauto, newdata = new_data, interval = "confidence", level = 0.95)

print(predictions)
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

Comment: i. Yes, ii: very strong it has a p-value very close to 0, iii: Negative: the coeficent is -0.15, iv: The answer is 24.46708 with a lwr interval 23.97308 and upr interval of 24.96108 with a 95% confidence. math check: intercept(39.9) + (98*-0.16) = 24.467.

  1. Plot the response and the predictor. Use the abline() function to display the least squares regression line.
plot(newauto$horsepower, newauto$mpg, main = "mpg vs horse power")
abline(lm_slrauto, col = "lightslateblue", lwd = 5)

  1. Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
plot(lm_slrauto)

Comment: The problem i see here is that single linear models assume homoskedasticity; here in the graph comparing residuals vs fitted values, it seems that the data violates the assumption. In general the red trend line should be at almost zero, but its only near zero in the middle of the fitted values. Generally we want constant variannce of residuals.

  1. Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(newauto[,1:8])

  1. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(newauto[,1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
  1. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
mlrauto <- lm(mpg~., data = newauto[,1:8])
summary(mlrauto)
## 
## Call:
## lm(formula = mpg ~ ., data = newauto[, 1:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. What does the coefficient for the year variable suggest?

Comment: i. yes, theres many predictors with significant p-values and there is a very high R^2 of 0.82. ii. displacement, weight, year, origin. iii. each additional year will increase MPG by 0.75 on average.

Bonus parts for Problem Five

  1. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
interactionwa <- newauto$weight*newauto$acceleration

mlrauto1 <- lm(mpg ~ year + origin + displacement + interactionwa, data = newauto[,1:8])
summary(mlrauto1)
## 
## Call:
## lm(formula = mpg ~ year + origin + displacement + interactionwa, 
##     data = newauto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3686  -1.8012  -0.1028   1.8168  14.2685 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.322e+01  4.318e+00  -5.378 1.31e-07 ***
## year           7.787e-01  5.411e-02  14.390  < 2e-16 ***
## origin         8.808e-01  2.932e-01   3.004  0.00284 ** 
## displacement  -3.566e-02  2.587e-03 -13.783  < 2e-16 ***
## interactionwa -1.535e-04  1.899e-05  -8.085 8.06e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.609 on 387 degrees of freedom
## Multiple R-squared:  0.7884, Adjusted R-squared:  0.7862 
## F-statistic: 360.4 on 4 and 387 DF,  p-value: < 2.2e-16
  1. Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
loggedweight <- log(newauto$weight)
sqacceleration <- sqrt(newauto$acceleration)


mlrauto2 <- lm(mpg ~ year + origin + displacement + loggedweight + sqacceleration, data = newauto[,1:8])
summary(mlrauto2)
## 
## Call:
## lm(formula = mpg ~ year + origin + displacement + loggedweight + 
##     sqacceleration, data = newauto[, 1:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2112 -1.9875  0.0934  1.7245 12.7750 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    133.277481  11.074192  12.035  < 2e-16 ***
## year             0.797422   0.046457  17.165  < 2e-16 ***
## origin           0.918441   0.252387   3.639 0.000311 ***
## displacement     0.012590   0.004648   2.709 0.007051 ** 
## loggedweight   -22.505206   1.509505 -14.909  < 2e-16 ***
## sqacceleration   1.224461   0.579046   2.115 0.035103 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.099 on 386 degrees of freedom
## Multiple R-squared:  0.8444, Adjusted R-squared:  0.8423 
## F-statistic: 418.8 on 5 and 386 DF,  p-value: < 2.2e-16

Comment: I did an interaction with weight and acceleration and it was significant with a p value of 0.001, with sightly lower R^2 in the model overall. For different transformations i tried log of weight and sqrt both were deemed statistically significant by my model, log of weight had a p value of 0.001 and sqrt of acceleration was 0.05. acceleration was not significant before the transformation of sqrt. Also, the overall R^2 did increase from the prior modeling.

Problem Six:

Explore the mtcars dataset in R.

Load the dataset into the environment, use the functions you’ve learned so far to look into the data and run a linear model of your choice.

Then share details about your model including: the parameters, the variable chosen, the coefficients for each variable, and the \(R^2\).

mtcars
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
pairs(mtcars)

lmcars <- lm(mpg ~ wt, data = mtcars) 
summary(lmcars)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
plot(mtcars$wt, mtcars$mpg, main = "mpg vs wt")
abline(lmcars, col = "mistyrose", lwd = 5)

Comment: Here i just ran a simple linear regression model where i have my response MPG and my predictor weight (wt) without any interactions or transformations. the coefficent for weight is -5.34 so for every unit change in weight there is about a 5 unit decrease in MPG (y). The predictor wt is also significant at a high level with the p-value being very close to 0. the R^2 here is 0.75 which is pretty close to 1.