Pick any two different datasets from base R packages like “?datasets()”
Describe the data to somebody who has never seen it (in less than 3 sentences). This may include elaborating upon the key variables so that the reader can follow along / guess what information the data contains.
Finally, tell/show us what is the type of data and why.
You can and should use appropriate chart/summary statistics to drive your point. See some different types of charts in base R, Links to an external site.and some simple code in base RLinks to an external site. to create these charts. You will find lots of code online, as well as some resources in R Programming Resources or Tools for Success. Try the commands in the help file to see if you have any quick visualizations to support your answer. Read up on descriptive statistics blogs online that can give you some ideas. STHDALinks to an external site., Stats and R. Links to an external site. Draw a scatterplotLinks to an external site. if cross-sectional data. Show us the timeplotsLinks to an external site. if time series data, Use the head/tail/indexing or table command on the ID variable if pooled cross-sectional, Use two way table command on the ID variables if panel data.
?datasets()
## starting httpd help server ... done
library(help = "datasets")
I’ll be choosing two datasets: 1) Cross-sectional data 2) Time-series data
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
This dataset contains information on various car models from the 1970s. This is cross-sectional data because it captures information about various car models at a single point in time.
The mtcars dataset gives the specifications of various car models, each represented as a row, and includes characteristics such as fuel efficiency, engine specifications , and more. It’s suitable for exploring relationships between these features.
I created a scatterplot to see the relationship between miles per gallon and horsepower.
library(ggplot2)
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
labs(x = "Horsepower (hp)", y = "Miles per Gallon (mpg)") +
ggtitle("Scatterplot of Horsepower vs. Miles per Gallon")
data("AirPassengers")
# Create a timeplot
plot(AirPassengers, main = "Monthly Airline Passenger Numbers (1949-1960)",
xlab = "Year", ylab = "Passenger Count (in Thousnads)", col = "blue")
This dataset records the number of airline passengers over a span of years. It is a time series data and tells us the trends and seasonality in air travel. This is time series data because it captures observations at multiple time points (months) and focuses on how the number of passengers changes over time.
Question - II In your own words, what is covarianceLinks to an external site.? What is varianceLinks to an external site.? Online stats blogsLinks to an external site. may be more helpful. Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)? Please show this in R. You just need to match the two outputs (one from regression, one from the formula) like we did in class, but for a dataset of your choice.
Covariance: Covariance is a measure of how two variables change together. f the covariance is positive, it indicates that when one variable increases, the other tends to increase as well. If it’s negative, it means that when one variable increases, the other tends to decrease. If there is no relatioknship between the variables, the covariance will be zero.
Variance: Variance is a measure of the spread of a single random variable. It tells us how much a random variable varies from its mean. A high variance means that the values are more spread out, and a low variance means that the values are closer to the mean. It is calculated by squaring the standard deviation of the variable.
The slope coefficient from a simple linear regression is a measure of how much the dependent variable changes for a one-unit change in the independent variable. It is calculated by dividing the covariance of the dependent and independent variables by the variance of the independent variable. Dividing the covariance of y and x by the variance of x gives the slope coefficient. This operation helps in determining how much y changes for a one-unit change in x, which is what the slope coefficient represents in a simple linear regression.
data <- mtcars
covariance_mpg_wt <- cov(data$mpg, data$wt)
variance_wt <- var(data$wt)
slope_mpg_wt <- covariance_mpg_wt / variance_wt
slope_mpg_wt
## [1] -5.344472
regression <- lm(data$mpg ~ data$wt)
regression
##
## Call:
## lm(formula = data$mpg ~ data$wt)
##
## Coefficients:
## (Intercept) data$wt
## 37.285 -5.344
I calculated the slope coefficient both using the formula (covariance of mpg and wt divided by variance of wt) and by performing a simple linear regression. I founnd that the two values match, and it shows how covariance and variance are used to calculate the slope coefficient in a simple linear regression.