[1] 5
Functions are reusable and referenceable blocks of code that accomplish a specific task
Functions take in a parameter, process it, and return a result
Here is an example of a basic function that you could write in R:
As you can see in the last line, you can call a function by referencing its name and entering your parameters in parenthesis.
The parameters are determined by the order in which they appear inside the parantheses. In this example, x is equal to 2 and y is equal to 3
parameter = ...Here are some examples of functions we have used in R:
vars n mean sd median trimmed mad min max range skew kurtosis
weight 1 578 121.82 71.07 103 113.18 69.68 35 373 338 0.96 0.34
Time 2 578 10.72 6.76 10 10.77 8.90 0 21 21 -0.02 -1.26
Chick* 3 578 26.26 14.00 26 26.27 17.79 1 50 49 0.00 -1.19
Diet* 4 578 2.24 1.16 2 2.17 1.48 1 4 3 0.31 -1.39
se
weight 2.96
Time 0.28
Chick* 0.58
Diet* 0.05
These functions have already been defined in Base R or in the “psych” library you previously installed. Your use of functions in this class will mostly be limited to pre-defined functions.
?funcname to do so.As a review:
Vectors are one-dimensional arrays of data
Dataframes are two-dimensional arrays of data, with each column written as a vector
You can specify the name of each column by using = as follows.
There are two ways to access elements inside a dataframe, using the column name or index
$ indicates that a specific column is being accessed.[row, col] indicates that both the row and column is specified, whether by number or by name
'' or "")You can call functions to manipulate data in vectors:
And dataframes:
colnames() and headHere are two functions which you’ll be asked to use on your assignment:
colnames(): Can be used to rename the columns in your data frame.
head(): Outputs only the first few lines in your data frame.
cor() and lm()Download the lab2.csv file from canvas.
In a new R file, follow the same steps to read the .csv file as last week, including:
Save the .R file to the same folder as your .csv.
Set current working directory to source file
call read.csv on the file name
Correlation is defined by the relationship between two variables. Let’s say that we’re interested in the correlation between x and y
cor() is a function that takes in x and y as its parametersWe’re often interested in \(r^2\) in this class (why?) which we can obtain simply by squaring the above value
You can also do correlations on the entire data frame to produce a correlation matrix.
x y z
x 1.00000000 0.90444939 0.05378997
y 0.90444939 1.00000000 -0.05182359
z 0.05378997 -0.05182359 1.00000000
This produces a new dataframe that has correlation values for every pair of variables.
What do you notice about this correlation matrix (how is it similar to our previous value?)
You can also square this table as well
This class is called Applied Linear Regression, so you’ll be working a lot with the lm function in R.
lm() has two required parameters:
formula: Always in the form of y ~ x, with y as the dependent and x as the independent variable.
data: The data frame from which the above variables come from
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
1.042 0.918
summary()Until now you have been using describe() to obtain descriptive statistics from data frames. summary is a base R function that does a similar purpose
x y z
Min. : 1 Min. :-3.167 Min. :-32.981
1st Qu.: 7 1st Qu.: 8.266 1st Qu.: -5.198
Median :13 Median :14.669 Median : 18.671
Mean :13 Mean :12.976 Mean : 19.922
3rd Qu.:19 3rd Qu.:17.418 3rd Qu.: 38.469
Max. :25 Max. :27.096 Max. : 67.687
However, unlike describe summary can also be used on the output of lm as well, giving you much more information.
Call:
lm(formula = y ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.148 -2.863 1.618 2.611 4.941
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.04237 1.34205 0.777 0.445
x 0.91795 0.09028 10.168 5.57e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.255 on 23 degrees of freedom
Multiple R-squared: 0.818, Adjusted R-squared: 0.8101
F-statistic: 103.4 on 1 and 23 DF, p-value: 5.575e-10
Call:
lm(formula = y ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.148 -2.863 1.618 2.611 4.941
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.04237 1.34205 0.777 0.445
x 0.91795 0.09028 10.168 5.57e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.255 on 23 degrees of freedom
Multiple R-squared: 0.818, Adjusted R-squared: 0.8101
F-statistic: 103.4 on 1 and 23 DF, p-value: 5.575e-10
Call: repeats the line of code used to generate the linear model.
Residuals: descriptive statistics about the error from the fitted line.
Coefficients: the \(\beta\) ’s obtained from your linear model. In this case \(\beta_0\) (y-intercept) and \(\beta_1\) (slope)
Estimate: estimated value of the coefficient
Std. Error: standard error of the coefficient, measure of variability of estimate
t value: number obtained from computing a t test for the coefficient
Pr(>|t|): the p-value obtained from the t-test, the likelihood that the null hypothesis for the estimate is true.
Some other important information for simple linear regression.
degrees of freedom
R-squared
If instead y is the independent variable and x is the dependent variable, should the linear model change?
lmFind the correlation for x and z, then obtain a summary of the linear model for z on x
x and y differ from x and z?Create a new column in df that is the z column with 2 added to every value. Then repeat the above steps for x and this new column.