Lab 2 Markdown

College Scorecard

Intro

For this project, you will be working with data that are available through the Department of Education (https://collegescorecard.ed.gov/data/). The data file contains information data related to various characteristics of 7175 degree granting U.S. institutions of higher education.

The variables you will be working with are defined in the included pdf (https://www2.stetson.edu/~jrasp/data/CollegeScorecard_variables.pdf). You should use the data contained in this file and your knowledge of R to answer the following questions. You must submit a document that includes your answers and also the R code you used to produce the answers. Your material should be submitted through the Google Classroom site.

Objectives

Use the data contained in the files.
Answer 16 questions based on your R experimentation.

Outline

Covariance and Correlation
Simple Regression
Normal Distributions
Probability

Questions

Covariance and Correlation

Present 5 relations through scatterplots. In addition to the scatterplots, provide an interpretation of what you think about the direction and magnitude of the each of the relations you report.
Report the full covariance and correlation matrix for all variables and provide an interpretation of what variables are most strongly related (either positive or negative) and what variables are relatively unrelated to one another.
What is the strongest relation in the data? What are the variables involved and what is the correlation? What is the weakest relation in the data? What are the variables involved and what is the correlation?

Simple Regression

Predict the average cost of attendance from average SAT score
Predict the average cost of attendance from admission rate
Predict the number of students from average SAT score
Predict the number of students from admission rate
Predict completion rate from average SAT score
Predict completion rate from admission rate
Predict the percentage of students with federal loans from average SAT score
Predict the percentage of students with federal loans from admission rate
Run two additional regression equations using any variables you like from the dataset. Provide an interpretation of the results of these additional analyses.

Normal Distributions

Are the distributions of average SAT score, admission rate, and total number of undergraduate students normal or non-normal? What information did you use to answer this question?
For any of the distributions that were not normal in your previous answer, how could you transform them to be normal? Perform the appropriate transformation and report the mean, standard deviation, and variance of the new transformed variable.

Probability

Based on the distribution of scores, what is the probability that the average SAT score for a school is greater than 1400? What is the probability that the average SAT score for a school is less than 800?
Imagine that the distribution of average SAT scores was perfectly normal. Answer both parts of Question 15 again using the observed mean and standard deviation as your parameters.

Answers

Covariance and Correlation

Compute basic descriptive statistics for all quantitiative variables

Descriptive statistics

# Code
# Prints descriptive statistics of all 24 quantitative variables
psych::describe(alotofdata[7:30])

##             vars    n     mean       sd   median  trimmed      mad   min
## ADM_RATE       1 2198     0.69     0.21     0.71     0.71     0.21     0
## SAT_AVG        2 1304  1059.07   133.36  1039.50  1048.34   104.52   720
## UGDS           3 6990  2332.16  5438.85   406.00  1052.09   526.32     0
## UGDS_WHITE     4 6990     0.51     0.29     0.56     0.52     0.34     0
## UGDS_BLACK     5 6990     0.19     0.22     0.10     0.14     0.12     0
## UGDS_HISP      6 6990     0.16     0.22     0.07     0.11     0.08     0
## UGDS_ASIAN     7 6990     0.03     0.08     0.01     0.02     0.02     0
## UGDS_AIAN      8 6990     0.01     0.07     0.00     0.00     0.00     0
## UGDS_NHPI      9 6990     0.00     0.03     0.00     0.00     0.00     0
## UGDS_2MOR     10 6990     0.02     0.03     0.02     0.02     0.03     0
## UGDS_NRA      11 6990     0.02     0.05     0.00     0.01     0.00     0
## UGDS_UNKN     12 6990     0.05     0.09     0.01     0.02     0.02     0
## PPTUG_EF      13 6969     0.23     0.25     0.15     0.19     0.22     0
## NPT4_PUB      14 1911  9624.66  4669.67  8751.00  9341.96  4293.61 -2434
## NPT4_PRIV     15 4688 18230.18  7272.13 18254.50 18021.04  7034.94  -581
## COSTT4_A      16 4030 24853.36 12762.63 22881.50 23321.23 12766.67  4610
## TUITFTE       17 7270 10401.08 17375.87  9015.00  9227.39  6676.15     0
## INEXPFTE      18 7270  7360.21 12726.34  5490.00  5875.23  3399.60     0
## PFTFAC        19 4045     0.57     0.31     0.54     0.57     0.40     0
## PCTPELL       20 6966     0.53     0.23     0.52     0.53     0.26     0
## C150_4        21 2481     0.48     0.21     0.47     0.47     0.22     0
## PFTFTUG1_EF   22 3664     0.53     0.26     0.53     0.53     0.31     0
## RET_FT4       23 2293     0.71     0.20     0.74     0.73     0.15     0
## PCTFLOAN      24 6966     0.52     0.28     0.58     0.54     0.28     0
##                    max      range  skew kurtosis     se
## ADM_RATE          1.00       1.00 -0.58    -0.04   0.00
## SAT_AVG        1545.00     825.00  0.82     1.11   3.69
## UGDS         151558.00  151558.00  6.84   104.77  65.05
## UGDS_WHITE        1.00       1.00 -0.26    -1.08   0.00
## UGDS_BLACK        1.00       1.00  1.73     2.49   0.00
## UGDS_HISP         1.00       1.00  2.24     4.83   0.00
## UGDS_ASIAN        0.97       0.97  6.38    55.98   0.00
## UGDS_AIAN         1.00       1.00 11.16   136.65   0.00
## UGDS_NHPI         1.00       1.00 22.92   608.03   0.00
## UGDS_2MOR         0.53       0.53  4.11    36.08   0.00
## UGDS_NRA          0.93       0.93  8.29   101.58   0.00
## UGDS_UNKN         0.90       0.90  4.37    23.94   0.00
## PPTUG_EF          1.00       1.00  1.02     0.22   0.00
## NPT4_PUB      28201.00   30635.00  0.61     0.10 106.82
## NPT4_PRIV     89406.00   89987.00  0.73     3.89 106.21
## COSTT4_A      79212.00   74602.00  0.97     0.47 201.04
## TUITFTE     1292154.00 1292154.00 55.94  4076.80 203.79
## INEXPFTE     735077.00  735077.00 31.03  1570.11 149.26
## PFTFAC            1.00       1.00  0.06    -1.30   0.00
## PCTPELL           1.00       1.00  0.00    -0.79   0.00
## C150_4            1.00       1.00  0.13    -0.38   0.00
## PFTFTUG1_EF       1.00       1.00 -0.06    -0.99   0.00
## RET_FT4           1.00       1.00 -1.26     2.31   0.00
## PCTFLOAN          1.00       1.00 -0.53    -0.81   0.00

1․ Present 5 relations through scatterplots. In addition to the scatterplots, provide an interpretation of what you think about the direction and magnitude of the each of the relations you report.

Relation 1

# Code
ggplot2::qplot(COSTT4_A, NPT4_PUB, data=alotofdata, geom = "point", xlim = c(0, 35000), main = 'Scatterplot of Avg. Net Price for Public Title IV and Avg. Cost of Attendance', xlab = 'Average cost of attendance', ylab = 'Average net price for Title IV public institutions')

Relation 2

# Code
ggplot2::qplot(UGDS_NHPI, UGDS_UNKN, data=alotofdata, geom = "point", main = 'Scatterplot of Native Hawaiian/Pacific Isl. Enrollment and Unknown Enrollment', xlab = 'Native Hawaiian/Pacific Islander Enrollment Race', ylab = 'Unknown Enrollment Race')

Relation 3

# Code
ggplot2::qplot(C150_4, SAT_AVG, data=alotofdata, geom = "point", main = 'Scatterplot of Completion Rate and Average SAT score', xlab = 'Completion Rate', ylab = 'Average SAT score')

Relation 4

# Code
ggplot2::qplot(PFTFTUG1_EF, UGDS_2MOR, data=alotofdata, geom = "point", ylim = c(0, .4), main = 'Scatterplot of Full-Time Undergrads and Enrollments Who Are Two or More Races', xlab = 'Full-Time Undergrads', ylab = 'Enrollments Who Are Two or More Races')

Relation 5

# Code
ggplot2::qplot(PFTFTUG1_EF, PPTUG_EF, data=alotofdata, geom = "point", main = 'Scatterplot of Full-Time Undergrads and Part-Time Undergrads', xlab = 'Full-Time Undergrads', ylab = 'Part-Time Undergrads')

2․ Report the full covariance and correlation matrix for all variables and provide an interpretation of what variables are most strongly related (either positive or negative) and what variables are relatively unrelated to one another.

Covariance Matrix

# Code
# Prints Covariance matrix of all quantitative variables
cov(alotofdata[, c(8:20,22:30)], use = "complete.obs", method = "pearson")

##                   SAT_AVG          UGDS    UGDS_WHITE    UGDS_BLACK
## SAT_AVG      1.218022e+04  5.329262e+05  7.602445e+00 -1.139528e+01
## UGDS         5.329262e+05  8.285945e+07 -1.742406e+02 -4.046013e+02
## UGDS_WHITE   7.602445e+00 -1.742406e+02  5.742202e-02 -3.180472e-02
## UGDS_BLACK  -1.139528e+01 -4.046013e+02 -3.180472e-02  4.293978e-02
## UGDS_HISP   -2.289345e-01  2.508627e+02 -1.776047e-02 -5.888088e-03
## UGDS_ASIAN   2.940904e+00  2.582883e+02 -6.168199e-03 -2.583640e-03
## UGDS_AIAN   -2.015583e-01 -2.680593e+01  1.468718e-04 -4.383466e-04
## UGDS_NHPI   -2.699450e-02  5.489100e-02 -1.776705e-04 -1.011808e-04
## UGDS_2MOR    3.445257e-01  3.163754e+01 -3.089527e-04 -1.002266e-03
## UGDS_NRA     1.229237e+00  9.322055e+01 -6.599578e-04 -1.127995e-03
## UGDS_UNKN   -2.638249e-01 -2.840537e+01 -6.884413e-04  5.944754e-06
## PPTUG_EF    -3.816668e+00 -7.697761e+01 -3.402001e-03  1.452766e-03
## NPT4_PUB     1.385288e+05  4.890578e+06  3.771110e+02 -1.341249e+02
## COSTT4_A     2.120705e+05  1.128208e+07  1.485043e+02 -1.158920e+02
## TUITFTE      2.200865e+05  1.350737e+07  1.635458e+02 -1.790211e+02
## INEXPFTE     2.413508e+05  1.380195e+07 -4.826708e+01 -1.068839e+02
## PFTFAC       2.097600e+00  6.739357e+01  2.259232e-03  3.873909e-03
## PCTPELL     -9.354897e+00 -3.201298e+02 -2.136342e-02  1.743799e-02
## C150_4       1.406813e+01  7.645367e+02  9.136309e-03 -1.343661e-02
## PFTFTUG1_EF  5.079774e+00  1.943247e+02  4.346010e-03 -1.095800e-03
## RET_FT4      7.771277e+00  4.830751e+02  1.129095e-03 -6.430685e-03
## PCTFLOAN    -7.287525e+00 -4.466688e+02  1.244968e-03  1.368123e-02
##                 UGDS_HISP    UGDS_ASIAN     UGDS_AIAN     UGDS_NHPI
## SAT_AVG     -2.289345e-01  2.940904e+00 -2.015583e-01 -2.699450e-02
## UGDS         2.508627e+02  2.582883e+02 -2.680593e+01  5.489100e-02
## UGDS_WHITE  -1.776047e-02 -6.168199e-03  1.468718e-04 -1.776705e-04
## UGDS_BLACK  -5.888088e-03 -2.583640e-03 -4.383466e-04 -1.011808e-04
## UGDS_HISP    2.105206e-02  2.728131e-03 -1.707483e-04  5.362607e-05
## UGDS_ASIAN   2.728131e-03  4.753151e-03 -1.775280e-04  1.039281e-04
## UGDS_AIAN   -1.707483e-04 -1.775280e-04  4.861502e-04  1.285416e-06
## UGDS_NHPI    5.362607e-05  1.039281e-04  1.285416e-06  3.194102e-05
## UGDS_2MOR   -5.569517e-05  4.881726e-04  1.701521e-04  8.110200e-05
## UGDS_NRA     2.172151e-05  8.658289e-04  3.279742e-06  1.072810e-05
## UGDS_UNKN    1.946616e-05 -1.000288e-05 -2.112756e-05 -3.773000e-06
## PPTUG_EF     2.453818e-03 -4.711632e-04  1.764228e-04  3.888940e-05
## NPT4_PUB    -2.162551e+02 -1.431622e+01 -1.342254e+01 -1.909956e+00
## COSTT4_A    -1.310979e+02  8.443090e+01 -1.609801e+01 -1.047081e+00
## TUITFTE     -8.683237e+01  6.604482e+01 -1.077970e+01 -2.437688e-01
## INEXPFTE    -1.724218e+01  1.141844e+02 -8.156141e+00  1.962394e+00
## PFTFAC      -3.763366e-03 -2.208007e-03  2.999566e-04 -3.803687e-05
## PCTPELL      5.267153e-03 -3.991713e-04  1.242456e-04  2.696185e-05
## C150_4      -1.008910e-03  4.340904e-03 -6.325488e-04 -2.611656e-05
## PFTFTUG1_EF -1.126385e-03 -8.905536e-04 -3.586837e-04 -1.357610e-04
## RET_FT4      1.765587e-03  3.173605e-03 -4.895431e-04 -1.789240e-05
## PCTFLOAN    -9.145455e-03 -3.752316e-03 -1.656458e-04 -1.034155e-04
##                 UGDS_2MOR      UGDS_NRA     UGDS_UNKN      PPTUG_EF
## SAT_AVG      3.445257e-01  1.229237e+00 -2.638249e-01 -3.816668e+00
## UGDS         3.163754e+01  9.322055e+01 -2.840537e+01 -7.697761e+01
## UGDS_WHITE  -3.089527e-04 -6.599578e-04 -6.884413e-04 -3.402001e-03
## UGDS_BLACK  -1.002266e-03 -1.127995e-03  5.944754e-06  1.452766e-03
## UGDS_HISP   -5.569517e-05  2.172151e-05  1.946616e-05  2.453818e-03
## UGDS_ASIAN   4.881726e-04  8.658289e-04 -1.000288e-05 -4.711632e-04
## UGDS_AIAN    1.701521e-04  3.279742e-06 -2.112756e-05  1.764228e-04
## UGDS_NHPI    8.110200e-05  1.072810e-05 -3.773000e-06  3.888940e-05
## UGDS_2MOR    6.307534e-04  6.640463e-05 -6.965429e-05  1.926695e-05
## UGDS_NRA     6.640463e-05  8.691456e-04 -4.911387e-05 -2.911461e-04
## UGDS_UNKN   -6.965429e-05 -4.911387e-05  8.168477e-04  2.251906e-05
## PPTUG_EF     1.926695e-05 -2.911461e-04  2.251906e-05  9.677881e-03
## NPT4_PUB     9.694901e-01  4.783556e-02  1.923607e+00 -1.351751e+02
## COSTT4_A     1.042207e+01  2.229337e+01 -1.486209e+00 -1.749173e+02
## TUITFTE      8.204159e+00  3.388701e+01  5.209673e+00 -1.029201e+02
## INEXPFTE     1.575697e+01  4.359567e+01  5.067068e+00 -1.139963e+02
## PFTFAC       9.234901e-05 -2.319271e-05 -4.925984e-04 -1.664099e-03
## PCTPELL     -3.905091e-04 -7.951955e-04  9.181373e-05  2.672191e-03
## C150_4       3.905836e-04  1.420905e-03 -1.838141e-04 -9.527538e-03
## PFTFTUG1_EF -3.235708e-04  7.834947e-05 -4.936012e-04 -1.135068e-02
## RET_FT4      1.561521e-04  8.495269e-04 -1.356578e-04 -4.265497e-03
## PCTFLOAN    -6.137635e-04 -1.289573e-03  1.443932e-04 -2.242967e-03
##                  NPT4_PUB      COSTT4_A       TUITFTE      INEXPFTE
## SAT_AVG      1.385288e+05  2.120705e+05  2.200865e+05  2.413508e+05
## UGDS         4.890578e+06  1.128208e+07  1.350737e+07  1.380195e+07
## UGDS_WHITE   3.771110e+02  1.485043e+02  1.635458e+02 -4.826708e+01
## UGDS_BLACK  -1.341249e+02 -1.158920e+02 -1.790211e+02 -1.068839e+02
## UGDS_HISP   -2.162551e+02 -1.310979e+02 -8.683237e+01 -1.724218e+01
## UGDS_ASIAN  -1.431622e+01  8.443090e+01  6.604482e+01  1.141844e+02
## UGDS_AIAN   -1.342254e+01 -1.609801e+01 -1.077970e+01 -8.156141e+00
## UGDS_NHPI   -1.909956e+00 -1.047081e+00 -2.437688e-01  1.962394e+00
## UGDS_2MOR    9.694901e-01  1.042207e+01  8.204159e+00  1.575697e+01
## UGDS_NRA     4.783556e-02  2.229337e+01  3.388701e+01  4.359567e+01
## UGDS_UNKN    1.923607e+00 -1.486209e+00  5.209673e+00  5.067068e+00
## PPTUG_EF    -1.351751e+02 -1.749173e+02 -1.029201e+02 -1.139963e+02
## NPT4_PUB     1.560800e+07  1.287933e+07  7.263421e+06  3.494579e+06
## COSTT4_A     1.287933e+07  1.677172e+07  9.799257e+06  7.238153e+06
## TUITFTE      7.263421e+06  9.799257e+06  1.170850e+07  8.223882e+06
## INEXPFTE     3.494579e+06  7.238153e+06  8.223882e+06  1.556465e+07
## PFTFAC      -1.259444e+01  3.481788e+00  4.339883e+01  3.441474e+01
## PCTPELL     -2.510637e+02 -1.975503e+02 -2.263728e+02 -1.478998e+02
## C150_4       3.181454e+02  4.348695e+02  3.466991e+02  3.382111e+02
## PFTFTUG1_EF  2.138848e+02  2.365901e+02  1.606063e+02  1.159236e+02
## RET_FT4      1.173998e+02  1.933343e+02  1.704002e+02  1.861454e+02
## PCTFLOAN     1.707053e+02  9.555363e+01 -3.347646e+01 -1.182968e+02
##                    PFTFAC       PCTPELL        C150_4   PFTFTUG1_EF
## SAT_AVG      2.097600e+00 -9.354897e+00  1.406813e+01  5.079774e+00
## UGDS         6.739357e+01 -3.201298e+02  7.645367e+02  1.943247e+02
## UGDS_WHITE   2.259232e-03 -2.136342e-02  9.136309e-03  4.346010e-03
## UGDS_BLACK   3.873909e-03  1.743799e-02 -1.343661e-02 -1.095800e-03
## UGDS_HISP   -3.763366e-03  5.267153e-03 -1.008910e-03 -1.126385e-03
## UGDS_ASIAN  -2.208007e-03 -3.991713e-04  4.340904e-03 -8.905536e-04
## UGDS_AIAN    2.999566e-04  1.242456e-04 -6.325488e-04 -3.586837e-04
## UGDS_NHPI   -3.803687e-05  2.696185e-05 -2.611656e-05 -1.357610e-04
## UGDS_2MOR    9.234901e-05 -3.905091e-04  3.905836e-04 -3.235708e-04
## UGDS_NRA    -2.319271e-05 -7.951955e-04  1.420905e-03  7.834947e-05
## UGDS_UNKN   -4.925984e-04  9.181373e-05 -1.838141e-04 -4.936012e-04
## PPTUG_EF    -1.664099e-03  2.672191e-03 -9.527538e-03 -1.135068e-02
## NPT4_PUB    -1.259444e+01 -2.510637e+02  3.181454e+02  2.138848e+02
## COSTT4_A     3.481788e+00 -1.975503e+02  4.348695e+02  2.365901e+02
## TUITFTE      4.339883e+01 -2.263728e+02  3.466991e+02  1.606063e+02
## INEXPFTE     3.441474e+01 -1.478998e+02  3.382111e+02  1.159236e+02
## PFTFAC       3.008119e-02 -9.424151e-04  1.623653e-03  5.401583e-03
## PCTPELL     -9.424151e-04  1.651561e-02 -1.255180e-02 -2.966363e-03
## C150_4       1.623653e-03 -1.255180e-02  2.660994e-02  1.157375e-02
## PFTFTUG1_EF  5.401583e-03 -2.966363e-03  1.157375e-02  2.568155e-02
## RET_FT4     -4.473285e-04 -5.685523e-03  1.335546e-02  5.700239e-03
## PCTFLOAN     1.564988e-03  6.671230e-03 -4.494687e-03  3.454553e-03
##                   RET_FT4      PCTFLOAN
## SAT_AVG      7.771277e+00 -7.287525e+00
## UGDS         4.830751e+02 -4.466688e+02
## UGDS_WHITE   1.129095e-03  1.244968e-03
## UGDS_BLACK  -6.430685e-03  1.368123e-02
## UGDS_HISP    1.765587e-03 -9.145455e-03
## UGDS_ASIAN   3.173605e-03 -3.752316e-03
## UGDS_AIAN   -4.895431e-04 -1.656458e-04
## UGDS_NHPI   -1.789240e-05 -1.034155e-04
## UGDS_2MOR    1.561521e-04 -6.137635e-04
## UGDS_NRA     8.495269e-04 -1.289573e-03
## UGDS_UNKN   -1.356578e-04  1.443932e-04
## PPTUG_EF    -4.265497e-03 -2.242967e-03
## NPT4_PUB     1.173998e+02  1.707053e+02
## COSTT4_A     1.933343e+02  9.555363e+01
## TUITFTE      1.704002e+02 -3.347646e+01
## INEXPFTE     1.861454e+02 -1.182968e+02
## PFTFAC      -4.473285e-04  1.564988e-03
## PCTPELL     -5.685523e-03  6.671230e-03
## C150_4       1.335546e-02 -4.494687e-03
## PFTFTUG1_EF  5.700239e-03  3.454553e-03
## RET_FT4      9.335425e-03 -4.348277e-03
## PCTFLOAN    -4.348277e-03  2.056833e-02

Correlation Matrix

# Code
# Prints Correlation matrix of all quantitative variables
cor(alotofdata[, c(8:20,22:30)], use = "complete.obs", method = "pearson")

##                 SAT_AVG         UGDS  UGDS_WHITE   UGDS_BLACK    UGDS_HISP
## SAT_AVG      1.00000000  0.530479300  0.28746595 -0.498273335 -0.014296713
## UGDS         0.53047930  1.000000000 -0.07988021 -0.214499541  0.189940545
## UGDS_WHITE   0.28746595 -0.079880211  1.00000000 -0.640504825 -0.510819933
## UGDS_BLACK  -0.49827333 -0.214499541 -0.64050482  1.000000000 -0.195838031
## UGDS_HISP   -0.01429671  0.189940545 -0.51081993 -0.195838031  1.000000000
## UGDS_ASIAN   0.38651147  0.411569154 -0.37336043 -0.180846962  0.272726304
## UGDS_AIAN   -0.08283002 -0.133559424  0.02779802 -0.095940606 -0.053373262
## UGDS_NHPI   -0.04327861  0.001066979 -0.13119027 -0.086396045  0.065396461
## UGDS_2MOR    0.12429801  0.138389106 -0.05133611 -0.192585413 -0.015284117
## UGDS_NRA     0.37779987  0.347371663 -0.09341802 -0.184642217  0.005078044
## UGDS_UNKN   -0.08364071 -0.109183967 -0.10052108  0.001003768  0.004694209
## PPTUG_EF    -0.35153340 -0.085961363 -0.14431289  0.071264897  0.171911489
## NPT4_PUB     0.31771591  0.135992698  0.39834227 -0.163834673 -0.377264147
## COSTT4_A     0.46920625  0.302641965  0.15132507 -0.136563560 -0.220627412
## TUITFTE      0.58279415  0.433659734  0.19945712 -0.252478054 -0.174897510
## INEXPFTE     0.55430827  0.384325956 -0.05105544 -0.130741289 -0.030121396
## PFTFAC       0.10958412  0.042687425  0.05435934  0.107788393 -0.149548348
## PCTPELL     -0.65957486 -0.273657783 -0.69371993  0.654815936  0.282475796
## C150_4       0.78142427  0.514879202  0.23372738 -0.397500767 -0.042626820
## PFTFTUG1_EF  0.28721436  0.133212902  0.11317248 -0.032998226 -0.048442764
## RET_FT4      0.72878214  0.549258224  0.04876680 -0.321188726  0.125943239
## PCTFLOAN    -0.46041860 -0.342148794  0.03622592  0.460358093 -0.439499529
##              UGDS_ASIAN    UGDS_AIAN    UGDS_NHPI    UGDS_2MOR
## SAT_AVG      0.38651147 -0.082830022 -0.043278613  0.124298008
## UGDS         0.41156915 -0.133559424  0.001066979  0.138389106
## UGDS_WHITE  -0.37336043  0.027798025 -0.131190270 -0.051336112
## UGDS_BLACK  -0.18084696 -0.095940606 -0.086396045 -0.192585413
## UGDS_HISP    0.27272630 -0.053373262  0.065396461 -0.015284117
## UGDS_ASIAN   1.00000000 -0.116785970  0.266727380  0.281937487
## UGDS_AIAN   -0.11678597  1.000000000  0.010315352  0.307271522
## UGDS_NHPI    0.26672738  0.010315352  1.000000000  0.571383114
## UGDS_2MOR    0.28193749  0.307271522  0.571383114  1.000000000
## UGDS_NRA     0.42598567  0.005045546  0.064387585  0.089685519
## UGDS_UNKN   -0.00507649 -0.033526916 -0.023358338 -0.097039279
## PPTUG_EF    -0.06946890  0.081335340  0.069946628  0.007798176
## NPT4_PUB    -0.05256104 -0.154090495 -0.085541207  0.009771022
## COSTT4_A     0.29903479 -0.178278159 -0.045239423  0.101329394
## TUITFTE      0.27996082 -0.142879791 -0.012605298  0.095467069
## INEXPFTE     0.41980354 -0.093762630  0.088012031  0.159027870
## PFTFAC      -0.18465547  0.078437848 -0.038804571  0.021200933
## PCTPELL     -0.04505273  0.043847855  0.037121709 -0.120991264
## C150_4       0.38598229 -0.175868004 -0.028328245  0.095337175
## PFTFTUG1_EF -0.08060438 -0.101511613 -0.149895984 -0.080394974
## RET_FT4      0.47642548 -0.229793990 -0.032766291  0.064350361
## PCTFLOAN    -0.37949754 -0.052383633 -0.127588575 -0.170400829
##                 UGDS_NRA    UGDS_UNKN     PPTUG_EF     NPT4_PUB
## SAT_AVG      0.377799869 -0.083640707 -0.351533401  0.317715914
## UGDS         0.347371663 -0.109183967 -0.085961363  0.135992698
## UGDS_WHITE  -0.093418015 -0.100521076 -0.144312888  0.398342269
## UGDS_BLACK  -0.184642217  0.001003768  0.071264897 -0.163834673
## UGDS_HISP    0.005078044  0.004694209  0.171911489 -0.377264147
## UGDS_ASIAN   0.425985670 -0.005076490 -0.069468896 -0.052561037
## UGDS_AIAN    0.005045546 -0.033526916  0.081335340 -0.154090495
## UGDS_NHPI    0.064387585 -0.023358338  0.069946628 -0.085541207
## UGDS_2MOR    0.089685519 -0.097039279  0.007798176  0.009771022
## UGDS_NRA     1.000000000 -0.058289096 -0.100386320  0.000410706
## UGDS_UNKN   -0.058289096  1.000000000  0.008009207  0.017036177
## PPTUG_EF    -0.100386320  0.008009207  1.000000000 -0.347802872
## NPT4_PUB     0.000410706  0.017036177 -0.347802872  1.000000000
## COSTT4_A     0.184646253 -0.012697565 -0.434164032  0.796032414
## TUITFTE      0.335920337  0.053270804 -0.305745059  0.537300149
## INEXPFTE     0.374823962  0.044938282 -0.293718218  0.224208270
## PFTFAC      -0.004535840 -0.099374508 -0.097530724 -0.018380517
## PCTPELL     -0.209884488  0.024997107  0.211363497 -0.494496352
## C150_4       0.295458637 -0.039426317 -0.593702072  0.493662415
## PFTFTUG1_EF  0.016583615 -0.107769296 -0.719981174  0.337828069
## RET_FT4      0.298238582 -0.049125507 -0.448758245  0.307557898
## PCTFLOAN    -0.305000206  0.035227068 -0.158976596  0.301282528
##                 COSTT4_A     TUITFTE    INEXPFTE       PFTFAC     PCTPELL
## SAT_AVG      0.469206247  0.58279415  0.55430827  0.109584117 -0.65957486
## UGDS         0.302641965  0.43365973  0.38432596  0.042687425 -0.27365778
## UGDS_WHITE   0.151325068  0.19945712 -0.05105544  0.054359338 -0.69371993
## UGDS_BLACK  -0.136563560 -0.25247805 -0.13074129  0.107788393  0.65481594
## UGDS_HISP   -0.220627412 -0.17489751 -0.03012140 -0.149548348  0.28247580
## UGDS_ASIAN   0.299034792  0.27996082  0.41980354 -0.184655466 -0.04505273
## UGDS_AIAN   -0.178278159 -0.14287979 -0.09376263  0.078437848  0.04384786
## UGDS_NHPI   -0.045239423 -0.01260530  0.08801203 -0.038804571  0.03712171
## UGDS_2MOR    0.101329394  0.09546707  0.15902787  0.021200933 -0.12099126
## UGDS_NRA     0.184646253  0.33592034  0.37482396 -0.004535840 -0.20988449
## UGDS_UNKN   -0.012697565  0.05327080  0.04493828 -0.099374508  0.02499711
## PPTUG_EF    -0.434164032 -0.30574506 -0.29371822 -0.097530724  0.21136350
## NPT4_PUB     0.796032414  0.53730015  0.22420827 -0.018380517 -0.49449635
## COSTT4_A     1.000000000  0.69928402  0.44799088  0.004901918 -0.37535448
## TUITFTE      0.699284022  1.00000000  0.60919524  0.073127354 -0.51478553
## INEXPFTE     0.447990875  0.60919524  1.00000000  0.050295295 -0.29170949
## PFTFAC       0.004901918  0.07312735  0.05029529  1.000000000 -0.04228121
## PCTPELL     -0.375354477 -0.51478553 -0.29170949 -0.042281212  1.00000000
## C150_4       0.650950779  0.62112656  0.52552835  0.057388350 -0.59873803
## PFTFTUG1_EF  0.360493250  0.29288779  0.18335455  0.194340342 -0.14403443
## RET_FT4      0.488599391  0.51540941  0.48833245 -0.026693903 -0.45788466
## PCTFLOAN     0.162689205 -0.06821649 -0.20907570  0.062916419  0.36195869
##                  C150_4 PFTFTUG1_EF     RET_FT4    PCTFLOAN
## SAT_AVG      0.78142427  0.28721436  0.72878214 -0.46041860
## UGDS         0.51487920  0.13321290  0.54925822 -0.34214879
## UGDS_WHITE   0.23372738  0.11317248  0.04876680  0.03622592
## UGDS_BLACK  -0.39750077 -0.03299823 -0.32118873  0.46035809
## UGDS_HISP   -0.04262682 -0.04844276  0.12594324 -0.43949953
## UGDS_ASIAN   0.38598229 -0.08060438  0.47642548 -0.37949754
## UGDS_AIAN   -0.17586800 -0.10151161 -0.22979399 -0.05238363
## UGDS_NHPI   -0.02832825 -0.14989598 -0.03276629 -0.12758857
## UGDS_2MOR    0.09533718 -0.08039497  0.06435036 -0.17040083
## UGDS_NRA     0.29545864  0.01658361  0.29823858 -0.30500021
## UGDS_UNKN   -0.03942632 -0.10776930 -0.04912551  0.03522707
## PPTUG_EF    -0.59370207 -0.71998117 -0.44875825 -0.15897660
## NPT4_PUB     0.49366241  0.33782807  0.30755790  0.30128253
## COSTT4_A     0.65095078  0.36049325  0.48859939  0.16268921
## TUITFTE      0.62112656  0.29288779  0.51540941 -0.06821649
## INEXPFTE     0.52552835  0.18335455  0.48833245 -0.20907570
## PFTFAC       0.05738835  0.19434034 -0.02669390  0.06291642
## PCTPELL     -0.59873803 -0.14403443 -0.45788466  0.36195869
## C150_4       1.00000000  0.44273254  0.84736375 -0.19212234
## PFTFTUG1_EF  0.44273254  1.00000000  0.36814206  0.15030786
## RET_FT4      0.84736375  0.36814206  1.00000000 -0.31379828
## PCTFLOAN    -0.19212234  0.15030786 -0.31379828  1.00000000

Top 5 strongest correlations

Results

The top 5 strongest correlations in order are:

1. Average cost of attendance vs. average price for Title IV public institutions

2. Completion rate vs. average SAT score

3. Full-time undergraduates vs. part-time undergraduates

4. Retention rate vs. average SAT score

5. Retention rate vs. Completion rate

Proof

# Code
# Finds the top 5 largest values
library(reshape)
z <- cor(alotofdata[, c(8:20,22:30)], use = "pairwise.complete.obs", method = "pearson")
x <- subset(melt(cor(z)), value != 1 | value != NA)
xl <- x[with(x, order(-abs(x$value))),]
xl [1:10, ]

##              X1          X2      value
## 278    COSTT4_A    NPT4_PUB  0.9446102
## 299    NPT4_PUB    COSTT4_A  0.9446102
## 19       C150_4     SAT_AVG  0.9364503
## 397     SAT_AVG      C150_4  0.9364503
## 262 PFTFTUG1_EF    PPTUG_EF -0.9287725
## 430    PPTUG_EF PFTFTUG1_EF -0.9287725
## 21      RET_FT4     SAT_AVG  0.9285524
## 441     SAT_AVG     RET_FT4  0.9285524
## 417     RET_FT4      C150_4  0.9148492
## 459      C150_4     RET_FT4  0.9148492

Top 5 weakest correlations

Results

The top 5 weakest correlations in order are:

1. Enrollments of unknown race vs. Enrollments of Native Hawaiian/Pacific Islander

2. Full-time undergraduates vs. Enrollments of students who have two or more races 

3. Net tuition revenue per full-time student vs. Enrollments of unknown race 

4. Part-time undergraduates vs. Enrollments of students who have two or more races

5. Average cost of attendance vs. Enrollments of unknown race

Proof

# Code
# Finds the top 5 smallest values
library(reshape)
z <- cor(alotofdata[, c(8:20,22:30)], use = "pairwise.complete.obs", method = "pearson")
x <- subset(melt(cor(z)), value != 1 | value != NA)
xs <- x[with(x, order(abs(x$value))),]
xs [1:10, ]

##              X1          X2         value
## 165   UGDS_UNKN   UGDS_NHPI  0.0002322099
## 228   UGDS_NHPI   UGDS_UNKN  0.0002322099
## 196 PFTFTUG1_EF   UGDS_2MOR -0.0114305185
## 427   UGDS_2MOR PFTFTUG1_EF -0.0114305185
## 235     TUITFTE   UGDS_UNKN -0.0127101627
## 319   UGDS_UNKN     TUITFTE -0.0127101627
## 188    PPTUG_EF   UGDS_2MOR  0.0137433720
## 251   UGDS_2MOR    PPTUG_EF  0.0137433720
## 234    COSTT4_A   UGDS_UNKN  0.0153987678
## 297   UGDS_UNKN    COSTT4_A  0.0153987678

3․ What is the strongest relation in the data? What are the variables involved and what is the correlation? What is the weakest relation in the data? What are the variables involved and what is the correlation?

Strongest Correlation

Results

The strongest correlation is between Average cost of attendance and average price for Title IV public institutions. The correlation is 94.46102%.

Proof

# Code
# Finds the strongest correlation
library(reshape)
z <- cor(alotofdata[, c(8:20,22:30)], use = "pairwise.complete.obs", method = "pearson")
x <- subset(melt(cor(z)), value != 1 | value != NA)
xl <- x[with(x, order(-abs(x$value))),]
xl [1, ]

##           X1       X2     value
## 278 COSTT4_A NPT4_PUB 0.9446102

Weakest Correlation

Results

The weakest correlation is between Enrollments of unknown race and enrollments of Native Hawaiian/Pacific Islander. The correlation is 0.02322099%.

Proof

# Code
# Finds the weakest correlation
library(reshape)
z <- cor(alotofdata[, c(8:20,22:30)], use = "pairwise.complete.obs", method = "pearson")
x <- subset(melt(cor(z)), value != 1 | value != NA)
xs <- x[with(x, order(abs(x$value))),]
xs [1, ]

##            X1        X2        value
## 165 UGDS_UNKN UGDS_NHPI 0.0002322099

Simple Regression

Run simple regression analyses as outlined below. For each one, report the results of your analyses including both a visual representation, numerical summary, and a written interpretation.

Predict the average cost of attendance from average SAT score.

Visual representation

# Code
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$SAT_AVG, alotofdata$COSTT4_A, ylim = c(0, 70000), main = 'Scatterplot of Average SAT Score and Average Cost of Attendance', xlab = 'Average SAT score', ylab = 'Average cost of attendance')
abline(lm(alotofdata$COSTT4_A ~ alotofdata$SAT_AVG))

Numerical representation

# Code
# Run a simple regression analysis in which we predict COSTT4_A from SAT_AVG
library(lm.beta)
reg <- lm(COSTT4_A ~ SAT_AVG, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = COSTT4_A ~ SAT_AVG, data = alotofdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32053 -10000    942   9354  25867 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -22454.499   2520.612  -8.908   <2e-16 ***
## SAT_AVG         51.901      2.361  21.984   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11350 on 1296 degrees of freedom
##   (6405 observations deleted due to missingness)
## Multiple R-squared:  0.2716, Adjusted R-squared:  0.2711 
## F-statistic: 483.3 on 1 and 1296 DF,  p-value: < 2.2e-16

coefficients(reg)

##  (Intercept)      SAT_AVG 
## -22454.49939     51.90104

lm.beta(reg)

## 
## Call:
## lm(formula = COSTT4_A ~ SAT_AVG, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)     SAT_AVG 
##   0.0000000   0.5211713

Written interpretation

We are running a Simple Linear Regression analysis to predict average cost of attendance from average SAT score.

The visual representation is a scatterplot all the data points from average cost of attendance and average SAT score.

The line in the middle represents the regression line. This line represents the line of best fit for predicting average cost of attendance from average SAT score.

The numerical representation is a summary of all of the values that desscribe our linear regression model.

Our Model for this problem is: Average cost of attendance = B0 + B1 X Average SAT score + e

The B0 = -22454.49939. This means if average SAT score is 0, then the average cost of attendance on average is -22454.49939.

The B1 = 51.90104. This is the numerical relationship between Y and X. This means that given one unit increase in average SAT score, the expected change in average cost of attendance on average is 51.90104.

The Standard error is 2.361. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 11350. This is how good the model is at predicting the average cost of attendance from the average SAT score. Our residual standard error shows that this model is off on average by 11350. The closer to 0, the better the model fits.

The R^2 is 0.2711. This is the coefficient of determination, or the percentage of variance in average cost of attendance that can be explained given the average SAT score. This is the ratio of explained variance vs total variance. For our model, 27.11% of the variance in average cost of attendance can be explained given the average SAT score.

The p-value indicates that 51.901 is statistically significant.

Predict the average cost of attendance from admission rate.

Visual representation

# Code
# Predict COSTT4_A from ADM_RATE
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$ADM_RATE, alotofdata$COSTT4_A, ylim = c(0, 70000), main = 'Scatterplot of Admission Rate and Average Cost of Attendance', xlab = 'Admission Rate', ylab = 'Average Cost of Attendance')
abline(lm(alotofdata$COSTT4_A ~ alotofdata$ADM_RATE))

Numerical representation

# Code
# Run a simple regression analysis in which we predict COSTT4_A from ADM_RATE
library(lm.beta)
reg <- lm(COSTT4_A ~ ADM_RATE, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = COSTT4_A ~ ADM_RATE, data = alotofdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31765  -9986  -1323   9705  36241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  43930.1      981.5   44.76   <2e-16 ***
## ADM_RATE    -17799.4     1380.0  -12.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12620 on 1994 degrees of freedom
##   (5707 observations deleted due to missingness)
## Multiple R-squared:  0.07701,    Adjusted R-squared:  0.07654 
## F-statistic: 166.4 on 1 and 1994 DF,  p-value: < 2.2e-16

coefficients(reg)

## (Intercept)    ADM_RATE 
##    43930.13   -17799.40

lm.beta(reg)

## 
## Call:
## lm(formula = COSTT4_A ~ ADM_RATE, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)    ADM_RATE 
##   0.0000000  -0.2775023

Written interpretation

We are running a Simple Linear Regression analysis to predict average cost of attendance from admission rate.

The visual representation is a scatterplot all the data points from average cost of attendance and admission rate.

The line in the middle represents the regression line. This line represents the line of best fit for predicting average cost of attendance from admission rate.

The numerical representation is a summary of all of the values that describe our linear regression model.

Our Model for this problem is: Average cost of attendance = B0 + B1 X Admission rate + e

The B0 = 43930.1. This means if the admission rate is 0, then the average cost of attendance on average is 43930.1.

The B1 = -17799.4. This is the numerical relationship between Y and X. This means that given one unit increase in admission rate, the expected change in average cost of attendance on average is -17799.4.

The Standard error is 1380.0. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 12620. This is how good the model is at predicting the average cost of attendance from the admission rate. Our residual standard error shows that this model is off on average by 12620. The closer to 0, the better the model fits.

The R^2 is 0.07654. This is the coefficient of determination, or the percentage of variance in average cost of attendance that can be explained given the admission rate. This is the ratio of explained variance vs total variance. For our model, 07.654% of the variance in average cost of attendance can be explained given the admission rate.

The p-value indicates that -17799.4 is statistically significant.

Predict the number of students from average SAT score

Visual representation

# Code
# Predict UGDS from SAT_AVG
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$SAT_AVG, alotofdata$UGDS, ylim = c(0, 55000),  main = 'Scatterplot of Average SAT Score and Number of Students', xlab = 'Average SAT Score', ylab = 'Number of Students')
abline(lm(alotofdata$UGDS ~ alotofdata$SAT_AVG))

Numerical representation

# Code
# Run a simple regression analysis in which we predict UGDS from SAT_AVG
library(lm.beta)
reg <- lm(UGDS ~ SAT_AVG, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = UGDS ~ SAT_AVG, data = alotofdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11197  -3981  -2485   1077  45035 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8999.262   1573.214  -5.720 1.32e-08 ***
## SAT_AVG        13.708      1.474   9.301  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7095 on 1302 degrees of freedom
##   (6399 observations deleted due to missingness)
## Multiple R-squared:  0.06231,    Adjusted R-squared:  0.06159 
## F-statistic: 86.51 on 1 and 1302 DF,  p-value: < 2.2e-16

coefficients(reg)

## (Intercept)     SAT_AVG 
## -8999.26201    13.70842

lm.beta(reg)

## 
## Call:
## lm(formula = UGDS ~ SAT_AVG, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)     SAT_AVG 
##   0.0000000   0.2496111

Written interpretation

We are running a Simple Linear Regression analysis to predict number of students from average SAT score.

The visual representation is a scatterplot all the data points from number of students and average SAT score.

The line in the middle represents the regression line. This line represents the line of best fit for predicting number of students from average SAT score.

The numerical representation is a summary of all of the values that describe our linear regression model.

Our Model for this problem is: Number of students = B0 + B1 X Average SAT score + e

The B0 = -8999.262. This means if the average SAT score is 0, then the number of students on average is -8999.262.

The B1 = 13.708. This is the numerical relationship between Y and X. This means that given one unit increase in average SAT score, the expected change in number of students on average is 13.708.

The Standard error is 1.474. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 7095. This is how good the model is at predicting the number of students from the average SAT score. Our residual standard error shows that this model is off on average by 7095. The closer to 0, the better the model fits.

The R^2 is 0.06159. This is the coefficient of determination, or the percentage of variance in number of students that can be explained given the average SAT score. This is the ratio of explained variance vs total variance. For our model, 06.159% of the variance number of students can be explained given average SAT score.

Predict the number of students from admission rate

Visual representation

# Code
# Predict UGDS from ADM_RATE
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$ADM_RATE, alotofdata$UGDS, ylim = c(0, 55000),  main = 'Scatterplot of Admission Rate and Number of Students', xlab = 'Admission Rate', ylab = 'Number of Students')
abline(lm(alotofdata$UGDS ~ alotofdata$ADM_RATE))

Numerical representation

# Code
# Run a simple regression analysis in which we predict UGDS from ADM_RATE
library(lm.beta)
reg <- lm(UGDS ~ ADM_RATE, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = UGDS ~ ADM_RATE, data = alotofdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6377  -3081  -2259     81  47716 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6474.7      462.7  13.994  < 2e-16 ***
## ADM_RATE     -3852.2      640.6  -6.013 2.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6273 on 2196 degrees of freedom
##   (5505 observations deleted due to missingness)
## Multiple R-squared:  0.0162, Adjusted R-squared:  0.01575 
## F-statistic: 36.16 on 1 and 2196 DF,  p-value: 2.122e-09

coefficients(reg)

## (Intercept)    ADM_RATE 
##    6474.726   -3852.210

lm.beta(reg)

## 
## Call:
## lm(formula = UGDS ~ ADM_RATE, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)    ADM_RATE 
##   0.0000000  -0.1272796

Written interpretation

We are running a Simple Linear Regression analysis to predict number of students from admission rate.

The visual representation is a scatterplot all the data points from number of students and admission rate.

The line in the middle represents the regression line. This line represents the line of best fit for predicting number of students from admission rate.

The numerical representation is a summary of all of the values that describe our linear regression model.

Our Model for this problem is: Number of students = B0 + B1 X admission rate + e

The B0 = 6474.7. This means if the admission rate is 0, then the number of students on average is 6474.7.

The B1 = -3852.2. This is the numerical relationship between Y and X. This means that given one unit increase in admission rate, the expected change in number of students on average is -3852.2.

The Standard error is 640.6. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 6273. This is how good the model is at predicting the number of students from the admission rate. Our residual standard error shows that this model is off on average by 6273. The closer to 0, the better the model fits.

The R^2 is 0.01575. This is the coefficient of determination, or the percentage of variance in number of students that can be explained given the admission rate. This is the ratio of explained variance vs total variance. For our model, 01.575% of the variance in number of students can be explained given admission rate.

Predict completion rate from average SAT score

Visual representation

# Code
# Predict C150_4 from SAT_AVG
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$SAT_AVG, alotofdata$C150_4, ylim = c(0, 1),  main = 'Scatterplot of Average SAT Score and Completion Rate', xlab = 'Average SAT Score', ylab = 'Completion Rate')
abline(lm(alotofdata$C150_4 ~ alotofdata$SAT_AVG))

Numerical representation

# Code
# Run a simple regression analysis in which we predict C150_4 from SAT_AVG
library(lm.beta)
reg <- lm(C150_4 ~ SAT_AVG, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = C150_4 ~ SAT_AVG, data = alotofdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52185 -0.06537  0.00721  0.06898  0.46134 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5741736  0.0237624  -24.16   <2e-16 ***
## SAT_AVG      0.0010598  0.0000222   47.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1049 on 1269 degrees of freedom
##   (6432 observations deleted due to missingness)
## Multiple R-squared:  0.6424, Adjusted R-squared:  0.6421 
## F-statistic:  2279 on 1 and 1269 DF,  p-value: < 2.2e-16

coefficients(reg)

##  (Intercept)      SAT_AVG 
## -0.574173650  0.001059838

lm.beta(reg)

## 
## Call:
## lm(formula = C150_4 ~ SAT_AVG, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)     SAT_AVG 
##   0.0000000   0.8014783

Written interpretation

We are running a Simple Linear Regression analysis to predict completion rate from average SAT score.

The visual representation is a scatterplot all the data points from completion rate and average SAT score.

The line in the middle represents the regression line. This line represents the line of best fit for predicting completion rate from average SAT score.

The numerical representation is a summary of all of the values that describe our linear regression model.

Our Model for this problem is: Completion rate = B0 + B1 X Average SAT score + e

The B0 = -0.5741736. This means if the average SAT score is 0, then the completion rate on average is -0.5741736.

The B1 = 0.0010598. This is the numerical relationship between Y and X. This means that given one unit increase in average SAT score, the expected change in the completion rate on average is 0.0010598.

The Standard error is 0.0000222. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 0.1049. This is how good the model is at predicting the completion rate from average SAT scores. Our residual standard error shows that this model is off on average by 0.1049. The closer to 0, the better the model fits.

The R^2 is 0.6421. This is the coefficient of determination, or the percentage of variance in completion rate that can be explained given the average SAT score. This is the ratio of explained variance vs total variance. For our model, 64.21% of the variance in completion rate can be explained given average SAT score.

Predict completion rate from admission rate

Visual representation

# Code
# Predict C150_4 from ADM_RATE
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$ADM_RATE, alotofdata$C150_4, ylim = c(0, 1),  main = 'Scatterplot of Admission Rate and Completion Rate', xlab = 'Admission Rate', ylab = 'Completion Rate')
abline(lm(alotofdata$C150_4 ~ alotofdata$ADM_RATE))

Numerical representation

# Code
# Run a simple regression analysis in which we predict C150_4 from ADM_RATE
library(lm.beta)
reg <- lm(C150_4 ~ ADM_RATE, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = C150_4 ~ ADM_RATE, data = alotofdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58476 -0.12368  0.00175  0.13908  0.57278 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.73479    0.01509   48.68   <2e-16 ***
## ADM_RATE    -0.30757    0.02143  -14.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.182 on 1804 degrees of freedom
##   (5897 observations deleted due to missingness)
## Multiple R-squared:  0.1025, Adjusted R-squared:  0.102 
## F-statistic: 206.1 on 1 and 1804 DF,  p-value: < 2.2e-16

coefficients(reg)

## (Intercept)    ADM_RATE 
##   0.7347908  -0.3075745

lm.beta(reg)

## 
## Call:
## lm(formula = C150_4 ~ ADM_RATE, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)    ADM_RATE 
##   0.0000000  -0.3201865

Written interpretation

We are running a Simple Linear Regression analysis to predict completion rate from admission rate.

The visual representation is a scatterplot all the data points from completion rate and admission rate.

The line in the middle represents the regression line. This line represents the line of best fit for predicting completion rate from admission rate.

The numerical representation is a summary of all of the values that describe our linear regression model.

Our Model for this problem is: Completion rate = B0 + B1 X Admission rate + e

The B0 = 0.73479. This means if the admission rate is 0, then the completion rate on average is 0.73479.

The B1 = -0.30757. This is the numerical relationship between Y and X. This means that given one unit increase in the admission rate, the expected change in the completion rate on average is -0.30757.

The Standard error is 0.02143. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 0.182. This is how good the model is at predicting the completion rate from the admission rate. Our residual standard error shows that this model is off on average by 0.182. The closer to 0, the better the model fits.

The R^2 is 0.102. This is the coefficient of determination, or the percentage of variance in completion rate that can be explained given the admission rate. This is the ratio of explained variance vs total variance. For our model, 10.2% of the variance in completion rate can be explained given admission rate.

Predict the percentage of students with federal loans from average SAT score

Visual representation

# Predict PCTFLOAN from SAT_AVG
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$SAT_AVG, alotofdata$PCTFLOAN, ylim = c(0, 1),  main = 'Scatterplot of Average SAT Score and % of Students with Fed. Loans', xlab = 'Average SAT Score', ylab = '% of Students with Fed. Loans')
abline(lm(alotofdata$PCTFLOAN ~ alotofdata$SAT_AVG))

Numerical representation

# Code
# Run a simple regression analysis in which we predict PCTFLOAN from SAT_AVG
library(lm.beta)
reg <- lm(PCTFLOAN ~ SAT_AVG, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = PCTFLOAN ~ SAT_AVG, data = alotofdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72338 -0.07918  0.01720  0.10087  0.34595 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.276e+00  3.239e-02    39.4   <2e-16 ***
## SAT_AVG     -6.493e-04  3.035e-05   -21.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.146 on 1300 degrees of freedom
##   (6401 observations deleted due to missingness)
## Multiple R-squared:  0.2604, Adjusted R-squared:  0.2599 
## F-statistic: 457.8 on 1 and 1300 DF,  p-value: < 2.2e-16

coefficients(reg)

##   (Intercept)       SAT_AVG 
##  1.2759380530 -0.0006493094

lm.beta(reg)

## 
## Call:
## lm(formula = PCTFLOAN ~ SAT_AVG, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)     SAT_AVG 
##   0.0000000  -0.5103375

Written interpretation

We are running a Simple Linear Regression analysis to predict percentage of students with federal loans from average SAT score.

The visual representation is a scatterplot all the data points from percentage of students with federal loans and average SAT score.

The line in the middle represents the regression line. This line represents the line of best fit for predicting percentage of students with federal loans from average SAT score.

The numerical representation is a summary of all of the values that describe our linear regression model.

Our Model for this problem is: % of students with federal loans = B0 + B1 X Average SAT score + e

The B0 = 1.27593805. This means if the average SAT score is 0, then the percentage of students with federal loans on average is 1.27593805.

The B1 = -0.00064931. This is the numerical relationship between Y and X. This means that given one unit increase in the average SAT score, the expected change in the percentage of students with federal loans on average is -0.00064931.

The Standard error is 0.00003035. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 0.146. This is how good the model is at predicting the percentage of students with federal loans from average SAT score. Our residual standard error shows that this model is off on average by 0.146. The closer to 0, the better the model fits.

The R^2 is 0.2599. This is the coefficient of determination, or the percentage of variance in the percentage of students with federal loans that can be explained given the average sAT score. This is the ratio of explained variance vs total variance. 25.99% of the variance in percentage of students with federal loans can be explained given average SAT score.

Predict the percentage of students with federal loans from admission rate

Visual representation

# Predict PCTFLOAN from ADM_RATE
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$ADM_RATE, alotofdata$PCTFLOAN, ylim = c(0, 1),  main = 'Scatterplot of Admission Rate and % of Students with Fed. Loans', xlab = 'Admission Rate', ylab = '% of Students with Fed. Loans')
abline(lm(alotofdata$PCTFLOAN ~ alotofdata$ADM_RATE))

Numerical representation

# Code
# Run a simple regression analysis in which we predict PCTFLOAN from ADM_RATE
library(lm.beta)
reg <- lm(PCTFLOAN ~ ADM_RATE, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = PCTFLOAN ~ ADM_RATE, data = alotofdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62435 -0.11274  0.03085  0.15665  0.44315 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.51186    0.01603  31.923  < 2e-16 ***
## ADM_RATE     0.11249    0.02219   5.069 4.32e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2168 on 2193 degrees of freedom
##   (5508 observations deleted due to missingness)
## Multiple R-squared:  0.01158,    Adjusted R-squared:  0.01113 
## F-statistic:  25.7 on 1 and 2193 DF,  p-value: 4.324e-07

coefficients(reg)

## (Intercept)    ADM_RATE 
##   0.5118554   0.1124929

lm.beta(reg)

## 
## Call:
## lm(formula = PCTFLOAN ~ ADM_RATE, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)    ADM_RATE 
##   0.0000000   0.1076242

Written interpretation

We are running a Simple Linear Regression analysis to predict percentage of students with federal loans from admission rate.

The visual representation is a scatterplot all the data points from percentage of students with federal loans and admission rate.

The line in the middle represents the regression line. This line represents the line of best fit for predicting percentage of students with federal loans from admission rate.

The numerical representation is a summary of all of the values that describe our linear regression model.

Our Model for this problem is: % of students with federal loans = B0 + B1 X Admission rate + e

The B0 = 0.51186. This means if the admission rate is 0, then the percentage of students with federal loans on average is 0.51186.

The B1 = 0.11249. This is the numerical relationship between Y and X. This means that given one unit increase in the admission rate, the expected change in the percentage of students with federal loans on average is 0.11249.

The Standard error is 0.02219. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 0.2168. This is how good the model is at predicting the percentage of students with federal loans from the admission rate. Our residual standard error shows that this model is off on average by 0.2168. The closer to 0, the better the model fits.

The R^2 is 0.01113. This is the coefficient of determination, or the percentage of variance in the percentage of students with federal loans that can be explained given the average sAT score. This is the ratio of explained variance vs total variance. 01.113% of the variance in percentage of students with federal loans can be explained given admission rate.

Run two additional regression equations using any variables you like from the dataset. Provide an interpretation of the results of these additional analyses.

12.1 Predict the completion rate from the retention rate

Visual representation

# Code
# Predict C150_4 from RET_FT4
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$RET_FT4, alotofdata$C150_4, ylim = c(0, 1),  main = 'Scatterplot of Retention Rate and Completion Rate', xlab = 'Retention Rate', ylab = 'Completion Rate')
abline(lm(alotofdata$C150_4 ~ alotofdata$RET_FT4))

Numerical representation

# Code
# Run a simple regression analysis in which we predict C150_4 from RET_FT4
library(lm.beta)
reg <- lm(C150_4 ~ RET_FT4, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = C150_4 ~ RET_FT4, data = alotofdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67726 -0.10017  0.00425  0.10588  0.97935 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02065    0.01443   1.431    0.153    
## RET_FT4      0.65661    0.01963  33.455   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1729 on 2196 degrees of freedom
##   (5505 observations deleted due to missingness)
## Multiple R-squared:  0.3376, Adjusted R-squared:  0.3373 
## F-statistic:  1119 on 1 and 2196 DF,  p-value: < 2.2e-16

coefficients(reg)

## (Intercept)     RET_FT4 
##  0.02065477  0.65660875

lm.beta(reg)

## 
## Call:
## lm(formula = C150_4 ~ RET_FT4, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)     RET_FT4 
##   0.0000000   0.5810345

Written interpretation

We are running a Simple Linear Regression analysis to predict completion rate from retention rate.

The visual representation is a scatterplot all the data points from completion rate and retention rate.

The line in the middle represents the regression line. This line represents the line of best fit for predicting completion rate from retention rate.

The numerical representation is a summary of all of the values that describe our linear regression model.

Our Model for this problem is: Completion rate = B0 + B1 X Retention rate + e

The B0 = 0.02065. This means if the retention rate is 0, then the completion rate on average is 0.02065.

The B1 = 0.65661. This is the numerical relationship between Y and X. This means that given one unit increase in the retention rate, the expected change in the completion rate on average is 0.65661.

The Standard error is 0.01963. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 0.1729. This is how good the model is at predicting the completion rate from the retention rate. Our residual standard error shows that this model is off on average by 0.1729. The closer to 0, the better the model fits.

The R^2 is 0.3373. This is the coefficient of determination, or the percentage of variance in completion rate that can be explained given the retention rate. This is the ratio of explained variance vs total variance. 33.73% of the variance in pcompletion rate can be explained given retention rate.

12.2 Predict the retention rate from average SAT scores

Visual representation

# Code
# Predict RET_FT4 from SAT_AVG
# Visualize the line in the scatterplot of points
library(lm.beta)
plot(alotofdata$SAT_AVG, alotofdata$RET_FT4, ylim = c(0, 1),  main = 'Scatterplot of Average SAT Score and Retention Rate', xlab = 'Average SAT Score', ylab = 'Retention Rate')
abline(lm(alotofdata$RET_FT4 ~ alotofdata$SAT_AVG))

Numerical representation

# Code
# Run a simple regression analysis in which we predict RET_FT4 from SAT_AVG
library(lm.beta)
reg <- lm(RET_FT4 ~ SAT_AVG, data=alotofdata)
summary(reg)

## 
## Call:
## lm(formula = RET_FT4 ~ SAT_AVG, data = alotofdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51482 -0.04275  0.00746  0.05047  0.22636 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.803e-02  1.727e-02   5.097 3.98e-07 ***
## SAT_AVG     6.355e-04  1.613e-05  39.396  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07611 on 1267 degrees of freedom
##   (6434 observations deleted due to missingness)
## Multiple R-squared:  0.5506, Adjusted R-squared:  0.5502 
## F-statistic:  1552 on 1 and 1267 DF,  p-value: < 2.2e-16

coefficients(reg)

##  (Intercept)      SAT_AVG 
## 0.0880345367 0.0006355223

lm.beta(reg)

## 
## Call:
## lm(formula = RET_FT4 ~ SAT_AVG, data = alotofdata)
## 
## Standardized Coefficients::
## (Intercept)     SAT_AVG 
##   0.0000000   0.7419929

Written interpretation

We are running a Simple Linear Regression analysis to predict retention rate from average SAT scores.

The visual representation is a scatterplot all the data points from retention rate and average SAT scores.

The line in the middle represents the regression line. This line represents the line of best fit for predicting retention rate from average SAT scores.

The numerical representation is a summary of all of the values that describe our linear regression model.

Our Model for this problem is: Retention rate = B0 + B1 X Average SAT score + e

The B0 = 0.08803454. This means if the average SAT score is 0, then the retention rate on average is 0.08803454.

The B1 = 0.00063552. This is the numerical relationship between Y and X. This means that given one unit increase in average SAT score, the expected change in the retention rate on average is 0.00063552.

The Standard error is 0.00001613. This is the measure of accuracy of predictions. This means that when you take samples from the population, then this is the estimated variability of the variable.

The Residual standard error is 0.07611. This is how good the model is at predicting the retention rate from average SAT scores. Our residual standard error shows that this model is off on average by 0.07611. The closer to 0, the better the model fits.

The R^2 is 0.5502. This is the coefficient of determination, or the percentage of variance in retention rate that can be explained given average SAT score. This is the ratio of explained variance vs total variance. 55.02% of the variance in retention rate can be explained given average SAT scores.

Normal Distributions

Are the distributions of average SAT score, admission rate, and total number of undergraduate students normal or non-normal? What information did you use to answer this question?

Average SAT score

Visual representation

# Code
library(ggplot2)
hist.SAT <- ggplot(alotofdata, aes(SAT_AVG)) + 
  geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + labs(x = "Average SAT Score", y = "Density") + 
  stat_function(fun = dnorm, args = list(mean = mean(alotofdata$SAT_AVG, na.rm = TRUE), sd = sd(alotofdata$SAT_AVG, na.rm = TRUE)), colour = "red", size =1)
hist.SAT

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Numerical representation

# Code
#Test for Normality
shapiro.test(alotofdata$SAT_AVG)

## 
##  Shapiro-Wilk normality test
## 
## data:  alotofdata$SAT_AVG
## W = 0.95536, p-value < 2.2e-16

Written interpretation

No, the distribution of average SAT scores is not a normal distribution because the p-value is less than 0.05.

We are testing if the distribution of average SAT scores is normal using the Shapiro-Wilk normality test. The data is normal if the p-value is above our significance level which is 0.05.

Visually, we can see this because the data falls outside of the normal curve and the mean isn’t perfectly centered.

Admission rate

Visual representation

# Code
library(ggplot2)
hist.SAT <- ggplot(alotofdata, aes(ADM_RATE)) + 
  geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + labs(x = "Admission Rate", y = "Density") + 
  stat_function(fun = dnorm, args = list(mean = mean(alotofdata$ADM_RATE, na.rm = TRUE), sd = sd(alotofdata$ADM_RATE, na.rm = TRUE)), colour = "red", size =1)
hist.SAT

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Numerical representation

# Code
#Test for Normality
shapiro.test(alotofdata$ADM_RATE)

## 
##  Shapiro-Wilk normality test
## 
## data:  alotofdata$ADM_RATE
## W = 0.9651, p-value < 2.2e-16

Written interpretation

No, the distribution of admission rates is not a normal distribution because the p-value is less than 0.05.

We are testing if the distribution of admission rates is normal using the Shapiro-Wilk normality test. The data is normal if the p-value is above our significance level which is 0.05.

Total number of undergraduate students

Visual representation

# Code
library(ggplot2)
hist.SAT <- ggplot(alotofdata, aes(UGDS)) + 
  geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + labs(x = "Total number of undergrads", y = "Density") + 
  stat_function(fun = dnorm, args = list(mean = mean(alotofdata$UGDS, na.rm = TRUE), sd = sd(alotofdata$UGDS, na.rm = TRUE)), colour = "red", size =1)
hist.SAT

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Numerical representation

# Code
#Anderson-Darling normality test
library(nortest)
ad.test(alotofdata$UGDS)

## 
##  Anderson-Darling normality test
## 
## data:  alotofdata$UGDS
## A = 1243.9, p-value < 2.2e-16

Written interpretation

No, the distribution of admission rates is not a normal distribution because the p-value is less than 0.05.

We are testing if the distribution of admission rates is normal using the Anderson-Darling normality test. It’s pretty similar to the previous process we used. The data is normal if the p-value is above our significance level which is 0.05.

For any of the distributions that were not normal in your previous answer, how could you transform them to be normal? Perform the appropriate transformation and report the mean, standard deviation, and variance of the new transformed variable.

For all of the distributions, we will standardize the distribution. We can transform the distributions with commonly used transformations like square, square root, cube root, logarithm, and reciprocal root.

Average SAT score

I applied the cube root transformation to see if it would help. Looks pretty good. Then I standardized it.

# Cube Root
alotofdata$CUBE_SAT_AVG  <- alotofdata$SAT_AVG^(1/3)
alotofdata$zCUBE_SAT_AVG <- scale(alotofdata$CUBE_SAT_AVG, center = TRUE, scale = TRUE)


hist.SAT <- ggplot(alotofdata, aes(zCUBE_SAT_AVG)) +
  geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + labs(x = "z-score", y = "Density", title = 'Distribution of Average SAT Score') + 
  stat_function(fun = dnorm, args = list(mean = mean(alotofdata$zCUBE_SAT_AVG , na.rm = TRUE), sd = sd(alotofdata$zCUBE_SAT_AVG , na.rm = TRUE)), colour = "red", size =1) 
hist.SAT

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# TEST
shapiro.test(alotofdata$zCUBE_SAT_AVG)

## 
##  Shapiro-Wilk normality test
## 
## data:  alotofdata$zCUBE_SAT_AVG
## W = 0.9747, p-value = 2.388e-14

# Describe
psych::describe(alotofdata$zCUBE_SAT_AVG)

##    vars    n mean sd median trimmed  mad  min  max range skew kurtosis
## X1    1 1304    0  1  -0.11   -0.06 0.82 -2.9 3.31  6.21 0.53     0.72
##      se
## X1 0.03

Admission rate

Setting the power to 3 should decrease the skew. Standardizing made it look pretty good. It still looks pretty ugly.

# POWER
alotofdata$LOG_ADM_RATE  <- (alotofdata$ADM_RATE+1)^3
alotofdata$z_LOG_ADM_RATE  <- scale(alotofdata$LOG_ADM_RATE, center = TRUE, scale = TRUE)

# VIsualize
hist.SAT <- ggplot(alotofdata, aes(z_LOG_ADM_RATE)) +
  geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + labs(x = "z-score", y = "Density", title = 'Admission Rate') + 
  stat_function(fun = dnorm, args = list(mean = mean(alotofdata$z_LOG_ADM_RATE , na.rm = TRUE), sd = sd(alotofdata$z_LOG_ADM_RATE , na.rm = TRUE)), colour = "red", size =1) 
hist.SAT

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# TEST
shapiro.test(alotofdata$z_LOG_ADM_RATE)

## 
##  Shapiro-Wilk normality test
## 
## data:  alotofdata$z_LOG_ADM_RATE
## W = 0.97855, p-value < 2.2e-16

# Describe
psych::describe(alotofdata$z_LOG_ADM_RATE)

##    vars    n mean sd median trimmed  mad   min  max range  skew kurtosis
## X1    1 2198    0  1  -0.02    0.01 1.06 -2.37 1.72  4.09 -0.02    -0.76
##      se
## X1 0.02

Total number of undergraduate students

I took the logarithm of the data. It make it look pretty good. Then I scaled it. It is still not normal, but it is normaler.

# Standardize
alotofdata$t_UGDS <- log(alotofdata$UGDS+1)
alotofdata$zUGDS <- scale(alotofdata$t_UGDS , center = TRUE, scale = TRUE)
# psych::describe(alotofdata$zSAT_AVG)

#Print Standardize
hist.SAT <- ggplot(alotofdata, aes(zUGDS)) + 
  geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + labs(x = "z-score", y = "Density", title = 'Distribution of Number of Students') + 
  stat_function(fun = dnorm, args = list(mean = mean(alotofdata$zUGDS, na.rm = TRUE), sd = sd(alotofdata$zUGDS, na.rm = TRUE)), colour = "red", size =1)
hist.SAT

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# TEST
library(nortest)
ad.test(alotofdata$zUGDS)

## 
##  Anderson-Darling normality test
## 
## data:  alotofdata$zUGDS
## A = 29.854, p-value < 2.2e-16

# Describe
psych::describe(alotofdata$zUGDS)

##    vars    n mean sd median trimmed  mad   min  max range skew kurtosis
## X1    1 6990    0  1  -0.09   -0.03 1.12 -3.35 3.12  6.47 0.18    -0.57
##      se
## X1 0.01

Probability

Based on the distribution of scores, what is the probability that the average SAT score for a school is greater than 1400? What is the probability that the average SAT score for a school is less than 800?

Answer

The Probability that the average SAT score for a school is greater than 1400 is 2.60736%.

The Probability that the average SAT score for a school is less than 800 is 1.303681%.

##                  level   freq   perc  cumfreq  cumperc
## 1            [700,750]  4e+00   0.3%    4e+00     0.3%
## 2            (750,800]  1e+01   1.0%    2e+01     1.3%
## 3            (800,850]  3e+01   2.0%    4e+01     3.3%
## 4            (850,900]  6e+01   4.8%    1e+02     8.1%
## 5            (900,950]  1e+02  10.0%    2e+02    18.2%
## 6          (950,1e+03]  2e+02  15.4%    4e+02    33.6%
## 7     (1e+03,1.05e+03]  3e+02  21.8%    7e+02    55.4%
## 8   (1.05e+03,1.1e+03]  2e+02  14.0%    9e+02    69.4%
## 9   (1.1e+03,1.15e+03]  1e+02  11.2%    1e+03    80.6%
## 10  (1.15e+03,1.2e+03]  8e+01   6.5%    1e+03    87.1%
## 11  (1.2e+03,1.25e+03]  6e+01   4.2%    1e+03    91.3%
## 12  (1.25e+03,1.3e+03]  3e+01   2.4%    1e+03    93.7%
## 13  (1.3e+03,1.35e+03]  3e+01   2.1%    1e+03    95.8%
## 14  (1.35e+03,1.4e+03]  2e+01   1.6%    1e+03    97.4%
## 15  (1.4e+03,1.45e+03]  2e+01   1.2%    1e+03    98.6%
## 16  (1.45e+03,1.5e+03]  2e+01   1.2%    1e+03    99.8%
## 17  (1.5e+03,1.55e+03]  3e+00   0.2%    1e+03   100.0%

##        X Freq         Prop      CumProp
## 1    720    1 0.0007668712 0.0007668712
## 2    735    2 0.0015337423 0.0023006135
## 3    740    1 0.0007668712 0.0030674847
## 4    758    1 0.0007668712 0.0038343558
## 5    760    2 0.0015337423 0.0053680982
## 6    762    1 0.0007668712 0.0061349693
## 7    773    1 0.0007668712 0.0069018405
## 8    774    1 0.0007668712 0.0076687117
## 9    775    1 0.0007668712 0.0084355828
## 10   776    1 0.0007668712 0.0092024540
## 11   780    1 0.0007668712 0.0099693252
## 12   796    1 0.0007668712 0.0107361963
## 13   798    1 0.0007668712 0.0115030675
## 14   800    2 0.0015337423 0.0130368098
## 15   802    1 0.0007668712 0.0138036810
## 16   803    1 0.0007668712 0.0145705521
## 17   804    1 0.0007668712 0.0153374233
## 18   806    2 0.0015337423 0.0168711656
## 19   808    1 0.0007668712 0.0176380368
## 20   810    2 0.0015337423 0.0191717791
## 21   811    1 0.0007668712 0.0199386503
## 22   812    1 0.0007668712 0.0207055215
## 23   814    1 0.0007668712 0.0214723926
## 24   820    1 0.0007668712 0.0222392638
## 25   825    2 0.0015337423 0.0237730061
## 26   827    1 0.0007668712 0.0245398773
## 27   837    1 0.0007668712 0.0253067485
## 28   840    1 0.0007668712 0.0260736196
## 29   842    1 0.0007668712 0.0268404908
## 30   843    2 0.0015337423 0.0283742331
## 31   845    1 0.0007668712 0.0291411043
## 32   847    1 0.0007668712 0.0299079755
## 33   849    2 0.0015337423 0.0314417178
## 34   850    2 0.0015337423 0.0329754601
## 35   851    2 0.0015337423 0.0345092025
## 36   852    1 0.0007668712 0.0352760736
## 37   853    2 0.0015337423 0.0368098160
## 38   854    1 0.0007668712 0.0375766871
## 39   855    1 0.0007668712 0.0383435583
## 40   857    1 0.0007668712 0.0391104294
## 41   859    2 0.0015337423 0.0406441718
## 42   860    1 0.0007668712 0.0414110429
## 43   862    1 0.0007668712 0.0421779141
## 44   863    1 0.0007668712 0.0429447853
## 45   864    1 0.0007668712 0.0437116564
## 46   866    2 0.0015337423 0.0452453988
## 47   867    1 0.0007668712 0.0460122699
## 48   868    1 0.0007668712 0.0467791411
## 49   870    3 0.0023006135 0.0490797546
## 50   871    1 0.0007668712 0.0498466258
## 51   872    3 0.0023006135 0.0521472393
## 52   873    1 0.0007668712 0.0529141104
## 53   874    1 0.0007668712 0.0536809816
## 54   875    3 0.0023006135 0.0559815951
## 55   876    1 0.0007668712 0.0567484663
## 56   877    1 0.0007668712 0.0575153374
## 57   880    1 0.0007668712 0.0582822086
## 58   881    1 0.0007668712 0.0590490798
## 59   883    1 0.0007668712 0.0598159509
## 60   884    1 0.0007668712 0.0605828221
## 61   885    2 0.0015337423 0.0621165644
## 62   886    4 0.0030674847 0.0651840491
## 63   887    2 0.0015337423 0.0667177914
## 64   890    4 0.0030674847 0.0697852761
## 65   892    1 0.0007668712 0.0705521472
## 66   894    2 0.0015337423 0.0720858896
## 67   895    1 0.0007668712 0.0728527607
## 68   896    1 0.0007668712 0.0736196319
## 69   897    2 0.0015337423 0.0751533742
## 70   899    5 0.0038343558 0.0789877301
## 71   900    3 0.0023006135 0.0812883436
## 72   901    1 0.0007668712 0.0820552147
## 73   902    2 0.0015337423 0.0835889571
## 74   903    1 0.0007668712 0.0843558282
## 75   905    1 0.0007668712 0.0851226994
## 76   906    2 0.0015337423 0.0866564417
## 77   907    2 0.0015337423 0.0881901840
## 78   909    3 0.0023006135 0.0904907975
## 79   910    8 0.0061349693 0.0966257669
## 80   911    1 0.0007668712 0.0973926380
## 81   913    3 0.0023006135 0.0996932515
## 82   914    3 0.0023006135 0.1019938650
## 83   916    1 0.0007668712 0.1027607362
## 84   917    4 0.0030674847 0.1058282209
## 85   918    2 0.0015337423 0.1073619632
## 86   919    2 0.0015337423 0.1088957055
## 87   920    2 0.0015337423 0.1104294479
## 88   921    1 0.0007668712 0.1111963190
## 89   922    1 0.0007668712 0.1119631902
## 90   923    3 0.0023006135 0.1142638037
## 91   924    1 0.0007668712 0.1150306748
## 92   925    2 0.0015337423 0.1165644172
## 93   926    2 0.0015337423 0.1180981595
## 94   927    2 0.0015337423 0.1196319018
## 95   928    4 0.0030674847 0.1226993865
## 96   930   12 0.0092024540 0.1319018405
## 97   931    1 0.0007668712 0.1326687117
## 98   932    2 0.0015337423 0.1342024540
## 99   933    2 0.0015337423 0.1357361963
## 100  934    3 0.0023006135 0.1380368098
## 101  935    2 0.0015337423 0.1395705521
## 102  936    2 0.0015337423 0.1411042945
## 103  937    4 0.0030674847 0.1441717791
## 104  938    2 0.0015337423 0.1457055215
## 105  939    1 0.0007668712 0.1464723926
## 106  940    4 0.0030674847 0.1495398773
## 107  941    5 0.0038343558 0.1533742331
## 108  942    3 0.0023006135 0.1556748466
## 109  943    3 0.0023006135 0.1579754601
## 110  944    2 0.0015337423 0.1595092025
## 111  945    3 0.0023006135 0.1618098160
## 112  946    6 0.0046012270 0.1664110429
## 113  947    3 0.0023006135 0.1687116564
## 114  948    6 0.0046012270 0.1733128834
## 115  949    3 0.0023006135 0.1756134969
## 116  950    8 0.0061349693 0.1817484663
## 117  951    5 0.0038343558 0.1855828221
## 118  952    1 0.0007668712 0.1863496933
## 119  953    2 0.0015337423 0.1878834356
## 120  954    3 0.0023006135 0.1901840491
## 121  955    1 0.0007668712 0.1909509202
## 122  957    1 0.0007668712 0.1917177914
## 123  958    5 0.0038343558 0.1955521472
## 124  959    3 0.0023006135 0.1978527607
## 125  960    3 0.0023006135 0.2001533742
## 126  961    3 0.0023006135 0.2024539877
## 127  962    6 0.0046012270 0.2070552147
## 128  963    4 0.0030674847 0.2101226994
## 129  964    3 0.0023006135 0.2124233129
## 130  965    7 0.0053680982 0.2177914110
## 131  966    4 0.0030674847 0.2208588957
## 132  967    5 0.0038343558 0.2246932515
## 133  968    5 0.0038343558 0.2285276074
## 134  969    4 0.0030674847 0.2315950920
## 135  970   20 0.0153374233 0.2469325153
## 136  971    2 0.0015337423 0.2484662577
## 137  972    1 0.0007668712 0.2492331288
## 138  973    2 0.0015337423 0.2507668712
## 139  974    3 0.0023006135 0.2530674847
## 140  975    3 0.0023006135 0.2553680982
## 141  976    5 0.0038343558 0.2592024540
## 142  977    1 0.0007668712 0.2599693252
## 143  978    3 0.0023006135 0.2622699387
## 144  979    2 0.0015337423 0.2638036810
## 145  980    3 0.0023006135 0.2661042945
## 146  981    4 0.0030674847 0.2691717791
## 147  982    3 0.0023006135 0.2714723926
## 148  983    3 0.0023006135 0.2737730061
## 149  984    5 0.0038343558 0.2776073620
## 150  985    5 0.0038343558 0.2814417178
## 151  986    3 0.0023006135 0.2837423313
## 152  987    2 0.0015337423 0.2852760736
## 153  988    3 0.0023006135 0.2875766871
## 154  989    3 0.0023006135 0.2898773006
## 155  990   19 0.0145705521 0.3044478528
## 156  991    2 0.0015337423 0.3059815951
## 157  992    2 0.0015337423 0.3075153374
## 158  993    3 0.0023006135 0.3098159509
## 159  994    2 0.0015337423 0.3113496933
## 160  995    6 0.0046012270 0.3159509202
## 161  996    6 0.0046012270 0.3205521472
## 162  997    4 0.0030674847 0.3236196319
## 163  998    3 0.0023006135 0.3259202454
## 164  999    5 0.0038343558 0.3297546012
## 165 1000    8 0.0061349693 0.3358895706
## 166 1001    6 0.0046012270 0.3404907975
## 167 1002    3 0.0023006135 0.3427914110
## 168 1003    3 0.0023006135 0.3450920245
## 169 1004    7 0.0053680982 0.3504601227
## 170 1005   10 0.0076687117 0.3581288344
## 171 1006    4 0.0030674847 0.3611963190
## 172 1007    3 0.0023006135 0.3634969325
## 173 1008    5 0.0038343558 0.3673312883
## 174 1009    8 0.0061349693 0.3734662577
## 175 1010   26 0.0199386503 0.3934049080
## 176 1011    3 0.0023006135 0.3957055215
## 177 1012    2 0.0015337423 0.3972392638
## 178 1013    2 0.0015337423 0.3987730061
## 179 1014    6 0.0046012270 0.4033742331
## 180 1015    2 0.0015337423 0.4049079755
## 181 1016    6 0.0046012270 0.4095092025
## 182 1017    1 0.0007668712 0.4102760736
## 183 1018    4 0.0030674847 0.4133435583
## 184 1019    1 0.0007668712 0.4141104294
## 185 1020    5 0.0038343558 0.4179447853
## 186 1021    5 0.0038343558 0.4217791411
## 187 1022    2 0.0015337423 0.4233128834
## 188 1023    1 0.0007668712 0.4240797546
## 189 1024    4 0.0030674847 0.4271472393
## 190 1025    8 0.0061349693 0.4332822086
## 191 1026    3 0.0023006135 0.4355828221
## 192 1027    2 0.0015337423 0.4371165644
## 193 1028    3 0.0023006135 0.4394171779
## 194 1029   10 0.0076687117 0.4470858896
## 195 1030   20 0.0153374233 0.4624233129
## 196 1031   10 0.0076687117 0.4700920245
## 197 1032    4 0.0030674847 0.4731595092
## 198 1033    6 0.0046012270 0.4777607362
## 199 1034    3 0.0023006135 0.4800613497
## 200 1035    8 0.0061349693 0.4861963190
## 201 1036    2 0.0015337423 0.4877300613
## 202 1037    6 0.0046012270 0.4923312883
## 203 1038    7 0.0053680982 0.4976993865
## 204 1039    3 0.0023006135 0.5000000000
## 205 1040    4 0.0030674847 0.5030674847
## 206 1041    2 0.0015337423 0.5046012270
## 207 1042    3 0.0023006135 0.5069018405
## 208 1043    4 0.0030674847 0.5099693252
## 209 1044    6 0.0046012270 0.5145705521
## 210 1045    3 0.0023006135 0.5168711656
## 211 1046    3 0.0023006135 0.5191717791
## 212 1047    4 0.0030674847 0.5222392638
## 213 1048    9 0.0069018405 0.5291411043
## 214 1049    6 0.0046012270 0.5337423313
## 215 1050   26 0.0199386503 0.5536809816
## 216 1051    5 0.0038343558 0.5575153374
## 217 1052    3 0.0023006135 0.5598159509
## 218 1053    5 0.0038343558 0.5636503067
## 219 1054    6 0.0046012270 0.5682515337
## 220 1055    8 0.0061349693 0.5743865031
## 221 1056    6 0.0046012270 0.5789877301
## 222 1057    3 0.0023006135 0.5812883436
## 223 1058    2 0.0015337423 0.5828220859
## 224 1059    4 0.0030674847 0.5858895706
## 225 1060    2 0.0015337423 0.5874233129
## 226 1061    2 0.0015337423 0.5889570552
## 227 1062    2 0.0015337423 0.5904907975
## 228 1063    3 0.0023006135 0.5927914110
## 229 1064    6 0.0046012270 0.5973926380
## 230 1065    5 0.0038343558 0.6012269939
## 231 1066    4 0.0030674847 0.6042944785
## 232 1067    3 0.0023006135 0.6065950920
## 233 1068    1 0.0007668712 0.6073619632
## 234 1069    2 0.0015337423 0.6088957055
## 235 1070   10 0.0076687117 0.6165644172
## 236 1071    2 0.0015337423 0.6180981595
## 237 1072    2 0.0015337423 0.6196319018
## 238 1073    4 0.0030674847 0.6226993865
## 239 1074    6 0.0046012270 0.6273006135
## 240 1075    4 0.0030674847 0.6303680982
## 241 1076    4 0.0030674847 0.6334355828
## 242 1077    3 0.0023006135 0.6357361963
## 243 1078    3 0.0023006135 0.6380368098
## 244 1079    4 0.0030674847 0.6411042945
## 245 1080    3 0.0023006135 0.6434049080
## 246 1081    6 0.0046012270 0.6480061350
## 247 1082    4 0.0030674847 0.6510736196
## 248 1083    3 0.0023006135 0.6533742331
## 249 1085    5 0.0038343558 0.6572085890
## 250 1086    6 0.0046012270 0.6618098160
## 251 1087    1 0.0007668712 0.6625766871
## 252 1088    4 0.0030674847 0.6656441718
## 253 1089    6 0.0046012270 0.6702453988
## 254 1090    9 0.0069018405 0.6771472393
## 255 1091    2 0.0015337423 0.6786809816
## 256 1092    2 0.0015337423 0.6802147239
## 257 1093    1 0.0007668712 0.6809815951
## 258 1094    3 0.0023006135 0.6832822086
## 259 1095    2 0.0015337423 0.6848159509
## 260 1096    2 0.0015337423 0.6863496933
## 261 1097    2 0.0015337423 0.6878834356
## 262 1098    3 0.0023006135 0.6901840491
## 263 1099    2 0.0015337423 0.6917177914
## 264 1100    3 0.0023006135 0.6940184049
## 265 1101    4 0.0030674847 0.6970858896
## 266 1102    2 0.0015337423 0.6986196319
## 267 1103    3 0.0023006135 0.7009202454
## 268 1104    4 0.0030674847 0.7039877301
## 269 1105   13 0.0099693252 0.7139570552
## 270 1106    4 0.0030674847 0.7170245399
## 271 1107    2 0.0015337423 0.7185582822
## 272 1108    2 0.0015337423 0.7200920245
## 273 1109    6 0.0046012270 0.7246932515
## 274 1110   11 0.0084355828 0.7331288344
## 275 1111    1 0.0007668712 0.7338957055
## 276 1112    3 0.0023006135 0.7361963190
## 277 1113    1 0.0007668712 0.7369631902
## 278 1114    2 0.0015337423 0.7384969325
## 279 1115    2 0.0015337423 0.7400306748
## 280 1116    6 0.0046012270 0.7446319018
## 281 1117    2 0.0015337423 0.7461656442
## 282 1118    2 0.0015337423 0.7476993865
## 283 1120    3 0.0023006135 0.7500000000
## 284 1121    1 0.0007668712 0.7507668712
## 285 1122    3 0.0023006135 0.7530674847
## 286 1123    1 0.0007668712 0.7538343558
## 287 1124    1 0.0007668712 0.7546012270
## 288 1125   12 0.0092024540 0.7638036810
## 289 1126    1 0.0007668712 0.7645705521
## 290 1127    4 0.0030674847 0.7676380368
## 291 1128    4 0.0030674847 0.7707055215
## 292 1129    1 0.0007668712 0.7714723926
## 293 1130    2 0.0015337423 0.7730061350
## 294 1131    2 0.0015337423 0.7745398773
## 295 1132    1 0.0007668712 0.7753067485
## 296 1133    4 0.0030674847 0.7783742331
## 297 1134    2 0.0015337423 0.7799079755
## 298 1135    3 0.0023006135 0.7822085890
## 299 1137    3 0.0023006135 0.7845092025
## 300 1138    1 0.0007668712 0.7852760736
## 301 1139    2 0.0015337423 0.7868098160
## 302 1140    1 0.0007668712 0.7875766871
## 303 1141    2 0.0015337423 0.7891104294
## 304 1142    1 0.0007668712 0.7898773006
## 305 1143    1 0.0007668712 0.7906441718
## 306 1144    4 0.0030674847 0.7937116564
## 307 1145    6 0.0046012270 0.7983128834
## 308 1146    3 0.0023006135 0.8006134969
## 309 1147    3 0.0023006135 0.8029141104
## 310 1148    2 0.0015337423 0.8044478528
## 311 1149    1 0.0007668712 0.8052147239
## 312 1150    1 0.0007668712 0.8059815951
## 313 1151    1 0.0007668712 0.8067484663
## 314 1152    5 0.0038343558 0.8105828221
## 315 1153    2 0.0015337423 0.8121165644
## 316 1155    3 0.0023006135 0.8144171779
## 317 1156    1 0.0007668712 0.8151840491
## 318 1157    1 0.0007668712 0.8159509202
## 319 1158    1 0.0007668712 0.8167177914
## 320 1159    2 0.0015337423 0.8182515337
## 321 1160    1 0.0007668712 0.8190184049
## 322 1161    2 0.0015337423 0.8205521472
## 323 1162    5 0.0038343558 0.8243865031
## 324 1164    2 0.0015337423 0.8259202454
## 325 1165    6 0.0046012270 0.8305214724
## 326 1166    2 0.0015337423 0.8320552147
## 327 1168    2 0.0015337423 0.8335889571
## 328 1169    1 0.0007668712 0.8343558282
## 329 1170    3 0.0023006135 0.8366564417
## 330 1171    1 0.0007668712 0.8374233129
## 331 1175    3 0.0023006135 0.8397239264
## 332 1176    2 0.0015337423 0.8412576687
## 333 1177    2 0.0015337423 0.8427914110
## 334 1178    2 0.0015337423 0.8443251534
## 335 1179    1 0.0007668712 0.8450920245
## 336 1180    1 0.0007668712 0.8458588957
## 337 1181    2 0.0015337423 0.8473926380
## 338 1182    1 0.0007668712 0.8481595092
## 339 1183    3 0.0023006135 0.8504601227
## 340 1184    2 0.0015337423 0.8519938650
## 341 1185    5 0.0038343558 0.8558282209
## 342 1186    1 0.0007668712 0.8565950920
## 343 1187    1 0.0007668712 0.8573619632
## 344 1188    3 0.0023006135 0.8596625767
## 345 1189    1 0.0007668712 0.8604294479
## 346 1191    2 0.0015337423 0.8619631902
## 347 1193    2 0.0015337423 0.8634969325
## 348 1194    3 0.0023006135 0.8657975460
## 349 1195    2 0.0015337423 0.8673312883
## 350 1196    1 0.0007668712 0.8680981595
## 351 1198    2 0.0015337423 0.8696319018
## 352 1200    2 0.0015337423 0.8711656442
## 353 1205    2 0.0015337423 0.8726993865
## 354 1206    1 0.0007668712 0.8734662577
## 355 1207    3 0.0023006135 0.8757668712
## 356 1209    1 0.0007668712 0.8765337423
## 357 1210    1 0.0007668712 0.8773006135
## 358 1211    2 0.0015337423 0.8788343558
## 359 1212    1 0.0007668712 0.8796012270
## 360 1213    4 0.0030674847 0.8826687117
## 361 1215    3 0.0023006135 0.8849693252
## 362 1217    2 0.0015337423 0.8865030675
## 363 1218    1 0.0007668712 0.8872699387
## 364 1219    1 0.0007668712 0.8880368098
## 365 1220    1 0.0007668712 0.8888036810
## 366 1221    2 0.0015337423 0.8903374233
## 367 1223    2 0.0015337423 0.8918711656
## 368 1225    1 0.0007668712 0.8926380368
## 369 1226    2 0.0015337423 0.8941717791
## 370 1227    1 0.0007668712 0.8949386503
## 371 1228    1 0.0007668712 0.8957055215
## 372 1229    1 0.0007668712 0.8964723926
## 373 1230    2 0.0015337423 0.8980061350
## 374 1231    1 0.0007668712 0.8987730061
## 375 1234    2 0.0015337423 0.9003067485
## 376 1235    1 0.0007668712 0.9010736196
## 377 1239    3 0.0023006135 0.9033742331
## 378 1240    3 0.0023006135 0.9056748466
## 379 1241    1 0.0007668712 0.9064417178
## 380 1243    2 0.0015337423 0.9079754601
## 381 1244    3 0.0023006135 0.9102760736
## 382 1246    1 0.0007668712 0.9110429448
## 383 1247    1 0.0007668712 0.9118098160
## 384 1248    1 0.0007668712 0.9125766871
## 385 1249    1 0.0007668712 0.9133435583
## 386 1252    1 0.0007668712 0.9141104294
## 387 1253    2 0.0015337423 0.9156441718
## 388 1254    3 0.0023006135 0.9179447853
## 389 1259    1 0.0007668712 0.9187116564
## 390 1261    2 0.0015337423 0.9202453988
## 391 1263    1 0.0007668712 0.9210122699
## 392 1266    1 0.0007668712 0.9217791411
## 393 1271    1 0.0007668712 0.9225460123
## 394 1272    1 0.0007668712 0.9233128834
## 395 1273    1 0.0007668712 0.9240797546
## 396 1274    1 0.0007668712 0.9248466258
## 397 1275    1 0.0007668712 0.9256134969
## 398 1280    1 0.0007668712 0.9263803681
## 399 1281    1 0.0007668712 0.9271472393
## 400 1283    1 0.0007668712 0.9279141104
## 401 1286    1 0.0007668712 0.9286809816
## 402 1287    1 0.0007668712 0.9294478528
## 403 1290    1 0.0007668712 0.9302147239
## 404 1292    2 0.0015337423 0.9317484663
## 405 1296    3 0.0023006135 0.9340490798
## 406 1297    1 0.0007668712 0.9348159509
## 407 1300    3 0.0023006135 0.9371165644
## 408 1308    1 0.0007668712 0.9378834356
## 409 1309    1 0.0007668712 0.9386503067
## 410 1311    1 0.0007668712 0.9394171779
## 411 1313    3 0.0023006135 0.9417177914
## 412 1315    1 0.0007668712 0.9424846626
## 413 1316    1 0.0007668712 0.9432515337
## 414 1317    1 0.0007668712 0.9440184049
## 415 1323    2 0.0015337423 0.9455521472
## 416 1326    2 0.0015337423 0.9470858896
## 417 1328    1 0.0007668712 0.9478527607
## 418 1330    3 0.0023006135 0.9501533742
## 419 1332    1 0.0007668712 0.9509202454
## 420 1333    2 0.0015337423 0.9524539877
## 421 1337    2 0.0015337423 0.9539877301
## 422 1342    1 0.0007668712 0.9547546012
## 423 1343    1 0.0007668712 0.9555214724
## 424 1344    1 0.0007668712 0.9562883436
## 425 1346    1 0.0007668712 0.9570552147
## 426 1349    1 0.0007668712 0.9578220859
## 427 1354    1 0.0007668712 0.9585889571
## 428 1357    1 0.0007668712 0.9593558282
## 429 1360    1 0.0007668712 0.9601226994
## 430 1366    1 0.0007668712 0.9608895706
## 431 1369    2 0.0015337423 0.9624233129
## 432 1372    1 0.0007668712 0.9631901840
## 433 1373    1 0.0007668712 0.9639570552
## 434 1375    1 0.0007668712 0.9647239264
## 435 1379    1 0.0007668712 0.9654907975
## 436 1380    3 0.0023006135 0.9677914110
## 437 1382    1 0.0007668712 0.9685582822
## 438 1383    1 0.0007668712 0.9693251534
## 439 1388    1 0.0007668712 0.9700920245
## 440 1390    1 0.0007668712 0.9708588957
## 441 1393    1 0.0007668712 0.9716257669
## 442 1395    1 0.0007668712 0.9723926380
## 443 1398    1 0.0007668712 0.9731595092
## 444 1400    1 0.0007668712 0.9739263804
## 445 1403    1 0.0007668712 0.9746932515
## 446 1408    1 0.0007668712 0.9754601227
## 447 1414    1 0.0007668712 0.9762269939
## 448 1419    1 0.0007668712 0.9769938650
## 449 1420    2 0.0015337423 0.9785276074
## 450 1422    2 0.0015337423 0.9800613497
## 451 1423    1 0.0007668712 0.9808282209
## 452 1433    1 0.0007668712 0.9815950920
## 453 1435    1 0.0007668712 0.9823619632
## 454 1436    1 0.0007668712 0.9831288344
## 455 1439    2 0.0015337423 0.9846625767
## 456 1444    1 0.0007668712 0.9854294479
## 457 1450    1 0.0007668712 0.9861963190
## 458 1452    2 0.0015337423 0.9877300613
## 459 1454    2 0.0015337423 0.9892638037
## 460 1460    1 0.0007668712 0.9900306748
## 461 1461    1 0.0007668712 0.9907975460
## 462 1465    1 0.0007668712 0.9915644172
## 463 1470    1 0.0007668712 0.9923312883
## 464 1475    1 0.0007668712 0.9930981595
## 465 1478    1 0.0007668712 0.9938650307
## 466 1481    1 0.0007668712 0.9946319018
## 467 1491    1 0.0007668712 0.9953987730
## 468 1493    1 0.0007668712 0.9961656442
## 469 1500    2 0.0015337423 0.9976993865
## 470 1501    1 0.0007668712 0.9984662577
## 471 1505    1 0.0007668712 0.9992331288
## 472 1545    1 0.0007668712 1.0000000000

Proof

#Code
#To select a specific row of this table (e.g., a value of 1400), we can use the following
myTable[myTable$X == '1400', c('X', 'Freq', 'Prop', 'CumProp')]

##        X Freq         Prop   CumProp
## 444 1400    1 0.0007668712 0.9739264

#To select a specific row of this table (e.g., a value of 1400), we can use the following
myTable[myTable$X == '800', c('X', 'Freq', 'Prop', 'CumProp')]

##      X Freq        Prop    CumProp
## 14 800    2 0.001533742 0.01303681

Imagine that the distribution of average SAT scores was perfectly normal. Answer both parts of Question 15 again using the observed mean and standard deviation as your parameters.

Answer

Assuming the distribution is perfectly normal…

The Probability that the average SAT score for a school is greater than 1400 is 0.52871%

The Probability that the average SAT score for a school is less than 800 is 2.603005%

Proof

# Code
# SUMMARY
psych::describe(alotofdata$SAT_AVG)

##    vars    n    mean     sd median trimmed    mad min  max range skew
## X1    1 1304 1059.07 133.36 1039.5 1048.34 104.52 720 1545   825 0.82
##    kurtosis   se
## X1     1.11 3.69

#Percentile (for a given score) based on the case in which data are normally distributed
a <- 1400
s <- 133.36
xbar <- 1059.07
z <- (a-xbar)/s
z

## [1] 2.556464

pnorm(z)

## [1] 0.9947129

#Percentile (for a given score) based on the case in which data are normally distributed
a <- 800
s <- 133.36
xbar <- 1059.07
z <- (a-xbar)/s
z

## [1] -1.942636

pnorm(z)

## [1] 0.02603005

Lab 2 Markdown

Jacob Huebner

10/26/2019

College Scorecard

Intro

Objectives

Outline

Questions

Covariance and Correlation

Simple Regression

Normal Distributions

Probability

Answers

Covariance and Correlation

Descriptive statistics

Relation 1

Relation 2

Relation 3

Relation 4

Relation 5

Covariance Matrix

Correlation Matrix

Top 5 strongest correlations

Results

Proof

Top 5 weakest correlations

Results

Proof

Strongest Correlation

Results

Proof

Weakest Correlation

Results

Proof

Simple Regression

Visual representation

Numerical representation

Written interpretation

Visual representation

Numerical representation

Written interpretation

Visual representation

Numerical representation

Written interpretation

Visual representation

Numerical representation

Written interpretation

Visual representation

Numerical representation

Written interpretation

Visual representation

Numerical representation

Written interpretation

Visual representation

Numerical representation

Written interpretation

Visual representation

Numerical representation

Written interpretation

Visual representation

Numerical representation

Written interpretation

Visual representation

Numerical representation

Written interpretation

Normal Distributions

Average SAT score

Visual representation

Numerical representation

Written interpretation

Admission rate

Visual representation

Numerical representation

Written interpretation

Total number of undergraduate students

Visual representation

Numerical representation

Written interpretation

Average SAT score

Admission rate