fun R code of the week!

  1. ways to look at data (hist, range, summary, str)

  2. subset

  3. using the data= function in lm()

usnews<- read.csv("/Users/clairemorrison/Desktop/gradstats2022/usnews.csv", header=T, na.strings = ".") ### remember, this dataset uses "."s to denote missing data so I'm going to convert them to NAs with na.strings="."

head(usnews) # head is a great way to first look at your data, it gives you all the variable names and first 6 rows of data
##    FICE                            College State     Type satMath satVerbal
## 1  1061          Alaska Pacific University    AK  Private     490       482
## 2  1063  University of Alaska at Fairbanks    AK   Public     499       462
## 3  1065     University of Alaska Southeast    AK   Public      NA        NA
## 4 11462  University of Alaska at Anchorage    AK   Public     459       422
## 5  1002        Alabama Agri. & Mech. Univ.    AL   Public      NA        NA
## 6  1003                Faulkner University    AL  Private      NA        NA
##   satTotal act math1q math3q verbal1q verbal3q act1q act3q numApps numAccept
## 1      972  20    440    530      430      550    18    22     193       146
## 2      961  22     NA     NA       NA       NA    NA    NA    1852      1427
## 3       NA  NA     NA     NA       NA       NA    NA    NA     146       117
## 4      881  20     NA     NA       NA       NA    NA    NA    2065      1598
## 5       NA  17     NA     NA       NA       NA    14    17    2817      1920
## 6       NA  20     NA     NA       NA       NA    NA    NA     345       320
##   numMatrix top10HS top25HS numFulltime numParttime inTuition outTuition
## 1        55      16      44         249         869      7560       7560
## 2       928      NA      NA        3885        4519      1742       5226
## 3        89       4      24         492        1849      1742       5226
## 4      1162      NA      NA        6209       10537      1742       5226
## 5       984      NA      NA        3958         305      1700       3400
## 6       179      NA      27        1367         578      5600       5600
##   roomBoard room board fees books personal PhD termDeg sfRatio alumDonate
## 1      4120 1620  2500  130   800     1500  76      72    11.9          2
## 2      3590 1800  1790  155   650     2304  67      NA    10.0          8
## 3      4764 2514  2250   34   500     1162  39      51     9.5         NA
## 4      5120 2600  2520  114   580     1260  48      NA    13.7          6
## 5      2550 1108  1442  155   500      850  53      53    14.3         NA
## 6      3250 1550  1700  300   350       NA  52      56    32.8         NA
##   spendPerStudent gradRate accptRate typeCode
## 1           10922       15     0.756        1
## 2           11935       NA     0.771        0
## 3            9584       39     0.801        0
## 4            8046       NA     0.774        0
## 5            7043       40     0.682        0
## 6            3971       55     0.928        1
tail(usnews) # you can also do "tail()" to see the bottom of the datafile
##      FICE                                College State     Type satMath
## 1297 3825  West Virginia Institute of Technology    WV   Public      NA
## 1298 3826            West Virginia State College    WV   Public      NA
## 1299 3827               West Virginia University    WV   Public     507
## 1300 3830         West Virginia Wesleyan College    WV  Private     489
## 1301 3831                Wheeling Jesuit College    WV  Private     479
## 1302 3932                  University of Wyoming    WY   Public      NA
##      satVerbal satTotal act math1q math3q verbal1q verbal3q act1q act3q numApps
## 1297        NA       NA  20     NA     NA       NA       NA    16    28    1594
## 1298        NA       NA  18     NA     NA       NA       NA    NA    NA    1869
## 1299       439      946  22    450    560      380      490    20    24    9630
## 1300       439      928  23    420    560      370      500    20    27    1566
## 1301       433      912  22    410    520      380      490    19    24     903
## 1302        NA       NA  23     NA     NA       NA       NA    20    26    2029
##      numAccept numMatrix top10HS top25HS numFulltime numParttime inTuition
## 1297      1572       675      NA      NA        2432         616      3954
## 1298        NA       957      NA      NA        2817        1939      1988
## 1299      7801      2881      23      49       14524        1053      2128
## 1300      1400       483      28      55        1509         170     14200
## 1301       755       213      15      49         971         305     10500
## 1302      1516      1073      23      46        7535        1488      1908
##      outTuition roomBoard room board fees books personal PhD termDeg sfRatio
## 1297         NA      3752 1800  1952   NA   500       NA  NA      31    15.3
## 1298       4616      3200 1500  1700   50   750      750  38      38    19.2
## 1299       6370      4310 2284  2026   NA    NA       NA  83      86    13.4
## 1300      14200      3775 1750  2025   NA   450     1100  58      81    16.4
## 1301      10500      4545 2100  2445   NA   600      600  66      71    14.1
## 1302       5988      3422 1462  1960  300   600     1500  91      94    15.1
##      alumDonate spendPerStudent gradRate accptRate typeCode
## 1297         NA            4325       56     0.986        0
## 1298          4            3839       NA        NA        0
## 1299         NA            8318       57     0.810        0
## 1300         42            8080       67     0.894        1
## 1301         27            7494       72     0.836        1
## 1302         13            8745       45     0.747        0
# let's say we want to ask a question about math SAT scores, but first want to know if the data are normally distributed
# an easy way to do this is the built in R function "hist()"

hist(usnews$satMath) ### this is pretty no frills, so not always the best plot to publish but definitely adequate for checking out data

range(usnews$satMath, na.rm = T) # if there are any outliers in the data they will show in a histogram, but we could also look at the range of data. remember! we have to do na.rm=T becuase we have NAs in the data we want to ignore
## [1] 320 750
summary(usnews$satMath) # summary gives us more values we might want (the count of NAs is helpful)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   320.0   460.0   500.0   506.8   544.0   750.0     525
table(usnews$Type) # table gives us a count of how many observations exist in the dataset per variable values
## 
##  Private   Public 
##      832      470
str(usnews) # str is a useful command to see if your variables are characters (words), integers (whole numbers) or numeric (numbers with decimals). this is helpful because sometimes you might read in a dataset and R might think a number is a character (if a variable is a mix of numbers and characters R will classify it as a character) and then you wouldn't be able to perform operations on it. 
## 'data.frame':    1302 obs. of  37 variables:
##  $ FICE           : int  1061 1063 1065 11462 1002 1003 1004 1005 1009 1012 ...
##  $ College        : chr  " Alaska Pacific University" " University of Alaska at Fairbanks" " University of Alaska Southeast" " University of Alaska at Anchorage" ...
##  $ State          : chr  " AK" " AK" " AK" " AK" ...
##  $ Type           : chr  " Private" " Public" " Public" " Public" ...
##  $ satMath        : int  490 499 NA 459 NA NA NA NA 575 575 ...
##  $ satVerbal      : int  482 462 NA 422 NA NA NA NA 501 525 ...
##  $ satTotal       : int  972 961 NA 881 NA NA NA NA 1076 1100 ...
##  $ act            : int  20 22 NA 20 17 20 21 NA 24 26 ...
##  $ math1q         : int  440 NA NA NA NA NA NA NA 520 470 ...
##  $ math3q         : int  530 NA NA NA NA NA NA NA 638 680 ...
##  $ verbal1q       : int  430 NA NA NA NA NA NA NA 453 460 ...
##  $ verbal3q       : int  550 NA NA NA NA NA NA NA 559 650 ...
##  $ act1q          : int  18 NA NA NA 14 NA 18 NA 21 23 ...
##  $ act3q          : int  22 NA NA NA 17 NA 23 NA 27 29 ...
##  $ numApps        : int  193 1852 146 2065 2817 345 1351 4639 7548 805 ...
##  $ numAccept      : int  146 1427 117 1598 1920 320 892 3272 6791 588 ...
##  $ numMatrix      : int  55 928 89 1162 984 179 570 1278 3070 287 ...
##  $ top10HS        : int  16 NA 4 NA NA NA 18 NA 25 67 ...
##  $ top25HS        : int  44 NA 24 NA NA 27 78 NA 57 88 ...
##  $ numFulltime    : int  249 3885 492 6209 3958 1367 2385 4051 16262 1376 ...
##  $ numParttime    : int  869 4519 1849 10537 305 578 331 405 1716 207 ...
##  $ inTuition      : int  7560 1742 1742 1742 1700 5600 2220 1500 2100 11660 ...
##  $ outTuition     : int  7560 5226 5226 5226 3400 5600 4440 3000 6300 11660 ...
##  $ roomBoard      : int  4120 3590 4764 5120 2550 3250 3030 1960 3933 4325 ...
##  $ room           : int  1620 1800 2514 2600 1108 1550 NA 1960 NA 2050 ...
##  $ board          : int  2500 1790 2250 2520 1442 1700 NA NA NA 2430 ...
##  $ fees           : int  130 155 34 114 155 300 124 84 NA 120 ...
##  $ books          : int  800 650 500 580 500 350 300 500 600 400 ...
##  $ personal       : int  1500 2304 1162 1260 850 NA 600 NA 1908 900 ...
##  $ PhD            : int  76 67 39 48 53 52 72 48 85 74 ...
##  $ termDeg        : int  72 NA 51 NA 53 56 72 53 91 79 ...
##  $ sfRatio        : num  11.9 10 9.5 13.7 14.3 32.8 18.9 18.7 16.7 14 ...
##  $ alumDonate     : int  2 8 NA 6 NA NA 8 NA 18 34 ...
##  $ spendPerStudent: int  10922 11935 9584 8046 7043 3971 5883 NA 6642 8649 ...
##  $ gradRate       : int  15 NA 39 NA 40 55 51 15 69 72 ...
##  $ accptRate      : num  0.756 0.771 0.801 0.774 0.682 0.928 0.66 0.705 0.9 0.73 ...
##  $ typeCode       : int  1 0 0 0 0 1 0 0 0 1 ...
# now, let's say we want to look at SAT math scores specifically for private schools. we learned how to do tapply to look at means or other functions

tapply(X = usnews$satMath, INDEX = usnews$Type, FUN = mean, na.rm=T) # if you want to start a function (e.g. tapply) sometimes it's nice to write the function then the first open parentheses then hit tab. hitting tab shows you the options of input arguments for that function. for example, in tapply, we feed it 'X'= the dataset and variable we want to get the mean of (in this case mean but could do other functions), 'INDEX'= the variable we are indexing by, and 'FUN'= operation to perform
##  Private   Public 
##  509.981  500.212
# it might also for other reasons be good to just subset the data by private schools, then we could just take the mean of sat MATH and it should match 509.981

private<- subset(usnews, usnews$Type==" Private") # the subset command takes a first argument that says what you want to subset, and the second argument says what you want to subset by
nrow(private) 
## [1] 832
mean(private$satMath, na.rm = T)
## [1] 509.981
# this is actually a good annoying example because I tried to subset using subset(usnews, usnews$Type=="Private"), but if you look at str(usnews$Type)
str(usnews$Type) # you can see that there's a leading space before "Private"...!!!!
##  chr [1:1302] " Private" " Public" " Public" " Public" " Public" " Private" ...
# finally, when we start performing more complex operations with lm(), it can be helpful to no have to type the name of the dataset before each variable in the function

mcSummary(lm(usnews$satMath~0)) # also sometimes helpful, we can feed mcSummary our lm command instead of saving the model as object (i.e. mod.C)<- lm()) then feeding that to mcSummary (i.e. mcSummary(mod.C))
## lm(formula = usnews$satMath ~ 0)
## 
## Omnibus ANOVA
##                   SS  df       MS EtaSq F p
## Model                  0                   
## Error      203168839 777 261478.6          
## Corr Total 203168839 777 261478.6          
## 
##    RMSE AdjEtaSq
##  511.35       NA
## 
## Coefficients: none
# we're used to doing 'usnews$variable' but we don't have to if we use the 'data=' argument

mcSummary(lm(satMath~0, data = usnews)) # the only thing about this is you have to know how to type the variable name, as opposed to the handy feature of typing the dataset name, $, then 'tab' to see your options.
## lm(formula = satMath ~ 0, data = usnews)
## 
## Omnibus ANOVA
##                   SS  df       MS EtaSq F p
## Model                  0                   
## Error      203168839 777 261478.6          
## Corr Total 203168839 777 261478.6          
## 
##    RMSE AdjEtaSq
##  511.35       NA
## 
## Coefficients: none

ok back to grad stats for the week woooo

confidence interval review:

from the example in class of SKQ, our null hypothesis was that \(\beta_0=10\), but we were able to reject that value.

the CI= in this sample, we can reject the null of 10. our scores are significantly higher than 10. BUT, if our CI was [10.25,15.27], we only barely rejected the null of 10.

an example with the cars dataset:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mean(mtcars$mpg)
## [1] 20.09062

see if our mean mpg is different from 15

mtcars$mpg_15<- mtcars$mpg-15
mod.c<- lm(mpg_15~0, data=mtcars)
mod.a<- lm(mpg_15~1, data=mtcars)


modelCompare(mod.c, mod.a)
## SSE (Compact) =  1955.31 
## SSE (Augmented) =  1126.047 
## Delta R-Squared =  0 
## Partial Eta-Squared (PRE) =  0.4241081 
## F(1,31) = 22.82955, p = 4.054136e-05
mcSummary(mod.a)
## lm(formula = mpg_15 ~ 1, data = mtcars)
## 
## Omnibus ANOVA
##                  SS df     MS EtaSq F p
## Model         0.000  0    Inf     0    
## Error      1126.047 31 36.324          
## Corr Total 1126.047 31 36.324          
## 
##   RMSE AdjEtaSq
##  6.027        0
## 
## Coefficients
##               Est StErr     t  SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 5.091 1.065 4.778 829.263 0.424  NA  2.918   7.264 0
mod.a_nodiff<- lm(mpg~1, data=mtcars)
mcSummary(mod.a_nodiff)
## lm(formula = mpg ~ 1, data = mtcars)
## 
## Omnibus ANOVA
##                  SS df     MS EtaSq F p
## Model         0.000  0    Inf     0    
## Error      1126.047 31 36.324          
## Corr Total 1126.047 31 36.324          
## 
##   RMSE AdjEtaSq
##  6.027        0
## 
## Coefficients
##                Est StErr      t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 20.091 1.065 18.857 12916.26  0.92  NA 17.918  22.264 0

our mean mpg is different from 15, with a mean mpg of 15+5.091 (this is confusing, the beta is 5.091, but remember we subtracted 15 from all values!) and CI of [2.92,7.26]. similarly, we could add 15 to the CIs to maybe understand the values in relation to the real mpgs more.

if we don’t use the difference test method and instead just regress our mpg data on the mean, our \(b=20.091\) (the mean) with a CI of [17.918, 22.264].

in repeated analyses, that CI should contain the true beta 95% of the time. AKA that range contains the null values we would not reject.

let’s see:

mtcars$mpg_18<- mtcars$mpg-18
mod.c.18<- lm(mpg_18~0, data=mtcars)
mod.a.18<- lm(mpg_18~1, data=mtcars)


modelCompare(mod.c.18, mod.a.18)
## SSE (Compact) =  1265.91 
## SSE (Augmented) =  1126.047 
## Delta R-Squared =  0 
## Partial Eta-Squared (PRE) =  0.110484 
## F(1,31) = 3.850413, p = 0.05876422
mcSummary(mod.a.18)
## lm(formula = mpg_18 ~ 1, data = mtcars)
## 
## Omnibus ANOVA
##                  SS df     MS EtaSq F p
## Model         0.000  0    Inf     0    
## Error      1126.047 31 36.324          
## Corr Total 1126.047 31 36.324          
## 
##   RMSE AdjEtaSq
##  6.027        0
## 
## Coefficients
##               Est StErr     t  SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) 2.091 1.065 1.962 139.863  0.11  NA -0.082   4.264 0.059

comparing our dataset to a null mpg of 18, we see that we just barely can’t reject the null (F(1,31)=3.85, p=.059). it gives us a beta of 2.091 (+18) and a CI of [-.082, 4.264]. because this CI contains zero, that’s another big predictor here (because we are analyzing a difference test) that we can’t reject the null!