ways to look at data (hist, range, summary, str)
subset
using the data= function in lm()
usnews<- read.csv("/Users/clairemorrison/Desktop/gradstats2022/usnews.csv", header=T, na.strings = ".") ### remember, this dataset uses "."s to denote missing data so I'm going to convert them to NAs with na.strings="."
head(usnews) # head is a great way to first look at your data, it gives you all the variable names and first 6 rows of data
## FICE College State Type satMath satVerbal
## 1 1061 Alaska Pacific University AK Private 490 482
## 2 1063 University of Alaska at Fairbanks AK Public 499 462
## 3 1065 University of Alaska Southeast AK Public NA NA
## 4 11462 University of Alaska at Anchorage AK Public 459 422
## 5 1002 Alabama Agri. & Mech. Univ. AL Public NA NA
## 6 1003 Faulkner University AL Private NA NA
## satTotal act math1q math3q verbal1q verbal3q act1q act3q numApps numAccept
## 1 972 20 440 530 430 550 18 22 193 146
## 2 961 22 NA NA NA NA NA NA 1852 1427
## 3 NA NA NA NA NA NA NA NA 146 117
## 4 881 20 NA NA NA NA NA NA 2065 1598
## 5 NA 17 NA NA NA NA 14 17 2817 1920
## 6 NA 20 NA NA NA NA NA NA 345 320
## numMatrix top10HS top25HS numFulltime numParttime inTuition outTuition
## 1 55 16 44 249 869 7560 7560
## 2 928 NA NA 3885 4519 1742 5226
## 3 89 4 24 492 1849 1742 5226
## 4 1162 NA NA 6209 10537 1742 5226
## 5 984 NA NA 3958 305 1700 3400
## 6 179 NA 27 1367 578 5600 5600
## roomBoard room board fees books personal PhD termDeg sfRatio alumDonate
## 1 4120 1620 2500 130 800 1500 76 72 11.9 2
## 2 3590 1800 1790 155 650 2304 67 NA 10.0 8
## 3 4764 2514 2250 34 500 1162 39 51 9.5 NA
## 4 5120 2600 2520 114 580 1260 48 NA 13.7 6
## 5 2550 1108 1442 155 500 850 53 53 14.3 NA
## 6 3250 1550 1700 300 350 NA 52 56 32.8 NA
## spendPerStudent gradRate accptRate typeCode
## 1 10922 15 0.756 1
## 2 11935 NA 0.771 0
## 3 9584 39 0.801 0
## 4 8046 NA 0.774 0
## 5 7043 40 0.682 0
## 6 3971 55 0.928 1
tail(usnews) # you can also do "tail()" to see the bottom of the datafile
## FICE College State Type satMath
## 1297 3825 West Virginia Institute of Technology WV Public NA
## 1298 3826 West Virginia State College WV Public NA
## 1299 3827 West Virginia University WV Public 507
## 1300 3830 West Virginia Wesleyan College WV Private 489
## 1301 3831 Wheeling Jesuit College WV Private 479
## 1302 3932 University of Wyoming WY Public NA
## satVerbal satTotal act math1q math3q verbal1q verbal3q act1q act3q numApps
## 1297 NA NA 20 NA NA NA NA 16 28 1594
## 1298 NA NA 18 NA NA NA NA NA NA 1869
## 1299 439 946 22 450 560 380 490 20 24 9630
## 1300 439 928 23 420 560 370 500 20 27 1566
## 1301 433 912 22 410 520 380 490 19 24 903
## 1302 NA NA 23 NA NA NA NA 20 26 2029
## numAccept numMatrix top10HS top25HS numFulltime numParttime inTuition
## 1297 1572 675 NA NA 2432 616 3954
## 1298 NA 957 NA NA 2817 1939 1988
## 1299 7801 2881 23 49 14524 1053 2128
## 1300 1400 483 28 55 1509 170 14200
## 1301 755 213 15 49 971 305 10500
## 1302 1516 1073 23 46 7535 1488 1908
## outTuition roomBoard room board fees books personal PhD termDeg sfRatio
## 1297 NA 3752 1800 1952 NA 500 NA NA 31 15.3
## 1298 4616 3200 1500 1700 50 750 750 38 38 19.2
## 1299 6370 4310 2284 2026 NA NA NA 83 86 13.4
## 1300 14200 3775 1750 2025 NA 450 1100 58 81 16.4
## 1301 10500 4545 2100 2445 NA 600 600 66 71 14.1
## 1302 5988 3422 1462 1960 300 600 1500 91 94 15.1
## alumDonate spendPerStudent gradRate accptRate typeCode
## 1297 NA 4325 56 0.986 0
## 1298 4 3839 NA NA 0
## 1299 NA 8318 57 0.810 0
## 1300 42 8080 67 0.894 1
## 1301 27 7494 72 0.836 1
## 1302 13 8745 45 0.747 0
# let's say we want to ask a question about math SAT scores, but first want to know if the data are normally distributed
# an easy way to do this is the built in R function "hist()"
hist(usnews$satMath) ### this is pretty no frills, so not always the best plot to publish but definitely adequate for checking out data
range(usnews$satMath, na.rm = T) # if there are any outliers in the data they will show in a histogram, but we could also look at the range of data. remember! we have to do na.rm=T becuase we have NAs in the data we want to ignore
## [1] 320 750
summary(usnews$satMath) # summary gives us more values we might want (the count of NAs is helpful)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 320.0 460.0 500.0 506.8 544.0 750.0 525
table(usnews$Type) # table gives us a count of how many observations exist in the dataset per variable values
##
## Private Public
## 832 470
str(usnews) # str is a useful command to see if your variables are characters (words), integers (whole numbers) or numeric (numbers with decimals). this is helpful because sometimes you might read in a dataset and R might think a number is a character (if a variable is a mix of numbers and characters R will classify it as a character) and then you wouldn't be able to perform operations on it.
## 'data.frame': 1302 obs. of 37 variables:
## $ FICE : int 1061 1063 1065 11462 1002 1003 1004 1005 1009 1012 ...
## $ College : chr " Alaska Pacific University" " University of Alaska at Fairbanks" " University of Alaska Southeast" " University of Alaska at Anchorage" ...
## $ State : chr " AK" " AK" " AK" " AK" ...
## $ Type : chr " Private" " Public" " Public" " Public" ...
## $ satMath : int 490 499 NA 459 NA NA NA NA 575 575 ...
## $ satVerbal : int 482 462 NA 422 NA NA NA NA 501 525 ...
## $ satTotal : int 972 961 NA 881 NA NA NA NA 1076 1100 ...
## $ act : int 20 22 NA 20 17 20 21 NA 24 26 ...
## $ math1q : int 440 NA NA NA NA NA NA NA 520 470 ...
## $ math3q : int 530 NA NA NA NA NA NA NA 638 680 ...
## $ verbal1q : int 430 NA NA NA NA NA NA NA 453 460 ...
## $ verbal3q : int 550 NA NA NA NA NA NA NA 559 650 ...
## $ act1q : int 18 NA NA NA 14 NA 18 NA 21 23 ...
## $ act3q : int 22 NA NA NA 17 NA 23 NA 27 29 ...
## $ numApps : int 193 1852 146 2065 2817 345 1351 4639 7548 805 ...
## $ numAccept : int 146 1427 117 1598 1920 320 892 3272 6791 588 ...
## $ numMatrix : int 55 928 89 1162 984 179 570 1278 3070 287 ...
## $ top10HS : int 16 NA 4 NA NA NA 18 NA 25 67 ...
## $ top25HS : int 44 NA 24 NA NA 27 78 NA 57 88 ...
## $ numFulltime : int 249 3885 492 6209 3958 1367 2385 4051 16262 1376 ...
## $ numParttime : int 869 4519 1849 10537 305 578 331 405 1716 207 ...
## $ inTuition : int 7560 1742 1742 1742 1700 5600 2220 1500 2100 11660 ...
## $ outTuition : int 7560 5226 5226 5226 3400 5600 4440 3000 6300 11660 ...
## $ roomBoard : int 4120 3590 4764 5120 2550 3250 3030 1960 3933 4325 ...
## $ room : int 1620 1800 2514 2600 1108 1550 NA 1960 NA 2050 ...
## $ board : int 2500 1790 2250 2520 1442 1700 NA NA NA 2430 ...
## $ fees : int 130 155 34 114 155 300 124 84 NA 120 ...
## $ books : int 800 650 500 580 500 350 300 500 600 400 ...
## $ personal : int 1500 2304 1162 1260 850 NA 600 NA 1908 900 ...
## $ PhD : int 76 67 39 48 53 52 72 48 85 74 ...
## $ termDeg : int 72 NA 51 NA 53 56 72 53 91 79 ...
## $ sfRatio : num 11.9 10 9.5 13.7 14.3 32.8 18.9 18.7 16.7 14 ...
## $ alumDonate : int 2 8 NA 6 NA NA 8 NA 18 34 ...
## $ spendPerStudent: int 10922 11935 9584 8046 7043 3971 5883 NA 6642 8649 ...
## $ gradRate : int 15 NA 39 NA 40 55 51 15 69 72 ...
## $ accptRate : num 0.756 0.771 0.801 0.774 0.682 0.928 0.66 0.705 0.9 0.73 ...
## $ typeCode : int 1 0 0 0 0 1 0 0 0 1 ...
# now, let's say we want to look at SAT math scores specifically for private schools. we learned how to do tapply to look at means or other functions
tapply(X = usnews$satMath, INDEX = usnews$Type, FUN = mean, na.rm=T) # if you want to start a function (e.g. tapply) sometimes it's nice to write the function then the first open parentheses then hit tab. hitting tab shows you the options of input arguments for that function. for example, in tapply, we feed it 'X'= the dataset and variable we want to get the mean of (in this case mean but could do other functions), 'INDEX'= the variable we are indexing by, and 'FUN'= operation to perform
## Private Public
## 509.981 500.212
# it might also for other reasons be good to just subset the data by private schools, then we could just take the mean of sat MATH and it should match 509.981
private<- subset(usnews, usnews$Type==" Private") # the subset command takes a first argument that says what you want to subset, and the second argument says what you want to subset by
nrow(private)
## [1] 832
mean(private$satMath, na.rm = T)
## [1] 509.981
# this is actually a good annoying example because I tried to subset using subset(usnews, usnews$Type=="Private"), but if you look at str(usnews$Type)
str(usnews$Type) # you can see that there's a leading space before "Private"...!!!!
## chr [1:1302] " Private" " Public" " Public" " Public" " Public" " Private" ...
# finally, when we start performing more complex operations with lm(), it can be helpful to no have to type the name of the dataset before each variable in the function
mcSummary(lm(usnews$satMath~0)) # also sometimes helpful, we can feed mcSummary our lm command instead of saving the model as object (i.e. mod.C)<- lm()) then feeding that to mcSummary (i.e. mcSummary(mod.C))
## lm(formula = usnews$satMath ~ 0)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 0
## Error 203168839 777 261478.6
## Corr Total 203168839 777 261478.6
##
## RMSE AdjEtaSq
## 511.35 NA
##
## Coefficients: none
# we're used to doing 'usnews$variable' but we don't have to if we use the 'data=' argument
mcSummary(lm(satMath~0, data = usnews)) # the only thing about this is you have to know how to type the variable name, as opposed to the handy feature of typing the dataset name, $, then 'tab' to see your options.
## lm(formula = satMath ~ 0, data = usnews)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 0
## Error 203168839 777 261478.6
## Corr Total 203168839 777 261478.6
##
## RMSE AdjEtaSq
## 511.35 NA
##
## Coefficients: none
confidence interval review:
from the example in class of SKQ, our null hypothesis was that \(\beta_0=10\), but we were able to reject that value.
the CI= in this sample, we can reject the null of 10. our scores are significantly higher than 10. BUT, if our CI was [10.25,15.27], we only barely rejected the null of 10.
an example with the cars dataset:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mean(mtcars$mpg)
## [1] 20.09062
see if our mean mpg is different from 15
mtcars$mpg_15<- mtcars$mpg-15
mod.c<- lm(mpg_15~0, data=mtcars)
mod.a<- lm(mpg_15~1, data=mtcars)
modelCompare(mod.c, mod.a)
## SSE (Compact) = 1955.31
## SSE (Augmented) = 1126.047
## Delta R-Squared = 0
## Partial Eta-Squared (PRE) = 0.4241081
## F(1,31) = 22.82955, p = 4.054136e-05
mcSummary(mod.a)
## lm(formula = mpg_15 ~ 1, data = mtcars)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 0.000 0 Inf 0
## Error 1126.047 31 36.324
## Corr Total 1126.047 31 36.324
##
## RMSE AdjEtaSq
## 6.027 0
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 5.091 1.065 4.778 829.263 0.424 NA 2.918 7.264 0
mod.a_nodiff<- lm(mpg~1, data=mtcars)
mcSummary(mod.a_nodiff)
## lm(formula = mpg ~ 1, data = mtcars)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 0.000 0 Inf 0
## Error 1126.047 31 36.324
## Corr Total 1126.047 31 36.324
##
## RMSE AdjEtaSq
## 6.027 0
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 20.091 1.065 18.857 12916.26 0.92 NA 17.918 22.264 0
our mean mpg is different from 15, with a mean mpg of 15+5.091 (this is confusing, the beta is 5.091, but remember we subtracted 15 from all values!) and CI of [2.92,7.26]. similarly, we could add 15 to the CIs to maybe understand the values in relation to the real mpgs more.
if we don’t use the difference test method and instead just regress our mpg data on the mean, our \(b=20.091\) (the mean) with a CI of [17.918, 22.264].
in repeated analyses, that CI should contain the true beta 95% of the time. AKA that range contains the null values we would not reject.
let’s see:
mtcars$mpg_18<- mtcars$mpg-18
mod.c.18<- lm(mpg_18~0, data=mtcars)
mod.a.18<- lm(mpg_18~1, data=mtcars)
modelCompare(mod.c.18, mod.a.18)
## SSE (Compact) = 1265.91
## SSE (Augmented) = 1126.047
## Delta R-Squared = 0
## Partial Eta-Squared (PRE) = 0.110484
## F(1,31) = 3.850413, p = 0.05876422
mcSummary(mod.a.18)
## lm(formula = mpg_18 ~ 1, data = mtcars)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 0.000 0 Inf 0
## Error 1126.047 31 36.324
## Corr Total 1126.047 31 36.324
##
## RMSE AdjEtaSq
## 6.027 0
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 2.091 1.065 1.962 139.863 0.11 NA -0.082 4.264 0.059
comparing our dataset to a null mpg of 18, we see that we just barely can’t reject the null (F(1,31)=3.85, p=.059). it gives us a beta of 2.091 (+18) and a CI of [-.082, 4.264]. because this CI contains zero, that’s another big predictor here (because we are analyzing a difference test) that we can’t reject the null!