Welcome to your first R Notebook (of the class!). We’ll start off with a brief guided tour of R, and then jump right into some code.
To start us off, you will need to install a few packages. The base version of R comes with many built-in functions to help us accomplish tasks, but it is somewhat limited in what it can do. To get more functionality out of R, we will install libraries, which are collections of new functions that help us do more within R Studio. You can press play on the code block below to install the pakcages.
#This symbol creates a note or comment! R won't run any code that uses this symbol
#ONLY INSTALL PACKAGES ONCE
#Do not install these again if you already have them. You can check if they need to be updated by going to Tools>Check for Package Updates...
#I don't recommend updating everything at once if you have a lot of packages, it's quite time consuming
#Delete or comment out this block after running
#install.packages('dplyr')
#install.packages('ggplot2')
#install.packages('tidyr')
#next, we'll load in the packages. We need to do this everytime we start a new RStudio session, even though we only install a package once
library(dplyr)
library(ggplot2)
library(tidyr)
You may get a warning about some functions being ‘masked’ by the packages we’ve loaded in. This is fine, you can generally ignore this message.
The next thing we’ll do is load in data. We will use raw data in this class generally, rather than data that has been cleaned already - my goal is for you to learn how to work with real world data as you learn how to analyze it. This will involve working with data that is often messy at first, but we’ll clean it up and put it in a neat format. The dplyr package will help us to do this, and I’ll cover functions from this package as we go.
Let’s load in our first data set:
#first, figure out where you saved the file. Never save to your desktop or downloads folders exclusively, or you'll lose data. We'll create a new folder just for this course, and navigate there every time. I'll walk you through the steps to do that.
setwd("~/Binghamton/dida380b/data")
#then, use read.csv to load in the data. We will mostly use csv files in this course - if you have the option to download a .csv or .xlsx file, alwasy choose the .csv. Don't forget that I renamed the file to make it neat like this. Press play to run the full block, not just the line below.
data <- read.csv("raw_county_data.csv")
#This will display the first few rows for inspection. You can also just click on the data set in the Environment pane.
head(data)
We’ll primarily be working with the socioeconomic variables in this dataset, although there are some interesting health variables as well. You can find a description of the variables here: https://opportunityinsights.org/wp-content/uploads/2018/04/health_ineq_online_table_12_readme.pdf. The full data set is from: https://opportunityinsights.org/data/. Some of the variables are from 2000 and are a bit outdated, but this will still work for our purposes. We’ll use more recent data later on.
| Using Packages |
In this next section, we’ll use our first package, dplyr. This package is one of the most important you will learn, and it is essential for cleaning and managing your data. I’ll show you a few functions from this package, but we will actually only need it quite briefly.
#before I do anything, let's see what variables we have in the data
#this is just a simple base R function
colnames(data)
[1] "cty" "county_name"
[3] "cty_pop2000" "cz"
[5] "cz_name" "cz_pop2000"
[7] "statename" "state_id"
[9] "stateabbrv" "csa"
[11] "csa_name" "cbsa"
[13] "cbsa_name" "intersects_msa"
[15] "cur_smoke_q1" "cur_smoke_q2"
[17] "cur_smoke_q3" "cur_smoke_q4"
[19] "bmi_obese_q1" "bmi_obese_q2"
[21] "bmi_obese_q3" "bmi_obese_q4"
[23] "exercise_any_q1" "exercise_any_q2"
[25] "exercise_any_q3" "exercise_any_q4"
[27] "puninsured2010" "reimb_penroll_adj10"
[29] "mort_30day_hosp_z" "adjmortmeas_amiall30day"
[31] "adjmortmeas_chfall30day" "adjmortmeas_pnall30day"
[33] "med_prev_qual_z" "primcarevis_10"
[35] "diab_hemotest_10" "diab_eyeexam_10"
[37] "diab_lipids_10" "mammogram_10"
[39] "amb_disch_per1000_10" "cs00_seg_inc"
[41] "cs00_seg_inc_pov25" "cs00_seg_inc_aff75"
[43] "cs_race_theil_2000" "gini99"
[45] "poor_share" "inc_share_1perc"
[47] "frac_middleclass" "scap_ski90pcm"
[49] "rel_tot" "cs_frac_black"
[51] "cs_frac_hisp" "unemp_rate"
[53] "pop_d_2000_1980" "lf_d_2000_1980"
[55] "cs_labforce" "cs_elf_ind_man"
[57] "cs_born_foreign" "mig_inflow"
[59] "mig_outflow" "pop_density"
[61] "frac_traveltime_lt15" "hhinc00"
[63] "median_house_value" "ccd_exp_tot"
[65] "ccd_pup_tch_ratio" "score_r"
[67] "dropout_r" "cs_educ_ba"
[69] "tuition" "gradrate_r"
[71] "e_rank_b" "cs_fam_wkidsinglemom"
[73] "crime_total" "subcty_exp_pc"
[75] "taxrate" "tax_st_diff_top20"
#I'll use this print out to pick the variables i want
#if you're not sure what some of these are, check out the data dictionary that I provided a link to above
#let's grab county_name, median_house_value, frac_middleclass, dropout_r, crime_total, unemp_rate, taxrate, mig_inflow
#the select function from dplyr will let us do this, but first we need to undestand the syntax
data_small <- data %>% dplyr::select(county_name, stateabbrv, frac_middleclass,
mig_inflow,
dropout_r, crime_total, unemp_rate,
taxrate, median_house_value)
What did I do there?
So, our final statement is:
Take data, then select the following columns and assign them to the data_small object.
#let's take a look at the data with the head() function, which prints out the first six rows:
head(data_small)
#These are all measured in fractions - what if I wanted to convert them to percentages? The mutate function will do this for us
#Mutate will add a new column to a dataframe
data_small <- data_small %>% mutate(percent_middleclass = frac_middleclass*100)
#print out just the new variable, to check that it's measured in percentages
data_small$percent_middleclass
[1] 51.950 49.911 40.833 46.136 59.722 34.630 37.678
[8] 53.681 50.302 51.245 52.105 38.282 42.316 49.074
[15] 62.272 51.354 50.655 40.945 47.046 48.110 43.555
[22] 57.138 54.728 34.359 54.357 54.001 47.278 49.497
[29] 53.588 54.359 48.441 30.118 36.075 46.396 44.310
[36] 57.583 43.082 53.662 50.660 49.530 46.822 49.768
[43] 28.169 42.061 45.368 36.788 54.606 52.488 45.569
[50] 41.918 40.605 50.537 28.913 39.517 36.513 45.843
[57] 49.810 52.264 42.216 28.190 46.472 41.942 45.702
[64] 51.653 47.440 33.074 49.476 62.069 49.533 45.904
[71] 48.223 53.846 47.429 54.225 59.821 45.719 48.613
[78] 51.511 59.401 53.333 52.565 55.096 41.214 58.304
[85] 56.944 56.568 56.044 52.796 43.424 49.796 59.875
[92] 54.754 49.340 55.105 52.922 55.481 63.476 77.980
[99] 47.945 49.590 51.237 54.044 51.820 57.735 42.681
[106] 56.291 47.751 48.918 49.639 51.544 55.584 56.387
[113] 48.077 56.098 53.775 34.488 49.936 55.673 55.107
[120] 58.929 49.143 57.634 52.976 56.209 39.974 44.903
[127] 51.238 39.178 43.724 56.139 60.398 45.789 47.289
[134] 58.655 63.528 48.394 56.529 49.856 59.393 48.250
[141] 45.411 44.740 51.523 39.766 56.810 30.470 52.481
[148] 47.300 56.095 57.833 54.845 49.301 49.206 40.990
[155] 41.016 55.285 52.785 53.628 50.342 60.227 33.358
[162] 52.970 49.812 46.793 49.078 53.134 46.395 55.943
[169] 40.231 55.930 54.279 42.675 53.726 51.014 47.724
[176] 45.304 47.968 48.611 54.685 54.797 37.817 53.599
[183] 44.102 45.455 55.142 51.119 55.401 51.325 39.933
[190] 55.044 48.778 45.337 55.225 54.411 42.427 53.731
[197] 48.397 53.436 49.076 59.099 45.092 51.236 33.777
[204] 50.707 53.395 51.972 53.825 58.199 50.455 49.052
[211] 52.794 42.818 45.183 61.378 51.188 48.839 48.654
[218] 51.826 48.008 46.658 50.038 51.595 42.064 49.717
[225] 39.080 45.293 54.155 65.493 53.373 50.143 49.168
[232] 52.309 52.612 53.748 54.887 46.502 54.054 45.513
[239] 48.944 52.156 60.073 51.572 43.817 54.808 62.567
[246] 54.884 42.143 64.621 63.025 57.576 43.632 34.899
[253] 53.793 55.046 57.331 48.969 66.216 30.004 48.013
[260] 56.218 53.817 59.420 57.583 53.968 57.920 52.542
[267] 55.556 50.000 67.164 46.950 63.514 57.879 66.800
[274] 57.944 50.569 47.700 55.411 60.442 58.272 50.000
[281] 61.401 57.658 59.370 60.676 55.272 62.264 58.907
[288] 65.775 40.210 57.390 56.883 59.677 53.739 49.483
[295] 46.154 79.487 51.852 64.948 45.825 55.266 54.464
[302] 57.081 61.894 36.349 42.664 50.591 46.027 45.524
[309] 50.700 46.357 57.641 56.437 45.186 49.354 43.027
[316] 47.577 60.958 52.622 54.063 55.431 45.357 52.116
[323] 55.980 49.562 61.101 44.951 51.328 41.481 45.930
[330] 45.372 52.613 50.746 55.008 41.566 38.040 53.209
[337] 52.663 54.000 48.817 46.798 49.021 54.415 47.974
[344] 49.611 47.257 49.544 48.492 42.944 44.379 52.323
[351] 51.097 49.401 47.268 54.091 46.768 51.248 49.102
[358] 48.329 48.988 61.469 58.968 47.200 49.402 54.058
[365] 45.277 53.569 50.594 51.997 47.103 47.364 51.465
[372] 57.187 51.560 50.421 48.201 52.053 47.175 55.794
[379] 53.141 56.973 44.838 49.790 50.941 36.364 46.281
[386] 35.821 45.804 55.500 55.328 55.301 42.617 57.491
[393] 40.919 52.790 58.989 36.602 53.323 46.080 37.059
[400] 54.745 29.107 63.379 41.515 52.023 59.816 53.571
[407] 45.011 50.694 56.495 48.415 38.950 27.559 56.248
[414] 46.429 38.954 41.757 41.765 44.956 45.424 50.853
[421] 42.672 37.799 61.260 57.895 38.629 47.920 47.111
[428] 38.340 38.595 56.468 32.117 38.462 56.724 50.473
[435] 36.251 41.579 49.231 36.633 48.826 42.604 55.056
[442] 34.433 54.419 60.360 44.704 55.360 43.778 34.725
[449] 47.908 55.508 49.815 36.528 56.318 53.157 53.654
[456] 53.743 55.197 53.841 42.246 56.759 50.914 48.713
[463] 38.149 35.180 41.543 52.709 50.175 46.091 43.851
[470] 53.794 62.676 44.481 55.204 48.023 60.507 47.022
[477] 39.944 34.713 57.736 43.961 43.902 37.638 34.630
[484] 50.000 38.462 50.909 59.652 48.553 47.305 47.714
[491] 52.091 58.857 46.714 55.396 52.880 58.730 52.663
[498] 38.015 44.059 35.955 50.448 32.061 47.184 44.374
[505] 45.455 45.086 43.257 46.948 53.869 33.180 44.113
[512] 42.857 32.000 47.637 44.952 38.879 32.290 39.421
[519] 42.726 37.057 49.356 37.549 48.934 33.260 48.196
[526] 49.119 52.599 57.234 50.783 46.985 34.884 46.061
[533] 50.100 49.524 40.449 52.792 54.744 44.479 46.278
[540] 48.582 45.054 53.404 51.689 59.500 58.898 52.808
[547] 60.000 60.105 70.408 63.676 67.001 49.521 60.894
[554] 56.558 57.433 56.769 73.514 72.093 61.338 68.008
[561] 61.785 71.111 63.409 65.244 65.972 70.717 67.620
[568] 65.132 64.286 57.576 71.395 63.860 60.238 62.063
[575] 59.770 65.818 59.694 64.791 67.280 56.345 70.464
[582] 62.708 57.995 62.772 56.604 68.015 61.618 57.968
[589] 58.159 63.656 45.213 59.758 56.925 69.264 63.561
[596] 66.832 63.329 63.255 52.425 64.435 61.250 66.721
[603] 61.271 59.224 46.596 62.699 66.983 55.385 57.021
[610] 63.770 35.601 58.875 71.146 64.743 62.211 70.304
[617] 56.325 60.104 62.593 67.378 48.378 61.813 72.316
[624] 63.842 69.139 58.369 66.124 51.530 63.389 57.259
[631] 60.858 65.795 59.518 45.292 54.913 50.147 60.853
[638] 37.206 59.555 58.155 63.500 60.909 63.622 61.900
[645] 43.914 47.880 52.643 60.735 54.545 62.180 64.310
[652] 59.704 53.109 56.719 65.538 53.983 64.531 59.381
[659] 63.287 59.246 45.277 64.904 63.754 62.277 67.105
[666] 50.145 59.231 63.183 62.447 53.780 47.875 55.403
[673] 48.450 66.783 67.078 69.429 63.121 58.256 53.495
[680] 58.193 59.172 63.380 63.454 65.913 67.500 60.310
[687] 64.310 45.983 56.400 51.370 52.967 69.559 51.460
[694] 53.930 64.008 68.750 50.808 62.143 65.217 60.326
[701] 58.841 63.781 61.973 63.080 58.020 60.030 68.083
[708] 62.783 53.418 64.228 60.115 60.000 51.205 66.551
[715] 66.796 66.810 63.889 55.183 59.302 33.568 49.340
[722] 62.438 47.732 56.501 45.865 63.636 62.217 65.540
[729] 68.932 61.787 64.027 50.210 58.453 61.742 68.212
[736] 52.333 57.434 58.242 53.219 48.093 65.675 55.474
[743] 60.027 52.989 64.932 57.649 66.075 69.016 68.394
[750] 62.884 65.306 59.937 66.710 64.185 51.724 54.654
[757] 64.094 61.743 67.112 61.716 64.439 51.367 63.704
[764] 61.257 70.485 57.813 65.146 57.503 61.300 51.353
[771] 54.518 72.368 51.489 61.290 54.524 65.103 63.519
[778] 55.198 63.905 57.212 64.286 65.622 63.815 66.452
[785] 66.298 68.085 63.396 64.368 66.733 52.222 65.972
[792] 63.965 65.101 68.042 71.429 69.623 71.282 66.071
[799] 66.569 61.876 73.356 70.681 70.411 67.358 70.203
[806] 61.234 65.401 59.887 65.075 64.078 69.966 60.867
[813] 63.542 61.949 71.158 69.383 65.886 68.734 70.833
[820] 66.013 69.130 69.538 75.000 77.297 71.957 68.750
[827] 66.055 71.705 68.677 67.416 72.537 67.374 63.983
[834] 57.250 48.078 72.161 70.577 70.287 62.226 52.329
[841] 66.339 66.927 74.411 66.132 66.362 67.117 61.337
[848] 65.881 70.098 66.190 62.976 64.348 63.756 73.118
[855] 72.313 66.341 73.932 68.708 70.809 49.369 58.665
[862] 64.092 61.261 67.681 49.613 68.190 73.113 55.619
[869] 67.336 68.868 66.245 62.060 57.910 61.064 65.833
[876] 60.073 61.371 73.810 71.186 57.573 73.602 66.948
[883] 63.725 65.829 66.035 71.852 66.853 58.610 65.253
[890] 57.736 67.213 52.096 59.934 61.832 62.963 64.778
[897] 68.471 61.211 77.778 65.587 59.701 66.250 68.757
[904] 71.314 52.962 70.760 54.472 64.100 67.704 58.494
[911] 57.002 65.658 64.125 71.176 70.833 61.951 68.634
[918] 75.000 67.656 72.222 63.918 63.055 70.370 72.727
[925] 68.327 67.961 63.871 37.546 62.632 69.509 64.162
[932] 64.170 54.717 57.581 65.806 59.079 73.006 63.245
[939] 67.403 68.984 63.914 74.586 63.667 70.414 61.461
[946] 59.690 63.277 70.599 66.156 74.054 65.759 68.620
[953] 64.130 73.801 68.333 69.118 70.093 66.589 63.758
[960] 62.178 70.751 64.066 58.212 69.811 58.599 68.249
[967] 61.742 66.308 51.914 58.134 54.488 63.934 66.192
[974] 65.641 68.860 61.207 56.522 60.485 61.383 75.373
[981] 76.543 72.449 67.820 60.317 61.314 63.529 53.943
[988] 47.826 54.027 66.297 54.185 56.808 46.420 44.835
[995] 54.796 52.950 49.976 50.980 63.172 40.795
[ reached 'max' / getOption("max.print") -- omitted 2138 entries ]
Alright, now we have a data set to work with. We’re going to be looking at some factors that might impact median house values by county. Here’s why I picked each variable:
There’s many other variables I could have included here as well! This is certainly not comprehensive.
Now, we’ll start to summarise the data. This is how we get insights from our data.
#making a summary table is easy with dplyr
#Typically you'll summarise by a group
#in our case, this will be state to make things simple. Here's what this looks like with dplyr, using the group_by and summarise functions:
summary <- data_small %>%
group_by(stateabbrv) %>%
summarise(total_observations = n(),
avg_middleclass = mean(frac_middleclass, na.rm=T),
avg_mig = mean(mig_inflow, na.rm=T),
avg_dropout = mean(dropout_r, na.rm=T),
avg_crime = mean(crime_total, na.rm=T),
avg_tax = mean(taxrate, na.rm=T),
avg_unemp = mean(unemp_rate, na.rm=T),
med_hv = median(median_house_value, na.rm=T))
summary
NA
What did I actually do here? I calculated means and medians for the variables in the data by state. The statement for this code is something like this:
Take data_small, group it by state, then calculate the following summary variables. You can then sort through the dataset to see, for example, which states had the highest median home values, or which states have the highest tax rates. Often, we’ll have a lot of data, and it will be challenging to show it all at once. Summary tables can help us to make sense of the data in a more manageable way. We won’t use this summary table for too much right now, but we’ll come back to this code in future tutorials.
| Two more dplyr functions |
One additional dplyr function that is typically quite useful is the filter function. Let’s say I only want to look at NYS data. I can’t select a column to do this; instead, I would need to filter the data by the rows, and only grab the rows where stateabbrv is equal to NY. We won’t need to do this right now, but I’ll show you how it works so you can use it in the future.
ny_data <- data_small %>%
#since one = assigns something to an object, two == are needed to select rows. This is called a logical operation
filter(stateabbrv == "NY")
head(ny_data)
How do logical operations work? Essentially, anytime I’m using a == or >= or <= or != operation, r is searching through the variable that I input and printing out T or F when the condition is met. Put inside the filter function, R will only keep values that are equal to T. Let me show you what I mean:
#this is not a dplyr function, so we have to tell R to use the stateabbrv variable from the data_small data frame by using the $
#this is needed anytime we're not using a dplyr function
#if there's no %>% operator, we probably need to select a variable in this way
data_small$stateabbrv == 'NY'
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[57] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[65] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[81] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[105] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[113] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[129] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[153] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[161] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[185] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[201] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[209] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[225] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[233] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[249] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[257] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[273] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[281] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[297] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[305] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[321] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[329] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[345] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[353] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[369] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[377] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[393] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[401] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[409] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[417] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[425] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[433] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[441] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[449] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[457] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[465] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[473] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[481] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[489] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[497] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[505] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[513] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[521] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[537] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[545] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[553] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[561] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[569] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[585] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[593] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[601] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[609] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[617] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[625] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[633] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[641] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[649] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[657] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[665] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[673] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[681] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[689] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[697] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[705] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[713] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[721] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[729] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[737] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[745] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[753] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[761] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[769] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[777] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[785] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[793] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[801] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[809] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[817] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[825] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[833] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[841] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[849] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[857] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[873] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[881] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[889] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[897] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[905] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[913] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[921] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[929] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[937] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[945] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[953] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[969] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[977] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[985] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[993] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[ reached 'max' / getOption("max.print") -- omitted 2138 entries ]
NY is quite low on the list, so the first 1000 entries are all FALSE. R will only keep the rows where stateabbrv == NY is true.
What if I want to sort the data by one of the variables? Maybe I want to know which counties have the highest crime rates. I can do this easily with the arrange() function:
ny_data %>%
arrange(crime_total) %>%
head()
#But wait! It's arranged from smallest to largest. How can I fix this? Let's check the help documentation:
?arrange
#now we'll fix it:
ny_data %>%
arrange(desc(crime_total)) %>%
head()
NA
To recap, we’ve now learned the following dplyr functions:
select: filter a dataframe by columns filter: filter a dataframe by rows group_by: group the observations by a categorical variable summarise: produce summary statistics, either for the full dataset or for groups (using group_by) mutate: create a new column arrange: sort the data (ascending order by default)
We’ve also learned some summary statistics as well (there are some new ones in here that are useful as well):
mean: take the mean, or average, of a variable median: find the middle value of a variable sum: add the values in a variable n: count the number of rows in a variable, or group (if group_by is used)
| Graphing Our Data |
Let’s take a look at the relationships between median home values and our other variables! Scatter plots are good for this purpose. We’ll use another package, ggplot2, to make graphs. First we’ll take a look at the distributions of our variables using histograms.
ggplot(data, aes(x = county_name))+
#The geom layer tells R which type of plot to make
geom_histogram(fill = "blue")
Error in `geom_histogram()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_bin()` requires a continuous x aesthetic.
✖ the x aesthetic is discrete.
ℹ Perhaps you want `stat="count"`?
Run `]8;;x-r-run:rlang::last_trace()rlang::last_trace()]8;;` to see where the error occurred.
#we can make this prettier by adding more layers! I'll add a title and a theme
ggplot(hist_data, aes(x = crime_total))+
#The geom layer tells R which type of plot to make
geom_histogram(fill = "skyblue", color = "gray")+
#the labs layer adds titles and axis labels - we can even add subtitles and captions
labs(title = "Total Crime in US Counties", x = "Crime Total", subtitle = "Crime Rate in the Year 2000")+
#This adds a theme to your plot! Themes will change the look of a plot, including the background and axis line colors, fonts, formatting, and other features
theme_minimal()
Let’s look at a few more plot types! The next plot type is a point plot:
#create a point plot for median_house_values versus unemployment
ggplot(data_small, aes(x = unemp_rate, y = log(median_house_value)))+
#I've just made the points bigger and slightly transparent
geom_point(color = "aquamarine", alpha = 0.8, size = 2) +
#this code adds a line of best fit to the point plot
geom_smooth(method = lm)+
theme_dark()+
labs(title = "Median House Value vs Unemployment")
The next plot type is a bar plot. Let’s plot crime in the largest cities in NY:
#you can create a ggplot within a dplyr string of code! It's pretty easy to do:
ny_data %>%
arrange(desc(crime_total)) %>%
head(n = 5) %>%
ggplot(aes(x = county_name, y = crime_total))+
#we need to incude stat="Identity" here so that R accepts an x and y value for the bar plot - otherwise, we get an error
geom_bar(stat = "Identity", fill = "tomato")+
theme_dark()+
labs(title = "Highest Crime Counties in NYS")
#If we have observations within groups, we can make a boxplot to visiualize the distribution:
plot1 <- data_small %>%
#select a few states to compare
filter(stateabbrv %in% c("NY", "PA", "NJ", "MA", "VT")) %>%
ggplot(aes(x = stateabbrv, y = crime_total))+
#we need to incude stat="Identity" here so that R accepts an x and y value for the bar plot - otherwise, we get an error
geom_bar(stat = "Identity", fill = "lavenderblush")+
theme_dark()+
labs(title = "Crime by State", y = "Crime Rate", x = "State")
plot2 <- data_small %>%
#select a few states to compare
filter(stateabbrv %in% c("NY", "PA", "NJ", "MA", "VT")) %>%
ggplot(aes(x = stateabbrv, y = crime_total))+
#we need to incude stat="Identity" here so that R accepts an x and y value for the bar plot - otherwise, we get an error
geom_boxplot(fill = "lavenderblush")+
theme_dark()+
labs(title = "Crime by State", y = "Crime Rate", x = "State")
library(cowplot)
plot_grid(plot1, plot2)
What if I want a different color for each state? I’ll show you two ways to do this.
#we could use the default color palettes, which are a bit boring
data_small %>%
#select a few states to compare
filter(stateabbrv %in% c("NY", "PA", "NJ", "MA", "VT")) %>%
#if we want to color by a variable, put it in the aes mapping!
ggplot(aes(x = stateabbrv, y = crime_total, fill = stateabbrv))+
#we need to incude stat="Identity" here so that R accepts an x and y value for the bar plot - otherwise, we get an error
geom_boxplot()+
theme_dark()+
labs(title = "Crime by State", y = "Crime Rate", x = "State", fill = "State")
library(ggthemes)
#I'll grab a palette from here: https://github.com/EmilHvitfeldt/r-color-palettes/blob/main/canva.md
#note that this palette technically only has four colors, so the fifth is always gray
#there are other palette options out there that will have more colors, see ggplot2 cheat sheet for more ideas
pal <- canva_pal(palette = "Fresh and bright")(5)
data_small %>%
#select a few states to compare
filter(stateabbrv %in% c("NY", "PA", "NJ", "MA", "VT")) %>%
#if we want to color by a variable, put it in the aes mapping!
ggplot(aes(x = stateabbrv, y = crime_total, fill = stateabbrv))+
#we need to incude stat="Identity" here so that R accepts an x and y value for the bar plot - otherwise, we get an error
geom_boxplot()+
theme_dark()+
scale_fill_manual(values = pal)+
labs(title = "Crime by State", y = "Crime Rate", x = "State", fill = "State")
#ggthemes also comes with some nice additional theme options:
data_small %>%
#select a few states to compare
filter(stateabbrv %in% c("NY", "PA", "NJ", "MA", "VT")) %>%
#if we want to color by a variable, put it in the aes mapping!
ggplot(aes(x = stateabbrv, y = crime_total, fill = stateabbrv))+
#we need to incude stat="Identity" here so that R accepts an x and y value for the bar plot - otherwise, we get an error
geom_boxplot()+
theme_economist()+
scale_fill_manual(values = pal)+
#center the title
theme(plot.title = element_text(hjust = 0.5))+
labs(title = "Crime Rate by State, 2000", y = "Crime Rate", x = "State", fill = "State")
I won’t require you to know how to do this next part, but I like to show it anyways because it’s quite useful. Sometimes you’ll want to make plots for multiple variables at once - this is easy to do with the facet_wrap layer in ggplot.
#What if I want to make a histogram for all of my variables? This is also easy to do.
#the gather function is from the tidyr package - it reshapes the data so we can plot multiple histograms at once
#don't worry about this one too much - you can use it as a template for other histograms later one
ggplot(gather(hist_data), aes(value)) +
geom_histogram(bins = 30, fill = "green") +
facet_wrap(~key, scales = 'free_x')+
theme_dark()
#source: https://stackoverflow.com/questions/35372365/how-do-i-generate-a-histogram-for-each-column-of-my-table
#what did we actually do? Try this code:
gather(hist_data)
With gather, our column names are now in the ‘key’ variable, and the row values are now all in the ‘value’ column. We’re essentially reshaped our data from wide format (each variable has it’s own column) to long format (the variable names are stored as groups in one column). I’ll show some examples of this on the board. Once we reshape the data, we can then use the facet_wrap function from ggplot2 to tell R to create a separate graph for each group in the data. This function only works with long format data, which is why we had to change the format of our data. We’ll go over the basics of ggplot2 in a moment - I wanted to show you a few examples before we dive a bit deeper.
What can we learn from these histograms? Most of our variables are roughly normally distributed, with some outliers. Median House Value and Taxrate are a bit more skewed. We’ll talk about transformations later, but I might go ahead and take the log of these variables to give the distributions a more normal shape.
hist_data <- data_small %>%
dplyr::select(3:9) %>%
#mutate is another dplyr function that lets us transform existing variables
#take the log of the skewed variables
mutate(median_house_value = log(median_house_value),
taxrate = log(taxrate))
#replot
ggplot(gather(hist_data), aes(value)) +
geom_histogram(bins = 30, fill = "green") +
facet_wrap(~key, scales = 'free_x')+
theme_dark()
Ok, now we have this nice pyramid shape for all of our variables. That is generally what we like to see. Let’s create some more point plots to look at the relationships between variables.
We’ll use similar code as we did before, but don’t worry about this code just yet - it’s a bit more complicated than we’re ready for. You just need to be able to run it.
hist_data %>%
mutate(id = row_number()) %>%
gather(variable, value, -id, -median_house_value) %>%
ggplot(aes(y = median_house_value, x = value))+
geom_point(color = "tomato")+
geom_smooth(method = "lm", color = "black")+
facet_wrap(~variable, scales = "free_x")+
theme_dark()+
labs(y = "Median House Value", x = "Variable")
#source: https://stackoverflow.com/questions/60727366/ggplot-for-one-dependent-variable-and-multiple-indipendent
This is very useful! Now we can see the relationships between variables more clearly, and it saves us time if we want to make multiple plots like this.