First Fun with R

This document was written using R-Markdown, R-studio’s handy interface for working with R in a way that lets you see both your code and its output.An R-Markdown document works a lot like a Word document except that you can put in R commands AND see what they do. When you type normally without making a coding block or comment block, RMarkdown just prints what you type.

Our first R Code Box

Before using RMarkdown, we will just run some commands in the console today, just to get an idea how R works.

First, we have to make sure that all files we need are in the directory that R is using … it is called the ‘working directory’. We will do this with the followign steps:

Using whatever method you usually use to make a new folder, make a directory where you will do all of your R work this semester. I will use

“/Users/robincunningham/Documents/STOR 151 R Projects”

To see your current ‘working directory’ type getwd() in the Console below. Whatever directory you see, you can save a lot of trouble for the rest of the course by going to file explorer (on your PC or Mac) and making a new folder inside the one you see as your working directory. Call it “R Projects”
Having completed #2 above, go to the console below again and type ‘setwd(“R Projects”)’. This will be your working directory and you will have to move to it every time you open R.
Put County.csv in “R Projects” in 2 steps:

Download count.xls and then
save the file as type .csv into “R Projects”.

Now everything is all synched up and the hardest part of the day is over.

Enter the line of code you see below in the console to load and name the dataset. You can skip anything with a ‘#’ symbol in front, those are just comment lines and R ignores them.

# Load dataset
county_data = read.csv("county.csv")

The command read.csv( ) will read a dataset into R from your computer or from online. “csv” stands for “comma separated value”, a common file type where the data is listed in a text file, with variables separated by commas. For now, you don’t need to worry about the details of read.csv( ).

Now that we have the data set loaded, we will look at it in 3 different ways: 1. using the command head() which shows us the first 6 lines of the data set.

# Look at the dataset
head(county_data)

##             name   state pop2000 pop2010 fed_spend poverty homeownership
## 1 Autauga County Alabama   43671   54571  6.068095    10.6          77.5
## 2 Baldwin County Alabama  140415  182265  6.139862    12.2          76.7
## 3 Barbour County Alabama   29038   27457  8.752158    25.0          68.0
## 4    Bibb County Alabama   20826   22915  7.122016    12.6          82.9
## 5  Blount County Alabama   51024   57322  5.130910    13.4          82.0
## 6 Bullock County Alabama   11714   10914  9.973062    25.3          76.9
##   multiunit income med_income
## 1       7.2  24568      53255
## 2      22.6  26469      50147
## 3      11.1  15875      33219
## 4       6.6  19918      41770
## 5       3.7  21070      45549
## 6       9.9  20289      31602

How many variables does the data set have?
Which variables are categorical and which are numerical
Guess what the command tail() shows you.

2. Another way to look at a data set is using the ‘summary()’ function.

summary(county_data)

##                 name           state         pop2000       
##  Washington County:  30   Texas   : 254   Min.   :     67  
##  Jefferson County :  25   Georgia : 159   1st Qu.:  11211  
##  Franklin County  :  24   Virginia: 134   Median :  24621  
##  Jackson County   :  23   Kentucky: 120   Mean   :  89627  
##  Lincoln County   :  23   Missouri: 115   3rd Qu.:  61792  
##  Madison County   :  19   Kansas  : 105   Max.   :9519338  
##  (Other)          :2997   (Other) :2254                    
##     pop2010          fed_spend          poverty     homeownership  
##  Min.   :     82   Min.   :  2.109   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.:  11119   1st Qu.:  6.970   1st Qu.:11.0   1st Qu.:69.50  
##  Median :  25887   Median :  8.673   Median :14.7   Median :74.60  
##  Mean   :  98294   Mean   : 10.003   Mean   :15.5   Mean   :73.27  
##  3rd Qu.:  66861   3rd Qu.: 10.877   3rd Qu.:19.0   3rd Qu.:78.40  
##  Max.   :9818605   Max.   :204.616   Max.   :53.5   Max.   :91.30  
##                                                                    
##    multiunit         income        med_income    
##  Min.   : 0.00   Min.   : 7772   Min.   : 19351  
##  1st Qu.: 6.10   1st Qu.:19030   1st Qu.: 36948  
##  Median : 9.70   Median :21765   Median : 42444  
##  Mean   :12.32   Mean   :22499   Mean   : 44259  
##  3rd Qu.:15.90   3rd Qu.:24801   3rd Qu.: 49120  
##  Max.   :98.50   Max.   :64381   Max.   :115574  
##

List 3 facts you can tell are true from the summary of the data set.

3. Ok, our last task in R today will be to make the plots from the slides in class. The first is the plot of Federal Spending per Capita versus Poverty Rate. We need to tell R which columns to choose from the county_data dataset. Here is the command, please take note of how we specify the columns!

plot(county_data$poverty,county_data$fed_spend)

As you can see, a couple of the county have really high federal spending which is squishing up our picture of the data. We can see that it is the same picture by setting limits on the y-axis:

plot(county_data$poverty,county_data$fed_spend, ylim = c(0,30), col="blue")

Note I put in the blue just to be fancy.
Revise the command above to come up with the command to produce the second plot from the slides.

plot()

First Fun with R

prepared by Robin Cunningham

Our first R Code Box