The midterm is next week! For the midterm, you’ll be given a dataset and asked some questions about it. These questions will not be markedly different from the ones we have answered in the weeks before. So reviewing our past lab material should help you get ready for the midterm. Today we’ll add to our tool box of commands and tehcniques for parsing and analyzing data.
Here is a list of things we have learned so far
Now let’s learn about a few mroe useful commands.
table to get a ferquencey table of catgorical data, we end up with number of cases in different groups. This is useful but sometimes we want to comapare these numbers. To compare things we always need to make sure they are on the same scale. We can’t compare Kilograms and Pounds becasue they are not on the same scale. For the same reason we can’t compare Dollars and Pesos, or meters and yards, and so forth. So to comapre things, we have to convert into a common scale. One simple method of conversion is using percentage points. By converting values into percentages, we are putting them in a scale of 1-100, which then allows us to make meaningful comaprisons. Let’s look at an example.
The dataset for this lab is called the AB_NYC_2019.csv. It’s a dataset of rooms listed on Airbnb in NYC and contains the following values:
| variable name | Description |
|---|---|
| id | listing ID |
| name | name of the listing |
| host_id | host ID |
| host_name | name of the host |
| neighbourhood_group | location |
| neighbourhood | area |
| latitude | latitude coordinates |
| longitude | longitude coordinates |
| room_typel | isting space type |
| price | price in dollars |
| minimum_nights | amount of nights minimum |
| number_of_reviews | number of reviews |
| last_review | latest review |
| reviews_per_month | number of reviews per month |
| calculated_host_listings_count | amount of listing per host |
| availability_365 | number of days when listing is available for booking |
airbnb <- read.csv("./AB_NYC_2019.csv")
And Run our initial commands to learn what the dataset looks like:
dim(airbnb)
[1] 48895 16
We have 48895 observations, in this case listings, and 16 variables. What kind of variables?
str(airbnb)
'data.frame': 48895 obs. of 16 variables:
$ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
$ name : Factor w/ 47906 levels ""," 1 Bed Apt in Utopic Williamsburg ",..: 12573 38016 45018 15591 19219 24849 8257 24896 15486 17573 ...
$ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
$ host_name : Factor w/ 11453 levels ""," Valéria",..: 4997 4791 2913 6210 5929 1938 3549 9649 6880 1235 ...
$ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
$ neighbourhood : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
$ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
$ longitude : num -74 -74 -73.9 -74 -73.9 ...
$ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
$ price : int 149 225 150 89 80 200 60 79 79 150 ...
$ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
$ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
$ last_review : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
$ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
$ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
$ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
So we have bunch of factors (i.e., categorical variables), and continuious variables that come in either in numeric or integer forms. Notice that when R is reading a dataset, it considers the variables it doesn’t know what they are as factors. So for example, the variable name is considered a factor. But is this a factor? It’s the name of the listing, which the host decided so it could have been anything. Inthis case it is technically a factor, but each level has only one value. This is improtant to keep in mind that just becasue R says something is a factor it doesn’ meanit is a cateogrical variable..
Using the summarycommand, we can get a sense of the disribution of each variable:
summary(airbnb)
id name host_id host_name
Min. : 2539 Hillside Hotel : 18 Min. : 2438 Michael : 417
1st Qu.: 9471945 Home away from home : 17 1st Qu.: 7822033 David : 403
Median :19677284 : 16 Median : 30793816 Sonder (NYC): 327
Mean :19017143 New york Multi-unit building : 16 Mean : 67620011 John : 294
3rd Qu.:29152178 Brooklyn Apartment : 12 3rd Qu.:107434423 Alex : 279
Max. :36487245 Loft Suite @ The Box House Hotel: 11 Max. :274321313 Blueground : 232
(Other) :48805 (Other) :46943
neighbourhood_group neighbourhood latitude longitude room_type
Bronx : 1091 Williamsburg : 3920 Min. :40.50 Min. :-74.24 Entire home/apt:25409
Brooklyn :20104 Bedford-Stuyvesant: 3714 1st Qu.:40.69 1st Qu.:-73.98 Private room :22326
Manhattan :21661 Harlem : 2658 Median :40.72 Median :-73.96 Shared room : 1160
Queens : 5666 Bushwick : 2465 Mean :40.73 Mean :-73.95
Staten Island: 373 Upper West Side : 1971 3rd Qu.:40.76 3rd Qu.:-73.94
Hell's Kitchen : 1958 Max. :40.91 Max. :-73.71
(Other) :32209
price minimum_nights number_of_reviews last_review reviews_per_month
Min. : 0.0 Min. : 1.00 Min. : 0.00 :10052 Min. : 0.010
1st Qu.: 69.0 1st Qu.: 1.00 1st Qu.: 1.00 2019-06-23: 1413 1st Qu.: 0.190
Median : 106.0 Median : 3.00 Median : 5.00 2019-07-01: 1359 Median : 0.720
Mean : 152.7 Mean : 7.03 Mean : 23.27 2019-06-30: 1341 Mean : 1.373
3rd Qu.: 175.0 3rd Qu.: 5.00 3rd Qu.: 24.00 2019-06-24: 875 3rd Qu.: 2.020
Max. :10000.0 Max. :1250.00 Max. :629.00 2019-07-07: 718 Max. :58.500
(Other) :33137 NA's :10052
calculated_host_listings_count availability_365
Min. : 1.000 Min. : 0.0
1st Qu.: 1.000 1st Qu.: 0.0
Median : 1.000 Median : 45.0
Mean : 7.144 Mean :112.8
3rd Qu.: 2.000 3rd Qu.:227.0
Max. :327.000 Max. :365.0
We have some NAs in the reviews_per_months variable. It makes sense becasue not every listin gis reviewed. We get a sens of the prices too,, although a room lsited as $0 can;t be right. We’ll look at this later.
For now, let’s see how many listings are available in each borough.
table(airbnb$neighbourhood_group)
Bronx Brooklyn Manhattan Queens Staten Island
1091 20104 21661 5666 373
If we wanted to see the types of lisings available in different borouhgs, we’d create a table with a new variable:
table(airbnb$neighbourhood_group, airbnb$room_type)
Entire home/apt Private room Shared room
Bronx 379 652 60
Brooklyn 9559 10132 413
Manhattan 13199 7982 480
Queens 2096 3372 198
Staten Island 176 188 9
This is a nice overview of a lot of data but it doesn’t allow us to compare boroughs. Why? Beacsue the boroughs differ markedly in the number of listings so we need to convert them. Let’s use percentages. To do this, we’ll use the prop.table command:
prop.table(table(airbnb$neighbourhood_group, airbnb$room_type))
Entire home/apt Private room Shared room
Bronx 0.0077513038 0.0133346968 0.0012271193
Brooklyn 0.1955005624 0.2072195521 0.0084466714
Manhattan 0.2699458022 0.1632477758 0.0098169547
Queens 0.0428673689 0.0689641068 0.0040494938
Staten Island 0.0035995501 0.0038449739 0.0001840679
Prop.table takes a table and converts that values into percentages. In the abvoe table, we can see that 26.99% of listings of Entire home/apt are in Manhattan.
We can also tell prop.table() to give use percentages either by row or column:
prop.table(table(airbnb$neighbourhood_group, airbnb$room_type), margin = 1)
Entire home/apt Private room Shared room
Bronx 0.34738772 0.59761687 0.05499542
Brooklyn 0.47547752 0.50397931 0.02054318
Manhattan 0.60934398 0.36849638 0.02215964
Queens 0.36992587 0.59512884 0.03494529
Staten Island 0.47184987 0.50402145 0.02412869
Note hat the pecentages are now based on the first variable (airbnb$neighbourhood_group). So each row tells us what percentage of listings within a borough are Entire home/apt, Private room, or Shared room. This allows us to comapre the value. For example, we can see that majority of listings in Manhattan are Entire home/apt, as opposed to every other borough.
We can also get the percentage base don room type, which is the second variable in the table command:
prop.table(table(airbnb$neighbourhood_group, airbnb$room_type), margin = 2)
Entire home/apt Private room Shared room
Bronx 0.014915975 0.029203619 0.051724138
Brooklyn 0.376205282 0.453820658 0.356034483
Manhattan 0.519461608 0.357520380 0.413793103
Queens 0.082490456 0.151034668 0.170689655
Staten Island 0.006926680 0.008420675 0.007758621
This table tells us that the most Private rooms are listed in Brooklyn. So to recap, by default, prop.table gives us overall percentages of each cell in a table. By specifcying the margin, we can get percentages for specifc variables in the table.
Logical operators in R
We have already learned about some these operators that can be applied to numeric variable:
| sign | function |
|---|---|
| + | addition |
| - | subtraction |
| * | multiplication |
| / | division |
| ^ or ** | exponentiation |
| x %% y | modulus (x mod y) 5%%2 is 1 |
| x %/% y | integer division 5%/%2 is 2 |
R also has operators below which return TRUE or FALSE.
| sign | function |
|---|---|
| < | less than |
| <= | less than or equal to |
| > | greater than |
| >= | greater than or equal to |
| == | exactly equal to |
| != | not equal to |
| !x | Not x |
| x | y | x OR y |
| x & y | x AND y |
| isTRUE(x) | test if X is TRUE |
Try some of these codes:
1 > 0
[1] TRUE
1 < 0
[1] FALSE
1 == 1
[1] TRUE
1 != 1
[1] FALSE
These operators are particularly important when we want to find subset our dataset. Let’s say we want to only focus on listings in Brooklyn. How do we do that? We find that column that tells us whether the listing is in Brooklyn or not. Then we tell R to keep the rows whose value in that column are ‘Brooklyn’. The comman for this statement looks like this: airbnb[airbnb$neighbourhood_group == "Brooklyn",]. Remember we used [row, column] to access specific rows or solumns or value sina daaset? We are doing the exact same thing here. We are specifying a condition hat must be met by a row inorder to be included in our subset. Let’s call this subset brooklyn.only using the command below:
brooklyn.only <- airbnb[airbnb$neighbourhood_group == "Brooklyn",]
Of cousre, this code has not output but if you look at your environment window on the top right, you’ll see a new element called brooklyn.only that has 20104 observations and 16 variables. This is the subset. Now you can use this dataset to work with listing that are only in Brooklyn.
Let’s answer some questions: - How many rooms are lisedi n brooklyn that are priced above $100 per night? - What is the average numberof reviews for shared rooms in Brooklyn? - What is the minimum number of night you staty at a place in Brooklyn?
Now here’s where it get interesting. You don’t have to acually save the dataset. You can run you commands using the dataset name with the condition statement in it. So every commadn that we have run beofre to get a sense of our dataset can be run on the subset:
head(airbnb[airbnb$neighbourhood_group == "Brooklyn",])
Or
dim(airbnb[airbnb$neighbourhood_group == "Brooklyn",])
Let’s answer some questions: