Introductory Statistics (CRN: 6896)



Objective

The midterm is next week! For the midterm, you’ll be given a dataset and asked some questions about it. These questions will not be markedly different from the ones we have answered in the weeks before. So reviewing our past lab material should help you get ready for the midterm. Today we’ll add to our tool box of commands and tehcniques for parsing and analyzing data.


Here is a list of things we have learned so far

  • How to read csv file
  • How many rows and columns the dataset has (dataset dimensions)
  • How to see the structure of the dataset
  • How to see the first few rows
  • How o see a summary of the dataset
  • How to get the mean of a numeric variable
  • How to calculate the median of a numeric variable
  • How to get the standard deviation of a variable
  • How to get the range of a numeric variable
  • How to give a value to a variable
  • How to access a column (variable) in our dataset
  • How to access a row in our dataset
  • How to create tables of values of a variable
  • How to deal with missing values (NA)
  • How to apply a function to a variable seperately for different groups
  • How to plot distribution of numeric variables
  • How to plot boxplots of numeric variabls sperated by groups (another variable)


Now let’s learn about a few mroe useful commands.

  • percentages: When we use the table to get a ferquencey table of catgorical data, we end up with number of cases in different groups. This is useful but sometimes we want to comapare these numbers. To compare things we always need to make sure they are on the same scale. We can’t compare Kilograms and Pounds becasue they are not on the same scale. For the same reason we can’t compare Dollars and Pesos, or meters and yards, and so forth. So to comapre things, we have to convert into a common scale. One simple method of conversion is using percentage points. By converting values into percentages, we are putting them in a scale of 1-100, which then allows us to make meaningful comaprisons. Let’s look at an example.


The dataset for this lab is called the AB_NYC_2019.csv. It’s a dataset of rooms listed on Airbnb in NYC and contains the following values:

variable name Description
id listing ID
name name of the listing
host_id host ID
host_name name of the host
neighbourhood_group location
neighbourhood area
latitude latitude coordinates
longitude longitude coordinates
room_typel isting space type
price price in dollars
minimum_nights amount of nights minimum
number_of_reviews number of reviews
last_review latest review
reviews_per_month number of reviews per month
calculated_host_listings_count amount of listing per host
availability_365 number of days when listing is available for booking
Open a new file
airbnb <- read.csv("./AB_NYC_2019.csv")

And Run our initial commands to learn what the dataset looks like:

dim(airbnb)
[1] 48895    16

We have 48895 observations, in this case listings, and 16 variables. What kind of variables?

str(airbnb)
'data.frame':   48895 obs. of  16 variables:
 $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
 $ name                          : Factor w/ 47906 levels ""," 1 Bed Apt in Utopic Williamsburg ",..: 12573 38016 45018 15591 19219 24849 8257 24896 15486 17573 ...
 $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
 $ host_name                     : Factor w/ 11453 levels "","​ Valéria",..: 4997 4791 2913 6210 5929 1938 3549 9649 6880 1235 ...
 $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
 $ neighbourhood                 : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
 $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
 $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
 $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
 $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
 $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
 $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
 $ last_review                   : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
 $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
 $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
 $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...


So we have bunch of factors (i.e., categorical variables), and continuious variables that come in either in numeric or integer forms. Notice that when R is reading a dataset, it considers the variables it doesn’t know what they are as factors. So for example, the variable name is considered a factor. But is this a factor? It’s the name of the listing, which the host decided so it could have been anything. Inthis case it is technically a factor, but each level has only one value. This is improtant to keep in mind that just becasue R says something is a factor it doesn’ meanit is a cateogrical variable..

Using the summarycommand, we can get a sense of the disribution of each variable:

summary(airbnb)
       id                                         name          host_id                 host_name    
 Min.   :    2539   Hillside Hotel                  :   18   Min.   :     2438   Michael     :  417  
 1st Qu.: 9471945   Home away from home             :   17   1st Qu.:  7822033   David       :  403  
 Median :19677284                                   :   16   Median : 30793816   Sonder (NYC):  327  
 Mean   :19017143   New york Multi-unit building    :   16   Mean   : 67620011   John        :  294  
 3rd Qu.:29152178   Brooklyn Apartment              :   12   3rd Qu.:107434423   Alex        :  279  
 Max.   :36487245   Loft Suite @ The Box House Hotel:   11   Max.   :274321313   Blueground  :  232  
                    (Other)                         :48805                       (Other)     :46943  
    neighbourhood_group            neighbourhood      latitude       longitude                room_type    
 Bronx        : 1091    Williamsburg      : 3920   Min.   :40.50   Min.   :-74.24   Entire home/apt:25409  
 Brooklyn     :20104    Bedford-Stuyvesant: 3714   1st Qu.:40.69   1st Qu.:-73.98   Private room   :22326  
 Manhattan    :21661    Harlem            : 2658   Median :40.72   Median :-73.96   Shared room    : 1160  
 Queens       : 5666    Bushwick          : 2465   Mean   :40.73   Mean   :-73.95                          
 Staten Island:  373    Upper West Side   : 1971   3rd Qu.:40.76   3rd Qu.:-73.94                          
                        Hell's Kitchen    : 1958   Max.   :40.91   Max.   :-73.71                          
                        (Other)           :32209                                                           
     price         minimum_nights    number_of_reviews     last_review    reviews_per_month
 Min.   :    0.0   Min.   :   1.00   Min.   :  0.00              :10052   Min.   : 0.010   
 1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00    2019-06-23: 1413   1st Qu.: 0.190   
 Median :  106.0   Median :   3.00   Median :  5.00    2019-07-01: 1359   Median : 0.720   
 Mean   :  152.7   Mean   :   7.03   Mean   : 23.27    2019-06-30: 1341   Mean   : 1.373   
 3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00    2019-06-24:  875   3rd Qu.: 2.020   
 Max.   :10000.0   Max.   :1250.00   Max.   :629.00    2019-07-07:  718   Max.   :58.500   
                                                       (Other)   :33137   NA's   :10052    
 calculated_host_listings_count availability_365
 Min.   :  1.000                Min.   :  0.0   
 1st Qu.:  1.000                1st Qu.:  0.0   
 Median :  1.000                Median : 45.0   
 Mean   :  7.144                Mean   :112.8   
 3rd Qu.:  2.000                3rd Qu.:227.0   
 Max.   :327.000                Max.   :365.0   
                                                


We have some NAs in the reviews_per_months variable. It makes sense becasue not every listin gis reviewed. We get a sens of the prices too,, although a room lsited as $0 can;t be right. We’ll look at this later.

For now, let’s see how many listings are available in each borough.

table(airbnb$neighbourhood_group)

        Bronx      Brooklyn     Manhattan        Queens Staten Island 
         1091         20104         21661          5666           373 

If we wanted to see the types of lisings available in different borouhgs, we’d create a table with a new variable:

table(airbnb$neighbourhood_group, airbnb$room_type)
               
                Entire home/apt Private room Shared room
  Bronx                     379          652          60
  Brooklyn                 9559        10132         413
  Manhattan               13199         7982         480
  Queens                   2096         3372         198
  Staten Island             176          188           9

This is a nice overview of a lot of data but it doesn’t allow us to compare boroughs. Why? Beacsue the boroughs differ markedly in the number of listings so we need to convert them. Let’s use percentages. To do this, we’ll use the prop.table command:

prop.table(table(airbnb$neighbourhood_group, airbnb$room_type))
               
                Entire home/apt Private room  Shared room
  Bronx            0.0077513038 0.0133346968 0.0012271193
  Brooklyn         0.1955005624 0.2072195521 0.0084466714
  Manhattan        0.2699458022 0.1632477758 0.0098169547
  Queens           0.0428673689 0.0689641068 0.0040494938
  Staten Island    0.0035995501 0.0038449739 0.0001840679

Prop.table takes a table and converts that values into percentages. In the abvoe table, we can see that 26.99% of listings of Entire home/apt are in Manhattan.

We can also tell prop.table() to give use percentages either by row or column:

prop.table(table(airbnb$neighbourhood_group, airbnb$room_type), margin = 1)
               
                Entire home/apt Private room Shared room
  Bronx              0.34738772   0.59761687  0.05499542
  Brooklyn           0.47547752   0.50397931  0.02054318
  Manhattan          0.60934398   0.36849638  0.02215964
  Queens             0.36992587   0.59512884  0.03494529
  Staten Island      0.47184987   0.50402145  0.02412869

Note hat the pecentages are now based on the first variable (airbnb$neighbourhood_group). So each row tells us what percentage of listings within a borough are Entire home/apt, Private room, or Shared room. This allows us to comapre the value. For example, we can see that majority of listings in Manhattan are Entire home/apt, as opposed to every other borough.

We can also get the percentage base don room type, which is the second variable in the table command:

prop.table(table(airbnb$neighbourhood_group, airbnb$room_type), margin = 2)
               
                Entire home/apt Private room Shared room
  Bronx             0.014915975  0.029203619 0.051724138
  Brooklyn          0.376205282  0.453820658 0.356034483
  Manhattan         0.519461608  0.357520380 0.413793103
  Queens            0.082490456  0.151034668 0.170689655
  Staten Island     0.006926680  0.008420675 0.007758621

This table tells us that the most Private rooms are listed in Brooklyn. So to recap, by default, prop.table gives us overall percentages of each cell in a table. By specifcying the margin, we can get percentages for specifc variables in the table.


Logical operators in R

We have already learned about some these operators that can be applied to numeric variable:

sign function
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2 is 2

R also has operators below which return TRUE or FALSE.

sign function
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x | y x OR y
x & y x AND y
isTRUE(x) test if X is TRUE


Try some of these codes:

1 > 0
[1] TRUE
1 < 0 
[1] FALSE
1 == 1
[1] TRUE
1 != 1 
[1] FALSE


These operators are particularly important when we want to find subset our dataset. Let’s say we want to only focus on listings in Brooklyn. How do we do that? We find that column that tells us whether the listing is in Brooklyn or not. Then we tell R to keep the rows whose value in that column are ‘Brooklyn’. The comman for this statement looks like this: airbnb[airbnb$neighbourhood_group == "Brooklyn",]. Remember we used [row, column] to access specific rows or solumns or value sina daaset? We are doing the exact same thing here. We are specifying a condition hat must be met by a row inorder to be included in our subset. Let’s call this subset brooklyn.only using the command below:

brooklyn.only <- airbnb[airbnb$neighbourhood_group == "Brooklyn",]

Of cousre, this code has not output but if you look at your environment window on the top right, you’ll see a new element called brooklyn.only that has 20104 observations and 16 variables. This is the subset. Now you can use this dataset to work with listing that are only in Brooklyn.

Let’s answer some questions: - How many rooms are lisedi n brooklyn that are priced above $100 per night? - What is the average numberof reviews for shared rooms in Brooklyn? - What is the minimum number of night you staty at a place in Brooklyn?


Now here’s where it get interesting. You don’t have to acually save the dataset. You can run you commands using the dataset name with the condition statement in it. So every commadn that we have run beofre to get a sense of our dataset can be run on the subset:

head(airbnb[airbnb$neighbourhood_group == "Brooklyn",])


Or

dim(airbnb[airbnb$neighbourhood_group == "Brooklyn",])

Let’s answer some questions:

  • What is the average price different room listing types in Manhattan?
  • How many neighborhoods are there in Brooklyn?
  • What is the average price of listings for different neighborhoods in brooklyn?
  • What is the average number of reviews for shared rooms in brooklyn?
