Airbnb has taken a non asset-based approach to housing and hospitality that has enabled individuals to earn a profit by commercializing their own private properties. This is a productof a shared economy, which has become a disruptive force in favor of evolving consumer tastes.
Particularly for Airbnb, the Superhost program creates incentives and opportunities for ambitious individuals. When the data for the specific set we set to anlayze was explored, we found that hosts that are in the Superhost program earned up to 22% more in profit than their counterparts and attract more customers,
Our goal for this project is to develop a proposal for whether any changes should be made to the Airbnb Superhost Program with a data-driven approach. We will use various machine learning algorithms, and then select the best algorithm available to develop our proposal.
HYPOTHESIS: We expect the following features to stand out in our analysis, given what we know about the program:
Inside Airbnb is an independent, non-commercial set of tools and data that allows the user to explore how Airbnb is being used in cities around the world (http://insideairbnb.com/get-the-data.html)
The ‘Los Angeles Listings’ dataset contains 96 features and 43,047 records.
The following were used to clean and prepare the data:
# Libraries for machine learning
library(tidyverse)
library(class)
library(gmodels)
library(caret)
library(ipred)
library(adabag)
library(vcd)
library(randomForest)
library(e1071)
library(C50)
library(klaR)
library(rJava)
library(RWeka)
library(magrittr)
library(ROCR)
library(pROC)
library(neuralnet)
library(kernlab)
library(VIM)
library(mice)
# Libraries for data cleaning and preprocessing
library(dplyr)
library(stringr)
library(lubridate)
library(ggplot2)
library(corrplot)
library(Boruta)
## id listing_url scrape_id last_scraped
## 1 109 https://www.airbnb.com/rooms/109 2.018121e+13 2018-12-07
## 2 344 https://www.airbnb.com/rooms/344 2.018121e+13 2018-12-07
## 3 2708 https://www.airbnb.com/rooms/2708 2.018121e+13 2018-12-06
## 4 2732 https://www.airbnb.com/rooms/2732 2.018121e+13 2018-12-06
## 5 2864 https://www.airbnb.com/rooms/2864 2.018121e+13 2018-12-06
## 6 3021 https://www.airbnb.com/rooms/3021 2.018121e+13 2018-12-07
## name
## 1 Amazing bright elegant condo park front *UPGRADED*
## 2 Family perfect;Pool;Near Studios!
## 3 Gold Memory Foam Bed & Breakfast in West Hollywood
## 4 Zen Life at the Beach
## 5 *Upscale Professional Home with Beautiful Studio*
## 6 Hollywood Hills Zen Modern style Apt/Guesthouse
## summary
## 1 *** Unit upgraded with new bamboo flooring, brand new Ultra HD 50" Sony TV, new paint, new lighting, new mattresses, ultra fast cable Internet connection, Apple TV, (Hidden by Airbnb) Chromecast. *** Gorgeous and Elegant Furnished Condo in front of Culver City Fox Hills Park. Upper corner unit, total silence protected by trees. Short walk to the new Westfield Mall. Tennis courts, heated pool and jacuzzi hot tub.
## 2 This home is perfect for families; aspiring child actors w/parents; and friends vacationing for the summer or holidays. The pool is large, back patio terrific for evening dinners/parties around the firepit while folks nighttime swim during the summer. Chilly firepit fun during the winter. Quiet neighborhood minutes from Burbank Airport and all freeways. GREAT CENTRAL LOCATION!
## 3 Our best memory foam pillows you'll ever sleep on. First Morning: Starbuck's & Peet's coffee, latte-style coffee also protein bars, granola bars, and a fresh baked Swedish cinnamon roll, continental breakfast as well as breakfast requests. A welcome bottle of Voss artesian water from Norway. Terry robe & slippers. Handmade Amish wildflower soap. Candy bowl & trail mix jar. SoCal: beaches, Walk of Fame, clubs. Then back here for R&R. Pamper yourself in West Hollywood, California.
## 4
## 5 Centrally located.... Furnished with 42 inch Sony Plasma tv/hbo/showtime. Fiber optic WIFI. 4.6 cu ft. Refridgerator, Microwave, Convection Oven. Large bathroom with Jaccuzi and large shower, brazilian cherry hardwood floors, thomasville bedroom furniture, ikea office, fireplace, ceiling fan, etc. Many restaurants and Shopping. Cheesecake Factory, Fronks (best burgers and ribs), Marino's, Royal Taste (Thai), California Pizza Kitchen, The Nest (breakfast!), Bike Trail, Beach in 15 minutes.
## 6 A very Modern Hollywood Hills Zen style gallery-esque abode , Dark Brazilian Hardwood Floors, Sleek modern Concrete decor infused with Asian feng shui sensibilities, Artisinal in all aspects with all the modern conveniences. Located in beautiful and Musically Historic Laurel Canyon. Approximate size is 460 sq feet.
## space
## 1 *** Unit upgraded with new bamboo flooring, brand new Ultra HD 50" Sony TV, new paint, new lighting, new mattresses, ultra fast cable Internet connection. *** Gorgeous and Elegant Furnished Apartment in front of Culver City Fox Hills Park. Upper corner unit, total silence protected by trees. Short walk to the new Westfield Mall. Tennis courts, heated pool and jacuzzi hot tub. *** Upgraded with bamboo flooring and new paint the whole apartment *** Just installed gorgeous high quality bamboo hardwood floor in the whole apartment! (pictures here shows bamboo only in the living room, now also the bedrooms have bamboo flooring) Gorgeous and Elegant Furnished Apartment in front of Culver City Fox Hills Park. Upper corner unit, totally silent protected by trees. Short walk to the new Westfield Mall. MANY MORE PICTURES AVAILABLE HERE: (URL HIDDEN) Listing Type: Short or Long Term Rental Listing Description: Gorgeous and Elegant Furnished Apartment in front of the Park Bedrooms: 2 bedrooms B
## 2 Cheerful & comfortable; near studios, amusement parks, downtown, beaches! Central and modern; private, convenient, family-friendly, pool. 3 bedrooms. Terrific host - feel welcome and relaxed while you travel.
## 3 Flickering fireplace display heater. Decorated with fresh flowers for the Holidays. Blendtec® Designer 625 Blender Bundle with Twister Jar MORE THAN JUST SMOOTHIES: From hot soup to ice cream and everything in between, your imagination is the only limit to the creative power you can unleash with Blendtec blender. This space is completely upgraded and updated. Luxury Gold Queen size memory foam bed in fully private 10 ft. x 13 ft. 6 in. walled off completely enclosed space, screened off, of huge living room. 17' high exposed beam ceiling. Brand new plush carpeting and plank flooring throughout. â\200¢ Wireless Router WIFI â\200¢ Great looking new building â\200¢ Great looking lobby â\200¢ Right off Sunset Blvd. (west of La Brea) â\200¢ Perfect for a model or actor. â\200¢ Walk to Formosa Cafe â\200¢ 5 Minutes: Beverly Hills, Paramount, Pantages Theater, Television City. â\200¢ 8 Minutes: Burbank, Universal, Disney, NBC, Fox, SAG. â\200¢ Laundry Facilities â\200¢ Jacuzzi and sundeck â\200¢ Public transportation close by â\200¢ Newl
## 4 This is a three story townhouse with the following layout. The first floor, where my guest stays, is an open space filled with light. Bright, airy, cheerful. You will be sleeping on a sleeper couch facing a beautiful garden overflowing with flowers to greet you every morning that offers a lovely patio area to sit and have a meal. Very peaceful and serene. There is no door but Japanese screens are provided if you want to nest in. While this area is considered a shared space, you will have it completely to yourself and have any privacy you seek. On the same floor, is the shared kitchen fully stocked. The 2nd floor is my office and private bath for my guest. The third floor is my quarters. We could possibly not run into one another. Located in Santa Monica, a few blocks from the very hip Main Street area that has cafes, both fine and casual dining that will appeal to those of you who are foodies. Walking distance to yoga studios, farmers markets and wonderful unique shops. The pristine c
## 5 The space is furnished with Thomasville furniture, brazillian cherrywood flooring, jacuzzi, large bathroom, large closet, large office desk area/furniture. 2 minutes to the freeway, 12 minutes to closest beach, 30 minutes to Hollywood, 5 minutes to restaurants, shopping and grocery stores. There are many parks close by and very close to the San Gabriel Pass where the cyclists ride down to Seal Beach. It's a safe, quiet place with tenants with a very busy lifestyle and there are no children. No smoking in the house and no alcoholics/drugs. I'm a very friendly home owner and I respect everyone's privacy! :)
## 6 Stay amongst the Stars when you visit the Hollywood Hills! One of the safest areas !!! Sleep just minutes away from Jim Morrison's house, .....The Mamas and Papas home, and many more present and past celebrities!! ...the Hollywood Hills / Laurel Canyon Welcome to Paradise , quiet, lush ,song birds greenery, , and refreshing breezes; yet in the heart of Hollywood, quick access to Sunset Strip, West Hollywood, Downtown, Beverly Hills, and Universal City Walk and other film studios. This gem is nestled up the infamous Laurel Canyon and has all of the amenities and comforts of a custom designer home , completely separate entrance, plenty of easy parking, and fully loaded! After your active touring, relax in this fully-furnished Soho gallery-esque modern guest house. Master bedroom includes- queen size bed, fresh linens and comforter, White leather Corbusier chair, custom corner desk, and a walk-in closet; media/living room features a 42" plasma screen with slime Warner cable , Italian le
## description
## 1 *** Unit upgraded with new bamboo flooring, brand new Ultra HD 50" Sony TV, new paint, new lighting, new mattresses, ultra fast cable Internet connection, Apple TV, (Hidden by Airbnb) Chromecast. *** Gorgeous and Elegant Furnished Condo in front of Culver City Fox Hills Park. Upper corner unit, total silence protected by trees. Short walk to the new Westfield Mall. Tennis courts, heated pool and jacuzzi hot tub. *** Unit upgraded with new bamboo flooring, brand new Ultra HD 50" Sony TV, new paint, new lighting, new mattresses, ultra fast cable Internet connection. *** Gorgeous and Elegant Furnished Apartment in front of Culver City Fox Hills Park. Upper corner unit, total silence protected by trees. Short walk to the new Westfield Mall. Tennis courts, heated pool and jacuzzi hot tub. *** Upgraded with bamboo flooring and new paint the whole apartment *** Just installed gorgeous high quality bamboo hardwood floor in the whole apartment! (pictures here shows bamboo only in the living r
## 2 This home is perfect for families; aspiring child actors w/parents; and friends vacationing for the summer or holidays. The pool is large, back patio terrific for evening dinners/parties around the firepit while folks nighttime swim during the summer. Chilly firepit fun during the winter. Quiet neighborhood minutes from Burbank Airport and all freeways. GREAT CENTRAL LOCATION! Cheerful & comfortable; near studios, amusement parks, downtown, beaches! Central and modern; private, convenient, family-friendly, pool. 3 bedrooms. Terrific host - feel welcome and relaxed while you travel. Pool, patio and self-contained main house all accessible freely by guests. Garage, pool house and back caretaker unit not accessible. Host and caretaker may be available throughout your stay to assist in troubleshooting with local information/amenities. During holiday time, you may have the place to yourself. Host available for support by phone. Quiet-yet-close to all the fun in LA! Hollywood, Univers
## 3 Our best memory foam pillows you'll ever sleep on. First Morning: Starbuck's & Peet's coffee, latte-style coffee also protein bars, granola bars, and a fresh baked Swedish cinnamon roll, continental breakfast as well as breakfast requests. A welcome bottle of Voss artesian water from Norway. Terry robe & slippers. Handmade Amish wildflower soap. Candy bowl & trail mix jar. SoCal: beaches, Walk of Fame, clubs. Then back here for R&R. Pamper yourself in West Hollywood, California. Flickering fireplace display heater. Decorated with fresh flowers for the Holidays. Blendtec® Designer 625 Blender Bundle with Twister Jar MORE THAN JUST SMOOTHIES: From hot soup to ice cream and everything in between, your imagination is the only limit to the creative power you can unleash with Blendtec blender. This space is completely upgraded and updated. Luxury Gold Queen size memory foam bed in fully private 10 ft. x 13 ft. 6 in. walled off completely enclosed space, screened off, of huge living roo
## 4 This is a three story townhouse with the following layout. The first floor, where my guest stays, is an open space filled with light. Bright, airy, cheerful. You will be sleeping on a sleeper couch facing a beautiful garden overflowing with flowers to greet you every morning that offers a lovely patio area to sit and have a meal. Very peaceful and serene. There is no door but Japanese screens are provided if you want to nest in. While this area is considered a shared space, you will have it completely to yourself and have any privacy you seek. On the same floor, is the shared kitchen fully stocked. The 2nd floor is my office and private bath for my guest. The third floor is my quarters. We could possibly not run into one another. Located in Santa Monica, a few blocks from the very hip Main Street area that has cafes, both fine and casual dining that will appeal to those of you who are foodies. Walking distance to yoga studios, farmers markets and wonderful unique shops. The pristine c
## 5 Centrally located.... Furnished with 42 inch Sony Plasma tv/hbo/showtime. Fiber optic WIFI. 4.6 cu ft. Refridgerator, Microwave, Convection Oven. Large bathroom with Jaccuzi and large shower, brazilian cherry hardwood floors, thomasville bedroom furniture, ikea office, fireplace, ceiling fan, etc. Many restaurants and Shopping. Cheesecake Factory, Fronks (best burgers and ribs), Marino's, Royal Taste (Thai), California Pizza Kitchen, The Nest (breakfast!), Bike Trail, Beach in 15 minutes. The space is furnished with Thomasville furniture, brazillian cherrywood flooring, jacuzzi, large bathroom, large closet, large office desk area/furniture. 2 minutes to the freeway, 12 minutes to closest beach, 30 minutes to Hollywood, 5 minutes to restaurants, shopping and grocery stores. There are many parks close by and very close to the San Gabriel Pass where the cyclists ride down to Seal Beach. It's a safe, quiet place with tenants with a very busy lifestyle and there are no children. N
## 6 A very Modern Hollywood Hills Zen style gallery-esque abode , Dark Brazilian Hardwood Floors, Sleek modern Concrete decor infused with Asian feng shui sensibilities, Artisinal in all aspects with all the modern conveniences. Located in beautiful and Musically Historic Laurel Canyon. Approximate size is 460 sq feet. Stay amongst the Stars when you visit the Hollywood Hills! One of the safest areas !!! Sleep just minutes away from Jim Morrison's house, .....The Mamas and Papas home, and many more present and past celebrities!! ...the Hollywood Hills / Laurel Canyon Welcome to Paradise , quiet, lush ,song birds greenery, , and refreshing breezes; yet in the heart of Hollywood, quick access to Sunset Strip, West Hollywood, Downtown, Beverly Hills, and Universal City Walk and other film studios. This gem is nestled up the infamous Laurel Canyon and has all of the amenities and comforts of a custom designer home , completely separate entrance, plenty of easy parking, and fully loaded! Af
## experiences_offered
## 1 none
## 2 none
## 3 none
## 4 none
## 5 none
## 6 none
## neighborhood_overview
## 1
## 2 Quiet-yet-close to all the fun in LA! Hollywood, Universal Studios, beaches, great hikes and more are all minutes away.
## 3 We are minutes away from the Mentor Language Institute, Kings College, Musicians Institute, and many film schools including AFI, and the American Academy of Dramatic Arts. Halfway between UCLA and USC. We are minutes away from the Hollywood Boulevard Walk of Fame and all the clubs on Sunset Strip. All the comedy clubs are here, as well. Minutes from the Grove and Rodeo Drive. I'll give you maps and directions to everything. Universal City is just up the road. Magic Mountain is a short drive out of town. Disneyland , as well.
## 4
## 5 What makes the neighborhood unique is that there are 5 grocery stores within 5 minutes and 2 Malls within 7 minutes. There are also many parks and with the San Gabriel Pass being a few minutes away, you can actually ride a bike to Seal Beach. The 91 freeway is 2 minutes away and the 605 3 minutes. The 105 freeway about 6 minutes and the 5 freeway about 6 minutes. The closest beach is about 12 minutes away. Downtown LA is about 20 minutes. Disneyland is about 12 minutes.
## 6 This is the famous Hollywood hills.. Historical for Music , many nighbor are well known celebrities
## notes
## 1
## 2 One dog may be on premises, friendly and cared for by caretaker. A great addition to stabilize kids-away-from-home and bring a family feel to your vacation.
## 3 Decorated for the Holidays. Blendtec® Designer 625 Blender Bundle with Twister Jar MORE THAN JUST SMOOTHIES: From hot soup to ice cream and everything in between, your imagination is the only limit to the creative power you can unleash with Blendtec blender. Our memory foam pillows are the best you'll ever sleep on. They are customizable utilizing exclusive Variable Fill Technology ensuring a pillow that is tailored just for you. This is the only memory foam pillow in the world that is adjustable. You can sculpt it much like a down pillow - it will shift and change into whatever shape you desire. We offer a continental breakfast and/or light breakfast fare. Wake up coffee or you can make your own. The first night and morning for all guests. There is a candy bowl with and without, sugar-free. There is a white terrycloth robe and slippers as well as fluffy thick bath and hand towels and a facecloth. Handmade Amish Wildflower Soap. A luffa mitt and other arrival bath amenities. A wel
## 4
## 5 If you are doing business travel, this studio is excellent because it offers one large desk and also a built in desk that would give you lots of room. Fiber optic wifi is very stable and fast 100 mbps.
## 6
## transit
## 1
## 2 Short drive to subway and elevated trains running to major tourist spots in LA; freeways minutes away as well. Car is advised for maximum accessibility to greater Los Angeles. Uber-friendly suburb, close to Hollywood and more.
## 3 There are many buses; bus stops going in every direction are just around the corner. The subway is five minutes away. We are in the heart of Los Angeles, West Hollywood, Hollywood, California, USA. Convenient to all the major studios. Beverly Hills is minutes away, as well.
## 4
## 5 Public transportation is a 3 minutes walk to the main street.
## 6 Car, Bike and Hike !! Uber access , Bus stop walking distance
## access
## 1
## 2 Pool, patio and self-contained main house all accessible freely by guests. Garage, pool house and back caretaker unit not accessible.
## 3 Kitchen with new refrigerator, dishwasher, stove and oven with new plank floors. Jacuzzi and sundeck New gym with new treadmill and elliptical Sauna Your own secure parking space Washer Dryer in building Shared brand new updated Bath with new glass enclosure and new plank floors.
## 4
## 5 Good access to all things in Los Angeles and Orange County.
## 6
## interaction
## 1
## 2 Host and caretaker may be available throughout your stay to assist in troubleshooting with local information/amenities. During holiday time, you may have the place to yourself. Host available for support by phone.
## 3 I am friendly and available to help you with your needs even before you arrive. I am seldom seen as I am in and out with my daily tasks. I always greet you with a smile if we do run into each other. I am happy to help you find things to do especially if it is about the entertainment industry.
## 4
## 5 I am always available for questions throughout the travellers stay.
## 6
## house_rules
## 1 Camelot NEW RESIDENTSâ\200\231 GENERAL INFORMATION File: New Residents Info 1 Created on 12/13/05 Hello, and welcome to the Camelot Condominium Complex. Below is some information to help you become oriented to your building and the complex. 1. The Camelot complex consists of five buildings. Your new unit is in bldg._______.(URL HIDDEN)You need to always use your building number plus your unit number when contacting either our property management company, Real Support Property Management Co. at ((PHONE NUMBER HIDDEN) or the Camelot office at (PHONE NUMBER HIDDEN). The Camelot office hours are Monday through Friday 8:30 am to 3:30 pm. 2. Parking in our Structures: Parking for residents is in assigned, numbered spaces that legally belong to each unit. (Some units only have one parking space.) You, any guest, or temporary worker you might have may only park in one of your assigned spaces. NOTE: Any vehicle parked illegally in another residentâ\200\231s slot or in a common area can
## 2 Host asks that guests refrain from partying loudly into the evening on back patio/pool area. Guest swim at their own risk; guest booking indicates agreement by Guest that Host is not responsible for any injuries related to the use of the pool or from being in or around the pool area. Finally, plumbing in the house is a bit sensitive. No feminine items down the toilet and nothing at all allowed in the garbage disposal in the kitchen. Thank you!
## 3 I just have one rule. The Golden Rule Do unto others as you would have them do unto you. This is a no smoking drug free place. No pets.
## 4 ABOUT YOU. Friendly travelers or people coming to LA for work are welcome to stay .I am open to interns who visit Santa Monica. This isnâ\200\231t a party house, but if youâ\200\231re looking for a party there are plenty of great bars, music and comedyvenues within walking distance. Please tell me about yourself, and we can decide if itâ\200\231s going to be a good fit. A few requestsâ\200¦ -No smoking -No guests apart from those registered, and no parties -Please remove shoes in the house. -I keep my home clean, and would ask you to do the same.
## 5
## 6 No Drugs, No partying, No unreasonable loudness of anykind after 11pm, no smoking, please keep voices low when entering property after 11pm as to not disturb neighbors
## thumbnail_url medium_url
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## picture_url
## 1 https://a0.muscache.com/im/pictures/4321499/1da9892a_original.jpg?aki_policy=large
## 2 https://a0.muscache.com/im/pictures/cc4b724d-db8b-4dd8-9c01-25841c4ba6ca.jpg?aki_policy=large
## 3 https://a0.muscache.com/im/pictures/40618141/2ac0b446_original.jpg?aki_policy=large
## 4 https://a0.muscache.com/im/pictures/1082974/0f74c9d1_original.jpg?aki_policy=large
## 5 https://a0.muscache.com/im/pictures/23817858/de20cdd9_original.jpg?aki_policy=large
## 6 https://a0.muscache.com/im/pictures/5147dcd2-efad-495c-8c31-d781cc626878.jpg?aki_policy=large
## xl_picture_url host_id host_url host_name
## 1 NA 521 https://www.airbnb.com/users/show/521 Paolo
## 2 NA 767 https://www.airbnb.com/users/show/767 Melissa
## 3 NA 3008 https://www.airbnb.com/users/show/3008 Chas.
## 4 NA 3041 https://www.airbnb.com/users/show/3041 Yoga Priestess
## 5 NA 3207 https://www.airbnb.com/users/show/3207 Bernadine
## 6 NA 3415 https://www.airbnb.com/users/show/3415 Nataraj
## host_since host_location
## 1 2008-06-27 San Francisco, California, United States
## 2 2008-07-11 Burbank, California, United States
## 3 2008-09-16 Los Angeles, California, United States
## 4 2008-09-17 Santa Monica, California, United States
## 5 2008-09-25 Long Beach, California, United States
## 6 2008-10-02 Los Angeles, California, United States
## host_about
## 1 Search for me on the Internet with the keyword pppaolo\n\nPolyhedric Lateral Thinker Entrepreneur, a Human Network Router and a Serendipity Innovator\n\n"Ahead of Number One" Paolo is a young technology entrepreneur and visionary, specializing in structuring progressive business models that capture the moment and stay on top of the future.\n\nClass of 1977, a master in computer science and one in marketing. 15 yrs of deep experience in Internet technology and business strategy, served more than 100 businesses.\nHe founded his first Internet company when he was 18. Then he founded Digitix, in 1999 in Italy and in 2002 in the United States, when he moved, first to NY, then LA and finally SF; involved in the operations and partner in 5 startups.\n\nIn 2010 he founded in Silicon Valley along with other partners, Doochoo, a revolutionary platform for the opinions in Internet as an innovative marketing and user engagement tool between brands and consumers: signed with Ikea, the first client in 2011.\n\nIn 2011 he held the position of Head of Innovation and Emerging Media in H-art, company of H-farm group, interactive agency for strategic marketing and communication projects, acquired by WPP, the world's' largest marketing and comm. group.\nPaolo now works full time and focuses only on his venture Doochoo ( (Website hidden by Airbnb) \n\nVery active in SV, connecting together experienced entrepreneurs, investors, , start-up rookies, and working to create an "Int'l Accelerator and TT Center", meanwhile, for years he has created a bridge between Italian and int'l companies.\n\nHe received several career awards from excellence centers, universities and conferences in Italy and USA; interviewed and mentioned on Int'l papers such as Financial Times, Wired, TechCrunch, La Repubblica, Il Sole 24 Ore, RAI, and many more.\n\nPaolo speaks Italian English Spanish French, restless traveler, power rollerskater, addicted photographer.\n\nTo date, Paolo commutes monthly between SF, NY and Venezia, and now with his company Pick1 (Doochoo Inc) he is part of Start-Up Chile amazing program and 500 Startups!
## 2 Single mother, CEO and Owner of an international coaching and training business. \n\nLove to travel! Family-focused and single friendly due to my own status! Hail from Washington, DC originally. International interests.\n\n"RIOT FOR JOY" is my motto. Looking forward to getting to know YOU.
## 3 Writer.\nLiterary Manager.\nPhotographer.\nProducing Partner.\nI work all the time.\nI wear many hats.\nProfessional.\nPleasant.\nRespectful.\nOptimistic and cheerful.
## 4 I have been teaching yoga and meditation for 30 years.\nWorld-traveled,passionate,love life and committed to making the world a healthier place one person and one company at a time. Enjoy meeting new and interesting people.
## 5 Fair, open, honest and very informative for new guests to the area.
## 6 Music Industry, Record producer, Songwriter, Composer, Multi Instrumentalist, Recording Artist
## host_response_time host_response_rate host_acceptance_rate host_is_superhost
## 1 N/A N/A N/A f
## 2 within a day 100% N/A f
## 3 within an hour 100% N/A t
## 4 within a few hours 100% N/A f
## 5 N/A N/A N/A f
## 6 N/A N/A N/A f
## host_thumbnail_url
## 1 https://a0.muscache.com/im/users/521/profile_pic/1429917533/original.jpg?aki_policy=profile_small
## 2 https://a0.muscache.com/im/users/767/profile_pic/1259093012/original.jpg?aki_policy=profile_small
## 3 https://a0.muscache.com/im/pictures/user/d17cfddd-9f98-4d0c-bfee-c005cc38a7de.jpg?aki_policy=profile_small
## 4 https://a0.muscache.com/im/users/3041/profile_pic/1331080494/original.jpg?aki_policy=profile_small
## 5 https://a0.muscache.com/im/pictures/8b82a267-bc4b-4d8b-935a-463a39c8c5ae.jpg?aki_policy=profile_small
## 6 https://a0.muscache.com/im/users/3415/profile_pic/1281545642/original.jpg?aki_policy=profile_small
## host_picture_url
## 1 https://a0.muscache.com/im/users/521/profile_pic/1429917533/original.jpg?aki_policy=profile_x_medium
## 2 https://a0.muscache.com/im/users/767/profile_pic/1259093012/original.jpg?aki_policy=profile_x_medium
## 3 https://a0.muscache.com/im/pictures/user/d17cfddd-9f98-4d0c-bfee-c005cc38a7de.jpg?aki_policy=profile_x_medium
## 4 https://a0.muscache.com/im/users/3041/profile_pic/1331080494/original.jpg?aki_policy=profile_x_medium
## 5 https://a0.muscache.com/im/pictures/8b82a267-bc4b-4d8b-935a-463a39c8c5ae.jpg?aki_policy=profile_x_medium
## 6 https://a0.muscache.com/im/users/3415/profile_pic/1281545642/original.jpg?aki_policy=profile_x_medium
## host_neighbourhood host_listings_count host_total_listings_count
## 1 Culver City 1 1
## 2 Burbank 1 1
## 3 Hollywood 2 2
## 4 Santa Monica 2 2
## 5 Bellflower 1 1
## 6 Laurel Canyon 3 3
## host_verifications
## 1 ['email', 'phone', 'facebook', 'reviews', 'kba']
## 2 ['email', 'phone', 'reviews', 'jumio', 'kba', 'government_id']
## 3 ['email', 'phone', 'facebook', 'reviews', 'kba']
## 4 ['email', 'phone', 'reviews', 'jumio', 'offline_government_id', 'government_id']
## 5 ['email', 'phone', 'facebook', 'kba']
## 6 ['email', 'phone', 'reviews', 'jumio', 'government_id']
## host_has_profile_pic host_identity_verified street
## 1 t t Culver City, CA, United States
## 2 t t Burbank, CA, United States
## 3 t t Los Angeles, CA, United States
## 4 t f Santa Monica, CA, United States
## 5 t t Bellflower, CA, United States
## 6 t t Los Angeles, CA, United States
## neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed
## 1 Culver City Culver City NA
## 2 Burbank Burbank NA
## 3 Hollywood NA
## 4 Santa Monica Santa Monica NA
## 5 Bellflower Bellflower NA
## 6 Laurel Canyon Hollywood Hills West NA
## city state zipcode market smart_location country_code
## 1 Culver City CA 90230 Los Angeles Culver City, CA US
## 2 Burbank CA 91505 Los Angeles Burbank, CA US
## 3 Los Angeles CA 90046 Los Angeles Los Angeles, CA US
## 4 Santa Monica CA 90405 Los Angeles Santa Monica, CA US
## 5 Bellflower CA 90706 Los Angeles Bellflower, CA US
## 6 Los Angeles CA 90046 Los Angeles Los Angeles, CA US
## country latitude longitude is_location_exact property_type
## 1 United States 33.98209 -118.3849 t Condominium
## 2 United States 34.16562 -118.3346 t House
## 3 United States 34.09768 -118.3460 t Apartment
## 4 United States 34.00475 -118.4813 t Apartment
## 5 United States 33.87619 -118.1140 t Apartment
## 6 United States 34.11132 -118.3823 t Guest suite
## room_type accommodates bathrooms bedrooms beds bed_type
## 1 Entire home/apt 6 2.0 2 3 Real Bed
## 2 Entire home/apt 6 1.0 3 3 Real Bed
## 3 Private room 1 1.5 1 1 Real Bed
## 4 Private room 1 1.0 1 1 Pull-out Sofa
## 5 Entire home/apt 2 1.0 1 1 Real Bed
## 6 Entire home/apt 2 1.0 1 2 Real Bed
## amenities
## 1 {TV,"Cable TV",Internet,Wifi,"Air conditioning","Wheelchair accessible",Pool,Kitchen,"Free parking on premises","Pets allowed",Gym,Elevator,"Hot tub","Indoor fireplace","Buzzer/wireless intercom",Heating,"Family/kid friendly","Suitable for events",Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit","Safety card","Fire extinguisher",Essentials,Shampoo,"24-hour check-in",Hangers,"Hair dryer",Iron,"Laptop friendly workspace"}
## 2 {TV,"Cable TV",Internet,Wifi,"Air conditioning",Pool,Kitchen,"Pets live on this property",Dog(s),"Free street parking",Heating,"Family/kid friendly",Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"24-hour check-in",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","Childrenâ\200\231s books and toys","Fireplace guards","Childrenâ\200\231s dinnerware","Hot water",Microwave,"Coffee maker",Refrigerator,Dishwasher,"Dishes and silverware","Cooking basics",Oven,Stove,"Single level home","BBQ grill","Patio or balcony","Luggage dropoff allowed",Other}
## 3 {Internet,Wifi,"Air conditioning","Wheelchair accessible",Kitchen,"Free parking on premises",Gym,Breakfast,Elevator,"Free street parking","Hot tub","Buzzer/wireless intercom",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit","Safety card","Fire extinguisher",Essentials,Shampoo,"24-hour check-in",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50","Hot water","Bed linens","Extra pillows and blankets",Microwave,"Coffee maker",Refrigerator,Dishwasher,"Dishes and silverware","Cooking basics",Oven,Stove,"Single level home","Patio or balcony","Host greets you"}
## 4 {Internet,Wifi,Kitchen,Heating,Washer,Dryer,"Smoke detector",Essentials,Shampoo,Hangers,"Hair dryer","Host greets you"}
## 5 {TV,"Cable TV",Internet,Wifi,"Air conditioning",Kitchen,"Free parking on premises","Hot tub","Indoor fireplace",Heating,Washer,Dryer,"Smoke detector","First aid kit","Fire extinguisher",Hangers,"Hair dryer","Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
## 6 {TV,"Cable TV",Wifi,"Air conditioning",Kitchen,"Free parking on premises","Pets allowed","Free street parking",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide detector",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_50","Private entrance","Hot water","Bed linens","Long term stays allowed","Host greets you"}
## square_feet price weekly_price monthly_price security_deposit cleaning_fee
## 1 NA $122.00 $904.00 $2,851.00 $500.00 $240.00
## 2 NA $168.00 $0.00 $100.00
## 3 NA $79.00 $399.00 $949.00 $299.00 $85.00
## 4 NA $140.00 $800.00 $1,879.00 $100.00
## 5 NA $80.00 $399.00 $1,400.00 $100.00 $75.00
## 6 NA $82.00 $790.00 $2,450.00 $250.00 $60.00
## guests_included extra_people minimum_nights maximum_nights calendar_updated
## 1 3 $25.00 7 730 9 months ago
## 2 6 $0.00 2 14 4 days ago
## 3 1 $0.00 6 366 today
## 4 1 $0.00 1 180 4 weeks ago
## 5 1 $25.00 2 730 4 months ago
## 6 1 $9.00 3 730 3 weeks ago
## has_availability availability_30 availability_60 availability_90
## 1 t 0 0 0
## 2 t 0 0 0
## 3 t 0 0 6
## 4 t 25 55 85
## 5 t 0 0 0
## 6 t 0 0 7
## availability_365 calendar_last_scraped number_of_reviews first_review
## 1 236 2018-12-07 2 2011-08-15
## 2 135 2018-12-07 4 2016-06-14
## 3 260 2018-12-06 13 2014-06-09
## 4 360 2018-12-06 18 2011-06-06
## 5 0 2018-12-06 0
## 6 282 2018-12-07 23 2013-09-03
## last_review review_scores_rating review_scores_accuracy
## 1 2016-05-15 80 10
## 2 2018-10-21 93 10
## 3 2018-09-07 97 10
## 4 2018-11-15 96 9
## 5 NA NA
## 6 2018-10-31 81 8
## review_scores_cleanliness review_scores_checkin review_scores_communication
## 1 10 6 8
## 2 10 10 10
## 3 10 10 10
## 4 9 10 10
## 5 NA NA NA
## 6 8 8 9
## review_scores_location review_scores_value requires_license license
## 1 10 8 f
## 2 10 9 f
## 3 10 10 f
## 4 10 9 t 228269
## 5 NA NA f
## 6 9 8 f
## jurisdiction_names instant_bookable is_business_travel_ready
## 1 {"Culver City"," CA"} f f
## 2 t f
## 3 {"City of Los Angeles"," CA"} t f
## 4 {"Santa Monica"} f f
## 5 f f
## 6 {"City of Los Angeles"," CA"} f f
## cancellation_policy require_guest_profile_picture
## 1 strict_14_with_grace_period t
## 2 flexible f
## 3 strict_14_with_grace_period f
## 4 strict_14_with_grace_period f
## 5 strict_14_with_grace_period f
## 6 strict_14_with_grace_period f
## require_guest_phone_verification calculated_host_listings_count
## 1 f 1
## 2 f 1
## 3 f 2
## 4 f 2
## 5 f 1
## 6 f 3
## reviews_per_month
## 1 0.02
## 2 0.13
## 3 0.24
## 4 0.20
## 5 NA
## 6 0.36
We need to select useful features from the dataset which can be used for descriptions, modeling and predictions by subsetting the data. We will do the following in no particular order:
Los_Angeles_Listings_subset <- Los_Angeles_Listings %>%
dplyr::select(id,host_id,host_since, host_response_time, host_response_rate,
experiences_offered, host_acceptance_rate, host_is_superhost,
host_listings_count, host_total_listings_count, host_has_profile_pic,
host_identity_verified, neighbourhood_cleansed, city, state,
zipcode, market, country_code, country, is_location_exact,
property_type, room_type, accommodates, bathrooms, bedrooms,
beds, bed_type,amenities, square_feet, price, weekly_price, monthly_price,
security_deposit, cleaning_fee, guests_included, extra_people, minimum_nights,
maximum_nights, has_availability, availability_30,
availability_60, availability_90, availability_365, number_of_reviews,
first_review, last_review, review_scores_rating, review_scores_accuracy,
review_scores_cleanliness, review_scores_checkin, review_scores_communication,
review_scores_location, review_scores_value, requires_license,
instant_bookable, is_business_travel_ready, cancellation_policy,
require_guest_profile_picture, require_guest_phone_verification,
calculated_host_listings_count, reviews_per_month, neighbourhood_group_cleansed)
# Take a look at descriptive summary of Los_Angeles_Listings_subset dataset
summary(Los_Angeles_Listings_subset)
## id host_id host_since
## Min. : 109 Min. : 59 2017-05-10: 162
## 1st Qu.:11624794 1st Qu.: 10488351 2015-11-02: 88
## Median :19916052 Median : 37791461 2015-04-19: 83
## Mean :18186850 Mean : 64844796 2018-05-02: 82
## 3rd Qu.:25741808 3rd Qu.:106317829 2016-07-09: 79
## Max. :30584243 Max. :229328694 2014-07-11: 76
## (Other) :42477
## host_response_time host_response_rate experiences_offered
## : 3 100% :23975 none:43047
## a few days or more: 700 N/A :12011
## N/A :12011 90% : 870
## within a day : 2626 99% : 619
## within a few hours: 4663 98% : 584
## within an hour :23044 0% : 476
## (Other): 4512
## host_acceptance_rate host_is_superhost host_listings_count
## : 3 : 3 Min. : 0.000
## N/A:43044 f:31649 1st Qu.: 1.000
## t:11395 Median : 2.000
## Mean : 9.598
## 3rd Qu.: 5.000
## Max. :803.000
## NA's :3
## host_total_listings_count host_has_profile_pic host_identity_verified
## Min. : 0.000 : 3 : 3
## 1st Qu.: 1.000 f: 49 f:22651
## Median : 2.000 t:42995 t:20393
## Mean : 9.598
## 3rd Qu.: 5.000
## Max. :803.000
## NA's :3
## neighbourhood_cleansed city state
## Hollywood : 2757 Los Angeles :27454 CA :43000
## Venice : 2691 Long Beach : 1497 Ca : 32
## Downtown : 1642 West Hollywood: 1001 ca : 6
## Long Beach : 1499 Santa Monica : 962 : 2
## Hollywood Hills: 1092 Marina del Rey: 747 NY : 2
## Westlake : 1006 Beverly Hills : 709 åŠ å·ž : 1
## (Other) :32360 (Other) :10677 (Other): 4
## zipcode market country_code
## 90291 : 2193 Los Angeles :41609 US:43047
## 90046 : 1661 Other (Domestic) : 1023
## 90028 : 1659 Malibu : 265
## 90026 : 1263 : 74
## 90068 : 1007 Fontana : 35
## 90036 : 988 Coastal Orange County: 13
## (Other):34276 (Other) : 28
## country is_location_exact property_type
## United States:43047 f: 9296 Apartment :16187
## t:33751 House :14510
## Condominium: 2406
## Guesthouse : 2202
## Townhouse : 1368
## Guest suite: 1324
## (Other) : 5050
## room_type accommodates bathrooms bedrooms
## Entire home/apt:26835 Min. : 1.000 Min. : 0.000 Min. : 0.000
## Private room :14261 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Shared room : 1951 Median : 3.000 Median : 1.000 Median : 1.000
## Mean : 3.678 Mean : 1.451 Mean : 1.415
## 3rd Qu.: 5.000 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :40.000 Max. :22.000 Max. :50.000
## NA's :28 NA's :18
## beds bed_type
## Min. : 0.000 Airbed : 131
## 1st Qu.: 1.000 Couch : 82
## Median : 1.000 Futon : 232
## Mean : 1.981 Pull-out Sofa: 157
## 3rd Qu.: 2.000 Real Bed :42445
## Max. :50.000
## NA's :34
## amenities
## {} : 140
## {TV,Wifi,"Air conditioning",Pool,Kitchen,Heating,Essentials,Shampoo,Hangers} : 34
## {TV,Wifi,Kitchen,"Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Private entrance","Hot water","Body soap","Bed linens",Microwave,"Coffee maker",Refrigerator,"Dishes and silverware","Cooking basics",Stove,"Host greets you"} : 30
## {"Family/kid friendly"} : 29
## {TV,Wifi,"Air conditioning",Pool,Kitchen,Gym,Elevator,"Hot tub",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","Self check-in",Lockbox,"Hot water","Bed linens","Ethernet connection",Microwave,"Coffee maker",Refrigerator,Dishwasher,"Dishes and silverware","Cooking basics",Oven,Stove,"BBQ grill","Patio or balcony","Long term stays allowed","Paid parking on premises"} : 23
## {TV,"Cable TV",Wifi,Pool,Kitchen,"Free parking on premises",Gym,"Pets live on this property",Elevator,"Indoor fireplace",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","Self check-in",Lockbox,"Private entrance",Bathtub,"Hot water","Bed linens","Ethernet connection",Microwave,"Coffee maker",Refrigerator,Dishwasher,"Dishes and silverware",Oven,Stove,"Patio or balcony","Long term stays allowed",Beachfront}: 21
## (Other) :42770
## square_feet price weekly_price monthly_price
## Min. : 0.0 $100.00: 1381 :37410 :37901
## 1st Qu.: 400.0 $150.00: 1216 $500.00: 203 $3,000.00: 161
## Median : 800.0 $75.00 : 1194 $600.00: 192 $2,500.00: 147
## Mean : 991.7 $50.00 : 1055 $800.00: 172 $1,500.00: 141
## 3rd Qu.:1200.0 $99.00 : 1015 $700.00: 170 $1,800.00: 128
## Max. :7000.0 $125.00: 909 $650.00: 160 $2,000.00: 123
## NA's :42700 (Other):36277 (Other): 4740 (Other) : 4446
## security_deposit cleaning_fee guests_included extra_people
## :11450 : 6336 Min. : 1.000 $0.00 :20313
## $0.00 : 8578 $50.00 : 2796 1st Qu.: 1.000 $10.00 : 4148
## $100.00: 4492 $100.00: 2609 Median : 1.000 $25.00 : 3724
## $500.00: 3607 $25.00 : 1775 Mean : 1.909 $20.00 : 3509
## $200.00: 2773 $0.00 : 1756 3rd Qu.: 2.000 $15.00 : 2820
## $300.00: 2085 $150.00: 1751 Max. :16.000 $50.00 : 1949
## (Other):10062 (Other):26024 (Other): 6584
## minimum_nights maximum_nights has_availability availability_30
## Min. : 1.000 Min. : 1.0 t:43047 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 30.0 1st Qu.: 1.00
## Median : 2.000 Median : 1125.0 Median :13.00
## Mean : 5.104 Mean : 666.9 Mean :13.66
## 3rd Qu.: 3.000 3rd Qu.: 1125.0 3rd Qu.:25.00
## Max. :3000.000 Max. :1000000.0 Max. :30.00
##
## availability_60 availability_90 availability_365 number_of_reviews
## Min. : 0.0 Min. : 0.00 Min. : 0 Min. : 0.0
## 1st Qu.: 8.0 1st Qu.:18.00 1st Qu.: 51 1st Qu.: 1.0
## Median :36.0 Median :64.00 Median :160 Median : 7.0
## Mean :32.2 Mean :52.24 Mean :177 Mean : 28.5
## 3rd Qu.:54.0 3rd Qu.:83.00 3rd Qu.:335 3rd Qu.: 32.0
## Max. :60.0 Max. :90.00 Max. :365 Max. :739.0
##
## first_review last_review review_scores_rating
## : 8720 : 8720 Min. : 20.00
## 2018-11-12: 119 2018-12-02: 1196 1st Qu.: 93.00
## 2018-10-28: 116 2018-11-12: 1131 Median : 97.00
## 2018-11-11: 111 2018-11-25: 1113 Mean : 94.48
## 2018-07-08: 106 2018-11-18: 921 3rd Qu.:100.00
## 2018-08-12: 105 2018-11-24: 893 Max. :100.00
## (Other) :33770 (Other) :29073 NA's :9276
## review_scores_accuracy review_scores_cleanliness review_scores_checkin
## Min. : 2.000 Min. : 2.000 Min. : 2.000
## 1st Qu.:10.000 1st Qu.: 9.000 1st Qu.:10.000
## Median :10.000 Median :10.000 Median :10.000
## Mean : 9.645 Mean : 9.438 Mean : 9.777
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
## NA's :9299 NA's :9297 NA's :9335
## review_scores_communication review_scores_location review_scores_value
## Min. : 2.000 Min. : 2.000 Min. : 2.000
## 1st Qu.:10.000 1st Qu.: 9.000 1st Qu.: 9.000
## Median :10.000 Median :10.000 Median :10.000
## Mean : 9.765 Mean : 9.655 Mean : 9.483
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
## NA's :9304 NA's :9340 NA's :9348
## requires_license instant_bookable is_business_travel_ready
## f:41439 f:23614 f:43047
## t: 1608 t:19433
##
##
##
##
##
## cancellation_policy require_guest_profile_picture
## flexible :12890 f:42255
## moderate :11888 t: 792
## strict : 63
## strict_14_with_grace_period:18008
## super_strict_30 : 9
## super_strict_60 : 189
##
## require_guest_phone_verification calculated_host_listings_count
## f:41950 Min. : 1.000
## t: 1097 1st Qu.: 1.000
## Median : 2.000
## Mean : 5.834
## 3rd Qu.: 5.000
## Max. :152.000
##
## reviews_per_month neighbourhood_group_cleansed
## Min. : 0.010 Mode:logical
## 1st Qu.: 0.390 NA's:43047
## Median : 1.180
## Mean : 1.898
## 3rd Qu.: 2.860
## Max. :17.840
## NA's :8720
# Removing original dataset to free up the space
rm(Los_Angeles_Listings)
# Summary output reveals interesting things
# We can see feature "neighbourhood_group_cleansed","experiences_offered"
# and "host_acceptance_rate" are almost completely empty.
# All the values are NA. So we should remove these feature as they don't
# contain any useful information
Los_Angeles_Listings_subset$neighbourhood_group_cleansed=NULL
Los_Angeles_Listings_subset$host_acceptance_rate=NULL
Los_Angeles_Listings_subset$experiences_offered=NULL
# Also, we can notice in the summary output that features like "country code",
# "country","state", "has_availability" and "is_business_travel_ready"
# contains only single type of information. So they are not useful
# for modeling and predictions. We can remove them
Los_Angeles_Listings_subset$country_code=NULL
Los_Angeles_Listings_subset$country=NULL
Los_Angeles_Listings_subset$state=NULL
Los_Angeles_Listings_subset$has_availability=NULL
Los_Angeles_Listings_subset$is_business_travel_ready=NULL
# host_total_listings_count and host_listings_count contain same information
# so keeping only one of these i.e host_listings_count
Los_Angeles_Listings_subset$host_total_listings_count <- NULL
# Feature square_feet contains 42700 NA values which is approximately
# 99% of the total values. So we should also remove this feature
summary(Los_Angeles_Listings_subset$square_feet)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 400.0 800.0 991.7 1200.0 7000.0 42700
Los_Angeles_Listings_subset$square_feet<- NULL
#The rest of the following we determined to be not needed for this analysis:
Los_Angeles_Listings_subset$weekly_price<- NULL
Los_Angeles_Listings_subset$monthly_price<- NULL
Los_Angeles_Listings_subset$first_review<- NULL
Los_Angeles_Listings_subset$last_review<- NULL
Los_Angeles_Listings_subset$market<- NULL
Los_Angeles_Listings_subset$zipcode<- NULL
Los_Angeles_Listings_subset$city<- NULL
Los_Angeles_Listings_subset$neighbourhood_cleansed<- NULL
Los_Angeles_Listings_subset$id<- NULL
Los_Angeles_Listings_subset$host_id<- NULL
Los_Angeles_Listings_subset$host_since<- NULL
# Checking the summary statistics of updated dataset
summary(Los_Angeles_Listings_subset)
## host_response_time host_response_rate host_is_superhost
## : 3 100% :23975 : 3
## a few days or more: 700 N/A :12011 f:31649
## N/A :12011 90% : 870 t:11395
## within a day : 2626 99% : 619
## within a few hours: 4663 98% : 584
## within an hour :23044 0% : 476
## (Other): 4512
## host_listings_count host_has_profile_pic host_identity_verified
## Min. : 0.000 : 3 : 3
## 1st Qu.: 1.000 f: 49 f:22651
## Median : 2.000 t:42995 t:20393
## Mean : 9.598
## 3rd Qu.: 5.000
## Max. :803.000
## NA's :3
## is_location_exact property_type room_type accommodates
## f: 9296 Apartment :16187 Entire home/apt:26835 Min. : 1.000
## t:33751 House :14510 Private room :14261 1st Qu.: 2.000
## Condominium: 2406 Shared room : 1951 Median : 3.000
## Guesthouse : 2202 Mean : 3.678
## Townhouse : 1368 3rd Qu.: 5.000
## Guest suite: 1324 Max. :40.000
## (Other) : 5050
## bathrooms bedrooms beds bed_type
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Airbed : 131
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 Couch : 82
## Median : 1.000 Median : 1.000 Median : 1.000 Futon : 232
## Mean : 1.451 Mean : 1.415 Mean : 1.981 Pull-out Sofa: 157
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000 Real Bed :42445
## Max. :22.000 Max. :50.000 Max. :50.000
## NA's :28 NA's :18 NA's :34
## amenities
## {} : 140
## {TV,Wifi,"Air conditioning",Pool,Kitchen,Heating,Essentials,Shampoo,Hangers} : 34
## {TV,Wifi,Kitchen,"Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Private entrance","Hot water","Body soap","Bed linens",Microwave,"Coffee maker",Refrigerator,"Dishes and silverware","Cooking basics",Stove,"Host greets you"} : 30
## {"Family/kid friendly"} : 29
## {TV,Wifi,"Air conditioning",Pool,Kitchen,Gym,Elevator,"Hot tub",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","Self check-in",Lockbox,"Hot water","Bed linens","Ethernet connection",Microwave,"Coffee maker",Refrigerator,Dishwasher,"Dishes and silverware","Cooking basics",Oven,Stove,"BBQ grill","Patio or balcony","Long term stays allowed","Paid parking on premises"} : 23
## {TV,"Cable TV",Wifi,Pool,Kitchen,"Free parking on premises",Gym,"Pets live on this property",Elevator,"Indoor fireplace",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","Self check-in",Lockbox,"Private entrance",Bathtub,"Hot water","Bed linens","Ethernet connection",Microwave,"Coffee maker",Refrigerator,Dishwasher,"Dishes and silverware",Oven,Stove,"Patio or balcony","Long term stays allowed",Beachfront}: 21
## (Other) :42770
## price security_deposit cleaning_fee guests_included
## $100.00: 1381 :11450 : 6336 Min. : 1.000
## $150.00: 1216 $0.00 : 8578 $50.00 : 2796 1st Qu.: 1.000
## $75.00 : 1194 $100.00: 4492 $100.00: 2609 Median : 1.000
## $50.00 : 1055 $500.00: 3607 $25.00 : 1775 Mean : 1.909
## $99.00 : 1015 $200.00: 2773 $0.00 : 1756 3rd Qu.: 2.000
## $125.00: 909 $300.00: 2085 $150.00: 1751 Max. :16.000
## (Other):36277 (Other):10062 (Other):26024
## extra_people minimum_nights maximum_nights availability_30
## $0.00 :20313 Min. : 1.000 Min. : 1.0 Min. : 0.00
## $10.00 : 4148 1st Qu.: 1.000 1st Qu.: 30.0 1st Qu.: 1.00
## $25.00 : 3724 Median : 2.000 Median : 1125.0 Median :13.00
## $20.00 : 3509 Mean : 5.104 Mean : 666.9 Mean :13.66
## $15.00 : 2820 3rd Qu.: 3.000 3rd Qu.: 1125.0 3rd Qu.:25.00
## $50.00 : 1949 Max. :3000.000 Max. :1000000.0 Max. :30.00
## (Other): 6584
## availability_60 availability_90 availability_365 number_of_reviews
## Min. : 0.0 Min. : 0.00 Min. : 0 Min. : 0.0
## 1st Qu.: 8.0 1st Qu.:18.00 1st Qu.: 51 1st Qu.: 1.0
## Median :36.0 Median :64.00 Median :160 Median : 7.0
## Mean :32.2 Mean :52.24 Mean :177 Mean : 28.5
## 3rd Qu.:54.0 3rd Qu.:83.00 3rd Qu.:335 3rd Qu.: 32.0
## Max. :60.0 Max. :90.00 Max. :365 Max. :739.0
##
## review_scores_rating review_scores_accuracy review_scores_cleanliness
## Min. : 20.00 Min. : 2.000 Min. : 2.000
## 1st Qu.: 93.00 1st Qu.:10.000 1st Qu.: 9.000
## Median : 97.00 Median :10.000 Median :10.000
## Mean : 94.48 Mean : 9.645 Mean : 9.438
## 3rd Qu.:100.00 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :100.00 Max. :10.000 Max. :10.000
## NA's :9276 NA's :9299 NA's :9297
## review_scores_checkin review_scores_communication review_scores_location
## Min. : 2.000 Min. : 2.000 Min. : 2.000
## 1st Qu.:10.000 1st Qu.:10.000 1st Qu.: 9.000
## Median :10.000 Median :10.000 Median :10.000
## Mean : 9.777 Mean : 9.765 Mean : 9.655
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
## NA's :9335 NA's :9304 NA's :9340
## review_scores_value requires_license instant_bookable
## Min. : 2.000 f:41439 f:23614
## 1st Qu.: 9.000 t: 1608 t:19433
## Median :10.000
## Mean : 9.483
## 3rd Qu.:10.000
## Max. :10.000
## NA's :9348
## cancellation_policy require_guest_profile_picture
## flexible :12890 f:42255
## moderate :11888 t: 792
## strict : 63
## strict_14_with_grace_period:18008
## super_strict_30 : 9
## super_strict_60 : 189
##
## require_guest_phone_verification calculated_host_listings_count
## f:41950 Min. : 1.000
## t: 1097 1st Qu.: 1.000
## Median : 2.000
## Mean : 5.834
## 3rd Qu.: 5.000
## Max. :152.000
##
## reviews_per_month
## Min. : 0.010
## 1st Qu.: 0.390
## Median : 1.180
## Mean : 1.898
## 3rd Qu.: 2.860
## Max. :17.840
## NA's :8720
str(Los_Angeles_Listings_subset)
## 'data.frame': 43047 obs. of 41 variables:
## $ host_response_time : Factor w/ 6 levels "","a few days or more",..: 3 4 6 5 3 3 6 6 5 6 ...
## $ host_response_rate : Factor w/ 66 levels "","0%","10%",..: 66 4 4 4 66 66 4 4 4 4 ...
## $ host_is_superhost : Factor w/ 3 levels "","f","t": 2 2 3 2 2 2 2 3 3 2 ...
## $ host_listings_count : int 1 1 2 2 1 3 13 2 1 3 ...
## $ host_has_profile_pic : Factor w/ 3 levels "","f","t": 3 3 3 3 3 3 3 3 3 3 ...
## $ host_identity_verified : Factor w/ 3 levels "","f","t": 3 3 3 2 3 3 3 3 3 2 ...
## $ is_location_exact : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 1 ...
## $ property_type : Factor w/ 44 levels "Aparthotel","Apartment",..: 16 26 2 2 2 22 43 2 2 26 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 1 1 2 2 1 1 2 2 1 2 ...
## $ accommodates : int 6 6 1 1 2 2 2 1 2 2 ...
## $ bathrooms : num 2 1 1.5 1 1 1 1 1.5 1 1 ...
## $ bedrooms : int 2 3 1 1 1 1 1 1 1 1 ...
## $ beds : int 3 3 1 1 1 2 1 1 1 1 ...
## $ bed_type : Factor w/ 5 levels "Airbed","Couch",..: 5 5 5 4 5 5 5 5 5 5 ...
## $ amenities : Factor w/ 40292 levels "{\"Air conditioning\",\"Fire extinguisher\",Essentials,Shampoo,Hangers}",..: 3226 7439 738 2180 3891 10322 3126 740 2133 34974 ...
## $ price : Factor w/ 867 levels "$0.00","$1,000.00",..: 135 187 785 155 796 801 811 862 796 562 ...
## $ security_deposit : Factor w/ 214 levels "","$0.00","$1,000.00",..: 166 2 114 1 25 102 52 130 1 1 ...
## $ cleaning_fee : Factor w/ 294 levels "","$0.00","$1,000.00",..: 103 8 277 8 262 236 109 282 109 137 ...
## $ guests_included : int 3 6 1 1 1 1 2 1 2 1 ...
## $ extra_people : Factor w/ 99 levels "$0.00","$10.00",..: 36 1 1 1 36 93 1 1 36 1 ...
## $ minimum_nights : int 7 2 6 1 2 3 5 6 2 1 ...
## $ maximum_nights : int 730 14 366 180 730 730 30 375 365 730 ...
## $ availability_30 : int 0 0 0 25 0 0 15 5 19 30 ...
## $ availability_60 : int 0 0 0 55 0 0 15 14 49 60 ...
## $ availability_90 : int 0 0 6 85 0 7 15 44 73 90 ...
## $ availability_365 : int 236 135 260 360 0 282 15 319 313 179 ...
## $ number_of_reviews : int 2 4 13 18 0 23 22 12 184 0 ...
## $ review_scores_rating : int 80 93 97 96 NA 81 89 96 94 NA ...
## $ review_scores_accuracy : int 10 10 10 9 NA 8 8 10 10 NA ...
## $ review_scores_cleanliness : int 10 10 10 9 NA 8 8 9 9 NA ...
## $ review_scores_checkin : int 6 10 10 10 NA 8 9 10 10 NA ...
## $ review_scores_communication : int 8 10 10 10 NA 9 9 10 10 NA ...
## $ review_scores_location : int 10 10 10 10 NA 9 9 9 9 NA ...
## $ review_scores_value : int 8 9 10 9 NA 8 8 9 9 NA ...
## $ requires_license : Factor w/ 2 levels "f","t": 1 1 1 2 1 1 1 1 1 1 ...
## $ instant_bookable : Factor w/ 2 levels "f","t": 1 2 2 1 1 1 1 2 1 1 ...
## $ cancellation_policy : Factor w/ 6 levels "flexible","moderate",..: 4 1 4 4 4 4 4 4 4 1 ...
## $ require_guest_profile_picture : Factor w/ 2 levels "f","t": 2 1 1 1 1 1 1 1 1 1 ...
## $ require_guest_phone_verification: Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
## $ calculated_host_listings_count : int 1 1 2 2 1 3 5 2 1 2 ...
## $ reviews_per_month : num 0.02 0.13 0.24 0.2 NA 0.36 0.19 0.1 1.59 NA ...
# New we will relabel and replace the object.
LA_Listings_Cleaned <- Los_Angeles_Listings_subset
Converting Factor to Numeric
# Converting host_response_rate to numeric column
LA_Listings_Cleaned$host_response_rate <- as.numeric(
gsub( "%", "", as.character(LA_Listings_Cleaned$host_response_rate)))
# Converting price to numeric column
LA_Listings_Cleaned$price <- as.numeric(
gsub( "[\\$,]", "", as.character(LA_Listings_Cleaned$price)))
# Converting security_deposit to numeric column
LA_Listings_Cleaned$security_deposit <- as.numeric(
gsub( "[\\$,]", "", as.character(LA_Listings_Cleaned$security_deposit)))
# Converting cleaning_fee to numeric column
LA_Listings_Cleaned$cleaning_fee <- as.numeric(
gsub( "[\\$,]", "", as.character(LA_Listings_Cleaned$cleaning_fee)))
# Converting extra_people to numeric column
LA_Listings_Cleaned$extra_people <- as.numeric(
gsub( "[\\$,]", "", as.character(LA_Listings_Cleaned$extra_people)))
Handling Missing Values
# Looking at the summary of whole dataset except "amenity" feature
summary(LA_Listings_Cleaned[,-22])
## host_response_time host_response_rate host_is_superhost
## : 3 Min. : 0.00 : 3
## a few days or more: 700 1st Qu.:100.00 f:31649
## N/A :12011 Median :100.00 t:11395
## within a day : 2626 Mean : 95.32
## within a few hours: 4663 3rd Qu.:100.00
## within an hour :23044 Max. :100.00
## NA's :12014
## host_listings_count host_has_profile_pic host_identity_verified
## Min. : 0.000 : 3 : 3
## 1st Qu.: 1.000 f: 49 f:22651
## Median : 2.000 t:42995 t:20393
## Mean : 9.598
## 3rd Qu.: 5.000
## Max. :803.000
## NA's :3
## is_location_exact property_type room_type accommodates
## f: 9296 Apartment :16187 Entire home/apt:26835 Min. : 1.000
## t:33751 House :14510 Private room :14261 1st Qu.: 2.000
## Condominium: 2406 Shared room : 1951 Median : 3.000
## Guesthouse : 2202 Mean : 3.678
## Townhouse : 1368 3rd Qu.: 5.000
## Guest suite: 1324 Max. :40.000
## (Other) : 5050
## bathrooms bedrooms beds bed_type
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Airbed : 131
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 Couch : 82
## Median : 1.000 Median : 1.000 Median : 1.000 Futon : 232
## Mean : 1.451 Mean : 1.415 Mean : 1.981 Pull-out Sofa: 157
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000 Real Bed :42445
## Max. :22.000 Max. :50.000 Max. :50.000
## NA's :28 NA's :18 NA's :34
## amenities
## {} : 140
## {TV,Wifi,"Air conditioning",Pool,Kitchen,Heating,Essentials,Shampoo,Hangers} : 34
## {TV,Wifi,Kitchen,"Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Private entrance","Hot water","Body soap","Bed linens",Microwave,"Coffee maker",Refrigerator,"Dishes and silverware","Cooking basics",Stove,"Host greets you"} : 30
## {"Family/kid friendly"} : 29
## {TV,Wifi,"Air conditioning",Pool,Kitchen,Gym,Elevator,"Hot tub",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","Self check-in",Lockbox,"Hot water","Bed linens","Ethernet connection",Microwave,"Coffee maker",Refrigerator,Dishwasher,"Dishes and silverware","Cooking basics",Oven,Stove,"BBQ grill","Patio or balcony","Long term stays allowed","Paid parking on premises"} : 23
## {TV,"Cable TV",Wifi,Pool,Kitchen,"Free parking on premises",Gym,"Pets live on this property",Elevator,"Indoor fireplace",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","Self check-in",Lockbox,"Private entrance",Bathtub,"Hot water","Bed linens","Ethernet connection",Microwave,"Coffee maker",Refrigerator,Dishwasher,"Dishes and silverware",Oven,Stove,"Patio or balcony","Long term stays allowed",Beachfront}: 21
## (Other) :42770
## price security_deposit cleaning_fee guests_included
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 1.000
## 1st Qu.: 69.0 1st Qu.: 0.0 1st Qu.: 30.0 1st Qu.: 1.000
## Median : 105.0 Median : 200.0 Median : 60.0 Median : 1.000
## Mean : 198.6 Mean : 391.5 Mean : 83.3 Mean : 1.909
## 3rd Qu.: 179.0 3rd Qu.: 400.0 3rd Qu.: 100.0 3rd Qu.: 2.000
## Max. :25000.0 Max. :5100.0 Max. :1500.0 Max. :16.000
## NA's :11450 NA's :6336
## extra_people minimum_nights availability_30 availability_60
## Min. : 0 Min. : 1.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 1.000 1st Qu.: 1.00 1st Qu.: 8.0
## Median : 10 Median : 2.000 Median :13.00 Median :36.0
## Mean : 15 Mean : 5.104 Mean :13.66 Mean :32.2
## 3rd Qu.: 20 3rd Qu.: 3.000 3rd Qu.:25.00 3rd Qu.:54.0
## Max. :300 Max. :3000.000 Max. :30.00 Max. :60.0
##
## availability_90 availability_365 number_of_reviews review_scores_rating
## Min. : 0.00 Min. : 0 Min. : 0.0 Min. : 20.00
## 1st Qu.:18.00 1st Qu.: 51 1st Qu.: 1.0 1st Qu.: 93.00
## Median :64.00 Median :160 Median : 7.0 Median : 97.00
## Mean :52.24 Mean :177 Mean : 28.5 Mean : 94.48
## 3rd Qu.:83.00 3rd Qu.:335 3rd Qu.: 32.0 3rd Qu.:100.00
## Max. :90.00 Max. :365 Max. :739.0 Max. :100.00
## NA's :9276
## review_scores_accuracy review_scores_cleanliness review_scores_checkin
## Min. : 2.000 Min. : 2.000 Min. : 2.000
## 1st Qu.:10.000 1st Qu.: 9.000 1st Qu.:10.000
## Median :10.000 Median :10.000 Median :10.000
## Mean : 9.645 Mean : 9.438 Mean : 9.777
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
## NA's :9299 NA's :9297 NA's :9335
## review_scores_communication review_scores_location review_scores_value
## Min. : 2.000 Min. : 2.000 Min. : 2.000
## 1st Qu.:10.000 1st Qu.: 9.000 1st Qu.: 9.000
## Median :10.000 Median :10.000 Median :10.000
## Mean : 9.765 Mean : 9.655 Mean : 9.483
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
## NA's :9304 NA's :9340 NA's :9348
## requires_license instant_bookable cancellation_policy
## f:41439 f:23614 flexible :12890
## t: 1608 t:19433 moderate :11888
## strict : 63
## strict_14_with_grace_period:18008
## super_strict_30 : 9
## super_strict_60 : 189
##
## require_guest_profile_picture require_guest_phone_verification
## f:42255 f:41950
## t: 792 t: 1097
##
##
##
##
##
## calculated_host_listings_count reviews_per_month
## Min. : 1.000 Min. : 0.010
## 1st Qu.: 1.000 1st Qu.: 0.390
## Median : 2.000 Median : 1.180
## Mean : 5.834 Mean : 1.898
## 3rd Qu.: 5.000 3rd Qu.: 2.860
## Max. :152.000 Max. :17.840
## NA's :8720
# For categorical variable we will replace NAs and blanks with Mode value
# For numerical variables we will replace NAs and blanks with median values
# Also for each variable in which we impute some values in place of NAs and blanks,
# we will creating a corresponding flag variable which will
# contain information of whether the value in variable is imputed one or actual one.
# Sometime knowing that the value is imputed or actual one also helps in
# improving the predictive ability of model
LA_Listings_Cleaned$flag_host_response_rate <-
ifelse(is.na(LA_Listings_Cleaned$host_response_rate) |
LA_Listings_Cleaned$host_response_rate=='' , 1,0)
LA_Listings_Cleaned$host_response_rate <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$host_response_rate) |
LA_Listings_Cleaned$host_response_rate=='' ,
median(LA_Listings_Cleaned$host_response_rate,na.rm = TRUE),
as.character(LA_Listings_Cleaned$host_response_rate)))
LA_Listings_Cleaned$flag_host_response_time <-
ifelse(LA_Listings_Cleaned$host_response_time=='N/A' |
LA_Listings_Cleaned$host_response_time=='' , 1,0)
LA_Listings_Cleaned$host_response_time <- as.factor(
ifelse(LA_Listings_Cleaned$host_response_time=='N/A' |
LA_Listings_Cleaned$host_response_time=='' , 'within an hour',
as.character(LA_Listings_Cleaned$host_response_time)))
LA_Listings_Cleaned$flag_host_is_superhost <-
ifelse(LA_Listings_Cleaned$host_is_superhost=='N/A' |
LA_Listings_Cleaned$host_is_superhost=='' , 1,0)
LA_Listings_Cleaned$host_is_superhost <- as.factor(
ifelse(LA_Listings_Cleaned$host_is_superhost=='N/A' |
LA_Listings_Cleaned$host_is_superhost=='' , 'f',
as.character(LA_Listings_Cleaned$host_is_superhost)))
LA_Listings_Cleaned$flag_host_listings_count <-
ifelse(is.na(LA_Listings_Cleaned$host_listings_count) |
LA_Listings_Cleaned$host_listings_count=='' , 1,0)
LA_Listings_Cleaned$host_listings_count <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$host_listings_count) |
LA_Listings_Cleaned$host_listings_count=='' ,
median(LA_Listings_Cleaned$host_listings_count,na.rm = TRUE),
as.character(LA_Listings_Cleaned$host_listings_count)))
LA_Listings_Cleaned$flag_host_has_profile_pic <-
ifelse(LA_Listings_Cleaned$host_has_profile_pic=='N/A' |
LA_Listings_Cleaned$host_has_profile_pic=='' , 1,0)
LA_Listings_Cleaned$host_has_profile_pic <- as.factor(
ifelse(LA_Listings_Cleaned$host_has_profile_pic=='N/A' |
LA_Listings_Cleaned$host_has_profile_pic=='' , 't',
as.character(LA_Listings_Cleaned$host_has_profile_pic)))
LA_Listings_Cleaned$flag_host_identity_verified <-
ifelse(LA_Listings_Cleaned$host_identity_verified=='N/A' |
LA_Listings_Cleaned$host_identity_verified=='' , 1,0)
LA_Listings_Cleaned$host_identity_verified <- as.factor(
ifelse(LA_Listings_Cleaned$host_identity_verified=='N/A' |
LA_Listings_Cleaned$host_identity_verified=='' , 't',
as.character(LA_Listings_Cleaned$host_identity_verified)))
LA_Listings_Cleaned$flag_bathrooms <-
ifelse(is.na(LA_Listings_Cleaned$bathrooms) |
LA_Listings_Cleaned$bathrooms=='' , 1,0)
LA_Listings_Cleaned$bathrooms <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$bathrooms) |
LA_Listings_Cleaned$bathrooms=='' ,
median(LA_Listings_Cleaned$bathrooms,na.rm = TRUE),
as.character(LA_Listings_Cleaned$bathrooms)))
LA_Listings_Cleaned$flag_bedrooms <-
ifelse(is.na(LA_Listings_Cleaned$bedrooms) |
LA_Listings_Cleaned$bedrooms=='' , 1,0)
LA_Listings_Cleaned$bedrooms <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$bedrooms) |
LA_Listings_Cleaned$bedrooms=='' ,
median(LA_Listings_Cleaned$bedrooms,na.rm = TRUE),
as.character(LA_Listings_Cleaned$bedrooms)))
LA_Listings_Cleaned$flag_beds <-
ifelse(is.na(LA_Listings_Cleaned$beds) |
LA_Listings_Cleaned$beds=='' , 1,0)
LA_Listings_Cleaned$beds <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$beds) |
LA_Listings_Cleaned$beds=='' ,
median(LA_Listings_Cleaned$beds,na.rm = TRUE),
as.character(LA_Listings_Cleaned$beds)))
LA_Listings_Cleaned$flag_security_deposit <-
ifelse(is.na(LA_Listings_Cleaned$security_deposit) |
LA_Listings_Cleaned$security_deposit=='' , 1,0)
LA_Listings_Cleaned$security_deposit <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$security_deposit) |
LA_Listings_Cleaned$security_deposit=='' ,
median(LA_Listings_Cleaned$security_deposit,na.rm = TRUE),
as.character(LA_Listings_Cleaned$security_deposit)))
LA_Listings_Cleaned$flag_cleaning_fee <-
ifelse(is.na(LA_Listings_Cleaned$cleaning_fee) |
LA_Listings_Cleaned$cleaning_fee=='' , 1,0)
LA_Listings_Cleaned$cleaning_fee <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$cleaning_fee) |
LA_Listings_Cleaned$cleaning_fee=='' ,
median(LA_Listings_Cleaned$cleaning_fee,na.rm = TRUE),
as.character(LA_Listings_Cleaned$cleaning_fee)))
LA_Listings_Cleaned$flag_review_scores_rating <-
ifelse(is.na(LA_Listings_Cleaned$review_scores_rating) |
LA_Listings_Cleaned$review_scores_rating=='' , 1,0)
LA_Listings_Cleaned$review_scores_rating <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$review_scores_rating) |
LA_Listings_Cleaned$review_scores_rating=='' ,
median(LA_Listings_Cleaned$review_scores_rating,na.rm = TRUE),
as.character(LA_Listings_Cleaned$review_scores_rating)))
LA_Listings_Cleaned$flag_review_scores_accuracy <-
ifelse(is.na(LA_Listings_Cleaned$review_scores_accuracy) |
LA_Listings_Cleaned$review_scores_accuracy=='' , 1,0)
LA_Listings_Cleaned$review_scores_accuracy <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$review_scores_accuracy) |
LA_Listings_Cleaned$review_scores_accuracy=='' ,
median(LA_Listings_Cleaned$review_scores_accuracy,na.rm = TRUE),
as.character(LA_Listings_Cleaned$review_scores_accuracy)))
LA_Listings_Cleaned$flag_review_scores_cleanliness <-
ifelse(is.na(LA_Listings_Cleaned$review_scores_cleanliness) |
LA_Listings_Cleaned$review_scores_cleanliness=='' , 1,0)
LA_Listings_Cleaned$review_scores_cleanliness <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$review_scores_cleanliness) |
LA_Listings_Cleaned$review_scores_cleanliness=='' ,
median(LA_Listings_Cleaned$review_scores_cleanliness,na.rm = TRUE),
as.character(LA_Listings_Cleaned$review_scores_cleanliness)))
LA_Listings_Cleaned$flag_review_scores_checkin <-
ifelse(is.na(LA_Listings_Cleaned$review_scores_checkin) |
LA_Listings_Cleaned$review_scores_checkin=='' , 1,0)
LA_Listings_Cleaned$review_scores_checkin <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$review_scores_checkin) |
LA_Listings_Cleaned$review_scores_checkin=='' ,
median(LA_Listings_Cleaned$review_scores_checkin,na.rm = TRUE),
as.character(LA_Listings_Cleaned$review_scores_checkin)))
LA_Listings_Cleaned$flag_review_scores_communication <-
ifelse(is.na(LA_Listings_Cleaned$review_scores_communication) |
LA_Listings_Cleaned$review_scores_communication=='' , 1,0)
LA_Listings_Cleaned$review_scores_communication <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$review_scores_communication) |
LA_Listings_Cleaned$review_scores_communication=='' ,
median(LA_Listings_Cleaned$review_scores_communication,na.rm = TRUE),
as.character(LA_Listings_Cleaned$review_scores_communication)))
LA_Listings_Cleaned$flag_review_scores_location <-
ifelse(is.na(LA_Listings_Cleaned$review_scores_location) |
LA_Listings_Cleaned$review_scores_location=='' , 1,0)
LA_Listings_Cleaned$review_scores_location <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$review_scores_location) |
LA_Listings_Cleaned$review_scores_location=='' ,
median(LA_Listings_Cleaned$review_scores_location,na.rm = TRUE),
as.character(LA_Listings_Cleaned$review_scores_location)))
LA_Listings_Cleaned$flag_review_scores_value <-
ifelse(is.na(LA_Listings_Cleaned$review_scores_value) |
LA_Listings_Cleaned$review_scores_value=='' , 1,0)
LA_Listings_Cleaned$review_scores_value <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$review_scores_value) |
LA_Listings_Cleaned$review_scores_value=='' ,
median(LA_Listings_Cleaned$review_scores_value,na.rm = TRUE),
as.character(LA_Listings_Cleaned$review_scores_value)))
LA_Listings_Cleaned$flag_reviews_per_month <-
ifelse(is.na(LA_Listings_Cleaned$reviews_per_month) |
LA_Listings_Cleaned$reviews_per_month=='' , 1,0)
LA_Listings_Cleaned$reviews_per_month <- as.numeric(
ifelse(is.na(LA_Listings_Cleaned$reviews_per_month) |
LA_Listings_Cleaned$reviews_per_month=='' ,
median(LA_Listings_Cleaned$reviews_per_month,na.rm = TRUE),
as.character(LA_Listings_Cleaned$reviews_per_month)))
Feature Engineering
# Creating some meaningful features from less important features
LA_Listings_Cleaned$amenities_count <-
str_count(LA_Listings_Cleaned$amenities, ",")+1
# Removing actual amenities variable
LA_Listings_Cleaned$amenities <- NULL
Dummy Coding Categorical Variables
# host_response_time variable
LA_Listings_Cleaned$host_response_within_few_days_or_more <-
ifelse(LA_Listings_Cleaned$host_response_time=='a few days or more',1,0)
LA_Listings_Cleaned$host_response_within_a_days <-
ifelse(LA_Listings_Cleaned$host_response_time=='within a day',1,0)
LA_Listings_Cleaned$host_response_within_few_hours <-
ifelse(LA_Listings_Cleaned$host_response_time=='within a few hours',1,0)
LA_Listings_Cleaned$host_response_within_an_hour <-
ifelse(LA_Listings_Cleaned$host_response_time=='within an hour',1,0)
# Removing actual variable
LA_Listings_Cleaned$host_response_time <- NULL
# host_has_profile_pic variable
LA_Listings_Cleaned$host_has_profile_pic<-
ifelse(LA_Listings_Cleaned$host_has_profile_pic == 't',1,0)
# host_identity_verified variable
LA_Listings_Cleaned$host_identity_verified<-
ifelse(LA_Listings_Cleaned$host_identity_verified == 'f',0,1)
# is_location_exact variable
LA_Listings_Cleaned$is_location_exact<-
ifelse(LA_Listings_Cleaned$is_location_exact == 'f',0,1)
# requires_license variable
LA_Listings_Cleaned$requires_license<-
ifelse(LA_Listings_Cleaned$requires_license == 'f',0,1)
# instant_bookable variable
LA_Listings_Cleaned$instant_bookable<-
ifelse(LA_Listings_Cleaned$instant_bookable == 'f',0,1)
# require_guest_profile_picture variable
LA_Listings_Cleaned$require_guest_profile_picture<-
ifelse(LA_Listings_Cleaned$require_guest_profile_picture == 'f',0,1)
# require_guest_phone_verification variable
LA_Listings_Cleaned$require_guest_phone_verification<-
ifelse(LA_Listings_Cleaned$require_guest_phone_verification == 'f',0,1)
# host_is_superhost variable
LA_Listings_Cleaned$host_is_superhost<-
ifelse(LA_Listings_Cleaned$host_is_superhost == 'f',0,1)
Bucketing Dummified Features
# cancellation_policy variable
summary(LA_Listings_Cleaned$cancellation_policy)
## flexible moderate
## 12890 11888
## strict strict_14_with_grace_period
## 63 18008
## super_strict_30 super_strict_60
## 9 189
# Bucketing strict,strict_14_with_grace_period,super_strict_30,super_strict_60 under
#one category of "strict"
LA_Listings_Cleaned$cancellation_policy_strict <-
ifelse(LA_Listings_Cleaned$cancellation_policy == 'strict' |
LA_Listings_Cleaned$cancellation_policy == 'strict_14_with_grace_period' |
LA_Listings_Cleaned$cancellation_policy == 'super_strict_30' |
LA_Listings_Cleaned$cancellation_policy == 'super_strict_60',1,0)
LA_Listings_Cleaned$cancellation_policy_flexible <-
ifelse(LA_Listings_Cleaned$cancellation_policy == 'flexible',1,0)
LA_Listings_Cleaned$cancellation_policy_moderate <-
ifelse(LA_Listings_Cleaned$cancellation_policy == 'moderate',1,0)
# Removing actual variable
LA_Listings_Cleaned$cancellation_policy <- NULL
# property_type variable
summary(LA_Listings_Cleaned$property_type)
## Aparthotel Apartment Barn
## 38 16187 7
## Bed and breakfast Boat Boutique hotel
## 186 34 132
## Bungalow Bus Cabin
## 1269 4 89
## Camper/RV Campsite Casa particular (Cuba)
## 182 8 2
## Castle Cave Chalet
## 10 1 15
## Condominium Cottage Dome house
## 2406 176 4
## Dorm Earth house Farm stay
## 2 11 30
## Guest suite Guesthouse Hostel
## 1324 2202 341
## Hotel House Houseboat
## 40 14510 3
## Hut Island Lighthouse
## 5 2 1
## Loft Minsu (Taiwan) Other
## 1006 2 169
## Plane Resort Serviced apartment
## 1 2 356
## Tent Tiny house Tipi
## 25 50 6
## Townhouse Train Treehouse
## 1368 1 12
## Villa Yurt
## 814 14
# Bucketing property type into 3 categories named House, Apartment and Other
LA_Listings_Cleaned$property_type <- as.factor(
ifelse(LA_Listings_Cleaned$property_type == 'House','House',
ifelse(LA_Listings_Cleaned$property_type == 'Apartment','Apartment',
'Other'))
)
# Dummification of property type
LA_Listings_Cleaned$property_type_house <-
ifelse(LA_Listings_Cleaned$property_type == 'House',1,0)
LA_Listings_Cleaned$property_type_apartment <-
ifelse(LA_Listings_Cleaned$property_type == 'Apartment',1,0)
LA_Listings_Cleaned$property_type_other <-
ifelse(LA_Listings_Cleaned$property_type == 'Other',1,0)
# Removing actual variable
LA_Listings_Cleaned$property_type <- NULL
### room type variable
summary(LA_Listings_Cleaned$room_type)
## Entire home/apt Private room Shared room
## 26835 14261 1951
LA_Listings_Cleaned$room_type_private <-
ifelse(LA_Listings_Cleaned$room_type == 'Private room',1,0)
LA_Listings_Cleaned$room_type_shared <-
ifelse(LA_Listings_Cleaned$room_type == 'Shared room',1,0)
LA_Listings_Cleaned$room_type_entire_home <-
ifelse(LA_Listings_Cleaned$room_type == 'Entire home/apt',1,0)
# Removing actual column
LA_Listings_Cleaned$room_type <- NULL
### bed type variable
summary(LA_Listings_Cleaned$bed_type)
## Airbed Couch Futon Pull-out Sofa Real Bed
## 131 82 232 157 42445
LA_Listings_Cleaned$bed_type <-
ifelse(LA_Listings_Cleaned$bed_type == 'Real Bed' ,1,0)
#str(LA_Listings_Cleaned)
## 0% 25% 50% 75% 100%
## 0 1 2 5 803
## 0% 25% 50% 75% 100%
## 1 2 3 5 40
## 0% 25% 50% 75% 100%
## 0 1 1 2 22
## 0% 25% 50% 75% 100%
## 0 1 1 2 50
## 0% 25% 50% 75% 100%
## 0 1 1 2 50
## 0% 25% 50% 75% 100%
## 0 69 105 179 25000
## 0% 25% 50% 75% 100%
## 0 100 200 300 5100
## 0% 25% 50% 75% 100%
## 0 35 60 100 1500
## 0% 25% 50% 75% 100%
## 1 1 1 2 16
## 0% 25% 50% 75% 100%
## 0 0 10 20 300
## 0% 25% 50% 75% 100%
## 1 1 2 3 3000
## 0% 25% 50% 75% 100%
## 0 1 7 32 739
## 0% 25% 50% 75% 100%
## 20 94 97 99 100
## 0% 25% 50% 75% 100%
## 2 10 10 10 10
## 0% 25% 50% 75% 100%
## 2 9 10 10 10
## 0% 25% 50% 75% 100%
## 2 10 10 10 10
## 0% 25% 50% 75% 100%
## 2 10 10 10 10
## 0% 25% 50% 75% 100%
## 2 10 10 10 10
## 0% 25% 50% 75% 100%
## 2 9 10 10 10
## 0% 25% 50% 75% 100%
## 0.01 0.56 1.18 2.31 17.84
## 0% 25% 50% 75% 100%
## 1 18 24 33 112
## host_is_superhost avg_price
## 1 0 100
## 2 1 110
## host_listings_count avg_price
## 1 0 100.0
## 2 1 104.0
## 3 2 100.0
## 4 3 100.0
## 5 4 99.5
## 6 5 100.0
## 7 6 115.0
## 8 7 109.0
## 9 8 105.0
## 10 9 115.0
## 11 10 130.0
## 12 11 110.0
## 13 12 110.0
## 14 13 125.0
## 15 14 129.0
## 16 15 115.0
## 17 16 90.0
## 18 17 135.5
## 19 18 100.0
## 20 19 135.0
## 21 20 118.5
## 22 21 89.0
## 23 22 79.0
## 24 23 72.5
## 25 24 63.0
## 26 25 115.0
## 27 26 75.0
## 28 27 59.0
## 29 28 159.0
## 30 29 95.0
## 31 30 115.0
## 32 31 179.0
## 33 32 25.0
## 34 33 139.5
## 35 34 69.0
## 36 35 179.0
## 37 36 52.0
## 38 37 130.0
## 39 38 39.0
## 40 39 139.0
## 41 40 299.0
## 42 41 110.0
## 43 43 199.0
## 44 45 39.0
## 45 46 25.0
## 46 47 72.0
## 47 48 500.0
## 48 49 220.0
## 49 50 179.0
## 50 51 200.0
## 51 53 84.0
## 52 56 231.0
## 53 58 175.0
## 54 59 199.0
## 55 60 230.5
## 56 61 250.0
## 57 62 214.0
## 58 63 196.0
## 59 66 204.5
## 60 67 575.0
## 61 69 80.0
## 62 70 299.0
## 63 76 23.5
## 64 89 299.0
## 65 90 109.0
## 66 95 249.0
## 67 98 299.0
## 68 100 35.0
## 69 115 90.0
## 70 116 196.0
## 71 117 2233.5
## 72 148 300.0
## 73 152 4562.5
## 74 165 192.0
## 75 185 499.0
## 76 209 1100.0
## 77 218 400.0
## 78 223 560.0
## 79 272 323.0
## 80 280 454.0
## 81 343 128.5
## 82 388 645.0
## 83 447 269.0
## 84 483 259.0
## 85 520 200.0
## 86 571 162.0
## 87 664 100.0
## 88 803 249.0
## host_has_profile_pic avg_price
## 1 0 130
## 2 1 105
## host_identity_verified avg_price
## 1 0 100
## 2 1 109
## accommodates avg_price
## 1 1 50.0
## 2 2 80.0
## 3 3 100.0
## 4 4 131.5
## 5 5 160.0
## 6 6 200.0
## 7 7 225.0
## 8 8 299.0
## 9 9 265.0
## 10 10 450.0
## 11 11 280.0
## 12 12 550.0
## 13 13 329.0
## 14 14 566.0
## 15 15 262.5
## 16 16 499.0
## 17 20 3995.0
## 18 40 340.0
## cancelation_policy median_price
## 1 Strict 125
## 2 Moderate 100
## 3 Flexible 90
## property_type median_price
## 1 House 100
## 2 Apartment 105
## 3 Other 110
## room_type median_price
## 1 private 65
## 2 shared 30
## 3 entire house 149
Data Transformations and Final Dataset
A significant amount of the data that we explored were skewed either way, which made the data transformations necessarry. Below is a breakdown of the transformed data that will be included in the final dataset:
########## Data transformations for outlier treatment #############
LA_Listings_Cleaned$log_host_listings_count <- log(LA_Listings_Cleaned$host_listings_count+1)
LA_Listings_Cleaned$log_accommodate <- log(LA_Listings_Cleaned$accommodates+1)
LA_Listings_Cleaned$log_bathrooms <- log(LA_Listings_Cleaned$bathrooms+1)
LA_Listings_Cleaned$log_bedrooms <- log(LA_Listings_Cleaned$bedrooms+1)
LA_Listings_Cleaned$log_price <- log(LA_Listings_Cleaned$price+1)
LA_Listings_Cleaned$log_security_deposit <- log(LA_Listings_Cleaned$security_deposit+1)
LA_Listings_Cleaned$log_cleaning_fee <- log(LA_Listings_Cleaned$cleaning_fee+1)
LA_Listings_Cleaned$log_extra_people <- log(LA_Listings_Cleaned$extra_people+1)
LA_Listings_Cleaned$log_minimum_nights <- log(LA_Listings_Cleaned$minimum_nights+1)
LA_Listings_Cleaned$log_number_of_reviews <- log(LA_Listings_Cleaned$number_of_reviews+1)
LA_Listings_Cleaned$cuberoot_review_scores_rating <-LA_Listings_Cleaned$review_scores_rating^(1/3)
LA_Listings_Cleaned$log_reviews_per_month<- log(LA_Listings_Cleaned$reviews_per_month+1)
# removing the original columns as we are going to use their
#trnasformations going forward
LA_Listings_Cleaned$host_listings_count<- NULL
LA_Listings_Cleaned$accommodates<- NULL
LA_Listings_Cleaned$bathrooms<- NULL
LA_Listings_Cleaned$bedrooms<- NULL
LA_Listings_Cleaned$price<- NULL
LA_Listings_Cleaned$security_deposit<- NULL
LA_Listings_Cleaned$cleaning_fee<- NULL
LA_Listings_Cleaned$extra_people<- NULL
LA_Listings_Cleaned$minimum_nights<- NULL
LA_Listings_Cleaned$number_of_reviews<- NULL
LA_Listings_Cleaned$reviews_per_month<- NULL
LA_Listings_Cleaned$review_scores_rating <- NULL
write.csv(LA_Listings_Cleaned, 'LA_Listings_Cleaned_FINAL.csv')
LA_Listings_Training <- read.csv('LA_Listings_Cleaned_FINAL.csv')
LA_Listings_Training <- LA_Listings_Training[,-26:-44]
LA_Listings_Training <- LA_Listings_Training[,-1]
LA_Listings_Training <- LA_Listings_Training[,-3]
LA_Listings_Training <- LA_Listings_Training[,-8]
write.csv(LA_Listings_Training, 'LA_Listings_Cleaned_FINAL.csv')
LA_Listings_Training <- read.csv('LA_Listings_Cleaned_FINAL.csv')
LA_Listings_Training <- LA_Listings_Training[,-1]
str(LA_Listings_Training)
## 'data.frame': 43047 obs. of 48 variables:
## $ host_response_rate : int 100 100 100 100 100 100 100 100 100 100 ...
## $ host_is_superhost : int 0 0 1 0 0 0 0 1 1 0 ...
## $ host_identity_verified : int 1 1 1 0 1 1 1 1 1 0 ...
## $ is_location_exact : int 1 1 1 1 1 1 1 1 1 0 ...
## $ beds : int 3 3 1 1 1 2 1 1 1 1 ...
## $ bed_type : int 1 1 1 0 1 1 1 1 1 1 ...
## $ guests_included : int 3 6 1 1 1 1 2 1 2 1 ...
## $ availability_30 : int 0 0 0 25 0 0 15 5 19 30 ...
## $ availability_60 : int 0 0 0 55 0 0 15 14 49 60 ...
## $ availability_90 : int 0 0 6 85 0 7 15 44 73 90 ...
## $ availability_365 : int 236 135 260 360 0 282 15 319 313 179 ...
## $ review_scores_accuracy : int 10 10 10 9 10 8 8 10 10 10 ...
## $ review_scores_cleanliness : int 10 10 10 9 10 8 8 9 9 10 ...
## $ review_scores_checkin : int 6 10 10 10 10 8 9 10 10 10 ...
## $ review_scores_communication : int 8 10 10 10 10 9 9 10 10 10 ...
## $ review_scores_location : int 10 10 10 10 10 9 9 9 9 10 ...
## $ review_scores_value : int 8 9 10 9 10 8 8 9 9 10 ...
## $ requires_license : int 0 0 0 1 0 0 0 0 0 0 ...
## $ instant_bookable : int 0 1 1 0 0 0 0 1 0 0 ...
## $ require_guest_profile_picture : int 1 0 0 0 0 0 0 0 0 0 ...
## $ require_guest_phone_verification : int 0 0 0 0 0 0 0 0 0 0 ...
## $ calculated_host_listings_count : int 1 1 2 2 1 3 5 2 1 2 ...
## $ amenities_count : int 32 41 43 12 20 24 39 57 12 13 ...
## $ host_response_within_few_days_or_more: int 0 0 0 0 0 0 0 0 0 0 ...
## $ host_response_within_a_days : int 0 1 0 0 0 0 0 0 0 0 ...
## $ host_response_within_few_hours : int 0 0 0 1 0 0 0 0 1 0 ...
## $ host_response_within_an_hour : int 1 0 1 0 1 1 1 1 0 1 ...
## $ cancellation_policy_strict : int 1 0 1 1 1 1 1 1 1 0 ...
## $ cancellation_policy_flexible : int 0 1 0 0 0 0 0 0 0 1 ...
## $ cancellation_policy_moderate : int 0 0 0 0 0 0 0 0 0 0 ...
## $ property_type_house : int 0 1 0 0 0 0 0 0 0 1 ...
## $ property_type_apartment : int 0 0 1 1 1 0 0 1 1 0 ...
## $ property_type_other : int 1 0 0 0 0 1 1 0 0 0 ...
## $ room_type_private : int 0 0 1 1 0 0 1 1 0 1 ...
## $ room_type_shared : int 0 0 0 0 0 0 0 0 0 0 ...
## $ room_type_entire_home : int 1 1 0 0 1 1 0 0 1 0 ...
## $ log_host_listings_count : num 0.693 0.693 1.099 1.099 0.693 ...
## $ log_accommodate : num 1.946 1.946 0.693 0.693 1.099 ...
## $ log_bathrooms : num 1.099 0.693 0.916 0.693 0.693 ...
## $ log_bedrooms : num 1.099 1.386 0.693 0.693 0.693 ...
## $ log_price : num 4.81 5.13 4.38 4.95 4.39 ...
## $ log_security_deposit : num 6.22 0 5.7 5.3 4.62 ...
## $ log_cleaning_fee : num 5.48 4.62 4.45 4.62 4.33 ...
## $ log_extra_people : num 3.26 0 0 0 3.26 ...
## $ log_minimum_nights : num 2.079 1.099 1.946 0.693 1.099 ...
## $ log_number_of_reviews : num 1.1 1.61 2.64 2.94 0 ...
## $ cuberoot_review_scores_rating : num 4.31 4.53 4.59 4.58 4.59 ...
## $ log_reviews_per_month : num 0.0198 0.1222 0.2151 0.1823 0.7793 ...
#summary(LA_Listings_Training)
remove('LA_Listings_Cleaned')
remove('Los_Angeles_Listings_subset')
We crafted multiple models using multiple algorithms for the purpose of predicting the Superhost status. We trained 85% of the data and tested against the remaining 15% when running all the following algorithms:
kNN Algorithm
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 6876
##
##
## | predicted
## actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 4718 | 306 | 5024 |
## | 0.939 | 0.061 | 0.731 |
## | 0.770 | 0.407 | |
## | 0.686 | 0.045 | |
## -------------|-----------|-----------|-----------|
## 1 | 1406 | 446 | 1852 |
## | 0.759 | 0.241 | 0.269 |
## | 0.230 | 0.593 | |
## | 0.204 | 0.065 | |
## -------------|-----------|-----------|-----------|
## Column Total | 6124 | 752 | 6876 |
## | 0.891 | 0.109 | |
## -------------|-----------|-----------|-----------|
##
##
Naive-Bayes Algorithm
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 6876
##
##
## | predicted
## actual | No | Yes | Row Total |
## -------------|-----------|-----------|-----------|
## No | 3867 | 94 | 3961 |
## | 0.976 | 0.024 | 0.576 |
## | 0.770 | 0.051 | |
## -------------|-----------|-----------|-----------|
## Yes | 1157 | 1758 | 2915 |
## | 0.397 | 0.603 | 0.424 |
## | 0.230 | 0.949 | |
## -------------|-----------|-----------|-----------|
## Column Total | 5024 | 1852 | 6876 |
## | 0.731 | 0.269 | |
## -------------|-----------|-----------|-----------|
##
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3867 94
## Yes 1157 1758
##
## Accuracy : 0.8181
## 95% CI : (0.8087, 0.8271)
## No Information Rate : 0.7307
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6087
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9492
## Specificity : 0.7697
## Pos Pred Value : 0.6031
## Neg Pred Value : 0.9763
## Prevalence : 0.2693
## Detection Rate : 0.2557
## Detection Prevalence : 0.4239
## Balanced Accuracy : 0.8595
##
## 'Positive' Class : Yes
##
C50 Algorithm
##
## Call:
## C5.0.default(x = superhost_train[, -2], y = superhost_train$host_is_superhost)
##
## Classification Tree
## Number of samples: 36171
## Number of predictors: 47
##
## Tree size: 197
##
## Non-standard options: attempt to group attributes
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 6876
##
##
## | Predicted Superhosts
## Actual Superhosts | No | Yes | Row Total |
## ------------------|-----------|-----------|-----------|
## No | 4588 | 436 | 5024 |
## | 0.667 | 0.063 | |
## ------------------|-----------|-----------|-----------|
## Yes | 571 | 1281 | 1852 |
## | 0.083 | 0.186 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 5159 | 1717 | 6876 |
## ------------------|-----------|-----------|-----------|
##
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 4588 571
## Yes 436 1281
##
## Accuracy : 0.8535
## 95% CI : (0.845, 0.8618)
## No Information Rate : 0.7307
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6191
##
## Mcnemar's Test P-Value : 2.414e-05
##
## Sensitivity : 0.6917
## Specificity : 0.9132
## Pos Pred Value : 0.7461
## Neg Pred Value : 0.8893
## Prevalence : 0.2693
## Detection Rate : 0.1863
## Detection Prevalence : 0.2497
## Balanced Accuracy : 0.8025
##
## 'Positive' Class : Yes
##
OneR Algorithm
##
## === Summary ===
##
## Correctly Classified Instances 28569 78.9832 %
## Incorrectly Classified Instances 7602 21.0168 %
## Kappa statistic 0.3537
## Mean absolute error 0.2102
## Root mean squared error 0.4584
## Relative absolute error 54.1038 %
## Root relative squared error 104.0237 %
## Total Number of Instances 36171
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 25249 1379 | a = No
## 6223 3320 | b = Yes
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 6876
##
##
## | Predicted Superhosts
## Actual Superhosts | No | Yes | Row Total |
## ------------------|-----------|-----------|-----------|
## No | 4795 | 229 | 5024 |
## | 0.697 | 0.033 | |
## ------------------|-----------|-----------|-----------|
## Yes | 1178 | 674 | 1852 |
## | 0.171 | 0.098 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 5973 | 903 | 6876 |
## ------------------|-----------|-----------|-----------|
##
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 4795 1178
## Yes 229 674
##
## Accuracy : 0.7954
## 95% CI : (0.7856, 0.8049)
## No Information Rate : 0.7307
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3798
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.36393
## Specificity : 0.95442
## Pos Pred Value : 0.74640
## Neg Pred Value : 0.80278
## Prevalence : 0.26934
## Detection Rate : 0.09802
## Detection Prevalence : 0.13133
## Balanced Accuracy : 0.65917
##
## 'Positive' Class : Yes
##
RIPPER Algorithm
##
## === Summary ===
##
## Correctly Classified Instances 31044 85.8257 %
## Incorrectly Classified Instances 5127 14.1743 %
## Kappa statistic 0.6116
## Mean absolute error 0.2347
## Root mean squared error 0.3425
## Relative absolute error 60.413 %
## Root relative squared error 77.7264 %
## Total Number of Instances 36171
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 24963 1665 | a = No
## 3462 6081 | b = Yes
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 6876
##
##
## | Predicted Superhosts
## Actual Superhosts | No | Yes | Row Total |
## ------------------|-----------|-----------|-----------|
## No | 4659 | 365 | 5024 |
## | 0.678 | 0.053 | |
## ------------------|-----------|-----------|-----------|
## Yes | 676 | 1176 | 1852 |
## | 0.098 | 0.171 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 5335 | 1541 | 6876 |
## ------------------|-----------|-----------|-----------|
##
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 4659 676
## Yes 365 1176
##
## Accuracy : 0.8486
## 95% CI : (0.8399, 0.857)
## No Information Rate : 0.7307
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5938
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6350
## Specificity : 0.9273
## Pos Pred Value : 0.7631
## Neg Pred Value : 0.8733
## Prevalence : 0.2693
## Detection Rate : 0.1710
## Detection Prevalence : 0.2241
## Balanced Accuracy : 0.7812
##
## 'Positive' Class : Yes
##
Random Forest Algorithm
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 36171 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 72342 matrix numeric
## oob.times 36171 -none- numeric
## classes 2 -none- character
## importance 47 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 36171 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 6876
##
##
## | superhost_randomForest_predict
## superhost_subtest$host_is_superhost | No | Yes | Row Total |
## ------------------------------------|-----------|-----------|-----------|
## No | 5012 | 41 | 5053 |
## | 0.729 | 0.006 | |
## ------------------------------------|-----------|-----------|-----------|
## Yes | 114 | 1709 | 1823 |
## | 0.017 | 0.249 | |
## ------------------------------------|-----------|-----------|-----------|
## Column Total | 5126 | 1750 | 6876 |
## ------------------------------------|-----------|-----------|-----------|
##
##
Based on the model evaluations, we have determined that the ‘Random Forest’ algorithm outperformed all the other models tested.
We revisited the hypothesis and then referred to the Random Forest feature called ‘importance’ to look at the model’s strongest variables. The breakdown below is the Gini Index of all the dataset’s features in descending order of importance.
## feature MeanDecreaseGini
## 1 log_number_of_reviews 1902.849607
## 2 log_reviews_per_month 1403.107873
## 3 cuberoot_review_scores_rating 1151.949485
## 4 amenities_count 834.175902
## 5 log_host_listings_count 601.631644
## 6 calculated_host_listings_count 549.289563
## 7 availability_365 533.870086
## 8 log_price 522.865651
## 9 log_cleaning_fee 463.861478
## 10 availability_90 458.176886
## 11 availability_60 425.546695
## 12 availability_30 383.321351
## 13 log_security_deposit 362.576043
## 14 review_scores_cleanliness 335.175444
## 15 log_extra_people 315.250710
## 16 log_minimum_nights 292.749317
## 17 review_scores_accuracy 275.235448
## 18 review_scores_value 267.842379
## 19 log_accommodate 255.051576
## 20 beds 202.411203
## 21 host_response_rate 197.656280
## 22 guests_included 185.722499
## 23 log_bathrooms 158.628532
## 24 cancellation_policy_moderate 156.998636
## 25 log_bedrooms 152.516979
## 26 review_scores_communication 149.028336
## 27 host_identity_verified 119.109564
## 28 cancellation_policy_flexible 116.849558
## 29 review_scores_checkin 110.313383
## 30 instant_bookable 109.120240
## 31 review_scores_location 103.964534
## 32 property_type_apartment 102.375983
## 33 property_type_other 84.496325
## 34 property_type_house 84.093639
## 35 is_location_exact 82.458249
## 36 cancellation_policy_strict 80.394675
## 37 host_response_within_an_hour 77.265232
## 38 requires_license 68.270094
## 39 room_type_private 64.770366
## 40 host_response_within_few_hours 63.534209
## 41 room_type_entire_home 62.387486
## 42 host_response_within_a_days 47.737785
## 43 require_guest_phone_verification 25.057622
## 44 require_guest_profile_picture 20.313194
## 45 room_type_shared 19.571724
## 46 bed_type 15.360340
## 47 host_response_within_few_days_or_more 6.364909
RECOMMENDATION
Previously, we determined that the Random Forest Algorithm best determined whether a host is a Superhost.
The best-performing model shows the following features are most important or ‘significant’ in determining the Superhost status of a host:
Based on these findings, we propose the following:
Loyal hosts that have been participating in the superhost program should be rightfully rewarded for their dedication to Airbnb, and this would be a great way to encourage and incentivize the host to be more available throughout the year.
FUTURE WORK
This study’s area of focus was soley on the Los Angeles, CA market. That being said, future work is more than necessary to create more of an argument in favor of the proposal. The following are a few areas where we can expand on this project: