Cleaning the data is also a time to understand the data. “id” and “name” all represent the information about the name of the room or house for Airbnb, and the “host_id” and “host_name” all represent the owner of the room or house. Accordingly, as the value in the “name” and “host_name” columns is not concise, deleting them and keeping their “id” column is a way to make the data set clean and easier to understand. As the ID represents the room and people, it should be treated as a categorical variable. “neighbourhood_group” and “neighbourhood” refers to the community in the Airbnb room and house located, “latitude” refers to the listing latitude of the Airbnb room and house, and “longitude” is the listing longitude of the Airbnb room and house. “room_type” provide information about different types of room, “price” is the listing price for the Airbnb room and house. “minimum_nights” means the minimum nights that required to stay. “number_of_reviews” is the total number of reviews online. “last_review” is the date of last review been posted. “reviews_per_month” refers to an average number of reviews per month. “calculated_host_listings_count” indicates the total number of room and house that the host own are listing. “availability_365” means the number of the day the room and house are available in the frame of one year (365 days). Here can find more information about the data set.
The data set includes information about different species of pathogens, the case of the illness it case, and the cost of the illness. Cleaning this data set, which includes deleting the header, footnote, and other parts that do not contain the value, and creating a new column for the abbreviation for the different types of pathogens, is necessary as it will make the data set easier to analyze.
pathogens_clean <- pathogens %>%na.omit()%>%rename("Species"="Total cost of foodborne illness estimates for 15 leading foodborne pathogens", "Cases"="...2", "Cost"="...3")%>%mutate(Species_abbr=abbreviate(Species,4, dot ="TRUE", strict ="TRUE"))%>%mutate(Cases=as.numeric(Cases), Cost=as.numeric(Cost))%>%slice(-16)pathogens_clean%>%print(n =10, width =Inf)
# Analysis the "Room Type"##| Choosing the "bar" function because 1) the variable is a categorical variable and 2) it can clearly display the frequency. During the visualization, some arguments like "fill=" and "labs()" make the graphic more straightforward to understand. ggplot(`NYC_Airbnb_clean`,aes(`room_type`, fill=room_type))+geom_bar()+scale_fill_discrete(name="Room Type")+labs(x="Room Type", y="Count", title ="Room Type Frequency")
#Analysis the "Price"##| Choosing the "histogram" function because 1) the variable is numerical, and 2) it can clearly display the frequency as it can automatically count the number of data points per bin. Some arguments like " coord_cartesian()" and "labs()," which can customize the graphic, can make it more straightforward to understand. ggplot(`NYC_Airbnb_clean`,aes(`price`),position ="dodge")+geom_histogram()+coord_cartesian(ylim =c(0, 13000))+scale_x_continuous(limits=range(NYC_Airbnb_clean$price),n.breaks=15)+labs(x =" Price", y="Count", title ="Price Frequency")
#Analysis the "Availability"##| Choosing the "histogram" function because 1) the variable is numerical, and 2) it can clearly display the frequency as it can automatically count the number of data points per bin.ggplot(`NYC_Airbnb_clean`,aes(`availability_365`))+geom_histogram()+scale_x_continuous(limits=range(NYC_Airbnb_clean$availability_365),n.breaks=10)+coord_cartesian(ylim =c(0, 2000))+labs(x =" Availability",y="Count", title ="Availability Frequency")
#Analysis the "Reviews Per Month"##| Choosing the "histogram" function because 1) the variable is numerical, and 2) it can clearly display the frequency as it can automatically count the number of data points per bin.ggplot(`NYC_Airbnb_clean`,aes(`reviews_per_month`, na.rm=TRUE))+geom_histogram()+scale_x_continuous(limits=range(NYC_Airbnb_clean$reviews_per_month),n.breaks=10)+labs(x ="Reviews Per Month",y="Count", title ="Reviews Per Month Frequency")
#Analysis the "Last Review Date"##| Choosing the "freqpoly" (frequency polygons) and histogram function because 1) the variable is about the date, which is closer to a quantity variable, and 2) the frequency polygons have the name ["histograms with lines."](https://dcl-data-vis.stanford.edu/distributions.html) ggplot(NYC_Airbnb_clean, aes(last_review, na.rm=TRUE)) +geom_freqpoly(color="red")+geom_histogram()+scale_x_date(limits=range(NYC_Airbnb_clean$last_review),breaks="1 years",date_labels ="%Y" )+labs(x ="Last Reviews Date",y="Count", title =" Last Reviews Date Frequency")
4.2 “pathogens” data set
# Analysis the "Cost"##| Choosing the "histogram" function because 1) the variable is numerical, and 2) it can clearly display the frequency as it can automatically count the number of data points per bin.ggplot(pathogens_clean, aes(Cost))+geom_histogram()+scale_x_continuous(labels=scales::label_currency(suffix="B",scale=1e-9),n.breaks =10)+labs(y="Count", title ="Cost Frequency")
# Analysis the "Cases"##| Choosing the "histogram" function because 1) the variable is numerical, and 2) it can clearly display the frequency as it can automatically count the number of data points per bin.ggplot(pathogens_clean, aes(Cases))+geom_histogram()+scale_x_continuous(labels= scales::label_number(suffix="M", scale=1e-6),n.breaks =10)+labs(y="Count", title ="Cases Frequency")
5. Bivariate Visualizations
5.1 “NYC_Airbnb” data set
# The relationship between "price" and "Number Of Reviews"##| Choosing the "point" and "smooth" is because 1) both variables are numerical and 2) the combination of two functions can make the tendency clear. Using the "facet_warp" function can include a categorical variable in the analysis. ggplot(`NYC_Airbnb_clean`, aes(x=price,y=number_of_reviews, color=room_type))+geom_point()+geom_smooth()+facet_wrap(~ room_type, nrow =3)+scale_color_discrete(name="Room Type")+labs(x="Price", y="Number Of Reviews", title ="'Price' and 'Number Of Reviews'")
# Analysis the "Neighbourhood Group"##| Choosing the "geom_col" is because 1) both variables are categorical and 2) containing the "position = 'dodge'" argument can make the graphic more easy to understand. NYC_Airbnb_counts <- NYC_Airbnb_clean %>%group_by(neighbourhood_group,room_type) %>%summarise(n=n()) %>%ungroup()ggplot(NYC_Airbnb_counts,aes(x=neighbourhood_group,y=n,fill= room_type))+geom_col(position ="dodge")+scale_fill_discrete(name="Room Type")+labs(x="Neighbourhood Group", y="Number", title ="Neighbourhood Group Frequency")
# Analysis the relationship "Room Type" and "Price"##| Choosing the "jitter" because two variables include a categorical variable and a numerical variable. It can present information regarding the numerical variable based on the categorical variable. Besides, using "point" and "box plot" might make some values in the "price" column overlap, which is not beneficial for understanding the distribution of the numerical value, and the "jitter" function can solve this problem.ggplot(data =NYC_Airbnb_clean, mapping =aes(x = room_type, y = price)) +geom_jitter(aes(color=room_type))+scale_y_continuous(limits=range(NYC_Airbnb_clean$price),n.breaks=15)+scale_color_discrete(name="Room Type")+labs(x="Room Type", y="Price", title ="'Room Type' and 'Price'")
# Analysis the relationship "Neighbourhood Group" and "Price"##| Choosing the "jitter" because two variables include a categorical variable and a numerical variable. It can present information regarding the numerical variable based on the categorical variable. Besides, using "point" and "box plot" might make some values in the "price" column overlap, which is not beneficial for understanding the distribution of the numerical value, and the "jitter" function can solve this problem.ggplot(data =NYC_Airbnb_clean, mapping =aes(x = neighbourhood_group, y = price)) +geom_jitter(aes(color=neighbourhood_group))+scale_y_continuous(limits=range(NYC_Airbnb_clean$price),n.breaks=15)+scale_color_discrete(name="Neighbourhood Group")+labs(x="Neighbourhood Group", y="Price", title ="'Neighbourhood Group' and 'Price'")
# Analysis the relationship "Room Type" and "Availability"##| Choosing the "box plot" because two variables include a categorical variable and a numerical variable. The graphic can present various "box plots, " including information regarding the numerical variable based on the categorical variable. ggplot(data =NYC_Airbnb_clean, mapping =aes(x = room_type, y = availability_365)) +geom_boxplot()+labs(x="Room Type", y="Availability", title ="'Room Type' and 'Availability'")
# Analysis the relationship "Neighbourhood Group" and "Avaliability"##| Choosing the "box plot" because two variables include a categorical variable and a numerical variable. The graphic can present various "box plots, " including information regarding the numerical variable based on the categorical variable. ggplot(data =NYC_Airbnb_clean, mapping =aes(x = neighbourhood_group, y = availability_365)) +geom_boxplot()+labs(x="Neighbourhood Group", y="Availability", title ="'Neighbourhood Group' and 'Availability'")
# Analysis the relationship "Last Review Date" and "Price"##| Choosing the "line" and "point" because the two variables include a categorical variable and a numerical variable, and using these two functions can make the graphic contain information about the third value, which is a categorical variable, through the different colors of the points. ggplot(NYC_Airbnb_clean, aes(x=last_review, na.rm=TRUE, y=price)) +geom_line( color="black") +geom_point(aes(color=`room_type`)) +scale_x_date(limits=range(NYC_Airbnb_clean$last_review),breaks="1 years",date_labels ="%Y")+scale_y_continuous(limits=range(NYC_Airbnb_clean$price),n.breaks=15)+scale_color_discrete(name="Room Type")+labs(x="Last Review Date", y="Price", title ="'Last Review Date' and 'Price'")
5.2 “pathogens” data set
# Analysis "Species" and "Cases"##| Choosing the "point" function because 1) the two variables include a categorical variable and a numerical variable, and 2) the function can clearly show the location of the numerical variable based on the categorical variable.ggplot(pathogens_clean, aes(x=Species_abbr, y=Cases)) +geom_point(aes(color=`Species_abbr`))+theme(axis.text.x =element_text(angle=90))+scale_y_continuous(labels= scales::label_number(suffix="M", scale=1e-6),n.breaks =10)+scale_color_discrete(name="Species Abbreviation")+labs(x="Species Abbreviation", title ="'Species' and 'Cases'")
# Analysis "Species" and "Cost"##| Choosing the "point" function because 1) the two variables include a categorical variable and a numerical variable, and 2) the function can clearly show the location of the numerical variable based on the categorical variable.ggplot(pathogens_clean, aes(x=Species_abbr, y=Cost)) +geom_point(aes(color=`Species_abbr`))+theme(axis.text.x =element_text(angle=90))+scale_y_continuous(labels=scales::label_currency(suffix="B",scale=1e-9),n.breaks =10)+scale_color_discrete(name="Species Abbreviation")+labs(x="Species Abbreviation", title ="'Species' and 'Cost'")