Introduction
In this project, I aim to examine a dataset of Airbnb listings in the city of San Francisco from September 2018 to determine which neighborhoods offer the best avaliability at the best price. I am also interested in looking at which affordable neighborhoods have a comparably high average when looking at the number of property reviews per month, and how room type breaks down between certain ideal neighborhoods. I hope to assist in combatting San Francisco’s housing crisis and helping to steer those in search of short-term rentals in a more informed direction.
Primary Data Manipulation
First, I will load the necessary packages.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(readr)
Next, I will import the dataset from Get the Data - Inside Airbnb for listings specific to San Francisco.
df <- read_csv("listings (1).csv",
col_types = cols(last_review = col_date(format = "%Y-%m-%d")))
head(df)
## # A tibble: 6 x 16
## id name host_id host_name neighbourhood_g… neighbourhood latitude
## <int> <chr> <int> <chr> <chr> <chr> <dbl>
## 1 958 Brig… 1169 Holly <NA> Western Addi… 37.8
## 2 5858 Crea… 8904 Philip A… <NA> Bernal Heigh… 37.7
## 3 7918 A Fr… 21994 Aaron <NA> Haight Ashbu… 37.8
## 4 8142 Frie… 21994 Aaron <NA> Haight Ashbu… 37.8
## 5 8339 Hist… 24215 Rosy <NA> Western Addi… 37.8
## 6 8567 Love… 25601 Sandy <NA> Western Addi… 37.8
## # ... with 9 more variables: longitude <dbl>, room_type <chr>,
## # price <int>, minimum_nights <int>, number_of_reviews <int>,
## # last_review <date>, reviews_per_month <dbl>,
## # calculated_host_listings_count <int>, availability_365 <int>
After importing the data, now I would like to examine the column “neighbourhood” a bit more. I would like to know how many neighbourhoods the dataset outlines, as a sanity check.
df %>% summarize(distinct = n_distinct(neighbourhood))
## # A tibble: 1 x 1
## distinct
## <int>
## 1 36
Now that we know we are working with 36 neighbourhoods, I would like to group my data based on neighborhood, and find the average price per night in that neighborhood and the average availability (number of days avaliable out of 365 days in the year). I also filtered by data to only include listings that have 1 or more reviews, as I feel that the listings with zero reviews could skew my data due to new users potentially setting prices very differently than other hosts.
nbhood <- df %>% group_by(neighbourhood) %>% filter(number_of_reviews > 0) %>% summarize(count = n(), price = mean(price), avail = mean(availability_365))
Now, I would like to determine the neighborhood with the best pricing, so I would like to sort my price column in descending order.
nbhood %>% arrange(price)
## # A tibble: 36 x 4
## neighbourhood count price avail
## <chr> <int> <dbl> <dbl>
## 1 Presidio 1 105 350
## 2 Lakeshore 43 106. 217.
## 3 Bayview 102 115. 161.
## 4 Crocker Amazon 31 119. 187.
## 5 Excelsior 125 122. 161.
## 6 Ocean View 90 127. 150.
## 7 Visitacion Valley 58 137. 138.
## 8 Outer Sunset 231 143. 149.
## 9 Outer Mission 134 148. 165.
## 10 Downtown/Civic Center 337 157. 128.
## # ... with 26 more rows
This yields an interesting result - the neighborhood with the best nightly price from the listings in September of 2018 is the Presidio, but there is only one listing. Because of this, I will discredit this row ; I do not find this to be an accurate result when looking at best priced neighborhoods, and looking at a count of higher than 5 would yield more substantial results. The second best price was Lakeshore, with 43 listings and an average price of $106 per night. After, Bayview with 102 listings and $115 per night.
nbhood <- nbhood %>% filter(count > 5) %>% arrange(price)
nbhood
## # A tibble: 35 x 4
## neighbourhood count price avail
## <chr> <int> <dbl> <dbl>
## 1 Lakeshore 43 106. 217.
## 2 Bayview 102 115. 161.
## 3 Crocker Amazon 31 119. 187.
## 4 Excelsior 125 122. 161.
## 5 Ocean View 90 127. 150.
## 6 Visitacion Valley 58 137. 138.
## 7 Outer Sunset 231 143. 149.
## 8 Outer Mission 134 148. 165.
## 9 Downtown/Civic Center 337 157. 128.
## 10 Bernal Heights 335 174. 119.
## # ... with 25 more rows
Data Visualizations
For my first plot, I would like to generate a scatter plot with each dot representing a different neighborhood and the axes being price and avaliability, to reveal the general layout of the data. We see that both Lakeshore and Bayview appear as neighborhoods with high avaliability at a low price, as does Crocker Amazon and Excelsior. Nob Hill also stands out as a neighborhood with a high yearly avaliability.
nbhood %>% ggplot(mapping = aes(x = price, y = avail, color = price, label = neighbourhood)) + geom_point() + theme_minimal() + scale_color_gradient(low = "#0091ff", high = "#f0650e") + geom_text(aes(label = neighbourhood), hjust = -.1, vjust = .5)
I would also like to represent this phenomenon with a box plot to see how the listing prices vary within neighbourhoods. This plot shows us that the prices within neighborhoods are very spread out, but the scatter plot makes it much easier to see how average neighborhood nightly prices are distributed and the boxplot here is more difficult to view due to the large number of neighborhoods. I also filtered for nightly prices of less than $1,000 to remove certain outliers and make the plot easier to view, while still preserving the general spread of the data.
df %>% group_by(neighbourhood) %>% mutate(count = n()) %>% filter (count > 5) %>% filter(price < 1000) %>% ggplot(aes(neighbourhood, price)) + geom_boxplot(varwidth = T, fill = "plum")
In my third plot, after examining how the neighborhoods are distributed by price, I would like to examine the popularity (defined for my purposes as the average number of reviews per month for properties within a given neighborhood) of the listings in the neighborhoods that have a low nightly price (for me, defined as neighbourhoods with an average nightly price less than $200).
nbhood2 <- df %>% group_by(neighbourhood) %>% filter(number_of_reviews > 0) %>%summarize(count = n(), price = mean(price), avail = mean(availability_365), popularity = mean(reviews_per_month)) %>% filter(count > 5) %>% filter(price < 200) %>% filter(popularity > 0)
nbhood2
## # A tibble: 13 x 5
## neighbourhood count price avail popularity
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Bayview 102 115. 161. 2.78
## 2 Bernal Heights 335 174. 119. 2.01
## 3 Chinatown 64 175. 151. 0.552
## 4 Crocker Amazon 31 119. 187. 2.76
## 5 Excelsior 125 122. 161. 2.53
## 6 Glen Park 57 199. 126. 1.76
## 7 Inner Sunset 123 183. 136. 2.07
## 8 Lakeshore 43 106. 217. 2.14
## 9 Ocean View 90 127. 150. 3.14
## 10 Outer Richmond 109 180. 164. 2.97
## 11 Outer Sunset 231 143. 149. 3.56
## 12 Visitacion Valley 58 137. 138. 3.58
## 13 West of Twin Peaks 101 198. 172. 2.14
Now, I would like to represent my findings with a bar chart to look at the popularity of these different neighborhoods with nightly prices that are less that $200.
nbhood2 %>% ggplot(aes(reorder(neighbourhood, popularity), popularity)) + geom_col(aes(fill = popularity)) + scale_fill_gradient(low = "darkorange1", high = "dodgerblue") + coord_flip() + labs(x = "neighbourhood")
Lastly, now that I have identified several neighborhoods that have low nightly prices on average and high avaliability (Lakeshore, Bayview, Crocker Amazon, the Outer Mission and Excelsior, all under $200) and neighborhoods with high popularity and a low price (Visitacion Valley, Outer Sunset, Ocean View, all above an average of 3 reviews per month) I would like to examine how the room type is broken down in these six neighborhoods to get a clearer picture of which experiences are avaliable to travelers.
nbhood3 <- df %>% group_by(neighbourhood) %>% filter(neighbourhood == "Lakeshore" | neighbourhood == "Bayview" | neighbourhood == "Excelsior" | neighbourhood == "Visitacion Valley" | neighbourhood == "Outer Sunset" | neighbourhood == "Ocean View" | neighbourhood == "Crocker Amazon" | neighbourhood == "Outer Mission") %>% filter(number_of_reviews > 0)
nbhood3 %>% ggplot(aes(neighbourhood)) + geom_bar(aes(fill = room_type)) + scale_fill_manual(values = c("dodgerblue", "plum", "darkorange1"))
Findings and Conclusion
Overall, in my manupulation of this Airbnb data for the month of September in the city of San Francisco, I have determined that the most affordable and avaliable neighborhoods for short term rentals are currently Lakeshore, Crocker Amazon, Bayview,the Outer Sunset and Excelsior. The most popular (an average popularity of more than 3 bookings per month) affordable neighborhoods (an average nightly price of less than $200) are Visitacion Valley, the Outer Sunset, and Ocean View. Interestingly, these listings are primarily broken down between entire home/apartments and private rooms. This yields promising travel options, and I hope that others (and myself) will be able to use this information in future visits to San Francisco. I find that futher analysis of short term rental pricing and avaliability in different areas of the city would be useful and necessary, considering the current exorbitant prices in San Francisco and the frequent struggle to secure housing.
In my analysis and data visualization, I was able to stick fairly close to the plan I outlined for myself in my project proposal. One initial difference was that I planned on using the host_id category, but did not find this useful in my actual analysis and instead focused more on the other categories (neighbourhood, price, avalibility_365, and room type). I also added in an analysis of popularity of different listings by examining reviews_per_month, because I felt that my analysis was not complete without examining which listings were actually were affordable/avaliable AND frequented by travelers. I abadoned my hope of creating a map reflecting neighborhood price differences due to lack of correct data and difficulty, and did not find box plots to be the most useful in representing my data (though I did include one, as I feel that it shows how much price does differ within neighborhoods, which is an important consideration). I also wanted to employ the same idea behind the segmented drought plots, which I used in the form of a segmented bar chart to reflect the breakdown of room_type instead of showing which areas account for the most listings in SF, as there were too many neighborhoods to represent this well and I feel that examining room_type yields a more informed perspective given my focus. I tried to stick to my project proposal as much as I could, but all of my changes occured when I was actually testing what was the most effective and yielded the best visual representation in R, which on some level I do feel is expected.