San Francisco Airbnb Analysis

Introduction

In this project, I aim to examine a dataset of Airbnb listings in the city of San Francisco from September 2018 to determine which neighborhoods offer the best avaliability at the best price. I am also interested in looking at which affordable neighborhoods have a comparably high average when looking at the number of property reviews per month, and how room type breaks down between certain ideal neighborhoods. I hope to assist in combatting San Francisco’s housing crisis and helping to steer those in search of short-term rentals in a more informed direction.

Primary Data Manipulation

First, I will load the necessary packages.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)
library(readr)

Next, I will import the dataset from Get the Data - Inside Airbnb for listings specific to San Francisco.

df <- read_csv("listings (1).csv", 
    col_types = cols(last_review = col_date(format = "%Y-%m-%d")))
head(df)

## # A tibble: 6 x 16
##      id name  host_id host_name neighbourhood_g… neighbourhood latitude
##   <int> <chr>   <int> <chr>     <chr>            <chr>            <dbl>
## 1   958 Brig…    1169 Holly     <NA>             Western Addi…     37.8
## 2  5858 Crea…    8904 Philip A… <NA>             Bernal Heigh…     37.7
## 3  7918 A Fr…   21994 Aaron     <NA>             Haight Ashbu…     37.8
## 4  8142 Frie…   21994 Aaron     <NA>             Haight Ashbu…     37.8
## 5  8339 Hist…   24215 Rosy      <NA>             Western Addi…     37.8
## 6  8567 Love…   25601 Sandy     <NA>             Western Addi…     37.8
## # ... with 9 more variables: longitude <dbl>, room_type <chr>,
## #   price <int>, minimum_nights <int>, number_of_reviews <int>,
## #   last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <int>, availability_365 <int>

After importing the data, now I would like to examine the column “neighbourhood” a bit more. I would like to know how many neighbourhoods the dataset outlines, as a sanity check.

df %>% summarize(distinct = n_distinct(neighbourhood))

## # A tibble: 1 x 1
##   distinct
##      <int>
## 1       36

Now that we know we are working with 36 neighbourhoods, I would like to group my data based on neighborhood, and find the average price per night in that neighborhood and the average availability (number of days avaliable out of 365 days in the year). I also filtered by data to only include listings that have 1 or more reviews, as I feel that the listings with zero reviews could skew my data due to new users potentially setting prices very differently than other hosts.

nbhood <- df %>% group_by(neighbourhood) %>% filter(number_of_reviews > 0) %>% summarize(count = n(), price = mean(price), avail = mean(availability_365))

Now, I would like to determine the neighborhood with the best pricing, so I would like to sort my price column in descending order.

nbhood %>% arrange(price)

## # A tibble: 36 x 4
##    neighbourhood         count price avail
##    <chr>                 <int> <dbl> <dbl>
##  1 Presidio                  1  105   350 
##  2 Lakeshore                43  106.  217.
##  3 Bayview                 102  115.  161.
##  4 Crocker Amazon           31  119.  187.
##  5 Excelsior               125  122.  161.
##  6 Ocean View               90  127.  150.
##  7 Visitacion Valley        58  137.  138.
##  8 Outer Sunset            231  143.  149.
##  9 Outer Mission           134  148.  165.
## 10 Downtown/Civic Center   337  157.  128.
## # ... with 26 more rows

This yields an interesting result - the neighborhood with the best nightly price from the listings in September of 2018 is the Presidio, but there is only one listing. Because of this, I will discredit this row ; I do not find this to be an accurate result when looking at best priced neighborhoods, and looking at a count of higher than 5 would yield more substantial results. The second best price was Lakeshore, with 43 listings and an average price of $106 per night. After, Bayview with 102 listings and $115 per night.

nbhood <- nbhood %>% filter(count > 5) %>% arrange(price)
nbhood

## # A tibble: 35 x 4
##    neighbourhood         count price avail
##    <chr>                 <int> <dbl> <dbl>
##  1 Lakeshore                43  106.  217.
##  2 Bayview                 102  115.  161.
##  3 Crocker Amazon           31  119.  187.
##  4 Excelsior               125  122.  161.
##  5 Ocean View               90  127.  150.
##  6 Visitacion Valley        58  137.  138.
##  7 Outer Sunset            231  143.  149.
##  8 Outer Mission           134  148.  165.
##  9 Downtown/Civic Center   337  157.  128.
## 10 Bernal Heights          335  174.  119.
## # ... with 25 more rows

Data Visualizations

For my first plot, I would like to generate a scatter plot with each dot representing a different neighborhood and the axes being price and avaliability, to reveal the general layout of the data. We see that both Lakeshore and Bayview appear as neighborhoods with high avaliability at a low price, as does Crocker Amazon and Excelsior. Nob Hill also stands out as a neighborhood with a high yearly avaliability.

nbhood %>% ggplot(mapping = aes(x = price, y = avail, color = price, label = neighbourhood)) + geom_point() + theme_minimal() + scale_color_gradient(low = "#0091ff", high = "#f0650e") + geom_text(aes(label = neighbourhood), hjust = -.1, vjust = .5)

I would also like to represent this phenomenon with a box plot to see how the listing prices vary within neighbourhoods. This plot shows us that the prices within neighborhoods are very spread out, but the scatter plot makes it much easier to see how average neighborhood nightly prices are distributed and the boxplot here is more difficult to view due to the large number of neighborhoods. I also filtered for nightly prices of less than $1,000 to remove certain outliers and make the plot easier to view, while still preserving the general spread of the data.

df %>% group_by(neighbourhood) %>% mutate(count = n()) %>% filter (count > 5) %>% filter(price < 1000) %>% ggplot(aes(neighbourhood, price)) + geom_boxplot(varwidth = T, fill = "plum")

In my third plot, after examining how the neighborhoods are distributed by price, I would like to examine the popularity (defined for my purposes as the average number of reviews per month for properties within a given neighborhood) of the listings in the neighborhoods that have a low nightly price (for me, defined as neighbourhoods with an average nightly price less than $200).

nbhood2 <- df %>% group_by(neighbourhood) %>% filter(number_of_reviews > 0) %>%summarize(count = n(), price = mean(price), avail = mean(availability_365), popularity = mean(reviews_per_month)) %>% filter(count > 5) %>% filter(price < 200) %>% filter(popularity >  0)
nbhood2

## # A tibble: 13 x 5
##    neighbourhood      count price avail popularity
##    <chr>              <int> <dbl> <dbl>      <dbl>
##  1 Bayview              102  115.  161.      2.78 
##  2 Bernal Heights       335  174.  119.      2.01 
##  3 Chinatown             64  175.  151.      0.552
##  4 Crocker Amazon        31  119.  187.      2.76 
##  5 Excelsior            125  122.  161.      2.53 
##  6 Glen Park             57  199.  126.      1.76 
##  7 Inner Sunset         123  183.  136.      2.07 
##  8 Lakeshore             43  106.  217.      2.14 
##  9 Ocean View            90  127.  150.      3.14 
## 10 Outer Richmond       109  180.  164.      2.97 
## 11 Outer Sunset         231  143.  149.      3.56 
## 12 Visitacion Valley     58  137.  138.      3.58 
## 13 West of Twin Peaks   101  198.  172.      2.14

Now, I would like to represent my findings with a bar chart to look at the popularity of these different neighborhoods with nightly prices that are less that $200.

nbhood2 %>% ggplot(aes(reorder(neighbourhood, popularity), popularity)) + geom_col(aes(fill = popularity)) + scale_fill_gradient(low = "darkorange1", high = "dodgerblue") + coord_flip() + labs(x = "neighbourhood")

Lastly, now that I have identified several neighborhoods that have low nightly prices on average and high avaliability (Lakeshore, Bayview, Crocker Amazon, the Outer Mission and Excelsior, all under $200) and neighborhoods with high popularity and a low price (Visitacion Valley, Outer Sunset, Ocean View, all above an average of 3 reviews per month) I would like to examine how the room type is broken down in these six neighborhoods to get a clearer picture of which experiences are avaliable to travelers.

nbhood3 <- df %>% group_by(neighbourhood) %>% filter(neighbourhood == "Lakeshore" | neighbourhood == "Bayview" | neighbourhood == "Excelsior" | neighbourhood == "Visitacion Valley" | neighbourhood == "Outer Sunset" | neighbourhood == "Ocean View" | neighbourhood == "Crocker Amazon" | neighbourhood == "Outer Mission") %>% filter(number_of_reviews > 0)

nbhood3 %>% ggplot(aes(neighbourhood)) + geom_bar(aes(fill = room_type)) + scale_fill_manual(values = c("dodgerblue", "plum", "darkorange1"))

Findings and Conclusion

Overall, in my manupulation of this Airbnb data for the month of September in the city of San Francisco, I have determined that the most affordable and avaliable neighborhoods for short term rentals are currently Lakeshore, Crocker Amazon, Bayview,the Outer Sunset and Excelsior. The most popular (an average popularity of more than 3 bookings per month) affordable neighborhoods (an average nightly price of less than $200) are Visitacion Valley, the Outer Sunset, and Ocean View. Interestingly, these listings are primarily broken down between entire home/apartments and private rooms. This yields promising travel options, and I hope that others (and myself) will be able to use this information in future visits to San Francisco. I find that futher analysis of short term rental pricing and avaliability in different areas of the city would be useful and necessary, considering the current exorbitant prices in San Francisco and the frequent struggle to secure housing.

In my analysis and data visualization, I was able to stick fairly close to the plan I outlined for myself in my project proposal. One initial difference was that I planned on using the host_id category, but did not find this useful in my actual analysis and instead focused more on the other categories (neighbourhood, price, avalibility_365, and room type). I also added in an analysis of popularity of different listings by examining reviews_per_month, because I felt that my analysis was not complete without examining which listings were actually were affordable/avaliable AND frequented by travelers. I abadoned my hope of creating a map reflecting neighborhood price differences due to lack of correct data and difficulty, and did not find box plots to be the most useful in representing my data (though I did include one, as I feel that it shows how much price does differ within neighborhoods, which is an important consideration). I also wanted to employ the same idea behind the segmented drought plots, which I used in the form of a segmented bar chart to reflect the breakdown of room_type instead of showing which areas account for the most listings in SF, as there were too many neighborhoods to represent this well and I feel that examining room_type yields a more informed perspective given my focus. I tried to stick to my project proposal as much as I could, but all of my changes occured when I was actually testing what was the most effective and yielded the best visual representation in R, which on some level I do feel is expected.

San Francisco Airbnb Analysis - September 2018

Jenny Nova

10/22/2018

Introduction

Primary Data Manipulation

Data Visualizations

Findings and Conclusion