Visualizing Yelp User Check-ins with R

I came across a dataset released by Yelp for the Yelp Dataset Challenge, and I feel it’s fun to create an animated map to visualize the spatio-temporal patterns of user check-in records. Specifically, I want to make an animated map that visualizes the hourly check-in pattern within a day, which is kind of like the pulse of city reflected by Yelp users.

The dataset is a subset of Yelp’s data from the greater Phoenix, AZ metropolitan area, including business info, custom reviews, user check-in records, and user info. The dataset comes in four json files and can be download from here.

I wrote a script in R that parses the following information for each check-in record from the business.json and check-in.json:
– long;
– lat;
– hour;
– day of the week;
– number of check-ins within that hour;

And that gives me ~200,000 rows with ~800,000 check-in records in a csv file. The csv file (~10M) can be downloaded from this link.

To get started, we need to load the two R packages – ggmap (to retrieve the basemap as a static image) and animation (to save the animated map as a gif file)

#Install&load packages
install.packages("animation")
library("animation")

install.packages("ggmap")
library("ggmap")

Next, read in the csv file that parsed from the original json files.

#Would be much faster if you download and read it from local)
yelp <- read.csv("http://dl.dropboxusercontent.com/u/15662467/blogposts/yelp_checkins.csv", header = T)

#data preview
head(yelp)

Once have the data loaded, we need to the geographic extend of the data as a vector.

#Set the map extent to contain all check-in locations
extent <- c(min(yelp$long), min(yelp$lat), max(yelp$long), max(yelp$lat))

Now use the get_map() function from ggmap to query a static basemap that has the same geographic extend as the data, and then plot it as the basemap using ggmap() function.

Here, I chose the “watercolor” map style (type) from @Stamen as this map style doesn’t include any labels for regions or street names, which will help your readers focus on the trend of check-in patterns by removing all the text on basemap. And of course, I always enjoy the eye-candy design schemes from @Stamen

#Query the basemap from stamen and plot it
basemap <- get_map(location = extent, maptype = "watercolor", source = "stamen")
basemap <- ggmap(basemap, extent = "device")

#Draw the basemap
basemap

Okay, you got the basemap. What is left here is to plot the check-in points onto the basemap on an hourly basis. To plot the hourly check-in patterns programmatically (rather than manually plot the patterns 24 times for 24 hours) , we need some magic from the animation package. First, set up the output path for the gif file. I set it the same as my work dirctiory (getwd()).

ani.options(outdir = getwd())

Before the animation magic started, it’s a good idea to create a set of time labels to tell us which hour we are looking at before we get lost in the eternal animation loops.

#Construct the hourly labels
am <- paste(1:11, "AM", sep = "")
am <- append("12AM", am)
pm <- paste(1:11, "PM", sep = "")
pm <- append("12PM", pm)
hour_label <- append(am, pm)

Now, it’s the time for the animation package to do some magics. As a side note, the way that the animation package works is essentially the same as cel animation – you draw a bunch of images and play them sequentially at a fast pace. In R, you can use for/while loops to draw a set of graphics (frames), and the animation package will just convert them into a gif file in the order that they were being drawn.

#Create the GIF
saveGIF({
  #Set the first frame to the 0 (12am)
  hour<-0
  #Draw the check-in patterns for each hour using a while loop
  while(hour < 24) {
    #Subset the data for this hour
    this_hour <- yelp[yelp$time == hour, ]

    #Compose the map for this hour (basemap + points + time label)
    this_hour_map <- basemap +
    #Set long/lat and adjust the alpha value for each points to mitigate the overplotting problem
    geom_point(aes(x = long, y = lat, alpha = checkins/50),
    #Set the data to be plotted
    data = this_hour, col = "purple", size = 0.8) +
    #Plot the black shadow for time label
    geom_text(data = NULL, x = -111.45 + 0.003, y = 33.9 - 0.003,
              label = hour_label[hour + 1], color = "black",
              face = "bold", cex = 10) +
    #Plot the time label
    geom_text(data = NULL,x = -111.45, y = 33.9,
              label = hour_label[hour + 1], color = "#9A413B",
              face = "bold", cex = 10) +
    #Remove axis, ticks, etc.
    theme(axis.line=element_blank(), axis.text.x=element_blank(),
    axis.text.y=element_blank(), axis.ticks=element_blank(),
    axis.title.x=element_blank(), axis.title.y=element_blank(),
    panel.border = element_blank(),plot.background=element_blank(),
    legend.position="none")

    #Plot the map composed above
    plot(this_hour_map, xaxs = "i", yaxs = "i")

    #Increment to next hour
    hour <- hour + 1

    #Keep track of the current progress
    print(hour)
  }
}, movie.name = "yelp_hourly.gif", interval = 0.6, ani.width = 600, ani.height = 600)

Now, go to the path you set earlier and you will find the gif (shown below)

An animated map that shows the hourly check-in patterns of Yelp users in Phoenix, AZ (click for a better view).

An animated map that shows the hourly check-in patterns of Yelp users in Phoenix, AZ (Click for a better view).

Some reflection:
– The major problem with this map is overplotting because we are plotting ~800,000 check-in records within a fairly small area. I tried to mitigate this problem by assigning an alpha value to each point such that the alpha value for a single check-in would be 0.02. In other words, an opaque point represents a business with more than 50 check-ins. But there are definitely more ways to improve.
– You might have already noticed that the legend is missing for this map. Yes, I intentionally removed the default legend by setting “legend.position” to be “none”. You may switch the legend back, but the problem is that the legend is placed at the right side of the map by default, and it’s only possible to move it to top/bottom/left side, which will create white margins. So I decided to rather remove the legend as the main info I wanted to communicate here is the spatio-temporal patterns of the check-ins, and absolute numbers of check-ins is not essential here.
– To add some analytic aspect to this map, it would be nice to superimpose an animated histogram at the corner of the map, which is synchronized to highlight the number of check-ins for current hour.

Advertisements

2 thoughts on “Visualizing Yelp User Check-ins with R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s