At Interana, we build a fast, scalable behavioral analytics solution for event data. By fast, I mean really fast (answers in seconds), and by big I mean really big (billions and billions of events). We currently have customers like Microsoft and Tinder that have nearly reached a trillion rows of data and still getting results in seconds. Scale and speed are critical to Interana and the types of analytics we do (conversion, engagement, retention, root cause, etc.) on massive volumes of raw event data. We knew that designing and writing a system like this would not be easy, and it wasn’t. But we didn’t expect it to be so hard to find good data for testing, measurement, and demos. That’s why we decided to make our own data.
Today, we’re releasing eventsim to the world. This is a tool that I wrote internally to produce a stream of real looking (but fake) event data. We use this for development, testing, and demos. This blog posts explains why I wrote a fake data generator, how it works, and how to get (and use) it.
You can get the code for the simulator from https://github.com/interana/eventsim.
Background and Motivation
At Interana, we built a system for viewing event data. Events are measurements that capture a moment in time; each event describes an observation (something that was seen, or something that happened), and attributes about that event. Examples of events include web page view, credit card transactions, SMS messages, and industrial sensor readings.
In industry, there are a huge number of data sources that look like this. I worked with data like this myself (at LinkedIn, Netflix, and Verisign), and our customers produce large volumes of data like this (at Asana, Imgur, Bing, and other places). Typically, the data contains a set of events associated with a set of users over a long time period.
At Interana, we wanted to find data sets that showed how people behaved over time, and that could be used to calculate common business metrics. Unfortunately, we struggled to find free, open data sources that looked like this. We found some data sets that satisfied some of our requirements, but not all. (For example, there is a good data set of wikipedia edits. This data set contains many events, but is less than ideal for engagement metrics.)
We decided that our best bet was to simulate the action of many users on a completely fake web site. We wanted the simulator to have the following features:
Configurable time period. We wanted to be able to create data for long or short time periods, and include timestamps up to the present.
Configurable volume. We needed to be able to create data for many different numbers of users, from tens to millions. (This lets use the same data for small development projects and massive performance testing projects).
Realistic traffic patterns. Many of us have worked at big consumer web sites, and know that traffic varies by time of data and day of week. We wanted more traffic in the day than at night, and more during the week than on weekends (and holidays).
One time or continuous. We wanted to be able to generate data once, or to generate data continuously.
Output to files, or to Apache Kafka.
Pseudo-random output. We wanted the data to look random, but to be generated deterministically (to ease testing and recreating data).
Different behavior for different users. We wanted different users to behave a little differently: some arrived more frequently than others, some clicked around differently.
Colorful attributes. We wanted to make the data fun and interesting: to assign users names, to have them use different browsers, to have them come from different places. And we wanted them to do interesting things.
Growth and attrition. We wanted to be able to calculate growth metrics, so new users appear over time (and some leave).
After some work, I decided to simulate a fake music web site, like Apple Music, Pandora, or Spotify. I chose this use case because I think it’s intuitive for most users (lots of people have experience with music streaming services), and fun. I also had some interesting data to use for faking it: the Million Song Dataset. (I used data from that project to create realistic names and distributions of songs.)