In this “Logging Data” series, we’ll provide some in-depth detail on the intricacies of data collection, infrastructure, and access here at Chitika, hopefully providing some useful lessons for both newcomers and veterans in the field. Our first post will focus on our logs – the baseline of our data collection – and the subsequent processes that coalesce the information they contain into more readily accessible formats for our data scientists.
The highest priority for any ad network is maintaining a healthy data infrastructure. But in the case of Chitika, with a network that has grown to span hundreds of thousands of websites with broadening reporting requirements, that task has become more complex and nuanced as time goes on.
In this “Logging Data” series, we’ll provide some in-depth detail on the intricacies of data collection, infrastructure, and access here at Chitika, hopefully providing some useful lessons for both newcomers and veterans in the field.
Our first post will focus on our logs – the baseline of our data collection – and the subsequent processes that coalesce the information they contain into more readily accessible formats for our data scientists.
Structure
When a user loads a webpage with Chitika ad code, what gets passed back to us is a set of basic information about the impression. These include the URL where the ad appeared (e.g. http://www.example.com/page.php), the size of the unit itself, any referring domain that got that user to the page, the associated user agent, and a number of other characteristics.
These are then aggregated in JSON format with hundreds of lines; a partial example being the following:
{
‘handler’ => ‘minimall’
‘http’ => {
‘ip’ => 175.142.XXX
‘size’ => ‘4918’
‘status’ => ‘200’
‘time’ => ‘255035’
This JSON output is compiled in one of two ways. If the impression is from our real-time bidding sector, it will write directly into JSON. For some of our other streams, such as those coming from our more traditional ad units, we do a quick post-processing of the logs that come out of Apache for cleaning and categorization purposes.
While these raw logs can be queried for specific information, we employ a variety of internally and some externally-built systems to make this process quicker and easier. As such, we need to actually get the data to these systems, which can best be shown as a kind of step-by-step process:
- Logs are bucketed into 6 minute chunks on each ad server
- Chunks are compressed once complete
- Chunks are then Rsynced to the datacenter local aggregation server (multiple threads per ad server)
- The aggregation server acts as (surprise!) an aggregation point for that datacenter and Rsync’s any chunks it gets to our final aggregation point in our analytics datacenter (multiple threads)
- As part of that final aggregation, all chunks are moved into Gluster over InfiniBand, and organized by type, day, and hour (multiple threads)
Rsync is the biggest factor in making this process highly robust and fault tolerant, and thanks to a Perl multithreaded reactor built by a few members of our team, this shuttling process has also become highly parallelized.
When all is said and done, data is accessible 8 minutes behind real time at the most.
The newly loaded data is then consumed by our data science team’s more frequently used tools to track performance and perform associated tests. More specifically, these are our SQL-based archive, Hadoop, a number of other internal systems, and automated scripts. A graphical view of this flow can be seen below:
Now that we’ve outlined how an impression becomes a usable piece of data, be sure to check in next week for:
Logging Data Part 2: Taming the Storage Beast
With historical data comparisons being extremely valuable, we store certain types of data for months and years at a time. Next time we’ll be focusing on the associated storage requirements and network maintenance that this amount of data necessitates.
Stay tuned!