Data Is the New Force

Embibe is fanatical about instrumenting, measuring, collecting, mining, and archiving data. Embibe owns its data, our IP depends on it. At Embibe, we delay the release until adequate instrumentation is in place to measure how our users interact with our products and what factors lead to specific outcomes. This obsession with data led to many insightful revelations on how students study and achieve their goals.

For instance, a student’s potential to score is a combination of two factors – their learning ability which contributes ~61% of the overall potential to score, and their behavioural attributes, which contribute ~39%. This razor-sharp focus on being data-driven has enabled Embibe to build products that personalise education and deliver tremendous improvements in student learning outcomes.

Primary Data Collection

Data is instrumented and collected at various stages and locations all across Embibe’s platform. It is not just necessary to capture data as much as capturing the correct type of data, at the right time, in the proper context, and with the right level of granularity. Data capture at Embibe falls broadly into the following categories:

  • Instrumentation of Rich Event Types:
    • User-interaction explicit events – clicks, taps, hovers, scrolls, text updates,
    • User-interaction implicit events – cursor position, tap pressure, device orientation, location,
    • System-generated server-side events – page load, session refreshes, API calls,
    • System-generated client-side events – system push notifications and triggers.
  • Specific Data by Property:
    • Page views (URL, referrer, user agent, device, IP, timestamp, traffic source, campaign),
    • Practice attempt level data (timestamp, visit/re-visit, answer choice, time first seen, correct, time spent, solution viewed, hint used) – aggregated at the session level,
    • Learn behaviour data:
      • Search event data (timestamp, query, result set),
      • Result interaction data (timestamp, suggested result selected, result widget and context, widget position),
    • Test attempt event-level data (timestamp, visit/re-visit, answer choice, time first seen, correct, time spent, feedback viewed) – aggregated at the session level,
    • Doubt Resolutions (Academic Forum) question and answer details, timestamps, and user voting behaviour.

Many practical considerations need to be accounted for when instrumenting the data collection at the scale that Embibe does. For instance, we rely on several modalities to gather all this data. The user interaction event stream is logged by integrating with third-party plugins. Server-side page load and session event logging is instrumented in-house and pushed to NoSQL databases. Daily user activity data on properties like Practice and Test are stored in the DB for query aggregations by the front end.

Data Processing

Once primary data collection happens, it is necessary to clean, enrich, mine, and visualise it. At Embibe, we have the following broad approaches to using the data we collect:

  • In-house Reporting and Ad-hoc Analysis:
    • Log mining to generate reporting data for traffic patterns, user monetisation, test-on-test improvement, search failures, and other needs. The processed data is again pushed into Elasticsearch and visualised using the Kibana dashboard.
    • Primary raw data is stored on HBase over HDFS for any required analysis to be conducted on an ad-hoc basis.

Figure 1: A High-Level Schematic of the Data Flow Stack that Powers the Intelligence that Embibe’s Data Science Lab Develops

  • Third-Party Tools for Business/Product/Marketing Self-Serve
    • Google Analytics for broad-level traffic monitoring, including traffic sources, demographic and location information, device breakdown, page views, time spent, and retention metrics.