Doubt Resolution Product

Doubt Resolution Product, as the name suggests, is the platform that aims to help solve the doubts of a user. While this help can be offered by the Subject Matter Experts, the volume and the breadth of doubts coming in parallel can make it very hard for these experts to get back to every doubt manually. This can in turn introduce longer wait time and add to a bad user experience.
We hence introduce the Machine Learning algorithms to assist and filter the number of questions that the human experts have to answer.
The 2 main challenges while solving this are capturing context from the user’s doubt and identifying contextually similar questions from the huge pool of questions.
Most doubt-resolution products utilize existing corpora of content to provide similar questions or build a system capable of answering questions based on the available context around the question.
We at Embibe have millions of questions in our question bank. We use state-of-the-art models finetuned on our academic corpora to get the contextual information from the question text and diagrams or figures present in the question. We encode this information into high dimensional vector space and retrieve the answer with step by step solution if it is present in our question bank; else, we display contextually similar questions with a step-by-step solution on which the user can practice.

Process Architecture

  • Production Version
    • Orchestrator Service
    • DPR Text Encoder Service
  • Demo Version

Implementation

This pipeline mainly contains three sequential stages, if the input is in the image form:

  • In stage-1, we automatically detect the diagrams and figures present in the input image and crop it.
  • In stage-2, we extract the text from the uploaded image.
  • In stage-3, the extracted text and image are projected to dense vectors, and an efficient search mechanism called FAISS is used to search indexed questions for the given query.

In the case of text input, stage-3 can directly be executed.

Stage-1: Diagram Detection

To be able to help solve the doubt at hand, it is important for us to capture all the details provided around the question, by the user. Nearly 10% of the time, this is in the form of figures, diagrams, graphs, chemical equations, etc. It was also observed that the performance of the OCR layer at Stage-2, is affected by the presence of a figure in the input image. We hence introduce a diagram detection layer to ensure both, that the diagrams also play an important role when searching for similar questions, and that the presence of a diagram in the input image does not affect the OCR performance.

  1. Input: Image uploaded via the platform
  2. We use the YOLOv5 model; traditionally used for object detection and localization problems, trained on our academic images to detect figures, diagrams such as graphs, chemical equations, to finally crop them as separate images, each having the text content and the diagram content.

Improvements in Diagram Detection

  • Initially, the YOLOv5 model was trained on ~1100 images with the following performance,
    • Precision: 0.863, Recall: 0.898, [email protected]: 0.9
    • The model wasn’t performing as expected on images with organic compounds/chemistry equations
  • To improve the model, the model was trained on ~2300 images and obtained the following performance,
  • Unlike other subjects where the diagram is fairly distant and separate from the rest of the textual content, Chemistry has them in line with text sometimes. Hence, this improvement was mainly focused on such samples.

Stage-2: OCR using Mathpix

This layer inputs the image with only text and no diagrams, to extract the text content and further fetch similar questions based on the same. It is known that, in order to pass a text-heavy image through OCR, it needs to have high resolution, while many of the academic doubt images(~42% of the user images) that’re uploaded on the platform might not only have low resolution but might also introduce shadow, rotation, unintended text, etc. Preprocessing steps like skew correction, removing shadow, enhancing resolution, and detecting blur, hence would add to ensuring a good OCR performance.

  1. Input: Separated image from stage-1, i.e. image containing only text
  2. To ensure the best OCR results, the user gets to crop and rotate the image as per the requirement.
  3. Internally, this layer uses the Mathpix API to convert the text content in the image to plain text. It also provides a confidence score for the prediction made. We use a metric called expRate (expression rate) to calculate the correctness of this layer. This evaluates the percentage match between the actual text in the image Vs the text predicted by the API. It was observed that the expRate significantly increases as the confidence does.
  4. In case the presence of multiple sub-questions is detected, it is divided into separate questions and passed to the next step accordingly.
  5. The output is further given an edit option before finally passing to the DPR API. This is to ensure the complete correctness of the input.

Stage-3: Most similar question using DPR

Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In DPR, retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, DPR outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.

We leverage the pre-trained DPR model and finetune it on our academic corpora consisting of Question Texts, Answers, and their detailed Explanations to make DPR domain-specific. Then we use the Question Encoder part of the DPR to encode the Questions to retrieve contextually similar questions.

Algorithm In Brief

  1. Input
    1. Text of Question
    2. Image URL(s) of Question
    3. top-k
  2. If text is present in Input, It will be encoded by the DPR model into the question vector. If an image URL is present in the Text itself, it will be converted into an image before cleaning.
  3. If Images(diagrams) are present in the input, They will be encoded by the EfficientNet model into the image vector. (For multiple images, the average is performed to get a single image vector)
  4. All the Published and Approved Questions’ text and Images are encoded using the aforementioned steps and then indexed via FAISS / Elasticsearch.
  5. FAISS / Elasticsearch indices are queried to retrieve top-k similar questions using question vector and image vector.
  6. If both text and diagram are present, then, based on input question length, question’s and image’s cosine scores are reranked using a weighted sum.
  7. Top-k results are returned based on the final score.
    For questions with no multiple sub-questions and diagrams, the top-5 match is achieved in 92.16% of cases currently.

References