AUTOMATIC INGESTION OF CONTENT FROM UNSTRUCTURED DATA SOURCES
At Embibe, we have different types of content – study material, question and answer pairs, video solutions, and many more. Ingesting this wide variety of content into Embibe’s datastores was historically a manual task wherein a group of human data entry operators would enter data into the system using a data entry tool. This is a tedious and time-consuming process, especially when we are expanding our content over thousands of exams across hundreds of syllabi.
Automatic extraction of information from unstructured data sources is an open research problem that we have been trying to solve for some time now so as to make the process of content ingestion easier and to enrich and augment the content available in our data stores. This problem draws from various fields including optical character recognition, information retrieval, natural language processing, and machine learning.