UWB Logo

Computer Science and Software Engineering Capstone Presentations

Fall Quarter

December 18, 2020

Thomas Hedrick

"Fact Graph Preprocessor"

(UWB CSS Faculty Research)

Faculty Advisor: Dr. Erika Parsons

Abstract

This capstone is part of a larger research project, that seeks the implementation of a Fact Graph (FG) strategy for Natural Language Processing. The ultimate goal of an FG is to try to teach a computer how to understand the human language through a new approach using a graph instead of other structures like trees or networks. This has many applications, take for instance a basic search of a topic across thousands of articles. A FG approach takes into account relationships between words, sentences, and paragraphs to provide context to language; the aim is to use context in addition to a language, for to a computer to process and learn from. The FG approach, like any other Machine Learning strategy, require large amounts of data to learn from.

For this reason, the focus of this project is creating a data-preprocessing module, which can be used as a first step in the process of building the FG. This step should be both fast and consistent to support future work. The data-preprocessing module takes information from Wikipedia's wikidump and turns it into an easy-to-read JSON format that can be easily passed into the next step of the project. Due to the large amount of processing required, we have analyzed the algorithm and used parallelism and multiprocessing to help speed up portions that would otherwise take much longer if the work were done sequentially. This project also tries to make it easy to provide additional data through the use of JSON format.

The output of the preprocessing module is a JSON datafile that takes 26.5 hours to generate and is 36 GB large. This file contains relevant information that can be used for NLP and other strategies, in an easy to process standard JSON format. In addition, it can be split into smaller JSON's based on starting targets of interest. This will allow future work to implement decisions on whether an entire dataset, or subsets, can be used to find patterns and categorize documents across different scope sizes.

Updated December 15, 2020