Computer Science and Software Engineering Capstone Presentations
Fall Quarter
December 18, 2020
Thomas Hedrick "Fact
Graph Preprocessor" (UWB CSS Faculty Research) Faculty Advisor: Dr. Erika Parsons |
Abstract This capstone is part of a larger research project, that
seeks the implementation of a Fact Graph (FG) strategy for Natural Language
Processing. The ultimate goal of an FG is to try to teach a computer how to
understand the human language through a new approach using a graph instead of
other structures like trees or networks. This has many applications, take for
instance a basic search of a topic across thousands of articles. A FG
approach takes into account relationships between words, sentences, and
paragraphs to provide context to language; the aim is to use context in
addition to a language, for to a computer to process and learn from. The FG
approach, like any other Machine Learning strategy, require large amounts of
data to learn from. For this reason, the focus of this project is
creating a data-preprocessing module, which can be used as a first step in
the process of building the FG. This step should be both fast and consistent
to support future work. The data-preprocessing module takes information from
Wikipedia's wikidump and turns it into an easy-to-read JSON format that can
be easily passed into the next step of the project. Due to the large amount
of processing required, we have analyzed the algorithm and used parallelism
and multiprocessing to help speed up portions that would otherwise take much
longer if the work were done sequentially. This project also tries to make it
easy to provide additional data through the use of JSON format. The output of the preprocessing module is a JSON
datafile that takes 26.5 hours to generate and is 36 GB large. This file
contains relevant information that can be used for NLP and other strategies,
in an easy to process standard JSON format. In addition, it can be split into
smaller JSON's based on starting targets of interest. This will allow future
work to implement decisions on whether an entire dataset, or subsets, can be
used to find patterns and categorize documents across different scope sizes. |
|
Updated December 15, 2020