back to CSSE
Capstone schedule
Computer Science and Software Engineering Capstone Presentations
Summer Quarter
August 21, 2020
Christian Rahmel "Document
Matrix Grouping for Data Provenance" (UWB CSS Faculty Research) Faculty Advisor: Dr. Erika Parsons |
Abstract The purpose of this project is to expand upon and
refine the available data sets available for the Data Provenance problem to
use them in new approaches. Data Provenance involves finding the origin of
data and this project aims to create meaning grouped data sets that can be
used for the various approaches to the Data provenance problem. This project
stresses the importance of quality groupings and achieving these groups
through a means that are logically meaningful to a human individual. As such
much like our current LDA data sets we look to using citations as our basis
for assigning truth to these groupings. Citations can be modeled as a graph
and so this project employs the usage of an adjacency matrix to map the
finite graph that a given document set represents. In the context of this
project a quality group means a group that is both heavily interconnected
while also being independent and disconnected from other groups. To this end,
we look to focal documents, documents in the data set that have high levels
of interconnectivity with other documents in the data set, this way we can
attempt to locate potential points of interest for building our highly
interconnected groups around. This method of finding groupings for
classification will be useful in growing and refining the data sets currently
available for the Data Provenance problem and can be adapted to other work on
other data sets granted we know the citations present in the data set.
Additionally, there is much that can be improved and refined with respect to
the tool itself to achieve even further enhance the quality of these
groupings. As of the end of this project the tool is currently designed to
work with a Wikipedia data set of around 500,000 highly interconnected
documents. |
|
Updated August 19, 2020, 00:23