back to CSSE Capstone schedule

UWB Logo

Computer Science and Software Engineering Capstone Presentations

Summer Quarter

August 21, 2020

 

Christian Rahmel

"Document Matrix Grouping for Data Provenance"

(UWB CSS Faculty Research)

 

Faculty Advisor: Dr. Erika Parsons

 

 

 

Abstract

The purpose of this project is to expand upon and refine the available data sets available for the Data Provenance problem to use them in new approaches. Data Provenance involves finding the origin of data and this project aims to create meaning grouped data sets that can be used for the various approaches to the Data provenance problem. This project stresses the importance of quality groupings and achieving these groups through a means that are logically meaningful to a human individual. As such much like our current LDA data sets we look to using citations as our basis for assigning truth to these groupings. Citations can be modeled as a graph and so this project employs the usage of an adjacency matrix to map the finite graph that a given document set represents. In the context of this project a quality group means a group that is both heavily interconnected while also being independent and disconnected from other groups. To this end, we look to focal documents, documents in the data set that have high levels of interconnectivity with other documents in the data set, this way we can attempt to locate potential points of interest for building our highly interconnected groups around. This method of finding groupings for classification will be useful in growing and refining the data sets currently available for the Data Provenance problem and can be adapted to other work on other data sets granted we know the citations present in the data set. Additionally, there is much that can be improved and refined with respect to the tool itself to achieve even further enhance the quality of these groupings. As of the end of this project the tool is currently designed to work with a Wikipedia data set of around 500,000 highly interconnected documents.

 

 

 

 

 

 

 

 

 

 

 

Updated August 19, 2020, 00:23