Cleveland Big Data Meetup
Details
Hybrid info: https://keybank.zoom.us/webinar/register/WN_AAyZCVrCQ0mM0PqpCR9Pfw
(however, no pizza is available for the hybrid modality)
** For those attending in person please try to get to Key by 4:45pm because there is a security check-in **
5:00pm - pizza and networking
5:30pm - Presentation
Paco Nathan! Paco is an O'Reilly author on AI and Machine Learning.
"Catching Bad Guys using open data, open models in AI: a tour through anti-fraud use cases with graphs and entity resolution"
GraphRAG is a popular way to use knowledge graphs to ground AI apps in facts. Most GraphRAG tutorials use LLMs to build graph automatically from unstructured data. However, what if you're working on use cases such as investigative journalism and sanctions compliance -- "catching bad guys" -- where transparency for decisions and evidence are required?
This talk explores how to leverage open data and open models for AI apps -- using entity resolution to build investigative graphs which are accountable, exploring otherwise hidden relations in the data that indicate fraud or corruption. Professionals who work in sanctions compliance, tax fraud, counter-terrorism, etc., -- which our team helps support -- generally don't present a lot in public. However, we can use open data and open source to illustrate where machine learning assists in these kinds of use cases.
For this talk we'll construct an investigative graph about potential money laundering, using ER to merge open data from ICIJ Offshore Leaks, Open Ownership, and OpenSanctions. We'll explore techniques used in production use cases for anti-money laundering (AML), ultimate beneficial owner (UBO), rapid movement of funds (RMF), and other areas of sanctions compliance.
First we'll build a "backbone" for the graph in ways which preserve evidence and allow for audits. Next we'll use spaCy pipelines to parse related news articles, using `GLiNER` to extract entities, then the new `spacy-lancedb-linker` to link them into the graph. Finally, we'll show graph analytics that make use of the results -- tying into what's needed for use cases such as GraphRAG.
This approach uses Python open source libraries, and all of the code is provided on GitHub organized in Jupyter notebooks. For each NLP task we use state-of-the-art open models (mostly not LLMs) emphasizing how to tune for a domain context: named entity recognition, relation extraction, textgraph, entity linking, as well as entity resolution to merge structured data and produce a semantic overlay that organizes the graph.
Cleveland Big Data Meetup