Document Processing Pipeline

This is a conceptual outline of my personal process for converting source documents into useful, actionable knowledge. If we could build software that could facilitate these processes, it would be useful in many different disciplines: journalists, lawyers, researchers, and analysts are tasked with extracting meaningful insights from source documents. 

Example Uses

Screen Shot 2018-02-12 at 4.17.30 PM.png

Overview

This colorful explosion is the whole process outlined in this blog post. It traces the path of data transformation from source content into outlines and article forms. Let's break it down step by step. 

Screen Shot 2018-02-12 at 3.23.33 PM.png

[1] Source Documents

For the sake of the rest of the process, I've stacked all source documents in a single column, sorted chronologically. 

Screen Shot 2018-02-12 at 3.24.22 PM.png

[2] OCR Text (Optical Character Recognition) 

Automatic text recognition improves the searchability of the documents, and allows you to visualize keywords and trends over time. 

Screen Shot 2018-02-12 at 3.24.26 PM.png

[3] Annotations

I don't really like the word annotation, but these are basically extracted sections of the document, which are first highlighted and then pulled out for use elsewhere. 

Screen Shot 2018-02-12 at 3.44.42 PM.png

[4] Timelines

Eh, again, I'm not really sure what to call this either. Timelines (red) are basically selection ranges which can extend beyond the bounds of a document. I thought of them as date ranges, or calendar overlays, but documents typically don't have a temporal dimension, so timelines are just really big annotations. lmao. 

Screen Shot 2018-02-12 at 3.24.36 PM.png

[5] Outline

This is where things get good. This step is where you sort all of the annotations and timelines into a structured outline (blue), a semantic tree of keywords and whatever else. 

Screen Shot 2018-02-12 at 3.42.43 PM.png

Kinda like the infinitely expandable and collapsible bulleted list platforms, Workflowy and Dynalist. 

 

[6] Editorialization 

I'm not sure if there's a word for "to make into an article" but I think editorialization might mean that. if not please suggest better naming practices. words are hard.

Anyways, this step is where you convert the outline into an article. 

Screen Shot 2018-02-12 at 3.24.53 PM.png

We've come full circle. The green article with orange images is basically this blog post, with images extracted from my journals. Woah. 

Who wants to build it?

Let's build this! At the top of the article I listed a few parties which might find this tool useful to their operations. I'll happily help build this if I believe in your cause. Maybe if you're using this to deport immigrants or some spooky crap like that I won't help you out.