formalyzer

Reads recc letter PDF, fills in grad school web forms

Motivation

I am happy to write a recommendation letter “by hand” for a student. But then each graduate school has their own lengthy, idiosyncratic form, foisting upon me their job of data entry. This is tedious work, especially with many schools and several students. Thus, I’ve wanted to automate the form-filling for quite a while.

Description

formalyzer will scrape the text from the PDF recc letter, and for each URL in url_list, it will:

  • launch a browser tab for that url
  • fill in the form using what the LLM has gleaned from the recc letter
  • attach the PDF via the form’s upload/attachment button

…and do no more.

The user will need to review the page and press the Submit button manually.

Requirements

  • Either ollama installed locally or ANTHROPIC_API_KEY environment variable set
  • beautifulsoup4, playwright, claudette, lisette, pypdf, fastcore

Technical Approach

You could try to feed raw HTML and PDF into an LLM, but that might be a waste of resources – prohibitively slow, expensive, and error-prone. Instead, formalyzer uses

  • standard packages to pre-process & reduce the inputs: bs4 for HTML, pypdf for PDF
  • the LLM only for reading the reduced input texts (+ a system prompt) and outputting values to assign to form fields.
  • another existing package (playwright) to fill in those fields.

Update, v0.0.5: Actually now you may send the raw HTML to Claude for verification of BS4’s field detections, using the optional --verify <student name> CLI arg.

Usage

On MacOS, startup the Chrome browser looking to port 9222 by executing this command in the terminal:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug

Then you can run this command:

formalyzer --debug <recc_info.txt> <recc_letter.pdf> <url_list.txt>

where recc_info.txt contains information about the recommender, their name, their title, their address, phone number and email. urls_list.txt is a file containing one URL per line.

Optional: --verify <student name> will use Claude to analyze raw HTML (with student name redacted) to check the fields detected via BS4. Currently using a local LLM for this is way too slow. The field-value mapping can be still done via local LLM.

Installation

Install latest from the GitHub repository:

$ pip install git+https://github.com/drscotthawley/formalyzer.git

or from pypi:

$ pip install formalyzer

After installing, users need to run playwright install chromium to download the browser binaries.

Demo

On MacOS, run these commands in Terminal:

  1. /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug &
  2. cd example
  3. python -m http.server 8000 &
  4. export ANTHROPIC_API_KEY="__your_API_key_goes_here__"
  5. formalyzer --debug recc_info.txt sample_letter.pdf sample_urls.txt

Local LLM Execution

For FERPA compliance, running a local model is preferable so that student data is not broadcast elsewhere. I recommend using ollama and starting with something medium-small like qwen2.5:14b (9 GB). Start up ollama:

ollama serve & 
ollama pull qwen2.5:14b 

Then you can use the --model CLI flag, e.g. 

formalyzer --debug --model 'ollama/qwen2.5:14b' recc_info.txt sample_letter.pdf sample_urls.txt

The quality of the form-filling will vary depending on the quality and size of the model you get. Smaller models like mistral (4 GB) may hallucinate many of the form field IDs, resulting in a mostly-blank form in the end. For a huge (41 GB) model, try ollama/qwen2:72b.

Developer Guide

Install formalyzer in Development mode

# make sure formalyzer package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to formalyzer
$ nbdev_prepare

Documentation

Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.

Limitations

Sometimes the LLM will miss certain fields – that’s just the nature of the game – so you’ll still need to fill those in by hand. But it gets most of them!