Skip to content

Unsupervised Fine Tuning

Doing unsupervised learning to train a Masked Language Model (MLM) on financial documents, specifically 10-K filings from the SEC Edgar database, can enhance the model's understanding of financial terminology and contexts. Unsupervised fine-tuning allows the model to learn from a large corpus of unlabelled financial documents, improving its ability to understand and generate text related to financial topics.

Data Preparation:

Collect URLs and file paths of 10-K filings. Example sources include the SEC Edgar website and local storage for PDF files.

Continual Pre-Training:

Unsupervised fine tuning leverages continual pre-training, which involves updating a pre-trained language model by training it incrementally on new, domain-specific data without forgetting previously learned information.

from anoteai import Anote

api_key = 'INSERT_API_KEY_HERE'
Anote = Anote(api_key, isPrivate=False)

file_paths = [
    'https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/a10-k20189292018.htm',
    'https://www.sec.gov/Archives/edgar/data/320193/000119312514383437/d783162d10k.htm',
    '10-Ks/aapl-10-k.pdf', '10-Ks/amzn-20221231.pdf', '10-Ks/bankofamerica-10K.pdf',
    '10-Ks/dbx-20221231.pdf', '10-Ks/google-10-k.pdf,', '10-Ks/msft-10k_20200630.pdf', '10-Ks/nflx-20221231.pdf',
    '10-Ks/nvda-10-k.pdf', '10-Ks/path-20230131.pdf', '10-Ks/sstk-20221231.pdf'
]

fine_tune_model_id = Anote.train(
    model_name="fine_tuned_mlm_on_10ks",
    model_type="MLM",
    fine_tuning_type="unsupervised",
    document_files=file_paths
)['id']

Using the Fine-Tuned Model

Once the model is fine-tuned, it can be used to perform specific NLP tasks such as answering questions or generating text based on the financial context.

import pandas as pd
test_df = pd.read_csv("Bizbench.csv")

# Fine Tuned Masked Language Model
for i, row in test_df.iterrows():
    row["ft_mlm_answer"], row["ft_mlm_chunk"] = Anote.predict(
        model_name="fine_tuned_mlm",
        model_id=fine_tuned_model_id,
        question_text=row["question"],
        context_text=row["context"]
    )

test_df[["id", "ft_mlm_answer"]].to_csv("ft_mlm_submission.csv")

Additional Resources