Unsupervised Fine Tuning

Doing unsupervised learning to train a Masked Language Model (MLM) on financial documents, specifically 10-K filings from the SEC Edgar database, can enhance the model's understanding of financial terminology and contexts. Unsupervised fine-tuning allows the model to learn from a large corpus of unlabelled financial documents, improving its ability to understand and generate text related to financial topics.

alt text

Unsupervised Learning Overview

Unsupervised learning involves training a model on unlabelled data, allowing it to discover patterns, structures, and relationships within the data without explicit guidance. This is particularly useful in domains like finance, where large amounts of unlabelled data (e.g., 10-K filings) are available, and manually labeling such data is impractical.

Masked Language Modeling (MLM) is a specific form of unsupervised learning where the model is trained to predict missing or masked words in a sentence. This helps the model develop a deep understanding of the language, including syntax, semantics, and domain-specific jargon.

Data Preparation

Collect URLs and file paths of 10-K filings. Example sources include the SEC Edgar website and local storage for PDF files. These documents provide rich and diverse text that can be used to fine-tune the model.

Continual Pre-Training

Unsupervised fine-tuning leverages continual pre-training, which involves updating a pre-trained language model by training it incrementally on new, domain-specific data without forgetting previously learned information. This approach ensures that the model retains its general knowledge while adapting to specific financial contexts.

By continually exposing the model to new financial documents, such as 10-K filings, the model's performance on financial-related tasks, such as sentiment analysis, entity recognition, or text generation, can be significantly improved.

from anoteai import Anote

api_key = 'INSERT_API_KEY_HERE'
Anote = Anote(api_key, isPrivate=False)

file_paths = [
    'https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/a10-k20189292018.htm',
    'https://www.sec.gov/Archives/edgar/data/320193/000119312514383437/d783162d10k.htm',
    '10-Ks/aapl-10-k.pdf', '10-Ks/amzn-20221231.pdf', '10-Ks/bankofamerica-10K.pdf',
    '10-Ks/dbx-20221231.pdf', '10-Ks/google-10-k.pdf,', '10-Ks/msft-10k_20200630.pdf', '10-Ks/nflx-20221231.pdf',
    '10-Ks/nvda-10-k.pdf', '10-Ks/path-20230131.pdf', '10-Ks/sstk-20221231.pdf'
]

fine_tune_model_id = Anote.train(
    model_name="fine_tuned_mlm_on_10ks",
    model_type="MLM",
    fine_tuning_type="unsupervised",
    document_files=file_paths
)['id']

Using the Fine-Tuned Model

Once the model is fine-tuned, it can be used to perform specific NLP tasks such as answering questions or generating text based on the financial context. The model, now more adept at understanding financial documents, can provide more accurate and contextually relevant outputs when interacting with financial texts.

import pandas as pd
test_df = pd.read_csv("Bizbench.csv")

# Fine Tuned Masked Language Model
for i, row in test_df.iterrows():
    row["ft_mlm_answer"], row["ft_mlm_chunk"] = Anote.predict(
        model_name="fine_tuned_mlm",
        model_id=fine_tuned_model_id,
        question_text=row["question"],
        context_text=row["context"]
    )

test_df[["id", "ft_mlm_answer"]].to_csv("ft_mlm_submission.csv")

Benefits of Unsupervised Fine-Tuning

Domain Adaptation: The model adapts to the specific language and terminology used in financial documents, improving its performance on tasks within this domain.

Resource Efficiency: Leveraging existing unlabelled data reduces the need for expensive and time-consuming data labeling processes.

Continual Learning: The model can continuously improve by being exposed to new data, maintaining relevance and accuracy as financial language evolves.