Train an AI to generate text with Python 🧠

Photo of Tom Dekan
by Tom Dekan
Updated: Sun 29 September 2024

We'll train a real, simple, GPT language model - from scratch - using python on macOS. Writing the code will be very quick (around 10 minutes), and then training will take around 30-60 mins on a Mac.

We'll use the Tiny Shakespeare dataset, a tiny dataset (1MB) which contains Shakespeare's complete works. This dataset is ideal for training our GPT AI to generate authentic Shakespearean text and small enough to train on a Mac quickly.

Our final model will generate text in a shakespearean style, like this:



Here's a video guide of me building this πŸ™‚:


Let's get started πŸš€

What We'll Cover

  • Setting up the development environment on macOS
  • Understanding the GPT architecture
  • Preparing the Tiny Shakespeare dataset for training
  • Training the GPT model
  • Generating text

Prerequisites

  • A Mac running macOS (preferably with an M1 or M2 chip)
  • Python 3.8 or higher installed

Let's go

1.1 Set Up a Virtual Environment

Create a project directory and navigate into it:

mkdir gpt-macos
cd gpt-macos

Create a virtual environment:

python3 -m venv venv

Activate the virtual environment:

source venv/bin/activate

1.1 Install Required Libraries

We'll install:

  • transformers by Hugging Face
  • datasets by Hugging Face
  • torch (PyTorch) with MPS support
  • tqdm for progress bars

Installing PyTorch with MPS Support

Note: As of September 2023, MPS support is available in the stable PyTorch release. You can install it via:

pip install torch torchvision torchaudio accelerate -U transformers datasets tqdm

Deeper Explanation

  • PyTorch with MPS: Allows you to utilize the GPU on Apple Silicon Macs for faster training.
  • transformers: Python library for NLP models
  • datasets: Simplifies access to datasets and efficient data preprocessing.

1.5 Verify the Installation

Create a file check_installation.py and add:

import transformers
import datasets
import torch

print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
print(f"Torch version: {torch.__version__}")
print(f"MPS Available: {torch.backends.mps.is_available()}")

Run the script:

Note: Patience is required. This takes around 90 seconds to run on my M3 Air (24GB RAM).

python check_installation.py 

After you've waited for the script to complete, you should see something like:

(venv) your-username@your-mac-name % python3 check_installation.py
Transformers version: 4.44.2
Datasets version: 3.0.0
Torch version: 2.6.0.dev20240917
MPS Available: True

Deeper Explanation

  • torch.backends.mps.is_available(): Checks if the MPS backend is available, indicating that you can use your Mac's GPU for training.

Step 2: Understanding the GPT Architecture (Optional)

We are training a GPT (Generative Pre-trained Transformer), based on the Transformer architecture. This relies on attention mechanisms to process sequences of data.

Learning about these components is optional for our guide, but might be interesting. I've added analogies for each if you want to grasp the mechanisms at a high level.

So, the key components in our model will be:

  • Self-Attention mechanism

My analogy for self-attention: focusing on different words as you read a sentence.

Imagine you're reading a sentence. As you read each word, you naturally pay attention to other words that help you understand its meaning. The self-attention mechanism is like this natural reading process, but for computers. It helps the computer "focus" on the most important words when trying to understand or generate text.

Here's a nice example by Deepgram:

Consider the following sentences: "My dog has black, thick fur as well as an active personality. I also have a cat with brown fur. What is the breed of my dog?" Without attention, the model would assign equal importance to the information about the cat and the dog, which could lead to incorrect or misleading answers. However, with attention, a well-trained language model would assign less attention to the phrase "brown fur" because it is irrelevant to the question being asked.

From a well-written article: Visualizing and Explainer Transform models from the ground up

  • Positional encoding My analogy for position encoding: reading the words in a sentence in an order

When you read a sentence, you know that the order of words matters. "The cat chased the mouse" means something different from "The mouse chased the cat." Positional encoding is a way to tell the computer about word order, so it understands that the position of words in a sentence is important.

More explanation here in "What is positional encoding and why do we need it?"

  • Decoder layers

My analogy for decoder layers: Editing a draft to a final version.

Decoder layers are like multiple rounds of editing a story. When you first write a draft, you get your main ideas down. In the first editing round, you might fix grammatical errors and improve sentence structure. In the next round, you enhance the flow and clarity, and in subsequent rounds, you add more details and refine the narrative. Similarly, each decoder layer takes the initial information and progressively refines it, enhancing the model’s ability to generate coherent and accurate text.

Alternative analogy and much more explanation here: The Encoder-Decoder Concept

Optional explanation over. Let's continue building!

Step 3: Preparing the Tiny Shakespeare Dataset

We'll use the Tiny Shakespeare dataset. This contains all shakespeare's works, and is small enough to train on a Mac quickly.

3.1 Download the Dataset

Create a script prepare_data.py and add the following code:

import requests

# Download the Tiny Shakespeare dataset
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = requests.get(url)
data = response.text

# Save to file
with open('tiny_shakespeare.txt', 'w') as f:
    f.write(data)

(We could do this manually inline, but it's neater to put this in a file.)

Run the script:

python prepare_data.py

3.2 Load and Explore the Dataset

Update prepare_data.py to contain the following code:

import requests
from datasets import Dataset

# Download the Tiny Shakespeare dataset
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = requests.get(url)
data = response.text

# Save to file
with open('tiny_shakespeare.txt', 'w') as f:
    f.write(data)


# Load data into a Hugging Face Dataset
raw_data = Dataset.from_dict({'text': [data]})

# Print the first 500 characters
print(raw_data['text'][0][:500])

And run the script to see a sample of the text.

python prepare_data.py

3.3 Tokenization

We need to tokenize the text using a tokenizer compatible with our GPT model.

Add this code to the end of prepare_data.py:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Deeper Explanation

  • Tokenizer: Converts text to numerical tokens the model can understand.
  • GPT2Tokenizer: Uses Byte Pair Encoding (BPE) to efficiently handle rare words.

3.4 Preprocess the Data

Tokenize and prepare the data for the model.

Again, add this code to the end of prepare_data.py:

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], return_special_tokens_mask=True)

tokenized_dataset = raw_data.map(
    tokenize_function,
    batched=True,
    num_proc=4,
    remove_columns=["text"]
)

Deeper Explanation

  • tokenize_function: Applies the tokenizer to the 'text' field.
  • batched=True: Processes multiple examples at once (efficient for larger datasets).
  • num_proc: Number of processes for parallelization.

3.5 Group Texts into Blocks

Since GPT models expect inputs of a fixed size, we'll split the data into blocks.

So, add this code to the end of prepare_data.py:

block_size = 128

def group_texts(examples):
    concatenated = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated['input_ids'])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [concatenated[k][i:i + block_size] for i in range(0, total_length, block_size)]
        for k in concatenated.keys()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    num_proc=4,
)

Our resulting lm_dataset contains chunks of block_size tokens with corresponding labels.

3.6 Save the Processed Dataset

Add this code to the end of prepare_data.py:

# Save the dataset to disk
lm_dataset.save_to_disk('lm_dataset')

Run the script:

python prepare_data.py

All of prepare_data.py

Just as a quick check, the final prepare_data.py (with all imports at the top) should look like this:

import requests
from datasets import Dataset
from transformers import GPT2Tokenizer


# Download the Tiny Shakespeare dataset
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = requests.get(url)
data = response.text

# Save to file
with open('tiny_shakespeare.txt', 'w') as f:
    f.write(data)


# Load data into a Hugging Face Dataset
raw_data = Dataset.from_dict({'text': [data]})

# Print the first 500 characters
print(raw_data['text'][0][:500])



tokenizer = GPT2Tokenizer.from_pretrained('gpt2')


def tokenize_function(examples):
    return tokenizer(examples['text'], return_special_tokens_mask=True)

tokenized_dataset = raw_data.map(
    tokenize_function,
    batched=True,
    num_proc=4,
    remove_columns=["text"]
)

block_size = 128

def group_texts(examples):
    concatenated = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated['input_ids'])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [concatenated[k][i:i + block_size] for i in range(0, total_length, block_size)]
        for k in concatenated.keys()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    num_proc=4,
)

lm_dataset.save_to_disk('lm_dataset')

Step 4: Training the GPT Model

Now we'll set up and train our GPT model using the processed dataset πŸš€

4.1 Load the Dataset and Model

Create a new script train.py and add:

from datasets import load_from_disk
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments

# Load the dataset
lm_dataset = load_from_disk('lm_dataset')

# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2')

Deeper Explanation

  • GPT2LMHeadModel: GPT-2 model with a language modeling head on top, suitable for text generation tasks.

4.2 Configure the Device (CPU or MPS)

Add this code to the end of train.py:

import torch

if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS backend")
else:
    device = torch.device("cpu")
    print("Using CPU")

model.to(device)

You'll want to be using MPS if you're on an Apple Silicon Mac. This will make the training much faster than the CPU by a factor of around 10 because the MPS backend is optimised for matrix multiplications which are a core part of training transformers.

4.3 Set Up Training Arguments

Append this to train.py:

training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    evaluation_strategy='steps',
    eval_steps=100,
    save_steps=500,
    logging_steps=50,
    learning_rate=5e-4,
    warmup_steps=100,
    save_total_limit=2,
    fp16=False,  # MPS backend currently doesn't support fp16 as of writing.
)

4.4 Initialize the Trainer

Once again, append this to train.py:

from transformers import DataCollatorForLanguageModeling

# Data collator for dynamic padding
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset,
    eval_dataset=lm_dataset,
    data_collator=data_collator,
)

4.5 Start Training

Finally, add this to the end of train.py:

print("Starting training...")
trainer.train()
print("βœ… Training complete. Saving model...")
trainer.save_model('./shakespeare_gpt2')
tokenizer.save_pretrained('./shakespeare_gpt2')
print("βœ… Model saved.")

The final train.py (with all imports at the top) should look like this:

from datasets import load_from_disk
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments

# Load the dataset
lm_dataset = load_from_disk('lm_dataset')

# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2')


import torch

if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS backend")
else:
    device = torch.device("cpu")
    print("Using CPU")

model.to(device)


training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    evaluation_strategy='steps',
    eval_steps=100,
    save_steps=500,
    logging_steps=50,
    learning_rate=5e-4,
    warmup_steps=100,
    save_total_limit=2,
    fp16=False,  # MPS backend currently doesn't support fp16
)

from transformers import DataCollatorForLanguageModeling

# Data collator for dynamic padding
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset,
    eval_dataset=lm_dataset,
    data_collator=data_collator,
)

print("Starting training...")
trainer.train()
print("βœ… Training complete. Saving model...")
trainer.save_model('./shakespeare_gpt2')
tokenizer.save_pretrained('./shakespeare_gpt2')
print("βœ… Model saved.")

Now, run the script to train the model: ⭐️

python train.py

You should see output like this: alt text

It will take 30 - 60 mins to run depending on your Mac. Once it's done you should see the message:

alt text

Deeper Explanation

  • trainer.train(): Initiates the training loop, handling forward and backward passes, optimizer steps, and logging.

Step 5: Evaluating and Generating Text

Now, let's use our trained model to generate some Shakespearean text!

5.1 Create a Text Generation Script

Create a new file called generate_text.py and add:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import argparse


# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('./shakespeare_gpt2')
model = GPT2LMHeadModel.from_pretrained('./shakespeare_gpt2')

# Define the pad_token as eos_token
tokenizer.pad_token = tokenizer.eos_token

# Update the model's configuration to recognize the pad_token_id
model.config.pad_token_id = tokenizer.pad_token_id

# Configure device
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS backend")
else:
    device = torch.device("cpu")
    print("Using CPU")

model.to(device)
model.eval()


def generate_text(prompt, max_length=100):
    # Encode the prompt and generate attention_mask
    encoded = tokenizer(
        prompt,
        return_tensors='pt',
        padding=False,  # No padding needed for single inputs
        truncation=True,
        max_length=512  # Adjust based on model's max input length
    )
    input_ids = encoded['input_ids'].to(device)
    attention_mask = encoded['attention_mask'].to(device)

    with torch.no_grad():
        output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,  # Pass the attention_mask
            max_length=max_length,
            num_beams=5,
            no_repeat_ngram_size=2,
            early_stopping=True,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id  # Ensure pad_token_id is set
        )
    return tokenizer.decode(output[0], skip_special_tokens=True)



def main():
    parser = argparse.ArgumentParser(description="Generate text based on a prompt.")
    parser.add_argument("--prompt", nargs="?", default=None, help="The prompt to generate text from.")
    args = parser.parse_args()

    if args.prompt:
        print(f"Prompt: {args.prompt}")
        print(generate_text(args.prompt)[len(args.prompt):], sep="")
    else:
        prompts = [
            "O Romeo, Romeo! wherefore art",
            "All the world's a",
            "Is this a dagger which I"
        ]

        for prompt in prompts:
            print(f"Prompt: {prompt}")
            print(generate_text(prompt)[len(prompt):], sep="")
            print("\n" + "-"*50 + "\n")

if __name__ == "__main__":
    main()

Run the script:

python generate_text.py

Deeper Explanation

- Expected Output:

Step 6: Tips for Further Improvements

6.1 Experiment with Different Prompts

Try different prompts to see how the model responds.

prompts = [
    "O Romeo, Romeo! wherefore art thou",
    "All the world's a stage,",
    "Is this a dagger which I see before"
]

for prompt in prompts:
    print(f"Prompt: {prompt}")
    print(generate_text(prompt))
    print("\n" + "-"*50 + "\n")

6.2 Train more πŸ‹οΈβ€β™‚οΈ

  • Increase Epochs: More epochs may improve the model's performance.
  • Learning Rate: Fine-tune the learning rate for better convergence.
  • Batch Size: If you have enough memory, increasing batch size can stabilize training.

6.3 Fine-Tune on Additional Data

  • Add more Shakespearean works or other literature to the dataset.
  • Combine Tiny Shakespeare with other texts to enrich the language.

Conclusion

Congrats! You can now generate simple shakespearean text using a GPT model that you trained from scratch on your Mac.

Let's get visual.

Do you want to create beautiful frontends effortlessly?
Click below to book your spot on our early access mailing list (as well as early adopter prices).
Copied link to clipboard πŸ“‹

Made with care by Tom Dekan

Β© 2024 Photon Designer