Add tech_docs/llm/transformer_llm.md

2024-06-14 08:06:48 +00:00
parent d681f3b253
commit 843d0b055e
1 changed files with 85 additions and 0 deletions
--- a/tech_docs/llm/transformer_llm.md
+++ b/tech_docs/llm/transformer_llm.md
@@ -0,0 +1,85 @@
 When working with Transformers in libraries like Hugging Face's `transformers`, dictionaries and indices play crucial roles in handling data efficiently. Here's a concise explanation with rich context:
 ### Dictionaries in Transformers
 1. **Tokenization:**
   - Tokenizers convert text into tokens, outputting a dictionary with keys such as `input_ids`, `attention_mask`, and `token_type_ids`.
   - Example:
     ```python
     from transformers import AutoTokenizer
     tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
     inputs = tokenizer("Hello, how are you?", return_tensors="pt")
     print(inputs)
     # {'input_ids': tensor([[  101,  7592,  1010,  2129,  2024,  2017,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
     ```
 2. **Model Outputs:**
   - Model outputs are often dictionaries containing elements like `logits`, `hidden_states`, and `attentions`.
   - Example:
     ```python
     from transformers import AutoModelForSequenceClassification
     model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
     outputs = model(**inputs)
     print(outputs)
     # SequenceClassifierOutput(loss=None, logits=tensor([[0.2438, -0.1436]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
     ```
 3. **Training and Configuration:**
   - Training arguments and configurations are managed through dictionaries.
   - Example:
     ```python
     from transformers import Trainer, TrainingArguments
     training_args = TrainingArguments(
         output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16,
         per_device_eval_batch_size=16, warmup_steps=500, weight_decay=0.01, logging_dir='./logs'
     )
     ```
 ### Indices in Transformers
 1. **Token Indices:**
   - Each token is assigned a unique index. The tokenizer maps tokens to these indices.
   - Example:
     ```python
     tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
     tokens = tokenizer.tokenize("Hello, how are you?")
     token_indices = tokenizer.convert_tokens_to_ids(tokens)
     print(token_indices)
     # [7592, 1010, 2129, 2024, 2017, 102]
     ```
 2. **Positional Encoding:**
   - Positional indices are used to maintain the order of tokens in the sequence, crucial for the Transformer's attention mechanism.
 ### Practical Example
 Combining dictionaries and indices in a text classification task:
 1. **Tokenization:**
   ```python
   from transformers import AutoTokenizer, AutoModelForSequenceClassification
   tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
   text = "Transformers are powerful models for NLP tasks."
   inputs = tokenizer(text, return_tensors="pt")
   ```
 2. **Model Inference:**
   ```python
   model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
   outputs = model(**inputs)
   logits = outputs.logits
   ```
 3. **Post-Processing:**
   ```python
   import torch
   predicted_class = torch.argmax(logits, dim=1).item()
   print(f"Predicted class: {predicted_class}")
   ```
 ### Summary
 - **Dictionaries**: Used for managing complex data (e.g., tokenized inputs, model outputs, configurations).
 - **Indices**: Used to represent tokens and positions, enabling efficient encoding and decoding.
 Together, they facilitate the efficient processing and manipulation of text data in Transformer models.