Add tech_docs/llm/transformer_llm.md

This commit is contained in:
2024-06-14 08:06:48 +00:00
parent d681f3b253
commit 843d0b055e

View File

@@ -0,0 +1,85 @@
When working with Transformers in libraries like Hugging Face's `transformers`, dictionaries and indices play crucial roles in handling data efficiently. Here's a concise explanation with rich context:
### Dictionaries in Transformers
1. **Tokenization:**
- Tokenizers convert text into tokens, outputting a dictionary with keys such as `input_ids`, `attention_mask`, and `token_type_ids`.
- Example:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
print(inputs)
# {'input_ids': tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
```
2. **Model Outputs:**
- Model outputs are often dictionaries containing elements like `logits`, `hidden_states`, and `attentions`.
- Example:
```python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
outputs = model(**inputs)
print(outputs)
# SequenceClassifierOutput(loss=None, logits=tensor([[0.2438, -0.1436]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
```
3. **Training and Configuration:**
- Training arguments and configurations are managed through dictionaries.
- Example:
```python
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16,
per_device_eval_batch_size=16, warmup_steps=500, weight_decay=0.01, logging_dir='./logs'
)
```
### Indices in Transformers
1. **Token Indices:**
- Each token is assigned a unique index. The tokenizer maps tokens to these indices.
- Example:
```python
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Hello, how are you?")
token_indices = tokenizer.convert_tokens_to_ids(tokens)
print(token_indices)
# [7592, 1010, 2129, 2024, 2017, 102]
```
2. **Positional Encoding:**
- Positional indices are used to maintain the order of tokens in the sequence, crucial for the Transformer's attention mechanism.
### Practical Example
Combining dictionaries and indices in a text classification task:
1. **Tokenization:**
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Transformers are powerful models for NLP tasks."
inputs = tokenizer(text, return_tensors="pt")
```
2. **Model Inference:**
```python
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
outputs = model(**inputs)
logits = outputs.logits
```
3. **Post-Processing:**
```python
import torch
predicted_class = torch.argmax(logits, dim=1).item()
print(f"Predicted class: {predicted_class}")
```
### Summary
- **Dictionaries**: Used for managing complex data (e.g., tokenized inputs, model outputs, configurations).
- **Indices**: Used to represent tokens and positions, enabling efficient encoding and decoding.
Together, they facilitate the efficient processing and manipulation of text data in Transformer models.