Word-Onlin

În această postare, implementez un model simplu de completare a cuvintelor, bazat pe Char-RNN al lui Karpathy, dar folosind Învățarea liniară online supravegheată a încorporării cuvintelor. Mai precis, folosesc sgdclassifiater din scikit-learncare este Un clasificator liniar simplu care poate fi actualizat treptat.

Rețineți că acesta este un exemplu ilustrativ, bazat pe câteva cuvinte și vocabular mic. Există multe, multe modalități de îmbunătățire a modelului și ar putea fi prevăzute multe alte configurații. Deci, nu ezitați să experimentați și să extindeți acest exemplu. Cu toate acestea, structura gramaticală a textului generat (nu generalizați încă acest rezultat) este surprinzător de bună.

Extrapolarea mea de 2 cenți-nu-științifică (?) În acest sens este că este artificial Neural Rețelele nu sunt intrinsec mai bune decât alte metode: este nevoie de un model cu o capacitate ridicată, capabil să învețe și să se generalizeze bine.

Iată cum să reproduceți exemplul, presupunând că ați numit fișierul word-online.py (Depozitul este numit word-online):

uv venv venv --python=3.11
source venv/bin/activate
uv pip install -r requirements.txt

python word-online.py

word-online.py conține următorul cod:

import numpy as np
import gensim
import time  # Added for the delay parameter

from collections import deque
from tqdm import tqdm
from scipy.special import softmax
from sklearn.linear_model import SGDClassifier


# Sample text 
text = """Hello world, this is an online learning example with word embeddings.
          It learns words and generates text incrementally using an SGD classifier."""

def debug_print(x):
    print(f"{x}")

# Tokenization (simple space-based)
words = text.lower().split()
vocab = sorted(set(words))
vocab.append("")  # Add unknown token for OOV words

# Train Word2Vec model (or load pretrained embeddings)
embedding_dim = 50  # Change to 100/300 if using a larger model
word2vec = gensim.models.Word2Vec((words), vector_size=embedding_dim, window=5, min_count=1, sg=0)

# Create word-to-index mapping
word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for word, i in word_to_idx.items()}

# Hyperparameters
context_size = 12  # Default 10, Words used for prediction context
learning_rate = 0.005
epochs = 10

# Prepare training data
X_train, y_train = (), ()

for i in tqdm(range(len(words) - context_size)):
    context = words(i:i + context_size)
    target = words(i + context_size)
    # Convert context words to embeddings
    context_embedding = np.concatenate((word2vec.wv(word) for word in context))
    X_train.append(context_embedding)
    y_train.append(word_to_idx(target))

X_train, y_train = np.array(X_train), np.array(y_train)

# Initialize SGD-based classifier
clf = SGDClassifier(loss="hinge", max_iter=1, learning_rate="constant", eta0=learning_rate)

# Online training (stochastic updates, multiple passes)
for epoch in tqdm(range(epochs)):
    for i in range(len(X_train)):
        clf.partial_fit((X_train(i)), (y_train(i)), classes=np.arange(len(vocab)))

# 🔥 **Softmax function for probability scaling**
def softmax(logits):
    exp_logits = np.exp(logits - np.max(logits))  # Stability trick
    return exp_logits / np.sum(exp_logits)


def sample_from_logits(logits, k=5, temperature=1.0, random_seed=123):
    """ Applies Top-K sampling & Temperature scaling """
    logits = np.array(logits) / temperature  # Apply temperature scaling
    probs = softmax(logits)  # Convert logits to probabilities
    # Select top-K indices
    top_k_indices = np.argsort(probs)(-k:)
    top_k_probs = probs(top_k_indices)
    top_k_probs /= top_k_probs.sum()  # Normalize
    # Sample from Top-K distribution
    np.random.seed(random_seed)
    return np.random.choice(top_k_indices, p=top_k_probs)


def generate_text(seed="this is", length=20, k=5, temperature=1.0, random_state=123, delay=3):
    seed_words = seed.lower().split()

    # Ensure context has `context_size` words (pad with zero vectors if needed)
    while len(seed_words) < context_size:
        seed_words.insert(0, "")

    context = deque(
        (word_to_idx(word) if word in word_to_idx else -1 for word in seed_words(-context_size:)),
        maxlen=context_size
    )

    generated = seed
    previous_word = seed

    for _ in range(length):
        # Generate embeddings, use a zero vector if word is missing
        context_embedding = np.concatenate((
            word2vec.wv(idx_to_word(idx)) if idx in idx_to_word else np.zeros(embedding_dim)
            for idx in context
        ))
        logits = clf.decision_function((context_embedding))(0)  # Get raw scores
        # Sample next word using Top-K & Temperature scaling
        pred_idx = sample_from_logits(logits, k=k, temperature=temperature)
        next_word = idx_to_word.get(pred_idx, "")
        
        print(f"Generating next word: {next_word}")  # Added this line
        time.sleep(delay)  # Added this line
        
        if previous_word(-1) == "." and previous_word(-1) != "" and previous_word(-1) != seed:
          generated += " " + next_word.capitalize()
        else: 
          generated += " " + next_word
        previous_word = next_word
        context.append(pred_idx)

    return generated

# 🔥 Generate text
print("nn Generated Text:")
seed = "This is a"
print(seed)
print(generate_text(seed, length=12, k=1, delay=0)) # delay seconds for next word generation, optimal for delay=0 seconds 

100%|████████████████████████████████████████████████████████████████| 10/10 (00:00<00:00, 12164.45it/s)
100%|███████████████████████████████████████████████████████████████████| 10/10 (00:01<00:00,  8.34it/s)


 Generated Text:
This is a
Generating next word: classifier.
Generating next word: an
Generating next word: sgd
Generating next word: classifier.
Generating next word: and
Generating next word: generates
Generating next word: text
Generating next word: incrementally
Generating next word: using
Generating next word: an
Generating next word: sgd
Generating next word: classifier.
This is a classifier. An sgd classifier. And generates text incrementally using an sgd classifier.

Image-titlu-here

%%R
library(reticulate)
library(progress)
library(stats)

# Initialize Python modules through reticulate
np <- import("numpy")
gensim <- import("gensim")
time <- import("time")  # Added for the delay parameter

# Sample text 
text <- "This is a model used for classification purposes. It applies continuous learning on word vectors, converting words into embeddings, learning from those embeddings, and gradually producing text through the iterative process of an SGD classifier."

debug_print <- function(x) {
  print(paste0(x))
}

# Tokenization (simple space-based)
words <- strsplit(tolower(text), "\s+")((1L))
vocab <- sort(unique(words))
vocab <- c(vocab, "")  # Add unknown token for OOV words

# Train Word2Vec model (or load pretrained embeddings)
embedding_dim <- 50L  # Change to 100/300 if using a larger model
word2vec <- gensim$models$Word2Vec(list(words), vector_size=embedding_dim, window=5L, min_count=1L, sg=0L)

# Ensure "" is in the Word2Vec vocabulary
# This is the crucial step to fix the KeyError
if (!("" %in% word2vec$wv$index_to_key)) {
  word2vec$wv$add_vector("", rep(0, embedding_dim))  # Add "" with a zero vector
}


# Create word-to-index mapping
word_to_idx <- setNames(seq_along(vocab) - 1L, vocab)  # 0-based indexing to match Python
idx_to_word <- setNames(vocab, as.character(word_to_idx))

# Hyperparameters
context_size <- 12L  # Default 10, Words used for prediction context
learning_rate <- 0.005
epochs <- 10L
    
# Prepare training data
X_train <- list()
y_train <- list()

pb <- progress_bar$new(total = length(words) - context_size)
for (i in 1L:(length(words) - context_size)) {
  context <- words(i:(i + context_size - 1L))
  target <- words(i + context_size)
  # Convert context words to embeddings
  context_vectors <- lapply(context, function(word) as.array(word2vec$wv(word)))
  context_embedding <- np$concatenate(context_vectors)
  X_train((i)) <- context_embedding
  y_train((i)) <- word_to_idx(target)
  pb$tick()
}

# Initialize SGD-based classifier
sklearn <- import("sklearn.linear_model")
clf <- sklearn$SGDClassifier(loss="hinge", max_iter=1L, learning_rate="constant", eta0=learning_rate)

# Online training (stochastic updates, multiple passes)
pb <- progress_bar$new(total = epochs)
for (epoch in 1L:epochs) {
  for (i in 1L:length(X_train)) {
    # Use the list version for indexing individual samples
    clf$partial_fit(
      np$array(list(X_train((i)))), 
      np$array(list(y_train((i)))), 
      classes=np$arange(length(vocab))
    )
  }
  pb$tick()
}

# Softmax function for probability scaling
softmax_fn <- function(logits) {
  exp_logits <- exp(logits - max(logits))  # Stability trick
  return(exp_logits / sum(exp_logits))
}

sample_from_logits <- function(logits, k=5L, temperature=1.0, random_seed=123L) {
  # Applies Top-K sampling & Temperature scaling
  logits <- as.numeric(logits) / temperature  # Apply temperature scaling
  probs <- softmax_fn(logits)  # Convert logits to probabilities
  
  # Select top-K indices - ensure k doesn't exceed the length of logits
  k <- min(k, length(logits))
  sorted_indices <- order(probs)
  top_k_indices <- sorted_indices((length(sorted_indices) - k + 1L):length(sorted_indices))
  
  # Handle case where k=1 specially
  if (k == 1L) {
    return(top_k_indices)
  }
  
  top_k_probs <- probs(top_k_indices)
  # Ensure probabilities sum to 1
  top_k_probs <- top_k_probs / sum(top_k_probs)
  
  # Check if all probabilities are valid
  if (any(is.na(top_k_probs)) || length(top_k_probs) != length(top_k_indices)) {
    # If there are issues with probabilities, just return the highest probability item
    return(top_k_indices(which.max(probs(top_k_indices))))
  }
  
  # Sample from Top-K distribution
  set.seed(random_seed)
  return(sample(top_k_indices, size=1L, prob=top_k_probs))
}

generate_text <- function(seed="this is", length=20L, k=5L, temperature=1.0, random_state=123L, delay=3L) {
  seed_words <- strsplit(tolower(seed), "\s+")((1L))
  
  # Ensure context has `context_size` words (pad with zero vectors if needed)
  while (length(seed_words) < context_size) {
    seed_words <- c("", seed_words)
  }
  
  # Use a fixed-size list as a ring buffer
  context <- vector("list", context_size)
  for (i in 1L:context_size) {
    word <- tail(seed_words, context_size)(i)
    if (word %in% names(word_to_idx)) {
      context((i)) <- word_to_idx(word)
    } else {
      context((i)) <- -1L
    }
  }
  
  # Track position in the ring buffer
  context_pos <- 1L
  
  generated <- seed
  previous_word <- seed
  
  for (i in 1L:length) {
    # Generate embeddings, use a zero vector if word is missing
    context_vectors <- list()
    for (idx in unlist(context)) {
      if (as.character(idx) %in% names(idx_to_word)) {
        word <- idx_to_word(as.character(idx))
        context_vectors <- c(context_vectors, list(as.array(word2vec$wv(word))))
      } else {
        context_vectors <- c(context_vectors, list(np$zeros(embedding_dim)))
      }
    }
    
    context_embedding <- np$concatenate(context_vectors)
    logits <- clf$decision_function(np$array(list(context_embedding)))(1L,)
    
    # Sample next word using Top-K & Temperature scaling
    pred_idx <- sample_from_logits(logits, k=k, temperature=temperature, random_seed=random_state+i)
    next_word <- if (as.character(pred_idx) %in% names(idx_to_word)) {
      idx_to_word(as.character(pred_idx))
    } else {
      ""
    }
    
    print(paste0("Generating next word: ", next_word))
    if (delay > 0) {
      time$sleep(delay)  # Added delay
    }
    
    if (substr(previous_word, nchar(previous_word), nchar(previous_word)) == "." && 
        previous_word != "" && previous_word != seed) {
      generated <- paste0(generated, " ", toupper(substr(next_word, 1, 1)), substr(next_word, 2, nchar(next_word)))
    } else {
      generated <- paste0(generated, " ", next_word)
    }
    
    previous_word <- next_word
    
    # Update context (ring buffer style)
    context((context_pos)) <- pred_idx
    context_pos <- (context_pos %% context_size) + 1L
  }
  
  return(generated)
}
    
cat("nn Generated Text:n")
seed <- "This classifier is"
cat(seed, "n")
result <- generate_text(seed, length=2L, k=3L, delay=0L)  # delay seconds for next word generation
print(result)    

Generated Text:
This classifier is 
(1) "Generating next word: for"
(1) "Generating next word: text"
(1) "This classifier is for text"

Word-Onlin

LĂSAȚI UN MESAJ Renunțați la răspuns