În această postare, implementez un model simplu de completare a cuvintelor, bazat pe Char-RNN al lui Karpathy, dar folosind Învățarea liniară online supravegheată a încorporării cuvintelor. Mai precis, folosesc sgdclassifiater din scikit-learn
care este Un clasificator liniar simplu care poate fi actualizat treptat.
Rețineți că acesta este un exemplu ilustrativ, bazat pe câteva cuvinte și vocabular mic. Există multe, multe modalități de îmbunătățire a modelului și ar putea fi prevăzute multe alte configurații. Deci, nu ezitați să experimentați și să extindeți acest exemplu. Cu toate acestea, structura gramaticală a textului generat (nu generalizați încă acest rezultat) este surprinzător de bună.
Extrapolarea mea de 2 cenți-nu-științifică (?) În acest sens este că este artificial Neural Rețelele nu sunt intrinsec mai bune decât alte metode: este nevoie de un model cu o capacitate ridicată, capabil să învețe și să se generalizeze bine.
Iată cum să reproduceți exemplul, presupunând că ați numit fișierul word-online.py
(Depozitul este numit word-online
):
uv venv venv --python=3.11 source venv/bin/activate uv pip install -r requirements.txt python word-online.py
word-online.py
conține următorul cod:
import numpy as np import gensim import time # Added for the delay parameter from collections import deque from tqdm import tqdm from scipy.special import softmax from sklearn.linear_model import SGDClassifier # Sample text text = """Hello world, this is an online learning example with word embeddings. It learns words and generates text incrementally using an SGD classifier.""" def debug_print(x): print(f"{x}") # Tokenization (simple space-based) words = text.lower().split() vocab = sorted(set(words)) vocab.append("") # Add unknown token for OOV words # Train Word2Vec model (or load pretrained embeddings) embedding_dim = 50 # Change to 100/300 if using a larger model word2vec = gensim.models.Word2Vec((words), vector_size=embedding_dim, window=5, min_count=1, sg=0) # Create word-to-index mapping word_to_idx = {word: i for i, word in enumerate(vocab)} idx_to_word = {i: word for word, i in word_to_idx.items()} # Hyperparameters context_size = 12 # Default 10, Words used for prediction context learning_rate = 0.005 epochs = 10 # Prepare training data X_train, y_train = (), () for i in tqdm(range(len(words) - context_size)): context = words(i:i + context_size) target = words(i + context_size) # Convert context words to embeddings context_embedding = np.concatenate((word2vec.wv(word) for word in context)) X_train.append(context_embedding) y_train.append(word_to_idx(target)) X_train, y_train = np.array(X_train), np.array(y_train) # Initialize SGD-based classifier clf = SGDClassifier(loss="hinge", max_iter=1, learning_rate="constant", eta0=learning_rate) # Online training (stochastic updates, multiple passes) for epoch in tqdm(range(epochs)): for i in range(len(X_train)): clf.partial_fit((X_train(i)), (y_train(i)), classes=np.arange(len(vocab))) # 🔥 **Softmax function for probability scaling** def softmax(logits): exp_logits = np.exp(logits - np.max(logits)) # Stability trick return exp_logits / np.sum(exp_logits) def sample_from_logits(logits, k=5, temperature=1.0, random_seed=123): """ Applies Top-K sampling & Temperature scaling """ logits = np.array(logits) / temperature # Apply temperature scaling probs = softmax(logits) # Convert logits to probabilities # Select top-K indices top_k_indices = np.argsort(probs)(-k:) top_k_probs = probs(top_k_indices) top_k_probs /= top_k_probs.sum() # Normalize # Sample from Top-K distribution np.random.seed(random_seed) return np.random.choice(top_k_indices, p=top_k_probs) def generate_text(seed="this is", length=20, k=5, temperature=1.0, random_state=123, delay=3): seed_words = seed.lower().split() # Ensure context has `context_size` words (pad with zero vectors if needed) while len(seed_words) < context_size: seed_words.insert(0, " ") context = deque( (word_to_idx(word) if word in word_to_idx else -1 for word in seed_words(-context_size:)), maxlen=context_size ) generated = seed previous_word = seed for _ in range(length): # Generate embeddings, use a zero vector if word is missing context_embedding = np.concatenate(( word2vec.wv(idx_to_word(idx)) if idx in idx_to_word else np.zeros(embedding_dim) for idx in context )) logits = clf.decision_function((context_embedding))(0) # Get raw scores # Sample next word using Top-K & Temperature scaling pred_idx = sample_from_logits(logits, k=k, temperature=temperature) next_word = idx_to_word.get(pred_idx, " ") print(f"Generating next word: {next_word}") # Added this line time.sleep(delay) # Added this line if previous_word(-1) == "." and previous_word(-1) != "" and previous_word(-1) != seed: generated += " " + next_word.capitalize() else: generated += " " + next_word previous_word = next_word context.append(pred_idx) return generated # 🔥 Generate text print("nn Generated Text:") seed = "This is a" print(seed) print(generate_text(seed, length=12, k=1, delay=0)) # delay seconds for next word generation, optimal for delay=0 seconds 100%|████████████████████████████████████████████████████████████████| 10/10 (00:00<00:00, 12164.45it/s) 100%|███████████████████████████████████████████████████████████████████| 10/10 (00:01<00:00, 8.34it/s) Generated Text: This is a Generating next word: classifier. Generating next word: an Generating next word: sgd Generating next word: classifier. Generating next word: and Generating next word: generates Generating next word: text Generating next word: incrementally Generating next word: using Generating next word: an Generating next word: sgd Generating next word: classifier. This is a classifier. An sgd classifier. And generates text incrementally using an sgd classifier.
%%R library(reticulate) library(progress) library(stats) # Initialize Python modules through reticulate np <- import("numpy") gensim <- import("gensim") time <- import("time") # Added for the delay parameter # Sample text text <- "This is a model used for classification purposes. It applies continuous learning on word vectors, converting words into embeddings, learning from those embeddings, and gradually producing text through the iterative process of an SGD classifier." debug_print <- function(x) { print(paste0(x)) } # Tokenization (simple space-based) words <- strsplit(tolower(text), "\s+")((1L)) vocab <- sort(unique(words)) vocab <- c(vocab, "") # Add unknown token for OOV words # Train Word2Vec model (or load pretrained embeddings) embedding_dim <- 50L # Change to 100/300 if using a larger model word2vec <- gensim$models$Word2Vec(list(words), vector_size=embedding_dim, window=5L, min_count=1L, sg=0L) # Ensure " " is in the Word2Vec vocabulary # This is the crucial step to fix the KeyError if (!(" " %in% word2vec$wv$index_to_key)) { word2vec$wv$add_vector(" ", rep(0, embedding_dim)) # Add " " with a zero vector } # Create word-to-index mapping word_to_idx <- setNames(seq_along(vocab) - 1L, vocab) # 0-based indexing to match Python idx_to_word <- setNames(vocab, as.character(word_to_idx)) # Hyperparameters context_size <- 12L # Default 10, Words used for prediction context learning_rate <- 0.005 epochs <- 10L # Prepare training data X_train <- list() y_train <- list() pb <- progress_bar$new(total = length(words) - context_size) for (i in 1L:(length(words) - context_size)) { context <- words(i:(i + context_size - 1L)) target <- words(i + context_size) # Convert context words to embeddings context_vectors <- lapply(context, function(word) as.array(word2vec$wv(word))) context_embedding <- np$concatenate(context_vectors) X_train((i)) <- context_embedding y_train((i)) <- word_to_idx(target) pb$tick() } # Initialize SGD-based classifier sklearn <- import("sklearn.linear_model") clf <- sklearn$SGDClassifier(loss="hinge", max_iter=1L, learning_rate="constant", eta0=learning_rate) # Online training (stochastic updates, multiple passes) pb <- progress_bar$new(total = epochs) for (epoch in 1L:epochs) { for (i in 1L:length(X_train)) { # Use the list version for indexing individual samples clf$partial_fit( np$array(list(X_train((i)))), np$array(list(y_train((i)))), classes=np$arange(length(vocab)) ) } pb$tick() } # Softmax function for probability scaling softmax_fn <- function(logits) { exp_logits <- exp(logits - max(logits)) # Stability trick return(exp_logits / sum(exp_logits)) } sample_from_logits <- function(logits, k=5L, temperature=1.0, random_seed=123L) { # Applies Top-K sampling & Temperature scaling logits <- as.numeric(logits) / temperature # Apply temperature scaling probs <- softmax_fn(logits) # Convert logits to probabilities # Select top-K indices - ensure k doesn't exceed the length of logits k <- min(k, length(logits)) sorted_indices <- order(probs) top_k_indices <- sorted_indices((length(sorted_indices) - k + 1L):length(sorted_indices)) # Handle case where k=1 specially if (k == 1L) { return(top_k_indices) } top_k_probs <- probs(top_k_indices) # Ensure probabilities sum to 1 top_k_probs <- top_k_probs / sum(top_k_probs) # Check if all probabilities are valid if (any(is.na(top_k_probs)) || length(top_k_probs) != length(top_k_indices)) { # If there are issues with probabilities, just return the highest probability item return(top_k_indices(which.max(probs(top_k_indices)))) } # Sample from Top-K distribution set.seed(random_seed) return(sample(top_k_indices, size=1L, prob=top_k_probs)) } generate_text <- function(seed="this is", length=20L, k=5L, temperature=1.0, random_state=123L, delay=3L) { seed_words <- strsplit(tolower(seed), "\s+")((1L)) # Ensure context has `context_size` words (pad with zero vectors if needed) while (length(seed_words) < context_size) { seed_words <- c(" ", seed_words) } # Use a fixed-size list as a ring buffer context <- vector("list", context_size) for (i in 1L:context_size) { word <- tail(seed_words, context_size)(i) if (word %in% names(word_to_idx)) { context((i)) <- word_to_idx(word) } else { context((i)) <- -1L } } # Track position in the ring buffer context_pos <- 1L generated <- seed previous_word <- seed for (i in 1L:length) { # Generate embeddings, use a zero vector if word is missing context_vectors <- list() for (idx in unlist(context)) { if (as.character(idx) %in% names(idx_to_word)) { word <- idx_to_word(as.character(idx)) context_vectors <- c(context_vectors, list(as.array(word2vec$wv(word)))) } else { context_vectors <- c(context_vectors, list(np$zeros(embedding_dim))) } } context_embedding <- np$concatenate(context_vectors) logits <- clf$decision_function(np$array(list(context_embedding)))(1L,) # Sample next word using Top-K & Temperature scaling pred_idx <- sample_from_logits(logits, k=k, temperature=temperature, random_seed=random_state+i) next_word <- if (as.character(pred_idx) %in% names(idx_to_word)) { idx_to_word(as.character(pred_idx)) } else { " " } print(paste0("Generating next word: ", next_word)) if (delay > 0) { time$sleep(delay) # Added delay } if (substr(previous_word, nchar(previous_word), nchar(previous_word)) == "." && previous_word != "" && previous_word != seed) { generated <- paste0(generated, " ", toupper(substr(next_word, 1, 1)), substr(next_word, 2, nchar(next_word))) } else { generated <- paste0(generated, " ", next_word) } previous_word <- next_word # Update context (ring buffer style) context((context_pos)) <- pred_idx context_pos <- (context_pos %% context_size) + 1L } return(generated) } cat("nn Generated Text:n") seed <- "This classifier is" cat(seed, "n") result <- generate_text(seed, length=2L, k=3L, delay=0L) # delay seconds for next word generation print(result) Generated Text: This classifier is (1) "Generating next word: for" (1) "Generating next word: text" (1) "This classifier is for text"