Readme

Part of Val Town Semantic Search.

Generates OpenAI embeddings for all public vals, and stores them in Val Town's blob storage.

  • Create a new metadata object. Also has support for getting the previous metadata and only adding new vals, but that's currently disabled.
  • Get all val names from the database of public vals, made by Achille Lacoin.
  • Put val names in batches. Vals in the same batch will have their embeddings stored in the same blob, at different offsets.
  • Iterate through all each batch, get code for all the vals, get embeddings from OpenAI, and store the result in a blob.
  • When finished, save the metadata JSON to its own blob.
  • Can now be searched using janpaul123/semanticSearchBlobs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import { decode as base64Decode, encode as base64Encode } from "https://deno.land/std@0.166.0/encoding/base64.ts";
import getValCode from "https://esm.town/v/janpaul123/getValCode";
import { sqlToJSON } from "https://esm.town/v/nbbaier/sqliteExportHelpers?v=22";
import { db as allValsDb } from "https://esm.town/v/sqlite/db?v=9";
import { blob } from "https://esm.town/v/std/blob";
import OpenAI from "npm:openai";
import { truncateMessage } from "npm:openai-tokens";
const dimensions = 1536;
const allVals = await sqlToJSON(
await allValsDb.execute("SELECT author_username, name, version FROM vals WHERE LENGTH(code) > 10 ORDER BY name"),
) as any;
// const allValsBlobEmbeddingsMeta = (await blob.getJSON(`allValsBlob${dimensions}EmbeddingsMeta`)) ?? {};
const allValsBlobEmbeddingsMeta = {};
const existingEmbeddingsIds = new Set(Object.keys(allValsBlobEmbeddingsMeta));
function idForVal(val: any): string {
return `${val.author_username}!!${val.name}!!${val.version}`;
}
const newValsBatches = [[]];
let currentBatch = newValsBatches[0];
for (const val of allVals) {
const id = idForVal(val);
if (!existingEmbeddingsIds.has(id)) {
currentBatch.push(val);
}
if (currentBatch.length >= 500) {
currentBatch = [];
newValsBatches.push(currentBatch);
}
}
let nextDataIndex = Math.max(
0,
...Object.values(allValsBlobEmbeddingsMeta).map((item: any) => item.batchDataIndex + 1),
);
const openai = new OpenAI();
for (const newValsBatch of newValsBatches) {
const batchDataIndex = nextDataIndex;
const embeddingsBatch = new Float32Array(dimensions * newValsBatch.length);
await Promise.all([...Array(newValsBatch.length).keys()].map(async (valIndex) => {
const val = newValsBatch[valIndex];
const code = getValCode(val);
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: truncateMessage(code, "text-embedding-3-small"),
encoding_format: "base64",
dimensions,
});
const embeddingBinary = new Float32Array(base64Decode(embedding.data[0].embedding as any).buffer);
if (embeddingBinary.length != dimensions) {
throw new Error(`Invalid embeddingBinary.length: ${embeddingBinary.length}`);
}
const id = idForVal(val);
embeddingsBatch.set(embeddingBinary, dimensions * valIndex);
allValsBlobEmbeddingsMeta[id] = { batchDataIndex, valIndex };
}));
const embeddingsBatchBlobName = `allValsBlob${dimensions}EmbeddingsData_${batchDataIndex}`;
await blob.set(embeddingsBatchBlobName, embeddingsBatch.buffer);
await blob.setJSON(`allValsBlob${dimensions}EmbeddingsMeta`, allValsBlobEmbeddingsMeta);
console.log(
`Saved batch to ${embeddingsBatchBlobName} with ${newValsBatch.length} records (${embeddingsBatch.byteLength} bytes) ${
batchDataIndex + 1
}/${newValsBatches.length}`,
);
nextDataIndex++;
}
console.log(`Finished, we have indexed ${Object.keys(allValsBlobEmbeddingsMeta).length} records`);
Val Town is a social website to write and deploy JavaScript.
Build APIs and schedule functions from your browser.
Comments
Nobody has commented on this val yet: be the first!
v37
June 17, 2024