Avatar

mikehiggins

Joined November 8, 2024
Public vals
3
mikehiggins avatar
mikehiggins
tfidf_analyser_90s_version
HTTP
// Expanded stopwords list
0
mikehiggins avatar
mikehiggins
lovelyYellowXerinae
Script
// Expanded stopwords list with additional web-specific terms
0
mikehiggins avatar
mikehiggins
sanguineCyanMastodon
HTTP
Project Name: Radical Text Analyser Primary Purpose: A web-based text analysis tool that counts words, calculates TF-IDF, analyzes sentiment, and generates fun facts about the input text. Main Technologies: Hono framework, HTML parsing, JS Base64 encoding, and Node.js. Word and character counts TF-IDF score calculation for the importance of phrases Word frequency analysis Basic sentiment analysis (positive, negative, or neutral) Fun fact generation about the text (e.g., average word length) Summary generation The tool is implemented in JavaScript using Hono as a lightweight framework and integrates web APIs for fetching external content and analyzing it in real-time. Code Structure Imports Hono: Used for routing and handling API requests. encode (from JS Base64): Utility for encoding data. parse (from Node HTML Parser): Parses HTML content, used in fetchUrlContent to clean and extract text. Core Modules app: Initializes the main app instance with Hono. stopwords: Contains a list of common stop words and web-specific terms to filter out during text analysis. synonyms: Dictionary mapping common words to their synonyms for highlighting in TF-IDF analysis. Functions The primary functions perform specific text analysis tasks, as detailed in the following sections. Function Documentation Each function's purpose, parameters, and output are described below, along with notes on how they are used within the application. analyseSentiment(text) Purpose: Analyzes the sentiment of a given text. Parameters: text (string): The text to be analyzed. Returns: An object with score (numerical sentiment score) and sentiment (positive, neutral, or negative). Usage: Called within the /analyse endpoint to evaluate the overall sentiment of the user’s input. calculateTFIDF(text, outputSize=7) Purpose: Calculates TF-IDF scores to measure the importance of words/phrases in the text. Parameters: text (string): The text for analysis. outputSize (number, optional): Limits the number of phrases to return. Returns: Array of top phrases with scores, filtered for uniqueness and length constraints. Usage: Called in /analyse to highlight significant terms. calculateWordFrequency(text, topN=30) Purpose: Counts word frequency, excluding stop words and web-specific terms. Parameters: text (string): Input text. topN (number, optional): Number of top words to return. Returns: Array of most frequent words with counts. Usage: Used within /analyse to populate the word frequency analysis section. generateFunFact(text) Purpose: Generates a fun fact based on text statistics (e.g., word length). Parameters: text (string): Text for analysis. Returns: String with a randomly chosen fun fact. Usage: Triggered in /analyse to add a unique, informational element to results. generateSummary(text, sentenceCount=3) Purpose: Extracts a summary from the first few sentences of the input. Parameters: text (string): Text for summarization. sentenceCount (number, optional): Number of sentences in the summary. Returns: Summary string. Usage: Included in /analyse for quick overviews. cleanText(text) Purpose: Cleans HTML and unwanted characters from the fetched content. Parameters: text (string): Raw HTML/text. Returns: Cleaned text string. Usage: Essential for preprocessing content in fetchUrlContent. fetchUrlContent(url) Purpose: Fetches HTML content from a URL, removes unwanted elements, and extracts main text. Parameters: url (string): URL to fetch and clean content from. Returns: Cleaned text string or an error message. Usage: This function powers the URL input functionality. Core Functionalities Text Analysis Form (HTML) The form on the main page allows users to submit either raw text or a URL. The client-side JavaScript processes the form submission and, if needed, triggers a URL fetch to obtain content. Backend Processing (Routes) GET /: Serves the main HTML page. POST /analyse: Analyzes submitted text and returns word count, sentiment, summary, TF-IDF, and word frequency results. POST /fetch-url: Retrieves and cleans content from a URL, then prepares it for analysis. Dark Mode and Interactivity Dark Mode Toggle: Client-side JavaScript manages style changes. Share Feature: Allows users to copy a link with encoded text or URL. Word Cloud: jQCloud generates a word cloud based on frequency data. Maintenance Notes Testing Integration Tests: Ensure calculateTFIDF, calculateWordFrequency, and fetchUrlContent produce consistent outputs. Client-Side Testing: Regularly test URL fetching functionality to ensure robustness, particularly with dynamic sites. Error Handling Maintain try/catch blocks in asynchronous functions like fetchUrlContent to capture network issues. Update error messages to be user-friendly and provide feedback if inputs are invalid or exceed size limits. Data Sanitization Verify that cleanText removes sensitive information from fetched URLs to prevent accidental disclosure. Rate Limits and API Usage Consider implementing rate limiting or caching mechanisms to manage heavy usage or repeated requests for the same content. Future Improvements Enhanced Synonym Dictionary Expand synonyms for better word variation support during analysis. Advanced Sentiment Analysis Implement a sentiment analysis library or API for more nuanced scoring and classification. Multilingual Support Extend stop words, sentiment analysis, and cleaning functions to support languages beyond English. The Radical Text Analyser code primarily uses the following APIs: Hono Framework API Purpose: Used to set up routing and serve responses for different endpoints (GET /, POST /analyse, and POST /fetch-url). Usage: The Hono framework handles HTTP requests and responses, forming the backbone of the server-side API. External Content Fetching API (via fetch) Purpose: Retrieves HTML content from external URLs when users input a URL for analysis. Usage: The fetchUrlContent function uses fetch to make a GET request to the user-provided URL, which allows the application to process content from web pages. Error Handling: Includes logic to check response status and handle various network errors. Google Fonts API Purpose: Loads the "VT323" font for styling the front-end. Usage: Linked in the HTML section via . jQCloud Library (via CDN) Purpose: Used for generating a word cloud from frequency data on the client side. Usage: Loaded in the HTML section with and The TF-IDF analysis in this code is performed without using an external library. Instead, it’s implemented directly in the calculateTFIDF function. Here’s how it works: Tokenization and Filtering : The text is split into words and phrases, excluding stopwords and web-specific terms. Term Frequency (TF) : For each unique phrase, the code calculates how frequently it appears in the text relative to the total number of words. Inverse Document Frequency (IDF) : Each term’s IDF is calculated using a formula that reduces the weight of commonly occurring phrases. Scoring : TF and IDF are multiplied, adjusted by phrase length and rarity, to boost significant phrases. Output : Top phrases are sorted by score, filtered for uniqueness, and returned as the most relevant terms. This custom implementation allows for added customization, such as phrase filtering and length adjustments, which wouldn’t be as easily configurable with a standard library. IDF = ln(Total Word Count / Document Frequency of Term) + 1 Total Word Count: The total number of words in the input text. Document Frequency of Term: How many times the term appears across documents, which is set to 1 in this single-document context to avoid division by zero. +1 Adjustment: Adding 1 ensures that terms appearing once don't have an IDF of zero, which would nullify their score.
0
Next