Search This Blog

Select Text Mining

 

Web-Based Tools and Platforms (Often No-Code/Low-Code)
  • Voyant Tools: Free, web-based environment for analyzing texts with interactive visualizations and analyses.
  • TAPoR (Text Analysis Portal for Research): Curated gateway to various text analysis and retrieval tools.
  • Gephi: Open-source desktop software for network analysis and visualization.
  • Palladio: Web-based tool for generating network graphs, maps, and timelines.
  • Other Tools: Topic Modelling Tool, Stanford Named Entity Recognizer (NER), web-based Sentiment Analyzers, collaborative text annotation platforms (Hypothes.is, Annotation Studio), and OpenRefine.
Major Literary Corpora and Data Sources
  • HathiTrust Digital Library / Research Center (HTRC): Repository with millions of digitized volumes, providing computational access.
  • Project Gutenberg: Digital library offering over 60,000 free e-books in the public domain.
  • Google Books / Ngram Viewer: Tracks the frequency of words and phrases over time using Google Books data.
  • Internet Archive: Digital library with archived web pages, books, music, and videos.
  • Other Sources: Time Magazine Corpus, collections from projects like The Victorian Web, and user-built corpora.

 Development of novel computational methods to address complex literary questions beyond AI capabilities.

  • Computational modeling of narrative structure to capture plot progression and dynamics, moving past thematic or stylistic analysis.
  • Systematic tracing of literary influence and allusion through intertextuality detection.
  • Expansion of sentiment analysis to capture a broader range of specific emotions and their interaction within literary texts.
  • Future methods may analyze the interplay between various media in literary and cultural artifacts as digital archives become more multimodal.

--------------------------------------------------

 

  • Text mining, also known as text analytics, involves using computational methods to extract knowledge, patterns, and information from large amounts of text data.
  • In literary studies, text mining applies computational techniques to analyze literary works like novels, poems, plays, and essays.
  • This enables research that goes beyond what's feasible with traditional manual reading.
  • Large digitized literary archives create both opportunities and challenges, making text mining crucial for understanding cultural heritage.
  • Text mining is a key method within the Digital Humanities (DH), which combines computational tools with traditional humanities scholarship.
  • It offers scholars new perspectives and empirical evidence for examining literary history, theory, and textual features.

-------------------------------------------------------------------------------------------------------------

 Interpretive Challenges

The path from computational output to meaningful literary insight is often complex:

  • Ambiguity: Natural language is rife with ambiguity at multiple levels. Lexical ambiguity occurs when words have multiple meanings (e.g., "bank"), syntactic ambiguity arises from sentence structure (e.g., "I saw the man with the telescope"), and figurative ambiguity involves non-literal language.39 While humans use context to disambiguate, computational tools often struggle, potentially leading to misinterpretations if context is not adequately modeled.39

  • Figurative Language: Metaphor, simile, irony, symbolism, and other forms of figurative language pose significant challenges for computational analysis, which typically operates on more literal patterns of word usage.17 The intended meaning often deviates substantially from the surface expression, making it difficult for algorithms to capture accurately. Topic models, for example, might find the language of poetry particularly challenging due to its frequent reliance on figuration.17

  • Context: Understanding a literary text fully requires considering its broader historical, cultural, and discursive context. Computational methods, particularly those relying on local word patterns (like n-grams or simple sentiment analysis), may fail to capture this wider context, potentially missing crucial layers of meaning.39

  • Oversimplification: A frequent critique of quantitative methods in literary studies is that they risk oversimplifying the richness and complexity of literary texts.5 Reducing a novel or poem to word counts, sentiment scores, or network diagrams might flatten its aesthetic qualities and interpretive depth, overlooking nuances readily apparent through close reading.41 There is a danger of treating literature merely as data, losing sight of its artistic and humanistic dimensions.

     

    ---------------------------------------------------------------

    Lexicon-based AI.

    In the context of Artificial Intelligence (AI), particularly within Natural Language Processing (NLP), 

    a lexicon-based approach refers to methods that rely primarily on a predefined lexicon 

    (a dictionary, vocabulary, or word list) to analyze text and perform tasks.

     

    Here's a more detailed explanation:

    1. What is a Lexicon?

      • In this context, a lexicon is more than just a list of words. It typically contains words or phrases along with associated information relevant to the specific AI task. This information could include:
        • Sentiment Scores: Assigning positive, negative, or neutral scores to words (e.g., "happy": +1, "sad": -1, "excellent": +2, "terrible": -2).
        • Part-of-Speech (POS) Tags: Identifying words as nouns, verbs, adjectives, etc.
        • Semantic Categories: Grouping words by meaning or topic (e.g., words related to "finance," "sports," or "emotions").
        • Intensity Modifiers: Identifying words that strengthen or weaken other words (e.g., "very," "slightly").
        • Negations: Identifying words that reverse meaning (e.g., "not," "never").
    2. How Does the Lexicon-Based Approach Work?

      • The AI system processes input text (like a sentence or document).
      • It breaks the text down into individual words or tokens.
      • It looks up these words in its predefined lexicon.
      • Based on the information found in the lexicon for each word, the system performs 
        calculations or applies rules.
      •  
      • For example, in sentiment analysis, it might sum the sentiment scores of all the
         recognized words in a sentence to determine the overall sentiment. 
    3. Rules might adjust scores based on negations or intensifiers found nearby.
    4. Common Applications:

      • Sentiment Analysis: This is perhaps the most common application. 
      •  
      • Systems use lexicons like SentiWordNet, VADER (Valence Aware
      •  
      •  Dictionary and sEntiment Reasoner), or AFINN to classify text as positive, negative, or neutral.
      • Topic Modeling/Categorization: Using lexicons containing words
      •  
      •  specific to certain topics to classify documents.
      • Spam Detection: Identifying emails containing words commonly found in spam (e.g., "free," "offer," "Viagra").
      • Information Extraction: Identifying specific types of entities (like company names or locations) based on predefined lists.
    5. Advantages:

      • Interpretability: It's often easier to understand why a lexicon-based system 
      •  
      • made a particular decision – you can trace it back to the specific words and their scores/categories in the lexicon.
      • No Training Data Required (Initially): Unlike many machine learning models, 
      •  
      • these systems don't necessarily need large amounts of labeled data to get started; the knowledge is encoded in the lexicon.
      • Control: Domain experts can directly create, modify, and curate the lexicon to 
      •  
      • tailor the system to specific needs or improve its accuracy.
      • Computational Efficiency: Often less computationally demanding than training
      •  
      •  complex machine learning models.
    6. Disadvantages:

      • Context Insensitivity: This is a major weakness. Lexicon-based systems
      •  often struggle with sarcasm, irony, negation complexity, and words with
      •  multiple meanings (polysemy). For example, 
      • "This movie was sick!" might be misinterpreted if "sick" only has a negative
      •  score in the lexicon.
      • Limited Coverage: The lexicon might not contain all relevant words, 
      • especially slang, jargon, misspelled words, or newly coined terms.
      • Labor-Intensive Maintenance: Creating and maintaining comprehensive,
      •  high-quality lexicons requires significant human effort and domain expertise.
      • Nuance: They often fail to capture subtle nuances in language or sentimen
      • t expressed across multiple sentences or through complex phrasing.
      • Domain Specificity: A lexicon built for one domain (e.g., product reviews) 
      • might perform poorly in another (e.g., financial news).
       
    ----------------------------------------------------------------------------------------- 


No comments:

Post a Comment

History of Pocahontas County Book

  A History of Pocahontas County: From Ancient Trails to the Iron Horse Introduction: The Mountain Crucible The history of Pocahontas County...

Shaker Posts