Get only visible text from document.body

Talk about add-ons and extension development.
Post Reply
Kevin Jones
Posts: 625
Joined: August 12th, 2009, 10:22 am

Get only visible text from document.body

Post by Kevin Jones »

Hello,

I am interested in get all the text from document.body, but only visible text, ie, that which the user actually can read on the webpage.

I have experimented with textContent and TreeWalker/NodeFilter.SHOW_TEXT, but they include “text” within script tags and style tags, etc. One thing I haven’t tried would be to clone the node and getElementsByTagName(“noscript/script/syle”) and set innerHTML = “” on those nodes. But even if it worked it seems it would be inefficient and more prone to bugs (ie, are there other tags which contain text that does not show up on the screen?).

I was wondering if Firefox has an API which would accomplish this. Obviously the browser has already done this work, so I wondered if there is an API which exposes that text. This would be far more efficient I expect, and more reliable. I have not found anything in this vein in my searches.

Or in lieu of that, any other suggestions which would efficiently accomplish this goal.

Thank you,
Allasso
User avatar
patrickjdempsey
Posts: 23686
Joined: October 23rd, 2008, 11:43 am
Location: Asheville NC
Contact:

Re: Get only visible text from document.body

Post by patrickjdempsey »

Tip of the day: If it has "toolbar" in the name, it's crap.
What my avatar is about: https://addons.mozilla.org/en-US/seamonkey/addon/sea-fox/
Kevin Jones
Posts: 625
Joined: August 12th, 2009, 10:22 am

Re: Get only visible text from document.body

Post by Kevin Jones »

patrickjdempsey wrote:Text.wholeText?
https://developer.mozilla.org/en-US/doc ... /wholeText


No, that returns all that other junk as well. Thanks for the suggestion.
User avatar
patrickjdempsey
Posts: 23686
Joined: October 23rd, 2008, 11:43 am
Location: Asheville NC
Contact:

Re: Get only visible text from document.body

Post by patrickjdempsey »

I wonder if there's a way to simulate "Select All, Copy"? That would do exactly what you want, and I think even preserve links.
Tip of the day: If it has "toolbar" in the name, it's crap.
What my avatar is about: https://addons.mozilla.org/en-US/seamonkey/addon/sea-fox/
Kevin Jones
Posts: 625
Joined: August 12th, 2009, 10:22 am

Re: Get only visible text from document.body

Post by Kevin Jones »

Wow, I just saw this (Mozillazine does not always send update emails), and that is exactly what I did:

Code: Select all


    _getSelectionController = function(aWindow) {
   
      // 'display: none' iframes don't have a selection controller, see bug 493658
      if (!aWindow.innerWidth || !aWindow.innerHeight)
        return null;

      // Yuck. See bug 138068.
      var Ci = Components.interfaces;
      var docShell = aWindow.QueryInterface(Ci.nsIInterfaceRequestor)
                            .getInterface(Ci.nsIWebNavigation)
                            .QueryInterface(Ci.nsIDocShell);

      var controller = docShell.QueryInterface(Ci.nsIInterfaceRequestor)
                               .getInterface(Ci.nsISelectionDisplay)
                               .QueryInterface(Ci.nsISelectionController);
      return controller;
     
    }

    let win = <content window or iframe for whatever you want to select>
    let controller = this._getSelectionController(win);

    controller.selectAll();
    let selection = controller.getSelection(controller.SELECTION_NORMAL); // controller.SELECTION_NORMAL === 1
    let selectionString = selection.toString();
    selection.removeAllRanges();  // clear the selection
    selectionString = selectionString.replace(/\s/g," ").replace(/  +/g," ");  // this may not be necessary
 


Caveat is that it will also select text for accessibility items, such as the 'alt' attribute in 'img' tags. This may or may not be desired.

This code and much more can be found in Finder.jsm.

Thanks, Patrick.
Post Reply