Hello,
I am interested in get all the text from document.body, but only visible text, ie, that which the user actually can read on the webpage.
I have experimented with textContent and TreeWalker/NodeFilter.SHOW_TEXT, but they include “text” within script tags and style tags, etc. One thing I haven’t tried would be to clone the node and getElementsByTagName(“noscript/script/syle”) and set innerHTML = “” on those nodes. But even if it worked it seems it would be inefficient and more prone to bugs (ie, are there other tags which contain text that does not show up on the screen?).
I was wondering if Firefox has an API which would accomplish this. Obviously the browser has already done this work, so I wondered if there is an API which exposes that text. This would be far more efficient I expect, and more reliable. I have not found anything in this vein in my searches.
Or in lieu of that, any other suggestions which would efficiently accomplish this goal.
Thank you,
Allasso
Get only visible text from document.body
-
- Posts: 625
- Joined: August 12th, 2009, 10:22 am
- patrickjdempsey
- Posts: 23686
- Joined: October 23rd, 2008, 11:43 am
- Location: Asheville NC
- Contact:
Re: Get only visible text from document.body
Text.wholeText?
https://developer.mozilla.org/en-US/doc ... /wholeText
https://developer.mozilla.org/en-US/doc ... /wholeText
Tip of the day: If it has "toolbar" in the name, it's crap.
What my avatar is about: https://addons.mozilla.org/en-US/seamonkey/addon/sea-fox/
What my avatar is about: https://addons.mozilla.org/en-US/seamonkey/addon/sea-fox/
-
- Posts: 625
- Joined: August 12th, 2009, 10:22 am
Re: Get only visible text from document.body
patrickjdempsey wrote:Text.wholeText?
https://developer.mozilla.org/en-US/doc ... /wholeText
No, that returns all that other junk as well. Thanks for the suggestion.
- patrickjdempsey
- Posts: 23686
- Joined: October 23rd, 2008, 11:43 am
- Location: Asheville NC
- Contact:
Re: Get only visible text from document.body
I wonder if there's a way to simulate "Select All, Copy"? That would do exactly what you want, and I think even preserve links.
Tip of the day: If it has "toolbar" in the name, it's crap.
What my avatar is about: https://addons.mozilla.org/en-US/seamonkey/addon/sea-fox/
What my avatar is about: https://addons.mozilla.org/en-US/seamonkey/addon/sea-fox/
-
- Posts: 625
- Joined: August 12th, 2009, 10:22 am
Re: Get only visible text from document.body
Wow, I just saw this (Mozillazine does not always send update emails), and that is exactly what I did:
Caveat is that it will also select text for accessibility items, such as the 'alt' attribute in 'img' tags. This may or may not be desired.
This code and much more can be found in Finder.jsm.
Thanks, Patrick.
Code: Select all
_getSelectionController = function(aWindow) {
// 'display: none' iframes don't have a selection controller, see bug 493658
if (!aWindow.innerWidth || !aWindow.innerHeight)
return null;
// Yuck. See bug 138068.
var Ci = Components.interfaces;
var docShell = aWindow.QueryInterface(Ci.nsIInterfaceRequestor)
.getInterface(Ci.nsIWebNavigation)
.QueryInterface(Ci.nsIDocShell);
var controller = docShell.QueryInterface(Ci.nsIInterfaceRequestor)
.getInterface(Ci.nsISelectionDisplay)
.QueryInterface(Ci.nsISelectionController);
return controller;
}
let win = <content window or iframe for whatever you want to select>
let controller = this._getSelectionController(win);
controller.selectAll();
let selection = controller.getSelection(controller.SELECTION_NORMAL); // controller.SELECTION_NORMAL === 1
let selectionString = selection.toString();
selection.removeAllRanges(); // clear the selection
selectionString = selectionString.replace(/\s/g," ").replace(/ +/g," "); // this may not be necessary
Caveat is that it will also select text for accessibility items, such as the 'alt' attribute in 'img' tags. This may or may not be desired.
This code and much more can be found in Finder.jsm.
Thanks, Patrick.