How to Convert HTML to Text in Java

This simple topic is about how to convert HTML to text in Java. In Java HTML to plain text conversion application running under Windows, Linux or macOS platforms can be developed using simple and easy API interfaces.

Steps to Convert HTML to Text in Java

Configure your project to add Aspose.HTML for Java from the Maven repository
Include the reference to Aspose.HTML namespace in your application
Read the source HMTL file content using String object
Initialize HTMLDocument class object to load the source HTML String
Initialize INodeIterator class object to iterate nodes and append in StringBuilder
Save the extracted text from HTML on disk

In order to extract text from HTML Java based application using few lines of code can be used. We will initiate the process by loading source HTML into a String object and subsequently loading that String using HTMLDocument class. We will then use INodeIterator to extract, traverse and append the HMTL nodes to a StringBuilder. Finally, the StringBuilder will be saved as plain text file on disk.

Code to Convert HTML to Text in Java

	import com.aspose.html.HTMLDocument;
	import com.aspose.html.License;
	import com.aspose.html.dom.Node;
	import com.aspose.html.dom.traversal.INodeIterator;
	import com.aspose.html.dom.traversal.filters.NodeFilter;
	import java.nio.file.Paths;
	import java.nio.file.Files;
	import java.io.IOException;
	import java.nio.charset.Charset;
	import java.nio.charset.StandardCharsets;
	import java.util.stream.Collectors;

	public class HtmlToTextCoverter {
	public static void main(String[] argsHTMLFile) throws Exception {

	// Setting Aspose.Html Java API license to use complete features
	License lic = new License();
	lic.setLicense("HTML.Total.Java.lic");

	// Read the HTML file in String
	String content = null;
	try {
	content = readFileContent("TestFile.html", StandardCharsets.UTF_8);
	} catch (IOException exception) {
	exception.printStackTrace();
	return;
	}

	// Instantiate HtmlDocument object to load HTML content in String
	HTMLDocument document = new HTMLDocument(content, "");

	// Initialize INodeIterator instance iterate HTML nodes
	INodeIterator iterator = document.createNodeIterator(document, NodeFilter.SHOW_TEXT, new StyleFilter());
	StringBuilder Stringbld = new StringBuilder();

	// Temp Node object
	Node node;

	// Iterate through Nodes
	while ((node = iterator.nextNode()) != null)
	Stringbld.append(node.getNodeValue());
	System.out.println(Stringbld.toString());

	Files.write(Paths.get("HtmlToText_Java.txt"), Stringbld.toString().getBytes());

	}

	public static String readFileContent(String filePath, Charset encoding) throws IOException {
	String fileContent = Files.lines(Paths.get(filePath), encoding)
	.collect(Collectors.joining(System.lineSeparator()));
	return fileContent;
	}
	}

	class StyleFilter extends NodeFilter {

	@Override
	public short acceptNode(Node node) {
	// In order to skip an element while fetching nodes, mention the name of element in upper case letters
	return (node.getParentElement().getTagName() == "STYLE" \|\| node.getParentElement().getTagName() == "SCRIPT"
	? FILTER_REJECT : FILTER_ACCEPT);
	}
	}

view raw How to Convert HTML to Text in Java.java hosted with ❤ by GitHub

The above example in Java convert HTML to plain text in few API calls. We have created StyleFilter class that extends NodeFilter class and implement the AcceptNode method to set the customer node filters and omit the undesirable nodes from HTML during conversion process.

In this topic, we have explored how to extract text from HTML in Java. If you are interested in conversion of MD file to XPS format, proceed to topic how to convert Markdown to XPS using Java.

Aspose Knowledge Base

Find Answers by API

How to Convert HTML to Text in Java

Steps to Convert HTML to Text in Java

Code to Convert HTML to Text in Java