This simple topic is about how to convert HTML to text in Java. In Java HTML to plain text conversion application running under Windows, Linux or macOS platforms can be developed using simple and easy API interfaces.
Steps to Convert HTML to Text in Java
- Configure your project to add Aspose.HTML for Java from the Maven repository
- Include the reference to Aspose.HTML namespace in your application
- Read the source HMTL file content using String object
- Initialize HTMLDocument class object to load the source HTML String
- Initialize INodeIterator class object to iterate nodes and append in StringBuilder
- Save the extracted text from HTML on disk
In order to extract text from HTML Java based application using few lines of code can be used. We will initiate the process by loading source HTML into a String object and subsequently loading that String using HTMLDocument class. We will then use INodeIterator to extract, traverse and append the HMTL nodes to a StringBuilder. Finally, the StringBuilder will be saved as plain text file on disk.
Code to Convert HTML to Text in Java
import com.aspose.html.HTMLDocument; | |
import com.aspose.html.License; | |
import com.aspose.html.dom.Node; | |
import com.aspose.html.dom.traversal.INodeIterator; | |
import com.aspose.html.dom.traversal.filters.NodeFilter; | |
import java.nio.file.Paths; | |
import java.nio.file.Files; | |
import java.io.IOException; | |
import java.nio.charset.Charset; | |
import java.nio.charset.StandardCharsets; | |
import java.util.stream.Collectors; | |
public class HtmlToTextCoverter { | |
public static void main(String[] argsHTMLFile) throws Exception { | |
// Setting Aspose.Html Java API license to use complete features | |
License lic = new License(); | |
lic.setLicense("HTML.Total.Java.lic"); | |
// Read the HTML file in String | |
String content = null; | |
try { | |
content = readFileContent("TestFile.html", StandardCharsets.UTF_8); | |
} catch (IOException exception) { | |
exception.printStackTrace(); | |
return; | |
} | |
// Instantiate HtmlDocument object to load HTML content in String | |
HTMLDocument document = new HTMLDocument(content, ""); | |
// Initialize INodeIterator instance iterate HTML nodes | |
INodeIterator iterator = document.createNodeIterator(document, NodeFilter.SHOW_TEXT, new StyleFilter()); | |
StringBuilder Stringbld = new StringBuilder(); | |
// Temp Node object | |
Node node; | |
// Iterate through Nodes | |
while ((node = iterator.nextNode()) != null) | |
Stringbld.append(node.getNodeValue()); | |
System.out.println(Stringbld.toString()); | |
Files.write(Paths.get("HtmlToText_Java.txt"), Stringbld.toString().getBytes()); | |
} | |
public static String readFileContent(String filePath, Charset encoding) throws IOException { | |
String fileContent = Files.lines(Paths.get(filePath), encoding) | |
.collect(Collectors.joining(System.lineSeparator())); | |
return fileContent; | |
} | |
} | |
class StyleFilter extends NodeFilter { | |
@Override | |
public short acceptNode(Node node) { | |
// In order to skip an element while fetching nodes, mention the name of element in upper case letters | |
return (node.getParentElement().getTagName() == "STYLE" || node.getParentElement().getTagName() == "SCRIPT" | |
? FILTER_REJECT : FILTER_ACCEPT); | |
} | |
} |
The above example in Java convert HTML to plain text in few API calls. We have created StyleFilter class that extends NodeFilter class and implement the AcceptNode method to set the customer node filters and omit the undesirable nodes from HTML during conversion process.
In this topic, we have explored how to extract text from HTML in Java. If you are interested in conversion of MD file to XPS format, proceed to topic how to convert Markdown to XPS using Java.