How to convert HTML document to plain text in C# and VB.NET?
The Essential® DocIO converts the HTML file into Word document and vice versa. You can also convert the HTML document to plain text format and vice versa.
In Word library (DocIO) we use XmlReader for parsing the content from input HTML. So, the input HTML should meet XML standard (have proper open and close tags), even if you specify XHTMLValidationType parameter as XHTMLValidationType.None.
XHTML Validation
Every HTML content is validated against a Document Type Declaration (DTD) which is a set of mark-up declarations that define a document type for a SGML-family mark-up language (GML, SGML, XML, HTML).
XHTML validation types
The following XHTML validation types are supported in Essential® DocIO while importing an HTML content.
XHTML validation types | Description |
---|---|
XHTMLValidationType.None | It does not perform any schema validation, but the given HTML content should meet XHTML 1.0 format. |
XHTMLValidationType.Transitional | It allows several attributes within the tags. |
XHTMLValidationType.Strict | It does not allow the attributes inside the tag. |
Steps to convert HTML document to plain text in C#
- Create a new C# console application project.
- Install Syncfusion.DocIO.WinForms NuGet package as a reference to your .NET Framework applications from the NuGet.org.
- Include the following namespace in the Program.cs file.
C#
using Syncfusion.DocIO; using Syncfusion.DocIO.DLS;
VB
Imports Syncfusion.DocIO Imports Syncfusion.DocIO.DLS
- Use the following code to convert HTML document to plain text.
C#
//Loads the HTML document against validation type none WordDocument document = new WordDocument("Input.html", FormatType.Html, XHTMLValidationType.None); //Saves the Word document document.Save("HTMLtoText.txt", FormatType.Txt); //Closes the document document.Close();
VB
'Loads the HTML document against validation type none Dim document As WordDocument = New WordDocument("Input.html", FormatType.Html, XHTMLValidationType.None) 'Saves the Word document document.Save("HTMLtoText.txt", FormatType.Txt) 'Closes the document document.Close()
A complete working example of converting a HTML document to plain text in C# can be downloaded from here.
Input HTML document as follows:
By executing the program, you will get the plain text as follows:
Take a moment to peruse the documentation, where you can find basic Word document processing options along with features like mail merge, merge and split documents, find and replace text in the Word document, protect the Word documents, and most importantly PDF and Image conversions with code examples.
Explore more about the rich set of Syncfusion® Word Framework features.
An online example to protect the Word document from editing using DocIO..
See Also:
Word to HTML and HTML to Word Conversions
Starting with v16.2.0.x, if you reference Syncfusion® assemblies from trial setup or from the NuGet feed, include a license key in your projects. Refer to link to learn about generating and registering Syncfusion® license key in your application to use the components without trail message.