How to Extract Text from Images in .NET Core PowerPoint Using C#?
Steps to extract the text from the images in PowerPoint presentation using C#
Step 1: Create a new .NET console application project.
Step 2: Install the Syncfusion.Presentation.Net.Core and Syncfusion.PDF.OCR.Net.Core NuGet package as a reference to your project from NuGet.org.
using Syncfusion.Presentation;
using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Graphics;
using System.Collections.Generic;
using System.IO;
Step 4: Include the below code snippet in Program.cs to get the images from PowerPoint presentation and add to the memory stream list using C#.
//Open the existing PowerPoint presentation.
using (IPresentation pptxDoc = Presentation.Open(@"../../../Template.pptx"))
{
List<MemoryStream> pictureStreamList = new List<MemoryStream>();
//Retrieves the each slide from the Presentation.
foreach (ISlide slide in pptxDoc.Slides)
{
//Retrieves all the picture from the slide.
IPictures pictures = slide.Pictures;
foreach (IPicture picture in pictures)
{
pictureStreamList.Add(new MemoryStream(picture.ImageData));
}
}
//Extract text from images using OCR processor.
ExtractTextFromImages(pictureStreamList);
}
Step 4: Include the below helper code snippet in Program.cs to extract the text from each image stream using C#.
/// <summary>
/// Extracts text from images using OCR processor.
/// </summary>
/// <param name="pictureStreamList">List of picture stream.</param>
private static void ExtractTextFromImages(List<MemoryStream> pictureStreamList)
{
//Inside bin folder, the tessdata folder contains the language data files.
string tessdataPath = Path.GetFullPath(@"runtimes/tessdata");
int i = 1;
//Get each picture and extract its text.
foreach (MemoryStream imgStream in pictureStreamList)
{
//Initialize the OCR processor by providing the path of the tesseract binaries.
using (OCRProcessor processor = new OCRProcessor())
{
//Set OCR language to process.
processor.Settings.Language = Languages.English;
//Sets Unicode font to preserve the Unicode characters in a PDF document.
FileStream fontStream = new FileStream(Path.GetFullPath("../../../ARIALUNI.ttf"), FileMode.Open);
processor.UnicodeFont = new PdfTrueTypeFont(fontStream, 8);
//Perform the OCR process for an image stream.
string ocrText = processor.PerformOCR(imgStream, tessdataPath);
//Write the OCR'ed text in text file.
using (StreamWriter writer = new StreamWriter(Path.GetFullPath(@"../../../OCRText_" + i + ".txt"), true))
{
writer.WriteLine(ocrText);
}
}
//Dispose the image streams.
imgStream.Dispose();
i++;
}
}
A complete working sample to extract the text from the images in PowerPoint presentation using C# can be downloaded from GitHub.
Conclusion
I hope you enjoyed learning about how to extract text from images in .NET Core PowerPoint using C#.
You can refer to our .NET PowerPoint feature tour page to know about its other groundbreaking feature representations and documentation, and how to quickly get started for configuration specifications. You can also explore our .NET PowerPoint example to understand how to create and manipulate data.
For current customers, you can check out our components from the License and Downloads page. If you are new to Syncfusion, you can try our 30-day free trial to check out our other controls.
If you have any queries or require clarifications, please let us know in the comments section below. You can also contact us through our support forums, Direct-Trac, or feedback portal. We are always happy to assist you!