15,506,865 members
See more:
I am trying to make a program which extracts the text from a PDF document PDF documents contain ARABIC text written by different types of FONT

when I extract the text it works with some files and others not it gives me ambiguous Text

I am using c # and Itext7 to make this program

please show me the methodology to do this with some examples

thank you

What I have tried:

StringBuilder processed = new StringBuilder();

var src = "d:\\text06.pdf";
var pdfDocument = new PdfDocument(new PdfReader(src));
var strategy = new LocationTextExtractionStrategy();

for (int i = 1; i <= pdfDocument.GetNumberOfPages(); i++)
{
PdfPage page = pdfDocument.GetPage(i);
PdfDictionary fontResources = page.GetResources().GetResource(PdfName.Font);
//foreach (PdfObject font in fontResources.Values(true))
//{

// if (font is PdfDictionary)
//    fontResources.Put(PdfName.Encoding, PdfName.IdentityH);
// }

string output = PdfTextExtractor.GetTextFromPage(page);
processed.Append(output);
}
string[] lines = Regex.Split(processed.ToString(), "\n");

List<String> Converted_Lines = new List<string>();
foreach (string  s in lines)
{
string converted_string = Inverse(s);
}

textBox1.Text = String.Join(Environment.NewLine, Converted_Lines);

Posted
Asif 7969814 30-Aug-21 11:00am
https://stackoverflow.com/questions/40596320/extracting-arabic-text-in-c-sharp-by-using-itextsharp

https://www.codeproject.com/Questions/1067285/How-to-print-arabic-characters-to-a-pdf-file-using

https://stackoverflow.com/questions/34528259/arabic-in-pdf-using-itextsharp-in-c-sharp
sahnoune_khaled 30-Aug-21 16:35pm
Extract text process from pdf document is different than create new one
Extract text from pdf is very defficult
Asif 7969814 31-Aug-21 5:22am
Yes, you are Right But Can We Use The way Of OCR Computer Vision API And C#.

They have the support of Language
ar (Arabic)
tr (Turkish)
ro (Romanian)
https://www.c-sharpcorner.com/article/cognitive-services-optical-character-recognition-ocr-from-an-image-using-com/

and this is the Github link

https://github.com/Azure-Samples/cognitive-services-quickstart-code/blob/master/dotnet/ComputerVision/REST/CSharp-print-text.md

https://westus.dev.cognitive.microsoft.com/docs/services/56f91f2d778daf23d8ec6739/operations/56f91f2e778daf14a499e1fc
sahnoune_khaled 31-Aug-21 8:40am
HiBut My files contains Text Images Font It is ordinaire Pdf File the way of OCR is not good idea
i take the sconde way extarct data of document by parsing the file and search where is the text and Convert it by encoding it.