Here’s a simplified version of the content rewritten for clarity:
The MarkItDown library is a handy tool that converts different types of files into Markdown format, which is useful for tasks like indexing and text analysis.
Currently, it can convert the following file types:
- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML (with special handling for Wikipedia and others)
- Other text formats (csv, json, xml, etc.)
The API is straightforward:
from markitdown import MarkItDown
markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
print(result.text_content)
This project encourages contributions and suggestions. Before contributing, you must agree to a Contributor License Agreement (CLA) to grant us the rights to use your contributions. You can find more details at https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will check if you need to sign a CLA and will guide you through the process. You only need to do this once for all repositories using our CLA.
This project follows the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or email opencode@microsoft.com with questions or comments.
Note: This project might include trademarks or logos for various projects, products, or services. If you use Microsoft trademarks or logos, you must follow Microsoft’s Trademark & Brand Guidelines. Modified versions of this project should not imply Microsoft sponsorship. Any use of third-party trademarks or logos should adhere to their respective policies.