PDA

View Full Version : beagle desktop search for doc and docx



simeon_banov
25-Jun-2013, 15:48
Hello, how can i make beagle search through .doc and .docx file. I have searched google for answers, but have only come up with Ubuntu / Kubuntu solutions which do not work and been stuck on this issue 2-3 weeks now.

OS: SLED 11 updated online and then by updated repositories.

I have tried to install another desktop search application, but in the end I noticed that everything uses beagle.
I have tried to configure beagle, but with no results.
Since the official site http://www.beagle-project.org/ is not useful and I cant seem to be making progress on this issue, I have no choice, but to cry for help.

malcolmlewis
25-Jun-2013, 17:20
Hi
There is a command line tool beagle-doc-extractor I'm not sure there is a backend for doc/docx.



beagle[tab]

beagle-build-index beagle-extract-content beagle-master-delete-button beagle-search
beagle-config beagle-imlogviewer beagle-merge-indexes beagle-settings
beagled beagle-index-info beagle-ping beagle-shutdown
beagle-doc-extractor beagle-info beagle-query beagle-static-query
beagle-dump-index beagle-manage-index beagle-removable-index beagle-status

simeon_banov
26-Jun-2013, 08:10
Hi
There is a command line tool beagle-doc-extractor I'm not sure there is a backend for doc/docx.

I need an application with an interface, as the ppl who are going to work with the computer aren't that technical.

Is there any application that can search in a folder (with files and subfolders) and index the doc and docx files, the file data base is more then 4GB of doc-docx files. If beagle can't do it (with the interface, no command line) I would be grateful if anyone can point me to any solution.

Isn't there a solution with an external filter? I have found the following for Ubuntu



<?xml version="1.0" encoding="utf-8"?>
<external-filters>
<filter>
<mimetype>application/msword</mimetype>
<extension>.doc</extension>
<command>antiword</command>
<arguments>-t %s</arguments>
</filter>
</external-filters>


The issue here is that I can't find "antiword" for SLED.

mikewillis
26-Jun-2013, 11:18
The issue here is that I can't find "antiword" for SLED.

The source is available from http://www.winfield.demon.nl/ and it compiles on SLED SP2. Click on 'Linux, Unix (with sources)' then 'version 0.37 (21 Oct 2005)' unpack the .tar.gz file then just run 'make' in the directory. The resulting binary is called antiword. If you run 'make install' then it'll set things up just for your current user. Helpfully it tells you what it's doing so you can see that it puts the files antiword kantiword in to ~/bin and copies the contents of the Resources directory to ~/.antiword. If you wanted it to be used by any user you're put the antiword kantiword files in /usr/local/bin and the contents of Resources in /usr/share/antiword

antiword works on a few doc files I tried it against. Unsurprisingly, given it was written in 2005, antiword doesn't work on docx files.

I'm not aware of anything that will allow beagle to index docx files and can't just now locate anything that does.

simeon_banov
27-Jun-2013, 10:00
The source is available from http://www.winfield.demon.nl/ and it compiles on SLED SP2. Click on 'Linux, Unix (with sources)' then 'version 0.37 (21 Oct 2005)' unpack the .tar.gz file then just run 'make' in the directory. The resulting binary is called antiword. If you run 'make install' then it'll set things up just for your current user. Helpfully it tells you what it's doing so you can see that it puts the files antiword kantiword in to ~/bin and copies the contents of the Resources directory to ~/.antiword. If you wanted it to be used by any user you're put the antiword kantiword files in /usr/local/bin and the contents of Resources in /usr/share/antiword

antiword works on a few doc files I tried it against. Unsurprisingly, given it was written in 2005, antiword doesn't work on docx files.

I'm not aware of anything that will allow beagle to index docx files and can't just now locate anything that does.

I'll give it a try, thank you for the info. Yet I need both doc and docx. What about Google Desktop Search, is there a version fot SLED, does it use beagle too (as I noticed recoll used beagle)?

mikewillis
27-Jun-2013, 13:22
What about Google Desktop Search, is there a version fot SLED, does it use beagle too (as I noticed recoll used beagle)?

Google Desktop was discontinued in September 2011 http://googledesktop.blogspot.co.uk/

Beagle seems to be discontinued too. Last version was released over four years ago. The website is gone, thought the domain exists and serves up some sort of, well I don't know quite what it's supposed to be.

In openSUSE beagle was replaced with Tracker (https://projects.gnome.org/tracker/) but I can't find any SLED packages of it or any information on whether it indexes docx files. I installed tracker on an openSUSE machine and copied a docx file over and tracker's not finding anything inside the docx file, even after I've modified the docx file with LibreOffice. But I wouldn't take that as definitive statement that tracker can't index docx files. (The same instance of tracker does find text inside plaintext files.)

docx files are actually zips. So you can unzip them and grep the contents. For example, where "Sequenced By" appears in blah.docx and "blah blah blah" doesn't


me@mine:~/Desktop> unzip -c blah.docx | grep -c "blah blah blah"
0
me@mine:~/Desktop> unzip -c blah.docx | grep -c "Sequenced By"
1


It seems like it might be possible to make a filter for beagle using that, but I don't have time to get in to trying right now.


I don't have any recommendation from experiences as I've never bothered with beagle, tracker or any equivalent. I deliberately exclude beagle from my SLED installs because I use SLED in an environment where the home directories are mounted from a server. I exclude tracker from my openSUSE installs because I've never felt I needed it.

mikewillis
27-Jun-2013, 14:01
docx files are actually zips. So you can unzip them and grep the contents. For example, where "Sequenced By" appears in blah.docx and "blah blah blah" doesn't


me@mine:~/Desktop> unzip -c blah.docx | grep -c "blah blah blah"
0
me@mine:~/Desktop> unzip -c blah.docx | grep -c "Sequenced By"
1



More efficient:


me@mine:~/Desktop> unzip -c blah.docx word/document.xml | grep -c "blah blah blah"
0
me@mine:~/Desktop> unzip -c blah.docx word/document.xml | grep -c "Sequenced By"
1

simeon_banov
27-Jun-2013, 16:01
Thank you for your replys, I'll try to configure beagle with both antiword and unzip and I hope it works. It's a bit strange to me that an Enterprise OS like SLED doesn't offer any ready to use solution, yes it has beagle, but .doc and .docx are used a lot in most work places and not having them index is ... well an issue. If I can't make this works I'll try to contact support for this, if even they can't help me I think my boss will be sad for buying 3 laptops with SLED.

mikewillis
27-Jun-2013, 22:00
It occurred to me that someone's probably written something to convert docx files to plain text. Several people have. I picked http://docx2txt.sourceforge.net/ because it was the most recently updated of what I found.

For proof of concept purposes I copied docx2txt.pl to /usr/local/bin and added read/execute on it

$ chmod a+rx /usr/local/bin/docx2txt.pl

Then I created /etc/beagle/external-filters.xml containing

<external-filters>
<filter>
<extension>.docx</extension>
<command>/usr/local/bin/docx2txt.pl</command>
<arguments>%s -</arguments>
</filter>
</external-filters>
I copied the same blah.docx file I was using earlier in to my home directory, logged out and in again to restart the beagle services, then told Desktop Search to search 'My Files' for a string I know is in blah.docx. blah.docx came up in the search results. So, success. At least in that instance. I don't have any other docx files to hand to do more rigorous testing.

When I selected blah.docx in the search results the snippet of found text didn't show up in the bottom panel like it does for some other file types. I've no idea why. I don't know if not having LibreOffice on this machine is an issue. For reasons far too tedious to go in to I'm currently unable to install LibreOffice on it.

This is all very ad-hoc and rough of course with potentially as yet unknown kinks or quirks or limitations. Manually setting this up on multiple machines is lousy for on going maintenance and consistency. Ideally it all needs to be put together in to an rpm which can then be installed on each machine and subsequently easily updated.

Out of curiosity, why did your boss buy laptops with SLED on them? Also, exactly what version of SLED 11 do you have? I ask because I just noticed your original post says 'OS: SLED 11 ' and also there was a post recently by someone saying they'd just bought a laptop with SLED 11 pre-loaded and it turned out to be a version of SLED 11 that's already End Of Life. The file cat /etc/SuSE-release should contain relevant info. Current version is SLED 11 SP2. SP3 is imminent.

simeon_banov
28-Jun-2013, 10:39
Then I created /etc/beagle/external-filters.xml containing

<external-filters>
<filter>
<extension>.docx</extension>
<command>/usr/local/bin/docx2txt.pl</command>
<arguments>%s -</arguments>
</filter>
</external-filters>
I copied the same blah.docx file I was using earlier in to my home directory, logged out and in again to restart the beagle services, then told Desktop Search to search 'My Files' for a string I know is in blah.docx. blah.docx came up in the search results. So, success. At least in that instance. I don't have any other docx files to hand to do more rigorous testing.

Thank you, I'll be sure to try that. I'm not very good with Linux Systems, but I'll manage with that info.


Out of curiosity, why did your boss buy laptops with SLED on them?

Why did he? I don't know really. There's several reasons I can think of (none of which he has confirmed):
- SLED is something new to our knowledge, he's aways like this, trying something new out.
- Hopes that it's better than the free solution (Ubuntu, Kubuntu), which it is if I can manage to fix this one issue.
- It came with a promotion to the laptops.


Also, exactly what version of SLED 11 do you have? I ask because I just noticed your original post says 'OS: SLED 11 ' and also there was a post recently by someone saying they'd just bought a laptop with SLED 11 pre-loaded and it turned out to be a version of SLED 11 that's already End Of Life. The file cat /etc/SuSE-release should contain relevant info. Current version is SLED 11 SP2. SP3 is imminent.

cat /etc/SuSE-release gives me:
SUSE Linux Enterprise Desktop 11 (x86_64)
VERSION = 11
PATCHLEVEL = 2

simeon_banov
08-Jul-2013, 08:28
hello, I wannted to thank everyone for helping, especially mikewillis, as his suggestions and code helped me to overcome this issue. I'm surprised that searching doc and docx is not set to default as that's mainly what people what to search in most cases. Anway the sollution works good enough for me and the firm, so thank you.

mikewillis
08-Jul-2013, 13:01
hello, I wannted to thank everyone for helping, especially mikewillis, as his suggestions and code helped me to overcome this issue.

Glad to hear you got this working to your satisfaction. Did you have to do anything beyond what's mentioned in this thread?

This seems like something that might be worth writing up for SUSE Conversations (https://www.suse.com/communities/conversations/) or perhaps promoting to an article, though I'm not actually sure how the latter works.


I'm surprised that searching doc and docx is not set to default as that's mainly what people what to search in most cases.

I don't think .doc and .docx being what people search for in most cases is true in the context of people who use Linux. A lot of Linux users will avoid them where ever possible. Users of an Enterprise orientated Linux distro are arguably more likely to have to deal with these formats though.