Dear Open Hub Users,
We’re excited to announce that we will be moving the Open Hub Forum to
https://community.blackduck.com/s/black-duck-open-hub.
Beginning immediately, users can head over,
register,
get technical help and discuss issue pertinent to the Open Hub. Registered users can also subscribe to Open Hub announcements here.
On May 1, 2020, we will be freezing https://www.openhub.net/forums and users will not be able to create new discussions. If you have any questions and concerns, please email us at
[email protected]
Hi,
In my Remind project, the file tkremind
is mis-classified as a shell script instead of Tcl/Tk. That's because it starts like this:
#!/bin/sh
# --Mode: TCL;--
# the next line restarts using wish \
exec wish $0
$@
This is a reasonably-common idiom in the Tcl world; it lets you ship a Tcl script without hard-coding the path to wish
. Perhaps your language classifier could look for Emacs Mode: lines or common idioms like the above.
Wow, that's interesting. My Windows background must be showing, because I hadn't seen that before. I'll open a bug ticket for this. I warn you that we've got a growing backlog of language parsing enhancements, so it might take us a while.
If you'll indulge me, I have some questions about this technique. To avoid hard-coding the path to ruby, around here we use:
#!/usr/bin/env ruby
Why one method over the other? Is /usr/bin/env
unlikely to work on some platforms?
Is this a fairly widespread pattern, or is it a Tcl/Tk signature? If this is a widespread practice, I'm guessing that Ohloh may be mis-classifying a lot of code as shell scripts.
Actually, /usr/bin/env is probably a better technique. The exec wish hack is ancient and may predate env.
The exec hack may be useful if you want to pass arguments to the wish interpreter. Some operating systems limit how many arguments you can put on the #! line.
In project http://www.ohloh.net/projects/9136/analyses/latest is as well a misidentification of language and of licenses (It is classified as mostly shell-script, many files (ending with .xotcl) are not classified). Is there somewhere a description, how language identification/license identification works? This would be a good topic for FAQs. Can a project do something to improve the language identification (e.g. emacs language classifiers like -- Tcl --, etc.)?
Here's how our language recognition works.
File extensions are our first choice method. We have a limited mapping of file extensions to languages. Occasionally, we might need to disambiguate two different languages that share the same extension, in which case we crack open the file and look for keywords.
If there is no match based on file extension, we use the file
command line tool to identify the file. As I understand it, this tool only goes as far as the first line of the file, looking for a known executable in the #! line.
If file
won't recognize it, we ignore it. We don't currently look for an emacs classifier or any other hints.
This is an area of the code that is ripe for a lot of simple improvements, if only we had the resources. We'd love to get our code open so that the community could offer some patches for this.
Whatever you are doing, does not work.
See: http://www.ohloh.net/projects/10160/analyses/latest
This project is in C - i.e. it's obvious from file extensions. Your sniffer thinks it is 100% assembler... but there is not one line of assembler in it.
Still, you're in good company: Google Code can't sniff licenses either.
Well, something is amiss.
I checked out the code for this project from http://www.telegraphics.com.au/svn/picide/trunk and had a look around. I found a dozen *.asm files, and no C at all.
Is the enlistment URL correct?
Ok, I tried using file on one of our .t files (XML/JS, but obviously not included in your file extension list) and get the following:
org.vexi.widgets/src/vexi/widget$ file button.t
button.t: exported SGML document text
That would imply XML but we have very few XML listings on our project:
http://www.ohloh.net/projects/3923/analyses/latest
Interestingly we have very little C/C++ (a few files, if that).