Language identification is typically associated with natural languages—identifying the language of Tweets, for example. However, after reading a challenge on HackerRank about detecting Java, C and Python, I became interested in its application to source code. I found a few other attempts at addressing this topic along the way:
A paper by David Klein, Kyle Murray and Simon Weber that discusses using punctuation, keywords and operators as a basis for the identification process. In their paper, they state 48% accuracy on 25 randomly selected source code files.
A recent project by Daniël Heres seems to have achieved impressive results using “machine learning and neural networks” across 18 supported languages. He states that, “for more than 99% of the documents we predict the right language in a random subset of the data we use for testing the performance of our model.”
I started working on my own solution,
codetype, with the intent of
pursuing a strategy similar to (2). My goals were to be accurate, fast, and
light-weight with minimal reliance on training data.
As seen in the Scala example above, each signature consists of five keys:
tokens: A combination of keywords, punctuation and operators that are indicative of a particular language.
first_line: A list of regular expressions designed to match statements typically found on the first line of source code. For example, many Scala files begin with
package <...>while Python files often start with either an
importstatement or a shebang.
unique: A list of tokens that are uncommon in other languages.
flags: A list of tokens that should not appear in one language but are found in similar languages. For instance, C# has a keyword
structbut Java does not.
ignores: A list of tokens that represent the start of a line or block that should be excluded from consideration—such as comments and strings.
These signatures are then used as a means of computing how similar a file or snippet is to each of codetype’s known languages.
codetype’s core codebase consists 220 lines of Python (excluding comments) and 21 signatures for a total uncompressed weight of approximately 32 KB. MessagePack is the only external dependency.
Each language is associated with a “base project,” which I used to measure my progress on a per-language basis throughout development. The base projects are also used to create the MessagePack-formatted version of signatures. In addition to being in binary format, the distribution version of a signature also associates each token with its average number of occurrences in its base project.
When a file or string is passed to
codetype, it is split into tokens
according to this regular expression. A signature is then generated from
the tokens and compared to each known signature according to the following
The similarity scores for each known language are then filtered by their values
If any known language matches both the unknown’s
ignores, we consider that as the only possible match. If multiple languages match both, we take the one with the highest score.
If both the unknown’s
ignoreshave matches but their union is empty, we take the highest score across both sets.
If only one of
ignoreshave matches, we take the highest score from that set.
ignoreshave matches, we simply take the highest score across all known languages.
Consider, for example, the following output from codetype’s CLI tool:
The code snippet
print("Hello, world!") is syntactically valid in many of
codetype’s supported languages. We can significantly narrow our candidate pool
by making a slight change:
The language signatures tell us that
-- is a comment character in only
AppleScript, Haskell and Lua. This definitely increases the percentage of
correctly identified files, but it also heavily relies on accurately
identifying comment delimiters.
99.4% of files were correctly identified across the 21 base projects (14,281 files), with C (97.4%) being the least accurate. Haskell was the most common culprit in misidentification cases, contributing significantly to Python and Ruby. However, since the base projects were used to create and refine the signatures, these results are not particularly meaningful.
In order to better measure codetype’s ability to identify languages in the “wild,” I also tested a project from GitHub’s list of trending repositories for each language. In these randomly selected projects (7,084 files), 97.8% of files were correctly identified. C and OCaml, at 92.9% and 92.2% respectively, were the least accurate. A summary of the results is shown below:
I also performed a head-to-head comparison between
codetype, the work published
by Klein et al., SourceClassifier (the PHP port) and lang-detector on the
Computer Language Benchmarks Game (Heres' work was not tested because it
is not free to use).
|Tool||Supported Languages||Total Files||Correctly Identified (%)||Time Per File (sec)|
|Klein et al.||24||643||20.2||3.364|
As you can see,
codetype had the most success at identifying its supported
languages while also being the second fastest per file. It is important to
note, though, that the test results for both SourceClassifier and the work of
Klein et. al are based solely on the training they provided (lang-detector
does not require training).
Finally, in an attempt to measure codetype’s ability to identify code snippets
(rather than complete files), I used the
“Hello world in every programming language” project. 90.5% (19 / 21)
of the “Hello, world” snippets were correctly identified. Lua and Swift were
both misidentified as Python. However, the code snippet in both
print("Hello World")—was in fact syntactically valid
I consider the results discussed above to be a promising start, but there are
improvements to be made. Text parsing is the most notable area of need: there
is currently no support for distinguishing between, for example,
// as a
comment delimiter and a division operator. This is an even larger issue for
languages, such as Matlab, that use common operators as comment delimiters. I
believe the key to solving this issue is to consider comments within the
overall context of their source. In other words, if a file appears to be
non-Matlab according to its
first_line matches, then
probably not a comment delimiter.
Another means of improving comment detection (and consequently language detection) could be adding a signature key for “function definitions.” The primary goal of signatures is to be brief, but I believe that enough languages have a construct along the lines of a “function” to warrant its inclusion.
A second area in need of improvement is analysis of first_line patterns. Currently, only the first non-comment line is considered. However, in reality, there are often many lines that could be considered a “first line” match. Take, for instance, the following Python code snippet:
import sys is only so useful as many languages have similar statements.
Including the subsequent
from <...> import <...> statements in our analysis
would allow us to consider a much smaller candidate list.
Finally, I would like to eliminate the process of scanning each file in the base projects (as mentioned in the implementation section). This aligns with codetype’s secondary goals of being standalone (i.e., requiring nothing along the lines of “training data”), light-weight and fast. I currently do this to account for some tokens being more common than others, but ultimately I think it is an unnecessary step. In the future, I plan on creating a “point system” of sorts in which tokens can be hand-assigned values based on their frequency in a given language.
Check back for future updates!
You may be interested in reading Identifying source code through natural language processing and Predicting Tags for StackOverflow Questions. There is also GitHub’s Linguist, a project with the goal of identifying files contained in Git repositories.