While in most cases there is no explicit information in an assembly as to which languages it was compiled from, it is possible to make an educated guess as to which languages were used.  This is due to the fact that each different .NET compiler leaves it’s own unique type of fingerprint.  In this article I discuss both my methodology for finding these fingerprints and which were unique to each language I used.

 

Methodology

For each language I made a new class library project.  I then reflected and compared each assembly to determine which unique characteristics it had.  It turned out that, at least for C#, F#, VB and C++, each was uniquely identifiable by the existence, or lack thereof, of certain features.

So to break it down a bit.

In each project I added one class and one public method in each of those classes:

  1: public class CSharpClass
  2: {
  3:     public void LocalMethod() {}
  4: }

After compiling each of these projects into it’s own assembly, I referenced them from another testing project.  To grab a set of features for each language, I used the following three reflection calls:

Assembly.GetTypes()
Assembly.GetCustomAttirbutes()
Module.GetFields( BindingFlags.NonPublic | BindingFlags.Static )

Then, with a simple program, I found which of these features were unique for each language.  This set of unique features ultimately represents a map of the imprint each compiler leaves.

 

F#

A compiled F# library will only have one attribute by default:

Microsoft.FSharp.Core.FSharpInterfaceDataVersionAttribute

This made it the easiest to differentiate of all the languages I tested.  Even more interesting, this attribute contains three fields which specify the specific version of the F# compiler used to generate the assembly:

Major    1    int
Minor    9    int
Release    6    int

I’m always impressed with how the F# team consistently goes above and beyond when it comes to the small details.

 

Visual Basic

The Visual Basic assembly I generated was also easily identifiable via extra types which were automatically added:

My.MyApplication
My.MyComputer
My.MyProject
My.MyProject+MyWebServices
My.MyProject+ThreadSafeObjectProvider`1
My.Resources.Resource
My.MySettings
My.MySettingsProperty

As you can see from this list, the existence of these types in the “My” namespace is a fairly safe indicator that the Visual Basic language was used. 

 

C++ CLI / Managed C++

C++/CLI and Managed C++ are considered to be the same language with slightly different syntax as they share the same compiler.  However, there are four different compilation modes for C++ and each has somewhat different results.

  • /CLR – Common Language Runtime Support
  • /CLR:pure – Pure Common Language Runtime Support
  • /CLR:safe – Safe Common Language Runtime Support
  • /CLR:OldSyntax – Managed C++ Syntax

The /CLR, /CLR:pure and /CLR:OldSyntax settings provide easy to classify assemblies, as they all inject an enormous number of types (70+) into the assembly.  I verified that contained two types from the vc_attributes namespace:

vc_attributes.YesNoMaybe
vc_attributes.AccessType

However, /CLR:Safe is much different in that it injects no types and adds no assembly attributes by default.  The generated assembly was almost completely clean.  I was forced to use Reflector to determine how to differentiate this from C#.

 

C#

C# was one the most difficult to identify assembly type.  This is due to the fact that it has no unique types and only one unique attribute:

System.Reflection.AssemblyConfigurationAttribute

Unfortunately, this attribute is defined in the AssemblyInfo.cs file and so we can’t depend on it.  Up to this point it was only necessary to use two reflection calls:

Assembly.GetTypes()
Assembly.GetCustomAttirbutes()

I was hoping to keep things very simple.  However, to differentance these two languages it’s necessary to go a bit further.  It turns out that C++ always injects an module level field into the assembly while C# does not.  And so by using:

Module.GetFields(BindingFlags.NonPublic | BindingFlags.Static)

We can check for the existence of this kind of field and so differentiate these two types. 

After some investigation with reflector, I was able to find one particular feature unique to C#.  Unfortunately, it requires disassembling functions and looking at the resulting IL.  It seems as though a function definition never has a .maxstack of less than 8.  In all other languages I observed .maxstack had been set to values as low as 0 when defined in an empty function. 

However, as I am only currently concerned with these four languages, my testing on this matter has been very shallow and so pleae take it with a grain of salt.

 

Conclusion

I admit that my sample Assembly set was very small and my feature set very large.  However, while this type of classification may not be robust enough to be applicable to a system which depended on these results being absolutely true, I’ve shown that it is in fact entirely possible to make reasonably confident guesses as to the language used to generate a .NET assembly while using only simple reflection.  It would be interesting to see how well this holds for obfuscated assemblies as well as other “bare minimum” compilations generated via different combinations of compiler settings.

The next obvious step would be to extend what I have already written into a full Bayesian classifier.  Would be much better than a hardcoded hierarchy which would be fragile and possibly completely and repeatedly incorrect for some cases.  Another big advantage of using machine learning here, is that it would be easy to add new features and classification categories.