rene_mobile’s avatarrene_mobile’s Twitter Archive—№ 8,918

                1. Licenses for auto-generated code will be even harder with ML models doing the generation. Who own the copyright? Whose IPR is involved? Which license is the compilation under, given that the generated code may have been "inspired" by code from a multitude of different sources? 1/ @DocSparse/1581461734665367554
              1. …in reply to @rene_mobile
                However, I am much more concerned about the general issue of provenance than just the licensing aspect: 1. Who is the author of such code? Is it whoever queried the model with a starting line? Whoever trained the model? The sum of all authors of code in the training set? 2/
            1. …in reply to @rene_mobile
              2. Who holds copyright on that code (which, depending on jurisdictions, cannot be assigned to another entity, but remains with the original author)? 3. What was the original intent of a piece of code? What was its context of use when written? 3/
          1. …in reply to @rene_mobile
            4. What were explicit (and implicit) assumptions and boundary conditions for correct execution (as correct as we currently manage to write any piece of code)? 5. Was the original code (in the training set) for demonstration/learning (i.e. simplified) or production (messy)? 4/
        1. …in reply to @rene_mobile
          6. [With my security hat on] Has the original code been updated since then? Which version was the model trained with? Were there any (functional or security/safety) bugs in that version? Has the world (protocols etc.) around that code changed? 5/
      1. …in reply to @rene_mobile
        7. [With my paranoid, attacker mindset hat on] Was the original code an intentionally malicious (e.g., bugdoored or old/outdated) contribution to the training set - which seems trivial to do with e.g. Github repositories being the "crowd sourced" training set? (CC @taviso) 6/
    1. …in reply to @rene_mobile
      @taviso For "traditional" sources of code, this kind of provenance information can be found. If it was a textbook or tutorial, assume simplified code meant for teaching concepts. If it was copied from an in-production open source project, assume a messy history behind it. 7/
  1. …in reply to @rene_mobile
    @taviso If it was original code for a new problem, the original author hopefully documented some of that - although where the learned concepts etc. came from earlier will most likely be lost in the training of that particular person's brain. However, there's an important difference: 8/
    1. …in reply to @rene_mobile
      @taviso People can typically reason about how they came up with some code, even if the full history can no longer be traced. ML models (as far as I understand the state of the art) make it close to impossible to do that. Getting even an idea of the original provenance is _hard_. 9/
      1. …in reply to @rene_mobile
        @taviso Effectively, ML models are giant mix-nets (in our circles often called anonymizers) that by design do not track the explicit input training data that resulted in a specific output. This is the antithesis to the current trend towards #transparencylogs and #reproduciblebuilds. 10/
        1. …in reply to @rene_mobile
          @taviso And without many of these aspects of code provenance, how can we trust ML-generated code for any security or safety critical purpose? Of course, I might be missing some novel state-of-the-art solutions (please tell me if so), but my security hat is terribly concerned. 11/fin