Assignment

From dis-Emi-A

Jump to: navigation, search


Contents

Problem of meaning

It has come to light that defining what "assignment" does is far from obvious, or natural. To start the discussion look at some of the common notions of assignment/copy in other languages. (You will need to understand the terms from structure of a variable)

Deep Copy / Clone

This is the most likely canditate for what is traditionally meant by copy. Given the expression

a = b

it says that the instance of a will get a copy of the value of b.

This is clear for simple data types, but it becomes ambiguous for compound data types. The clarity continues to anything of a POD type (to borrow C++ syntax): a block of memory is copied from one location to another, possibly a variant length block. The difficulty arises if the RValue contains loops. Consider a simple list

a.next = b
b.next = c
c.next = a
d = a

A deep copy operation would imply that all of the nodes in the list are copied, and d assumes the same value as a. That is, a complete duplicate of the list is created (a Clone).

In such a simple example it is easy to see this complete graph and have a notion of deep copy, but should that graph extend over thousands of objects in a rather arbitrary fashion it is not always easy to spot, nor is it easy to write a straight-forward algorithm to perform a copy (as an exercise to you, just think of how you would write an OOP DeepCopy function that would work if the data contained loops).

The situation is even worse if in the above one did rather

 b = a

One could see that if a complete Clone of a is created first one could then assign it to b, however, what does that really mean to the b reference from the original a? By example:

a.text = "One"; a.next = b;
b.text = "Two"; b.next = c;
c.text = "Three"; c.next = a;
b = a;
print a.text, b.text, c.text

What should be printed? I'm certain b.text should be "One" now, but what about a and c? Again, remember this is a near trivial case, so if we can't find an answer for it, how would we possibly address an even normal scenario.

Languages that have such a construct are:

  • C does this on structs so long as no pointers are used
  • C++ does this as the default = operator, though it can be overridden, and has some limitations
  • Java does this only for fundamental types (int, float, etc...)
    • Java serialization can however load/save proper deep copies of arbitrary types
  • Python augmented assignment can also do in-place modifications
    • explicitly Python copy.deepcopy

Surface Copy / Clone

Unlike a deep copy, a surface copy simply looks at the first instance layer and copies the value to there from the other instance layer. Any further references are not descended into, thus creating a surface, or shallow copy.

This construct is very easy to implement as a compiler: it is essentially a single memcpy operation (and if you look closely at structures in C you'll realize that C only actually supports surface copies and those structures which are deep copied are only due to the manner in which POD types are handled).

The problem with a surface copy is that it is normally quite useless and only in limited cases what you really want (usually in cases where it is equivalent to a deep copy). The one case where it is useful is when an object contains a series of POD like data combined with pointers to related functions/API details. In these cases, the POD data is actually the entirety of the value and the other pointers are incidentals and likely singletons in the system.

OOP additionally makes surface copies all but useless as it breaks the notion of encapsulation. Anytime a surface copy is used, it means the hidden data of the copied object can no longer be freely refactored, since some user of the class expects certain data to be copied in the surface copy.

Languages that have such a construct:

  • C does this as the = operator, and the memcpy function for structs containing pointers
  • Java's Clone function does this
  • Python copy.copy

Reference

References are the simplest of all of the assignment like constructs and likely the easiest for a compiler to implement. A reference operation simply means that some variable will be modified to point to another instance, thus two or more variables are actually referring to the same instance, and in turn thus all always have the same value.

Languages that have such a construct:

  • C can do this with pointers
  • C++ does this with references
  • Java does this for all types except fundamentals
  • Python does exlusively this (though a library adds other copy semantics)
  • Ruby basically does this (with tricks for small/fundamental objects)

Definition / Replacement

A simple example can illustrate the difference between copying and defining in this sense (using c like syntax):

a = 1;
b = 2;
c = a + b;
b = 10;
d = c;  //d == 3
a = 1;
b = 2;
#define c a + b
b = 10;
d = c;  //d == 11

This illustrates the question of binding and is a fundamental difference between functional and imperative langauges. When values are "assigned", regardless of which copy method that means, is it done immediately, or is done later when actually needed.

Bind's and closures seen in more recent languages (and in the C++ Boost library) are good indicators that although for most programs the first example is the expected norm, there are several occassions where the second would provide for improved program logic.

Languages that have such a construct:

  • Macros in C/C++
  • OCaml with the "let" construct

Mutable

The meaning of the assignment operator is also heavily tied to whether the language allows side-effects and/or mutable variables.

In a primarily functional language (OCaml), where no side effects can occur, there is no real distinction between deep/shallow copy and reference. Since none of the pointed to values can change, taking a reference to them or copying them will have no impact. This is one manner of removing ambiguity.

Similarily in Python and Java you have primarily immutable values (at least most of the fundamental types). This means that once an object is created, such as a String, its value will never change. In the same manner you've avoided the need to distinguish between copy and reference. Both of these languages do however allow mutables, and they both use assignment by reference.

Another variant of this would be a language that allows only a single assignment to a value. In this manner the difference between copying a value and creating a Clone of a value are indistinguishable, thus solving a second set of problems. The intermediary in GCC is such a language.

Such restrictions always make the job for the compiler easier, but more importantly they remove certain kinds of errors that can be made by reducing ambiguity. They do however also reduce flexibility. For example, immutables and functional languages are great at taking input an producing an output -- this can be expressed cleanly and without ambiguity. They are however not great at modifying data, such as file entries or database records, and usually involve the use of mutable functions to do so. Nonetheless, for data transformation/calculation however the functional method is preferred as it is much cleaner.


Problem of syntax

Aside from the problem of meaning, comes the problem of syntax. Recall that an abstract variable is actually the set of a name and a value, and for a computer a pointer and a block of memory.

So, given a simple construct such as:

a = 5

The full set of knowledge the compiler/computer knows is a chain of references:

"a" is the key (name) which maps to a pointer (reference) to a block of memory (instance) which is the (value) 5

Okay, there is a lot more than that actually, since you need to consider the domain/type of the data. So if somewhere in the code you see

a

as an RValue, what syntactically does that refer to:

  1. The value of a (which is 5)
  2. The name 'a'
  3. The pointer to the memory referenced by a
  4. the sizeof the block of memory used
  5. The actual block of memory, in raw format
  6. the type of the value of a (which is an Integer)
  7. the evaluation of a (if a is a closure, which it isn't in this case)

You can't ignore any of these cases, since they are all actually used, at least in a logical sense. Consider as example:

  1. 6 + a (results in 11)
  2. #define VarName( var ) #var (VarName(a) results in "a")
  3. &a (the address in memory)
  4. sizeof( a) (probably 4)
  5. memcpy( target, &a, sizeof(a) ) (we are actaully wanting the block of memory, not the address)
  6. type_info( a ) (a C++ class which has RTTI about a)
  7. n/a

Using the C/C++ syntax is a great way to show how all these syntactic views of 'a' have diverged into a rather unpleasant set of malarky. Do other languages fair better? No. None that I've seen at least.

Auto-evaluation

Once closures / code-blocks are introduced in a language the syntactic problem is extended. Given the following example:

a = { y + 6 };

meaning that a is the code to the right, and where y is some variable in scope, if we see something like

b = a + 5;

it is easy to interpret that we want the evaluation of a (since we can't add to the closure itself, only it's value), but if we have:

b = a

It is unclear if we intend on using the evaluation of a, or indeed assigning the complete expression contained in a.

Languages:

  • C requires a functional operator to evaluate any function
  • C++ does too, however a proposed extension would allow member variables to have automatic evaluation
  • Ruby does auto-evaluation by default

Decision

So what do we do? Here are some of the vital points in making a decision (not all explained above)

  1. people don't think about what is happening and just assume it work
  2. Deep-copy has several conditions which are ambiguous and not tractable
  3. Reference assignment is unambiguous and can always be done
  4. Definition/closures are capable of significantly reducing duplicate code
  5. Mutables and immutables both have their required uses
  6. "assignment" should always mean one thing, no ambiguity

So we can come up with some implementation points:

Reference assignment 
this is the only one that can be done consistently and always has a definable result. This also allows OOP inheritance to be usable by all types such that all receivers of these types can use derived types.
A Library Deep Copy 
Due to the potential hazards/intractability of deep copy it should not be the generic definition of assignment. This can then easily be added as a library function, and/or introduced as an interface type definable on objects (with the option then of adding further semantics such as surface copy, etc.)
 ??? Auto-evaluation/Non-evaluation by default 
The language is auto-typed, so let's use it. Whenver a closure is encountered it can always evaluate to either the closure, or the result of the closure. The enclosing code semantics will determine which of those is correct. Additionally explicity syntax will be given to force one or the other in an RValue. In both cases though "assignment" still takes the reference of the RValue, so assignment is consistent.
 ??? OR everything is a closure 
to avoid the evaluation problem we can also simply define that all values are actually closures taking 0 parameters and having 1 or more returns. Evaluation of a closure requires an explicit operation, otherwise the closure itself will be maintained. This yields some very useful properties, but also some confusing ones.

Everything a closure

a = 1; //a :> None -> 1
b = 2; //b :> None -> 2
x = { a + b; }; // x :> None -> a + b
y = x;  // y is { a + b; }
q = y(); //q is 3
a = 3;
r = y(); //r is 5

n = a + b; //what does this mean then?!

The last expression is curious, since it everything is a closure we can't really do "addition" on them, we can only do addition on the evaluation of the closures. So in the last item should we:

  • do auto-evalution, doing so defeats the whole purpose of everything is a closure, we could just stick with auto-evaluation and be done with it
  • treat all expressions as closures until an evaluation is called and make evaluation be a complete evaluation, resolving all contained closures recursively (could we define a less then fulyl recursive solution, perhaps all first-order closures are resolved?)

If we consider the second option the above code has a different meaning:

a = 1; //a :> None -> :Final: 1
b = 2; //b :> None -> :Final: 2
x = a + b;  //x :> None -> ctx.a + ctx.b;
y = x; //y :> None -> x
q = y(); // --> (None -> x)() --> x() --> (None -> ctx.a + ctx.b)() --> ctx.a() + ctx.b() --> 1 + 2 --> 3

This would make the system perform like Maxima, which would be really nice, but I suspect it won't make the compiler's job easy (well, from a compiler's viewpoint this is just a bunch of closures, so this just won't make the optimizer's job easy).

Personal tools