Motion Video Summary ModelSorry, your browser does not support midi plugin


Primary Algorithm Design by: Nick Chang, Primary JAVA Implementation by: Jack Huang; Special Help from: Di Zhong


Abstract

Usually, when we are trying to find a video clip in a large database over the internet, we might run into problems of fishing out the exact clip to our preference. If there is a way to summarize video clips on the server site such that the users might be able to visualize the clip before downloading it, much time can be saved. There have been various techniques used to summarize motion video clips; the most popular way is to use a key frame. Though simplicity is its intention, a key frame does not illustrate the temporal and spatial information of the motion pictures. In the following discussion, another video summary technique is proposed to include those two key features.

The technique we discussed here utilizes a well-defined model, Affine Model, which is used in the object extraction within a video. There has been video database search engines developed, based on this model, such as "VideoQ" developed in Columbia University. It estimates the motion of the foreground and background objects of a video clip and fits the motion into a set of parameters. While VideoQ uses the Affine Model parameters as a kind of general descriptors to search for similar video clips, in the following algorithm, we actually use these descriptors as a kind of motion vector to re-generate the scene of a clip using synthetic objects. In so doing, we do not only provide users an alternative summary of a motion picture, but we also provide the feasibility to view the temporal content of a video online.

 

Algorithm

There are 2 sets of parameters of an array size 6*Number of frames, extracted from AMOS. While the 1st set is for the foreground object, the 2nd set is for the background.

Translate the Affine Model into a polar coordinate:

Affine Model Parameters Polar Coordinate Parameters (foreground)
(a0, a1, a2)

(a3, a4, a5)

(r, theta)
X[1]=a0+a1*X[0]+a2*Y[0] Y[1]=a3+a4*X[0]+a5*Y[0]

beta is a correction factor, explained below.

MV(x)=a0+(a1-1)*X+a2*Y

MV(y)=a3+(a4-1)*Y+a5*X

Object Model: A realization of the polar coordinate model above:

Foreground object:

Operation:

Background object:

Operation:

 

Mathematical Discussion:

After some manipulation, we can derive the following relationships for the foreground:

whereas (x, y, r) represents the current frame of the foreground system.

As these equations are suggested, delta_r represents the change of the distance from the center of the foreground object to the satellite while delta_theta represents the change of the angle, approximately. The approximation is made for we approximate a curve by a straight line. More specifically, in a polar coordinate, the change in angle should be found by the change of "curve" divided by the circumference of the circle in radian. Thus, we add the beta function to do a fix up. But to combat a potential drastic change in angle, beta should be made in proportion to the radius because the larger the circle, the larger the error in our original approximation without beta.

As for the background object, it is quite straight forward as we have stated above. We change the spacing between lines according to (a1, a4) for the background, and we shift the grid up-down, right-left according to (a0, a3) of the background. To formalize this, an equation is provided for zooming:

making the zoom ratio the same in both (x, y).

Performance

The performance of such a model depends largely on how well the Affine Model parameters are extracted from the original video. While the extraction of those parameters are not in the scope of this paper, we will present the detail adjustments for the above modeling equations as we put our model into test.

Three video clips have been used for testing:

  1. Soccer Player – left to right translation.
  2. Free Style Skiing – rotation and translation.
  3. Statute of Liberty – zooming in.
Original Video (description) Video Summary
  • The soccer player is the foreground object.
  • The camera is panning heavily with the soccer player moving forth.
  • Minimum rotation occurs when he head-knocks the soccer ball into the net.
  • No zooming.
  • We successfully capture the foreground object with the satellite system.
  • Background moves backward while foreground remains at the center.
  • 90 degree rotation when the soccer player shoots the net.
  • Background has a very slight zoom in and zoom out fluctuation.
  • The sense of velocity is quite good, subjectively.
  • A skier jumps off the cliff, spinning in the air and exit the screen from the right-lower corner – a projectile motion, which is diminished by the camera movement.
  • The camera is panning with the skier, but the object does not remain at the center of the screen through out.
  • Heavy rotation when the skier spins in air.
  • Slight zooming when the object approaches the camera in motion.
  • Cannot observe the projectile motion because of the camera panning.
  • The foreground lacks behind, eventually is not observable, as if it exists the screen from the left (see more discussion below).
  • Heavy rotation is very well captured.
  • The sense of velocity is quite good, subjectively.
  • A camera zooms into the statute of liberty gradually. The object remains at the center of the screen all the time.
  • No camera panning, or translation.
  • No rotation.
  • The foreground object does enlarge, and so is the background.
  • But the object sits slightly at the left side of the screen.
  • The speed of zooming is very well captured.
  • No rotation.

Generally speaking, the performance of our modeling system is quite good although there is a problem in the case two. There, we observe that the object actually lacks behind the screen eventually, which does not faithfully reproduce the original video. After we carefully examine the cause of the problem, we find that the translation parameters generated from AMOS have consistently negative values after the middle of the video sequence. That is when our object eventually falls out of bound due to the fact that all our motion is accumulative, i.e. it adds on the value to the past position. In order to fix this problem, we might have to consider sizing down the pixel movement of our foreground object in relation to the background such that though moving backward, it does not fall out of bound that easily while we still maintains the sense of velocity coming from the background. This task can be arbitrary and difficult for we have to experiment with more video sequences before a finer adjustment can be made.

 

Adjustments

There are some detail adjustments made along the way to appeal to the eyes:

Here, the larger the foreground object is, the harder for our eyes to percept its change in size. Thus, we made it in proportion to the current size to be visually perceptive.

Here, we increase the zooming speed by a factor of 5 to visually percept the change. This is nothing else but empirical observations.

Because of the nature of the Affine Model parameters is a general descriptor which is used as a kind of feature vector for video comparison purposes, there should be some threshold restrictions applied to determine if an event like rotation or zooming has actually occurred.

While there are no threshold imposed on the translation for it does not change much in terms of eye perceptions, thresholds for both zooming and rotation are quite important:

Zooming:

Rotation:

 

Conclusion:

The goal of this project is to present a way such that the users might be informed about the temporal and spatial information about a video sequence online without viewing the original video. Performance-wise, this model does a good job in terms of presenting the sense of velocity as well as illustrating the occurrence of an event such as zooming, rotation and camera panning. However, as discussed above, we can see that in some circumstances, this model does not faithfully reproduce what happens in the original video, nor it tells the exact direction of movements of the foreground object. This is difficult to achieve because the Affine Model parameters only give a general description of the behavior of the objects instead of the exact moment. Besides, the exact moment is too difficult to tell given the phenomena of object rotation and camera zooming.

From a subjective point of view, nevertheless, this model works okay with a simple motion video, such as a straight line movement, instead of a projectile. In addition, this model also presents well the sense of velocity in translation, zooming and rotation. To my opinion, this model does summarize the spatial and temporal information of a video sequence.

Demo