Multi-Object Tracking

deep learning
computer vision
Learn how to detect objects with deep learning and track their movement over time.

Adrien Vandekerckhove


February 28, 2023


If you want to directly see the results, click here. Otherwise, I will explain the steps and ideas developed in this article as I implement them.

The goal here is to showcase what is possible with computer vision and recent deep learning techniques. The one we will study here is called multi-object tracking (MOT). We will see that MOT is a direct upgrade from object detection.

With object detection, the model is trained to predict a label and a bounding box to some objects in the scene. One limitation of that approach is that it is not suited for studying the behavior of those objects over time since there is no relation between objects from one frame to another. That’s when multi-object tracking comes in handy. For each object in the scene, it will first be detected by a detection model (Yolov5 for example) and then associated an id to keep track of it between each frame. This approach is quite powerful if a workflow requires an understanding of the movement of objects (take self driving-cars as an example).

For the sake of simplicity in this demo, we will use MOT to detect and track people in a shopping center. The scene is kept simple to showcase the main ideas with MOT.

Let’s outline the general approach taken here :

  • Find a source video
  • Apply an MOT model to the video
  • Save the results of MOT
  • Transform the results
  • Analyze the results
  • Load them nicely in a video

Let’s now outline the tools and methods for each point :

  • Video by Coverr-Free-Footage taken from Pixabay
  • We will use the Deep Sort Model with the MMTracking library
  • We simply save the results as a csv file, if you know that you will process lots of data, parquet is better suited
  • We will extract a bit more information from the bboxes like speed on the x and y-axis.
  • Once we have all the raw results from the tracking and the new features, we can do some exploratory analysis of the data.
  • Finally, we will reconstruct a video that adds new information about the scene with OpenCV

Note: Deep Sort isn’t the newest MOT model but still has good accuracy and is enough for our use case. For more information on the state of the art in MOT, take a look at Papers with Code for the MOT20 benchmark


Deep Sort isn’t the newest MOT model but still has good accuracy and is enough for our use case. For more information on the state of the art in MOT, take a look at Papers with Code for the MOT20 benchmark

Finding a source video

Importing the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import mmcv
import cv2

from mmtrack.apis import inference_mot, init_model
from dataclasses import dataclass, asdict
from shapely import Polygon, Point
from IPython.display import Video
from pathlib import Path
from tqdm import tqdm

Here is the video we will be working with :


Apply MOT on the video

Once mmtracking is installed, it is easy to run inference with models supported by the library. The two main methods are init_model for loading a model and inference_mot to run inference from a multi object tracking model. You can then call the show_result to get a new frame where the bboxes are drawn.

def track_video(input_video, out_path, out_video_name, config=MOT_CONFIG):
  imgs = mmcv.VideoReader(input_video)
  mot_model = init_model(config, device='cuda:0')

  Path(out_path).mkdir(parents=True, exist_ok=True)

  results = []
  for i, img in enumerate(imgs):
      if i % 100 == 0:
          print(f'processing frame {i}')
      result = inference_mot(mot_model, img, frame_id=i)
              wait_time=int(1000. / imgs.fps),
  print(f'\n making the output video at {out_path} with a FPS of {imgs.fps}')
  mmcv.frames2video(out_path, out_video_name, fps=imgs.fps, fourcc='FMP4')
  return results

We can now call the previous function with the input video we chose, the folder that will contain the new frames and the name for the newly constructed video.

results = track_video("./people.mp4", "./tracking", "./tracking.mp4")

Like before, we can display the video with from IPython.display import Video.


We can see that Deep Sort performs quite well in this specific environment. We need to keep in mind that the camera angle is quite good and that there nothing is cluttering the view of the camera. Those factors play a big role in the accuracy of the model. Still, some people may sometimes be mislabeled or simply won’t be detected. Those things should be handled with precaution if one wants to use MOT in production.

Save the results of MOT

To simplify our conversion of the results of Deep Sort to a pandas Dataframe, we will use a dataclass. This makes it easier to see what attributes we will be working with.

class Person:
  time_step: int
  id: int
  xmin: float
  ymin: float
  xmax: float
  ymax: float
  score: float

  def centroid_x(self):
    return self.xmin + (self.xmax - self.xmin) / 2

  def centroid_y(self):
    return self.ymin + (self.ymax - self.ymin) / 2

  def __lt__(self, o):
    return <

  def asdict(self):
    data = asdict(self)
    data["centroid_x"] = self.centroid_x
    data["centroid_y"] = self.centroid_y
    return data

  def df_from_tracking_results(cls, results) -> pd.DataFrame:

    data = []
    for i, result in enumerate(results):
      for track_bboxes in result["track_bboxes"]:
        for bbox in track_bboxes:
          person = Person(i, *bbox)
 = int(
    return pd.json_normalize(obj.asdict() for obj in data)

We can now simply construct a dataframe for any MOT model results from mmtracking with our dataclass.

df = Person.df_from_tracking_results(results)

Transform the results

In this section, we will load and apply some simple transformations to our csv file to extract more interesting properties. First, we will look at the results contained in our csv file.

df = pd.read_csv("./tracking.csv", index_col=0)
time_step id xmin ymin xmax ymax score centroid_x centroid_y
0 0 0 618.210022 116.318733 654.233826 205.645447 0.999987 636.221924 160.982090
1 0 1 801.630249 87.863060 834.679016 178.867615 0.999971 818.154633 133.365337
2 0 2 603.403503 -28.892660 628.354431 57.333500 0.999964 615.878967 14.220420
3 0 3 500.431854 300.716736 544.472168 417.534668 0.999961 522.452011 359.125702
4 0 4 727.580750 30.594791 754.745728 98.699463 0.999961 741.163239 64.647127

Everything looks good. We can now play a bit with the data. First, let’s see how many people appear in each frame.

df[["time_step", "id"]].groupby("time_step").count()
0 46
1 45
2 46
3 46
4 46
... ...
336 37
337 37
338 38
339 40
340 41

341 rows × 1 columns

I said earlier that one limitation of object detection is that we cannot study the behavior of objects in time because we cannot reidentify an object that appeared previously. We will see here that by removing this limitation, we can have a richer insight of the data. I will demonstrate it by using the speed variable.

df[["speed_x", "speed_y", "d_time_step"]] = df.groupby("id").diff()[["centroid_x", "centroid_y", "time_step"]]
df["speed"] = np.sqrt(df["speed_x"] ** 2 + df["speed_y"] ** 2)
df.sample(10)[["time_step", "id", "speed_x", "speed_y", "speed"]]
time_step id speed_x speed_y speed
12234 295 17 -1.411377 -1.271286 1.899514
11205 270 79 1.078552 -10.116051 10.173385
2122 47 32 1.045593 -0.295624 1.086581
3242 71 34 0.761230 -1.453631 1.640889
685 15 3 -0.426651 -3.009827 3.039916
4713 104 2 0.006409 0.883190 0.883213
9668 230 70 -0.035904 -2.074677 2.074987
2453 54 16 -0.587830 -0.057123 0.590599
13379 324 70 0.593063 -0.606087 0.847977
12060 291 20 0.154694 -1.100861 1.111676

As you can see, for each person, at each frame, we can estimate the speed. Note that the speed here is expressed as px/frame and not in m/s. For accurate speed prediction, we would need a camera that predicts the depth of each pixel. Without it, we can only make gross estimations based on some reference points.

We will come back later to the speed variables when we will study the behavior of people over time.

Region detection

Another thing that we can do is to define specific regions on the camera/video. We can then detect people in those zones and if they are crossing one of them. Here, we will work with 3 regions I handpicked.

In practice it, could be used for prevention by detecting dangerous events like a car stuck on train tracks and sending an alert to the train driver ahead of time.

max_width = 1280
max_height = 720
lw = 1 # Line width

region1 = Polygon([(265, max_height - lw), (460, lw), (lw, lw), (lw, max_height - lw)])
region2 = Polygon([(265, max_height - lw), (460, lw), (785, lw), (1020, max_height - lw)])
region3 = Polygon([(max_width - lw, max_height - lw), (1278, lw), (785, lw), (1020, max_height - lw)])

img = plt.imread("./tracking/000000.jpg")

plt.plot(*region1.exterior.xy, color="red")
plt.plot(*region3.exterior.xy, color="green")
plt.plot(*region2.exterior.xy, color="blue")

<matplotlib.image.AxesImage at 0x7fdeac0b1eb0>

Now that we have defined the regions, we can create a new class called region in which we will store the region the person is currently in.

regions = {"region_1": region1, "region_2": region2, "region_3": region3}

def in_region(row, regions):
    for key in regions:
        if regions[key].contains(Point(row.centroid_x, row.centroid_y)):
            return key

df = (
    .assign(region = lambda x: x.apply(lambda row: in_region(row, regions), axis=1))

sns.scatterplot(data=df, x="centroid_x", y="centroid_y", hue="region")
img = plt.imread("./tracking/000000.jpg")
plt.plot(*region1.exterior.xy, color="red")
plt.plot(*region3.exterior.xy, color="green")
plt.plot(*region2.exterior.xy, color="blue")
<matplotlib.image.AxesImage at 0x7fdeb5227a90>

Removing outliers

Outliers are not easy to handle. Here they arise from the Deep Sort model’s bad predictions. This means that we could get someone that is located in the top left corner of the camera and goes to the bottom right corner in one frame. The way outliers are treated should be dependent on the task being done. Here, we simply want to analyze the data and reconstruct a new video that takes the regions into account. That is why we will use simple techniques to remove them.

Position incoherence

As I said, it is possible that the same person is detected at two locations that don’t make sense between each frame. This can be seen with a box plot of the speed samples.

# Show the violin plots of the features that interest us
fig, ax = plt.subplots(1, 2, figsize=(10, 10))
sns.boxplot(data=df, x="speed_x", ax=ax[0])
sns.boxplot(data=df, x="speed_y", ax=ax[1])

# Add titles to the plots
ax[0].set_title("Speed X")
ax[1].set_title("Speed Y")

# Set limits to the x-axis
ax[0].set(xlim=(-100, 100))
ax[1].set(xlim=(-100, 100))
[(-100.0, 100.0)]

As you can see, there are a lot of outliers at values that don’t seem to be possible. The outliers seen on the boxplot can be detected using the interquartile range (IQR). We will use that method to remove them.

# Using IRQ to remove outliers
q1 = df["speed_x"].quantile(0.25)
q3 = df["speed_x"].quantile(0.75)
iqr = q3 - q1

# Remove outliers
df = df[(df["speed_x"] > q1 - 1.5 * iqr) & (df["speed_x"] < q3 + 1.5 * iqr)]

The plots will now look a lot nicer without the extreme outliers.

# Show the violin plots of the features that interest us
fig, ax = plt.subplots(1, 2, figsize=(10, 10))
sns.boxplot(data=df, x="speed_x", ax=ax[0])
sns.boxplot(data=df, x="speed_y", ax=ax[1])

# Add titles to the plots
ax[0].set_title("Speed X")
ax[1].set_title("Speed Y")
Text(0.5, 1.0, 'Speed Y')

Missing values

We will also drop missing values

df = df.dropna()

Analyze the results

Now that we have a nice dataframe, we can look at the interesting part. Visualizing behavior over time. On our input video, we can see many people walking in a shopping center, we can also see that some of them are standing. We could then ask ourselves an interesting question: “What is the default behavior of the people seen on the camera, standing or walking ?” This is what we will answer in this section.

The first step is to create new columns that indicate the time associated with the frame in seconds.

def get_time(time_step, fps=30):
  return time_step // fps

df["time 1s"] = df["time_step"].apply(get_time)
df["time 5s"] = df["time 1s"] // 5
df["time 20s"] = df["time 1s"] // 20

Now, all we need to do is to estimate the distributions of the speed of the x and y axis and look at the samples at the heads and tails of those distributions. We can sum up that idea quite simply by using seaborn to display the samples and the kernel density estimations over speed_x and speed_y. By default, the kde_plot uses the 5th and 95th percentile to draw the outermost contour line.

(sns.FacetGrid(data=df, col="time 1s", col_wrap=5)
 .map(sns.scatterplot, "speed_x", "speed_y", alpha=0.5)
 .map(sns.kdeplot, "speed_x", "speed_y", levels=5, color="red")).set(xlim=(-10, 10), ylim=(-10, 10))
<seaborn.axisgrid.FacetGrid at 0x7fdeb5032d60>

You can easily see on those plots that people are more likely to move fast from top to bottom on the camera than left to right. We can also answer our first question. Whilst there are people standing (the first contour line encapsulates the values close to zero for speed_x and speed_y) there are many more who are moving slowly or faster (the other four contour lines).

def mark_outliers(df, group_col, col, new_colname, limits=(0.05, 0.95)):

    lb, ub = limits
    limits = df.groupby(group_col)[col].quantile(limits).unstack(level=1)

    df[new_colname] = (
            lambda x: 
            True if (x[col] < limits.loc[x[group_col], lb]) | (x[col] > limits.loc[x[group_col], ub]) 
            else False, 

We now create two new columns to store the speed outliers.

mark_outliers(df, "time 1s", "speed_x", "outlier_speed_x")
mark_outliers(df, "time 1s", "speed_y", "outlier_speed_y")

df[["speed_x", "outlier_speed_x", "speed_y", "outlier_speed_y"]]
speed_x outlier_speed_x speed_y outlier_speed_y
46 0.497681 False 1.304595 False
47 1.840714 True -7.917877 True
48 0.404175 False 1.988087 False
49 0.804657 False -0.509497 False
50 0.630310 False -4.280685 True
... ... ... ... ...
14015 -0.303680 False -14.349731 True
14016 0.120117 False -0.358795 False
14017 -0.152954 False -4.005455 False
14018 -0.303879 False -0.013641 False
14020 1.839661 True -5.155056 True

10596 rows × 4 columns

To be sure that we didn’t make any mistakes when selecting the outliers, we can plot our samples with a color for the outliers and another one for the rest of the data points.

df["outlier_speed"] = df["outlier_speed_x"] | df["outlier_speed_y"]

(sns.FacetGrid(data=df, col="time 1s", col_wrap=5, hue="outlier_speed")
 .map(sns.scatterplot, "speed_x", "speed_y", alpha=0.5)).set(xlim=(-10, 10), ylim=(-10, 10))
<seaborn.axisgrid.FacetGrid at 0x7fdfb9df5790>

We can see that for each point that was outside the kernel density estimation, we correctly identified the point as an outlier.

This concludes this section. I showed how one can try and extract new information from a video by linking the objects in each frame thanks to MOT and we were able to extract insight that would not have been possible with simple object detection.

Load the results nicely in a video

In this section, we will use OpenCV to create a new video that takes the regions into account.

Similarly to the way we used a dataclass to create our dataframe, we will use a dataclass to help us draw the needed information on the screen.

class Person:
    time_step: int
    id: int
    xmin: int
    ymin: int
    xmax: int
    ymax: int
    score: float
    centroid_x: int
    centroid_y: int
    speed_x: int
    speed_y: int
    d_time_step: int
    speed: int
    region: str
    time_1s: int
    time_5s: int
    time_20s: int
    outlier_speed_x: bool
    outlier_speed_y: bool
    outlier_speed: bool

    def __le__(self, other):
        return (self.time_step, <= (other.time_step,

    def from_dataframe(cls, df):
        return [cls(*row) for row in df.itertuples(index=False)]

    def _draw_bbox(self, img):
        color = (0, 0, 0)
        if self.region == "region_1":
            color = COLOR_1
        elif self.region == "region_2":
            color = COLOR_2
        elif self.region == "region_3":
            color = COLOR_3

        cv2.rectangle(img, (int(self.xmin), int(self.ymin)), (int(self.xmax), int(self.ymax)), color, 1)

    def _draw_text(self, img, text, color=(0, 0, 0), offset=(0, 0)):
        offset_x, offset_y = offset
        cv2.putText(img, text, (int(self.xmin) + offset_x, int(self.ymin) + offset_y), cv2.FONT_HERSHEY_SIMPLEX, 0.25, color, 1, cv2.LINE_AA)

    def _draw_labels(self, img, color=(0, 0, 0)):
        self._draw_text(img, f"ID:{}", color, (0, -10))
        self._draw_text(img, f"SCORE:{self.score:.2f}", color, (0, -20))

    def draw(self, img):
        self._draw_labels(img, COLOR_4)

Here we are constructing a dictionary with the time_step as the key and that contains a list with all the people appearing at that time_step

persons = Person.from_dataframe(df)

dic = {}
outliers = {}

for person in persons:
    if person.time_step not in dic:
        dic[person.time_step] = []


We can now use OpenCV to make a new video that we will call output.mp4

out = cv2.VideoWriter('output.mp4',cv2.VideoWriter_fourcc(*"FMP4"), 30, (1280,720))
cap = cv2.VideoCapture("people.mp4")

counts = df.groupby(["time_step", "region"])["region"].count()

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
for i in tqdm(range(total_frames)):
    _, image =

    # Initially, no one is detected
    if i not in dic:

    # Drawing the regions
    cv2.polylines(image, [np.array(region1.exterior.coords).astype(np.int32)], True, COLOR_1, 2)
    cv2.polylines(image, [np.array(region3.exterior.coords).astype(np.int32)], True, COLOR_3, 2)
    cv2.polylines(image, [np.array(region2.exterior.coords).astype(np.int32)], True, COLOR_2, 2)

    # Counting the number of people in each region
    count1 = counts[(i, "region_1")]
    count2 = counts[(i, "region_2")]
    count3 = counts[(i, "region_3")]

    # Drawing the number of people in each region
    cv2.putText(image, f"Region 1: {count1}", (10, 20), cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLOR_1, 1, cv2.LINE_AA)
    cv2.putText(image, f"Region 2: {count2}", (10, 40), cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLOR_2, 1, cv2.LINE_AA)
    cv2.putText(image, f"Region 3: {count3}", (10, 60), cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLOR_3, 1, cv2.LINE_AA)

    # Drawing the bounding boxes and labels
    for person in dic[i]:

100%|██████████| 341/341 [00:05<00:00, 62.18it/s]

Finally, we will display the results of our work.

Here is the video we took as input.


Here is the video we got by applying Deep Sort.


And here is the final video we got by combining Deep Sort and adding regions to the scene.



This concludes what I wanted to show with this demonstration. We touched upon many topics in Computer Vision and Machine Learning and applied them to answer some interesting questions we can ask ourselves about the data.

As I mentioned earlier, the models are not limited to people. You can track any kind of object. The most common ones will be accessible via pre-trained models. But you can also fine-tune those models or retrain parts of them to get what you want out of multi-object tracking.

If you follow the methodology presented here, you will be able to construct your own solutions to problems that involve computer vision systems.