Current multi-object tracking (MOT) algorithms are dominated by the tracking-by-detection paradigm, which divides MOT into three independent sub-tasks of target detection, appearance embedding, and data association. To improve the efficiency of this tracking paradigm, this paper presents an anchor-free one-stage learning framework to perform target detection and appearance embedding in a unified network, which learns for each point in the feature pyramid of the input image an object detection prediction and a feature representation. Two effective training strategies are proposed to reduce missed detections in dense pedestrian scenes. Moreover, an improved non-maximum suppression procedure is introduced to obtain more accurate box detections and appearance embeddings by taking the box spatial and appearance similarities into account simultaneously. Experiments show that our MOT algorithm achieves real-time tracking speed while obtaining comparable tracking performance to state-of-the-art MOT trackers. Code will be released to facilitate further studies of this problem.