There are lots of things that can be done using deep learning. Video classification is just one of them. When you google video classification, you will see some works that are done using 2D CNN’s. I would like to share my experience of devoloping a video classifier by using 3D CNN architecture.
While 2D CNN’s take single video frame as input, 3D CNN architecture takes array of video frames. So it produces more accurate scores on videos. For instance, 3D CNN’s can predict if a person is walking or running by considering some consequtive frames of the video.
I used 5 classes of UCF-101 action recognition dataset for training. These classes are applying lipstick, sumo wrestling, yoyo, hulahoop and punching boxing bag. I used first 10 frames of each video for feeding the model.
cap = cv2.VideoCapture(file) #get video file for i in range(depth): cap.set(cv2.CAP_PROP_POS_FRAMES, frames[i]) #first 10 frames success, frame = cap.read() #reading i. frame of video frame = cv2.resize(frame, (224, 224)) framearray.append(frame) cap.release()
return np.array(frame_array) #returns first ten frames of videofor dirname in files: name = os.path.join(video_dir, dirname) for file in os.listdir(name): filename = os.path.join(name, file) X.append(video_frames.process_videos(filename, color, skip))framearray.reshape((framearray.shape, 224, 224, 10, 3))
#shape frame array as (number of input videos & 10 frames that are #shaped 224x224x3
framearray = np.array(framearray).transpose((0, 2, 3, 4, 1))
After reading frames, you need to split data as train and test data. You can use scikit-learn’s built in method for it. After that, you need to build cnn architecture. Its input shape can be specified as framearray.shape[1:] and I chose to use 4 convolutional layers which have 128, 64, 32, 32 and 32 filters respectively. I also added activation, pooling, dropout, flatten and dense layers. Model architecture can be like below.
model.add(Conv3D(64, kernel_size=(3, 3, 3), padding=’same’))model.add(Activation(‘relu’))
#relu function prevents vanishing gradientsmodel.add(MaxPooling3D(pool_size=(3, 3, 3), padding=’same’))
#dropout prevents overfittingmodel.add(Flatten())model.add(Dense(1024))model.add(Dropout(0.5))model.add(Dense(5, activation=’softmax’))
#softmax function helps us to understand the probability of each #class, since sum of its outputs are 1model.compile(loss=categorical_crossentropy,optimizer=Adam(), metrics=[‘accuracy’])
#crossentropy can be used for multiclass problems
callbacks = [ModelCheckpoint('3dcnn_model.h5', verbose=1, save_best_only=True, save_weights_only=False)]
#checkpoint is used for saving the best model during training
After building model architecture, you need to fit your model and see if results are good. If results are not accurate as you expected, you can change parameters of model or add some extra layers. Here is the code snippet for fitting the model:
epoch = 100batch = 8
#its helpful to give batches of frames as input to modelhistory = model.fit(X_train, Y_train, validation_data=(X_test, Y_test), batch_size=batch,epochs=epoch, verbose=1, shuffle=True, callbacks=callbacks)
After training the model, you can see predictions of the model on test data. Scikit-learn has some built in methods for it.
model=load_model(&path)predictions = model.predict_generator(X_test, steps=16)predicted_classes = np.argmax(predictions, axis=1) #0–4 classestrue_classes = Y_test #encoded classesclasses=np.argmax(Y_test, axis=1)conf = sklearn.metrics.confusion_matrix(predicted_classes, classes)labels=[“HulaHoop”, “YoYo”,”Sumo”,”Lipstick”,”Boxing”]sklearn.metrics.ConfusionMatrixDisplay(conf,labels).plot(include_values=True)
Confusion matrix is an efficient way to see if your model is trained well. If results are not as expected, you can retrain your model. I got %88 accuracy and %0.2 loss for this dataset, you can see results below.
You can use trained model to classify videos, you can simply use opencv to read video frames and feed them into the trained model. Here is the code snippet below.
frame_count = 0
framearray = 
cap = cv2.VideoCapture(&video path)while True:
ret, frame = cap.read()
frame = cv2.resize(frame, (640, 480))
frame_copy = frame.copy()
frame = cv2.resize(frame, (IMG_SIZE, IMG_SIZE))
if frame_count >= 10: #trained model can be fed with 10 frames
new_frame_array = np.array(framearray).transpose((1, 2, 0, 3))
prediction = model.predict(np.expand_dims(new_frame_array, axis = 0))
result = “RESULT: “ + CATEGORIES[int(np.argmax(prediction))]
cv2.putText(frame_copy, result, (5, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255,255,255), 2) cv2.imshow(“frame”, frame_copy)
frame_count += 1
if cv2.waitKey(5) & 0xFF == 27:
Thank you for reading. I hope it was helpful.