Movie Rating Problem

Github Repo

Movie type estimation

As training the matrix factorization model, we can obtain user and movie embedded vectors, which represent some features about movies/users and relationships between them, with arbitrary dimensions. I use a 500-dimension vector to represent a movie/user. With TSNE to reduce dimension and K-means to classify the movies (about the original vectors) into 5 categories, I get the following results:

img1 Figure 1. Visualization of Movie Embedding Layer through TSNE and K-means

The plot shows that the K-means categories might be underestimated, so the blue area is too big for us to learn information. I also count the percentage of genres appearing in each category (the following figure), it shows that the K-means categories can barely identify the genres either. The interesting thing here is that the trend in Category 2 is quite different from the others. I find that this Category is more latest movies than the other categories and that its genres are more acceptable by universals, e.g. Adventure, Action, Comedy, Sci-Fi.

img2 Figure 2. Percentage of genres in each estimated category

Then I plot another scatter plot showing the number of rating people. I finally find the problem: some movies were rated by a few people (the gray area is the rating count lower than the others), so it is hard to learn its behaviors by the machine.

img3 Figure 3. Rating Count of each Movie

After removing the movie which rating count less than 1000, the plot becomes:

img4 Figure 4. K-means Estimated Categories This figure extracts the movies with higher rating counts and is labeled by their estimated Category number.

img5 Figure 5. Number of Rating People This figure is labeled by its rating count. (Ranging > 1000)

img6 Figure 6. Movie Ratings This figure is labeled by movies’ real ratings. (Ranging 0-5)

img7 Figure 7. Movie Produced Year This figure is labeled by movies produced a year.

Analysis & Conclusion

We can infer some conclusions through these figures:

Future works

  1. Analyzing users embedding layers as doing with movies embedding layers.
  2. Using a rating timestamp to see that people watch movies in what order.
  3. Finding out what kind of movies or what kind of users are more predictable.