Skip to document

Project 3 Copy1 red blue green orange

Project 3 Copy1 red blue green orange Project 3 Copy1 red blue green o...
Academic year: 2014/2015
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
West Virginia University

Comments

Please sign in or register to post comments.

Preview text

December 31, 2018 1 Project 3 Classification Welcome to the third project of Data 8! You will build a classifier that guesses whether a movie is romance or action, using only the numbers of times words appear in the screenplay. the end of the project, you should know how to: 1. Build a classifier. 2. Test a classifier on data. 1.0 Logistics Deadline. This project is due at 11:59pm on Friday You can earn an early submission bonus point submitting your completed project Thursday much better to be early than late, so start working now. Checkpoint. For full credit, you must also complete Part 1 of the project (out of 4) and submit it 11:59pm on Friday You will have some lab time to work on these questions, but we recommend that you start the project before lab and leave time to finish the checkpoint afterward. Partners. You may work with one other this partner must be enrolled in the same lab section as you are. Only one of you is required to submit the project. On okpy, the person who submits should also designate their partner so that both of you receive credit. Rules. share your code with anybody but your partner. You are welcome to discuss questions with other students, but share the answers. The experience of solving the problems in this project will prepare you for exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem. Support. You are not alone! Come to office hours, post on Piazza, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private Piazza post and the staff will respond. If ever feeling overwhelmed or know how to make progress, email your TA or tutor for help. You can find contact information for the staff on the course website. Tests. Passing the tests for a question does not mean that you answered the question correctly. Tests usually only check that your table has the correct column labels. However, more tests will be applied to verify the correctness of your submission in order to assign your final score, so be careful and check your work! Advice. Develop your answers incrementally. To perform a complicated table manipulation, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells. Also, please be sure to not variables throughout 1 the notebook! For example, if you use max_temperature in your answer to one question, do not reassign it later on. To get started, load datascience, numpy, plots, and ok. In Run this cell to set up the notebook, but please change it. import numpy as np import math from datascience import These lines set up the plotting functionality and formatting. import matplotlib inline import matplotlib as plots import warnings These lines load the tests. from client.api import Notebook ok _ Assignment: Project 3 Classification OK, version v1.12 Successfully logged in as 2 1. The Dataset In this project, we are exploring movie screenplays. be trying to predict each genre from the text of its screenplay. In particular, we have compiled a list of words that occur in conversations between movie characters. For each movie, our dataset tells us the frequency with which each of these words occurs in certain conversations in its screenplay. All words have been converted to lowercase. Run the cell below to read the movies table. It may take up to a minute to load. In movies 1, 2, 3, 4, 5, 10, 30, 5005) Title Genre Year Rating Votes Words it not fling the matrix action 8 389480 3792 0 0 0 The above cell prints a few columns of the row for the action movie The Matrix. The movie 115 contains 3792 words. The word appears 115 times, as it makes up 3792 0 of the words 2 Saving notebook... Could not save your notebook. Make sure your notebook is saved before sendin Backup... complete Backup successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok This dataset was extracted from a dataset from Cornell University. After transforming the dataset (e., converting the words to lowercase, removing the naughty words, and converting the counts to frequencies), we created this new dataset containing the frequency of common words in each movie. In with movies(np(6)).num_columns) with movies_rows) Words with frequencies: Movies with genres: 242 2 1. Word Stemming The columns other than and in the movies table are all words that appear in some of the movies in our dataset. These words have been stemmed, or abbreviated heuristically, in an attempt to make different inflected forms of the same base word into the same string. For example, the column is the sum of proportions of the words and (and perhaps others) in each movie. This is a common technique used in machine learning and natural language processing. Stemming makes it a little tricky to search for the words you want to use, so we have provided another table that will let you see examples of unstemmed versions of each stemmed word. Run the code below to load it. In Just run this cell. vocab_mapping stemmed np(movies, np(3, vocab_table vocab_mapping) vocab_table(np(1100, Stem blame blame blank blank blank blanket Word blamed blame blanks blank blankness blanket 4 blanket blast blast blast Question 1.1 blankets blasting blast blasted Assign stemmed_message to the stemmed version of the word In Set stemmed_message to the stemmed version of (which should be a string). Use vocab_table. stemmed_message are_to stemmed_message In _ _ ok() Running tests summary Passed: 1 Failed: 0 passed Saving notebook... Could not save your notebook. Make sure your notebook is saved before sendin Backup... complete Backup successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok Question 1.1 Assign unstemmed_run to an array of words in vocab_table that have as its stemmed form. In Set unstemmed_run to the unstemmed versions of (which should be an array of string). unstemmed_run are_to unstemmed_run 5 Stem Word Chopped respons responsibilities 9 In _ _ ok() Running tests summary Passed: 1 Failed: 0 passed Saving notebook... Could not save your notebook. Make sure your notebook is saved before sendin Backup... complete Backup successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok 2 1. Splitting the dataset going to use our movies dataset for two purposes. 1. First, we want to train movie genre classifiers. 2. Second, we want to test the performance of our classifiers. Hence, we need two different datasets: training and test. The purpose of a classifier is to classify unseen data that is similar to the training data. Therefore, we must ensure that there are no movies that appear in both sets. We do so splitting the dataset randomly. The dataset has already been permuted randomly, so easy to split. We just take the top for training and the rest for test. Run the code below (without changing it) to separate the datasets into two tables. In Here we have defined the proportion of our data that we want to designate for training as of our total dataset. of the data is 7 reserved for testing. training_proportion num_movies movies_rows num_train int(num_movies training_proportion) num_test num_movies num_train train_movies movies(np(num_train)) test_movies movies(np(num_train, num_movies)) Training: 205 Test: train_movies_rows, test_movies_rows) 37 Question 1.2 Draw a horizontal bar chart with two bars that show the proportion of Action movies in each dataset. Complete the function action_proportion it should help you create the bar chart. In def action_proportion(table): the proportion of movies in a table that have the Action action num_action action_rows return The staff solution took multiple lines. Start creating a table. If you get stuck, think about what sort of table you need for barh to work proportion_array cool_table Table().with_columns( of Action proportion_array) Congratulations, you have reached the checkpoint! Run the next cell to submit! In _ ok() Saving notebook... Saved Submit... complete Submission successful for user: 8 x_feature, y_feature, ) for movie in training_movies: row row_for_title(movie) row(y_feature), distances(x_feature, y_feature, training training, 0, Question 2.1 Compute the distance between the two action movies, Batman Returns and The Avengers, using the money and feel features only. Assign it the name action_distance. Note: If you have a row, you can use item to get a value from a column its name. For example, if r is a row, then is the value in column in row r. Hint: Remember the function row_for_title, redefine d for you below. In title_index def row_for_title(title): the row for a title, similar to the following expression (but faster) 10 title).row(0) return In batman avengers action_distance ((batman( action_distance 0 In _ _ ok() Running tests summary Passed: 1 Failed: 0 passed Saving notebook... Could not save your notebook. Make sure your notebook is saved before sendin Submit... complete Submission successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok Below, added a third training movie, The Terminator. Before, the point closest to Batman Returns was Titanic, a romance movie. However, now the closest point is The Terminator, an action movie. In training training, 0, 11 Running tests summary Passed: 2 Failed: 0 passed Saving notebook... Saved Submit... complete Submission successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok Question 2.1 Define the function distance_from_batman_returns so that it works as described in its documentation. Note: Your solution should not use arithmetic operations directly. Instead, it should make use of existing functionality above! In def distance_from_batman_returns(title): distance between the given movie and based on the feature This function takes a single argument: title: A string, the name of a movie. batman row1 row_for_title(title) return In _ _ ok() Running tests summary 13 Passed: 1 Failed: 0 passed Saving notebook... Saved Submit... complete Submission successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok Question 2.1 Using the features and what are the names and genres of the 7 movies in the training set closest to To answer this question, make a table named close_movies containing those 7 movies with columns and as well as a column called from that contains the distance from The table should be sorted in ascending order distance from batman. In The staff solution took multiple lines. trainingset close_movies from trainingset(dista close_movies from descending close_movies Title Genre money feel the bridges of madison county romance the fisher king romance 0 broadcast news romance 0 hellboy action 0 0 as good as it gets romance 0 action 0 harold and maude romance 0 0 In _ _ ok() Running tests 14 distance from bat Running tests summary Passed: 1 Failed: 0 passed Saving notebook... Saved Submit... complete Submission successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok Congratulations are in order classified your first movie! However, we can see that the classifier work too well since it categorized Batman Returns as a romance movie (unless you count the bromance between Alfred and Batman). see if we can do better! 4 3. Features Now, going to extend our classifier to consider more than two features at a time. Euclidean distance still makes sense with more than two features. For n different features, we compute the difference between corresponding feature values for two movies, square each of the n differences, sum up the resulting numbers, and take the square root of the sum. Question 3 Write a function to compute the Euclidean distance between two arrays of features of arbitrary (but equal) length. Use it to compute the distance between the first movie in the training set and the first movie in the test set, using all of the features. (Remember that the first six columns of your tables are not features.) Note: To convert rows to arrays, use np. For example, if t was a table, np(t(0)) converts row 0 of t into an array. In def distance(features1, features2): Euclidean distance between two arrays of feature return (sum((features1 features2) 2) (0)) 16 distance_first_to_first distance(np(train_movies(np(6, train_mov distance_first_to_first In _ _ ok() Running tests 3 Suite 1 Case 1 0 distance_first_to_first 0 True np(distance(make_array(1, 2), make_array(1, 2)), 0) True np(distance(make_array(1, 2, 3), make_array(2, 4, 5)), 3) False Error: expected True but got False Run only this test case with ok q3_1 1 summary Passed: 0 Failed: 1 passed Saving notebook... Could not save your notebook. Make sure your notebook is saved before sendin Submit... complete Submission successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok 17 The word is common in both action and romance movies 5. It is not possible to say from the plot What properties does a word in the bottom left corner of the plot have? Your answer should be a single integer from 1 to 5, corresponding to the correct statement from the choices above. In bottom_left 1 In _ _ ok() Running tests summary Passed: 1 Failed: 0 passed Saving notebook... Saved Submit... complete Submission successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok What properties does a word in the bottom right corner have? In bottom_right 3 In _ _ ok() Running tests summary Passed: 1 Failed: 0 passed 19 Saving notebook... Saved Submit... complete Submission successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok What properties does a word in the top right corner have? In top_right 4 In _ _ ok() Running tests summary Passed: 1 Failed: 0 passed Saving notebook... Saved Submit... complete Submission successful for user: URL: NOTE: this is only a backup. To submit your assignment, use: python3 ok What properties does a word in the top left corner have? 20

Was this document helpful?

Project 3 Copy1 red blue green orange

Was this document helpful?
project3-Copy1
December 31, 2018
1 Project 3 - Classification
Welcome to the third project of Data 8! You will build a classifier that guesses whether a movie is
romance or action, using only the numbers of times words appear in the movies’s screenplay. By
the end of the project, you should know how to:
1. Build a k-nearest-neighbors classifier.
2. Test a classifier on data.
1.0.1 Logistics
Deadline. This project is due at 11:59pm on Friday 11/30. You can earn an early submission bonus
point by submitting your completed project by Thursday 11/29. It’s much better to be early than
late, so start working now.
Checkpoint. For full credit, you must also complete Part 1 of the project (out of 4) and submit
it by 11:59pm on Friday 11/16. You will have some lab time to work on these questions, but we
recommend that you start the project before lab and leave time to finish the checkpoint afterward.
Partners. You may work with one other partner; this partner must be enrolled in the same lab
section as you are. Only one of you is required to submit the project. On okpy.org, the person who
submits should also designate their partner so that both of you receive credit.
Rules. Don’t share your code with anybody but your partner. You are welcome to discuss
questions with other students, but don’t share the answers. The experience of solving the prob-
lems in this project will prepare you for exams (and life). If someone asks you for the answer,
resist! Instead, you can demonstrate how you would solve a similar problem.
Support. You are not alone! Come to office hours, post on Piazza, and talk to your classmates.
If you want to ask about the details of your solution to a problem, make a private Piazza post and
the staff will respond. If you’re ever feeling overwhelmed or don’t know how to make progress,
email your TA or tutor for help. You can find contact information for the staff on the course
website.
Tests. Passing the tests for a question does not mean that you answered the question correctly.
Tests usually only check that your table has the correct column labels. However, more tests will
be applied to verify the correctness of your submission in order to assign your final score, so be
careful and check your work!
Advice. Develop your answers incrementally. To perform a complicated table manipulation,
break it up into steps, perform each step on a different line, give a new name to each result,
and check that each intermediate result is what you expect. You can add any additional names or
functions you want to the provided cells. Also, please be sure to not re-assign variables throughout
1