## Intro to IS Lab 2: Data-crunched A* through Dubrovnik

In this lab, as discussed, you will use the collected data on walks around Dubrovnik to improve (maybe!) the accuracy of your planner's output. We will look at two different ways of modeling the time required for a path: a linear function of the path's features, and a nearest-neighbor (non-parametric) approach.

#### Getting set up:

First of all, you will need to modify your planner to read the data from the provided file. Each line of the file starts with an arbitrary number of waypoints, and is followed by the minutes and seconds taken, the identity of the walker, and True or False for whether the walker was alone. The idea will be to collect some features of the trip that correlate to the overall walk time - that is, assuming that the time is a function of the waypoints (path), we want to discover that function in the most accurate fashion. However, if we treat each path as its own unique entity, the learner will not be able to generalize. Instead, we will consider some features of each trip - for example, total distance walked, total elevation gain, etc. Depending on how your planner works, you may find different features helpful. Of course, the set of waypoints may not strictly determine these features, depending on how the data was collected. Therefore, you should run the waypoints through some simple path planner (probably a simplified version of your Lab 1 solution) to create an assumed path for each list of waypoints, then write some code to extract features of interest from that path.

#### Linear modeling

The next part is simply to come up with a set of weights that best fit the function ```w1*f1 + w2*f2 + ... = time```, where the wi are the weights, the fi are the feature values, and time is the recorded time for that path. Notice that if you think the best function includes a non-linear relationship to a feature, it is often possible to rethink your feature definition (for example, log(distance) as a feature instead of just distance - of course this particular example is a dumb one :) ). Use the simple multivariate regression technique shown in Sec. 18.6.2 of Russell&Norvig to converge to the best weights. Also, don't forget to keep some random examples aside as test data to verify the quality of your results.

#### Non-parametric modeling

Here we will simply use a nearest neighbor approach - the example whose features are the closest to the new path's features serves as the predictor. In this case, for a new test point, you would run your basic planner to compute features for all (training) examples, then for each test case, again compute features from your basic planner and find the training example with the closest feature values. This is not completely straightforward in that you will have features with very different units (e.g. km of distance vs m of elevation change, or whatever), so it will be up to you to decide how to scale these dimensions to determine "closest" in a sensible way.

#### Experimentation

Now that you have a few different ways to determine the time of a particular route, you should test them on some new example paths. Run your basic planner and have it spit out a time, then run both the linear function and the nearest neighbor to give their time predictions. Now walk the route(s) and see which is closest (no cheating! :) ).

#### Writeup

For your writeup, explain what features you used, and why you expected them to produce an effective model. Then, show your results and discuss how well each learner functioned. In addition, try your learning based on only your own examples. Does this do a more accurate job, even with fewer data points? Does it do so for all cases, or only some?

• Data processing, including feature computation: 15%
• Feature selection and defense thereof: 10%
• Linear regression learning (correctness and accuracy): 25%
• Nearest neighbor learning (ditto): 10%
• Mechanically correct prediction process*: 15%
• Experimentation and writeup: 25%
* By this I mean that it is possible/straightforward to produce the two (or more) learning-based predictions for a novel input, regardless of their accuracy.