# Lab 0: Introduction To Python üêç

## Outline

1. [Why Python](#Why-Python?)
2. [About This Lab](#About-This-Lab)
    1. [A Jupyter Notebook](#This-is-a-Jupyter-Notebook)
3. [Working in Python](#Working-in-Python)
    1. [Expressions](#Expressions)
    2. [Variables](#Variables)
    3. [Strings](#Strings)
    4. [Lists](#Lists)
    5. [Control Flow and Loops](#Control-Flow-and-Loops)
    6. [Functions](#Functions)
    7. [Other Python Topics](#Other-Python-Topics)
4. [NumPy](#NumPy)
    1. [Creating Arrays](#Creating-Arrays)
    2. [Thinking of Arrays as Vectors](#Thinking-of-Arrays-as-Vectors)
    3. [Arrays as Matrices](#Arrays-as-Matrices)
    4. [Arrays have Axes](#Arrays-have-Axes)

## Why Python?

Python is one of the hottest languages in industry today, especially in machine learning and data science. According to Stack Overflow's [2018 Developer Survey Results](https://insights.stackoverflow.com/survey/2018/#most-loved-dreaded-and-wanted) Python is the third "most loved" and *the* "most wanted" language as chosen by industry professionals. It is widely appreciated for its clean, readable, and generally no-nonsense code that enables singleminded focus on the task at hand. It was designed to be both simple and friendly, yet still be powerful and expressive.

In this course, we will be using Python 3.6.

## About This Lab

In this lab we assume that you have taken an introductory course in Computer Science, or are familiar with programming in another programming language (eg. C++ or Java). If you are already familiar with Python, then you may skip to the [discussion of NumPy](#NumPy). Here we will briefly demonstrate how to write Python code and introduce some of the tools we frequently use to analyze data.

### This is a Jupyter Notebook

As you may already know, this lab is formatted as a Jupyter Notebook, which provides a Python environment that you can interact with. This means that you can run all of the code examples you find below, and even create your own code to try things out ‚Äî in fact, you are highly encouraged to do so.

Each chunk of code or text in this notebook is written in a **block**. There are three types of blocks in Jupyter Notebooks: code, markdown, and raw.

#### Code Blocks

**Code blocks**, are exactly what their name implies: you write code in them. They are blocks that can be executed in the interactive Python environment we mentioned earlier. You can **run** code blocks by clicking into them so that you see a blue bar to the left of the block, and then either pressing the small ‚ñ∂ button in the top bar, or pressing `Shift + Enter`.

**Try this!** Here is an example of a code block, run it! 

In [None]:
print("I'm a code block!")

You can also add new blocks by pressing the + button in the top bar.

#### Markdown Blocks

All of the text you see in this notebook was written with the second type of block we mentioned: **markdown blocks**. These blocks allow you to write formatted text with a simple markdown language called, well, [Markdown](https://en.wikipedia.org/wiki/Markdown). You can find a very simple demonstration of markdown [here](https://markdown-here.com/livedemo.html). You can also double-click any of the text blocks in this document, **like this one**, to see the markdown "source" code that produced it. Just **run** it to render it again.

#### Raw Blocks

The last type of block is a **raw block**, which allows you to write preformatted or "raw" text. The contents of these blocks are not rendered as happens with the markdown blocks. These blocks are not very common in a notebook, but there are times where you will find them useful.

## Working in Python

As mentioned above Python is renowned for its simplicity. And, while this introduction may be long and intense, we strongly believe that as you continue to write Python code, you will find that the language just "fades into the background," allowing you to focus on the data science that you are here to learn.

As you read through the rest of the lab, remember that all of the code blocks are interactive and that you should **run** them to see what happens. Furthermore, you are _encouraged_ to make your own blocks (with the + button) and to try experimenting with things yourself.

### Expressions

As with any programming language, complex ideas are built up from small, *primitive* expressions. More generally, anything that can be *evaluated* to a value is an expression. For example, a number (1, 2, etc.) in code is an expression because that code can be evaluated to the value of that number.

**Try this!** Execute the cells in this section and observe their output.

In [None]:
217

By combining simple expressions with operators, we can even express complex, almost magical, ideas like those used to find patterns in data. For example,

In [None]:
217 * 9 + 8 * 7 + 10

Some other types of expressions are `strings` and `booleans`.

In [None]:
"I'm an expression too!"

And of course:

In [None]:
True or False is not False

### Variables

You can create variables by assigning a value to a name (or, if you prefer, by giving a name a value). Values can be expressed as expressions, which are evaluated prior to being assigned to a name.

In [None]:
greeting = 'Hello, DS'
greeting

If you are in a hurry, you can also assign names to multiple values simultaneously. But a caveat for this is that assignment is done after evaluation so you cannot use `one` and `two` in the expression for `three`.

In [None]:
one, two, three = 1, 2, 3

In [None]:
# this won't work!
# four, five, nine = 4, 5, four + five

**Thy this!** Uncomment the last line in the previous code block and run the cell to verify that it doesn't work.

### Strings

Working with data often means working with strings. Recall that strings are what we call words and sentences in programming languanges because they are essentially a group of characters, like `a` or `b`, that have been *strung* together like `hello there!`. Being familiar with how to manipulate strings is not only important, but very useful. Many professionals love Python because of how easy it is to work with strings, especially in areas like [Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing).

Like in other languages, you can denote a string using the `"` symbol. For example, `"this is a string"`. Alternatively in Python you may also use a "single quote", `'`, to denote a string (eg. `'this is also a string'`).

**Try this!** In the follow block, try assigning a string containing your name to a variable called `my_name`.

In [None]:
# replace None with a string containing your name
my_name = None
my_name

#### Indexing and Slicing

You can access specific characters in a string using square bracket notation, `[]`. 

In [None]:
my_name[0]

Remember that, like Java, strings are indexed from `0`.

You can also slice strings, or get a part of a string using **slice** indexing. Slicing works by specifying the range of indices that you are interested in retrieving. The way to describe this is the same as [interval notation](http://www.mathwords.com/i/interval_notation.htm) if you remember from math: $[\text{begin} : \text{end})$.

The selected string will include the character at the first index given, but will not include the character of the second index.

In [None]:
my_name[0:2]

Often, you will be interested in either the first few or last few characters in a string, in which case you can leave out the corresponding index. In the following example, I only want the characters from index `1` to the end so I can leave off the ending index.

In [None]:
my_name[1:]

**Thy this!** Get all characters of the following string but the first four without counting its length: 

In [None]:
my_string = "This is a really long sting and you have now clue on how long it is!"
# your code here


#### More On Strings

There are many more things that you can do with strings, for example you can concatenate them together (`'hello' + ', '+ 'world'`). However, in this course, you will primarily be working with _numerical_ data, instead of strings, so we will not linger on this topic. For more information on strings, please see this [article from Google](https://developers.google.com/edu/python/strings).

### Lists

Lists are like `arrays` in Java, except that you can put whatever you want into them. They are also flexible in size so you can keep adding to them as much as you'd like. Python lists are also the basis of strings. Because of this, strings and lists both share many of the same features.

Here we create a new list, using square bracket notation, and then fill it with `1`, `a`, and the value of the `my_name` variable.

In [None]:
[1, one, 'a', my_name]

#### Adding to a List

You can add to lists using the `append` method like this.

In [None]:
a_list = [1, 2, 3, 4, 5]
a_list.append(6)
a_list

It is also possible to concatenate two arrays, ie. you can add two lists end-to-end.

In [None]:
a_list + a_list

#### Indexing

The good thing about this is that strings are essentially the same as lists so you can use the same indexing techniques as you did for strings for lists.

**Try This!** Get the second to fourth elements of `a_list`.

In [None]:
# your code here


We can also do more crazy indexing things, like skipping elements:

In [None]:
a_list[0::2]

Or reversing it:

In [None]:
a_list[::-1]

### Control Flow and Loops

#### If-Else Conditionals

Besides the `elif` statement, there are three more things to notice. First, the conditional statements don't have to be in parentheses, which allows for easier reading of code with much less clutter. Second, each statement is terminated with a `:`. This is essentially saying that we will "define" what this statement entails. And, third, that each "block" following a statement is merely indented with either 4 spaces or a tab. This is, again, for readability. You can almost think of Python code as _outlining_ what you want it to do, where each statement is kind of like a heading and each block is an indented "idea", so to speak.

In [None]:
if 2 < 1 and 1 > 0:
    print('Not both!')
elif 1 > 0:
    print('Just one.')
else:
    print('Everything else')

#### Loops

The syntax for a `while` loop should look relatively similar to those in Java. The differences lie in the same places as for the `if` statements. A   `while` loop will iterate "while" its condition remains `True`.

In [None]:
n = 5

while n > 0:
    print(n)
    n -= 1

`For` loops may look a little unfamiliar at first, but there is good reason for this. For loops in Python have managed to achieve greater functionality without sacrificing readability. Below is an example of how you would iterate 5 times.

In [None]:
for i in range(5):
    print(i)

You can also easily iterate through lists the same way as lists are also `iterables`.

In [None]:
a_list = ['Hello', 'Machine', 'Learning', '!!!']

for word in a_list:
    print(word)

**Try this!** Create a for loop to print out every even number from in the interval `[1:100]`. Hint: create an appropriate `iterable` first.

In [None]:
# your code here


#### List Comprehensions

A very useful feature of Python lists is called **comprehension** notation. It allows us to use for loops to create lists! This notation is very similar to set-builder notation from math and allows us to succinctly create lists from other lists. In the following cell, we take a `range`, which is "list-like", and square each value. The `range` function returns an `iterable` from an _optional_ starting point to an end point.

In [None]:
range(5)

In [None]:
squares = [x * x for x in range(5)]
squares

In [None]:
squares_plus = [x + 1 for x in squares]
squares_plus

### Functions

Functions in Python, and in general, can be thought of as a generalized method in Java. Where methods operate on and are attached to classes, functions are not. However, both constructs facilitate "abstraction" in our code. Abstraction is usually defined as a process by which we hide away all the little details about something in order to focus on how the thing interacts. Practically in CS this means

> [Abstraction's] main goal is to handle complexity by hiding unnecessary details from the user. -[Stackify](https://stackify.com/oop-concept-abstraction/?utm_referrer=https%3A%2F%2Fwww.google.com%2F)

This is what allows us to think about driving a car without actually thinking about all the complex detail of what happens when we press the gas pedal. In the same way, functional abstraction allows us to think about functions as a collection of code that, given a particular input, will return a particular output.

A very simple example might be a sum function, which computes the sum of a `list` of values.

In [None]:
sum([1, 2, 3, 4, 5])

With this function, I can retrieve the sum of a group of values without ever having to know _how_ the sum is actually computed. Another benefit of grouping code into functions is that it makes the code easy to test -- just call the function with some inputs and see if the right output is returned. The last benefit I will mention here is that a function is a way of separating concerns and ideas. When you write a `sum` function, all you are responsible for is the correctness and efficiency of the computation, nothing else. Also, when writing a function, you are putting code that does a particular computation into a logical grouping, much like how in writing you would group similar ideas into a paragraph.

#### Syntax

The syntax of defining a function in Python is quite simple. You let the interpreter know that you want to define a function using the `def` statement. Then, you follow that with your function's name with the arguments. There is no need to specify the types that the function accepts as Python will figure that out itself. If it cannot, then it will let you know in the form of an error.

In [None]:
def my_function(arg1, arg2):
    output = arg1 + arg2
    return output

Notice that, again, much like the `if` statements and loops, function definitions end with a `:` and the contents are indented.

Some will miss the safety of static typing in Java (others won't). While this is true, in exchange you can reduce the redundancy of code between one version of a function that takes in one type of input and another version of the same function that merely takes in another type. Why write a separate `max` function for `byte`, `char`, `int`, `float`, etc. when you can just write one?

#### More Examples

In [None]:
def mean(values):
    return sum(values) / len(values)

Notice that the `len` function returns the length of any list.

#### Lambdas

Lambdas are a special type of functions called anonymous functions. Whereas with normal functions you must name the function in the definition, you do not have to name lambda functions.

```
def named_function():
    pass
```

Lambdas are treated as expressions that evaluate to functions exactly like how `'hello'` is evaluated to a string. This means that you can store lambdas in variables but this is _widely_ considered bad practice.

```
lambda x, y: x + y
```

Here, the `lambda` keyword tells the Python that we want to make a lambda function. The `x` and `y` before the colon denote the arguments of the function. The expression after the colon represents the logic of the function. In this case, we could call this function `add` since it takes two values and adds them.

Again, it is very bad practice to assign lambdas to variables. This might make you think that they are useless, but I assure you that they are not. The most common use case of lambdas is when another function takes a function as an argument. For example the `max` function.

I'm sure everyone is familiar with the `max` function, but in case you are not, this function returns the largest element in a list. Typically you would simply call the `max` function with `some_list` and get the largest element.

Try evaluating this next cell.

In [None]:
some_list = [1, 2, 3, 4]
max(some_list)

As expected, the largest value in `some_list` is `4`. However, there are cases where you want to get the max element of a list but the elements have complex structure. For example, consider a class roster, which is a list of students. In this situation, let us represent each student as a `list` containing their name, age, and graduation year.

In [None]:
roster = [
    ['Billy',  50, 2021],
    ['Meghan', 18, 2020],
    ['Jeff',   21, 2019],
    ['Alex',   21, 2019],
    ['Cate',   21, 2020]
]

We want to find the oldest student, `Billy`, in this group of students. How can we do that? Let's try directly calling `max` with this `roster` and see what we get.

In [None]:
max(roster)

The `max` function returned `['Meghan', 18, 2020]`, which isn't what we are looking for.

> **Note**: this result makes sense because the max function sees a list of lists and defaults to using the first element of each list to compare them. In this case, `Meghan` starts with `M`, which comes later in the alphabet than the rest of the students' initials.

In order to get the oldest student, we will need to show the `max` function which values we want it to compare. To do this we will use a `key`, which is a function. Instead of writing an entire function for this, we can just pass in a lambda that, when called with a student-list, will return the age of the student.

In [None]:
max(roster, key=lambda student: student[1])

Here we see that `'Billy'`, who is 50, was returned as the oldest student.

### Other Python Topics

A complete discussion of Python would include topics such as [dictionaries](https://www.programiz.com/python-programming/dictionary), [iterators](https://www.programiz.com/python-programming/iterator), and [classes](https://www.programiz.com/python-programming/object-oriented-programming), among other important topics. However, this has already been a lot and we will not run into these in the first few labs. Because of this, we will opt to introduce these other topics as they appear throughout this course.

## NumPy

You can put just about anything into a Python list. They are designed to be completely agnostic to types, making them just about as flexible as a data structure can get. You want to store a string, an int, and another list? No problem. However, this versatility doesn‚Äôt come for free. In exchange for quality of life, we must give up some degree of computational efficiency ‚Äî though probably not enough to tip the scales against lists in most use cases.

One case, however, where lists are not ideal is mathematical computation. Here, we don't need the flexibility that lists give us since we know upfront that we are only dealing with _numbers_ and _how many_ of these numbers we have (eg. the dimensionality of a column vector is fixed for a particular problem). This leads us to seek an alternative data structure that is optimized for these constraints (ie. known type and shape): the **array**.

These types of math-specialized arrays are not provided by Python itself and, instead, can be found in the `numpy` package. Here we import `numpy` with the alias `np`. This is for convention and because, simply, $2 < 5$.

In [None]:
import numpy as np

### Creating Arrays

Arrays can be created from many "list-like" objects by calling `np.array` on the object.

In [None]:
some_list = range(10)
np.array(some_list)

You may also created multi-dimensional, or `ndarray`s, in this way.

In [None]:
some_ndlist = [
    [1, 2, 3],
    [4, 5, 6]
]
np.array(some_ndlist)

#### Zeros

Sometimes just need a _uniform_ array of some value. Here are some examples of how you can make these.

In [None]:
np.zeros(5)

In the following example, we pass in a `tuple` (kind of like a list) with the shape of the array we want, ie. `(5, 5)`.

In [None]:
np.zeros((5, 3))

**Try this!** Create a three dimensional array with 3 zeros in the first two dimensions and 2 along the third dimension.  

In [None]:
# your code here


This looks pretty complicated and luckly for our course we will mosly use one and two dinmesinoal arrays as they can represent _vectors_ and _matrices_. 

#### Ones

In [None]:
np.ones(5)

If you need an array of `5` then make an array of `ones` of the desired shape and multiple it by `5`.

In [None]:
np.ones(5) * 5

#### Evenly Spaced Sequences

You can also get an array of evenly spaced numbers over a specified interval using `np.linspace`.

```
np.linspace(start, stop, number)
```

In [None]:
np.linspace(0, 10, 20)

### Thinking of Arrays as Vectors

One of the main benefits of using arrays rather than lists is because of _vectorized_ operations, which essentially allow use to think of an entire array as a unit and operate at the array level -- we don't have to concern ourselves with each individual number. From here on out, we will refer to arrays as **vectors** as the mathematical definitions of _vectorized_ operations are typically defined for vectors. It is important to note that, in many cases, we can use the terms interchangably, but this is not always the case. For example, a dot product, or inner product, of two vectors (mathspeak for same-sized arrays) $\textbf{a}$ and $\textbf{b}$ is defined as $$\textbf{a}\textbf{b} = \sum_{i=1}^{d} a_i b_i.$$ This roughly translates to "multiply each element of $\textbf{a}$ by the element at the same index in $\textbf{b}$ and sum the products". A Python function for this might look like

In [None]:
def dot(a, b):
    products = []
    
    for i in range(len(a)):
        p = a[i] * b[i]
        products.append(p)
        
    return sum(products)

Notice the `for` loop that is required. Often times, we will be taking "elementwise" products, sums, etc. all of which will involve looping. Looping itself in Python is not necessarily slow, but given the constraints of this context (recall, we are only dealing with numbers and we know how many of them we have) we can leverage "super" fast C libraries to do this for us.

> **For the curious**: In this case, we use Python and NumPy as an interface for highly optimized C routines.

In [None]:
a = np.ones(3)             # a = [1, 1, 1]
b = np.array([1, 2, 3])    # b = [1, 2, 3]

dot(a, b)                  # 1*1 + 1*2 + 1*3 = 6

Even better, all of this comes with intuitive syntax. Below is an example of "vectorized" addition.

In [None]:
a + b

And the "elementwise" product of two vectors $$a_i b_i \quad i=0, 1, \ldots, d$$ can be done as follows.

In [None]:
2 * a * b

Now that you know what these arrays look like, let's try to add them together. Write down what you would expect to see in the cell below.

In [None]:
# your code here


There are many other vectorized array operations, such as `np.sum`, `np.min`, among others that also take advantage of the `array` datastructure to compute results very quickly. An extensive list of mathematical operations provided by NumPy can be found in [its documentation](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.math.html).

Furthermore, some computations can be found as methods of arrays themselves. These include `array.min()`, `array.mean()`, etc. A list of these can be found [here](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation)

### Arrays as Matrices

Arrays are considered to be very generic. They can be used to represent vectors, often as points in space, but they can also just represent a collection of numbers you would like to do math with. Moving from $1$ dimension to $2$ dimensions, an array could represent both a collection of vectors aligned side-by-side or a matrix in a more traditional mathematical sense. We won't spend much time on the matrix sense as most of the applications of _ndarrays_ we will see in this class are better described as collection of vectors.

An example of this could be a collection of vectors representing the prices of several stocks and their "earnings per share," which together are used in finance to compute a price-earnings ratio, or more commonly, a [P/E ratio](https://www.investopedia.com/university/peratio/peratio1.asp).

$$
\begin{bmatrix}
\text{prices} \\
\text{earnings per share}
\end{bmatrix}
=
\begin{bmatrix}
175 & 150 & 180 \\
1.60 & 1.03 & 2.00
\end{bmatrix}
$$

In NumPy, this could be represented by a 2D array as follows.

In [None]:
prices = [175, 150, 180]
earnings = [1.60, 1.03, 2.00]

data = np.array([prices, earnings])
data

Here we have represented our data, $\mathcal{D}$, as a collection of three column vectors, each representing one observation. With this data, we can calculate the P/E ratios of each of these stocks.

$$\text{P/E Ratio}\; = \frac{\text{Price per Share}}{\text{Earnings per Share}}$$

In [None]:
def pe_ratio(data):
    prices = data[0, :] # all values from row 1
    earnings = data[1, :] # row 2
    
    return prices / earnings

In [None]:
f'P/E ratios are: {pe_ratio(data)}'

Notice how we did not need to _manually_ loop through the elements.

### Arrays have Axes

NumPy arrays have axes that correspond to the order of numbers supplied when indexing.

![axes](utility/pics/elsp_0105.png)

It makes sense to consider axes for many operations. For example, the `min` method on arrays will default to returning the minimum value in each column, preferring to follow axis 0. However, there are cases where you want to get the minimum of each row instead. For these cases, you can specify which axis `min` should use.

To demonstrate, we will use our sample BMI dataset. Here is a reminder of what the data looked like.

In [None]:
data

If we wanted to find the minimum value in `data` we would simply use the `min` method on the array as is done in the next cell.

In [None]:
data.min()

However, often times we would like to find the minimum value in a particular row or column. Depending on the data you are representing with the array, this might mean finding the minimum stock price and stock earnings per share as with our P/E ratio example.

We can find the minimum price and earnings (each row; axis 1) we can specify that `min` should find the `min` in a specific axis.

In [None]:
data.min(axis=1)

Many vectorized operations that manipulate arrays can take an axis argument, allowing you more flexibility.

Now, using the power of arrays, let's reattempt our goal from above to find the oldest student, Billy, in the group of students. Note that we do not want stings in our numpy arrays, so we will put the names into a list and the numeric entries into a two dimesnional array (matrix). 

In [None]:
names = ['Billy','Meghan','Jeff', 'Alex','Cate']
roster = [
    [50, 2021],
    [18, 2020],
    [21, 2019],
    [21, 2019],
    [21, 2020]
]
R = np.array(roster)
R

**Try this!** Get the age and graduation year of the odldes person in our roster. 

In [None]:
R.max(axis=0)

**Try this!** Here is a last challenging excercise! Using the reference on [array functions](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation) linked above, get the index of the oldest person in our roster and retrieve their name from the `names` list. Then adapt your code to reveal the name of the youngest person.

In [None]:
# your code here
