Today i got to complete five sections from lesson two namely :
- Two perspectives on ML
- The computer science Perspective
- The statistical perspective
- Tools for ML
- Libraries
Two perspective on ML
The computer science perspective Computer science and statistics are closely related course as one of them requires the knowledge of the other at some certain point of study. Discussing ML from a computer science perspective, a computer scientist might define ML as
Using data inputs otherwise known as input features to create a program that can generate a desired output
While this definition is correct, an individual with a statistics background might as well define ML like:
We are trying to find a mathematical function that, given the values of the independent variables can predict the values of the dependent variables.
While the terminology are different, the challenges are the same, that is how to get the best possible outcome.
Computer science terminology
Consider this table below
SKU | Make | Colour | Quantity | Price |
908721 | Guess | Blue | 789 | 45.33 |
456552 | Tillys | Red | 244 | 22.91 |
789921 | A&F | Green | 387 | 25.92 |
872266 | Guess | Blue | 154 | 17.56 |
We can deduce from a computer science perspective that;
For the rows in the table, we might call each row an entity or an observation about an entity. In our example above, each entity is simply a product, and when we speak of an observation, we are simply referring to the data collected about a given product. You'll also sometimes see a row of data referred to as an instance, in the sense that a row may be considered a single example (or instance) of data.
For the columns in the table, we might refer to each column as a feature or attribute which describes the property of an entity. In the above example, color and quantity are features (or attributes) of the products.
Input and output
Remember that in a typical case of machine learning, you have some kind of input which you feed into the machine learning algorithm, and the algorithm produces some output. In most cases, there are multiple pieces of data being used as input. For example, we can think of a single row from the above table as a vector of data points: (908721, Guess, Blue, 789, 45.33)
Again, in computer science terminology, each element of the input vector (such as Guess or Blue) is referred to as an attribute or feature. Thus, we might feed these input features into our machine learning program and the program would then generate some kind of desired output (such as a prediction about how well the product will sell). This can be represented as:
- Output = Program(Input Features)
Feature extraction An important step in preparing your data for machine learning is extracting the relevant features from the raw data.
Tools for ML
Just like traditional programming, as a developer you need some tools in whatever domain of the tech ecosystem you might seem to find yourself for you to be productive. These tools include libraries, IDE's, API's and frameworks that would increase your productivity and efficiency. In ML, the same is applied and various tools have been developed in order to make ML more powerful and easier to implement. We would mention some of theses tools below and discusss about the ML ecosystem .
The Machine Learning Ecosystem There are three main componenets involved in the ML ecosysem
Libraries: While working on a ML project, you likely will not want to write all of the necessary code yourself—instead, you'll want to make use of code that has already been created and refined. That's where libraries come in. A library is a collection of pre-written (and compiled) code that you can make use of in your own project. NumPy is an example of a library popularly used in data science, while TensorFlow is a library specifically designed for machine learning
Development environment: While some web developers tend to use VS code as their preferred development environment and some Atom depending on their choice, ML engineers also have a development environment where they write their codes and train ML models. A development environment is a software application (or sometimes a group of applications) that provide a whole suite of tools designed to help you (as the developer or machine learning engineer) build out your projects. Jupyter Notebooks and Visual Studio are examples of development environments that are popular for coding many different types of projects, including machine learning projects.
Cloud: A cloud service is a service that offers data storage or computing power over the Internet. In the context of machine learning, you can use a cloud service to access a server that is likely far more powerful than your own machine, or that comes equipped with machine learning models that are ready for you to use. Some of the top cloud computing providers are AWS, Microsoft Azure and Google's GCP. We'll be using microsoft azure for this scholarship as it is being sponsored by microsoft and they have onne of the most intuitive cloud services with an easy UI and drag & drop functionalities.
Libraries
Core Framework and Tools
Python is a very popular high-level programming language that is great for data science. Its ease of use and wide support within popular machine learning platforms, coupled with a large catalog of ML libraries, has made it a leader in this space. Here is a link to a free course on Udacity to get you started with python Click here!
TensorFlow is a free, open-source software library for machine learning built by Google Brain.
Keras is a Python deep-learning library. It provide an Application Programming Interface (API) that can be used to interface with other libraries, such as TensorFlow, in order to program neural networks. Keras is designed for rapid development and experimentation.
PyTorch is an open source library for machine learning, developed in large part by Facebook's AI Research lab. It is known for being comparatively easy to use, especially for developers already familiar with Python and a Pythonic code style. Here is a link to a free course on Udacity to get you up and running with Pytorch Click here!
Data Visualization
Matplotlib is a Python library designed for plotting 2D visualizations. It can be used to produce graphs and other figures that are high quality and usable in professional publications. You'll see that the Matplotlib library is used by a number of other libraries and tools, such as SciKit Learn (above) and Seaborn (below). You can easily import Matplotlib for use in a Python script or to create visualizations within a Jupyter Notebook.
Plotly is not itself a library, but rather a company that provides a number of different front-end tools for machine learning and data science—including an open source graphing library for Python.
Seaborn is a Python library designed specifically for data visualization. It is based on matplotlib, but provides a more high-level interface and has additional features for making visualizations more attractive and informative.
Bokeh is an interactive data visualization library. In contrast to a library like matplotlib that generates a static image as its output, Bokeh generates visualizations in HTML and JavaScript. This allows for web-based visualizations that can have interactive features.