AI不是魔法,它涉及:数学、统计、数据和编程

1,219 阅读4分钟
原文链接: thenewstack.io

This piece is the first in a series, called “Machine Learning Is Not Magic,” covering how to get started in machine learning, using familiar tools such as Excel, Python, Jupyter Notebooks and machine learning cloud services from Azure and Amazon Web Services. Check back here each Friday for future installments. 

Back in 2010, when I first encountered the concept of Machine Learning (ML), I told myself that it is only for PhDs in Computer Science, which meant that I might never get a chance to work on it.

As an ex-Microsoftie and Azure enthusiast, I decided to take a closer look at ML when Microsoft started to add Machine Learning components to Azure. Even then, I only got overwhelmed and confused by the enormous number of technologies and jargons surrounding it. With Google announcing TensorFlow and Cloud ML followed by Amazon’s launch of its own Machine Learning service, it started to become very clear that ML is going to be the next big thing in the cloud.

Looking at the buzz and the hype around ML, I decided to write my first “Hello World” equivalent of Machine Learning. With each attempt, I was only left more confused and disappointed. The sheer number of articles, blogs, self-learning courses, tutorials, and samples on ML added to my anxiety. Despite all the available resources, I couldn’t even get close to creating a meaningful and complete ML implementation.

One of the main reasons why I kept making a U-turn was the liberal dosage of mathematics found in almost every ML resource that I bookmarked. Despite my determination and commitment, the thought that I need to learn advanced mathematics kept pushing me away. Let me admit it — I dread dealing with mathematics. I barely managed to pass my math papers in high school. When I was a teen, I rejoiced when I found that it was possible to build a career in IT without a master’s degree in mathematics. The fact that some advanced math became a prerequisite for ML disappointed me and, in many ways, brought back the nightmare of my school days.

But as I continued to work with my customers on Internet of Things and data-centric projects, the possible usage of ML kept coming back to us. Meanwhile, the hype around ML has reached the peak. So much so that the cloud providers started to push ML more than the core IaaS components like VMs, storage, and networking. It also became extremely clear that ML is becoming the front and center of many emerging technologies including Cognitive Computing, Artificial Intelligence, Chatbots, Personal Assistants, and Predictive Maintenance.

Hit the Spreadsheets

At the beginning of 2017, I decided to spend two hours every day to learn ML. The first few weeks of my learning path was nothing to do with programming but brushing up my math skills. After shortlisting a series of math concepts that are essential for ML, I realized that the best resource is very accessible to me was my son’s high school mathematics textbook. I bookmarked a few chapters and kept learning the concepts relentlessly till I could solve the problems. I must also give due credit to Khan Academy for its excellent collection of tutorials covering the prerequisites.

After getting hold of basic mathematics and statistics, I started experimenting with the formulae, in Microsoft Excel. It was fascinating to see the primary hypothesis of ML working in Excel. Having understood how to apply modern datasets to traditional statistical formulae, I was eager to try them in Python. Though I am not an expert in Python, I am pretty comfortable with writing code in it.

Based on my past experience of Python, I installed the required modules and configured my Python-based ML testbed. With a working formula in Excel and the basic Python environment, I successfully created my first Machine Learning model that was accurately predicting a value based on the dataset that I used for training. This was an “aha” moment for me.

Before I got to the working model of an ML program, I read umpteen number of articles and watched hours of video tutorials on YouTube. But only after I managed to write my first program, it all started to make sense.

The Path Forward

The objective of this guide is to help you create a personalized learning path for Machine Learning. If you struggle with math, you will find this plan more useful. I will tell you how much math you need to learn, what environment you to need to configure, which tools you need to use, and finally, how to write your first ML program.

Before we delve into the details, I want to make a disclaimer. By no means, this is going to be the most complete or exhaustive guide to ML. It may not be the most the accurate in terms of terminology and the official nomenclature. But I can tell you with a conviction that it will certainly get you few steps closer to your goal of learning ML. I promise that I will stay away from using the jargon and complex mathematical formula that hinder your ability to learn.

Over the next few weeks, we will take a real world problem and first experiment with it in the simplest tool that most of us are familiar with — Microsoft Excel. We will use Excel as a tool to explore the core premise of ML. After that, we will move to coding in Python where I will try my best to demystify the concepts involving Numpy, Pandas, Jupyter Notebooks, Matplotlib, and Seaborn.

I will also show you how to move the final model that we create to Node.js and use it with a web application. With the core concepts behind us, I will finally show you how to use ML in the public cloud. We will take the same problem that we attempted to solve with Excel and Python to Azure ML and use the interactive ML Studio to create a predictive analytics web service. We will also attempt to take the same problem to AWS and solve it using Amazon ML.

The objective is to equip you with the basic ammunition for jumpstarting your ML learning process.

To whet your appetite, I want to introduce to the problem we are going to solve: Based on the number of years of experience, we will predict the salary of a developer at Stack Overflow.

Salary Calculator from Stack Overflow

Following the tradition of openness, the folks at Stack Overflow created a salary calculator for various jobs and seniority levels. Instead of using complex datasets or fictitious datasets, we will use real-world data coming from Stack Overflow. The idea is to use a scenario that every developer can easily relate. What else is better than the data from Stack Overflow? Since the results are available, it is also easy to verify the hypothesis that we will create through multiple tools.

In the next installment, I will introduce Machine Learning in layman terms. We will take a simple dataset based on Stack Overflow salary calculator, and analyze it to understand how to create a predictive model from it. You will walk away with a clear understanding of what is called as supervised machine learning and simple linear regression. Stay tuned as I prepare the assets that you can grab from the GitHub repo.

Feature image via Pixabay.