Clustering, visualisation, and dimensionality reduction.
Question 2
Explain whether each scenario is a classification or regression problem. In the case of classification, what are the number of classes? Finally, provide the values of \(N\) and \(D\).
(a) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product, we recorded if it was a success or failure, the price charged for the product, the marketing budget, the competition price, and ten other variables.
Solution
This is a classification problem, where the output \(y\) denotes if the product will be a success or failure and the input \(x\) collects information on the price charged for the product, the marketing budget, the competition price, and ten other variables. In this case, we have \(C = 2, N = 20,\) and \(D = 13\).
(b) We are interested in predicting the percentage change in the GBP/EUR exchange rate in relation to weekly changes in world stock markets. We collect weekly data for all of 2018, and for each week, we record the percentage change in the GBP/EUR exchange rate, the percentage change in the UK market, the percentage change in the German market, the percentage change in the US market, and the percentage change in the Chinese market.
Solution
This is a regression problem, where the output \(y\) denotes the percentage change in the GBP/EUR exchange rate and the input \(x\) collects information on the the percentage change in the UK market, the percentage change in the German market, the percentage change in the US market, and the percentage change in the Chinese market. In this case, we have \(N = 52\) and \(D = 4\).
(c) We are interested in identifying species of birds based on audio recordings. We have ten-second audio recordings of 645 birds, and 35 features have been extracted to represent the signals in the raw audio recordings. There are 19 bird species in the dataset.
Solution
This is a classifiation problem, where the output \(y\) denotes the bird species and the input \(x\) contains the features that have been extracted to represent the signals in the raw audio recordings. In this case, we have \(C = 19, N = 645,\) and \(D = 35\).
Note: In 2013 Kaggle ran a competition on predicting bird species given audio recordings. More details and data can be found here: https://www.kaggle.com/c/mlsp-2013-birds.
(d) An e-commerce company sells fresh produce online. Currently, every product must be manually inspected to determine if it is rotten before being sent to the customer. To save resources, the company is interested in automating this process. They have collected 2,000 images of strawberries of size \(50 \times 50\) and have manually determined whether each strawberry is rotten.
Solution
This is a classification problem, where the output \(y\) denotes whether the strawberry is rotten and the input \(x\) is the image of the strawberry. In
this case, we have \(C = 2, N = 2000,\) and \(D = 50^2\).
Note: Amazon purchased Whole Foods in 2017. And since the purchase, Amazon has been developing algorithms and exploring different inputs (e.g. different types of images) to automate the process of manually checking for rotten foods.
Question 3
Think of some example real-life applications for the following types of learning.
(a) Describe the responce and potential predictors for a Classification project.
Solution
Answers will vary but an example could be:
We could try classify someone as ill or healthy (responce variable) using inputs such as resting heart rate, resting breath rate, mile run time.
(b) Describe the responce and potential predictors for a Regression project.
Solution
Answers will vary but an example could be:
We could try predict the salary of an individual (responce variable) using inputs such as their education, work history, and skillsets.
(c) Describe a potential Clustering project.
Solution
Answers will vary but an example could be:
We could try do market research on a product using data such as incomes, location, age, sex, and opinion polls to segment the potential customer base.