Class notes for the Machine Learning Nanodegree at Udacity
Go to Index""" quiz materials for feature scaling clustering """
### FYI, the most straightforward implementation might
### throw a divide-by-zero error, if the min and max
### values are the same
### but think about this for a second--that means that every
### data point has the same value for that feature!
### why would you rescale it? Or even use it at all?
def featureScaling(arr):
min_x = min(arr)
value_range = max(arr) - min_x
if value_range == 0:
return [1 for x in arr]
return [float(x - min_x) / value_range for x in arr]
# tests of your feature scaler--line below is input data
data = [115, 140, 175]
print featureScaling(data)
Answer: SVM and K-means are affected by feature rescaling. For instance, take the feature rescaling example where weight
and height
were rescaled, so that their contributions to the outcome would be the same (i.e: between 0 and 1). Because SVM and K-means compute distances, scaling the features would affect the calculated distances and therefore would affect the result.
In contrast, Decision Trees and Linear Regression don’t measure distances. Decision Trees define for each feature available some constant value in order to split the data. So if we scale that feature, we will be scaling the constant split value by the same amount and the result won’t be changed. In a similar way, Linear Regression defines coefficients for each feature available, so they are independent from each other and rescaling the features won’t change the results
n
is the number of features of our model.Feature selection is defined as:
F(N) -> M, where M ≤ N
To select all M relevant features from N, without knowing M, we need to try all subsets of N. This give us combinations, which, if we don’t know m equals 2n possibilities.
Filtering
is a straightforward process where the search algorithm and the learner don’t interact.wrapping
the learner gives feedback to the search algorithm about which features are impacting the learning process. Because of this, it is slower than filtering.a AND b
. Because of that is easy to note what features are necessary to minimize the error for both learners.AND quiz-example
, the teacher shows that the features a
and b
are strongly relevant, since the function is a AND b
and we need both to correctly predict the result.e
equal to not a
, we make both a
and e
weakly relevant, because without one of them, we can still use the other to predict the result of a AND b
(or NOT e AND b
).c
and d
are irrelevant, but not useless as we’ll see next.