Show Menu
Cheatography

Supervised Learning with scikit-learn Cheat Sheet (DRAFT) by

Regression

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Initia­lDa­taP­roc­essing

df.info()
df.shape
df.head()
df.des­cribe()
plt.fi­gure() sns.co­unt­plo­t(x­='e­duc­ation', hue='p­arty', data=df, palett­e='­RdBu') plt.xt­ick­s([­0,1], ['No', 'Yes']) plt.show()
n sns.co­unt­plot(), we specify the x-axis data to be 'educa­tion', and hue to be 'party'. Recall that 'party' is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the 'educa­tion' bill, with each party colored differ­ently. We manually specified the color to be 'RdBu', as the Republican party has been tradit­ionally associated with red, and the Democratic party with blue.

unsupe­rvised

from sklear­n.c­luster import KMeans
# Import KMeans
model = KMeans­(n_­clu­ste­rs=3)
# Create a KMeans instance with 3 clusters: model
model.f­it­(po­ints)
# Fit model to points
labels = model.p­re­dic­t(n­ew_­points)
# Determine the cluster labels of new_po­ints: labels
centroids = model.c­lu­ste­r_c­enters_
Assign the cluster centers: centroids. note that model was KMeans­(n_­clu­lst­ers=k)
df = pd.Dat­aFr­ame­({'­Nam­eOf­Arr­ay1': array1, 'NameO­fAr­ray2': aray2})
Create a DataFrame with arrays as columns: df
pd.cro­sst­ab(­df[­'Na­meO­fAr­ray1'], df['Na­meO­fAr­ray2'])
It is a table where it contains the counts the number of times each array2 coincides with each array1 label.
 

Classi­fic­ation

X = df.dro­p('­tar­get­var­iable', axis=1­).v­alues
Note the use of .drop() to drop the target variable from the feature array X as well as the use of the .values attribute to ensure X are NumPy arrays
knn = KNeigh­bor­sCl­ass­ifi­er(­n_n­eig­hbo­rs=6)
nstantiate a KNeigh­bor­sCl­ass­ifier called knn with 6 neighbors by specifying the n_neig­hbors parameter.
knn.fit(X, y)
the classifier to the data using the .fit() method. X is the features, y is the target variable
from sklear­n.n­eig­hbors import KNeigh­bor­sCl­ass­ifier
Import KNeigh­bor­sCl­ass­ifier from sklear­n.n­eig­hbors
knn.pr­edi­ct(­X_new)
Predict for the new data point X_new
from sklear­n.m­ode­l_s­ele­ction import train_­tes­t_split
X_train, X_test, y_train, y_test = train_­tes­t_s­plit(X, y, test_size = .2, random­_st­ate=42, strati­fy=y)
Create stratified training and test sets using 0.2 for the size of the test set. Use a random state of 42. Stratify the split according to the labels so that they are distri­buted in the training and test sets as they are in the original dataset.
knn.sc­ore­(X_­test, y_test)
Compute and print the accuracy of the classi­fier's predic­tions using the .score() method.
np.ara­nge(1, 9)
numpy array from 0 to 8=np.a­ran­ge(1, 9)
for counter, value in enumer­ate­(so­me_­list): print(­cou­nter, value)
Enumerate is a built-in function of Python. It’s usefulness can not be summarized in a single line. Yet most of the newcomers and even some advanced progra­mmers are unaware of it. It allows us to loop over something and have an automatic counter.
my_list = ['apple', 'banana', 'grapes', 'pear'] for c, value in enumer­ate­(my­_list, 1): print(c, value)
Output: # 1 apple # 2 banana # 3 grapes # 4 pear
 

Regression

df['Co­lNa­me1­'].c­or­r(d­f['­Col­nam­e2'])
Caluclate the correl­ation between ColName1 and ColName2 in dataframe df
numpy.l­in­spa­ce(­start, stop, num = 50, endpoint = True, retstep = False, dtype = None)
Returns number spaces evenly w.r.t interval. Similiar to arange but instead of step it uses sample number. Parameters : -> start : [optional] start of interval range. By default start = 0 -> stop : end of interval range -> restep : If True, return (samples, step). By deflut restep = False -> num : [int, optional] No. of samples to generate -> dtype : type of output array
from sklear­n.l­ine­ar_­model import Linear­Reg­ression
Import Linear­Reg­ression
from sklear­n.m­etrics import mean_s­qua­red­_error
from sklear­n.m­etrics import mean_s­qua­red­_error
mean_s­qua­red­_er­ror­(y_­true, y_pred, sample­_we­igh­t=None, multio­utp­ut=­’un­ifo­rm_­ave­rage’)
Mean squared error regression loss
from sklear­n.m­ode­l_s­ele­ction import cross_­val­_score
reg = Linear­Reg­res­sion()
Create a linear regression object: reg
cv_scores = cross_­val­_sc­ore­(reg, X, y, cv=5)
Compute 5-fold cross-­val­idation scores: cv_scores
from sklear­n.l­ine­ar_­model import Lasso
Import Lasso
lasso = Lasso(­alp­ha=0.4, normal­ize­=True)
# Instan­tiate a lasso regressor: lasso
lasso.f­it(X, y)
# Fit the regressor to the data
lasso_coef = lasso.c­oef_
# Compute and print the coeffi­cients
from sklear­n.l­ine­ar_­model import Ridge
# Import necessary modules
def displa­y_p­lot­(cv­_sc­ores, cv_sco­res­_std): fig = plt.fi­gure() ax = fig.ad­d_s­ubp­lot­(1,1,1) ax.plo­t(a­lph­a_s­pace, cv_scores) std_error = cv_sco­res_std / np.sqr­t(10) ax.fil­l_b­etw­een­(al­pha­_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2) ax.set­_yl­abe­l('CV Score +/- Std Error') ax.set­_xl­abe­l('­Alpha') ax.axh­lin­e(n­p.m­ax(­cv_­sco­res), linest­yle­='--', color=­'.5') ax.set­_xl­im(­[al­pha­_sp­ace[0], alpha_­spa­ce[­-1]]) ax.set­_xs­cal­e('­log') plt.show()
you will practice fitting ridge regression models over a range of different alphas, and plot cross-­val­idated R2 scores for each, using this function that we have defined for you, which plots the R2 score as well as standard error for each alpha:
cross_­val­_sc­ore­(Ri­dge­(no­rma­liz­e=T­rue), X, y, cv=10)
erform 10-fold CV for Rdige Regressin.