昌吉回族自治州网站建设_网站建设公司_Redis

importnumpyasnpclassLDA_fs:""" Performs a Linear Discriminant Analysis (LDA) Methods ======= fit_transform(): Fits the model to the data X and Y, derives the transformation matrix W and projects the feature matrix X onto the m LDA axes """def__init__(self,m):""" Parameters ========== m : int Number of LDA axes onto which the data will be projected Returns ======= None """self.m=mdeffit_transform(self,X,Y):""" Parameters ========== X : array(n_samples, n_features) Feature matrix of the dataset Y = array(n_samples) Label vector of the dataset Returns ======= X_transform : New feature matrix projected onto the m LDA axes """# Get number of features (columns)self.n_features=X.shape[1]# Get unique class labelsclass_labels=np.unique(Y)# Get the overall mean vector (independent of the class labels)mean_overall=np.mean(X,axis=0)# Mean of each feature# Initialize both scatter matrices with zerosSW=np.zeros((self.n_features,self.n_features))# Within scatter matrixSB=np.zeros((self.n_features,self.n_features))# Between scatter matrix# Iterate over all classes and select the corresponding dataforcinclass_labels:# Filter X for class cX_c=X[Y==c]# Calculate the mean vector for class cmean_c=np.mean(X_c,axis=0)# Calculate within-class scatter for class cSW+=(X_c-mean_c).T.dot((X_c-mean_c))# Number of samples in class cn_c=X_c.shape[0]# Difference between the overall mean and the mean of class c --> between-class scattermean_diff=(mean_c-mean_overall).reshape(self.n_features,1)SB+=n_c*(mean_diff).dot(mean_diff.T)# Determine SW^-1 * SBA=np.linalg.inv(SW).dot(SB)# Get the eigenvalues and eigenvectors of (SW^-1 * SB)eigenvalues,eigenvectors=np.linalg.eig(A)# Keep only the real parts of eigenvalues and eigenvectorseigenvalues=np.real(eigenvalues)eigenvectors=np.real(eigenvectors.T)# Sort the eigenvalues descending (high to low)idxs=np.argsort(np.abs(eigenvalues))[::-1]self.eigenvalues=np.abs(eigenvalues[idxs])self.eigenvectors=eigenvectors[idxs]# Store the first m eigenvectors as transformation matrix Wself.W=self.eigenvectors[0:self.m]# Transform the feature matrix X onto LD axesreturnnp.dot(X,self.W.T)

4. 将 LDA 应用于工业数据集

为了看到 LDA 的实际应用，我们将其应用于生产环境中的典型任务。我们有来自一个简单制造流水线的数据，该流水线只有 7 个工作站。每个工作站发送一个数据点（是的，我知道，只有一个数据点是非常不现实的）。不幸的是，我们的生产线产生了大量的不良品，我们希望找出哪些工作站对这一现象负责。

首先，我们加载数据并初步查看。

importpandasaspd# URL to Github repositoryurl="https://raw.githubusercontent.com/IngoNowitzky/LDA_Medium/main/production_line_data.csv"# Read csv to DataFramedata=pd.read_csv(url)# Print first 5 linesdata.head()

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/c8c63823ac648330d6223f37f1ecc5be.png

接下来，我们使用 Pandas 中的.describe()方法来研究数据的分布。

# Show average, min and max of numerical valuesdata.describe()

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/3e772f56c494366b54ef78ece6386e3f.png

我们看到有 20,000 个数据点，测量范围从 -5 到 +150。由此，我们需要记住，稍后需要对数据集进行归一化：不同数值的量级，否则会对 LDA 产生负面影响。

我们有多少个良品和多少个不良品？

# Count the number of good and bad partslabel_counts=data['Label'].value_counts()# Display the resultsprint("Number of Good and Bad Parts:")print(label_counts)

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/ea0d6a2c9c981587a36f6beeb5ca4898.png

我们有 19,031 个良品和 969 个不良品。数据集严重失衡是进一步分析的一个问题。因此，我们选择所有不良品和等数量的随机选择的良品进行后续处理。

# Select all bad partsbad_parts=data[data['Label']=='Bad']# Randomly select an equal number of good partsgood_parts=data[data['Label']=='Good'].sample(n=len(bad_parts),random_state=42)# Combine both subsets to create a balanced datasetbalanced_data=pd.concat([bad_parts,good_parts])# Shuffle the combined datasetbalanced_data=balanced_data.sample(frac=1,random_state=42).reset_index(drop=True)# Display the number of good and bad parts in the balanced datasetprint("Number of Good and Bad Parts in the balanced dataset:")print(balanced_data['Label'].value_counts())

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/872c6b4b1dd3882783e57eb19d3c52a8.png

现在，让我们将从头开始实现的 LDA 应用于平衡后的数据集。我们使用sklearn中的StandardScaler来对每个特征的测量值进行归一化，使其均值为 0，标准差为 1。我们选择一个线性判别轴 (m=1)，将数据投影到该轴上。这有助于我们清楚地看到哪些特征在区分良品和不良品方面最为相关，并通过直方图可视化投影后的数据。

importmatplotlib.pyplotaspltfromsklearn.preprocessingimportStandardScaler# Separate features and labelsX=balanced_data.drop(columns=['Label'])y=balanced_data['Label']# Normalize the featuresscaler=StandardScaler()X_scaled=scaler.fit_transform(X)# Perform LDAlda=LDA_fs(m=1)# Instanciate LDA object with 1 axisX_lda=lda.fit_transform(X_scaled,y)# Fit the model and project the data# Plot the LDA projectionplt.figure(figsize=(10,6))plt.hist(X_lda[y=='Good'],bins=20,alpha=0.7,label='Good',color='green')plt.hist(X_lda[y=='Bad'],bins=20,alpha=0.7,label='Bad',color='red')plt.title("LDA Projection of Good and Bad Parts")plt.xlabel("LDA Component")plt.ylabel("Frequency")plt.legend()plt.show()# Examine feature contributions to the LDA componentfeature_importance=pd.DataFrame({'Feature':X.columns,'LDA Coefficient':lda.W[0]})feature_importance=feature_importance.sort_values(by='LDA Coefficient',ascending=False)# Display feature importanceprint("Feature Contributions to LDA Component:")print(feature_importance)

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/bc19ff379454908643a58e3ad24e81a8.png

将特征矩阵投影到一个 LD（m=1）上

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/89ff843406588e3be91c2f2fb0098c46.png

特征重要性 = 各站点在类分离中的贡献有多大？

直方图显示，我们能够很好地区分良品和不良品，只有少量重叠。这已经是一个积极的结果，表明我们的 LDA 成功了。

表格“LDA 组件的特征贡献”中的“LDA 系数”表示我们转换矩阵W的第一列（也是唯一一列，因为m=1）的特征向量。它们表示从各个工站的标准化测量值在 LDA 轴上的投影方向和大小。表中的数值按降序排列。我们需要同时从表格的顶部和底部读取，因为系数的绝对值表示每个工站在类别分离中的重要性，进而表示其对生产缺陷部件的贡献。符号表示较低或较高的测量值是否会增加缺陷部件的可能性。让我们仔细看看我们的例子：

最大的绝对值来自工站 4，其系数为-0.672。这意味着工站 4 对部件故障的影响最强。由于负号，较高的正测量值会投影到负的线性判别轴（LD）。直方图显示，负的 LD 与良好（绿色）部件相关。相反，在该工站的低值和负值会增加部件故障的可能性。

第二大绝对值来自工站 2，其系数为 0.557。因此，这个工站是导致部件故障的第二大贡献者。正号表示较高的正测量值会投影到正的 LD。从直方图中我们知道，较高的正 LD 值与较高的故障概率相关。换句话说，工站 2 的高测量值会导致部件故障。

第三高的系数来自工站 7，其值为-0.486。这使得工站 7 成为部件故障的第三大贡献者。负号再次表明，在该工站的较高正值会导致负 LD（这对应于良好的部件）。相反，在该工站的低值和负值会导致部件故障。

所有其他 LDA 系数的数量级都比上述三者小得多，因此，相关的工站对部件故障没有影响。

我们的 LDA 分析结果正确吗？正如你可能已经猜到的，生产数据集是合成生成的。我将所有部件标记为有缺陷，其中工站 2 的测量值大于 0.5，工站 4 的值小于-2.5，工站 7 的值小于 3。事实证明，LDA 完全准确地命中了目标！

# Determine if a sample is a good or bad part based on the conditionsdata['Label']=np.where((data['Station_2']>0.5)&(data['Station_4']<-2.5)&(data['Station_7']<3),'Bad','Good')

5. 结论

线性判别分析（LDA）不仅可以减少数据集的复杂性，还能突出驱动类别分离的关键特征，因此在识别生产系统中的故障原因时非常有效。它是一种直接而强大的方法，具有实际应用，并且在像scikit-learn这样的库中容易获取。

为了实现最佳效果，平衡数据集（确保每个类别中样本数量相似）并进行归一化（均值为 0，标准差为 1）是至关重要的。

下次当你处理包含类别标签和多个特征的大型数据集时，为什么不尝试一下 LDA 呢？

昌吉回族自治州网站建设_网站建设公司_Redis_seo优化

发现 LDA 如何帮助识别关键数据特征

1. LDA 的目标和原理

旁白：什么是维度，什么是维度降维？

2. LDA 是如何工作的？

3. 从头开始实现线性判别分析（LDA）

4. 将 LDA 应用于工业数据集

5. 结论

热门文章

文章分类

标签云

需要专业的网站建设服务？

昌吉回族自治州网站建设_网站建设公司_Redis_seo优化

发现 LDA 如何帮助识别关键数据特征

热门文章

文章分类

标签云

相关文章

Manus 爆火之后，我梳理了现在最火的 10 大 AI 智能体

二分法排查：通过禁用模块或数据分段定位

掌握AI应用架构师领域上下文工程，提升AI智能体性能的有效方法

需要专业的网站建设服务？