Clustering Mental Health in Tech Survey¶

Intent¶

The purpose of this notebook is to research and apply best practices for clustering discrete data. (from surveys) I will be using the data obtained from a large scale survey of the tech workforce, asking questions about mental health in the workplace. Upon completion I will have practiced encoding discrete and categorical data, as well as analysis and visualization. Most importantly, I will apply the appropriate clustering algorithm, tune hyperparameters, and measure the results. I will conclude with a brief decision tree analysis that show the key characteristic of each cluster.

Description of the Data¶

"This dataset contains the following data:

Timestamp

Age

Gender

Country

state: If you live in the United States, which state or territory do you live in?

self_employed: Are you self-employed?

family_history: Do you have a family history of mental illness?

treatment: Have you sought treatment for a mental health condition?

work_interfere: If you have a mental health condition, do you feel that it interferes with your work?

no_employees: How many employees does your company or organization have?

remote_work: Do you work remotely (outside of an office) at least 50% of the time?

tech_company: Is your employer primarily a tech company/organization?

benefits: Does your employer provide mental health benefits?

care_options: Do you know the options for mental health care your employer provides?

wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?

seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?

anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?

leave: How easy is it for you to take medical leave for a mental health condition?

mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences?

phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences?

coworkers: Would you be willing to discuss a mental health issue with your coworkers?

supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?

mental_health_interview: Would you bring up a mental health issue with a potential employer in an interview?

phys_health_interview: Would you bring up a physical health issue with a potential employer in an interview?

mental_vs_physical: Do you feel that your employer takes mental health as seriously as physical health?

obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?

comments: Any additional notes or comments"

Import the Dataset and Preview¶

This data has been obtained from the link below.

https://www.kaggle.com/osmi/mental-health-in-tech-survey/data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import seaborn as  sns
import re
%matplotlib inline
plt.style.use('ggplot')
from sklearn.preprocessing import LabelEncoder

os.chdir("C:\\Users\\harri\\.kaggle\\datasets\\osmi\\mental-health-in-tech-survey")
survey = pd.read_csv('survey.csv')
survey.head()

Preprocessing¶

Review of Changes to Come in EDA Section¶

Threw out responses that could not be easily be categorized into "Male" and "Female". (No political commentary here, just avoiding outlier issues.)
Threw out responses with ages such that 15 < age < 55. (Same logic.)

Removing Superfluous Data¶

As they will not be useful to our clustering algorithm, we remove the Timestamp and comments columns. Later on, we will also have to remove some of our demographic info so our clustering algorithm can focus solely on the question content of the survey.

survey = survey.drop(["Timestamp", "comments", "state"], axis =1)

Check for Empty Values¶

We have 1209 responses left in our data. The columns with the most non-responses are Work_interfere and self_employed. (with 258 and 18 non-responses, respectively) In the case of "work_interfere" we will have to throw out the column, imputing the values would be to detrimental as we are missing nearly 30% of the data. For "self_employed" we fill in the missing answers with the most frequently seen response.

from sklearn.preprocessing import Imputer
survey = survey.drop(["work_interfere"], axis =1)

Label Encoding¶

Since KMeans and other machine learning algorithms can only understand numerical data, we will have to convert the text answers of the survey to disctete numbers. For example yes/no becomes 1/0. To do this well takes a lot of thought, unfortunately a major limitation of the data is the inclusion of the "don't know" response in some of the survey questions. The ambiguity implicit in this answer will surely limit the effectiveness of a clustering algorithm, however, the purpose of this exerise is practice. So we push forward.

encoders = dict()
enc_no_employees = {'1-5':0, '6-25':1, '26-100':2, '100-500':3, '500-1000':4, 'More than 1000':5}
surv_no_employees = survey.no_employees.apply(lambda row: enc_no_employees[row])

enc_leave = {'Somewhat easy':3, "Don't know":2, 'Somewhat difficult':1,
       'Very difficult':0, 'Very easy':4}
surv_leave = survey.leave.apply(lambda row: enc_leave[row])

for x in range(2,23):
    encoders['le'+str(x-2)]= LabelEncoder()
    survey.iloc[:,x] = encoders['le'+str(x-2)].fit_transform(survey.iloc[:,x].astype(str))
survey['leave'] = surv_leave
survey['no_employees'] = surv_no_employees
survey.head()

Now that we have encoded our data, we can finish the section above by imputing the "self_employed" column. (We never did that)

labels = survey.columns
imp = Imputer(strategy = 'most_frequent')
survey = pd.DataFrame(imp.fit_transform(survey), columns = labels)

Beautiful. We have no missing values. Data that is encoded numerically in a way that more contextually similar answers are closer together in euclidean space. We should be good to go for clustering.

EDA: Demographic Info¶

Gender of Respondents¶

We are going to need to do a little bit of prep here as our surveys Gender column was collected as text. As a result, we have issues with capitilization, spelling errors, and abbreviations. We will remove all of the respondents that are not male/female as I cannot easily categorize an "Other" category.

def encode(text):
    text = text.lower()
    if text in ['male', 'm', 'mail', 'make','man', 'malr','mal', 'maile',]:
        text = 1
        return text
    elif text in ['f', 'female', 'femake', 'woman']:
        text = 0
        return text
    elif 'female' in text:
        text = 0
        return text
    else:
        text = 2
        return text

survey.Gender = survey.Gender.apply(encode)

survey = survey[survey.Gender !=2]
plt.figure(figsize =(10,5))
plt.bar(survey.Gender.value_counts().index, survey.Gender.value_counts())
xticks = plt.gca()
plt.xticks([0,1,2], ['Female','Male'])
plt.gca().set_xticklabels(['Female', 'Male'], fontsize = 14)
plt.title('Gender Distrbution of Respondents')
plt.ylabel('Number of Respondents')
plt.tight_layout()
plt.show()

Age Distribution of Respondents¶

We have removed responses with age less than 15 and greater than 55. Removal of outliers will make clustering easier as we continue. The average age of respondents is 35.

survey = survey[survey.Age>15]
survey = survey[survey.Age<55]

plt.figure(figsize =(10,5))
survey.Age.value_counts()
plt.bar(survey.Age.value_counts().index, survey.Age.value_counts())
plt.title('Age of Respondents')
plt.ylabel('Number of Respondents')
plt.xlabel('Age of Respondents')
av_age = np.mean(survey.Age)
plt.axvline(av_age,
           color = 'black',
           linestyle = '--'
           , label='Average Age')
plt.legend()
plt.show()

Nationality of Respondents¶

Around 60% or respondents came from the US, and another 14% comes from the UK.

plt.figure(figsize =(10,5))
plt.bar(survey.Country.value_counts().index[0:10], survey.Country.value_counts()[0:10])
plt.title('Nationality of Respondents')
plt.ylabel('Number of Respondents')
plt.gca().set_xticklabels(survey.Country.value_counts().index[0:10],rotation =90, fontsize = 14)
plt.show()

EDA: Conclusion¶

As we continue on to the actual clustering of the data, we are going to remove the Country, Gender and Age column from our data. The reason behind this decision, is that categorical variables like Country, add to much complexity to the model. I ight be able to come up with smart features through some smart feature engineering, but the effort would outweigh the gain, and this is merely an exersice.

survey = survey.drop(["Age", "Gender",], axis = 1)

survey = survey.drop(["Country"], axis =1)

Clustering without PCA¶

In this section I go about the clustering process with the KMeans, and Agglomerative Clustering algorithms. In the next section we will reduce the dimension of the feature space, for now we are imputing the entirety of our data in each algorithm and looking for the right number of clusters.

K-Means and Clustering Metrics¶

Inertia Plot¶

Below is a plot of the inertia decrese as we increase the number of clusters. Remember we want our inertia to be LOW. A low inertia means our clusters are very tightly packed, like peas in a pod. There is no hard elbow, so we pick a good number of clusters to be 5.(This is based on the diminishing decrease of inertia.)

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN #For clustering
from matplotlib import cm
from sklearn.metrics import silhouette_samples, silhouette_score


def inertia_plot( clust, X, start = 2, stop = 10):
    #A simple inertia plotter to decide K in KMeans
    inertia = []
    for x in range(start,stop):
        km = clust(n_clusters = x)
        labels = km.fit_predict(X)
        inertia.append(km.inertia_)
    plt.figure(figsize = (10,6))
    plt.plot(range(start,stop), inertia, marker = 'o')
    plt.xlabel('Number of Clusters')
    plt.ylabel('Inertia')
    plt.title('Inertia Decrease with K')
    plt.xticks(list(range(start, stop)))
    plt.show()

def silh_samp_cluster(clust,  X, start=2, stop=5, metric = 'euclidean'):
    # Taken from sebastian Raschkas book Python Machine Learning second edition
    for x in range(start, stop):
        km = clust(n_clusters = x)
        y_km = km.fit_predict(X)
        cluster_labels = np.unique(y_km)
        n_clusters = cluster_labels.shape[0]
        silhouette_vals = silhouette_samples(X, y_km, metric = metric)
        y_ax_lower, y_ax_upper =0,0
        yticks = []
        for i, c in enumerate(cluster_labels):
            c_silhouette_vals = silhouette_vals[y_km == c]
            c_silhouette_vals.sort()
            y_ax_upper += len(c_silhouette_vals)
            color = cm.jet(float(i)/n_clusters)
            plt.barh(range(y_ax_lower, y_ax_upper),
                    c_silhouette_vals,
                    height=1.0,
                    edgecolor='none',
                    color = color)
            yticks.append((y_ax_lower + y_ax_upper)/2.)
            y_ax_lower+= len(c_silhouette_vals)

        silhouette_avg = np.mean(silhouette_vals)
        plt.axvline(silhouette_avg,
                   color = 'red',
                   linestyle = "--")
        plt.yticks(yticks, cluster_labels+1)
        plt.ylabel("cluster")
        plt.xlabel('Silhouette Coefficient')
        plt.title('Silhouette for ' + str(x) + " Clusters")
        plt.show()

inertia_plot(KMeans, survey)

Silhouette Scores¶

Next we have silhouette scores for different numbers of clusters. A score of -1 means poor clustering, 0 means cluster overrlap, and 1 means good clusterning. (good clustering is tightly packed clusters that are far away from each other.)

for x in range(2,8):
    ag = KMeans(n_clusters = x, )
    label = ag.fit_predict(survey)
    print('Silhouette-Score for', x,  'Clusters: ', silhouette_score(survey, label))

Silhouette-Score for 2 Clusters:  0.1901790243780608
Silhouette-Score for 3 Clusters:  0.14167460926516368
Silhouette-Score for 4 Clusters:  0.12022983394445913
Silhouette-Score for 5 Clusters:  0.11616351570181287
Silhouette-Score for 6 Clusters:  0.11802987050715062
Silhouette-Score for 7 Clusters:  0.10049815890508677

Silhouette Plots¶

Latly we have silhouette plots which are the silhouette scores of every sample within each cluster. Note that imbalanced clusters lead to wider bars. Samples with higher spectrall coefficient are close to their cluster mates, and far from their neighbors in other clusters

silh_samp_cluster(KMeans, survey, stop =6)

Conclusion¶

Not great. We have a high average inertia, low silhouette scrores, and a high variance in the silhouette plots.Lets try agglomerative Clustering

Agglomerative Clustering¶

Silhouette Scores¶

for x in range(2,8):
    ag = AgglomerativeClustering(n_clusters = x, )
    label = ag.fit_predict(survey)
    print('Silhouette-Score for', x,  'Clusters: ', silhouette_score(survey, label))

Silhouette-Score for 2 Clusters:  0.1965709194292001
Silhouette-Score for 3 Clusters:  0.11195933280363203
Silhouette-Score for 4 Clusters:  0.09979111096256836
Silhouette-Score for 5 Clusters:  0.08727975194310092
Silhouette-Score for 6 Clusters:  0.08760098945117002
Silhouette-Score for 7 Clusters:  0.08980342970987766

Silhouette Plots¶

silh_samp_cluster(AgglomerativeClustering, survey, stop =6)

Clustering with PCA¶

KMeans with PCA¶

Silhouette Scores and Inertias¶

Below we see much improved silhouette scores, and lower average inertia following a dimension reduction with Pricipal Component Analysis. Based on the data below, i would say that 2 components, and 4-6 clusters would be the best bet.

from sklearn.decomposition import PCA
for y in range(2,6):
    print("PCA with # of components: ", y)
    pca = PCA(n_components =y)
    survey_p = pca.fit_transform(survey)
    for x in range(2, 9):
        ag = KMeans(n_clusters = x, )
        label = ag.fit_predict(survey_p)
        print('Silhouette-Score for', x,  'Clusters: ', silhouette_score(survey_p, label) , '       Inertia: ',ag.inertia_)
    print()

PCA with # of components:  2
Silhouette-Score for 2 Clusters:  0.43478973717080877        Inertia:  3061.14193519136
Silhouette-Score for 3 Clusters:  0.40862240624424523        Inertia:  2057.4522005745225
Silhouette-Score for 4 Clusters:  0.3882108470196273        Inertia:  1542.815091931754
Silhouette-Score for 5 Clusters:  0.3857258539343663        Inertia:  1177.7320837112477
Silhouette-Score for 6 Clusters:  0.3916480464060315        Inertia:  954.0398029866354
Silhouette-Score for 7 Clusters:  0.38317522216411953        Inertia:  824.6732608434008
Silhouette-Score for 8 Clusters:  0.3773630365562337        Inertia:  721.234582263155

PCA with # of components:  3
Silhouette-Score for 2 Clusters:  0.34806804781852896        Inertia:  4575.186547457093
Silhouette-Score for 3 Clusters:  0.3057300924763964        Inertia:  3570.774554338654
Silhouette-Score for 4 Clusters:  0.28725142347175503        Inertia:  2986.7180607475107
Silhouette-Score for 5 Clusters:  0.29009648074939304        Inertia:  2453.078766851491
Silhouette-Score for 6 Clusters:  0.28265330539433975        Inertia:  2162.637023698975
Silhouette-Score for 7 Clusters:  0.29333759099854545        Inertia:  1941.0197181797594
Silhouette-Score for 8 Clusters:  0.2903033382833832        Inertia:  1738.0616227891483

PCA with # of components:  4
Silhouette-Score for 2 Clusters:  0.3096770112894559        Inertia:  5440.124821320205
Silhouette-Score for 3 Clusters:  0.26369373190295553        Inertia:  4433.598905373727
Silhouette-Score for 4 Clusters:  0.23857610112766858        Inertia:  3846.0942201924654
Silhouette-Score for 5 Clusters:  0.24067370060962573        Inertia:  3311.829282035463
Silhouette-Score for 6 Clusters:  0.2279334463413734        Inertia:  3017.5233399320678
Silhouette-Score for 7 Clusters:  0.24126517264172975        Inertia:  2757.6447772022516
Silhouette-Score for 8 Clusters:  0.24518131489912756        Inertia:  2538.510615415523

PCA with # of components:  5
Silhouette-Score for 2 Clusters:  0.2824659555418534        Inertia:  6237.065547639066
Silhouette-Score for 3 Clusters:  0.2365893262887934        Inertia:  5225.088883387629
Silhouette-Score for 4 Clusters:  0.21084428867918975        Inertia:  4633.390322515035
Silhouette-Score for 5 Clusters:  0.2149571350100246        Inertia:  4091.183724625033
Silhouette-Score for 6 Clusters:  0.22026950302628318        Inertia:  3763.642257528536
Silhouette-Score for 7 Clusters:  0.2157564999794979        Inertia:  3483.3312471163326
Silhouette-Score for 8 Clusters:  0.2116879942756015        Inertia:  3268.5061676171335

Visualizing in 2 Dinensions¶

Lets visualize a scatterplot with 2 principal components, and 5 clusters determined with the KMeans Algorithm

survey_p = pd.DataFrame(PCA(n_components = 2).fit_transform(survey))
preds = pd.Series(KMeans(n_clusters = 5,).fit_predict(survey_p))
survey_p = pd.concat([survey_p, preds], axis =1)
survey_p.columns = [0,1,'target']

fig = plt.figure(figsize = (18, 7))
colors = ['red', 'green', 'blue', 'purple', 'orange']
plt.subplot(121)
plt.scatter(survey_p[survey_p['target']==0].iloc[:,0], survey_p[survey_p.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
plt.scatter(survey_p[survey_p['target']==1].iloc[:,0], survey_p[survey_p.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
plt.scatter(survey_p[survey_p['target']==2].iloc[:,0], survey_p[survey_p.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
plt.scatter(survey_p[survey_p['target']==3].iloc[:,0], survey_p[survey_p.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
plt.scatter(survey_p[survey_p['target']==4].iloc[:,0], survey_p[survey_p.target==4].iloc[:,1], c = colors[4], label = 'cluster 5')
plt.legend()
plt.title('KMeans Clustering with 5 Clusters')
plt.xlabel('PC1')
plt.ylabel('PC2')

survey_p = pd.DataFrame(PCA(n_components = 2).fit_transform(survey))
preds = pd.Series(KMeans(n_clusters = 4,).fit_predict(survey_p))
survey_p = pd.concat([survey_p, preds], axis =1)
survey_p.columns = [0,1,'target']

plt.subplot(122)
plt.scatter(survey_p[survey_p['target']==0].iloc[:,0], survey_p[survey_p.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
plt.scatter(survey_p[survey_p['target']==1].iloc[:,0], survey_p[survey_p.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
plt.scatter(survey_p[survey_p['target']==2].iloc[:,0], survey_p[survey_p.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
plt.scatter(survey_p[survey_p['target']==3].iloc[:,0], survey_p[survey_p.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
plt.legend()
plt.title('KMeans Clustering with 4 Clusters')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

Visualizing in 3 Dimensions¶

Now with 3 Prinipal Components

from mpl_toolkits.mplot3d import Axes3D

survey_p3 = pd.DataFrame(PCA(n_components = 3).fit_transform(survey))
preds = pd.Series(KMeans(n_clusters = 5,).fit_predict(survey_p3))
survey_p3 = pd.concat([survey_p3, preds], axis =1)
survey_p3.columns = [0,1,2, 'target']

fig = plt.figure(figsize = (10, 10))
ax = fig.add_subplot(111, projection='3d')
colors = ['red', 'green', 'blue', 'purple', 'orange']
ax.scatter(survey_p3[survey_p3['target']==0].iloc[:,0], survey_p3[survey_p3.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
ax.scatter(survey_p3[survey_p3['target']==1].iloc[:,0], survey_p3[survey_p3.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
ax.scatter(survey_p3[survey_p3['target']==2].iloc[:,0], survey_p3[survey_p3.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
ax.scatter(survey_p3[survey_p3['target']==3].iloc[:,0], survey_p3[survey_p3.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
ax.scatter(survey_p3[survey_p3['target']==4].iloc[:,0], survey_p3[survey_p3.target==4].iloc[:,1], c = colors[4], label = 'cluster 5')
plt.legend()
plt.title('KMeans Clustering with 2 Principal Components')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

Conclusion¶

Based on the above analysis, we made some serious gains with applying the PCA, but there are distinct patterns in the 2 dimensional data that I would like to capture. Next we try the same process with Agglomerative Clustering.

Agglomerative Clustering with PCA¶

Silhouette Scores¶

Ag has no inertia metric.

for y in range(2,6):
    print("PCA with # of components: ", y)
    pca = PCA(n_components =y)
    survey_p = pca.fit_transform(survey)
    for x in range(2, 9):
        ag = AgglomerativeClustering(n_clusters = x, )
        label = ag.fit_predict(survey_p)
        print('Silhouette-Score for', x,  'Clusters: ', silhouette_score(survey_p, label))
    print()

PCA with # of components:  2
Silhouette-Score for 2 Clusters:  0.44520944187485545
Silhouette-Score for 3 Clusters:  0.3982310337077982
Silhouette-Score for 4 Clusters:  0.3276185047633979
Silhouette-Score for 5 Clusters:  0.32328488796632937
Silhouette-Score for 6 Clusters:  0.3316368039874285
Silhouette-Score for 7 Clusters:  0.3263468375392069
Silhouette-Score for 8 Clusters:  0.32740571021957593

PCA with # of components:  3
Silhouette-Score for 2 Clusters:  0.3611725317050327
Silhouette-Score for 3 Clusters:  0.2562683041525399
Silhouette-Score for 4 Clusters:  0.24129145462702958
Silhouette-Score for 5 Clusters:  0.22281786699002076
Silhouette-Score for 6 Clusters:  0.21074596230949919
Silhouette-Score for 7 Clusters:  0.21810689423819662
Silhouette-Score for 8 Clusters:  0.23069898398257283

PCA with # of components:  4
Silhouette-Score for 2 Clusters:  0.32445644274354735
Silhouette-Score for 3 Clusters:  0.23312693604443777
Silhouette-Score for 4 Clusters:  0.22056466501422275
Silhouette-Score for 5 Clusters:  0.21025854168941271
Silhouette-Score for 6 Clusters:  0.17803535661926223
Silhouette-Score for 7 Clusters:  0.176618042281536
Silhouette-Score for 8 Clusters:  0.18660832100208677

PCA with # of components:  5
Silhouette-Score for 2 Clusters:  0.26097624669996816
Silhouette-Score for 3 Clusters:  0.2087425037355288
Silhouette-Score for 4 Clusters:  0.16992611075290612
Silhouette-Score for 5 Clusters:  0.16041753502807943
Silhouette-Score for 6 Clusters:  0.1653342439459336
Silhouette-Score for 7 Clusters:  0.15975255083507262
Silhouette-Score for 8 Clusters:  0.155304012476521

Visualizing in 2 Dimensions¶

survey_p = pd.DataFrame(PCA(n_components = 2).fit_transform(survey))
preds = pd.Series(AgglomerativeClustering(n_clusters = 5,).fit_predict(survey_p))
survey_p = pd.concat([survey_p, preds], axis =1)
survey_p.columns = [0,1,'target']

fig = plt.figure(figsize = (18, 7))
colors = ['red', 'green', 'blue', 'purple', 'orange']
plt.subplot(121)
plt.scatter(survey_p[survey_p['target']==0].iloc[:,0], survey_p[survey_p.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
plt.scatter(survey_p[survey_p['target']==1].iloc[:,0], survey_p[survey_p.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
plt.scatter(survey_p[survey_p['target']==2].iloc[:,0], survey_p[survey_p.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
plt.scatter(survey_p[survey_p['target']==3].iloc[:,0], survey_p[survey_p.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
plt.scatter(survey_p[survey_p['target']==4].iloc[:,0], survey_p[survey_p.target==4].iloc[:,1], c = colors[4], label = 'cluster 5')
plt.legend()
plt.title('Agg Clustering with 5 Clusters')
plt.xlabel('PC1')
plt.ylabel('PC2')

survey_p = pd.DataFrame(PCA(n_components = 2).fit_transform(survey))
preds = pd.Series(AgglomerativeClustering(n_clusters = 4,).fit_predict(survey_p))
survey_p = pd.concat([survey_p, preds], axis =1)
survey_p.columns = [0,1,'target']

plt.subplot(122)
plt.scatter(survey_p[survey_p['target']==0].iloc[:,0], survey_p[survey_p.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
plt.scatter(survey_p[survey_p['target']==1].iloc[:,0], survey_p[survey_p.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
plt.scatter(survey_p[survey_p['target']==2].iloc[:,0], survey_p[survey_p.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
plt.scatter(survey_p[survey_p['target']==3].iloc[:,0], survey_p[survey_p.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
plt.legend()
plt.title('Agg Clustering with 4 Clusters')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

Visualizing in 3 Dimensions¶

survey_p3 = pd.DataFrame(PCA(n_components = 3).fit_transform(survey))
preds = pd.Series(AgglomerativeClustering(n_clusters = 5,).fit_predict(survey_p3))
survey_p3 = pd.concat([survey_p3, preds], axis =1)
survey_p3.columns = [0,1,2, 'target']

fig = plt.figure(figsize = (10, 10))
ax = fig.add_subplot(111, projection='3d')
colors = ['red', 'green', 'blue', 'purple', 'orange']
ax.scatter(survey_p3[survey_p3['target']==0].iloc[:,0], survey_p3[survey_p3.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
ax.scatter(survey_p3[survey_p3['target']==1].iloc[:,0], survey_p3[survey_p3.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
ax.scatter(survey_p3[survey_p3['target']==2].iloc[:,0], survey_p3[survey_p3.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
ax.scatter(survey_p3[survey_p3['target']==3].iloc[:,0], survey_p3[survey_p3.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
ax.scatter(survey_p3[survey_p3['target']==4].iloc[:,0], survey_p3[survey_p3.target==4].iloc[:,1], c = colors[4], label = 'cluster 5')
plt.legend()
plt.title('Agglomerative Clustering with 2 Principal Components')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

Conclusion¶

Good job, but there is one more thing I want to try, a Gaussian Mixture Model.

Gaussian Mixture with PCA¶

Visualizing in 2 dimensions¶

from sklearn.mixture import GaussianMixture

survey_p = pd.DataFrame(PCA(n_components = 2).fit_transform(survey))
preds = pd.Series(GaussianMixture(n_components = 5,).fit(survey_p).predict(survey_p))
survey_p = pd.concat([survey_p, preds], axis =1)
survey_p.columns = [0,1,'target']

fig = plt.figure(figsize = (18, 7))
colors = ['red', 'green', 'blue', 'purple', 'orange']
plt.subplot(122)
plt.scatter(survey_p[survey_p['target']==0].iloc[:,0], survey_p[survey_p.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
plt.scatter(survey_p[survey_p['target']==1].iloc[:,0], survey_p[survey_p.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
plt.scatter(survey_p[survey_p['target']==2].iloc[:,0], survey_p[survey_p.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
plt.scatter(survey_p[survey_p['target']==3].iloc[:,0], survey_p[survey_p.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
plt.scatter(survey_p[survey_p['target']==4].iloc[:,0], survey_p[survey_p.target==4].iloc[:,1], c = colors[4], label = 'cluster 5')
plt.legend()
plt.title('Agg Clustering with 5 Clusters')
plt.xlabel('PC1')
plt.ylabel('PC2')

survey_p = pd.DataFrame(PCA(n_components = 2).fit_transform(survey))
preds = pd.Series(GaussianMixture(n_components = 4,).fit(survey_p).predict(survey_p))
survey_p = pd.concat([survey_p, preds], axis =1)
survey_p.columns = [0,1,'target']

plt.subplot(121)
plt.scatter(survey_p[survey_p['target']==0].iloc[:,0], survey_p[survey_p.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
plt.scatter(survey_p[survey_p['target']==1].iloc[:,0], survey_p[survey_p.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
plt.scatter(survey_p[survey_p['target']==2].iloc[:,0], survey_p[survey_p.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
plt.scatter(survey_p[survey_p['target']==3].iloc[:,0], survey_p[survey_p.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
plt.legend()
plt.title('Agg Clustering with 4 Clusters')
plt.xlabel('PC1')
plt.ylabel('PC2')








survey_p = pd.DataFrame(PCA(n_components = 2).fit_transform(survey))
preds = pd.Series(GaussianMixture(n_components = 6,).fit(survey_p).predict(survey_p))
survey_p = pd.concat([survey_p, preds], axis =1)
survey_p.columns = [0,1,'target']

fig = plt.figure(figsize = (18, 7))
colors = ['red', 'green', 'blue', 'purple', 'orange', 'black', 'white']
plt.subplot(121)
plt.scatter(survey_p[survey_p['target']==0].iloc[:,0], survey_p[survey_p.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
plt.scatter(survey_p[survey_p['target']==1].iloc[:,0], survey_p[survey_p.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
plt.scatter(survey_p[survey_p['target']==2].iloc[:,0], survey_p[survey_p.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
plt.scatter(survey_p[survey_p['target']==3].iloc[:,0], survey_p[survey_p.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
plt.scatter(survey_p[survey_p['target']==4].iloc[:,0], survey_p[survey_p.target==4].iloc[:,1], c = colors[4], label = 'cluster 5')
plt.scatter(survey_p[survey_p['target']==5].iloc[:,0], survey_p[survey_p.target==5].iloc[:,1], c = colors[5], label = 'cluster 6')

plt.legend()
plt.title('Agg Clustering with 6 Clusters')
plt.xlabel('PC1')
plt.ylabel('PC2')

survey_p = pd.DataFrame(PCA(n_components = 2).fit_transform(survey))
preds = pd.Series(GaussianMixture(n_components = 7,).fit(survey_p).predict(survey_p))
survey_p = pd.concat([survey_p, preds], axis =1)
survey_p.columns = [0,1,'target']

plt.subplot(122)
plt.scatter(survey_p[survey_p['target']==0].iloc[:,0], survey_p[survey_p.target==0].iloc[:,1], c = colors[0], label = 'cluster 1')
plt.scatter(survey_p[survey_p['target']==1].iloc[:,0], survey_p[survey_p.target==1].iloc[:,1], c = colors[1], label = 'cluster 2')
plt.scatter(survey_p[survey_p['target']==2].iloc[:,0], survey_p[survey_p.target==2].iloc[:,1], c = colors[2], label = 'cluster 3')
plt.scatter(survey_p[survey_p['target']==3].iloc[:,0], survey_p[survey_p.target==3].iloc[:,1], c = colors[3], label = 'cluster 4')
plt.scatter(survey_p[survey_p['target']==4].iloc[:,0], survey_p[survey_p.target==4].iloc[:,1], c = colors[4], label = 'cluster 5')
plt.scatter(survey_p[survey_p['target']==5].iloc[:,0], survey_p[survey_p.target==5].iloc[:,1], c = colors[5], label = 'cluster 6')
plt.scatter(survey_p[survey_p['target']==6].iloc[:,0], survey_p[survey_p.target==6].iloc[:,1], c = colors[6], label = 'cluster 7')

plt.legend()
plt.title('Agg Clustering with 7 Clusters')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
plt.show()

Conclusion¶

In conclusion, we practiced our data preprocessing. We looked at different metrics for evaluating how good the clustering was like silhouette scores and inertia. Then we found that the best clustering models were generally once we reduced the data down to two dimenstions and searched for 4-5 clusters in the data.

Due to the limitations of the data, clustering was less than optimal, but in the future we will know what to look for when we consider clustering a new dataset!

	Timestamp	Age	Gender	Country	state	self_employed	family_history	treatment	work_interfere	no_employees	...	leave	mental_health_consequence	phys_health_consequence	coworkers	supervisor	mental_health_interview	phys_health_interview	mental_vs_physical	obs_consequence	comments
0	2014-08-27 11:29:31	37	Female	United States	IL	NaN	No	Yes	Often	6-25	...	Somewhat easy	No	No	Some of them	Yes	No	Maybe	Yes	No	NaN
1	2014-08-27 11:29:37	44	M	United States	IN	NaN	No	No	Rarely	More than 1000	...	Don't know	Maybe	No	No	No	No	No	Don't know	No	NaN
2	2014-08-27 11:29:44	32	Male	Canada	NaN	NaN	No	No	Rarely	6-25	...	Somewhat difficult	No	No	Yes	Yes	Yes	Yes	No	No	NaN
3	2014-08-27 11:29:46	31	Male	United Kingdom	NaN	NaN	Yes	Yes	Often	26-100	...	Somewhat difficult	Yes	Yes	Some of them	No	Maybe	Maybe	No	Yes	NaN
4	2014-08-27 11:30:22	31	Male	United States	TX	NaN	No	No	Never	100-500	...	Don't know	No	No	Some of them	Yes	Yes	Yes	Don't know	No	NaN

	Age	Gender	Country	self_employed	family_history	treatment	no_employees	remote_work	tech_company	benefits	...	anonymity	leave	mental_health_consequence	phys_health_consequence	coworkers	supervisor	mental_health_interview	phys_health_interview	mental_vs_physical	obs_consequence
0	37	0	44	2	0	1	1	0	1	2	...	2	3	1	1	1	2	1	0	2	0
1	44	1	44	2	0	0	5	0	0	0	...	0	2	0	1	0	0	1	1	0	0
2	32	1	6	2	0	0	1	0	1	1	...	0	1	1	1	2	2	2	2	1	0
3	31	1	43	2	1	1	2	0	1	1	...	1	1	2	2	1	0	0	0	1	1
4	31	1	44	2	0	0	3	1	1	2	...	0	2	1	1	1	2	2	2	0	0

	Age	Gender	Country	self_employed	family_history	treatment	no_employees	remote_work	tech_company	benefits	...	anonymity	leave	mental_health_consequence	phys_health_consequence	coworkers	supervisor	mental_health_interview	phys_health_interview	mental_vs_physical	obs_consequence
0	37	0	44	2	0	1	1	0	1	2	...	2	3	1	1	1	2	1	0	2	0
1	44	1	44	2	0	0	5	0	0	0	...	0	2	0	1	0	0	1	1	0	0
2	32	1	6	2	0	0	1	0	1	1	...	0	1	1	1	2	2	2	2	1	0
3	31	1	43	2	1	1	2	0	1	1	...	1	1	2	2	1	0	0	0	1	1
4	31	1	44	2	0	0	3	1	1	2	...	0	2	1	1	1	2	2	2	0	0

	Age	Gender	Country	self_employed	family_history	treatment	no_employees	remote_work	tech_company	benefits	...	anonymity	leave	mental_health_consequence	phys_health_consequence	coworkers	supervisor	mental_health_interview	phys_health_interview	mental_vs_physical	obs_consequence
0	37	0	44	2	0	1	1	0	1	2	...	2	3	1	1	1	2	1	0	2	0
1	44	1	44	2	0	0	5	0	0	0	...	0	2	0	1	0	0	1	1	0	0
2	32	1	6	2	0	0	1	0	1	1	...	0	1	1	1	2	2	2	2	1	0
3	31	1	43	2	1	1	2	0	1	1	...	1	1	2	2	1	0	0	0	1	1
4	31	1	44	2	0	0	3	1	1	2	...	0	2	1	1	1	2	2	2	0	0