My Machine Learning Works: October 2023

Tuesday, October 17, 2023

In [55]:

In [56]:

Out[56]:

Constituent

Membership

Organization

Membership

Level

Inception

Date

Initiation

Date

Expiration

Date

Custom

Category

0 68233 Individual

Membership Supporter 2004-10-20 1948-12-30 1949-12-31 (none)

1 8056 Individual

Membership Senior 2004-04-06 1948-12-30 1949-12-29 (none)

2 54161 Individual

Membership Supporter 2004-04-13 1948-12-30 1949-12-31 (none)

#importing libraries

import pandas as pd

import matplotlib.pyplot as plt import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, confusion_matrix from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, confusion_matrix, classification_repor from sklearn.feature_selection import SelectKBest, mutual_info_classif

#Importing the member_history dataset

member_history = pd.read_excel('Member_History.xlsx', sheet_name='MemHistory' member_history["Initiation Date"] = pd.to_datetime(member_history["Initiation Date member_history.head(3)

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

1 of 30 10/18/2023, 8:12 AM

In [57]:

In [58]:

Out[57]:

Campaign name

Ad Set

Name Ad name Month Delivery status Delivery level Reach Impressions

0 NaN NaN NaN NaN NaN NaN 857447 7034538

SFJAZZ At Home | Retargeting | 23-24 Always On...

SFJAZZ | Watch Page Visitor 7dLB | West Coast

Video | SFJAZZ At Home Sizzle Reel 2023

2023-10-01

2023-10-05

active ad 138

Concert | Prospecting | Meshell Ndegeocello (1...

SFJAZZ | All Members CRM List LAL | 50 Mile

Ra...

Soundcard | Meshell Ndegeocello (10/27-10/29)

2023-10-01

2023-10-05

active ad 1719 1998

SFJAZZ At Home | Prospecting | Sep 2023 Always...

Interest | Jazz Targeted | West Coast

Video | SFJAZZ At Home Sep Broadcasts

2023-10-01

2023-10-05

not_delivering ad 0

SFJAZZ At Home | Prospecting | Oct 2023 Always...

SFJAZZ | All Members CRM List LAL | West Coast

Video | SFJAZZ At Home Oct Broadcasts

2023-10-01

2023-10-05

active ad 2435 4208

5 rows × 25 columns

Out[58]: Date Facebook reach

0 2023-01-01 35390

1 2023-01-02 29241

2 2023-01-03 21768

#Importing the jazz_membership dataset

jazz_membership = pd.read_excel('sf_jazz_membership_data.xlsx') jazz_membership["Starts"] = pd.to_datetime(jazz_membership["Starts"], format jazz_membership.head(5)

#Importing the facebook_Reach dataset

facebook_Reach = pd.read_csv('Facebook_Reach.csv', encoding='ISO-8859-1') facebook_Reach["Date"] = pd.to_datetime(facebook_Reach["Date"], format="%Y-%m-%d" facebook_Reach.head(3)

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

2 of 30 10/18/2023, 8:12 AM

In [59]:

In [60]:

In [61]:

Out[59]: Date Facebook Page likes

0 2023-01-01 217

1 2023-01-02 215

2 2023-01-03 118

Out[60]: Date New Facebook Page likes

0 2023-01-01 11

1 2023-01-02 14

2 2023-01-03 11

Out[61]: Date New Facebook Page likes

0 2023-01-01 11

1 2023-01-02 14

2 2023-01-03 11

3 2023-01-04 19

4 2023-01-05 14

... ... ...

552 2023-10-05 55

553 2023-10-06 69

554 2023-10-07 55

555 2023-10-08 66

556 2023-10-09 59

557 rows × 2 columns

#Importing the page_Profile_visits dataset

page_Profile_visits = pd.read_csv('page_Profile_visits.csv', encoding='ISO-8859-1' page_Profile_visits["Date"] = pd.to_datetime(page_Profile_visits["Date"], format page_Profile_visits.head(3)

#Importing the New_likes_and_follows dataset

New_likes_and_follows = pd.read_csv('New_likes_and_follows.csv', encoding= New_likes_and_follows["Date"] = pd.to_datetime(New_likes_and_follows["Date" New_likes_and_follows.head(3)

#converting the date to datetime object format

New_likes_and_follows["Date"] = pd.to_datetime(New_likes_and_follows["Date"

New_likes_and_follows

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

3 of 30 10/18/2023, 8:12 AM

In [62]:

In [63]:

Out[63]:

Campaign name

Ad Set

Name

Ad name

Month Delivery status

Delivery level Reach Impressions Frequency

SFJAZZ At Home | Retargeting | 23-24 Always On...

SFJAZZ | Watch Page Visitor 7dLB | West Coast

Video | SFJAZZ At Home Sizzle Reel 2023

2023-10-01

2023-10-05

active ad 138 842 6.101449

SFJAZZ At Home | Retargeting | 23-24 Always On...

SFJAZZ | Watch Page Visitor 7dLB | West Coast

Video | SFJAZZ At Home Sizzle Reel 2023

2023-10-01

2023-10-05

active ad 138 842 6.101449

SFJAZZ At Home | Retargeting | 23-24 Always On...

SFJAZZ | Watch Page Visitor 7dLB | West Coast

Video | SFJAZZ At Home Sizzle Reel 2023

2023-10-01

2023-10-05

active ad 138 842 6.101449

SFJAZZ At Home | Retargeting | 23-24 Always On...

SFJAZZ | Watch Page Visitor 7dLB | West Coast

Video | SFJAZZ At Home Sizzle Reel 2023

2023-10-01

2023-10-05

active ad 138 842 6.101449

SFJAZZ At Home | Retargeting | 23-24 Always On...

SFJAZZ | Watch Page Visitor 7dLB | West Coast

Video | SFJAZZ At Home Sizzle Reel 2023

2023-10-01

2023-10-05

active ad 138 842 6.101449

5 rows × 38 columns

# Performing the merge based on Initiation Date and Starts

combined_data = jazz_membership.merge(member_history, left_on='Starts', right_on combined_data = combined_data.merge(facebook_Reach, left_on='Starts', right_on combined_data = combined_data.merge(page_Profile_visits, left_on='Starts', combined_data = combined_data.merge(New_likes_and_follows, left_on='Starts'

combined_data.head()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

4 of 30 10/18/2023, 8:12 AM

In [85]:

In [86]:

Out[86]:

Month Delivery status Reach Impressions Frequency

Amount spent (USD)

Cost per result Starts

592

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

593

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

594

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

595

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

596

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

5 rows × 31 columns

#Dropping unnecessary columns

columns_to_drop = [ "Campaign name", "Ad Set Name", "Ad name",

"Delivery level", "Attribution setting", "Result type", "Results", "Constituent ID", "Expiration Date", "Custom Category 01"

]

combined_data = combined_data.drop(columns=columns_to_drop)

combined_data.head(5)

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

5 of 30 10/18/2023, 8:12 AM

In [87]:

In [88]:

In [89]:

Missing Values:

Month 0

Delivery status 0

Reach 0 Impressions 0 Frequency 0

Amount spent (USD) 0

Cost per result 0

Starts 0 Ends 0

Link clicks 0

CPC (cost per link click) 0

CTR (all) 0

CPM (cost per 1,000 impressions) 0

Result rate 0

Clicks (all) 0

CPC (All) 0

Reporting starts 0

Reporting ends 0

Membership Organization 0

Membership Level 0

Inception Date 0

Initiation Date 0

Date_x 0

Facebook reach 0

Date_y 0

Facebook Page likes 0

Date 0

New Facebook Page likes 0

Membership Reg 0

Membership Reg Label 0

Membership Reg Binary 0

dtype: int64

Out[88]: (184048, 31)

Out[89]: (184048, 31)

# Checking for missing values

missing_values = combined_data.isnull().sum() print("Missing Values:\n", missing_values)

combined_data.shape

#dropping null values

combined_data = combined_data.dropna()

#Checking the shape of the dataset

combined_data.shape

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

6 of 30 10/18/2023, 8:12 AM

In [90]:

Exploratory Data Analysis

Out[90]:

Reach Impressions Frequency Amount spent (USD)

Cost per result Link clicks

count 184048.000000 184048.000000 184048.000000 184048.000000 184048.000000 184048.000000

mean 20190.627967 45421.637964 2.310083 434.256041 18.409868 436.720508

std 23907.713927 57929.976816 1.153889 504.985397 14.888930 564.666127

min 95.000000 513.000000 1.034668 17.270000 3.246533 6.000000

25% 3193.000000 5830.000000 1.697043 84.820000 10.326667 42.000000

50% 10078.000000 23722.000000 1.980823 281.780000 15.350000 207.000000

75% 31982.000000 64649.000000 2.652004 566.130000 21.165846 652.000000

max 107170.000000 304363.000000 11.826733 2372.460000 149.520000 2973.000000

#Summary Statistics

summary_stats = combined_data.describe() summary_stats

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

7 of 30 10/18/2023, 8:12 AM

In [91]: # Distribution of 'Amount spent (USD)' plt.figure(figsize=(12, 10))

sns.histplot(data=combined_data, x='Amount spent (USD)', bins=10, kde=True plt.title('Distribution of Amount spent (USD)')

plt.xlabel('Amount spent (USD)') plt.ylabel('Frequency') plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

8 of 30 10/18/2023, 8:12 AM

In [92]: # Relationship between 'Amount spent (USD)' and 'Impressions' plt.figure(figsize=(10, 6))

sns.scatterplot(data=combined_data, x='Amount spent (USD)', y='Impressions' plt.title('Relationship between Amount spent (USD) and Impressions') plt.xlabel('Amount spent (USD)')

plt.ylabel('Impressions')

plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

9 of 30 10/18/2023, 8:12 AM

In [93]: # Box plot of 'Delivery status' vs. 'Impressions' plt.figure(figsize=(10, 6))

sns.boxplot(data=combined_data, x='Delivery status', y='Impressions') plt.title('Delivery Status vs. Impressions')

plt.xlabel('Delivery status') plt.ylabel('Impressions') plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

10 of 30 10/18/2023, 8:12 AM

In [94]: # Count of 'Delivery status'

plt.figure(figsize=(8, 5)) sns.countplot(data=combined_data, x='Delivery status') plt.title('Count of Delivery Status') plt.xlabel('Delivery status')

plt.ylabel('Count')

plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

11 of 30 10/18/2023, 8:12 AM

In [95]: # Distribution of 'Link clicks' by 'Delivery status' plt.figure(figsize=(10, 6))

sns.violinplot(data=combined_data, x='Delivery status', y='Link clicks') plt.title('Distribution of Link clicks by Delivery status') plt.xlabel('Delivery status')

plt.ylabel('Link clicks')

plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

12 of 30 10/18/2023, 8:12 AM

In [98]: # Correlation heatmap of numeric variables

numeric_data = combined_data[['Impressions', 'Amount spent (USD)', 'Link clicks'

correlation_matrix = numeric_data.corr() plt.figure(figsize=(8, 6))

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.title('Correlation Heatmap')

plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

13 of 30 10/18/2023, 8:12 AM

In [99]: import matplotlib.pyplot as plt import seaborn as sns

from matplotlib.ticker import MaxNLocator

# Creating monthly data

monthly_data = combined_data.groupby('Month')[['Impressions', 'Amount spent (USD)'

# Setting up the figure and the first y-axis fig, ax1 = plt.subplots(figsize=(12, 6)) ax1.set_xlabel('Month') ax1.set_ylabel('Impressions', color='tab:blue')

sns.lineplot(x='Month', y='Impressions', data=monthly_data, marker='o', color

# Creating the second y-axis

ax2 = ax1.twinx()

ax2.set_ylabel('Amount spent (USD)', color='tab:red') sns.lineplot(x='Month', y='Amount spent (USD)', data=monthly_data, marker=

# Rotating x-axis labels to be vertical and reduce the frequency ax1.xaxis.set_major_locator(MaxNLocator(nbins=10, integer=True)) plt.xticks(rotation=90)

# Showing the plot

plt.title('Monthly Trend of Impressions and Amount Spent') plt.tight_layout()

plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

14 of 30 10/18/2023, 8:12 AM

In [100]: # Pairplot of selected numeric variables

sns.pairplot(data=numeric_data, diag_kind='kde') plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

15 of 30 10/18/2023, 8:12 AM

In [101]: # Visualizing 'Facebook reach' plt.figure(figsize=(10, 6))

sns.histplot(data=combined_data, x='Facebook reach', bins=20, kde=True) plt.title('Distribution of Facebook Reach')

plt.xlabel('Facebook Reach') plt.ylabel('Frequency') plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

16 of 30 10/18/2023, 8:12 AM

In [102]: # Visualizing 'Facebook Page likes' plt.figure(figsize=(10, 6))

sns.histplot(data=combined_data, x='Facebook Page likes', bins=20, kde=True plt.title('Distribution of Facebook Page Likes')

plt.xlabel('Facebook Page Likes') plt.ylabel('Frequency') plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

17 of 30 10/18/2023, 8:12 AM

In [103]: # Visualizing 'New Facebook Page likes'

plt.figure(figsize=(10, 6))

sns.histplot(data=combined_data, x='New Facebook Page likes', bins=20, kde= plt.title('Distribution of New Facebook Page Likes')

plt.xlabel('New Facebook Page Likes')

plt.ylabel('Frequency')

plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

18 of 30 10/18/2023, 8:12 AM

In [104]: # Box plot for Amount spent by Membership Organization

plt.figure(figsize=(12, 10))

sns.boxplot(data=combined_data, x='Membership Organization', y='Amount spent (USD) plt.title('Amount Spent by Membership Organization')

plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

19 of 30 10/18/2023, 8:12 AM

In [105]: # Box plot for Link clicks by Membership Organization plt.figure(figsize=(12, 10))

sns.boxplot(data=combined_data, x='Membership Organization', y='Link clicks' plt.title('Link Clicks by Membership Organization')

plt.show()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

20 of 30 10/18/2023, 8:12 AM

In [106]:

Initiation Date 2023-01-09 1440 2023-01-24 3536 2023-02-17 7280 2023-03-02 5520 2023-03-03 4800 2023-03-28 6000 2023-05-01 54208 2023-05-09 992 2023-05-31 416 2023-06-07 1168 2023-06-10 1248 2023-06-23 17632 2023-07-03 2720 2023-07-05 1344 2023-07-14 2560 2023-07-20 2240 2023-07-21 4800 2023-07-28 7104 2023-08-01 56952 2023-08-21 216 2023-09-01 1872

Name: Initiation Date, dtype: int64

# Converting "Initiation Date" to a datetime object

combined_data['Initiation Date'] = pd.to_datetime(combined_data['Initiation Date'

# Grouping by day and count the occurrences

initiation_date_grouped = combined_data.groupby(combined_data['Initiation Date'

# Displaying the result

print(initiation_date_grouped)

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

21 of 30 10/18/2023, 8:12 AM

In [107]:

Out[107]:

Month Delivery status Reach Impressions Frequency

Amount spent (USD)

Cost per result Starts

592

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

593

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

594

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

595

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

596

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

5 rows × 31 columns

# Grouping by day and counting the occurrences, then transforming it to create a n

combined_data['Membership Reg'] = combined_data.groupby(combined_data['Initiation

# Displaying the updated DataFrame

combined_data.head()

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

22 of 30 10/18/2023, 8:12 AM

In [108]:

Out[108]:

Month Delivery status Reach Impressions Frequency

Amount spent (USD)

Cost per result

592

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

593

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

594

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

595

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

596

2023-09-01 - 2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

... ... ... ... ... ... ... ...

207475

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

207476

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

207477

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

207478

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

207479

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

184048 rows × 31 columns

# Defining the threshold for "High" membership registration count threshold = 20000

# Creating a new column "Membership Reg Label" based on the threshold

combined_data['Membership Reg Label'] = combined_data['Membership Reg'].apply

# Displaying the updated DataFrame

combined_data

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

23 of 30 10/18/2023, 8:12 AM

In [109]:

Out[109]:

Month Delivery status Reach Impressions Frequency

Amount spent (USD)

Cost per result

592

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

593

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

594

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

595

2023-09-01

2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

596

2023-09-01 - 2023-09-30

active 494 5510 11.153846 192.61 64.203333 2023-07-28

... ... ... ... ... ... ... ...

207475

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

207476

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

207477

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

207478

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

207479

2023-01-01

2023-01-31

not_delivering 4147 17250 4.159633 206.34 9.379091 2023-01-09

184048 rows × 31 columns

# Mapping 'High' to 1 and 'Low' to 0 in a new column 'Membership Reg Binary'

combined_data['Membership Reg Binary'] = combined_data['Membership Reg Label'

# Displaying the updated DataFrame

combined_data

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

24 of 30 10/18/2023, 8:12 AM

In [111]:

One Hot Encoding

# Selecting the independent variables

X = [

'Reach', 'Impressions', 'Frequency',

'Amount spent (USD)',

'CPC (cost per link click)',

'CPM (cost per 1,000 impressions)', 'Clicks (all)',

'Facebook reach', 'Facebook Page likes', 'New Facebook Page likes', 'Membership Organization', 'Membership Level',

]

# Defining the dependent variable

y = 'Membership Reg Binary'

# Creating a new DataFrame with selected independent and dependent variables

selected_data = combined_data[X + [y]]

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

25 of 30 10/18/2023, 8:12 AM

In [113]:

Splitting the dataset

In [114]:

In [115]:

1. Logistic Regression Model

Out[113]:

Reach Impressions Frequency

Amount spent (USD)

CPC (cost per link click)

CPM (cost per 1,000 impressions)

Clicks (all)

Facebook reach

592 -0.823863 593 -0.823863 594 -0.823863 595 -0.823863 596 -0.823863

-0.688965 -0.688965 -0.688965 -0.688965 -0.688965

1.413193

-0.758285

5 rows × 34 columns

# One-hot encoding of categorical variables

data = pd.get_dummies(selected_data, columns=['Membership Organization', 'Membersh

# Separating the numerical and one-hot encoded features

numerical_features = ['Reach', 'Impressions', 'Frequency', 'Amount spent (USD)'

categorical_features = [col for col in data.columns if col not in numerical_featur

# Scaling the numerical features using StandardScaler

scaler = StandardScaler()

data[numerical_features] = scaler.fit_transform(data[numerical_features])

# Displaying the encoded data

data.head()

# Spliting the dataset into features (X) and target (y)

X = data.drop(columns=['Membership Reg Binary']) y = data['Membership Reg Binary']

# Splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_st

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

26 of 30 10/18/2023, 8:12 AM

In [116]:

2. Naive Bayes Model

Accuracy: 90.20% Logistic Regression:

precision recall f1-score support

0 0.91 0.84 0.87 14628

1 0.90 0.94 0.92 22182

accuracy 0.90 36810

macro avg 0.90 0.89 0.90 36810

weighted avg 0.90 0.90 0.90 36810

[[12285 2343] [ 1263 20919]]

C:\Users\DELL\anaconda3\lib\site-packages\sklearn\linear_model\_logistic. py:444: ConvergenceWarning: lbfgs failed to converge (status=1):

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown i n:

https://scikit-learn.org/stable/modules/preprocessing.html (https://s

cikit-learn.org/stable/modules/preprocessing.html)

Please also refer to the documentation for alternative solver options:

https://scikit-learn.org/stable/modules/linear_model.html#logistic-re gression (https://scikit-learn.org/stable/modules/linear_model.html#logis tic-regression)

n_iter_i = _check_optimize_result(

from sklearn.linear_model import LogisticRegression

# Create and fit the Logistic Regression model

logistic_model = LogisticRegression() logistic_model.fit(X_train, y_train)

# Predict on the test set

logistic_predictions = logistic_model.predict(X_test)

# Calculate accuracy

accuracy_log = accuracy_score(y_test, logistic_predictions)

# Evaluate Logistic Regression

print("Accuracy: {:.2f}%".format(accuracy_log * 100)) print("Logistic Regression:") print(classification_report(y_test, logistic_predictions)) print(confusion_matrix(y_test, logistic_predictions))

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

27 of 30 10/18/2023, 8:12 AM

In [118]:

3. K-Nearest Neighbour

Accuracy: 87.09% Confusion Matrix: [[10605 4023]

[ 728 21454]] Classification Report:

precision recall f1-score support

0 0.94 0.72 0.82 14628

1 0.84 0.97 0.90 22182

accuracy 0.87 36810

macro avg 0.89 0.85 0.86 36810

weighted avg 0.88 0.87 0.87 36810

# Initializing and training a Gaussian Naive Bayes classifier

naive_bayes_classifier = GaussianNB() naive_bayes_classifier.fit(X_train, y_train)

# Making predictions on the testing set

y_pred = naive_bayes_classifier.predict(X_test)

# Calculating accuracy

accuracy = accuracy_score(y_test, y_pred)

# Creating a confusion matrix

confusion_matrix_result = confusion_matrix(y_test, y_pred)

# Generating a classification report

classification_report_result = classification_report(y_test, y_pred)

# Displaying the results

print("Accuracy: {:.2f}%".format(accuracy * 100)) print("Confusion Matrix:") print(confusion_matrix_result) print("Classification Report:") print(classification_report_result)

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

28 of 30 10/18/2023, 8:12 AM

In [119]:

Feature: Facebook reach, Score: 0.6737807119570731

Feature: Amount spent (USD), Score: 0.6716786730812395

Feature: Frequency, Score: 0.6716515061813902

Feature: Impressions, Score: 0.6716413185939468

Feature: Reach, Score: 0.6716175475565788

Feature: CPM (cost per 1,000 impressions), Score: 0.6715632137568803

Feature: CPC (cost per link click), Score: 0.6715088799571819

Feature: Facebook Page likes, Score: 0.6664605683017847

Feature: Clicks (all), Score: 0.6265928324696309

Feature: New Facebook Page likes, Score: 0.5981146874141803

Feature: Membership Organization_Auto Renew Digital, Score: 0.14653950529 041393

Feature: Membership Level_Monthly Digital, Score: 0.10963007117148704

Feature: Membership Organization_Auto Renew Core, Score: 0.09113822980519 659

Feature: Membership Level_Annual Digital, Score: 0.03146238743828755

Feature: Membership Level_Supporter, Score: 0.0218728648286588

Feature: Membership Level_Contributor, Score: 0.01100593560702312

Feature: Membership Level_Senior, Score: 0.010553147368204119

# Initializing a KNN classifier

knn_classifier = KNeighborsClassifier(n_neighbors=3)

# Initializing SelectKBest with mutual information as the scoring function

selector = SelectKBest(score_func=mutual_info_classif, k='all')

# Fitting the selector to the training data

selector.fit(X_train, y_train)

# Getting feature scores

feature_scores = selector.scores_

# Getting the names of features and their scores

feature_names = X.columns

feature_scores_dict = dict(zip(feature_names, feature_scores))

# Sorting features by their scores

sorted_features = sorted(feature_scores_dict.items(), key=lambda x: x[1], reverse

# Printing the sorted features and their scores

for feature, score in sorted_features: print(f"Feature: {feature}, Score: {score}")

# Training KNN with the selected features

selected_features = selector.transform(X_train) knn_classifier.fit(selected_features, y_train)

# Making predictions on the testing set

selected_test_features = selector.transform(X_test) y_pred = knn_classifier.predict(selected_test_features)

# Calculating accuracy with the selected features

accuracy = accuracy_score(y_test, y_pred)

# Displaying accuracy

print("Accuracy with selected features: {:.2f}%".format(accuracy * 100))

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

29 of 30 10/18/2023, 8:12 AM

In [ ]:

Feature: Membership Organization_Digital Membership, Score: 0.00401456777 7799849

Feature: Membership Level_Presenter, Score: 0.003670884719077838

Feature: Membership Level_Student, Score: 0.003644116868627645

Feature: Membership Organization_Complimentary Membership, Score: 0.00341 27659588549797

Feature: Membership Level_Benefactor, Score: 0.003381116702806608

Feature: Membership Organization_Individual Membership, Score: 0.00286567 69728065967

Feature: Membership Level_Director, Score: 0.002366614215765006

Feature: Membership Level_Visionary, Score: 0.002102882854883248

Feature: Membership Level_Producer, Score: 0.0019864536065339333

Feature: Membership Level_Legend, Score: 0.000676206656551992

Feature: Membership Level_Artist, Score: 0.0006015666572929401

Feature: Membership Level_Emerging Patron, Score: 0.000439211408163942

Feature: Membership Level_One Month Digital, Score: 0.0001906287739976697 5

Feature: Membership Level_Master, Score: 0.0

Feature: Membership Level_Patron, Score: 0.0

Feature: Membership Level_Presenters Circle, Score: 0.0

Accuracy with selected features: 99.84%

Jazz Membership(1) - Jupyter Notebook http://localhost:8888/notebooks/Downloads/Jazz%20Membership(1).ipynb#

30 of 30 10/18/2023, 8:12 AM

Methodology: Data Transformation and Merging for Comprehensive Analysis

In this section, I will provide a detailed overview of the methodology employed to transform and merge multiple datasets, creating a unified dataset for comprehensive analysis. The process involves changing relevant columns to datetime objects and merging data from various sources. The datasets involved in this process are 'jazz_membership,' 'member_history,' 'facebook_Reach,' 'page_Profile_visits,' and 'New_likes_and_follows.' The common columns used for merging are 'Starts', ‘Date’ and 'Initiation Date.'

1. Data Loading

The first step in the data transformation process is loading the individual datasets. The 'jazz_membership' ‘'member_history’ datasets are loaded directly from an Excel file, while the other datasets 'facebook_Reach,' 'page_Profile_visits,' and 'New_likes_and_follows' are loaded from CSV files. In the case of CSV files, specific character encoding, 'ISO-8859-1,' is specified to ensure proper reading.

2. Changing Columns to Datetime Objects

Certain columns in the datasets that represent dates needed to be converted into datetime objects to enable time-based analysis. The following columns were changed:

'Starts' in the 'jazz_membership' dataset, specifying the start date of the membership.
'Initiation Date' in the 'member_history' dataset, representing the date of initiation-becoming a member date.
'Date' in the 'New_likes_and_follows' dataset, indicating the date of data collection.
'Date' in the 'page_Profile_visits' dataset, denoting the date of profile visits.
'Date' in the 'facebook_Reach' dataset, representing the date of data collection.

3. Data Merging

The merging process is fundamental for creating a comprehensive dataset that combines information from multiple sources. This ensures that related data points are aligned for further analysis. The 'jazz_membership' dataset is initially merged with the 'member_history' dataset based on a common column, 'Starts' from 'jazz_membership,' and 'Initiation Date' from 'member_history.' This merging operation is executed as an inner join, meaning that only records with matching dates of initiation and membership start dates are included. Subsequently, the consolidated dataset is merged with the 'facebook_Reach' dataset using the common column 'Date' from 'facebook_Reach.' Again, an inner join is performed to retain only rows with matching dates. The same process is repeated for the 'page_Profile_visits' and 'New_likes_and_follows' datasets. Each is merged with the existing dataset based on their respective date columns ('Date') using inner joins.

4. Data Cleaning

Following the merging process, some columns are dropped from the dataset to remove irrelevant or redundant information. This step enhances the dataset's clarity and focus on the variables that are essential for analysis. Columns such as "Campaign name," "Ad Set Name," "Ad name," "Delivery level," "Campaign name", "Attribution setting", "Result type", "Results", "Constituent ID", "Expiration Date", and "Custom Category 01" are dropped from the dataset. Additionally, null values were also checked and quite a number of variables were having missing values. This was dealt with by dropping rows with missing data. This is essential to ensure data quality and to prevent issues during analysis.

5. Additional Date-Based Transformation

After merging, the 'Initiation Date' column is transformed to create a new variable, 'Membership Reg.' This transformation involves counting the occurrences of initiation dates for each day.A threshold of 20,000 is defined to categorize 'Membership Reg' into 'High' or 'Low,' and a new column, 'Membership Reg Label,' is created based on this threshold. Furthermore, another column, 'Membership Reg Binary,' is introduced by mapping 'High' to 1 and 'Low' to 0 in the 'Membership Reg Label' column. This binary representation is often useful in statistical analysis.

6. Data Preprocessing

To ensure that the dataset is appropriately formatted and scaled for the machine learning model, a two-step data preprocessing approach was applied. This involved:

a. One-Hot Encoding for Categorical Variables

Categorical variables, such as "Membership Organization" and "Membership Level," were transformed using one-hot encoding. This technique is crucial for converting categorical data into a numerical format that the machine learning model can understand. Each unique category within these variables was represented as binary columns, with "1" indicating the presence of a category and "0" denoting its absence. This process ensures that the model can effectively interpret and utilize categorical information.

b. Standard Scaling for Numerical Features

The numerical features, including 'Reach,' 'Impressions,' 'Frequency,' 'Amount spent (USD),' 'CPC (cost per link click),' 'CPM (cost per 1,000 impressions),' 'Clicks (all),' 'Facebook reach,' 'Facebook Page likes,' and 'New Facebook Page likes,' were standardized using the StandardScaler. Standardization transforms numerical values to have a mean of 0 and a standard deviation of 1, ensuring that all numerical features are on a common scale. This step is important for preventing features with larger values from dominating the model's performance.

Therefore, by applying one-hot encoding for categorical variables and standard scaling for numerical features, the dataset was effectively prepared for subsequent machine learning tasks. These preprocessing techniques contribute to the model's ability to make accurate predictions and capture meaningful patterns within the data.

7. Selection of Independent and Dependent Variables

The final step involves selecting the independent and dependent variables for analysis. Independent variables, such as 'Reach’, ‘Impressions', 'Frequency’, ‘Amount spent (USD)','CPC (cost per link click)','CPM (cost per 1,000 impressions)','Clicks (all)','Facebook reach’, ‘Facebook Page likes’, ‘New Facebook Page likes’, ‘Membership Organization’, and ‘Membership Level’ are chosen to influence the dependent variable, 'Membership Reg Binary.'

8. Splitting the Dataset

The dataset is split into training and testing sets using the 'train_test_split' function from the scikit-learn library. This division is crucial for modeling and evaluating the predictive power of the independent variables on membership registration.

Assumptions

Data Quality. The assumptions made about data quality are that the provided datasets are accurate, complete, and free of significant errors. Any inconsistencies or missing data have been addressed during data preprocessing.
Data Transformation. It is assumed that the data transformation processes, including datetime conversion and one-hot encoding, have been carried out accurately without errors.
Model Suitability. The choice of machine learning models-Logistic Regression, Naïve Bayes, and K-Nearest Neighbor is appropriate for the given dataset. The assumption is that these models are suitable for binary classification tasks.
Independence of Observations. It is assumed that observations are independent, and the order of observations does not affect the results.
Variable Importance. The variable importance scores are assumed to accurately reflect the influence of each feature on the outcome, and these scores guide the conclusions drawn.

Workflow Diagram

The workflow diagram shows the steps that will be taken from the start to the final prediction.

Exploratory Data Analysis

Figure 1: Distribution of Amount spent (USD)

Looking at figure 1 above, it is evident that the Amount spent is right skewed. A right-skewed distribution, also known as positively skewed, is a type of probability distribution where the right tail (the larger values) is longer or extends further than the left tail (the smaller values) (Glen, 2022). Right skewness occurs when the majority of the data points have lower values, but a few data points have exceptionally high values. In the case of "Amount spent in USD," it's common to observe that most spending amounts are relatively low or moderate, but some instances are involving very high expenditures.

Figure 2: Relationship between Amount spent (USD) and Impressions

From figure 2, there is a direct increasing relationship between Amount Spent (USD) and impressions. The direct increasing relationship between "Amount spent (USD)" and "Impressions" implies that as the amount spent on a campaign ads increase, the number of impressions also increases. This positive correlation suggests that investing more in advertising or promotions leads to broader exposure and visibility. It means that allocating a higher budget result in reaching a larger audience and potentially more customers. This relationship explains the fundamental principle of advertising, where a greater financial commitment can yield increased visibility, potentially resulting in more engagement, brand recognition, and better business outcomes.

Figure 3: Boxplot on Delivery Status vs. Impressions

From figure 3 above, we can see that features such as “active” and “inactive” do not contain outliers. However. “not_delivering” contains notable number of outliers with “archived” having only 1 visible outlier. Additionally, “inactive” feature has a higher median compared to other features. Looking at the variability of the features, “not_delivering” has a greater variability because the whisker is longer. However, “active” and “archive” has a lower variability. Additionally, we can see that the feature “inactive” has a symmetrical distribution because the box is evenly centered between the whiskers. The other remaining features are considered to be skewed because one whisker is longer than the other.

Figure 4: Count of Delivery Status

Figure 4 above shows count of Delivery Status and we can see that “not_delivering” registered the highest number, followed by “inactive” and “active” while “archived” has the lowest number of count.

Figure 5: Distribution of Link clicks by Delivery status

Looking at figure 5, “not_delivering” had the highest number of link clicks of 3000 plus, followed by “archived: which had around 1500 link clicks. Additionally, “inactive: had a relatively lower number of link clicks while “active” registered the lowest number of link clicks. This suggests that when a delivery status is categorized as "not_delivering," it received a substantial amount of user engagement in terms of link clicks. This could be due to various factors, such as the content of the ad or the targeting of the audience. Moreover, "inactive" having a relatively lower number of link clicks suggests that ads categorized as "inactive" did not perform as well in terms of user engagement. Also, "active" registered the lowest number of link clicks, which indicates that ads with an "active" status did not receive a substantial amount of user engagement through link clicks.

Figure 6: Correlation heatmap of numeric variables

From figure 6, there is a strong positive correlation between “impressions”, “Amount spent (USD)”, and “Link Clicks”. This suggests that as the amount spent on advertising increases, both the number of impressions (how often an ad is displayed) and the number of link clicks also tend to increase. This relationship makes sense, as higher spending often leads to increased visibility and user interaction with ads. On the other hand, there is a relatively weak positive correlation between “Facebook reach”, “Facebook Page Likes” and “New Facebook page likes.” This implies that an increase in Facebook reach (the number of users who see the content) is associated with a slight increase in both Facebook page likes, and new page likes. This correlation indicates that broader reach can contribute to a gradual rise in page likes. Furthermore, a very weak negative correlation is evident between “impressions”, “Amount spent (USD)”, and “Link Clicks” and “Facebook Page Likes” and “New Facebook page likes.” This means that as the number of impressions and the amount spent on ads increase, there is a slight decrease in the number of Facebook page likes and new page likes.

Figure 7: Monthly Trend of Impressions and Amount Spent

From figure 7 above, there has been a fluctuating trend for both “impressions and amount spent between January 2023 to September 2023. The month in which impression was highest was in May while the month in which Amount spent was highest was in August. There was a low trend at the beginning of the month which later increased rapidly for both “impressions” and “Amount spent (USD).”

Figure 8: Pairplot of selected numeric variables

From figure 8 above, we can see right skewness in a pair plot alongside a direct relationship among the variables, it implies that one of the variables is positively skewed, with most data points having lower values. Also, there are some outliers with higher values, which influence the direct relationship.

Figure 9: Three different histograms showing the distribution of Facebook page likes, facebook reach, and new facebook page likes.

From figure 9 above, it is evident that both distributions are non-uniform.

Figure 10: Box plot for Amount spent by Membership Organization

From figure 10 above, it is evident that all the categories of membership organization contain outliers when it comes to “Amount spent.” Also, the categories “Auto Renew Core” and “individual membership” have the highest variability while “Complimentary Membership”, “Digital Membership” and “Auto Renew Digital” have lower variability with “Complimentary Membership” having the lowest variability.

Figure 11: Box plot for Link clicks by Membership Organization

From figure 11, we can see that “Auto Renew Core” and “Individual Membership” have a higher variability compared to other features because they have a longer whisker. However, “Digital membership” has the lowest variability because it has a shorter whisker. Additionally, the features “Digital membership”, “Auto Renew Digital” and “Complimentary Membership” contain many outliers.

Classification and Logistic Regression

Logistic regression Model

Accuracy: 90.20%

Logistic Regression:

precision recall f1-score support

0 0.91 0.84 0.87 14628

1 0.90 0.94 0.92 22182

accuracy 0.90 36810

macro avg 0.90 0.89 0.90 36810

weighted avg 0.90 0.90 0.90 36810

[[12285 2343]

[ 1263 20919]]

The Logistic Regression model achieved an accuracy of 90.20%, which signifies the overall correctness of its predictions. This means that 90.20% of the instances in the dataset were correctly classified by the model. Precision is a metric that helps us understand the model's accuracy when it predicts the occurrence of "high membership" (class 1). For class 0, representing "low membership", the precision is 0.91, indicating that 91% of the instances predicted as "low membership" are indeed correct. Similarly, for class 1, the precision is 0.90, implying that 90% of the instances predicted as "high membership" are accurate.

Recall, also known as sensitivity or true positive rate, evaluates the model's ability to correctly identify all the actual instances of "high membership" (class 1). For class 0, the recall is 0.84, signifying that the model accurately identifies 84% of the instances in which there is "low membership." Conversely, for class 1, the recall is 0.94, indicating that the model correctly identifies 94% of the instances where there is "high membership." The support metric provides the number of instances in each class within the dataset. Specifically, there are 14,628 instances in class 0 and 22,182 instances in class 1.

The confusion matrix is a tabular representation that reveals the number of true positive, true negative, false positive, and false negative predictions. True Positives (TP) indicate that there are 20,919 instances correctly predicted as "high." True Negatives (TN) represent the 12,285 instances correctly predicted as "low." False Positives (FP) account for 2,343 instances incorrectly predicted as "high" when they are not. Lastly, False Negatives (FN) indicate that there are 1,263 instances incorrectly predicted as "low" when they should have been categorized as "high."

Naïve Bayes model

Accuracy: 87.09%

Confusion Matrix:

[[10605  4023]

 [  728 21454]]

Classification Report:

              precision    recall  f1-score   support

           0       0.94      0.72      0.82     14628

           1       0.84      0.97      0.90     22182

    accuracy                           0.87     36810

   macro avg       0.89      0.85      0.86     36810

weighted avg       0.88      0.87      0.87     36810

The Naive Bayes model achieved an accuracy of 87.09%, which tells us that it correctly predicted the outcome for approximately 87.09% of the instances in the dataset. Looking at the confusion matrix, in TP, the model correctly predicted 21,454 instances as "class 0." For TN, the model made accurate predictions of 10,605 instances as "class 0." Also, for FP, there were 4,023 instances where the model incorrectly predicted "class 1". Moreover, for FN, there are 728 instances, that the model mistakenly predicted "class 0"

For class 0, signifying "low," the precision is 0.94, indicating that 94% of the instances predicted as "low" were indeed correct. For class 1, the precision is 0.84, implying that 84% of the instances predicted as "high" are accurate. Recall, on the other hand, assesses the model's ability to correctly identify all the actual instances of class 1. For class 0, the recall is 0.72, indicating that the model accurately identifies 72% of instances of "low." On the other hand, for class 1, the recall is 0.97, suggesting that the model correctly identifies 97% of instances where there is "high membership."

K-Nearest neighbor

The K-Nearest neighbor has an accuracy of 99.84%. Additionally, the output provides feature importance scores from a K-Nearest Neighbor model used for predicting membership join rates with a very high accuracy of 99.84%. The feature importance scores indicate the extent to which each feature contributes to the model's predictions. For instance, higher feature importance scores suggest that the corresponding features have a greater impact on the likelihood of a user joining SF Jazz membership. The feature importance scores are in descending order, so the most important features are listed at the top. Therefore, since K-Nearest Neighbor recorded the highest accuracy, we will use this model to identify the factors that contribute to increased rates of joining SF Jazz membership.

Conclusion

Among the three-machine learning model, K-Nearest Neighbor performed better than other models. Therefore, based on the K-Nearest Neighbor model feature importance scores output above, the factors that contribute to increased rates of joining SF Jazz membership include the following:

1. Facebook reach (Score: 0.6736). The number of people who were reached on Facebook appears to be the most influential factor. For instance, a higher reach on Facebook positively affects membership join rates.

2. Reach (Score: 0.6717). A broader reach, possibly across multiple channels, is the second most important factor contributing to increased membership join rates.

3. Amount spent (USD) (Score: 0.6717). The amount spent on advertising or promotional activities is also a key factor. For instance, a higher spending positively influence membership join rates.

4. Impressions (Score: 0.6716). The number of times ads or content were viewed (impressions) plays a significant role in increasing membership join rates. An increase in the number of impressions increased the membership join rate.

5. CPM (cost per 1,000 impressions) (Score: 0.6716). The cost per 1,000 impressions is another cost-related factor that contributes to membership join rates.

6. Frequency (Score: 0.6715). The frequency of interactions or ad displays is an important factor in membership join rate. Higher frequency is associated with increased membership join rates.

7. CPC (cost per link click) (Score: 0.6715). The cost per link click is related to the cost-effectiveness of the advertising strategy. This increases the number of memberships join rates.

8. Facebook Page likes (Score: 0.6665). The number of likes on the Facebook page is an indicator of the page's popularity and influence on membership join rates. An increase in the number of pages likes increases the membership joining rates.

9. Clicks (all) (Score: 0.6271): The total number of clicks on various links and content is a significant contributor to membership join rates.

10. New Facebook Page likes (Score: 0.5985): The number of new Facebook page likes acquired during the campaign is also a relevant factor.

Therefore, the model seems to focus heavily on digital marketing and online engagement metrics, particularly on social media-Facebook. As a result, to increase membership join rates, it is important to invest in strategies that enhance reach, engagement, and advertisement spending, with a specific emphasis on Facebook.

if you need this contact me at ochiengfelix@gmail.com

My Machine Learning Works

Tuesday, October 17, 2023

The Need for Efficient Cable Organizer in the Digital Age

Report Abuse