Methodology: Data Transformation and Merging for Comprehensive Analysis
In this section, I will provide a detailed overview of the methodology employed to transform and merge multiple datasets, creating a unified dataset for comprehensive analysis. The process involves changing relevant columns to datetime objects and merging data from various sources. The datasets involved in this process are 'jazz_membership,' 'member_history,' 'facebook_Reach,' 'page_Profile_visits,' and 'New_likes_and_follows.' The common columns used for merging are 'Starts', ‘Date’ and 'Initiation Date.'
1. Data Loading
The first step in the data transformation process is loading the individual datasets. The 'jazz_membership' ‘'member_history’ datasets are loaded directly from an Excel file, while the other datasets 'facebook_Reach,' 'page_Profile_visits,' and 'New_likes_and_follows' are loaded from CSV files. In the case of CSV files, specific character encoding, 'ISO-8859-1,' is specified to ensure proper reading.
2. Changing Columns to Datetime Objects
Certain columns in the datasets that represent dates needed to be converted into datetime objects to enable time-based analysis. The following columns were changed:
- 'Starts' in the 'jazz_membership' dataset, specifying the start date of the membership.
- 'Initiation Date' in the 'member_history' dataset, representing the date of initiation-becoming a member date.
- 'Date' in the 'New_likes_and_follows' dataset, indicating the date of data collection.
- 'Date' in the 'page_Profile_visits' dataset, denoting the date of profile visits.
- 'Date' in the 'facebook_Reach' dataset, representing the date of data collection.
3. Data Merging
The merging process is fundamental for creating a comprehensive dataset that combines information from multiple sources. This ensures that related data points are aligned for further analysis. The 'jazz_membership' dataset is initially merged with the 'member_history' dataset based on a common column, 'Starts' from 'jazz_membership,' and 'Initiation Date' from 'member_history.' This merging operation is executed as an inner join, meaning that only records with matching dates of initiation and membership start dates are included. Subsequently, the consolidated dataset is merged with the 'facebook_Reach' dataset using the common column 'Date' from 'facebook_Reach.' Again, an inner join is performed to retain only rows with matching dates. The same process is repeated for the 'page_Profile_visits' and 'New_likes_and_follows' datasets. Each is merged with the existing dataset based on their respective date columns ('Date') using inner joins.
4. Data Cleaning
Following the merging process, some columns are dropped from the dataset to remove irrelevant or redundant information. This step enhances the dataset's clarity and focus on the variables that are essential for analysis. Columns such as "Campaign name," "Ad Set Name," "Ad name," "Delivery level," "Campaign name", "Attribution setting", "Result type", "Results", "Constituent ID", "Expiration Date", and "Custom Category 01" are dropped from the dataset. Additionally, null values were also checked and quite a number of variables were having missing values. This was dealt with by dropping rows with missing data. This is essential to ensure data quality and to prevent issues during analysis.
5. Additional Date-Based Transformation
After merging, the 'Initiation Date' column is transformed to create a new variable, 'Membership Reg.' This transformation involves counting the occurrences of initiation dates for each day.A threshold of 20,000 is defined to categorize 'Membership Reg' into 'High' or 'Low,' and a new column, 'Membership Reg Label,' is created based on this threshold. Furthermore, another column, 'Membership Reg Binary,' is introduced by mapping 'High' to 1 and 'Low' to 0 in the 'Membership Reg Label' column. This binary representation is often useful in statistical analysis.
6. Data Preprocessing
To ensure that the dataset is appropriately formatted and scaled for the machine learning model, a two-step data preprocessing approach was applied. This involved:
a. One-Hot Encoding for Categorical Variables
Categorical variables, such as "Membership Organization" and "Membership Level," were transformed using one-hot encoding. This technique is crucial for converting categorical data into a numerical format that the machine learning model can understand. Each unique category within these variables was represented as binary columns, with "1" indicating the presence of a category and "0" denoting its absence. This process ensures that the model can effectively interpret and utilize categorical information.
b. Standard Scaling for Numerical Features
The numerical features, including 'Reach,' 'Impressions,' 'Frequency,' 'Amount spent (USD),' 'CPC (cost per link click),' 'CPM (cost per 1,000 impressions),' 'Clicks (all),' 'Facebook reach,' 'Facebook Page likes,' and 'New Facebook Page likes,' were standardized using the StandardScaler. Standardization transforms numerical values to have a mean of 0 and a standard deviation of 1, ensuring that all numerical features are on a common scale. This step is important for preventing features with larger values from dominating the model's performance.
Therefore, by applying one-hot encoding for categorical variables and standard scaling for numerical features, the dataset was effectively prepared for subsequent machine learning tasks. These preprocessing techniques contribute to the model's ability to make accurate predictions and capture meaningful patterns within the data.
7. Selection of Independent and Dependent Variables
The final step involves selecting the independent and dependent variables for analysis. Independent variables, such as 'Reach’, ‘Impressions', 'Frequency’, ‘Amount spent (USD)','CPC (cost per link click)','CPM (cost per 1,000 impressions)','Clicks (all)','Facebook reach’, ‘Facebook Page likes’, ‘New Facebook Page likes’, ‘Membership Organization’, and ‘Membership Level’ are chosen to influence the dependent variable, 'Membership Reg Binary.'
8. Splitting the Dataset
The dataset is split into training and testing sets using the 'train_test_split' function from the scikit-learn library. This division is crucial for modeling and evaluating the predictive power of the independent variables on membership registration.
Assumptions
- Data Quality. The assumptions made about data quality are that the provided datasets are accurate, complete, and free of significant errors. Any inconsistencies or missing data have been addressed during data preprocessing.
- Data Transformation. It is assumed that the data transformation processes, including datetime conversion and one-hot encoding, have been carried out accurately without errors.
- Model Suitability. The choice of machine learning models-Logistic Regression, Naïve Bayes, and K-Nearest Neighbor is appropriate for the given dataset. The assumption is that these models are suitable for binary classification tasks.
- Independence of Observations. It is assumed that observations are independent, and the order of observations does not affect the results.
- Variable Importance. The variable importance scores are assumed to accurately reflect the influence of each feature on the outcome, and these scores guide the conclusions drawn.
Workflow Diagram
The workflow diagram shows the steps that will be taken from the start to the final prediction.
Exploratory Data Analysis
Figure 1: Distribution of Amount spent (USD)
Looking at figure 1 above, it is evident that the Amount spent is right skewed. A right-skewed distribution, also known as positively skewed, is a type of probability distribution where the right tail (the larger values) is longer or extends further than the left tail (the smaller values) (Glen, 2022). Right skewness occurs when the majority of the data points have lower values, but a few data points have exceptionally high values. In the case of "Amount spent in USD," it's common to observe that most spending amounts are relatively low or moderate, but some instances are involving very high expenditures.
Figure 2: Relationship between Amount spent (USD) and Impressions
From figure 2, there is a direct increasing relationship between Amount Spent (USD) and impressions. The direct increasing relationship between "Amount spent (USD)" and "Impressions" implies that as the amount spent on a campaign ads increase, the number of impressions also increases. This positive correlation suggests that investing more in advertising or promotions leads to broader exposure and visibility. It means that allocating a higher budget result in reaching a larger audience and potentially more customers. This relationship explains the fundamental principle of advertising, where a greater financial commitment can yield increased visibility, potentially resulting in more engagement, brand recognition, and better business outcomes.
Figure 3: Boxplot on Delivery Status vs. Impressions
From figure 3 above, we can see that features such as “active” and “inactive” do not contain outliers. However. “not_delivering” contains notable number of outliers with “archived” having only 1 visible outlier. Additionally, “inactive” feature has a higher median compared to other features. Looking at the variability of the features, “not_delivering” has a greater variability because the whisker is longer. However, “active” and “archive” has a lower variability. Additionally, we can see that the feature “inactive” has a symmetrical distribution because the box is evenly centered between the whiskers. The other remaining features are considered to be skewed because one whisker is longer than the other.
Figure 4: Count of Delivery Status
Figure 4 above shows count of Delivery Status and we can see that “not_delivering” registered the highest number, followed by “inactive” and “active” while “archived” has the lowest number of count.
Figure 5: Distribution of Link clicks by Delivery status
Looking at figure 5, “not_delivering” had the highest number of link clicks of 3000 plus, followed by “archived: which had around 1500 link clicks. Additionally, “inactive: had a relatively lower number of link clicks while “active” registered the lowest number of link clicks. This suggests that when a delivery status is categorized as "not_delivering," it received a substantial amount of user engagement in terms of link clicks. This could be due to various factors, such as the content of the ad or the targeting of the audience. Moreover, "inactive" having a relatively lower number of link clicks suggests that ads categorized as "inactive" did not perform as well in terms of user engagement. Also, "active" registered the lowest number of link clicks, which indicates that ads with an "active" status did not receive a substantial amount of user engagement through link clicks.
Figure 6: Correlation heatmap of numeric variables
From figure 6, there is a strong positive correlation between “impressions”, “Amount spent (USD)”, and “Link Clicks”. This suggests that as the amount spent on advertising increases, both the number of impressions (how often an ad is displayed) and the number of link clicks also tend to increase. This relationship makes sense, as higher spending often leads to increased visibility and user interaction with ads. On the other hand, there is a relatively weak positive correlation between “Facebook reach”, “Facebook Page Likes” and “New Facebook page likes.” This implies that an increase in Facebook reach (the number of users who see the content) is associated with a slight increase in both Facebook page likes, and new page likes. This correlation indicates that broader reach can contribute to a gradual rise in page likes. Furthermore, a very weak negative correlation is evident between “impressions”, “Amount spent (USD)”, and “Link Clicks” and “Facebook Page Likes” and “New Facebook page likes.” This means that as the number of impressions and the amount spent on ads increase, there is a slight decrease in the number of Facebook page likes and new page likes.
Figure 7: Monthly Trend of Impressions and Amount Spent
From figure 7 above, there has been a fluctuating trend for both “impressions and amount spent between January 2023 to September 2023. The month in which impression was highest was in May while the month in which Amount spent was highest was in August. There was a low trend at the beginning of the month which later increased rapidly for both “impressions” and “Amount spent (USD).”
Figure 8: Pairplot of selected numeric variables
From figure 8 above, we can see right skewness in a pair plot alongside a direct relationship among the variables, it implies that one of the variables is positively skewed, with most data points having lower values. Also, there are some outliers with higher values, which influence the direct relationship.
Figure 9: Three different histograms showing the distribution of Facebook page likes, facebook reach, and new facebook page likes.
From figure 9 above, it is evident that both distributions are non-uniform.
Figure 10: Box plot for Amount spent by Membership Organization
From figure 10 above, it is evident that all the categories of membership organization contain outliers when it comes to “Amount spent.” Also, the categories “Auto Renew Core” and “individual membership” have the highest variability while “Complimentary Membership”, “Digital Membership” and “Auto Renew Digital” have lower variability with “Complimentary Membership” having the lowest variability.
Figure 11: Box plot for Link clicks by Membership Organization
From figure 11, we can see that “Auto Renew Core” and “Individual Membership” have a higher variability compared to other features because they have a longer whisker. However, “Digital membership” has the lowest variability because it has a shorter whisker. Additionally, the features “Digital membership”, “Auto Renew Digital” and “Complimentary Membership” contain many outliers.
Classification and Logistic Regression
Logistic regression Model
Accuracy: 90.20%
Logistic Regression:
precision recall f1-score support
0 0.91 0.84 0.87 14628
1 0.90 0.94 0.92 22182
accuracy 0.90 36810
macro avg 0.90 0.89 0.90 36810
weighted avg 0.90 0.90 0.90 36810
[[12285 2343]
[ 1263 20919]]
The Logistic Regression model achieved an accuracy of 90.20%, which signifies the overall correctness of its predictions. This means that 90.20% of the instances in the dataset were correctly classified by the model. Precision is a metric that helps us understand the model's accuracy when it predicts the occurrence of "high membership" (class 1). For class 0, representing "low membership", the precision is 0.91, indicating that 91% of the instances predicted as "low membership" are indeed correct. Similarly, for class 1, the precision is 0.90, implying that 90% of the instances predicted as "high membership" are accurate.
Recall, also known as sensitivity or true positive rate, evaluates the model's ability to correctly identify all the actual instances of "high membership" (class 1). For class 0, the recall is 0.84, signifying that the model accurately identifies 84% of the instances in which there is "low membership." Conversely, for class 1, the recall is 0.94, indicating that the model correctly identifies 94% of the instances where there is "high membership." The support metric provides the number of instances in each class within the dataset. Specifically, there are 14,628 instances in class 0 and 22,182 instances in class 1.
The confusion matrix is a tabular representation that reveals the number of true positive, true negative, false positive, and false negative predictions. True Positives (TP) indicate that there are 20,919 instances correctly predicted as "high." True Negatives (TN) represent the 12,285 instances correctly predicted as "low." False Positives (FP) account for 2,343 instances incorrectly predicted as "high" when they are not. Lastly, False Negatives (FN) indicate that there are 1,263 instances incorrectly predicted as "low" when they should have been categorized as "high."
Naïve Bayes model
Accuracy: 87.09%
Confusion Matrix:
[[10605 4023]
[ 728 21454]]
Classification Report:
precision recall f1-score support
0 0.94 0.72 0.82 14628
1 0.84 0.97 0.90 22182
accuracy 0.87 36810
macro avg 0.89 0.85 0.86 36810
weighted avg 0.88 0.87 0.87 36810
The Naive Bayes model achieved an accuracy of 87.09%, which tells us that it correctly predicted the outcome for approximately 87.09% of the instances in the dataset. Looking at the confusion matrix, in TP, the model correctly predicted 21,454 instances as "class 0." For TN, the model made accurate predictions of 10,605 instances as "class 0." Also, for FP, there were 4,023 instances where the model incorrectly predicted "class 1". Moreover, for FN, there are 728 instances, that the model mistakenly predicted "class 0"
For class 0, signifying "low," the precision is 0.94, indicating that 94% of the instances predicted as "low" were indeed correct. For class 1, the precision is 0.84, implying that 84% of the instances predicted as "high" are accurate. Recall, on the other hand, assesses the model's ability to correctly identify all the actual instances of class 1. For class 0, the recall is 0.72, indicating that the model accurately identifies 72% of instances of "low." On the other hand, for class 1, the recall is 0.97, suggesting that the model correctly identifies 97% of instances where there is "high membership."
K-Nearest neighbor
The K-Nearest neighbor has an accuracy of 99.84%. Additionally, the output provides feature importance scores from a K-Nearest Neighbor model used for predicting membership join rates with a very high accuracy of 99.84%. The feature importance scores indicate the extent to which each feature contributes to the model's predictions. For instance, higher feature importance scores suggest that the corresponding features have a greater impact on the likelihood of a user joining SF Jazz membership. The feature importance scores are in descending order, so the most important features are listed at the top. Therefore, since K-Nearest Neighbor recorded the highest accuracy, we will use this model to identify the factors that contribute to increased rates of joining SF Jazz membership.
Conclusion
Among the three-machine learning model, K-Nearest Neighbor performed better than other models. Therefore, based on the K-Nearest Neighbor model feature importance scores output above, the factors that contribute to increased rates of joining SF Jazz membership include the following:
1. Facebook reach (Score: 0.6736). The number of people who were reached on Facebook appears to be the most influential factor. For instance, a higher reach on Facebook positively affects membership join rates.
2. Reach (Score: 0.6717). A broader reach, possibly across multiple channels, is the second most important factor contributing to increased membership join rates.
3. Amount spent (USD) (Score: 0.6717). The amount spent on advertising or promotional activities is also a key factor. For instance, a higher spending positively influence membership join rates.
4. Impressions (Score: 0.6716). The number of times ads or content were viewed (impressions) plays a significant role in increasing membership join rates. An increase in the number of impressions increased the membership join rate.
5. CPM (cost per 1,000 impressions) (Score: 0.6716). The cost per 1,000 impressions is another cost-related factor that contributes to membership join rates.
6. Frequency (Score: 0.6715). The frequency of interactions or ad displays is an important factor in membership join rate. Higher frequency is associated with increased membership join rates.
7. CPC (cost per link click) (Score: 0.6715). The cost per link click is related to the cost-effectiveness of the advertising strategy. This increases the number of memberships join rates.
8. Facebook Page likes (Score: 0.6665). The number of likes on the Facebook page is an indicator of the page's popularity and influence on membership join rates. An increase in the number of pages likes increases the membership joining rates.
9. Clicks (all) (Score: 0.6271): The total number of clicks on various links and content is a significant contributor to membership join rates.
10. New Facebook Page likes (Score: 0.5985): The number of new Facebook page likes acquired during the campaign is also a relevant factor.
Therefore, the model seems to focus heavily on digital marketing and online engagement metrics, particularly on social media-Facebook. As a result, to increase membership join rates, it is important to invest in strategies that enhance reach, engagement, and advertisement spending, with a specific emphasis on Facebook.
if you need this contact me at ochiengfelix@gmail.com
No comments:
Post a Comment