Bellabeat Case Study

About Bellabeat

Bellabeat is a company focused on manufacturing wellness products for women. One of their most popular product is called Leaf which is a wellness tracker. It can track steps, sleep, stress, heartbeat and a lot more. What makes Leaf different from other smart devices is the jewelry-like aesthetic of it. Tracker data can be accessed through the Bellabeat app after connecting the tracker to a smartphone.

Business task

Analyzing another brand’s smart device usage data in order to improve Bellabeat’s future marketing strategies.

The data

The data that will be used in this study is called “FitBit Fitness Tracker Data” from Kaggle, made available by Mobius. It was collected through a survey via Amazon’s Mechanical Turk. The sample size of the data is not ideal –just 30 eligible FitBit users. Still, it is enough to get an idea on trends surrounding smart device usage. Keeping the business task in mind, out of the 18 files that are present in the data, only 4 of them will be used in this study.

You can skip the next section and go straight into the Analyzing section if you are not interested in the cleaning process.

Cleaning and transforming the data

To do this analysis, R programming language seems the easiest and most efficient one to use. Since the data is not too big, but it consists of multiple spreadsheets, any merging or cleaning will be done very quickly through R.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(snakecase)
library(skimr)

daily_activity <- read_csv("Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv",show_col_types = FALSE)
daily_sleep <- read_csv("Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv",show_col_types = FALSE)
hourly_steps <- read_csv("Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv",show_col_types = FALSE)
heart_rate <- read_csv("Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv",show_col_types = FALSE)

I will start the cleaning process by checking the metadata, to see if they are spelled or written correctly, if they have any uppercase characters or any disorder in general.

Now it would be ideal to confirm how many users are participating, to see if it actually is 30 for all the data sets. For this, checking how many unique ids there are would be enough.

n_unique(daily_activity$Id)

## [1] 33

n_unique(daily_sleep$Id)

## [1] 24

n_unique(hourly_steps$Id)

## [1] 33

n_unique(heart_rate$Id)

## [1] 14

hourly_steps and daily_activity data sets have sufficient participants. For daily_sleep and heart_rate data sets, participants are less than 30 which is not ideal, however for the sake of this study, daily_sleep data set will still be used.

From these previews of these data frames, it is understood that changing the column names from camel case (ex. ActivityDate) into snake case (ex. activity_date) is necessary.

names(daily_activity) <- to_snake_case(names(daily_activity))
names(daily_sleep) <- to_snake_case(names(daily_sleep))
names(hourly_steps) <- to_snake_case(names(hourly_steps))

Also even though the data set daily_sleep contains daily sleep data, time of sleep is also present with the date of sleep which is irrelevant and unnecessary. So it is better to change the date format from datetime to just date.

daily_sleep <- daily_sleep %>% 
  mutate(sleep_day= as_date(sleep_day, format= "%m/%d/%Y  %I:%M:%S %p"))

While at it, it is good to make all date formats the same.

daily_activity <- daily_activity %>% 
  mutate(activity_date= as_date(activity_date, format= "%m/%d/%Y"))
hourly_steps <- hourly_steps %>% 
  mutate(activity_hour= as_datetime(activity_hour, format= "%m/%d/%Y  %I:%M:%S %p"))

Next step in the cleaning process will be checking and getting rid of any missing values.

sum(is.na(daily_activity))

## [1] 0

sum(is.na(daily_sleep))

## [1] 0

sum(is.na(hourly_steps))

## [1] 0

According to the results, the data sets do not contain any missing values. Finally it is time to check for duplicate values.

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(daily_sleep))

## [1] 3

sum(duplicated(hourly_steps))

## [1] 0

Knowing daily_sleep has 3 duplicate values, it is good to get rid of them and check if they have been removed for sure.

daily_sleep <- distinct(daily_sleep)
sum(duplicated(daily_sleep))

## [1] 0

Okay, daily_sleep data set also only consists of distinct values. Adding the correspondent weekdays for the date will help with the analysis later on.

daily_activity <- daily_activity %>% mutate( weekday = weekdays(activity_date))
daily_sleep <- daily_sleep %>%  mutate(weekday= weekdays(sleep_day))

Analyzing

Which days have the least/most amount of steps?

#plotting Total Number of Steps according to Weekdays
weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", 
             "Friday", "Saturday", "Sunday")
ggplot(data=daily_activity, aes(x=weekday, y=total_steps)) + 
  geom_bar(stat="identity", fill='pink') + 
  labs(title='Total Number of Steps per Weekday') + 
  scale_x_discrete(limits = weekday) + 
  xlab("Weekdays") + ylab("Number of Steps")

By plotting the total number of steps per weekday, the most and the least active days recorded can be found. According to the above figure, Tuesday seems to be the most active day for all the users since it has the most steps. The least amount of total steps being on Sunday is not very surprising, since it is a rest day for most people. Following Sunday are Monday and Friday.
Based on these findings, users can be given extra motivation through the app to start the week on a more positive note, with increased activity. Same goes for Fridays, end of the week can be tiring however users can be encouraged to not give up, and keep moving by adding certain game-like characteristics such as achievements or a notification showing them which percentage of users they fall under based on their activity levels for the day.

#Calculating the average steps per weekday
average_weekday_activity <- daily_activity %>%
  group_by(weekday) %>% 
  summarise(average_steps_per_weekday = mean(total_steps))

#plotting the average steps per weekday
ggplot(data=average_weekday_activity, 
       aes(x=weekday, y=average_steps_per_weekday)) + 
  geom_bar(stat="identity",fill="pink") + 
  scale_x_discrete(limits = weekday) + 
  xlab("Weekdays") + ylab("Average Steps") + 
  labs(title="Average Steps per Weekday by All Users")

On average, users moved the most on Saturdays, followed by Tuesday. Sunday is the day with the least steps on average.

Which days have the least/most amount of minutes slept?

#plotting Total Minutes Slept per Weekday
ggplot(data=daily_sleep, aes(x=weekday, y=total_minutes_asleep)) + 
  geom_bar(stat='identity',fill="pink") + 
  scale_x_discrete(limits = weekday) + 
  xlab("Weekdays") + ylab("Total Minutes Slept") + 
  labs(title="Total Minutes Slept per Weekday")

Above plot shows the total minutes slept by all the users based on the weekday. Interestingly, Wednesday seems to be the day with the most minutes slept. The least total minutes slept however are on Monday, this can very well be due to not being able to sleep because of the anxiety of starting a new week or avoiding sleep to make the precious Sunday night last longer. On these days with the least sleep, it can be good to remind the users of benefit of sleep, and how much better they will feel in the morning and in the future with a fixed sleep schedule.

#Calculating average minutes slept per Weekday
average_weekday_sleep <- daily_sleep %>%
  group_by(weekday) %>% 
  summarise(average_sleep_per_weekday = mean(total_minutes_asleep))

#plotting average minutes slept per weekday
ggplot(data=average_weekday_sleep, aes(x=weekday, y=average_sleep_per_weekday)) + 
  geom_bar(stat="identity",fill="pink") + 
  scale_x_discrete(limits = weekday) + 
  xlab("Weekdays") + ylab("Average Sleep") + 
  labs(title="Average Minutes Slept per Weekday by All Users")

According to this graph Sunday is the day with the most minutes slept on average, followed by Wednesday. All the other days do not have that big of a difference between them.

Categorizing Users

Instead of analyzing every single user individually, the users can be categorized into three different groups based on their average activity. According to this article, the classification of the users can be done like this:

<5000 steps/day, ‘sedentary’
5000-7499 steps/day, ‘low active’
7500-9999 steps/day, ‘somewhat active’
“>or=10000 steps/day, ‘active’”
“>12500 steps/day, ‘highly active’”

user_type <- daily_activity %>% 
  group_by (id) %>% 
  summarise(average_steps= mean(total_steps)) %>% 
  mutate(user_type= case_when(
    average_steps < 5000 ~"sedentary",
    average_steps >= 5000 & average_steps <7499 ~"low active",
    average_steps >= 7499 & average_steps <9999 ~"somewhat active",
    average_steps >= 10000 & average_steps <12500 ~"active",
    average_steps > 12500 ~"highly active"))

daily_activity <- merge(x= daily_activity, y=user_type, by='id')

Which activity level is the least popular?

activity_level <- c("sedentary", "low active", "somewhat active", 
                    "active", "highly active")
ggplot(data = daily_activity, aes(x = user_type)) +
  geom_bar(color = "purple", fill = "pink") + 
  labs(title = "Activity Levels") + 
  xlab("User Type") + ylab("Number of Users") + 
  scale_x_discrete(limits = activity_level)

Above figure visualizes the categorization of the number of users per activity level. Most of the users seem to belong to the activity level “somewhat active”, which was defined as taking 7500-9999 steps per day which is not a small amount. However, “sedentary” and “low active” users seem to be higher in quantity compared to “active” and “highly active” users.

What hour of the daytime is the least/most active?

The column activity_hour needs to be separated into date and time for analyzing the hourly steps.

hourly_steps <- hourly_steps %>% 
  separate(activity_hour, into = c("date", "time"), sep= " ")

Below code helps with calculating the average steps by all the users per hour of the day.

hourly_steps_average <- hourly_steps %>% 
  group_by(time) %>% 
  summarise(average_steps = mean(step_total))

Plotting above findings results in the below graph.

ggplot(hourly_steps_average, aes(x=time, y=average_steps, fill=average_steps)) + 
  geom_col() + theme(axis.text.x = element_text(angle = 90)) + 
  xlab("Time") + ylab("Average Steps")

According to this graph, the hour in which people move the most on average is 6 pm, followed by 7 pm. This can be easily explained by commuting or going for after work walks. The other hours during the day where people move a lot are between 12-2 pm, which is also the time most people have lunch breaks.

Summary

According to my findings, there is a clear pattern to smart device usage. Even if there are some outliers, on average people seem to be following the same pattern.
My recommendation based on this analysis will be to use the app and the smart device to send notifications through out the day and the week where people don’t seem to be moving a lot. In-app rewards or competitions would also definitely encourage people to move more and would increase their usage time.
Keeping users engaged can be easier if they have something to come back for, or something to log on to everyday. Daily streaks are a great way to ensure this. Once the commitment is made, and a community is created on the app more and more people might want to join this community that promotes healthy living and wellness.