Introduction

Finding the perfect name has been a topic of discussion amongst every couple naming their child. There are names that have always been popular and there are names that go through fads and trends. I want to look into the different metrics and statistics of different names and analyze the different between male and female name trends.

I want to study the change in proportion of names each year that are in the top 100 and 500 most popular, how these change over time, and what it could mean, along with other metrics that go along with naming styles.

One thing I am very interested in is Fad Names. I often remark that there are some names that are no longer given out to new babies in our current generation. For example, the name Shirley. Shirley is often considered now to be the name of a Grandmother, a name that many people in my parents generation, or my generation, do not have. But the names Mary, or Elizabeth seem to be common among girls of many different generations, and John and Michael seem to have transcended generations completely. This means the name Shirley, and others of the like, were fad names in my grandparents generation. I want to view the distribution of fad vs timeless names amongst Girls and Boys in the US across time, and see if there is a difference among them, since it seems to me that female names are more ’generational" than male names.

I would also like to view how diversity distributions of name choice has changed with respect to time.

My Data

The data I am using comes from the Social Security Administration’s Baby Names Data. The entities are name frequencies per year. One row has the attributes: Year, Sex, Name, frequency. The data begins in the year 1910, where every time a family registers their child through Social Security, they are added to the database that year.

We have to make note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data, which could change or somehow affect our data from before 1937, which is something to take into mind in our data that tracks different name metrics over time.

Immediately I have to make a choice. There are two datasets I could use. There is the compiled dataset mentioned, which includes only year, sex, name, and frequency. Or there is a dataset I could scrape from the Social Security Site directly, which includes the frequencies per year PER state. The drawback of the state data is that names with frequency fewer than 5 are omitted for security reasons for such individuals.

I ultimately choose to forego the state information, and use the dataset with all names included, in hopes that these will more accurately reflect US naming proportions.

For all the plots, the blue lines represent the male data, and the red lines represent the female data.

Part 1 - Data Curation

names_df <- read.csv("usa_names.csv", colClasses=c("numeric", "numeric", "character", "character", "numeric") )

names_df <- names_df %>%
  select(year, gender, name, number) %>%
  arrange(year, name)
head(names_df)
##   year gender    name number
## 1 1910      M   Aaron    111
## 2 1910      F   Abbie     28
## 3 1910      M     Abe     31
## 4 1910      M   Abner     12
## 5 1910      M Abraham    138
## 6 1910      M   Abram     10

Part 2- Exploratory Data Analysis

Questions to Contemplate and Analyze-

  1. What are the proportions of babies named per year with names in the top 100? Top 500? Male? and Female?
  2. What are the differences in diversity in Male vs. Female names over time by order of number of unique names?
  3. What fad/trend names can we find? What metrics can we use to find fad names?
  4. Do fad names occur more in male or females?

Question 1

  1. What are the proportions of babies named per year with names in the top 100? Top 500? Male? and Female? To solve this question we must graph the popularity of the top 100 used names per year.
names_top_100_male <- names_df %>%
  filter(gender == "M") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>%
  slice(1:100)

names_top_500_male <- names_df %>%
  filter(gender == "M") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>%
  slice(1:500)

names_df_total_births_per_year_male <- names_df %>%
  filter(gender == "M") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>% 
  summarise(total_births = sum(number))

names_top_100_female <- names_df %>%
  filter(gender == "F") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>%
  slice(1:100)

names_top_500_female <- names_df %>%
  filter(gender == "F") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>%
  slice(1:500)

names_df_total_births_per_year_female <- names_df %>%
  filter(gender == "F") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>% 
  summarise(total_births = sum(number))


##Above we have the top 100 male and female names per year, once we divide them by the total names per year we will have the total population
#per year that have names in the top 100.

prop_male <- names_top_100_male %>% inner_join(names_df_total_births_per_year_male, by="year") %>%
  mutate(prop = number / total_births)
prop_female <- names_top_100_female %>% inner_join(names_df_total_births_per_year_female, by="year") %>%
  mutate(prop = number / total_births)

prop_male_500 <- names_top_500_male %>% inner_join(names_df_total_births_per_year_male, by="year") %>%
  mutate(prop = number / total_births)
prop_female_500 <- names_top_500_female %>% inner_join(names_df_total_births_per_year_female, by="year") %>%
  mutate(prop = number / total_births)
#These proportion entities are extremely valuable, we now have the total proportion that each name in the top 100 names represents of the total name pool per year.
head(prop_male)
## # A tibble: 6 x 6
## # Groups:   year [1]
##    year gender name    number total_births   prop
##   <dbl> <chr>  <chr>    <dbl>        <dbl>  <dbl>
## 1  1910 M      John     11450       164228 0.0697
## 2  1910 M      James     9192       164228 0.0560
## 3  1910 M      William   8844       164228 0.0539
## 4  1910 M      Robert    5609       164228 0.0342
## 5  1910 M      George    5441       164228 0.0331
## 6  1910 M      Joseph    5226       164228 0.0318
head(prop_female)
## # A tibble: 6 x 6
## # Groups:   year [1]
##    year gender name     number total_births   prop
##   <dbl> <chr>  <chr>     <dbl>        <dbl>  <dbl>
## 1  1910 F      Mary      22848       352087 0.0649
## 2  1910 F      Helen     10479       352087 0.0298
## 3  1910 F      Margaret   8222       352087 0.0234
## 4  1910 F      Dorothy    7314       352087 0.0208
## 5  1910 F      Ruth       7209       352087 0.0205
## 6  1910 F      Anna       6433       352087 0.0183
prop_year_100_male <- prop_male %>%
  group_by(year) %>%
  summarise(prop_100 = sum(prop))

prop_year_100_female <- prop_female %>%
  group_by(year) %>%
  summarise(prop_100 = sum(prop))

  ggplot() +
  geom_line(data=prop_year_100_male, aes(x=year, y=prop_100), color="Blue") +
  geom_line(data=prop_year_100_female, aes(x=year, y=prop_100), color="Red") +
  labs(title="Percentage of Babies With Names in the Top 100 Most Popular Names Per Year",
         x = "Year",
         y = "Percentage %")

1a. Top 100 Names: We can answer our first original Question. When viewing the above graph we can see before the 1950’s, (mostly pre-WWII era) There were significantly higher babies (per gender) being named one of the 100 most popular names. There is a significant dip in 1950 for both male and female baby names, and another strong dip in 1986-1987 for both male and female babies. It seems that 1910 and 1945 were the peaks of conformity in naming children.

Less babies being named one of the 100 most popular names points to a potential higher diversity in names, which will be interesting to explore.

We also notice that throughout the time frame (1910-2016), males seem to have around a 10% higher proportion of babies with names in the top 100 most popular names than their female counterparts, pretty consistently. I would like to look more into the difference between male and female name homegeneity, as well as why there is a decline in proportion of parents naming their children the most popular (same) names.

prop_year_500_male <- prop_male_500 %>%
  group_by(year) %>%
  summarise(prop_500 = sum(prop))

prop_year_500_female <- prop_female_500 %>%
  group_by(year) %>%
  summarise(prop_500 = sum(prop))

  ggplot() +
  geom_line(data=prop_year_500_male, mapping = aes(x=year, y=prop_500), color="Blue") +
  geom_line(data=prop_year_500_female, mapping = aes(x=year, y=prop_500), color="Red") +
  labs(title="Percentage of Babies With Names in the Top 500 Most Popular Names Per Year",
         x = "Year",
         y = "Percentage %")

1b. Top 500 Names: The top 500 name proportions are more interesting to me. We can see that in 1910 near 98-99% of all babies male and female shared the same 500 names, respectively. There was a small downtrend until 1925, then a small upward trend until about 1950 where the trend has pretty significantly decreased over time, which seems to follow an exponential (or at least non-linear) down curve. One take away could be that for both male and female babies, names are becoming more diverse MORE rapidly as time passes.

It could also indicate that a higher number of names are becoming popular. You can imagine that there could have been 500 popular names in the past and if babies were only named these names then 100% of the population would have names in the top 500 for that year. But, if there were 1000 popular names with a more uniform distribution amongst them, than there would be only around 50% of the populaton with the most popular baby names. So, another possible explanation is that over time, there has been a higher name pool of names, and more of these names are being chosen from.

We also notice a widening disparity in Male v Female proportion in the most popular name category, where the parents are naming females less and less popular names, at first glance. This could mean there is a wider name pool for female babies, but in general it tells us the proportion of parents naming their children the same few names is smaller in girls than in boys (the popular boys names are more popular in comparison with the most popular girls names).

We can see that in 2016 (The most recent data year) that still about 76% of female babies named and about 85% of male babies named were in the top 500 most popular name category.

In other words, in 1910, only around 1-2% of parents strayed out of the 500 most popular names when naming their child, but in 2016, 19% of parents strayed out of the 500 most popular names (derivation below).

(1 - [(female_prop)] * (females born)] + [(male_prop_500) * (males_born)] ) (1 - (.76 * 1435587 + .849 * 1641986 )/ 3077537) (1 - .80749) =.19251

The next thing we can analyze is diversity of name, are these data changes in proportion of popular nams correlated with a higher overall number of names in the population? Let’s explore the total number of unique names being allocated in each year.

Question 2

  1. What are the differences in diversity in Male vs. Female names over time by order of number of unique names?
#We already know how many males and females were named each year, but now we must find the number of unique names used

unique_names_per_year_male <- names_df %>%
  filter(gender == "M") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>% summarise(unique_names = n())

unique_names_per_year_female <- names_df %>%
  filter(gender == "F") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>% summarise(unique_names = n())

head(unique_names_per_year_male)
## # A tibble: 6 x 2
##    year unique_names
##   <dbl>        <int>
## 1  1910          692
## 2  1911          754
## 3  1912         1114
## 4  1913         1256
## 5  1914         1494
## 6  1915         1740
head(unique_names_per_year_female)
## # A tibble: 6 x 2
##    year unique_names
##   <dbl>        <int>
## 1  1910         1083
## 2  1911         1066
## 3  1912         1261
## 4  1913         1350
## 5  1914         1549
## 6  1915         1816
  ggplot() +
  geom_line(data=unique_names_per_year_male, aes(x=year, y=unique_names), color='blue') + 
  geom_line(data=unique_names_per_year_female, aes(x=year, y=unique_names), color='red') + 
  ylim(500,6500) +
  labs(title="Number of Unique Names Per Year",
         x = "Year",
         y = "# of Unique Names")

  1. We can see in the plot that there is a general upward trend in unique names over time. We see in our data frame that in 1910 (remember not everyone was legally forced to register with Social Security until 1937), there were around 1000 unique female baby names being used, and around 600 male baby names. In 2008 at the peak of unique names there were 6056 unique names distributed amongst baby girls and 4425 unique boys names distributed amongst baby boys. Since then there has been a slight decrease in uniqueness of names for both genders, althought this could be a small bump, or could potentially be a new trend.

This data could potentially indicate that the decrease in proportion of children with the same 100 and 500 most popular names is due to the rise in diversity of names and the growth of the potential name pool. This could be attributed to more genetic diversity in the nation, a cultural shift in wanting unique names, and other potential factors. We are unable to make any absolute assumptions about the causation of these trends.

Question 3

  1. What fad/trend names can we find? What metrics can we use to find fad names?

To solve this problem I first want to explore the top 10 males and female names in any given year, I want to visually analyze if the same names remain, and the difference between the popular names, are some names consistently popular, do they change ranking quickly, or remain mroe stable?

names_top_10_male <- names_df %>%
  filter(gender == "M") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>%
  slice(1:10)

names_top_10_male %>%
  ggplot(aes(x=year, y=number, color=name)) +
    geom_line() +
    theme(legend.position="none") +
    ylim(0,100000) +
    labs(title="Popularity of Male names over time",
         x = "Year",
         y = "# of Babies With Name")

names_top_10_female <- names_df %>%
  filter(gender == "F") %>%
  arrange(year, desc(number)) %>%
  group_by(year) %>%
  slice(1:10)

names_top_10_female %>%
  ggplot(aes(x=year, y=number, color=name)) +
    geom_line() +
    theme(legend.position="none") +
    ylim(0,100000) +
    labs(title="Popularity of Female names over time",
         x = "Year",
         y = "# of Babies With Name")

  1. As a general trend with boys vs. girls we see that boys tended to have a collection of names that were popular from 1925 to 1975 with less quick turn arounds than in the female data. The most popular boys names were consistently the most popular names, even when their proportion decreased relative to all names, although there were new names that came into popularity.

The female data has few crazy peaks in name popularity and a few stand-out names at any one time. The female data seems to be more trend-oriented, while the male data seems to have more generation-crossing names. But, we are only looking at the top 10 names, not the entire rankings range.

We also see that the popularity of certain names has drastically decreased since 1990, where we see the most popular names having a significantly smaller share of the market of names than they did before. We continue to see the trend mentioned earler where in 1945-1955 there was a peak homogeneity in naming children (In the form of a huge trend name in females, and a wide use of the more seemingly timeless names in males).

As a quick aside, I wanted to see what the that huge trend name from 1945-1955 that we see in the female chart was. And why almost a hundred thousand girls in 1947 were named this most popular name. With some manipulation of the data we see that the peak in 1947 is the name Linda. Let’s look at the name Linda on its own.

year_df <- names_df %>%
  filter(gender == "F") %>%
  arrange(year, desc(number)) %>%
  filter(year == 1947)
head(year_df)
##   year gender     name number
## 1 1947      F    Linda  99685
## 2 1947      F     Mary  71686
## 3 1947      F Patricia  51276
## 4 1947      F  Barbara  48791
## 5 1947      F   Sandra  34776
## 6 1947      F    Carol  33538
Linda_df <- names_df %>%
  filter(gender == "F") %>%
  arrange(year, desc(number)) %>%
  filter(name == 'Linda')
  
Linda_df %>%
  ggplot(aes(x=year, y=number, color=name)) +
    geom_line() +
    theme(legend.position="none") +
    labs(title="Popularity of 'Linda' over time",
         x = "Year",
         y = "# of Babies With Name")

Now of course, I was curious to the proliferation and the rise and fall of “Linda”. With a quick google search I learned that “Linda” is actually considered the trendiest name of all time. I found another data scientist who looked at this a similar dataset while I was researching why Linda was so popular. David Taylor, a biotechnologist and blogger at Proofreader.com, analyzed a names database as well and found Linda to be the trendiest name of all time on his metrics, where he “[took] into account both the swiftness with which a name enters and then exits the naming pool, as well as the intensity of its popularity. The names on his list are therefore ones that both had a sharp rise and fall and had a major impact.”. He had a calculation based on peak height and peak width to measure trend.

Now I am interested in the ‘trendiness’ metric and how I can use the measurements of how quickly a name becomes popular and how quickly it becomes unpopular to create a trend-scale of names.

It turns out, in 1947 this name spiked because of a hit song in 1946 called… “Linda”, written by Jack Lawrence. It is easy to imagine that this name dropped in popularity as quickly as it came with respect to the dieing popularity of this song, but this is a data science project, not a speculative assignment.

Now Let’s look for a male trending name, I am curious about the purple peak of 1956, as it seems to have a quick rise to #1 and then a slower decline over time, but remaining as one of the top most popular names even today. I am guessing it will not stem from pop culture since it has lasted the test of time, it is perhaps biblical or a common British name.

year_df <- names_df %>%
  filter(gender == "M") %>%
  arrange(year, desc(number)) %>%
  filter(year == 1956)
head(year_df)
##   year gender    name number
## 1 1956      M Michael  90620
## 2 1956      M   James  84842
## 3 1956      M  Robert  83905
## 4 1956      M   David  81601
## 5 1956      M    John  80759
## 6 1956      M William  58935

The name, Micheal, was the name in question, we see below its trend from 1910-2016 in popularity.

Michael_df <- names_df %>%
  filter(gender == "M") %>%
  arrange(year, desc(number)) %>%
  filter(name == 'Michael')
  
Michael_df %>%
  ggplot(aes(x=year, y=number, color=name)) +
    geom_line() +
    theme(legend.position="none") +
    labs(title="Popularity of 'Michael' over time",
         x = "Year",
         y = "# of Babies With Name")

In the top 10 data set, male and female both, there seem to be less intense spikes and trends in the male dataset. I want to test whether there is a significant difference in the trends of baby names between males and females, but we are only looking at 10 names per gender, and have not clearly defined a trend metric to study.

This leads us to our final Exploratory question,

Question 4

  1. Do fad names occur more in males or females?

I aim to test this by looking at the relative change in popularity in names across years. I will create what I call the trend meter. If name popularity changes quickly, this means it is changing trendiness, and it will mark higher on the trend meter.

I first want to standardize all data.

zs_df_male <- names_top_500_male %>%
  group_by(year) %>%
  mutate(mean = mean(number)) %>%
  mutate(sd = sd(number)) %>%
  mutate(z = (number - mean) / sd) %>%
  ungroup()

zs_df_female <- names_top_500_female %>%
  group_by(year) %>%
  mutate(mean = mean(number)) %>%
  mutate(sd = sd(number)) %>%
  mutate(z = (number - mean) / sd) %>%
  ungroup()

While standardizing our data, we have to think about what it means, let’s look at the top 500 most popular names in 1910. First, we find our trend metrics on the top 500 most popular names because names that are less than top 500 aren’t making significant trends, and this is an accurate place to start tracking names, once they reach top 500 popularity.

Let’s look at the first most popular name in 1910, John. The data is so skewed that the mean number of of males per name is 326, but a standard deviation of 986 (3x as large as the mean). we also see that the name that is closest to this mean is Morris (ranking #91 in popularity.) This means that the top (90/500)= 18% of our 500 most popular names are above the mean and 82% are below, this data is immensely skewed, to whereas our z-score for the #1 name is 11.27, a huge outlier.

We will graph the distribution of male and female names in 1910 and 2017

names_top_500_male_1910 <- names_top_500_male %>%
  filter(year == 1910) %>% 
  mutate(toHighlight = ifelse( number >= 326.148, "yes", "no" ) )

names_top_500_female_1910 <- names_top_500_female %>%
  filter(year == 1910) %>% 
  mutate(toHighlight = ifelse( number >= 691.130, "yes", "no" ) )

names_top_500_male_2016 <- names_top_500_male %>%
  filter(year == 2016) %>% 
  mutate(toHighlight = ifelse( number >= 2788.098, "yes", "no" ) )

names_top_500_female_2016 <- names_top_500_female %>%
  filter(year == 2016) %>% 
  mutate(toHighlight = ifelse( number >= 2186.564, "yes", "no" ) )

names_top_500_male_1910 %>%
  ggplot(mapping =  aes(x=reorder(name, -number), y=number, fill=toHighlight) ) +
  geom_bar(stat = "identity") +
  scale_fill_manual( values = c( "yes"="tomato", "no"="gray" ), guide = FALSE ) +
    labs(title="Distribution of popularity of top 500 Male Names in 1910",
         x = "Year",
         y = "# of Babies With Name")

names_top_500_female_1910 %>%
  ggplot(mapping =  aes(x=reorder(name, -number), y=number, fill=toHighlight) ) +
  geom_bar(stat = "identity") +
  scale_fill_manual( values = c( "yes"="tomato", "no"="gray" ), guide = FALSE ) +
    labs(title="Distribution of popularity of top 500 Female Names in 1910",
         x = "Year",
         y = "# of Babies With Name")

names_top_500_male_2016 %>%
  ggplot(mapping =  aes(x=reorder(name, -number), y=number, fill=toHighlight) ) +
  geom_bar(stat = "identity") +
  scale_fill_manual( values = c( "yes"="tomato", "no"="gray" ), guide = FALSE ) +
    labs(title="Distribution of popularity of top 500 Male Names in 2016",
         x = "Year",
         y = "# of Babies With Name")

names_top_500_female_2016 %>%
  ggplot(mapping =  aes(x=reorder(name, -number), y=number, fill=toHighlight) ) +
  geom_bar(stat = "identity") +
  scale_fill_manual( values = c( "yes"="tomato", "no"="gray" ), guide = FALSE ) +
    labs(title="Distribution of popularity of top 500 Female Names in 2016",
         x = "Year",
         y = "# of Babies With Name")

Above, we have the names above the median number of babies named per name colored in Orange. Using all our curated knowledge, we will track ‘trendiness’.

We must formally define trendiness, and how to measure it.

Trendiness could be defineed in multiple ways:

  1. Trendiness could be defined as the rate in which a name changes proportionally to the mean data. This could be how quickly a name changes its popularity, using Z-score as our metric, or change in proportion of babies named that name from year to year.

The problem with this lies here, we want to see how quickly a name becomes popular and un-popular again to measure a ‘fad’ name, but since we learned that there is a general trend of popular names taking up a smaller total proportion of all names, we know that popular names like “Michael” would look as if they are becoming less popular, but in reality some names keep the same ranking but take up a smaller proportion of the name pool. For this reason tracking change in Z-score or change in proportion doesn’t create accurate results, since it would show consistently popular names as being fad names, since they change proportion quickly.

  1. Trendiness could be measured as the change in ranking of each name with respect to itself each year that it is in. We could then look at the standard deviation of the all the change in rankings for this name as our trend metric because this will measure how high the data strays from the mean, reflecting high volatility. But we would need to figure out a way so that names changing from rank 1 to 2 are weighted more than names moving from rank 10,000 to 9,999, since this change means more in terms of trendiness.

Imagine we have three years of data and 5 names. If the first name was always rank 1 or 2 then the mean of this data would be around 1.5 with a small standard deviation, this name is POPULAR but it doesn’t match a trend or fad.

We can look at another name that goes from rank 5, to rank 3, to rank 1. This data point would have a mean of 3, but a higher standard of deviation, meaning the name went through a trend.

We use this ‘change in ranking per year’ metric because if we took the average ranking of each name across all years we would have data with small quick trends and data with one large change, that have the same ranking proportions, then they could have the same average and standard deviation even though they do not represent the same level in trend.

This metric will accurately outline the trend factors of names, and is the way we should define trend, through a higher standard deviation or volatility.

I will create a new data set where the entity is the name, and year is an attribute, we can now look at the rank of child names per gender in all of the used names per year. If a name is not present in at least 10 years, we will exclude it from the dataset.

I have one last interesting point before we continue, we know that the sqrt(2) - sqrt(1) is greater than sqrt(10000) - sqrt(9999). For this reason we will not track the change in rank, but the change in square roots of rank, to make names with higher ranks worth MORE in terms of our trend index.

We also may find different results in trendiness compared to data scientist David Taylor since our trend metric is likely different. I am taking volatility as the first measure, and popularity as a second degree measure, where he likely took popularity as a higher weight when calculating his metric.

#get all male data
all_males <- names_df %>%
  filter(gender == "M") %>%
  arrange(year, desc(number))

#get all female data
all_females <- names_df %>%
  filter(gender == "F") %>%
  arrange(year, desc(number))

#get how many years each name was issued
names_male_year_count <- names_df %>%
  filter(gender == "M") %>%
  arrange(year, desc(number)) %>%
  group_by(name) %>%
  summarize(years_with_name = n()) %>%
  ungroup()

#get how many years each name was issued
names_female_year_count <- names_df %>%
  filter(gender == "F") %>%
  arrange(year, desc(number)) %>%
  group_by(name) %>%
  summarize(years_with_name = n()) %>%
  ungroup()

names__male <-
  names_male_year_count %>% inner_join(all_males, by='name') %>%
  filter(years_with_name > 10)

names__female <-
  names_female_year_count %>% inner_join(all_females, by='name') %>%
  filter(years_with_name > 10)

rank_male_spread <- names__male %>%
  mutate(rank = dense_rank(desc(number))) %>%
  select(name, year, rank) %>%
  tidyr::spread(year, rank)

rank_female_spread <- names__female %>%
  mutate(rank = dense_rank(desc(number))) %>%
  select(name, year, rank) %>%
  tidyr::spread(year, rank)


#encode NA's
rank_male_spread[is.na(rank_male_spread)] <- 10000

rank_female_spread[is.na(rank_female_spread)] <- 10000

Our current data frame shows the ranks of children names each year for male and female babies. We encoded every NA as 10000 rank, so that we can subtract from these values. This makes it so we can track a change or rank from a name not used at all in a year or a name beginning to be used, a change in sqrt(10000) - sqrt(10000) is 0, so this change only decreased our standard deviation, but does not add to a a change in rank, which is good for our trend metric.

I will use matrices to subtract each rank from the year before it.

matrix_1 <- rank_male_spread %>%
  select(-name) %>%
  as.matrix() %>%
  .[,-1]

matrix_2 <- rank_male_spread %>%
  select(-name) %>%
  as.matrix() %>%
  .[,-ncol(.)]

#This is where we subtract the log of rank changes, because it makes higher rank changes worth more than lower rank changes of the same number.
diff_rank_male <- (sqrt(matrix_1) - sqrt(matrix_2)) %>%
  magrittr::set_colnames(NULL) %>%
  as_data_frame() %>%
  mutate(name = rank_male_spread$name)

diff_rank_male$change <- rowSums( diff_rank_male[,1:106] )
diff_rank_male$avg_change <- rowMeans( diff_rank_male[,1:106] )
sd_rank_males <-suppressWarnings( transform(diff_rank_male, Trend_Metric=apply(diff_rank_male,1, sd, na.rm = TRUE)) )

trendm <- sd_rank_males %>%
  select(name, avg_change, Trend_Metric) %>%
  arrange(desc(Trend_Metric)) 
  

##female matrices  
matrix_1f <- rank_female_spread %>%
  select(-name) %>%
  as.matrix() %>%
  .[,-1]

matrix_2f <- rank_female_spread %>%
  select(-name) %>%
  as.matrix() %>%
  .[,-ncol(.)]


diff_rank_female <- (sqrt(matrix_1f) - sqrt(matrix_2f)) %>%
  magrittr::set_colnames(NULL) %>%
  as_data_frame() %>%
  mutate(name = rank_female_spread$name)

diff_rank_female$change <- rowSums( diff_rank_female[,1:106] )
diff_rank_female$avg_change <- rowMeans( diff_rank_female[,1:106] )
sd_rank_females <-suppressWarnings( transform(diff_rank_female, Trend_Metric=apply(diff_rank_female,1, sd, na.rm = TRUE)) )

trendsm <- sd_rank_males %>%
  select(name, avg_change, Trend_Metric) %>%
  arrange(desc(Trend_Metric)) 
  
trendsf <- sd_rank_females %>%
  select(name, avg_change, Trend_Metric) %>%
  arrange(desc(Trend_Metric)) 

head(trendsm)
##      name avg_change Trend_Metric
## 1    Noah -0.5749504     6.228043
## 2    Liam -0.5663917     5.910047
## 3   Ethan -0.4998997     5.759774
## 4   Mason -0.5230817     5.481326
## 5   Jacob -0.5012225     5.326577
## 6 Michael -0.4667974     5.208330
head(trendsf)
##       name avg_change Trend_Metric
## 1   Olivia -0.6460929     6.749955
## 2      Ava -0.6040010     6.544934
## 3      Mia -0.5844095     6.248015
## 4   Sophia -0.5896020     6.233287
## 5 Isabella -0.5667308     6.095240
## 6     Mary  0.5746583     6.033457

We can now look at the top ‘trendiest and volatile’ male and female trend names. The highest ranking male names being Noah, Liam, and Ethan. The highest ranking female names being Olivia, Ava, and Mia.

Noah <- names__male %>%
  mutate(rank = dense_rank(desc(number))) %>%
  select(name, year, rank) %>%
  filter(name == 'Noah')

Noah %>%
  ggplot(aes(x=year, y=rank, color=name)) +
    geom_line() +
    ylim(0,10500) +
    theme(legend.position="none") +
    labs(title="Ranking of 'Noah' over time",
         x = "Year",
         y = "Ranking")

Olivia <- names__female %>%
  mutate(rank = dense_rank(desc(number))) %>%
  select(name, year, rank) %>%
  filter(name == 'Olivia')

Linda <- names__female %>%
  mutate(rank = dense_rank(desc(number))) %>%
  select(name, year, rank) %>%
  filter(name == 'Linda')
  

ggplot() +
    geom_line(data=Olivia, aes(x=year, y=rank), color="blue") + 
    geom_line(data=Linda, aes(x=year, y=rank), color='red') + 
    geom_line() +
    ylim(0,10500) +
    labs(title="Ranking of 'Olivia' and 'Linda' over time",
         x = "Year",
         y = "Ranking")

Olivia is the blue line, and Linda is the red line.

We have above graphed the ‘trendiest’ names according to our calculated metrics. Remember that ‘trendiest’ means most volatile in our situation, not most popular, but it does take popularity into account when 2 names have similar changes in rank, due to our sqrt() function being used during each change in rank calculation. We are looking at name FADS.

We also must remember than rank #1 is the best positioning, but is at the bottom of the graph, on the x-axis.

I also graphed Linda compared to Olivia to view the difference. I think the reason that Olivia is ranked so much higher than Linda (Even though Linda was previously considered the trendiest name), is because Olivia is more volatile on average, where Linda had much slower rates of change from 1947 to around 1963, diminishing its trendiness value, whereas Olivia in this time had rapid, quick rank shifts. Linda was ranked 46 out of 8,294 in our metrics.

Now that we have all of our SD as our trend metrics, we move onto our Hypothesis testing for our question, Do fad names occur more in males or females?

Hypothesis testing

Null Hypothesis: There is no significant difference in the ‘trend metric’ between male and female baby names.

Alternative Hypothesis: There is a significant difference in the ‘trend metric’ between male and female baby names.

I want to run a T-test on both data sets. First, I will visually graph the distributions.

# Density plot
ggplot(sd_rank_males) + geom_density(aes(x = Trend_Metric), bw= .3)

ggplot(sd_rank_females) + geom_density(aes(x = Trend_Metric), bw = .3)
## Warning: Removed 1 rows containing non-finite values (stat_density).

We cannot see much in the density plots, we can see that the mode of female trend metric data is higher than the mode for male trend metric data. We will have to run our T-test of the means to see if they are significantly different.

avector <- as.vector(sd_rank_males['Trend_Metric'])
class(avector) 
## [1] "data.frame"
avector2 <- as.vector(sd_rank_females['Trend_Metric'])
class(avector) 
## [1] "data.frame"
t.test(avector, avector2)
## 
##  Welch Two Sample t-test
## 
## data:  avector and avector2
## t = -52.338, df = 10413, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5227918 -0.4850455
## sample estimates:
## mean of x mean of y 
## 0.2229251 0.7268437

Conclusion

We see through our Two Sample t-test that our sample estimates for the male average trend score is .223, and the female average trend score is .7266. Our p-value for our null hypothesis is essentially 0, (2.2e-16) indicating that we can reject or null hypothesis.

Recall the trend metric is Standard deviation of the change in the square roots of ranking from year to year, which measures how quickly rankings change, weighting higher ranking changes as more than lower ranking changes of the same value.

We can say there is a significant difference in the ‘trend metric’ between male and female baby names. We can also say with 95% confidence that the average trend metric for female baby names is between .485 and .523 units higher than the average trend metric for male baby names.

In Layman terms, the average female name is ‘trendier’ or ‘goes through quicker ranking changes’ than the average male name, at a significant rate.

Final Words

We have found some interesting conclusions to be drawn from the data for further exploration. We have found that for some reason in The United States, female names tend to change in popularity more rapidly than male names, there are more unique female names, and the proportion of parents naming females the same more popular names is lower than males.

We have also learned that over time parents have been choosing from a wider pool of names to name their children, and the most popular names are taking up less of this total market share.

This could be a testament to the ever-growing melting pot of America that brings together new cultures consistently. Another interesting project (maybe my next project), would be to analyze changes in heritage and cultures of families naming children to view if the change in names stem from the proportion of children being born in the US with respect to their cultural backgrounds, allowing me to isolate demographics and research its effects on name changes.

It wouldn’t make sense to predict which names will be “popular” or “trendy” next since it seems many come from current culture, such as US presidents, artists, and popular figures. These external factors that would be difficult to predict before they happen. But perhaps a third project would be studying the most ‘popular’ people in US culture for any given year and seeing if their names were popular baby names the next year.

Regardless of the reasons for different changes, we can conclude that the US name trend analysis is far from over. although we have learned a significant amount through this analysis, and it will be interesting to see how name fads change in the future.

Below are some external links to help the reader:

Dataset used: https://www.kaggle.com/salil007/a-very-extensive-exploratory-analysis-usa-names/data

Article found about “Linda”: https://www.bustle.com/p/linda-is-the-trendiest-baby-name-in-us-history-making-for-a-classic-yet-unexpected-pick-30410

Data Project by David Taylor: http://www.prooffreader.com/2014/07/trendiest-baby-names-in-social-security.html