This is an analysis of SAT scores recorded across the nation from 2005 to 2015. Other pieces of data collected include the specific letter grade by subject, the average GPA by subject, family income level, gender, and the number of years a student has studied a specific subject. The dataset is stratified by year and state, where District of Columbia, Puerto Rico, and the Virgin Islands are included as separate regions. Note that this dataset only includes the math and verbal (reading) sections of the SAT, and excludes the writing portion.
For this project, I will focus on analyzing three specific questions based on this dataset:
1. On average, do males generally perform better than females on the math section of the SAT?
2. In 2015, how did the states compare to each other in terms of average total SAT score?
3. Is there a correlation between an arts/music education and a high SAT score?
The dataset used in this analysis can be found here: https://think.cs.vt.edu/corgis/csv/school_scores/school_scores.html
Library imports:
library(dplyr)
library(ggplot2)
library(knitr)
library(readr)
library(maps)
Data import:
df <- read_csv("school_scores.csv")
Check to ensure that we have 53 regions (50 states, Washington D.C., Puerto Rico, and the Virgin Islands):
‘Name’ is the full name of the state for this report.
df %>% summarize(distinct = n_distinct(Name))
## # A tibble: 1 x 1
## distinct
## <int>
## 1 53
Check to ensure that we have 11 years represented (from 2005 to 2015):
df %>% summarize(distinct = n_distinct(Year))
## # A tibble: 1 x 1
## distinct
## <int>
## 1 11
The total number of observations in this dataset:
nrow(df)
## [1] 577
The correlation between the average GPA in math and the average math score:
Note that this is just the GPA within the subject, not across all academic subjects.
df %>% summarize(correlation = cor(`Mathematics.Average GPA`, Math))
## # A tibble: 1 x 1
## correlation
## <dbl>
## 1 0.811048
The large, positive correlation shows that higher GPAs in math are associated with higher scores on the math section of the SAT, as is expected.
The average SAT score (out of 1600) across the nation during the 11-year period:
‘Math’ is the average math score of students in this state during this year.
‘Verbal’ the average verbal (reading, not writing) score of students in this state during this year.
average_total_score <- mean(df$Math) + mean(df$Verbal)
average_total_score
## [1] 1067.017
The standard deviation of SAT scores across the nation during the 11-year period:
df1 <- df %>% mutate (Average_Total_Score = Math + Verbal)
standard_deviation <- sd(df1$Average_Total_Score)
standard_deviation
## [1] 90.00557
We can also represent this data in a histogram:
th <- theme(plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1)),
legend.position = "bottom")
ggplot(data = df1) +
geom_histogram(mapping = aes(x = Average_Total_Score), bins = 20) +
labs(title = "Histogram of Average SAT Scores", x = "Average Total Score", y = "Frequency") + th
Histogram of Average SAT scores stratified by year:
ggplot(data = df1) +
geom_histogram(mapping = aes(x = Average_Total_Score), bins = 20) +
labs(title = "Histogram of Average SAT Scores By Year", x = "Average Total Score", y = "Frequency") +
facet_wrap(~Year) + th
As can be seen from the above section, the dataset contains an exorbitant amount of information. I will be only be focusing on certain factors to try to answer the three core questions listed in the “Introduction” section.
A common stereotype in society is that boys are better at math than girls. While this is not necessarily true, the stereotype stems from the objective truth that boys have continued to score significantly higher than girls in the math section of the SAT. One article from AEIdeas reported that for over 40 years, a 30-point difference between boys’ and girls’ math scores has persisted*. In our dataset, we compare the first 10 rows of the boys’ mean math scores and the girls’ mean math scores. In each of the 10 observations, the boys score higher than the girls on average.
‘Male.Math’ is the average math score of students in this state during this year who identified as male.
‘Female.Math’ is fhe average math score of students in this state during this year who identified as female.
*The AEIdeas article can be found here: http://www.aei.org/publication/2015-sat-test-results-confirm-pattern-thats-persisted-for-40-years-high-school-boys-are-better-at-math-than-girls/
df_gendered_math_scores <- df %>% select(Male.Math, Female.Math, Name, Year)
kable(head(df_gendered_math_scores, n=10))
Male.Math | Female.Math | Name | Year |
---|---|---|---|
582 | 538 | Alabama | 2005 |
535 | 505 | Alaska | 2005 |
549 | 513 | Arizona | 2005 |
570 | 536 | Arkansas | 2005 |
543 | 504 | California | 2005 |
577 | 546 | Colorado | 2005 |
534 | 502 | Connecticut | 2005 |
521 | 486 | Delaware | 2005 |
509 | 451 | District Of Columbia | 2005 |
516 | 484 | Florida | 2005 |
We can also compare the average male math score with the average female math score on the SAT (across the nation and over a span of 11 years):
mean(df$Male.Math)
## [1] 553.9116
mean(df$Female.Math)
## [1] 518.4159
difference = mean(df$Male.Math) - mean(df$Female.Math)
difference
## [1] 35.49567
From this, we can see that for the period 2005-2015, boys had a higher mean math score than girls. Specifically, boys scored 35.49567 more points than girls on average.
The mean difference calculated above is for a very large period spanning from 2005 to 2015. It would be more beneficial to us to see how the gender difference in math scores has changed throughout the years. To figure this out, we can create a new dataframe that includes the mean male score, mean female score, the difference between the two averages, and the year.
df_2005 <- df_gendered_math_scores %>% filter (Year == "2005")
df_2006 <- df_gendered_math_scores %>% filter (Year == "2006")
df_2007 <- df_gendered_math_scores %>% filter (Year == "2007")
df_2008 <- df_gendered_math_scores %>% filter (Year == "2008")
df_2009 <- df_gendered_math_scores %>% filter (Year == "2009")
df_2010 <- df_gendered_math_scores %>% filter (Year == "2010")
df_2011 <- df_gendered_math_scores %>% filter (Year == "2011")
df_2012 <- df_gendered_math_scores %>% filter (Year == "2012")
df_2013 <- df_gendered_math_scores %>% filter (Year == "2013")
df_2014 <- df_gendered_math_scores %>% filter (Year == "2014")
df_2015 <- df_gendered_math_scores %>% filter (Year == "2015")
df_mean_by_year <- data.frame(male_mean = c(mean(df_2005$Male.Math),
mean(df_2006$Male.Math),
mean(df_2007$Male.Math),
mean(df_2008$Male.Math),
mean(df_2009$Male.Math),
mean(df_2010$Male.Math),
mean(df_2011$Male.Math),
mean(df_2012$Male.Math),
mean(df_2013$Male.Math),
mean(df_2014$Male.Math),
mean(df_2015$Male.Math)),
female_mean = c(mean(df_2005$Female.Math),
mean(df_2006$Female.Math),
mean(df_2007$Female.Math),
mean(df_2008$Female.Math),
mean(df_2009$Female.Math),
mean(df_2010$Female.Math),
mean(df_2011$Female.Math),
mean(df_2012$Female.Math),
mean(df_2013$Female.Math),
mean(df_2014$Female.Math),
mean(df_2015$Female.Math)),
year = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015))
df_mean_by_year <- df_mean_by_year %>%
mutate(diff = male_mean - female_mean) %>%
select(male_mean, female_mean, diff, year)
kable(df_mean_by_year)
male_mean | female_mean | diff | year |
---|---|---|---|
555.1538 | 518.9615 | 36.19231 | 2005 |
555.5769 | 520.2885 | 35.28846 | 2006 |
552.7925 | 518.4151 | 34.37736 | 2007 |
554.0943 | 518.3962 | 35.69811 | 2008 |
559.8235 | 522.1373 | 37.68627 | 2009 |
560.6667 | 523.9412 | 36.72549 | 2010 |
551.1887 | 515.7925 | 35.39623 | 2011 |
552.1509 | 515.6415 | 36.50943 | 2012 |
550.0000 | 515.8113 | 34.18868 | 2013 |
551.7170 | 517.3396 | 34.37736 | 2014 |
550.3962 | 516.2453 | 34.15094 | 2015 |
Since 2005, the point difference between the mean scores for boys and girls has not changed significantly for 11 years. While it has decreased very slightly, the difference in mean scores is consistently above 30 points, which is a statistically significant point difference. Many researchers and scholars have pointed towards differences in problem-solving strategies, spatial skills, and attitudes and values as reasons for the large point difference. More info about this can be found here: http://www.nctm.org/Publications/Teaching-Children-Mathematics/Blog/Current-Research-on-Gender-Differences-in-Math/
Boxplot of Male and Female Mean Math Scores By Year:
th <- theme(plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1)),
legend.position = "bottom")
df$Year <- factor (df$Year)
ggplot(data=df) +
geom_boxplot(mapping = aes(x = Year, y = Male.Math), fill = NA, col = "blue") +
labs(title = "Gendered Math Score Averages By Year", x = "Year", y = "Mean Math Score") +
geom_boxplot(mapping = aes(x = Year, y = Female.Math), fill = NA, col = "red") + th
The boxplot provides a visual representation of the conclusions found above. Blue represents male scores, while red represents female scores. In every year between 2005 and 2015, the mean math score for boys has always been higher than the mean math score for girls. Furthermore, the distance between the male mean and the female mean does not change substantially throughout the decade. Even the outliers of the female math scores are lower than the outliers of the male math scores.
How does each state compare to the other states academically? To test this, we will find the mean SAT score for each state during the most current year (2015).
state_scores <- df1 %>%
filter(Year == "2015") %>%
select(Average_Total_Score, Name)
state_scores <- state_scores[-c(9, 40, 48), ] #removes DC, Puerto Rico, and Virgin Islands
state_scores$Name = tolower(state_scores$Name)
colnames(state_scores)[colnames(state_scores) == 'Name'] <- 'region'
kable(state_scores %>% arrange(desc(Average_Total_Score), region))
Average_Total_Score | region |
---|---|
1215 | illinois |
1207 | north dakota |
1204 | michigan |
1203 | minnesota |
1197 | wisconsin |
1195 | missouri |
1192 | iowa |
1192 | south dakota |
1182 | kansas |
1181 | nebraska |
1176 | kentucky |
1175 | wyoming |
1170 | colorado |
1155 | tennessee |
1154 | utah |
1149 | mississippi |
1146 | oklahoma |
1141 | arkansas |
1125 | louisiana |
1121 | ohio |
1119 | montana |
1096 | new mexico |
1088 | alabama |
1056 | new hampshire |
1052 | arizona |
1047 | oregon |
1046 | massachusetts |
1046 | vermont |
1035 | virginia |
1021 | new jersey |
1014 | alaska |
1013 | washington |
1010 | connecticut |
1009 | west virginia |
1004 | pennsylvania |
1003 | california |
1003 | north carolina |
997 | indiana |
996 | hawaii |
991 | new york |
990 | nevada |
989 | rhode island |
983 | maryland |
976 | georgia |
976 | south carolina |
967 | florida |
957 | texas |
940 | maine |
930 | idaho |
924 | delaware |
We then combine this data with the map_state data:
map_state <- map_data("state")
combined_data <- map_state %>% left_join(state_scores, by = "region")
This data can be visually displayed on a map of the U.S.
map_theme <- theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
panel.background = element_rect(fill = "white")
)
ggplot() +
geom_polygon(data = combined_data,
mapping = aes(x = long, y = lat, group = region, fill = Average_Total_Score)) +
geom_polygon(data = map_state,
mapping = aes(x = long, y = lat, group = group), fill = NA, col = "black") +
scale_fill_gradient(low = "red", high = "blue") +
coord_quickmap() + map_theme +
labs(title = "Average SAT Scores in 2015") + th
This matches the data we got in the table, where Illinois had the highest average SAT score and is thus the most blue on the map. Meanwhile, places like Texas, Idaho, and Florida have some of the lowest average SAT scores, and so are the most red in color. It is interesting to see that the midwest/Great Lakes region tends to have a higher SAT score average than places on the coast or in the south.
Is an education in the arts or music associated with a high SAT score? Many past studies have shown that kids who played a musical instrument tended to score higher on tests. Furthermore, Americans for the Arts reported that data from the CollegeBoard showed that students who take four years of arts scored on average 100 points better on the SAT than students who had half a year or less of arts education.
This report can be found here: https://www.americansforthearts.org/sites/default/files/pdf/get_involved/advocacy/research/2013/artsed_sat13.pdf
We can test to see if there is an association between an arts and music education and test scores by calculating the correlation between the two variables.
df$Year <- factor (df$Year)
correlation2 <- df1 %>% select(`Arts/Music.Average Years`, Average_Total_Score, Year)
correlation2 %>% summarize(correlation = cor(`Arts/Music.Average Years`, Average_Total_Score))
## # A tibble: 1 x 1
## correlation
## <dbl>
## 1 0.7465564
From this, we see that the correlation between arts education SAT score is high and positive. This means that the more years of arts or music education a student goes through, the higher their SAT score tends to be, which supports the findings of Americans for the Arts.
We can also plot the data in a scatterplot:
ggplot(data = correlation2, mapping = aes(x=`Arts/Music.Average Years`, y=Average_Total_Score, col = Year)) +
geom_point(alpha = 0.8, position = "jitter") +
geom_smooth(method = "lm") +
labs(title = "Correlation between Arts Education and SAT Score", x = "Number of Years of Arts Education", y = "Total SAT Score") + th
As can be seen from the line of best fit on the graph, there is a strong, positive relationship between the two factors. While we cannot identify causation between having a longer arts education and having a higher SAT score, we can still assume that they are at least correlated with each other.
In conclusion, the data analyzed in this project supports much of the research that other scholars in education have done. Here, we have shown three main findings from this data:
Lastly, this project deviated from the original project proposal in several ways: